biondizzle

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-02 10:46:31 +00:00

e231b98387 Fix mHC Sinkhorn test: row sums expected to be off (eps after softmax)

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-02 10:45:04 +00:00

b5f29be169 Add mHC Sinkhorn CUDA kernel test

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-02 10:44:57 +00:00

6cb5078821 Fix mHC Sinkhorn kernel: remove VLA, remove Python fallback

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-02 10:20:41 +00:00

c89762ecdd Fix set_indexer_keys_fp8 None guard + store comp_pos in mixed storage

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-02 10:19:55 +00:00

1f69f61363 Add detailed comment: why compressed KV uses FP8 not NVFP4

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-02 10:08:46 +00:00

edc8e7ee8d KV-1/KV-2: Mixed FP8+BF16 compressed KV (DeepSeek V4 paper format)

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-02 10:04:02 +00:00

12b6365b42 Fix RoPE test: use proper cos/sin cache

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-02 10:02:10 +00:00

f566b9b748 Fix FP8 quantize return type (2-tuple not 3)

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-02 10:01:10 +00:00

bdb25ee5cd Add production-value unit tests for kv_quantize kernels

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-02 10:00:59 +00:00

7ef6402936 KV-1/KV-2/KV-3: NVFP4 compressed KV + FP8 indexer keys

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-02 09:49:13 +00:00

40dd56eac2 KV-1: Fix shared memory corruption in block_reduce

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-02 09:46:34 +00:00

0fefadedd4 KV-1: Fix FP8 round-trip mismatch in fused quantize

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-02 09:44:03 +00:00

d74ff5768d KV diag test

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-02 09:41:18 +00:00

c2664281c3 KV-1/KV-2: Fix quantize kernel — each thread handles 16-elem blocks independently

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-02 09:37:56 +00:00

f23320b5b2 KV-1/KV-2: Fused compress+NVFP4 quantize kernels + dequant

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-02 09:30:07 +00:00

107d62dd76 docs: update PERFORMANCE_AUDIT.md — Part 1 (P0-P3) landed, Part 2 KV cache next

biondizzle pushed tag v-p0p1p2p3-fused-swiglu-cuda-rope-20260602 to biondizzle/nvfp4-megamoe-kernel

2026-06-02 09:22:22 +00:00

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-02 09:08:10 +00:00

3c295f225a P3: integrate CUDA RoPE kernel into single_shot — 732 launches/token eliminated

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-02 09:06:39 +00:00

54a9b6961b fix: rope_cuda path — kernels/cuda not ops/cuda

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-02 09:05:24 +00:00

2bbbead984 P3: CUDA RoPE kernel — single launch per call (vs 5-6 PyTorch ops)