biondizzle
  • Joined on 2025-12-10
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-02 10:46:31 +00:00
e231b98387 Fix mHC Sinkhorn test: row sums expected to be off (eps after softmax)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-02 10:45:04 +00:00
b5f29be169 Add mHC Sinkhorn CUDA kernel test
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-02 10:44:57 +00:00
6cb5078821 Fix mHC Sinkhorn kernel: remove VLA, remove Python fallback
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-02 10:20:41 +00:00
c89762ecdd Fix set_indexer_keys_fp8 None guard + store comp_pos in mixed storage
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-02 10:19:55 +00:00
1f69f61363 Add detailed comment: why compressed KV uses FP8 not NVFP4
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-02 10:08:46 +00:00
edc8e7ee8d KV-1/KV-2: Mixed FP8+BF16 compressed KV (DeepSeek V4 paper format)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-02 10:04:02 +00:00
12b6365b42 Fix RoPE test: use proper cos/sin cache
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-02 10:02:10 +00:00
f566b9b748 Fix FP8 quantize return type (2-tuple not 3)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-02 10:01:10 +00:00
bdb25ee5cd Add production-value unit tests for kv_quantize kernels
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-02 10:00:59 +00:00
7ef6402936 KV-1/KV-2/KV-3: NVFP4 compressed KV + FP8 indexer keys
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-02 09:49:13 +00:00
40dd56eac2 KV-1: Fix shared memory corruption in block_reduce
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-02 09:46:34 +00:00
0fefadedd4 KV-1: Fix FP8 round-trip mismatch in fused quantize
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-02 09:44:03 +00:00
d74ff5768d KV diag test
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-02 09:41:18 +00:00
c2664281c3 KV-1/KV-2: Fix quantize kernel — each thread handles 16-elem blocks independently
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-02 09:37:56 +00:00
f23320b5b2 KV-1/KV-2: Fused compress+NVFP4 quantize kernels + dequant
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-02 09:30:07 +00:00
107d62dd76 docs: update PERFORMANCE_AUDIT.md — Part 1 (P0-P3) landed, Part 2 KV cache next
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-02 09:08:10 +00:00
3c295f225a P3: integrate CUDA RoPE kernel into single_shot — 732 launches/token eliminated
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-02 09:06:39 +00:00
54a9b6961b fix: rope_cuda path — kernels/cuda not ops/cuda
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-02 09:05:24 +00:00
2bbbead984 P3: CUDA RoPE kernel — single launch per call (vs 5-6 PyTorch ops)