biondizzle
  • Joined on 2025-12-10
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-03 17:02:37 +00:00
e07d79868f CUDA graph: Fix _assemble_scales_single_group swizzle size
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-03 16:52:31 +00:00
0ca7bed0e1 CUDA graph: Fix sync violations found by B200 detector
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-03 16:38:36 +00:00
46a3a51832 CUDA graph: Fix per-step allocations in decode loop
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-03 16:37:23 +00:00
a9ea30353c CUDA graph: Fix sync violations (Category 1-2)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-03 16:34:35 +00:00
caac8ae108 Fix syntax error: 'is not not None' -> 'is not None'
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-03 16:34:18 +00:00
ba68212fa7 Add CUDA graph readiness detector (Section A of GETTING_CUDAGRAPH_READY.md)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-03 15:52:02 +00:00
ca5bc814d5 Fix compressor: do not add positional bias to KV content
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-03 15:45:22 +00:00
4fe73fe713 auto: pre-test commit
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-03 14:57:53 +00:00
f577ed97f4 Fix: Use PyTorch dequant_nvfp4 for weight dequantization (compressor/indexer/router gate)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-03 14:48:54 +00:00
1121cd7b47 Add CUDA_LAUNCH_BLOCKING=1 to catch async errors
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-03 14:38:26 +00:00
f3bb0ca08c Fix dequant gsa: use ws2 only, NOT input_scale * ws2
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-03 14:27:01 +00:00
470e65fb19 Fix dequant gsb: input_scale * ws2, not 1.0 * ws2
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-03 14:19:44 +00:00
2dd16d5789 Switch compressor + indexer weights_proj to BF16 F.linear
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-03 14:17:09 +00:00
95e45a87e3 Add explicit .to(dev) on W_gate after transpose — belt and suspenders
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-03 14:14:13 +00:00
ef94c48957 Simplify router gate: dequant NVFP4 → BF16, F.linear (no FP8 middleman)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-03 14:10:30 +00:00
715602c87c Switch lm_head to BF16 + router gate to FP8_E4M3
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-03 14:00:37 +00:00
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-03 13:57:17 +00:00
89510601f5 Revert compressor pos bias fix + SwiGLU clamp ordering from commit 3320abf
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-03 13:48:46 +00:00
f05ee6cd69 Revert SE BF16 fallback — produced garbage output