biondizzle

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-03 17:02:37 +00:00

e07d79868f CUDA graph: Fix _assemble_scales_single_group swizzle size

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-03 16:52:31 +00:00

0ca7bed0e1 CUDA graph: Fix sync violations found by B200 detector

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-03 16:38:36 +00:00

46a3a51832 CUDA graph: Fix per-step allocations in decode loop

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-03 16:37:23 +00:00

a9ea30353c CUDA graph: Fix sync violations (Category 1-2)

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-03 16:34:35 +00:00

caac8ae108 Fix syntax error: 'is not not None' -> 'is not None'

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-03 16:34:18 +00:00

ba68212fa7 Add CUDA graph readiness detector (Section A of GETTING_CUDAGRAPH_READY.md)

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-03 15:52:02 +00:00

ca5bc814d5 Fix compressor: do not add positional bias to KV content

biondizzle pushed tag v-precision-floor-fix-20260603 to biondizzle/nvfp4-megamoe-kernel

2026-06-03 15:51:43 +00:00

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-03 15:45:22 +00:00

4fe73fe713 auto: pre-test commit

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-03 14:57:53 +00:00

f577ed97f4 Fix: Use PyTorch dequant_nvfp4 for weight dequantization (compressor/indexer/router gate)

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-03 14:48:54 +00:00

1121cd7b47 Add CUDA_LAUNCH_BLOCKING=1 to catch async errors

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-03 14:38:26 +00:00

f3bb0ca08c Fix dequant gsa: use ws2 only, NOT input_scale * ws2

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-03 14:27:01 +00:00

470e65fb19 Fix dequant gsb: input_scale * ws2, not 1.0 * ws2

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-03 14:19:44 +00:00

2dd16d5789 Switch compressor + indexer weights_proj to BF16 F.linear

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-03 14:17:09 +00:00

95e45a87e3 Add explicit .to(dev) on W_gate after transpose — belt and suspenders

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-03 14:14:13 +00:00

ef94c48957 Simplify router gate: dequant NVFP4 → BF16, F.linear (no FP8 middleman)

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-03 14:10:30 +00:00

715602c87c Switch lm_head to BF16 + router gate to FP8_E4M3

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-03 14:00:37 +00:00

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-03 13:57:17 +00:00

89510601f5 Revert compressor pos bias fix + SwiGLU clamp ordering from commit 3320abf

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-03 13:48:46 +00:00

f05ee6cd69 Revert SE BF16 fallback — produced garbage output