biondizzle
  • Joined on 2025-12-10
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-21 07:13:16 +00:00
f3a7dc1598 FOOTGUN #0: num_tma_load_bytes MUST include V bytes. Fix v27, v29, comment all. Update README.
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-21 07:08:53 +00:00
d873284e84 v29: FIX DEADLOCK - add V bytes to num_tma_load_bytes. V=I(128,128) cosine 1.0
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-21 06:46:16 +00:00
21c9a4823e README: update with v28/v29 deadlock investigation, FMHA softmax bridge trace, new footguns
6ec308dd50 v29 (padded V, deadlocks), v30 (diag copy, works) — debugging epilogue deadlock with (128,128) PV
Compare 2 commits »
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-21 05:55:29 +00:00
261fe5e43e even more stuff
3fdb9f008b v28 attempt: PV MMA (128,64) - cosine 0.004, debugging
Compare 2 commits »
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-21 05:17:17 +00:00
ce63ec7ee9 README: Bug 4 root cause — TMEM layout mismatch (128,64) PV A-fragment vs softmax P write
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-21 05:09:04 +00:00
1f6fb856ea more stuff
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-21 04:40:30 +00:00
1314a67135 Stage B progress: PV works for square (128,128), broken for (128,64)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-21 04:10:20 +00:00
2cbb0369fd Stage B: pipeline deadlock fixed, V MN-major applied, PV output garbage
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-21 00:12:50 +00:00
1eca10e39f Stage B: C-fragment vs A-fragment TMEM layout mismatch diagnosed
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-20 20:26:31 +00:00
e678afcde0 Stage B: two MMAs + identity softmax — crash fixed, softmax output still wrong
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-20 07:15:05 +00:00
d3a7e7a286 stuff
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-20 06:51:15 +00:00
1b4742a438 Update README: reflect current state, add C128A/C4A topk + warmup fixes
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-20 06:48:12 +00:00
a793751fe8 Fix warmup compilation + add sparse topk metadata kernels
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-20 05:46:17 +00:00
dd7af0cd8a feat: GPU-native SWA + sparse decode attention kernels (CuTeDSL)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-20 04:42:59 +00:00
16b3094bdb README: comprehensive update with current kernel status
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-20 04:40:00 +00:00
efa0a156a0 Update README with final kernel status
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-20 04:39:50 +00:00
fffb2144ae Custom CUDA kernel for de-interleave plus NVFP4 quantize
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-20 04:20:49 +00:00
7fa81e6990 Remove debug print statements from pipeline
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-20 04:13:58 +00:00
d775d1075d Fused SwiGLU epilogue with granularity-8 weight interleave
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-20 03:30:37 +00:00
f8716a1fa1 docs: rewrite README.md with current project state