biondizzle
  • Joined on 2025-12-10
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-21 17:30:51 +00:00
9cbdc92744 Restructure: cutedsl/ -> dsv4/ with proper layering
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-21 17:15:17 +00:00
94b26ebdf0 Fix: add scale_softmax_log2, use O TMEM rescale for C9 normalization
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-21 17:11:06 +00:00
3f6f4a3ad8 Stage C: online softmax kernel (WIP) - test_fmha_v3_softmax.py
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-21 15:43:03 +00:00
ed712c4939 README: full roadmap — Stage C (real softmax), D (paged KV), E (production kernel)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-21 15:36:09 +00:00
b4f6c6b702 Update both READMEs: Stage B complete, document TMEM overlap root cause
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-21 15:32:52 +00:00
2d10319e9f Fix TMEM overlap in test_pv64_with_softmax.py too — cosine 0.999999
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-21 15:30:27 +00:00
73f38acf74 STAGE B BUG 4b FIXED: TMEM P/O overlap + FMHA V reconstruction
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-21 15:20:18 +00:00
dba4279364 Stage B Bug 4b debugging: P/A alias proven working, V layout issue for (128,64) PV
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-21 12:52:37 +00:00
5b9d0280d9 FMHA v3: KV-tile interleaving pipeline - QK works, Bug 4b blocks PV
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-21 11:49:08 +00:00
b0912fbf85 Stage B: PV(128,64) test + v2 pipeline fixes
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-21 10:50:35 +00:00
579b2f6abe stuff and stuff
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-21 10:47:48 +00:00
04cbea072e FMHA v1: pv_mma_tiler=(128,64,128) works with V=I, fails with real V (SMEM layout bug)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-21 09:59:40 +00:00
241ac2bf94 README: Bug 4 ROOT CAUSE CONFIRMED - V SMEM 1 K-tile + PV 8 K-phases mismatch. Zero-pad V workaround correct.
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-21 09:56:56 +00:00
6c0a6cf50b Root cause FOUND: V SMEM only holds 1 K-tile (2048 BF16), but PV MMA iterates 8 K-phases. For non-(128,128) PV, most K-phases read wrong V data. Zero-padded V works because V is (128,128) covering all 8 K-phases. FMHA interleaves QK+PV per KV-tile to avoid this.
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-21 09:47:11 +00:00
c59974bcef README: Bug 4 corrected — NOT TMEM alias. A-fragment identical for all PV sizes. Real bug in V/B or output C/D.
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-21 09:44:24 +00:00
80da5c51a6 Key finding: PV A-fragment layout is IDENTICAL for (128,128)/(128,32)/(128,16) PV. Bug is NOT TMEM alias. cta_tile_shape_mnk wrong for non-(128,128) PV. V SMEM and O C-fragment sizes look correct. Debugging V/epilogue paths.
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-21 09:20:12 +00:00
464390fcfc Update README: Bug 4 status, (128,16) PV zero output, (128,128) PV zero-pad workaround (cosine 1.0)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-21 09:10:15 +00:00
6a352cf6da TMEM alias analysis: (128,16) PV broken, (128,128) PV with zero-pad works. Root cause: PV A-fragment layout differs from QK C-fragment layout for (128,16) PV, causing TMEM column mismatch. Using (128,128) PV as workaround.
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-21 09:00:45 +00:00
73cb3a3277 Debugging TMEM alias for (128,16) PV: zero output confirmed, PV reads from wrong TMEM columns. Need to align softmax P write with PV A-fragment layout.
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-21 08:45:57 +00:00
768a696373 Stage B N-tiling: (128,16) PV MMA compiles and runs, cosine 0.36 (TMEM alias mismatch bug). FMHA head_dim=64 passes. Debugging TMEM layout alignment.