biondizzle
  • Joined on 2025-12-10
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-24 23:37:12 +00:00
b5cd1b88c9 D2: add shape debug print for mQ/mK
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-24 23:35:51 +00:00
df3146eb53 D2: hardcode a_major=MN for multi-CTA (Q is always MN-major in FMHA)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-24 23:34:43 +00:00
e809e71253 D2: use tensor indexing q[0] instead of local_tile for layout extraction
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-24 23:33:35 +00:00
49c4189195 D2: fix LayoutEnum for multi-dim Q (use head-0 view for layout)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-24 23:30:10 +00:00
2b76b691cb fix: block_idx() returns tuple, use [1] for y
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-24 23:27:40 +00:00
4c79e5533e D2: add multi-CTA grid with block_idx_y for Q/O head indexing
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-24 22:58:24 +00:00
335e310c79 Update D2 status in README
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-24 22:57:51 +00:00
e3e67c3992 NVFP4-3: enable 2-CTA UMMA when MMA tile M >= 256 (1.7-1.9x throughput)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-24 22:52:24 +00:00
e0339a92fc D2: revert multi-CTA grid params (using per-head launch approach instead)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-24 22:49:45 +00:00
a5271821a8 D2: add scale test (more heads, larger hd)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-24 22:48:24 +00:00
d563c93fc5 D2: add per-head launch test
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-24 22:44:12 +00:00
9b476d87f9 fix: compare un-normalized O against un-normalized reference
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-24 22:41:33 +00:00
0ca7b58a6a D1: fully revert LSE change back to original sfw_idx==0 guard
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-24 22:39:28 +00:00
db353ec35a D2: add simple n_h=1 regression test
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-24 22:28:19 +00:00
4418e04a28 D1: revert per-row LSE to sfw_idx=0 for now (debugging D2 regression)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-24 22:26:11 +00:00
2cc66bff68 D2: add initial multi-head test file
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-24 22:24:30 +00:00
49e66fb6e4 D1: corrected KV merge test with proper normalized output formula
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-24 22:23:09 +00:00
c47f648617 fix lse verify
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-24 22:22:34 +00:00
3577e09603 D1: add LSE verification test
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-24 22:21:53 +00:00
674c5b9c18 D1: fix per-row LSE output + add KV merge test v2 with per-row LSE