biondizzle
  • Joined on 2025-12-10
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 23:01:14 +00:00
d54bce6a6d fix: correct SMEM size for MMA 4-warp test
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 23:00:29 +00:00
be45e87891 test: MMA→4-warp TMEM read — do warps see different rows?
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 22:59:33 +00:00
6b0d57074a test: TMEM cross-warp visibility with different sync strategies
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 22:58:50 +00:00
77d190278e test: simpler TMEM 4-warp read — direct store+load
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 22:58:01 +00:00
91b03bd6bd test: verify 4-warp TMEM read with 32x32b.x8 after MMA
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 22:56:30 +00:00
28e04a5ea8 fix: use __cvta_generic_to_shared directly for 64-bit compat
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 22:56:17 +00:00
1d6a95df32 fix: typo in tmem row offset test
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 22:56:01 +00:00
cf6fe71368 test: verify TMEM 32x32b.x8 row offset addressing
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 22:53:51 +00:00
4cfb707405 fix: correct SMEM size calculation in multirow test
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 22:52:53 +00:00
863a030c3b fmha_multirow: rewrite with 32x32b.x8 only, no s_p_vals, row_page addressing
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 21:08:17 +00:00
1ba304db3e stuff
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 20:13:54 +00:00
deaa3ec725 CRITICAL FIX: Q/K SMEM canonical layout must use local d (0..15) not full_d — UMMA descriptor reads from sQ0/sK0 start, not offset
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 20:10:18 +00:00
08694b8136 Fix multi-row softmax v3: 32x32b.x8 with per-lane per-row (no wmax/wsum), per-row sRowMax/sRowSum arrays
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 20:08:32 +00:00
aaa76c1af1 Rewrite multi-row softmax using 16x256b.x1 TMEM reads for proper multi-row access
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 20:06:17 +00:00
5e3c61184c Fix multi-row softmax: remove cross-lane wmax/wsum — each lane handles its own row independently
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 20:05:02 +00:00
bf4dfd131b Fix nvcc goto-bypasses-init: move var decls before goto targets
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 20:04:46 +00:00
2b09d4f2ef Fix nvcc goto-bypasses-init in multi-row test
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 20:04:31 +00:00
d8b421ccee Multi-row FMHA kernel (Milestone 4): T>1 prefill support with 4-warp parallel softmax
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 19:35:07 +00:00
adc88613fa Milestone 5 COMPLETE: multi-head FMHA grid launch verified on B200
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 19:33:06 +00:00
3fd302e7a0 Fix nvcc goto-bypasses-init errors in multi-head test