Files
nvfp4-megamoe-kernel/dsv4
biondizzle 863a030c3b fmha_multirow: rewrite with 32x32b.x8 only, no s_p_vals, row_page addressing
- Kill 64KB s_p_vals buffer — P is streamed per K-tile through sPk
- All TMEM ops use 32x32b.x8 exclusively (16x256b.x1 crashes on 2nd call)
- T>32: 4 softmax warps use row_page offset in TMEM address (row<<16)
- Lane l in warp w handles row w*32+l
- Two-pass softmax: pass 1 row_max, pass 2 exp/sum interleaved with PV
- PV: N=16 sub-tiles, SS MMA sPk(128,16) × sV(16,16) → TMEM
- Epilogue: 32x32b.x8 TMEM read, normalize, BF16 → GMEM
- SMEM budget: ~14KB (well within 232KB)
2026-05-28 22:52:52 +00:00
..