Files
nvfp4-megamoe-kernel/dsv4
biondizzle b3020c2811 6-warp specialized FMHA kernel — ALL HD=16/64/128/256 PASS cos 0.999997+
Warp layout (192 threads):
- Warps 0-3: Softmax + correction + epilogue
- Warp 4: MMA (QK + PV GEMM)
- Warp 5: Data staging (Q/K/V loads, direct GMEM for now)
CTA-wide __syncthreads() sync between phases.

Fix: removed spurious inv_sum normalization in epilogue
(MMA output is already correctly scaled with softmax'd P).

Files: fmha_6warp.cuh + test_fmha_6warp*.cu
2026-05-28 16:34:14 +00:00
..