biondizzle
b3020c2811
6-warp specialized FMHA kernel — ALL HD=16/64/128/256 PASS cos 0.999997+
Warp layout (192 threads):
- Warps 0-3: Softmax + correction + epilogue
- Warp 4: MMA (QK + PV GEMM)
- Warp 5: Data staging (Q/K/V loads, direct GMEM for now)
CTA-wide __syncthreads() sync between phases.
Fix: removed spurious inv_sum normalization in epilogue
(MMA output is already correctly scaled with softmax'd P).
Files: fmha_6warp.cuh + test_fmha_6warp*.cu
2026-05-28 16:34:14 +00:00
..
2026-05-22 00:25:47 +00:00
2026-05-28 16:34:14 +00:00
2026-05-25 16:19:07 +00:00
2026-05-21 17:30:44 +00:00
2026-05-21 23:31:58 +00:00
2026-05-27 15:15:03 +00:00
2026-05-21 17:30:44 +00:00
2026-05-21 17:30:44 +00:00