biondizzle
208af3eadd
FMHA v3 Stage-C full: 12-warp pipeline with real softmax + correction + epilogue
- Softmax warps (0-3): online row max, exp2 scaling, P store, vec broadcast
- Correction warps (4-7): online O rescale, final normalization, SMEM write
- MMA warp (8): QK->S, PV->O with proper pipeline chaining
- TMA warp (9): Q/K/V load
- Epilogue warp (10): TMA store O from SMEM to GMEM
- Empty warp (11): tmem dealloc mbar init
- Pipeline chain: mma_s -> softmax -> s_corr -> correction -> corr_epi -> epilogue
- Plus mma_corr -> correction for O rescale
- Reference test uses softmax(Q@K^T/sqrt(d))@V
2026-05-22 09:18:56 +00:00
..
2026-05-21 17:30:44 +00:00
2026-05-21 17:30:44 +00:00
2026-05-21 17:30:44 +00:00
2026-05-21 17:30:44 +00:00
2026-05-22 00:08:38 +00:00
2026-05-21 17:30:44 +00:00
2026-05-21 17:30:44 +00:00
2026-05-21 17:30:44 +00:00
2026-05-21 21:54:05 +00:00
2026-05-22 08:57:38 +00:00
2026-05-22 05:52:10 +00:00
2026-05-22 07:09:52 +00:00
2026-05-22 05:52:10 +00:00
2026-05-22 07:29:04 +00:00
2026-05-22 07:09:52 +00:00
2026-05-22 05:52:10 +00:00
2026-05-22 07:09:52 +00:00
2026-05-22 05:52:10 +00:00
2026-05-22 05:52:10 +00:00
2026-05-21 20:13:51 +00:00
2026-05-22 09:18:56 +00:00
2026-05-22 08:57:38 +00:00
2026-05-22 08:57:38 +00:00
2026-05-22 05:52:10 +00:00
2026-05-21 17:30:44 +00:00
2026-05-21 17:30:44 +00:00
2026-05-21 17:30:44 +00:00
2026-05-21 17:30:44 +00:00
2026-05-21 17:30:44 +00:00
2026-05-21 23:11:09 +00:00
2026-05-21 17:30:44 +00:00
2026-05-21 17:30:44 +00:00
2026-05-21 17:30:44 +00:00
2026-05-21 21:54:05 +00:00