Files
nvfp4-megamoe-kernel/tests/unit
biondizzle 90c3372040 refactor: TMA FMHA kernel — 4-warp, proven pattern, full pipeline
Complete rewrite of fmha_6warp_tma.cuh based on lessons learned:
- 128 threads (4 warps) instead of 192 (6 warps) — simpler, proven
- Warp 0: TMA load + softmax, Warp 1: MMA + TMEM alloc
- TMA: mbarrier.arrive.expect_tx (root cause fix), phase parity tracking
- Q loaded directly (T=1 decode), K/V via TMA
- Per-K-sub-tile Q and K loading into (128,16) canonical buffers
- Full softmax + PV GEMM + epilogue pipeline
- Test updated to match new kernel signature
2026-05-29 18:50:58 +00:00
..
2026-05-23 03:25:29 +00:00
2026-05-23 03:20:46 +00:00
2026-05-24 22:23:08 +00:00
2026-05-24 22:04:51 +00:00
2026-05-24 03:48:37 +00:00
2026-05-28 16:28:58 +00:00
2026-05-28 16:28:58 +00:00
2026-05-28 16:28:58 +00:00
2026-05-28 16:28:58 +00:00
2026-05-28 16:28:58 +00:00
2026-05-28 15:59:22 +00:00
2026-05-28 15:59:22 +00:00
2026-05-28 15:59:22 +00:00
2026-05-28 15:59:22 +00:00
2026-05-28 15:46:53 +00:00
2026-05-28 15:59:22 +00:00
2026-05-28 15:55:59 +00:00
2026-05-28 15:59:22 +00:00
2026-05-28 15:59:22 +00:00
2026-05-28 19:12:23 +00:00
2026-05-28 14:38:03 +00:00
2026-05-28 14:40:55 +00:00
2026-05-28 14:33:31 +00:00
2026-05-23 23:58:57 +00:00
2026-05-28 16:36:53 +00:00
2026-05-28 17:00:20 +00:00
2026-05-28 16:39:45 +00:00
2026-05-28 16:42:24 +00:00
2026-05-28 15:51:55 +00:00
2026-05-28 15:49:47 +00:00
2026-05-28 15:48:15 +00:00
2026-05-28 15:54:05 +00:00
2026-05-28 11:39:15 +00:00