nvfp4-megamoe-kernel

Files

biondizzle c69f3668e1 feat: TMA async FMHA kernel — WORKING on B200

Three critical CUDA 13 fixes that made TMA work:
1. globalStrides in BYTES not elements (root cause of desc creation failures)
2. BFLOAT16 data type instead of UINT16
3. mbarrier wait: selp.b32 polling pattern (@p bra HANGS on SM100!)

Also includes CUTLASS driver workaround (bit 21 clear for drv <= 13.1).

Verified: 2D TMA load of (128,16) BF16 tile = 0 mismatches.
Kernel: fmha_6warp_tma_kernel with per-sub-tile TMA loads for Q, K, V.
Test: test_fmha_tma.cu with padded Q allocations and per-head descriptors.

2026-05-29 07:02:07 +00:00