TMA store uses cp.async.bulk.tensor.2d.global.shared::cta.tile.bulk_group
NOT mbarrier::complete_tx::bytes. Completion tracked via:
- cp.async.bulk.commit_group (after issuing stores)
- cp.async.bulk.wait_group.read 0 (wait for all groups)
Removed sMbarStore from SMEM allocations (no longer needed).
- fmha_multitile_capi.cu: C API wrapper for TMA multi-tile kernel
Creates TMA descriptors per (head, batch), launches kernel
- fmha_multitile_op.py: nvcc precompile + ctypes loader
- production.py: dispatch to multitile for N>128 or hd=512
- Reverted fmha_6warp_multihead.cuh to working single-tile version
- The TMA multi-tile kernel already passes 72 configs (D1.5)
HD=64/128/256/512 × T=1/4/32/128 × s_k=128/256/384/512
Single-tile path is IDENTICAL to the working pre-P5 kernel.
Multi-tile path uses FA2 online softmax with sOacc accumulator.
Runtime branch on is_multi_tile = (n_kv_tiles > 1).
- Kernel loops over KV tiles internally with running max/sum rescale
- SMEM accumulator sOacc[hd] replaces TMEM accumulation across tiles
- P is UN-NORMALIZED for multi-tile (exp(s-max), not /sum)
- Per KV tile: QK→softmax→PV→TMEM→read→add to sOacc
- Final: O = sOacc / running_sum
- Single tile (n_kv_tiles=1): same as before, no rescale
- Updated CAPI, Python loader, production.py fast path
- Added multi-tile test cases (N=256, 512)
Evidence: TMA loads succeed with 128B-aligned GMEM on all descriptor configs.
The bit-21 workaround was NOT needed. The 'misaligned address' crashes were
caused by passing non-128B-aligned GMEM pointers to cp.async.bulk.tensor.
Added docs/p4_tma_hang_resolution.md with root cause and fix.
Cleaned up stale P4 test files.
- test_p3_fast_decode.py: clean kernel test + full API test
- Removed debug tests (sanity, v_debug, v_ref_debug)
- Double normalization fix verified: kernel output matches reference
at cos >= 0.999990 across all MHA/MQA/GQA configs
P was already normalized in softmax step. PV = P_norm @ V gives the
correct attention output. Dividing by row_sum again in the epilogue
produces O = O_correct / row_sum (128x too small for uniform data).