Fixed ALL loops to use self.n_kv_tiles (Python int) instead of
cute.size(gK, mode=[3]) which returned 1 for all n values.
Results:
n=128: cos 0.999998 ✅ PASS (single tile, full softmax + normalize)
n=256: cos 0.801156 (2 tiles, O rescale partially working)
n=512: CUDA launch failure (pipeline can't cycle past kv_stage=2)
The n=256 improvement (0.71 → 0.80) confirms:
1. TMA fix (None,0,None,0) loads both KV tiles correctly
2. Softmax processes both tiles with online row_max/row_sum tracking
3. O rescale (O *= acc_scale for kt > 0) is partially working
4. Final normalize (O *= 1/row_sum) works correctly
Remaining:
- n=256 cos 0.80 → 0.9999: O rescale precision issue
- n≥384: pipeline cycling (kv_stage=2 can only hold 2 tiles)
- Need to increase kv_stage or fix pipeline state cycling
The MMA loop (cutlass.range) and MMA consumer loop (range) also used
cute.size(gK, mode=[3]) which returns 1 for all n. Fixed all 3 loops:
1. TMA load loop (cutlass.range, line 215)
2. MMA consumer loop (range, line 231)
3. Softmax loop (range, line 324)
This was causing the deadlock — MMA only produced S[0] while softmax
waited for S[1].
cute.size(gK, mode=[3]) returns 1 for ALL n values — mode 3 is batch,
not KV tiles. self.n_kv_tiles = s_k // 128 is the correct Python int.
This is why softmax only processed kt=0 for all n.
Setup the correction_rescale atoms BEFORE the softmax loop so they can be
shared between per-tile O rescale and final normalize. Uses the working
2D register tensor pattern for final normalize. O rescale uses simple
1D rmem tensor per sub-tile (same as example10).
Previous O rescale attempt broke n=128 (0.464773).
Revert to known-good softmax code, only apply TMA fix:
tBgK[(None,None,0,0)] → tBgK[(None,0,None,0)]
Expected: n=128 cos 0.999998 (same as working), n=256 cos 0.71 (TMA fix loads 2 tiles but no O rescale)
The make_rmem_tensor(tTMEM_LOADcO.shape) creates a 1D tensor that doesn't
match the paired atom layout. The working pattern uses a 2D register tensor
with sub-tile composition (tTMrO_i_ = tTMrO[None, i] + composition).
- Moves correction_rescale atom setup before softmax loop (needed for O rescale)
- Adds O *= acc_scale for kt > 0, before softmax_done_bar.arrive()
- Uses same paired Ld32x32bOp/St32x32bOp(corr_tile_size=16) atoms as final normalize
- Final normalize (O *= 1/row_sum) uses same atoms, no duplicate setup
- Fixes softmax loop to use self.n_kv_tiles (Python int) not n_kv_tiles (CuTeDSL symbolic)
- This should fix n=256 cos 0.71 → 0.9999