Fixed ALL loops to use self.n_kv_tiles (Python int) instead of
cute.size(gK, mode=[3]) which returned 1 for all n values.
Results:
n=128: cos 0.999998 ✅ PASS (single tile, full softmax + normalize)
n=256: cos 0.801156 (2 tiles, O rescale partially working)
n=512: CUDA launch failure (pipeline can't cycle past kv_stage=2)
The n=256 improvement (0.71 → 0.80) confirms:
1. TMA fix (None,0,None,0) loads both KV tiles correctly
2. Softmax processes both tiles with online row_max/row_sum tracking
3. O rescale (O *= acc_scale for kt > 0) is partially working
4. Final normalize (O *= 1/row_sum) works correctly
Remaining:
- n=256 cos 0.80 → 0.9999: O rescale precision issue
- n≥384: pipeline cycling (kv_stage=2 can only hold 2 tiles)
- Need to increase kv_stage or fix pipeline state cycling