- Moves correction_rescale atom setup before softmax loop (needed for O rescale)
- Adds O *= acc_scale for kt > 0, before softmax_done_bar.arrive()
- Uses same paired Ld32x32bOp/St32x32bOp(corr_tile_size=16) atoms as final normalize
- Final normalize (O *= 1/row_sum) uses same atoms, no duplicate setup
- Fixes softmax loop to use self.n_kv_tiles (Python int) not n_kv_tiles (CuTeDSL symbolic)
- This should fix n=256 cos 0.71 → 0.9999