Files
nvfp4-megamoe-kernel/tests
biondizzle ba31a1cedf Add per-tile O rescale (O *= acc_scale) to softmax loop
- Moves correction_rescale atom setup before softmax loop (needed for O rescale)
- Adds O *= acc_scale for kt > 0, before softmax_done_bar.arrive()
- Uses same paired Ld32x32bOp/St32x32bOp(corr_tile_size=16) atoms as final normalize
- Final normalize (O *= 1/row_sum) uses same atoms, no duplicate setup
- Fixes softmax loop to use self.n_kv_tiles (Python int) not n_kv_tiles (CuTeDSL symbolic)
- This should fix n=256 cos 0.71 → 0.9999
2026-05-23 00:22:12 +00:00
..