Replace hand-constructed Ld32x32bOp/St32x32bOp TMEM round-trip with the
proven correction epilogue pattern from fused_swiglu.py:
1. O rescale (kt>0): TMEM→REGS (paired load), multiply by acc_scale,
REGS→TMEM (paired store via retile_to_S). No layout mismatch.
2. Final O output: One-way TMEM→REGS→SMEM→GMEM using
epilogue_tmem_copy_and_partition + epilogue_smem_copy_and_partition
+ TMA partition. Register-level normalization (divide by row_sum)
or raw BF16 cast for D5a path.
This fixes both D1.5 issues:
- Issue 1: TMEM round-trip corruption (hand-constructed atoms)
- Issue 2: O rescale for multi-KV-tile (kt>0)
Supports normalize=True (in-kernel) and normalize=False (D5a external).
Uses epilog_sync_bar + c_pipe for SMEM→GMEM, replacing epilogue_tma_store.