- Remove hand-constructed TMEM round-trips (3% layout mismatch error) - Use CUTLASS get_tmem_load_op + get_smem_store_op paired atoms - One-way trip: TMEM -> reg (normalize) -> SMEM -> GMEM - SMEM-P path: zero-fill stub (proper copy TBD) - Keep per-tile O rescale atoms for n>128 support