The correction epilog (TMEM→reg→SMEM→GMEM one-way trip) is the right approach but the TMA store from SMEM requires proper partitioning that needs more work. Reverting to the known-working state (with 3% TMEM round-trip error) to focus on the SMEM-P write first.