Previous attempts used tOtO0 (from pv_thr.make_fragment_C) and corrupted data. This version uses tCtO_base (from pv_mma.make_fragment_C) which is the SAME tensor the epilogue successfully reads O from. Both load and store atoms built from same tCtO_i via composition — CUTLASS correction_rescale pattern.