- TMA: issue two tma::copy calls per K-block (K_box=1, 2 SF K-columns) - UTCCP: double loop for 2 K-columns, correct SMEM offsets - TMEM: double SFA/SFB column counts (SF_BLOCK_M/32 * 2) - Heuristic: fix smem_size (2× SF, packed FP4 A/B, packed send buffers, no amax) - Staging kernel: fix double-count bug in packed_k_mask