UTCCP 4x32dp128bit always writes 4 TMEM cols per 128-element group regardless of 1X vs 4X. The 4X only changes MMA interpretation, not UTCCP column count. Reverted from (*4, stride i*8) to (same as 1X, stride i*4): - kNumSFATmemCols: SF_BLOCK_M/32 (was SF_BLOCK_M/32*4) - kNumSFBTmemCols: SF_BLOCK_N/32 (was SF_BLOCK_N/32*4) - UTCCP stride: i*4 (was i*8)