Files
DeepGEMM/deep_gemm
biondizzle aa97a3f949 fix: correct TMEM column layout for scale_vec::4X
UTCCP 4x32dp128bit always writes 4 TMEM cols per 128-element group
regardless of 1X vs 4X. The 4X only changes MMA interpretation,
not UTCCP column count. Reverted from (*4, stride i*8) to (same as 1X, stride i*4):
- kNumSFATmemCols: SF_BLOCK_M/32 (was SF_BLOCK_M/32*4)
- kNumSFBTmemCols: SF_BLOCK_N/32 (was SF_BLOCK_N/32*4)
- UTCCP stride: i*4 (was i*8)
2026-05-11 23:44:12 +00:00
..