The forward-map approach (src -> layout_sf(m, k)) failed because CuTe's
layout operator requires coordinates matching the nested shape rank, and
passing flat (int, int) to a ((32,4),K) shape triggers Mismatched Ranks.
New approach: iterate over CUTLASS dest indices, use idx2crd to get the
hierarchical coordinate, flatten it, then extract logical (m, k_sf) by
interpreting the flattened sub-coordinates correctly:
flat[0..2] = (inner_M, sub_M, tile_M) -> m = tile_M*128 + inner_M*4 + sub_M
flat[3..5] = (inner_K, sub_K, tile_K) -> k_sf = tile_K*4 + sub_K
(inner_K is within one SF group — same byte, so ignored for k_sf)
Previous bug: get<0> and get<1> of flatten gave (inner_M, sub_M) — both
M sub-indices. K information was never extracted, so only k_group=0 worked.
Dest buffer is zero-initialized so padding slots (where m >= MN or
k_sf >= K_sf) stay zero.