- Allocation: cute::size(cute::filter_zeros(layout)) matches CUTLASS examples - Kernel: layout_sf(make_coord(mn, k_sf*16, 0)) — no branching on LayoutRank - Avoids silent fallthrough that wrote dst[0] for all threads
- Allocation: cute::size(cute::filter_zeros(layout)) matches CUTLASS examples - Kernel: layout_sf(make_coord(mn, k_sf*16, 0)) — no branching on LayoutRank - Avoids silent fallthrough that wrote dst[0] for all threads