Two fixes:
1. CuTe layout uses element-space K, not group-space. k_group=3 with
SFVecSize=16 maps to k_elem=48 in the layout, not k=3.
Added SFVecSize param to remap kernel, multiply k_sf * SFVecSize
before passing to layout_sf().
2. Zero-init CUTLASS dest buffer before remap. The layout pads to
tile boundaries (128x64), so dest is larger than M*K_sf. Unmapped
padding slots reading garbage causes sporadic wrong results.
Also fixed grid size to use source count (M*K_sf), not dest size.