Files
nvfp4-megamoe-kernel/src/nvfp4_megamoe_kernel
biondizzle 196ee37fdb fix: rewrite SF remap kernel — source-iterating with layout_sf(m, k_elem)
Ripped out idx2crd + flatten + get<> approach entirely. New kernel
iterates over source indices (m, k_group) and uses layout_sf(m, k_elem)
to compute the CUTLASS destination offset. CuTe handles nested shape
decomposition internally — no rank inspection needed.

K coordinate is in element-space (k_group * SFVecSize) as the layout
expects. Iterates over groups (not every element) since all 16 elements
within a group share one SF byte — avoids 16x redundant writes.

Grid size based on source count (MN * K_sf), not dest buffer size.
2026-05-14 15:28:44 +00:00
..