nvfp4-megamoe-kernel

Files

biondizzle 196ee37fdb fix: rewrite SF remap kernel — source-iterating with layout_sf(m, k_elem)

Ripped out idx2crd + flatten + get<> approach entirely. New kernel
iterates over source indices (m, k_group) and uses layout_sf(m, k_elem)
to compute the CUTLASS destination offset. CuTe handles nested shape
decomposition internally — no rank inspection needed.

K coordinate is in element-space (k_group * SFVecSize) as the layout
expects. Iterates over groups (not every element) since all 16 elements
within a group share one SF byte — avoids 16x redundant writes.

Grid size based on source count (MN * K_sf), not dest buffer size.

2026-05-14 15:28:44 +00:00

cutlass_nvfp4_gemm

fix: rewrite SF remap kernel — source-iterating with layout_sf(m, k_elem)

2026-05-14 15:28:44 +00:00

__init__.py

cleanup: remove abandoned TileLang and Mojo files

2026-05-14 12:44:47 +00:00

nvfp4_mega_moe.py

fix: unpack_ue4m3_u32 — uint32 lacks CUDA bitwise ops, use int32

2026-05-14 13:44:42 +00:00

symm_buffer.py

Initial: TileLang NVFP4 mega_moe kernel package

2026-05-13 15:44:51 +00:00

weight_transform.py

debug: add weight_scale_2 shape/value logging in weight transform

2026-05-14 14:19:35 +00:00