Files
DeepGEMM/deep_gemm/mega
biondizzle 74bf612771 NVFP4 mega MoE: sf_id=0 fix for scale_vec::4X + UINT8 TMA + SF pipeline + interleaving
Root cause of ILLEGAL_INSTRUCTION: make_runtime_instr_desc_with_sf_id(instr_desc, k, k)
passed sf_id=1 for k=1 (second UMMA atom), but mxf4nvf4 with scale_vec::4X requires
sf_id=0 always — the hardware implicitly reads 4 SF positions per atom from a single
TMEM region. Non-zero sf_id causes the hardware to access invalid TMEM offsets.

Also includes:
- UINT8 TMA for packed FP4 (avoids 16U4 driver bugs)
- NVFP4 SF pipeline: 2 K-columns per BLOCK_K for group_size=16
- MN-major SF interleaving for gate/up L1 weights
- Fix contiguous copy for SF byte view
- Preserve MN-major layout in SF interleave
- Force contiguous on SF tensors before C++ call
- Unpack weight tuples before printing
- Single transpose back to MN-major (don't double-transpose)
2026-05-12 20:26:13 +00:00
..