Root cause of ILLEGAL_INSTRUCTION: make_runtime_instr_desc_with_sf_id(instr_desc, k, k) passed sf_id=1 for k=1 (second UMMA atom), but mxf4nvf4 with scale_vec::4X requires sf_id=0 always — the hardware implicitly reads 4 SF positions per atom from a single TMEM region. Non-zero sf_id causes the hardware to access invalid TMEM offsets. Also includes: - UINT8 TMA for packed FP4 (avoids 16U4 driver bugs) - NVFP4 SF pipeline: 2 K-columns per BLOCK_K for group_size=16 - MN-major SF interleaving for gate/up L1 weights - Fix contiguous copy for SF byte view - Preserve MN-major layout in SF interleave - Force contiguous on SF tensors before C++ call - Unpack weight tuples before printing - Single transpose back to MN-major (don't double-transpose)
13 KiB
13 KiB