Commit Graph

9 Commits

Author SHA1 Message Date
74bf612771 NVFP4 mega MoE: sf_id=0 fix for scale_vec::4X + UINT8 TMA + SF pipeline + interleaving
Root cause of ILLEGAL_INSTRUCTION: make_runtime_instr_desc_with_sf_id(instr_desc, k, k)
passed sf_id=1 for k=1 (second UMMA atom), but mxf4nvf4 with scale_vec::4X requires
sf_id=0 always — the hardware implicitly reads 4 SF positions per atom from a single
TMEM region. Non-zero sf_id causes the hardware to access invalid TMEM offsets.

Also includes:
- UINT8 TMA for packed FP4 (avoids 16U4 driver bugs)
- NVFP4 SF pipeline: 2 K-columns per BLOCK_K for group_size=16
- MN-major SF interleaving for gate/up L1 weights
- Fix contiguous copy for SF byte view
- Preserve MN-major layout in SF interleave
- Force contiguous on SF tensors before C++ call
- Unpack weight tuples before printing
- Single transpose back to MN-major (don't double-transpose)
2026-05-12 20:26:13 +00:00
49e5646b42 fix: remove duplicate kInt8 case — kPackedFP4 is already kInt8
kPackedFP4 = torch::kInt8, so the kInt8 case was a duplicate.
The real fix was in mega_nvfp4.hpp: changing kUInt8→kInt8 so
tensors match the existing kPackedFP4 path in the TMA switch.
2026-05-11 22:55:28 +00:00
80df24a641 fix: add kInt8 dtype support to TMA descriptor + change activation tensors to kInt8
- runtime_utils.hpp: added kInt8 -> CU_TENSOR_MAP_DATA_TYPE_UINT8 mapping
- mega_nvfp4.hpp: changed activation tensor dtypes from kUInt8 to kInt8
  (same byte layout, but kInt8 is recognized by the TMA dtype switch)
2026-05-11 22:54:47 +00:00
Chenggang Zhao
7f2a703ed5 [Public release 26/04] Introducing Mega MoE, FP4 Indexer and other features/fixes (#304)
* Merge with private repo

* Update README

* Update README

* Update README

* Add PyTorch requirements

* Fix sync scopes for MQA logits (#256)

* Update README
2026-04-17 09:45:14 +08:00
Zhean Xu
0f5f266202 Multiple updates and refactorings (#280) 2026-01-16 17:06:52 +08:00
Ray Wang
38f8ef73a4 Multiple updates and refactorings (#231) 2025-11-21 17:49:47 +08:00
Chenggang Zhao
07b82fb8cd Fix old CUDA compatibility 2025-10-01 20:29:15 +08:00
Ray Wang
3f71de7aa9 Make various updates and fixes (#198) 2025-09-25 16:19:07 +08:00
Ray Wang
9da4a23561 Add more GPU architectures support (#112)
* Add more GPU architectures support

* Update layout.py

* Optimize performance, Add SM90 support, Add 1D2D SM100 support

* Add fmtlib submodule at commit 553ec11

---------

Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
2025-07-18 11:32:22 +08:00