DeepGEMM

Author	SHA1	Message	Date
biondizzle	74bf612771	NVFP4 mega MoE: sf_id=0 fix for scale_vec::4X + UINT8 TMA + SF pipeline + interleaving Root cause of ILLEGAL_INSTRUCTION: make_runtime_instr_desc_with_sf_id(instr_desc, k, k) passed sf_id=1 for k=1 (second UMMA atom), but mxf4nvf4 with scale_vec::4X requires sf_id=0 always — the hardware implicitly reads 4 SF positions per atom from a single TMEM region. Non-zero sf_id causes the hardware to access invalid TMEM offsets. Also includes: - UINT8 TMA for packed FP4 (avoids 16U4 driver bugs) - NVFP4 SF pipeline: 2 K-columns per BLOCK_K for group_size=16 - MN-major SF interleaving for gate/up L1 weights - Fix contiguous copy for SF byte view - Preserve MN-major layout in SF interleave - Force contiguous on SF tensors before C++ call - Unpack weight tuples before printing - Single transpose back to MN-major (don't double-transpose)	2026-05-12 20:26:13 +00:00
biondizzle	49e5646b42	fix: remove duplicate kInt8 case — kPackedFP4 is already kInt8 kPackedFP4 = torch::kInt8, so the kInt8 case was a duplicate. The real fix was in mega_nvfp4.hpp: changing kUInt8→kInt8 so tensors match the existing kPackedFP4 path in the TMA switch.	2026-05-11 22:55:28 +00:00
biondizzle	80df24a641	fix: add kInt8 dtype support to TMA descriptor + change activation tensors to kInt8 - runtime_utils.hpp: added kInt8 -> CU_TENSOR_MAP_DATA_TYPE_UINT8 mapping - mega_nvfp4.hpp: changed activation tensor dtypes from kUInt8 to kInt8 (same byte layout, but kInt8 is recognized by the TMA dtype switch)	2026-05-11 22:54:47 +00:00
Chenggang Zhao	7f2a703ed5	[Public release 26/04] Introducing Mega MoE, FP4 Indexer and other features/fixes (#304 ) * Merge with private repo * Update README * Update README * Update README * Add PyTorch requirements * Fix sync scopes for MQA logits (#256) * Update README	2026-04-17 09:45:14 +08:00
Zhean Xu	0f5f266202	Multiple updates and refactorings (#280 )	2026-01-16 17:06:52 +08:00
Ray Wang	38f8ef73a4	Multiple updates and refactorings (#231 )	2025-11-21 17:49:47 +08:00
Chenggang Zhao	07b82fb8cd	Fix old CUDA compatibility	2025-10-01 20:29:15 +08:00
Ray Wang	3f71de7aa9	Make various updates and fixes (#198 )	2025-09-25 16:19:07 +08:00
Ray Wang	9da4a23561	Add more GPU architectures support (#112 ) * Add more GPU architectures support * Update layout.py * Optimize performance, Add SM90 support, Add 1D2D SM100 support * Add fmtlib submodule at commit 553ec11 --------- Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>	2025-07-18 11:32:22 +08:00

9 Commits