DeepGEMM

Author	SHA1	Message	Date
biondizzle	b3d1aae038	feat: full FP4 activations for mxf4nvf4 - E2M1 packed A side + UE4M3 scales mxf4nvf4 requires BOTH A and B to be FP4 (E2M1 packed). Changes: - a_dtype_t: float_e4m3_t → float_e2m1_unpacksmem_t - UMMA_K: 32 → 64 (FP4 MMA atom) - L1 epilogue: FP8 quant → E2M1 FP4 quantization with nearest-neighbor - L1 output SMEM: packed E2M1 (2 per byte), TMA store uint8 - TMA descriptors: adjusted for FP4 packing (K/2 bytes per row) - SymmBuffer: uint8 activations, shape (M, K//2) - Staging kernel: BF16 → E2M1 packed + UE4M3 block16 scales	2026-05-11 20:29:08 +00:00
biondizzle	36b439ee26	feat: NVFP4 mega MoE kernel (scale_vec::4X, UE4M3 block scales) - New CUDA kernel: sm100_fp8_nvfp4_mega_moe_impl - kGranK=16 (NVFP4 group_size=16, vs MXFP4's 32) - kind::mxf4nvf4.block_scale.scale_vec::4X PTX instruction - float_ue4m3_t scale factor type in instruction descriptor - SF layout: scale_vec::4X (4 TMEM sub-columns per UMMA atom) - UTCCP column stride: i8 (vs MXFP4's i4) for 4X layout - L1 epilogue: UE4M3 activation scales (float→cutlass::float_e4m3_t) - SF loading: kNumSFUint32 = kHidden/64 (4 UE4M3 per int32) - New PTX wrappers: SM100_MMA_MXF4NVF4_2x1SM_SS, SM100_MMA_MXF4NVF4_SS - Python API: - fp8_nvfp4_mega_moe() with recipe=(1,1,16) - transform_nvfp4_weights_for_mega_moe() for UE4M3→int32 UTCCP packing - _pack_nvfp4_sf_for_utccp() helper - C++ bindings: - mega_nvfp4.hpp with NVFP4-specific SymmBuffer (SF stride K/16) - JIT kernel header with kGranK=16 TMA descriptors - Registered in python_api.cpp NOTE: Both SFA and SFB must use UE4M3 (scale_format_ is 1-bit, shared). The L1 epilogue converts float→UE4M3 for activation scales.	2026-05-11 05:41:08 +00:00
Zhean Xu	891d57b4db	Add various optimizations and Mega MoE benchmarks (#316 ) * Merge with private repo * Add Mega MoE Benchmark * Minor fix * Update --------- Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>	2026-04-24 18:41:37 +08:00
Chenggang Zhao	7f2a703ed5	[Public release 26/04] Introducing Mega MoE, FP4 Indexer and other features/fixes (#304 ) * Merge with private repo * Update README * Update README * Update README * Add PyTorch requirements * Fix sync scopes for MQA logits (#256) * Update README	2026-04-17 09:45:14 +08:00
Zhean Xu	0f5f266202	Multiple updates and refactorings (#280 )	2026-01-16 17:06:52 +08:00
yurekami	6be0eb31d9	fix: use SM90ArchSpec instead of SM100ArchSpec in sm90_bf16_k_grouped_gemm The function sm90_bf16_k_grouped_gemm was incorrectly using SM100ArchSpec to calculate TMA descriptor block sizes. Since this file is the SM90 implementation, it should consistently use SM90ArchSpec like the other functions in this file (sm90_bf16_gemm, sm90_m_grouped_bf16_gemm_contiguous, etc.). This fixes a copy-paste error that could cause incorrect block size calculations on SM90 (Hopper) GPUs. Fixes #242 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2026-01-01 05:06:36 +09:00
Ray Wang	38f8ef73a4	Multiple updates and refactorings (#231 )	2025-11-21 17:49:47 +08:00
Ray Wang	ec5e9ed0b8	Fix SM90 MQA logits (#229 )	2025-11-19 10:50:36 +08:00
Ray Wang	2f9d87877e	Use larger MMA shape (#227 )	2025-11-14 11:38:15 +08:00
Chenggang Zhao	07b82fb8cd	Fix old CUDA compatibility	2025-10-01 20:29:15 +08:00
Simon Mo	59f2c07cf2	Add SM100 kernels (#201 ) Signed-off-by: simon-mo <simon.mo@hey.com>	2025-09-29 17:07:28 +08:00
Chenggang Zhao	80ceeb2c76	Add SM90 kernels (#200 )	2025-09-29 17:00:23 +08:00
Ray Wang	3f71de7aa9	Make various updates and fixes (#198 )	2025-09-25 16:19:07 +08:00
yukuai26	79f48ee15a	Fix multicast bug and optimize masked GEMM (#193 ) * Fix multicast bug and profile masked GEMM * Updates and lint --------- Co-authored-by: Kuai Yu <yukuai@deepseek.com> Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>	2025-09-12 17:12:27 +08:00
Ray Wang	f85ec649d7	Make various updates and fixes: (#164 ) - Add BF16 support for SM90 and SM100 - Refactor Python APIs - Other fixes and code refactoring	2025-08-15 18:32:35 +08:00
zhonghui-J	3254b758e2	Polish get_best_configs modeling. (#158 )	2025-08-14 16:50:21 +08:00
LJC00118	7b6b5563b9	Fix smxx layout assertion (#154 )	2025-08-05 10:38:06 +08:00
Ray Wang	d9c363f86f	Make various updates and fixes: - Add support for legacy CUDA versions; now compatible with CUDA 12.3 and newer - Add support for NVRTC compilation - Other fixes and code refactoring	2025-08-02 19:52:22 -07:00
Chenggang Zhao	c50deed14c	Code lint	2025-07-30 10:39:30 +08:00
LJC00118	6bc75b549e	Fix smxx layout assertion (#141 ) * Fix assertion error in smxx_layout.hpp for mn % 4 != 0 cases * Fix assertion error in smxx_layout.hpp for mn % 4 != 0 cases * Align submodule files * Fix assertion error in smxx_layout.hpp for mn % 4 != 0 cases * fix(smxx_layout): support mn%4!=0 and num_groups>1 via torch * fix(smxx_layout): support mn%4!=0 and num_groups>1 via torch * fix: correct logic for entering get_mn_major_tma_aligned_packed_ue8m0_tensor_torch	2025-07-30 10:36:54 +08:00
Ray Wang	9da4a23561	Add more GPU architectures support (#112 ) * Add more GPU architectures support * Update layout.py * Optimize performance, Add SM90 support, Add 1D2D SM100 support * Add fmtlib submodule at commit 553ec11 --------- Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>	2025-07-18 11:32:22 +08:00

21 Commits