DeepGEMM

Author	SHA1	Message	Date
biondizzle	30d72e7ef5	fix: packed FP4 for mxf4nvf4 — correct SMEM layout, UMMA descriptors, L1 epilogue Key changes: - a_dtype_t/b_dtype_t: float_e2m1_t (packed 4-bit) with sizeof_bits_v==4 assert - kSwizzleAMode/BMode: BLOCK_K/2 (64 bytes packed, not 128 unpacked) - SMEM sizes: LOAD_BLOCK_M * BLOCK_K / 2 (packed byte count) - Token layouts: kHidden/2, kIntermediateHidden/2 (packed bytes) - TMA loads: BLOCK_K/2 inner dim, uint8_t, byte offsets k_block_idx*(BLOCK_K/2) - UMMA descriptors: BLOCK_K/2 template param, uint8_t dtype, UMMA_K/2 advance - L1 epilogue: dropped STSM, direct st.shared.u16 with packed nibbles, no swizzle (v1) - Pybind buffer sizes: hidden/2, intermediate_hidden/2 with packed tensor shapes - Host TMA descriptors: hidden/2 K-dims, block_k/2 inner, fp4_unpacked_smem=false - L1 output TMA: block_n/4 inner, no swizzle (CU_TENSOR_MAP_SWIZZLE_NONE)	2026-05-11 21:59:21 +00:00
biondizzle	0ac73a82f9	fix: L1 output uses unpacked E2M1 (1 byte/element) like FP8 - float_e2m1_unpacksmem_t: sizeof=1, SMEM is 1 byte/element (not packed) - TMA load unpacks 2 E2M1/global-byte → 2 SMEM bytes - UMMA reads unpacked SMEM, packs internally for mxf4nvf4 - L1→L2 handoff: unpacked format (same byte count as FP8) - Epilogue: 4 E2M1 bytes per uint32 STSM atom, same as FP8 - Dispatch TMA: kHidden bytes (unpacked), not kHidden/2 - Added static_assert on sizeof(a_dtype_t) and sizeof(b_dtype_t) - Note: no bandwidth savings at L1→L2 boundary for v1	2026-05-11 21:27:35 +00:00
biondizzle	091b974736	fix: L1 epilogue uses STSM with XOR swizzle for E2M1 FP4 output Keep STSM (not naive SMEM write) so TMA reads correct bank layout. Pack 4 E2M1 nibbles into uint32 per STSM atom with XOR swizzle. Known perf note: 32B swizzle zone for L1 output (land for v1).	2026-05-11 20:57:34 +00:00
biondizzle	a554de8b24	fix: dispatch TMA byte counts for FP4 (kHidden/2), rename fp8→fp4 layout refs	2026-05-11 20:47:58 +00:00
biondizzle	b3d1aae038	feat: full FP4 activations for mxf4nvf4 - E2M1 packed A side + UE4M3 scales mxf4nvf4 requires BOTH A and B to be FP4 (E2M1 packed). Changes: - a_dtype_t: float_e4m3_t → float_e2m1_unpacksmem_t - UMMA_K: 32 → 64 (FP4 MMA atom) - L1 epilogue: FP8 quant → E2M1 FP4 quantization with nearest-neighbor - L1 output SMEM: packed E2M1 (2 per byte), TMA store uint8 - TMA descriptors: adjusted for FP4 packing (K/2 bytes per row) - SymmBuffer: uint8 activations, shape (M, K//2) - Staging kernel: BF16 → E2M1 packed + UE4M3 block16 scales	2026-05-11 20:29:08 +00:00
biondizzle	2cd86ff5e7	fix: UE8M0→float32 reinterpret in fold_global_scale (Bug #7 )	2026-05-11 19:40:01 +00:00
biondizzle	47621bb990	add NVFP4SymmBuffer + get_symm_buffer_for_nvfp4_mega_moe Python wrapper The C++ binding was registered but there was no Python wrapper. vLLM patch imports get_symm_buffer_for_nvfp4_mega_moe from deep_gemm.mega.	2026-05-11 16:25:08 +00:00
biondizzle	86a1263f44	fix: gran_k=16 in transform_sf + sm_100a arch for NVFP4 mega_moe - transform_sf_into_required_layout: add gran_k=16 branch for NVFP4 UE4M3 scales (4 per int32, group_size=16). Previously only handled 32/128. - get_arch: always return '100a' for SM100, never '100f'. The family variant lacks mxf4nvf4 (NVFP4 block-scaled MMA) support, causing 'scale_vec::4X not supported on sm_100f' errors. - transform_nvfp4_weights_for_mega_moe: fold weight_scale_2 into block scales, pack UE4M3→int32, transpose MN-major, call transform_sf_into_required_layout with gran_k=16.	2026-05-11 16:11:11 +00:00
biondizzle	fbdddaccf4	revert: restore mxf4nvf4/block16 code (correct path for sm_100a) Reverted to commit 36b439e's NVFP4 kernel code: - kGranK=16, mxf4nvf4.block_scale.scale_vec::4X - float_ue4m3_t instruction descriptor - Block16 SF layout (4X TMEM) - UE4M3 L1 epilogue - No UE4M3→UE8M0 conversion, no block16→block32 merge The mxf4nvf4.scale_vec::4X PTX instruction compiles successfully on both sm_100 and sm_100f with CUDA 13.0. The previous build 17 error was likely from a different cause, not the arch flag. Python: reverted transform_nvfp4_weights_for_mega_moe to use pack_ue4m3_to_int32 with gran_k=16, no UE8M0 conversion.	2026-05-11 15:02:47 +00:00
biondizzle	57c629ed1b	fix: cast to int32 before >> 23 (uint32 doesn't support right-shift)	2026-05-11 09:45:54 +00:00
biondizzle	6d7231a50e	fix: reinterpret float32 bits as uint32 before >> 23 for UE8M0	2026-05-11 09:42:03 +00:00
biondizzle	03b8c99ee1	fix: use mxf8f6f4 (UE8M0) on SM100 — mxf4nvf4 requires SM103+ B200 (SM100) does NOT support kind::mxf4nvf4 at all (neither 2X nor 4X). Only mxf8f6f4.block_scale with UE8M0 scales is available on SM100. Strategy: keep NVFP4 E2M1 weights, convert UE4M3 block scales → UE8M0 in the weight transformation. This is a scale format adaptation for hardware compatibility, not a format conversion. Changes: - Kernel: back to mxf8f6F4 instruction + float_ue8m0_t descriptor - L1 epilogue: back to UE8M0 (>> 23) activation scales - Python: merge block16→block32, convert UE4M3→float32→UE8M0 - Packing: uint8 (UE8M0) → int32, same as MXFP4	2026-05-11 09:28:45 +00:00
biondizzle	cd7a612175	debug: add shape logging to SF packing	2026-05-11 08:54:14 +00:00
biondizzle	dcebe033e2	fix: use scale_vec::2X (block32) for SM100 B200 compatibility scale_vec::4X (block16) requires SM103/SM120 (B300/GB300), not SM100 (B200). Revert to block32 with UE4M3 scales. Same TMEM layout as MXFP4 but with UE4M3 scale format instead of UE8M0. Changes: - kGranK: 16 → 32 - PTX: scale_vec::4X → scale_vec::2X - SF layout: same as MXFP4 (K/32, K/128 for int32 packed) - UTCCP: i8 → i4 (2X layout, same as MXFP4) - TMEM columns: same as MXFP4 (SF_BLOCK_M/32, SF_BLOCK_N/32) - Python: merge NVFP4 block16→block32 scales (max of adjacent pairs) - recipe: (1,1,16) → (1,1,32)	2026-05-11 08:36:59 +00:00
biondizzle	deff80c9c1	fix: add Python wrapper for NVFP4 SymmBuffer allocation get_symm_buffer_for_nvfp4_mega_moe uses _C.get_symm_buffer_size_for_nvfp4_mega_moe to allocate the correct buffer size (2x SF entries due to group_size=16). Custom init to avoid SymmBuffer's hardcoded MXFP4 allocation.	2026-05-11 08:05:21 +00:00
biondizzle	8d02eb38fa	fix: transpose SF to MN-major layout before TMA stride checks transform_sf_into_required_layout expects MN-major input (stride(-2)=1). Our packed int32 SF is K-major (stride(-1)=1). Transpose the last two dims, make contiguous, then transpose back so data is in MN-major order.	2026-05-11 07:32:10 +00:00
biondizzle	7154500f22	fix: reshape SF to 2D before transform_sf_into_required_layout The C++ check_sf_layout stride assertion fails on 3D (experts, mn, K//64) tensors. Reshape to 2D (experts*mn, K//64) before calling the transform function, matching the expected stride layout.	2026-05-11 07:30:54 +00:00
biondizzle	388fd8dcfd	fix: pack UE4M3 into int32 before transform_sf_into_required_layout The C++ transform function expects int32 (for kInt type) with 4 UE4M3 bytes packed per int32. We pack first, then transform for TMA alignment and UTCCP transpose with recipe (1, 16).	2026-05-11 07:05:11 +00:00
biondizzle	acae75e109	fix: use transform_sf_into_required_layout for proper TMA-aligned SF Instead of custom _pack_nvfp4_sf_for_utccp, use DeepGEMM's C++ transform_sf_into_required_layout with recipe (1, 1, 16) for NVFP4. This handles TMA alignment and UTCCP transpose correctly.	2026-05-11 06:54:34 +00:00
biondizzle	5cb4fcaef3	fix: cast uint8 weights to int8 (kPackedFP4) for DeepGEMM compatibility	2026-05-11 06:36:32 +00:00
biondizzle	bbf9a5f46a	feat: fold weight_scale_2 into block scales in NVFP4 transform - transform_nvfp4_weights_for_mega_moe now accepts weight_scale_2 - Folds global scale into block scales: UE4M3 * FP32 -> UE4M3 - Dequantize to f32, multiply by global scale, clamp [0,448], re-quantize - This is needed because the kernel only applies one level of block scaling	2026-05-11 05:42:16 +00:00
biondizzle	36b439ee26	feat: NVFP4 mega MoE kernel (scale_vec::4X, UE4M3 block scales) - New CUDA kernel: sm100_fp8_nvfp4_mega_moe_impl - kGranK=16 (NVFP4 group_size=16, vs MXFP4's 32) - kind::mxf4nvf4.block_scale.scale_vec::4X PTX instruction - float_ue4m3_t scale factor type in instruction descriptor - SF layout: scale_vec::4X (4 TMEM sub-columns per UMMA atom) - UTCCP column stride: i8 (vs MXFP4's i4) for 4X layout - L1 epilogue: UE4M3 activation scales (float→cutlass::float_e4m3_t) - SF loading: kNumSFUint32 = kHidden/64 (4 UE4M3 per int32) - New PTX wrappers: SM100_MMA_MXF4NVF4_2x1SM_SS, SM100_MMA_MXF4NVF4_SS - Python API: - fp8_nvfp4_mega_moe() with recipe=(1,1,16) - transform_nvfp4_weights_for_mega_moe() for UE4M3→int32 UTCCP packing - _pack_nvfp4_sf_for_utccp() helper - C++ bindings: - mega_nvfp4.hpp with NVFP4-specific SymmBuffer (SF stride K/16) - JIT kernel header with kGranK=16 TMA descriptors - Registered in python_api.cpp NOTE: Both SFA and SFB must use UE4M3 (scale_format_ is 1-bit, shared). The L1 epilogue converts float→UE4M3 for activation scales.	2026-05-11 05:41:08 +00:00
Zhean Xu	891d57b4db	Add various optimizations and Mega MoE benchmarks (#316 ) * Merge with private repo * Add Mega MoE Benchmark * Minor fix * Update --------- Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>	2026-04-24 18:41:37 +08:00
Chenggang Zhao	7f2a703ed5	[Public release 26/04] Introducing Mega MoE, FP4 Indexer and other features/fixes (#304 ) * Merge with private repo * Update README * Update README * Update README * Add PyTorch requirements * Fix sync scopes for MQA logits (#256) * Update README	2026-04-17 09:45:14 +08:00
Ray Wang	d30fc36c8f	Fix sync issue of TMEM alloc/dealloc (#292 )	2026-03-22 16:41:28 +08:00
Xin Qiu	35c4bc8771	fix: k_grouped_fp8_gemm_nt_contiguous crashes with n = 768 on H100 (#238 )	2026-02-25 10:13:54 +08:00
Ray Wang	477618cd51	Fix a sync issue in SM100 MQA logits (#285 )	2026-02-03 17:29:49 +08:00
Zhean Xu	0f5f266202	Multiple updates and refactorings (#280 )	2026-01-16 17:06:52 +08:00
Ray Wang	38f8ef73a4	Multiple updates and refactorings (#231 )	2025-11-21 17:49:47 +08:00
Zhean Xu	bb4424aad4	Fix sum_k * shape_m overflow	2025-11-19 11:51:36 +08:00
Ray Wang	ec5e9ed0b8	Fix SM90 MQA logits (#229 )	2025-11-19 10:50:36 +08:00
Ray Wang	2f9d87877e	Use larger MMA shape (#227 )	2025-11-14 11:38:15 +08:00
oliver könig	9f196058ae	chore: Build and store bdist wheels (#181 ) * build: Minor tweeks for wheel build Signed-off-by: oliver könig <okoenig@nvidia.com> * ci: Workflows for wheel build Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * build: Add CachedWheel Signed-off-by: oliver könig <okoenig@nvidia.com> * add version to init Signed-off-by: oliver könig <okoenig@nvidia.com> * revert Signed-off-by: oliver könig <okoenig@nvidia.com> * revert Signed-off-by: oliver könig <okoenig@nvidia.com> * revert Signed-off-by: oliver könig <okoenig@nvidia.com> * v2 Signed-off-by: oliver könig <okoenig@nvidia.com> * update Signed-off-by: oliver könig <okoenig@nvidia.com> * test Signed-off-by: oliver könig <okoenig@nvidia.com> * from packaging.version import parse Signed-off-by: oliver könig <okoenig@nvidia.com> * local version Signed-off-by: oliver könig <okoenig@nvidia.com> * remove file Signed-off-by: oliver könig <okoenig@nvidia.com> * revert Signed-off-by: oliver könig <okoenig@nvidia.com> * Updates and lint * revert missing cudaextension args Signed-off-by: oliver könig <okoenig@nvidia.com> * Add timeout * fix DG settings Signed-off-by: oliver könig <okoenig@nvidia.com> * DG_USE_LOCAL_VERSION Signed-off-by: oliver könig <okoenig@nvidia.com> * Update version * Detect local changes * Minor fix * Revert CUTLASS * Unify options --------- Signed-off-by: oliver könig <okoenig@nvidia.com> Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>	2025-10-10 18:23:40 +08:00
Chenggang Zhao	c1bf4cae4b	Fix version	2025-10-01 20:31:27 +08:00
Chenggang Zhao	07b82fb8cd	Fix old CUDA compatibility	2025-10-01 20:29:15 +08:00
Simon Mo	59f2c07cf2	Add SM100 kernels (#201 ) Signed-off-by: simon-mo <simon.mo@hey.com>	2025-09-29 17:07:28 +08:00
Chenggang Zhao	80ceeb2c76	Add SM90 kernels (#200 )	2025-09-29 17:00:23 +08:00
Ray Wang	3f71de7aa9	Make various updates and fixes (#198 )	2025-09-25 16:19:07 +08:00
zhonghui-J	2da871e304	Fix grouped gemms performance issue. (#168 )	2025-08-22 17:35:43 +08:00
Chenggang Zhao	e38c2e3103	Remove comments	2025-08-22 17:32:04 +08:00
Chenggang Zhao	f20256fd50	Compatible with CUDA 13	2025-08-22 17:30:47 +08:00
xiweny	affdb1cd90	Add sm_100f support and make nvcc 13 happy (#157 ) Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>	2025-08-22 17:19:32 +08:00
Ray Wang	f85ec649d7	Make various updates and fixes: (#164 ) - Add BF16 support for SM90 and SM100 - Refactor Python APIs - Other fixes and code refactoring	2025-08-15 18:32:35 +08:00
Ray Wang	d9c363f86f	Make various updates and fixes: - Add support for legacy CUDA versions; now compatible with CUDA 12.3 and newer - Add support for NVRTC compilation - Other fixes and code refactoring	2025-08-02 19:52:22 -07:00
yukuai26	aff9da0aba	Fix SM90 GEMM (#149 ) * Fix sm90 GEMM * Fix typo --------- Co-authored-by: Kuai Yu <yukuai@deepseek.com>	2025-08-01 10:36:49 +08:00
Ray Wang	9da4a23561	Add more GPU architectures support (#112 ) * Add more GPU architectures support * Update layout.py * Optimize performance, Add SM90 support, Add 1D2D SM100 support * Add fmtlib submodule at commit 553ec11 --------- Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>	2025-07-18 11:32:22 +08:00
Chenggang Zhao	03d0be3d2d	Simplify expression	2025-07-02 14:07:05 +08:00
fy1214	3fc6728dee	[add] fix smem_barrier size in wgrad way (#122 )	2025-07-02 14:05:36 +08:00
yukuai	e82c4139da	Revert "Fixed the bug in get_swizzle_mode function related to elem_size setting. (#115 )" This reverts commit `ac428e25e0`. This PR causes wgrad to hang during testing. Revert it until we resolve the issue	2025-06-23 17:13:36 +08:00
TherLF	ac428e25e0	Fixed the bug in get_swizzle_mode function related to elem_size setting. (#115 )	2025-06-23 09:37:10 +08:00

1 2 3

131 Commits