f98c1f7fd5
fix: add gran_k=16 (NVFP4) support to transform_sf_into_required_layout
...
The C++ function only handled gran_k=32 and 128 (MXFP4/FP8).
Added gran_k=16 for NVFP4 group_size=16 support.
2026-05-11 07:13:00 +00:00
388fd8dcfd
fix: pack UE4M3 into int32 before transform_sf_into_required_layout
...
The C++ transform function expects int32 (for kInt type) with 4 UE4M3
bytes packed per int32. We pack first, then transform for TMA alignment
and UTCCP transpose with recipe (1, 16).
2026-05-11 07:05:11 +00:00
acae75e109
fix: use transform_sf_into_required_layout for proper TMA-aligned SF
...
Instead of custom _pack_nvfp4_sf_for_utccp, use DeepGEMM's C++
transform_sf_into_required_layout with recipe (1, 1, 16) for NVFP4.
This handles TMA alignment and UTCCP transpose correctly.
2026-05-11 06:54:34 +00:00
5cb4fcaef3
fix: cast uint8 weights to int8 (kPackedFP4) for DeepGEMM compatibility
2026-05-11 06:36:32 +00:00
aa9e53d5b2
feat: add build script for in-container compilation
2026-05-11 05:53:07 +00:00
328a352119
feat: add Dockerfile for NVFP4 mega moe build
2026-05-11 05:52:41 +00:00
bbf9a5f46a
feat: fold weight_scale_2 into block scales in NVFP4 transform
...
- transform_nvfp4_weights_for_mega_moe now accepts weight_scale_2
- Folds global scale into block scales: UE4M3 * FP32 -> UE4M3
- Dequantize to f32, multiply by global scale, clamp [0,448], re-quantize
- This is needed because the kernel only applies one level of block scaling
2026-05-11 05:42:16 +00:00
42c215d49b
docs: add NVFP4 mega MoE kernel README
2026-05-11 05:41:25 +00:00
36b439ee26
feat: NVFP4 mega MoE kernel (scale_vec::4X, UE4M3 block scales)
...
- New CUDA kernel: sm100_fp8_nvfp4_mega_moe_impl
- kGranK=16 (NVFP4 group_size=16, vs MXFP4's 32)
- kind::mxf4nvf4.block_scale.scale_vec::4X PTX instruction
- float_ue4m3_t scale factor type in instruction descriptor
- SF layout: scale_vec::4X (4 TMEM sub-columns per UMMA atom)
- UTCCP column stride: i*8 (vs MXFP4's i*4) for 4X layout
- L1 epilogue: UE4M3 activation scales (float→cutlass::float_e4m3_t)
- SF loading: kNumSFUint32 = kHidden/64 (4 UE4M3 per int32)
- New PTX wrappers: SM100_MMA_MXF4NVF4_2x1SM_SS, SM100_MMA_MXF4NVF4_SS
- Python API:
- fp8_nvfp4_mega_moe() with recipe=(1,1,16)
- transform_nvfp4_weights_for_mega_moe() for UE4M3→int32 UTCCP packing
- _pack_nvfp4_sf_for_utccp() helper
- C++ bindings:
- mega_nvfp4.hpp with NVFP4-specific SymmBuffer (SF stride K/16)
- JIT kernel header with kGranK=16 TMA descriptors
- Registered in python_api.cpp
NOTE: Both SFA and SFB must use UE4M3 (scale_format_ is 1-bit, shared).
The L1 epilogue converts float→UE4M3 for activation scales.
2026-05-11 05:41:08 +00:00
Zhean Xu
891d57b4db
Add various optimizations and Mega MoE benchmarks ( #316 )
...
* Merge with private repo
* Add Mega MoE Benchmark
* Minor fix
* Update
---------
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com >
2026-04-24 18:41:37 +08:00
Chenggang Zhao
7f2a703ed5
[Public release 26/04] Introducing Mega MoE, FP4 Indexer and other features/fixes ( #304 )
...
* Merge with private repo
* Update README
* Update README
* Update README
* Add PyTorch requirements
* Fix sync scopes for MQA logits (#256 )
* Update README
2026-04-17 09:45:14 +08:00
Ray Wang
d30fc36c8f
Fix sync issue of TMEM alloc/dealloc ( #292 )
2026-03-22 16:41:28 +08:00
Xin Qiu
35c4bc8771
fix: k_grouped_fp8_gemm_nt_contiguous crashes with n = 768 on H100 ( #238 )
2026-02-25 10:13:54 +08:00
Ray Wang
477618cd51
Fix a sync issue in SM100 MQA logits ( #285 )
2026-02-03 17:29:49 +08:00
Zhean Xu
0f5f266202
Multiple updates and refactorings ( #280 )
2026-01-16 17:06:52 +08:00
Zhean Xu
3ccf40c53a
Merge pull request #270 from yurekami/fix/sm90-archspec-bug
...
fix: use SM90ArchSpec instead of SM100ArchSpec in sm90_bf16_k_grouped_gemm
2026-01-06 09:56:33 +08:00
yurekami
6be0eb31d9
fix: use SM90ArchSpec instead of SM100ArchSpec in sm90_bf16_k_grouped_gemm
...
The function sm90_bf16_k_grouped_gemm was incorrectly using SM100ArchSpec
to calculate TMA descriptor block sizes. Since this file is the SM90
implementation, it should consistently use SM90ArchSpec like the other
functions in this file (sm90_bf16_gemm, sm90_m_grouped_bf16_gemm_contiguous,
etc.).
This fixes a copy-paste error that could cause incorrect block size
calculations on SM90 (Hopper) GPUs.
Fixes #242
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2026-01-01 05:06:36 +09:00
Chenggang Zhao
9b680f4284
Update install.sh
2025-12-05 17:06:48 +08:00
AJ WISE
659a279bbd
Better error handling, code consistency, compile-time safety ( #234 )
2025-12-05 16:49:52 +08:00
Ray Wang
38f8ef73a4
Multiple updates and refactorings ( #231 )
2025-11-21 17:49:47 +08:00
Zhean Xu
bb4424aad4
Fix sum_k * shape_m overflow
2025-11-19 11:51:36 +08:00
Chenggang Zhao
8da33d6bd9
Clean up
2025-11-19 11:00:55 +08:00
Guoteng
f63d7f24d6
fix: prevent int32 overflow in k-grouped GEMM size calculations ( #226 )
2025-11-19 10:52:08 +08:00
Ray Wang
ec5e9ed0b8
Fix SM90 MQA logits ( #229 )
2025-11-19 10:50:36 +08:00
Ray Wang
2f9d87877e
Use larger MMA shape ( #227 )
2025-11-14 11:38:15 +08:00
oliver könig
c9f8b34dcd
Merge pull request #220 from ko3n1g/ko3n1g/chore/revert-name-change
...
Ko3n1g/chore/revert name change
2025-10-15 16:30:23 +02:00
oliver könig
237041a257
revert
...
Signed-off-by: oliver könig <okoenig@nvidia.com >
2025-10-15 14:29:57 +00:00
oliver könig
f82018273d
chore: Revert name change
...
Signed-off-by: oliver könig <okoenig@nvidia.com >
2025-10-15 14:29:16 +00:00
oliver könig
737e420fad
chore: Rename project to ds-deem-gemm
...
Signed-off-by: oliver könig <okoenig@nvidia.com >
2025-10-15 12:44:21 +00:00
oliver könig
2b8a8e24f8
Update publish.yml
2025-10-15 13:00:51 +02:00
oliver könig
9528451969
Ko3n1g/chore/rename to deepgemm ( #217 )
...
* py3.8
Signed-off-by: oliver könig <okoenig@nvidia.com >
* chore: Rename from `deep_geem` to `deepgemm`
Signed-off-by: oliver könig <okoenig@nvidia.com >
---------
Signed-off-by: oliver könig <okoenig@nvidia.com >
2025-10-15 18:13:42 +08:00
oliver könig
93b3c28fa8
ci: Fixes for pre-built wheels ( #214 )
...
* build: Allow NGC builds
Signed-off-by: oliver könig <okoenig@nvidia.com >
* reduce grid
Signed-off-by: oliver könig <okoenig@nvidia.com >
* update grid
Signed-off-by: oliver könig <okoenig@nvidia.com >
* fix
Signed-off-by: oliver könig <okoenig@nvidia.com >
* upgrade cuda action
Signed-off-by: oliver könig <okoenig@nvidia.com >
* remove test
Signed-off-by: oliver könig <okoenig@nvidia.com >
* py3.8
Signed-off-by: oliver könig <okoenig@nvidia.com >
* fix
Signed-off-by: oliver könig <okoenig@nvidia.com >
* exclude
Signed-off-by: oliver könig <okoenig@nvidia.com >
* fix
Signed-off-by: oliver könig <okoenig@nvidia.com >
* torch-version
Signed-off-by: oliver könig <okoenig@nvidia.com >
* py3.8/torch2.1/cuda12.3
Signed-off-by: oliver könig <okoenig@nvidia.com >
* Update publish.yml
* fix grid
Signed-off-by: oliver könig <okoenig@nvidia.com >
* fix
Signed-off-by: oliver könig <okoenig@nvidia.com >
* cuda11.8
Signed-off-by: oliver könig <okoenig@nvidia.com >
* no hopper for 118
Signed-off-by: oliver könig <okoenig@nvidia.com >
* fix
Signed-off-by: oliver könig <okoenig@nvidia.com >
* fix
Signed-off-by: oliver könig <okoenig@nvidia.com >
---------
Signed-off-by: oliver könig <okoenig@nvidia.com >
2025-10-14 13:05:47 +08:00
Chenggang Zhao
f8f41145da
Use CUDA runtime API to get device prop instead of ATen
2025-10-11 09:16:31 +08:00
oliver könig
9f196058ae
chore: Build and store bdist wheels ( #181 )
...
* build: Minor tweeks for wheel build
Signed-off-by: oliver könig <okoenig@nvidia.com >
* ci: Workflows for wheel build
Signed-off-by: oliver könig <okoenig@nvidia.com >
* fix
Signed-off-by: oliver könig <okoenig@nvidia.com >
* fix
Signed-off-by: oliver könig <okoenig@nvidia.com >
* build: Add CachedWheel
Signed-off-by: oliver könig <okoenig@nvidia.com >
* add version to init
Signed-off-by: oliver könig <okoenig@nvidia.com >
* revert
Signed-off-by: oliver könig <okoenig@nvidia.com >
* revert
Signed-off-by: oliver könig <okoenig@nvidia.com >
* revert
Signed-off-by: oliver könig <okoenig@nvidia.com >
* v2
Signed-off-by: oliver könig <okoenig@nvidia.com >
* update
Signed-off-by: oliver könig <okoenig@nvidia.com >
* test
Signed-off-by: oliver könig <okoenig@nvidia.com >
* from packaging.version import parse
Signed-off-by: oliver könig <okoenig@nvidia.com >
* local version
Signed-off-by: oliver könig <okoenig@nvidia.com >
* remove file
Signed-off-by: oliver könig <okoenig@nvidia.com >
* revert
Signed-off-by: oliver könig <okoenig@nvidia.com >
* Updates and lint
* revert missing cudaextension args
Signed-off-by: oliver könig <okoenig@nvidia.com >
* Add timeout
* fix DG settings
Signed-off-by: oliver könig <okoenig@nvidia.com >
* DG_USE_LOCAL_VERSION
Signed-off-by: oliver könig <okoenig@nvidia.com >
* Update version
* Detect local changes
* Minor fix
* Revert CUTLASS
* Unify options
---------
Signed-off-by: oliver könig <okoenig@nvidia.com >
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com >
2025-10-10 18:23:40 +08:00
Jun Jiang
6e74faad5c
Upgrade to CUTLASS 4.2.1 ( #203 )
2025-10-09 09:09:22 +08:00
PGFLMG
239112cb4c
Fix syntax errors and correct the conditional statements ( #206 )
2025-10-01 20:31:43 +08:00
Chenggang Zhao
c1bf4cae4b
Fix version
2025-10-01 20:31:27 +08:00
Chenggang Zhao
07b82fb8cd
Fix old CUDA compatibility
2025-10-01 20:29:15 +08:00
Chenggang Zhao
594953acce
Update version number
2025-09-29 17:12:21 +08:00
Chenggang Zhao
0ed3b949d0
Update README
2025-09-29 17:10:12 +08:00
Simon Mo
59f2c07cf2
Add SM100 kernels ( #201 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-09-29 17:07:28 +08:00
Chenggang Zhao
80ceeb2c76
Add SM90 kernels ( #200 )
2025-09-29 17:00:23 +08:00
Chenggang Zhao
904b721731
Update README
2025-09-25 16:27:57 +08:00
Ray Wang
3f71de7aa9
Make various updates and fixes ( #198 )
2025-09-25 16:19:07 +08:00
yukuai26
79f48ee15a
Fix multicast bug and optimize masked GEMM ( #193 )
...
* Fix multicast bug and profile masked GEMM
* Updates and lint
---------
Co-authored-by: Kuai Yu <yukuai@deepseek.com >
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com >
2025-09-12 17:12:27 +08:00
Chenggang Zhao
ea9c5d9270
Use driver API
2025-08-28 09:40:49 +08:00
Rain Jiang
51d1e9cdd3
Support compilation with CUDA 13.0 ( #174 )
2025-08-27 09:30:08 +08:00
Chenggang Zhao
0e49c3353b
Refactor compiler version checks and arch flags
2025-08-27 09:28:21 +08:00
PGFLMG
3a93f4eb28
Fix B200 cu128 NVCC compilation failed ( #173 )
2025-08-27 09:07:18 +08:00
Chenggang Zhao
9c3783beb2
Fix CUBIN symbol name compatibility
2025-08-26 17:43:26 +08:00