Commit Graph

193 Commits

Author SHA1 Message Date
328a352119 feat: add Dockerfile for NVFP4 mega moe build 2026-05-11 05:52:41 +00:00
bbf9a5f46a feat: fold weight_scale_2 into block scales in NVFP4 transform
- transform_nvfp4_weights_for_mega_moe now accepts weight_scale_2
- Folds global scale into block scales: UE4M3 * FP32 -> UE4M3
- Dequantize to f32, multiply by global scale, clamp [0,448], re-quantize
- This is needed because the kernel only applies one level of block scaling
2026-05-11 05:42:16 +00:00
42c215d49b docs: add NVFP4 mega MoE kernel README 2026-05-11 05:41:25 +00:00
36b439ee26 feat: NVFP4 mega MoE kernel (scale_vec::4X, UE4M3 block scales)
- New CUDA kernel: sm100_fp8_nvfp4_mega_moe_impl
  - kGranK=16 (NVFP4 group_size=16, vs MXFP4's 32)
  - kind::mxf4nvf4.block_scale.scale_vec::4X PTX instruction
  - float_ue4m3_t scale factor type in instruction descriptor
  - SF layout: scale_vec::4X (4 TMEM sub-columns per UMMA atom)
  - UTCCP column stride: i*8 (vs MXFP4's i*4) for 4X layout
  - L1 epilogue: UE4M3 activation scales (float→cutlass::float_e4m3_t)
  - SF loading: kNumSFUint32 = kHidden/64 (4 UE4M3 per int32)

- New PTX wrappers: SM100_MMA_MXF4NVF4_2x1SM_SS, SM100_MMA_MXF4NVF4_SS

- Python API:
  - fp8_nvfp4_mega_moe() with recipe=(1,1,16)
  - transform_nvfp4_weights_for_mega_moe() for UE4M3→int32 UTCCP packing
  - _pack_nvfp4_sf_for_utccp() helper

- C++ bindings:
  - mega_nvfp4.hpp with NVFP4-specific SymmBuffer (SF stride K/16)
  - JIT kernel header with kGranK=16 TMA descriptors
  - Registered in python_api.cpp

NOTE: Both SFA and SFB must use UE4M3 (scale_format_ is 1-bit, shared).
The L1 epilogue converts float→UE4M3 for activation scales.
2026-05-11 05:41:08 +00:00
Zhean Xu
891d57b4db Add various optimizations and Mega MoE benchmarks (#316)
* Merge with private repo

* Add Mega MoE Benchmark

* Minor fix

* Update

---------

Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>
2026-04-24 18:41:37 +08:00
Chenggang Zhao
7f2a703ed5 [Public release 26/04] Introducing Mega MoE, FP4 Indexer and other features/fixes (#304)
* Merge with private repo

* Update README

* Update README

* Update README

* Add PyTorch requirements

* Fix sync scopes for MQA logits (#256)

* Update README
2026-04-17 09:45:14 +08:00
Ray Wang
d30fc36c8f Fix sync issue of TMEM alloc/dealloc (#292) 2026-03-22 16:41:28 +08:00
Xin Qiu
35c4bc8771 fix: k_grouped_fp8_gemm_nt_contiguous crashes with n = 768 on H100 (#238) 2026-02-25 10:13:54 +08:00
Ray Wang
477618cd51 Fix a sync issue in SM100 MQA logits (#285) 2026-02-03 17:29:49 +08:00
Zhean Xu
0f5f266202 Multiple updates and refactorings (#280) 2026-01-16 17:06:52 +08:00
Zhean Xu
3ccf40c53a Merge pull request #270 from yurekami/fix/sm90-archspec-bug
fix: use SM90ArchSpec instead of SM100ArchSpec in sm90_bf16_k_grouped_gemm
2026-01-06 09:56:33 +08:00
yurekami
6be0eb31d9 fix: use SM90ArchSpec instead of SM100ArchSpec in sm90_bf16_k_grouped_gemm
The function sm90_bf16_k_grouped_gemm was incorrectly using SM100ArchSpec
to calculate TMA descriptor block sizes. Since this file is the SM90
implementation, it should consistently use SM90ArchSpec like the other
functions in this file (sm90_bf16_gemm, sm90_m_grouped_bf16_gemm_contiguous,
etc.).

This fixes a copy-paste error that could cause incorrect block size
calculations on SM90 (Hopper) GPUs.

Fixes #242

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2026-01-01 05:06:36 +09:00
Chenggang Zhao
9b680f4284 Update install.sh 2025-12-05 17:06:48 +08:00
AJ WISE
659a279bbd Better error handling, code consistency, compile-time safety (#234) 2025-12-05 16:49:52 +08:00
Ray Wang
38f8ef73a4 Multiple updates and refactorings (#231) 2025-11-21 17:49:47 +08:00
Zhean Xu
bb4424aad4 Fix sum_k * shape_m overflow 2025-11-19 11:51:36 +08:00
Chenggang Zhao
8da33d6bd9 Clean up 2025-11-19 11:00:55 +08:00
Guoteng
f63d7f24d6 fix: prevent int32 overflow in k-grouped GEMM size calculations (#226) 2025-11-19 10:52:08 +08:00
Ray Wang
ec5e9ed0b8 Fix SM90 MQA logits (#229) 2025-11-19 10:50:36 +08:00
Ray Wang
2f9d87877e Use larger MMA shape (#227) 2025-11-14 11:38:15 +08:00
oliver könig
c9f8b34dcd Merge pull request #220 from ko3n1g/ko3n1g/chore/revert-name-change
Ko3n1g/chore/revert name change
2025-10-15 16:30:23 +02:00
oliver könig
237041a257 revert
Signed-off-by: oliver könig <okoenig@nvidia.com>
2025-10-15 14:29:57 +00:00
oliver könig
f82018273d chore: Revert name change
Signed-off-by: oliver könig <okoenig@nvidia.com>
2025-10-15 14:29:16 +00:00
oliver könig
737e420fad chore: Rename project to ds-deem-gemm
Signed-off-by: oliver könig <okoenig@nvidia.com>
2025-10-15 12:44:21 +00:00
oliver könig
2b8a8e24f8 Update publish.yml 2025-10-15 13:00:51 +02:00
oliver könig
9528451969 Ko3n1g/chore/rename to deepgemm (#217)
* py3.8

Signed-off-by: oliver könig <okoenig@nvidia.com>

* chore: Rename from `deep_geem` to `deepgemm`

Signed-off-by: oliver könig <okoenig@nvidia.com>

---------

Signed-off-by: oliver könig <okoenig@nvidia.com>
2025-10-15 18:13:42 +08:00
oliver könig
93b3c28fa8 ci: Fixes for pre-built wheels (#214)
* build: Allow NGC builds

Signed-off-by: oliver könig <okoenig@nvidia.com>

* reduce grid

Signed-off-by: oliver könig <okoenig@nvidia.com>

* update grid

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* upgrade cuda action

Signed-off-by: oliver könig <okoenig@nvidia.com>

* remove test

Signed-off-by: oliver könig <okoenig@nvidia.com>

* py3.8

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* exclude

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* torch-version

Signed-off-by: oliver könig <okoenig@nvidia.com>

* py3.8/torch2.1/cuda12.3

Signed-off-by: oliver könig <okoenig@nvidia.com>

* Update publish.yml

* fix grid

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* cuda11.8

Signed-off-by: oliver könig <okoenig@nvidia.com>

* no hopper for 118

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

---------

Signed-off-by: oliver könig <okoenig@nvidia.com>
2025-10-14 13:05:47 +08:00
Chenggang Zhao
f8f41145da Use CUDA runtime API to get device prop instead of ATen 2025-10-11 09:16:31 +08:00
oliver könig
9f196058ae chore: Build and store bdist wheels (#181)
* build: Minor tweeks for wheel build

Signed-off-by: oliver könig <okoenig@nvidia.com>

* ci: Workflows for wheel build

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* build: Add CachedWheel

Signed-off-by: oliver könig <okoenig@nvidia.com>

* add version to init

Signed-off-by: oliver könig <okoenig@nvidia.com>

* revert

Signed-off-by: oliver könig <okoenig@nvidia.com>

* revert

Signed-off-by: oliver könig <okoenig@nvidia.com>

* revert

Signed-off-by: oliver könig <okoenig@nvidia.com>

* v2

Signed-off-by: oliver könig <okoenig@nvidia.com>

* update

Signed-off-by: oliver könig <okoenig@nvidia.com>

* test

Signed-off-by: oliver könig <okoenig@nvidia.com>

* from packaging.version import parse

Signed-off-by: oliver könig <okoenig@nvidia.com>

* local version

Signed-off-by: oliver könig <okoenig@nvidia.com>

* remove file

Signed-off-by: oliver könig <okoenig@nvidia.com>

* revert

Signed-off-by: oliver könig <okoenig@nvidia.com>

* Updates and lint

* revert missing cudaextension args

Signed-off-by: oliver könig <okoenig@nvidia.com>

* Add timeout

* fix DG settings

Signed-off-by: oliver könig <okoenig@nvidia.com>

* DG_USE_LOCAL_VERSION

Signed-off-by: oliver könig <okoenig@nvidia.com>

* Update version

* Detect local changes

* Minor fix

* Revert CUTLASS

* Unify options

---------

Signed-off-by: oliver könig <okoenig@nvidia.com>
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>
2025-10-10 18:23:40 +08:00
Jun Jiang
6e74faad5c Upgrade to CUTLASS 4.2.1 (#203) 2025-10-09 09:09:22 +08:00
PGFLMG
239112cb4c Fix syntax errors and correct the conditional statements (#206) 2025-10-01 20:31:43 +08:00
Chenggang Zhao
c1bf4cae4b Fix version 2025-10-01 20:31:27 +08:00
Chenggang Zhao
07b82fb8cd Fix old CUDA compatibility 2025-10-01 20:29:15 +08:00
Chenggang Zhao
594953acce Update version number 2025-09-29 17:12:21 +08:00
Chenggang Zhao
0ed3b949d0 Update README 2025-09-29 17:10:12 +08:00
Simon Mo
59f2c07cf2 Add SM100 kernels (#201)
Signed-off-by: simon-mo <simon.mo@hey.com>
2025-09-29 17:07:28 +08:00
Chenggang Zhao
80ceeb2c76 Add SM90 kernels (#200) 2025-09-29 17:00:23 +08:00
Chenggang Zhao
904b721731 Update README 2025-09-25 16:27:57 +08:00
Ray Wang
3f71de7aa9 Make various updates and fixes (#198) 2025-09-25 16:19:07 +08:00
yukuai26
79f48ee15a Fix multicast bug and optimize masked GEMM (#193)
* Fix multicast bug and profile masked GEMM

* Updates and lint

---------

Co-authored-by: Kuai Yu <yukuai@deepseek.com>
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>
2025-09-12 17:12:27 +08:00
Chenggang Zhao
ea9c5d9270 Use driver API 2025-08-28 09:40:49 +08:00
Rain Jiang
51d1e9cdd3 Support compilation with CUDA 13.0 (#174) 2025-08-27 09:30:08 +08:00
Chenggang Zhao
0e49c3353b Refactor compiler version checks and arch flags 2025-08-27 09:28:21 +08:00
PGFLMG
3a93f4eb28 Fix B200 cu128 NVCC compilation failed (#173) 2025-08-27 09:07:18 +08:00
Chenggang Zhao
9c3783beb2 Fix CUBIN symbol name compatibility 2025-08-26 17:43:26 +08:00
ZiTian Zhao
89b4089d24 Update test files in README documentation (#169)
- Replace non-existent test_core.py with test_bf16.py
- Add test_fp8.py and test_lazy_init.py to testing section
2025-08-25 09:43:10 +08:00
zhonghui-J
2da871e304 Fix grouped gemms performance issue. (#168) 2025-08-22 17:35:43 +08:00
Chenggang Zhao
e38c2e3103 Remove comments 2025-08-22 17:32:04 +08:00
Chenggang Zhao
f20256fd50 Compatible with CUDA 13 2025-08-22 17:30:47 +08:00
xiweny
affdb1cd90 Add sm_100f support and make nvcc 13 happy (#157)
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-08-22 17:19:32 +08:00