Xin Qiu
35c4bc8771
fix: k_grouped_fp8_gemm_nt_contiguous crashes with n = 768 on H100 ( #238 )
2026-02-25 10:13:54 +08:00
Ray Wang
477618cd51
Fix a sync issue in SM100 MQA logits ( #285 )
2026-02-03 17:29:49 +08:00
Zhean Xu
0f5f266202
Multiple updates and refactorings ( #280 )
2026-01-16 17:06:52 +08:00
Zhean Xu
3ccf40c53a
Merge pull request #270 from yurekami/fix/sm90-archspec-bug
...
fix: use SM90ArchSpec instead of SM100ArchSpec in sm90_bf16_k_grouped_gemm
2026-01-06 09:56:33 +08:00
yurekami
6be0eb31d9
fix: use SM90ArchSpec instead of SM100ArchSpec in sm90_bf16_k_grouped_gemm
...
The function sm90_bf16_k_grouped_gemm was incorrectly using SM100ArchSpec
to calculate TMA descriptor block sizes. Since this file is the SM90
implementation, it should consistently use SM90ArchSpec like the other
functions in this file (sm90_bf16_gemm, sm90_m_grouped_bf16_gemm_contiguous,
etc.).
This fixes a copy-paste error that could cause incorrect block size
calculations on SM90 (Hopper) GPUs.
Fixes #242
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2026-01-01 05:06:36 +09:00
Chenggang Zhao
9b680f4284
Update install.sh
2025-12-05 17:06:48 +08:00
AJ WISE
659a279bbd
Better error handling, code consistency, compile-time safety ( #234 )
2025-12-05 16:49:52 +08:00
Ray Wang
38f8ef73a4
Multiple updates and refactorings ( #231 )
2025-11-21 17:49:47 +08:00
Zhean Xu
bb4424aad4
Fix sum_k * shape_m overflow
2025-11-19 11:51:36 +08:00
Chenggang Zhao
8da33d6bd9
Clean up
2025-11-19 11:00:55 +08:00
Guoteng
f63d7f24d6
fix: prevent int32 overflow in k-grouped GEMM size calculations ( #226 )
2025-11-19 10:52:08 +08:00
Ray Wang
ec5e9ed0b8
Fix SM90 MQA logits ( #229 )
2025-11-19 10:50:36 +08:00
Ray Wang
2f9d87877e
Use larger MMA shape ( #227 )
2025-11-14 11:38:15 +08:00
oliver könig
c9f8b34dcd
Merge pull request #220 from ko3n1g/ko3n1g/chore/revert-name-change
...
Ko3n1g/chore/revert name change
2025-10-15 16:30:23 +02:00
oliver könig
237041a257
revert
...
Signed-off-by: oliver könig <okoenig@nvidia.com >
2025-10-15 14:29:57 +00:00
oliver könig
f82018273d
chore: Revert name change
...
Signed-off-by: oliver könig <okoenig@nvidia.com >
2025-10-15 14:29:16 +00:00
oliver könig
737e420fad
chore: Rename project to ds-deem-gemm
...
Signed-off-by: oliver könig <okoenig@nvidia.com >
2025-10-15 12:44:21 +00:00
oliver könig
2b8a8e24f8
Update publish.yml
2025-10-15 13:00:51 +02:00
oliver könig
9528451969
Ko3n1g/chore/rename to deepgemm ( #217 )
...
* py3.8
Signed-off-by: oliver könig <okoenig@nvidia.com >
* chore: Rename from `deep_geem` to `deepgemm`
Signed-off-by: oliver könig <okoenig@nvidia.com >
---------
Signed-off-by: oliver könig <okoenig@nvidia.com >
2025-10-15 18:13:42 +08:00
oliver könig
93b3c28fa8
ci: Fixes for pre-built wheels ( #214 )
...
* build: Allow NGC builds
Signed-off-by: oliver könig <okoenig@nvidia.com >
* reduce grid
Signed-off-by: oliver könig <okoenig@nvidia.com >
* update grid
Signed-off-by: oliver könig <okoenig@nvidia.com >
* fix
Signed-off-by: oliver könig <okoenig@nvidia.com >
* upgrade cuda action
Signed-off-by: oliver könig <okoenig@nvidia.com >
* remove test
Signed-off-by: oliver könig <okoenig@nvidia.com >
* py3.8
Signed-off-by: oliver könig <okoenig@nvidia.com >
* fix
Signed-off-by: oliver könig <okoenig@nvidia.com >
* exclude
Signed-off-by: oliver könig <okoenig@nvidia.com >
* fix
Signed-off-by: oliver könig <okoenig@nvidia.com >
* torch-version
Signed-off-by: oliver könig <okoenig@nvidia.com >
* py3.8/torch2.1/cuda12.3
Signed-off-by: oliver könig <okoenig@nvidia.com >
* Update publish.yml
* fix grid
Signed-off-by: oliver könig <okoenig@nvidia.com >
* fix
Signed-off-by: oliver könig <okoenig@nvidia.com >
* cuda11.8
Signed-off-by: oliver könig <okoenig@nvidia.com >
* no hopper for 118
Signed-off-by: oliver könig <okoenig@nvidia.com >
* fix
Signed-off-by: oliver könig <okoenig@nvidia.com >
* fix
Signed-off-by: oliver könig <okoenig@nvidia.com >
---------
Signed-off-by: oliver könig <okoenig@nvidia.com >
2025-10-14 13:05:47 +08:00
Chenggang Zhao
f8f41145da
Use CUDA runtime API to get device prop instead of ATen
2025-10-11 09:16:31 +08:00
oliver könig
9f196058ae
chore: Build and store bdist wheels ( #181 )
...
* build: Minor tweeks for wheel build
Signed-off-by: oliver könig <okoenig@nvidia.com >
* ci: Workflows for wheel build
Signed-off-by: oliver könig <okoenig@nvidia.com >
* fix
Signed-off-by: oliver könig <okoenig@nvidia.com >
* fix
Signed-off-by: oliver könig <okoenig@nvidia.com >
* build: Add CachedWheel
Signed-off-by: oliver könig <okoenig@nvidia.com >
* add version to init
Signed-off-by: oliver könig <okoenig@nvidia.com >
* revert
Signed-off-by: oliver könig <okoenig@nvidia.com >
* revert
Signed-off-by: oliver könig <okoenig@nvidia.com >
* revert
Signed-off-by: oliver könig <okoenig@nvidia.com >
* v2
Signed-off-by: oliver könig <okoenig@nvidia.com >
* update
Signed-off-by: oliver könig <okoenig@nvidia.com >
* test
Signed-off-by: oliver könig <okoenig@nvidia.com >
* from packaging.version import parse
Signed-off-by: oliver könig <okoenig@nvidia.com >
* local version
Signed-off-by: oliver könig <okoenig@nvidia.com >
* remove file
Signed-off-by: oliver könig <okoenig@nvidia.com >
* revert
Signed-off-by: oliver könig <okoenig@nvidia.com >
* Updates and lint
* revert missing cudaextension args
Signed-off-by: oliver könig <okoenig@nvidia.com >
* Add timeout
* fix DG settings
Signed-off-by: oliver könig <okoenig@nvidia.com >
* DG_USE_LOCAL_VERSION
Signed-off-by: oliver könig <okoenig@nvidia.com >
* Update version
* Detect local changes
* Minor fix
* Revert CUTLASS
* Unify options
---------
Signed-off-by: oliver könig <okoenig@nvidia.com >
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com >
2025-10-10 18:23:40 +08:00
Jun Jiang
6e74faad5c
Upgrade to CUTLASS 4.2.1 ( #203 )
2025-10-09 09:09:22 +08:00
PGFLMG
239112cb4c
Fix syntax errors and correct the conditional statements ( #206 )
2025-10-01 20:31:43 +08:00
Chenggang Zhao
c1bf4cae4b
Fix version
2025-10-01 20:31:27 +08:00
Chenggang Zhao
07b82fb8cd
Fix old CUDA compatibility
2025-10-01 20:29:15 +08:00
Chenggang Zhao
594953acce
Update version number
2025-09-29 17:12:21 +08:00
Chenggang Zhao
0ed3b949d0
Update README
2025-09-29 17:10:12 +08:00
Simon Mo
59f2c07cf2
Add SM100 kernels ( #201 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-09-29 17:07:28 +08:00
Chenggang Zhao
80ceeb2c76
Add SM90 kernels ( #200 )
2025-09-29 17:00:23 +08:00
Chenggang Zhao
904b721731
Update README
2025-09-25 16:27:57 +08:00
Ray Wang
3f71de7aa9
Make various updates and fixes ( #198 )
2025-09-25 16:19:07 +08:00
yukuai26
79f48ee15a
Fix multicast bug and optimize masked GEMM ( #193 )
...
* Fix multicast bug and profile masked GEMM
* Updates and lint
---------
Co-authored-by: Kuai Yu <yukuai@deepseek.com >
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com >
2025-09-12 17:12:27 +08:00
Chenggang Zhao
ea9c5d9270
Use driver API
2025-08-28 09:40:49 +08:00
Rain Jiang
51d1e9cdd3
Support compilation with CUDA 13.0 ( #174 )
2025-08-27 09:30:08 +08:00
Chenggang Zhao
0e49c3353b
Refactor compiler version checks and arch flags
2025-08-27 09:28:21 +08:00
PGFLMG
3a93f4eb28
Fix B200 cu128 NVCC compilation failed ( #173 )
2025-08-27 09:07:18 +08:00
Chenggang Zhao
9c3783beb2
Fix CUBIN symbol name compatibility
2025-08-26 17:43:26 +08:00
ZiTian Zhao
89b4089d24
Update test files in README documentation ( #169 )
...
- Replace non-existent test_core.py with test_bf16.py
- Add test_fp8.py and test_lazy_init.py to testing section
2025-08-25 09:43:10 +08:00
zhonghui-J
2da871e304
Fix grouped gemms performance issue. ( #168 )
2025-08-22 17:35:43 +08:00
Chenggang Zhao
e38c2e3103
Remove comments
2025-08-22 17:32:04 +08:00
Chenggang Zhao
f20256fd50
Compatible with CUDA 13
2025-08-22 17:30:47 +08:00
xiweny
affdb1cd90
Add sm_100f support and make nvcc 13 happy ( #157 )
...
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com >
2025-08-22 17:19:32 +08:00
Ray Wang
f85ec649d7
Make various updates and fixes: ( #164 )
...
- Add BF16 support for SM90 and SM100
- Refactor Python APIs
- Other fixes and code refactoring
2025-08-15 18:32:35 +08:00
zhonghui-J
3254b758e2
Polish get_best_configs modeling. ( #158 )
2025-08-14 16:50:21 +08:00
fzyzcjy
6d3717d541
Update test_fp8.py ( #159 )
2025-08-14 16:47:57 +08:00
LJC00118
7b6b5563b9
Fix smxx layout assertion ( #154 )
2025-08-05 10:38:06 +08:00
Ray Wang
3979c0576e
Merge pull request #151 from RayWang96/update_jit
...
Make various updates and fixes
2025-08-03 11:04:02 +08:00
Ray Wang
d9c363f86f
Make various updates and fixes:
...
- Add support for legacy CUDA versions; now compatible with CUDA 12.3 and newer
- Add support for NVRTC compilation
- Other fixes and code refactoring
2025-08-02 19:52:22 -07:00
yukuai26
aff9da0aba
Fix SM90 GEMM ( #149 )
...
* Fix sm90 GEMM
* Fix typo
---------
Co-authored-by: Kuai Yu <yukuai@deepseek.com >
2025-08-01 10:36:49 +08:00