Commit Graph

187 Commits

Author SHA1 Message Date
Ray Wang
d30fc36c8f Fix sync issue of TMEM alloc/dealloc (#292) 2026-03-22 16:41:28 +08:00
Xin Qiu
35c4bc8771 fix: k_grouped_fp8_gemm_nt_contiguous crashes with n = 768 on H100 (#238) 2026-02-25 10:13:54 +08:00
Ray Wang
477618cd51 Fix a sync issue in SM100 MQA logits (#285) 2026-02-03 17:29:49 +08:00
Zhean Xu
0f5f266202 Multiple updates and refactorings (#280) 2026-01-16 17:06:52 +08:00
Zhean Xu
3ccf40c53a Merge pull request #270 from yurekami/fix/sm90-archspec-bug
fix: use SM90ArchSpec instead of SM100ArchSpec in sm90_bf16_k_grouped_gemm
2026-01-06 09:56:33 +08:00
yurekami
6be0eb31d9 fix: use SM90ArchSpec instead of SM100ArchSpec in sm90_bf16_k_grouped_gemm
The function sm90_bf16_k_grouped_gemm was incorrectly using SM100ArchSpec
to calculate TMA descriptor block sizes. Since this file is the SM90
implementation, it should consistently use SM90ArchSpec like the other
functions in this file (sm90_bf16_gemm, sm90_m_grouped_bf16_gemm_contiguous,
etc.).

This fixes a copy-paste error that could cause incorrect block size
calculations on SM90 (Hopper) GPUs.

Fixes #242

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2026-01-01 05:06:36 +09:00
Chenggang Zhao
9b680f4284 Update install.sh 2025-12-05 17:06:48 +08:00
AJ WISE
659a279bbd Better error handling, code consistency, compile-time safety (#234) 2025-12-05 16:49:52 +08:00
Ray Wang
38f8ef73a4 Multiple updates and refactorings (#231) 2025-11-21 17:49:47 +08:00
Zhean Xu
bb4424aad4 Fix sum_k * shape_m overflow 2025-11-19 11:51:36 +08:00
Chenggang Zhao
8da33d6bd9 Clean up 2025-11-19 11:00:55 +08:00
Guoteng
f63d7f24d6 fix: prevent int32 overflow in k-grouped GEMM size calculations (#226) 2025-11-19 10:52:08 +08:00
Ray Wang
ec5e9ed0b8 Fix SM90 MQA logits (#229) 2025-11-19 10:50:36 +08:00
Ray Wang
2f9d87877e Use larger MMA shape (#227) 2025-11-14 11:38:15 +08:00
oliver könig
c9f8b34dcd Merge pull request #220 from ko3n1g/ko3n1g/chore/revert-name-change
Ko3n1g/chore/revert name change
2025-10-15 16:30:23 +02:00
oliver könig
237041a257 revert
Signed-off-by: oliver könig <okoenig@nvidia.com>
2025-10-15 14:29:57 +00:00
oliver könig
f82018273d chore: Revert name change
Signed-off-by: oliver könig <okoenig@nvidia.com>
2025-10-15 14:29:16 +00:00
oliver könig
737e420fad chore: Rename project to ds-deem-gemm
Signed-off-by: oliver könig <okoenig@nvidia.com>
2025-10-15 12:44:21 +00:00
oliver könig
2b8a8e24f8 Update publish.yml 2025-10-15 13:00:51 +02:00
oliver könig
9528451969 Ko3n1g/chore/rename to deepgemm (#217)
* py3.8

Signed-off-by: oliver könig <okoenig@nvidia.com>

* chore: Rename from `deep_geem` to `deepgemm`

Signed-off-by: oliver könig <okoenig@nvidia.com>

---------

Signed-off-by: oliver könig <okoenig@nvidia.com>
2025-10-15 18:13:42 +08:00
oliver könig
93b3c28fa8 ci: Fixes for pre-built wheels (#214)
* build: Allow NGC builds

Signed-off-by: oliver könig <okoenig@nvidia.com>

* reduce grid

Signed-off-by: oliver könig <okoenig@nvidia.com>

* update grid

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* upgrade cuda action

Signed-off-by: oliver könig <okoenig@nvidia.com>

* remove test

Signed-off-by: oliver könig <okoenig@nvidia.com>

* py3.8

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* exclude

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* torch-version

Signed-off-by: oliver könig <okoenig@nvidia.com>

* py3.8/torch2.1/cuda12.3

Signed-off-by: oliver könig <okoenig@nvidia.com>

* Update publish.yml

* fix grid

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* cuda11.8

Signed-off-by: oliver könig <okoenig@nvidia.com>

* no hopper for 118

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

---------

Signed-off-by: oliver könig <okoenig@nvidia.com>
2025-10-14 13:05:47 +08:00
Chenggang Zhao
f8f41145da Use CUDA runtime API to get device prop instead of ATen 2025-10-11 09:16:31 +08:00
oliver könig
9f196058ae chore: Build and store bdist wheels (#181)
* build: Minor tweeks for wheel build

Signed-off-by: oliver könig <okoenig@nvidia.com>

* ci: Workflows for wheel build

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* build: Add CachedWheel

Signed-off-by: oliver könig <okoenig@nvidia.com>

* add version to init

Signed-off-by: oliver könig <okoenig@nvidia.com>

* revert

Signed-off-by: oliver könig <okoenig@nvidia.com>

* revert

Signed-off-by: oliver könig <okoenig@nvidia.com>

* revert

Signed-off-by: oliver könig <okoenig@nvidia.com>

* v2

Signed-off-by: oliver könig <okoenig@nvidia.com>

* update

Signed-off-by: oliver könig <okoenig@nvidia.com>

* test

Signed-off-by: oliver könig <okoenig@nvidia.com>

* from packaging.version import parse

Signed-off-by: oliver könig <okoenig@nvidia.com>

* local version

Signed-off-by: oliver könig <okoenig@nvidia.com>

* remove file

Signed-off-by: oliver könig <okoenig@nvidia.com>

* revert

Signed-off-by: oliver könig <okoenig@nvidia.com>

* Updates and lint

* revert missing cudaextension args

Signed-off-by: oliver könig <okoenig@nvidia.com>

* Add timeout

* fix DG settings

Signed-off-by: oliver könig <okoenig@nvidia.com>

* DG_USE_LOCAL_VERSION

Signed-off-by: oliver könig <okoenig@nvidia.com>

* Update version

* Detect local changes

* Minor fix

* Revert CUTLASS

* Unify options

---------

Signed-off-by: oliver könig <okoenig@nvidia.com>
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>
2025-10-10 18:23:40 +08:00
Jun Jiang
6e74faad5c Upgrade to CUTLASS 4.2.1 (#203) 2025-10-09 09:09:22 +08:00
PGFLMG
239112cb4c Fix syntax errors and correct the conditional statements (#206) 2025-10-01 20:31:43 +08:00
Chenggang Zhao
c1bf4cae4b Fix version 2025-10-01 20:31:27 +08:00
Chenggang Zhao
07b82fb8cd Fix old CUDA compatibility 2025-10-01 20:29:15 +08:00
Chenggang Zhao
594953acce Update version number 2025-09-29 17:12:21 +08:00
Chenggang Zhao
0ed3b949d0 Update README 2025-09-29 17:10:12 +08:00
Simon Mo
59f2c07cf2 Add SM100 kernels (#201)
Signed-off-by: simon-mo <simon.mo@hey.com>
2025-09-29 17:07:28 +08:00
Chenggang Zhao
80ceeb2c76 Add SM90 kernels (#200) 2025-09-29 17:00:23 +08:00
Chenggang Zhao
904b721731 Update README 2025-09-25 16:27:57 +08:00
Ray Wang
3f71de7aa9 Make various updates and fixes (#198) 2025-09-25 16:19:07 +08:00
yukuai26
79f48ee15a Fix multicast bug and optimize masked GEMM (#193)
* Fix multicast bug and profile masked GEMM

* Updates and lint

---------

Co-authored-by: Kuai Yu <yukuai@deepseek.com>
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>
2025-09-12 17:12:27 +08:00
Chenggang Zhao
ea9c5d9270 Use driver API 2025-08-28 09:40:49 +08:00
Rain Jiang
51d1e9cdd3 Support compilation with CUDA 13.0 (#174) 2025-08-27 09:30:08 +08:00
Chenggang Zhao
0e49c3353b Refactor compiler version checks and arch flags 2025-08-27 09:28:21 +08:00
PGFLMG
3a93f4eb28 Fix B200 cu128 NVCC compilation failed (#173) 2025-08-27 09:07:18 +08:00
Chenggang Zhao
9c3783beb2 Fix CUBIN symbol name compatibility 2025-08-26 17:43:26 +08:00
ZiTian Zhao
89b4089d24 Update test files in README documentation (#169)
- Replace non-existent test_core.py with test_bf16.py
- Add test_fp8.py and test_lazy_init.py to testing section
2025-08-25 09:43:10 +08:00
zhonghui-J
2da871e304 Fix grouped gemms performance issue. (#168) 2025-08-22 17:35:43 +08:00
Chenggang Zhao
e38c2e3103 Remove comments 2025-08-22 17:32:04 +08:00
Chenggang Zhao
f20256fd50 Compatible with CUDA 13 2025-08-22 17:30:47 +08:00
xiweny
affdb1cd90 Add sm_100f support and make nvcc 13 happy (#157)
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-08-22 17:19:32 +08:00
Ray Wang
f85ec649d7 Make various updates and fixes: (#164)
- Add BF16 support for SM90 and SM100
- Refactor Python APIs
- Other fixes and code refactoring
2025-08-15 18:32:35 +08:00
zhonghui-J
3254b758e2 Polish get_best_configs modeling. (#158) 2025-08-14 16:50:21 +08:00
fzyzcjy
6d3717d541 Update test_fp8.py (#159) 2025-08-14 16:47:57 +08:00
LJC00118
7b6b5563b9 Fix smxx layout assertion (#154) 2025-08-05 10:38:06 +08:00
Ray Wang
3979c0576e Merge pull request #151 from RayWang96/update_jit
Make various updates and fixes
2025-08-03 11:04:02 +08:00
Ray Wang
d9c363f86f Make various updates and fixes:
- Add support for legacy CUDA versions; now compatible with CUDA 12.3 and newer
- Add support for NVRTC compilation
- Other fixes and code refactoring
2025-08-02 19:52:22 -07:00