Commit Graph

31 Commits

Author SHA1 Message Date
Zhean Xu
891d57b4db Add various optimizations and Mega MoE benchmarks (#316)
* Merge with private repo

* Add Mega MoE Benchmark

* Minor fix

* Update

---------

Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>
2026-04-24 18:41:37 +08:00
Chenggang Zhao
7f2a703ed5 [Public release 26/04] Introducing Mega MoE, FP4 Indexer and other features/fixes (#304)
* Merge with private repo

* Update README

* Update README

* Update README

* Add PyTorch requirements

* Fix sync scopes for MQA logits (#256)

* Update README
2026-04-17 09:45:14 +08:00
Zhean Xu
0f5f266202 Multiple updates and refactorings (#280) 2026-01-16 17:06:52 +08:00
yurekami
6be0eb31d9 fix: use SM90ArchSpec instead of SM100ArchSpec in sm90_bf16_k_grouped_gemm
The function sm90_bf16_k_grouped_gemm was incorrectly using SM100ArchSpec
to calculate TMA descriptor block sizes. Since this file is the SM90
implementation, it should consistently use SM90ArchSpec like the other
functions in this file (sm90_bf16_gemm, sm90_m_grouped_bf16_gemm_contiguous,
etc.).

This fixes a copy-paste error that could cause incorrect block size
calculations on SM90 (Hopper) GPUs.

Fixes #242

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2026-01-01 05:06:36 +09:00
Ray Wang
38f8ef73a4 Multiple updates and refactorings (#231) 2025-11-21 17:49:47 +08:00
Chenggang Zhao
8da33d6bd9 Clean up 2025-11-19 11:00:55 +08:00
Guoteng
f63d7f24d6 fix: prevent int32 overflow in k-grouped GEMM size calculations (#226) 2025-11-19 10:52:08 +08:00
Ray Wang
ec5e9ed0b8 Fix SM90 MQA logits (#229) 2025-11-19 10:50:36 +08:00
Ray Wang
2f9d87877e Use larger MMA shape (#227) 2025-11-14 11:38:15 +08:00
Chenggang Zhao
f8f41145da Use CUDA runtime API to get device prop instead of ATen 2025-10-11 09:16:31 +08:00
Chenggang Zhao
07b82fb8cd Fix old CUDA compatibility 2025-10-01 20:29:15 +08:00
Simon Mo
59f2c07cf2 Add SM100 kernels (#201)
Signed-off-by: simon-mo <simon.mo@hey.com>
2025-09-29 17:07:28 +08:00
Chenggang Zhao
80ceeb2c76 Add SM90 kernels (#200) 2025-09-29 17:00:23 +08:00
Ray Wang
3f71de7aa9 Make various updates and fixes (#198) 2025-09-25 16:19:07 +08:00
yukuai26
79f48ee15a Fix multicast bug and optimize masked GEMM (#193)
* Fix multicast bug and profile masked GEMM

* Updates and lint

---------

Co-authored-by: Kuai Yu <yukuai@deepseek.com>
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>
2025-09-12 17:12:27 +08:00
Chenggang Zhao
ea9c5d9270 Use driver API 2025-08-28 09:40:49 +08:00
Chenggang Zhao
0e49c3353b Refactor compiler version checks and arch flags 2025-08-27 09:28:21 +08:00
PGFLMG
3a93f4eb28 Fix B200 cu128 NVCC compilation failed (#173) 2025-08-27 09:07:18 +08:00
Chenggang Zhao
9c3783beb2 Fix CUBIN symbol name compatibility 2025-08-26 17:43:26 +08:00
Chenggang Zhao
f20256fd50 Compatible with CUDA 13 2025-08-22 17:30:47 +08:00
xiweny
affdb1cd90 Add sm_100f support and make nvcc 13 happy (#157)
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-08-22 17:19:32 +08:00
Ray Wang
f85ec649d7 Make various updates and fixes: (#164)
- Add BF16 support for SM90 and SM100
- Refactor Python APIs
- Other fixes and code refactoring
2025-08-15 18:32:35 +08:00
zhonghui-J
3254b758e2 Polish get_best_configs modeling. (#158) 2025-08-14 16:50:21 +08:00
LJC00118
7b6b5563b9 Fix smxx layout assertion (#154) 2025-08-05 10:38:06 +08:00
Ray Wang
d9c363f86f Make various updates and fixes:
- Add support for legacy CUDA versions; now compatible with CUDA 12.3 and newer
- Add support for NVRTC compilation
- Other fixes and code refactoring
2025-08-02 19:52:22 -07:00
Chenggang Zhao
c50deed14c Code lint 2025-07-30 10:39:30 +08:00
LJC00118
6bc75b549e Fix smxx layout assertion (#141)
* Fix assertion error in smxx_layout.hpp for mn % 4 != 0 cases

* Fix assertion error in smxx_layout.hpp for mn % 4 != 0 cases

* Align submodule files

* Fix assertion error in smxx_layout.hpp for mn % 4 != 0 cases

* fix(smxx_layout): support mn%4!=0 and num_groups>1 via torch

* fix(smxx_layout): support mn%4!=0 and num_groups>1 via torch

* fix: correct logic for entering get_mn_major_tma_aligned_packed_ue8m0_tensor_torch
2025-07-30 10:36:54 +08:00
dan_the_3rd
4b4e4f20dd Update system.hpp (#133) 2025-07-28 17:01:05 +08:00
Chenggang Zhao
187656694f Code lint 2025-07-21 11:00:50 +08:00
Ray Wang
436a56314c Use std::filesystem::directory_iterator instead of std::filesystem::recursive_directory_iterator to avoid an ABI breakage we met (#131) 2025-07-21 10:44:20 +08:00
Ray Wang
9da4a23561 Add more GPU architectures support (#112)
* Add more GPU architectures support

* Update layout.py

* Optimize performance, Add SM90 support, Add 1D2D SM100 support

* Add fmtlib submodule at commit 553ec11

---------

Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
2025-07-18 11:32:22 +08:00