03b8c99ee1
fix: use mxf8f6f4 (UE8M0) on SM100 — mxf4nvf4 requires SM103+
...
B200 (SM100) does NOT support kind::mxf4nvf4 at all (neither 2X nor 4X).
Only mxf8f6f4.block_scale with UE8M0 scales is available on SM100.
Strategy: keep NVFP4 E2M1 weights, convert UE4M3 block scales → UE8M0
in the weight transformation. This is a scale format adaptation for
hardware compatibility, not a format conversion.
Changes:
- Kernel: back to mxf8f6F4 instruction + float_ue8m0_t descriptor
- L1 epilogue: back to UE8M0 (>> 23) activation scales
- Python: merge block16→block32, convert UE4M3→float32→UE8M0
- Packing: uint8 (UE8M0) → int32, same as MXFP4
2026-05-11 09:28:45 +00:00
b856c57ba6
fix: kGranK=32 in C++ binding (was still 16 from old block16 code)
2026-05-11 09:09:32 +00:00
cd7a612175
debug: add shape logging to SF packing
2026-05-11 08:54:14 +00:00
dcebe033e2
fix: use scale_vec::2X (block32) for SM100 B200 compatibility
...
scale_vec::4X (block16) requires SM103/SM120 (B300/GB300), not SM100 (B200).
Revert to block32 with UE4M3 scales. Same TMEM layout as MXFP4 but with
UE4M3 scale format instead of UE8M0.
Changes:
- kGranK: 16 → 32
- PTX: scale_vec::4X → scale_vec::2X
- SF layout: same as MXFP4 (K/32, K/128 for int32 packed)
- UTCCP: i*8 → i*4 (2X layout, same as MXFP4)
- TMEM columns: same as MXFP4 (SF_BLOCK_M/32, SF_BLOCK_N/32)
- Python: merge NVFP4 block16→block32 scales (max of adjacent pairs)
- recipe: (1,1,16) → (1,1,32)
2026-05-11 08:36:59 +00:00
deff80c9c1
fix: add Python wrapper for NVFP4 SymmBuffer allocation
...
get_symm_buffer_for_nvfp4_mega_moe uses _C.get_symm_buffer_size_for_nvfp4_mega_moe
to allocate the correct buffer size (2x SF entries due to group_size=16).
Custom init to avoid SymmBuffer's hardcoded MXFP4 allocation.
2026-05-11 08:05:21 +00:00
acbe006498
docs: update debugging log in README
2026-05-11 07:33:02 +00:00
8d02eb38fa
fix: transpose SF to MN-major layout before TMA stride checks
...
transform_sf_into_required_layout expects MN-major input (stride(-2)=1).
Our packed int32 SF is K-major (stride(-1)=1). Transpose the last two
dims, make contiguous, then transpose back so data is in MN-major order.
2026-05-11 07:32:10 +00:00
7154500f22
fix: reshape SF to 2D before transform_sf_into_required_layout
...
The C++ check_sf_layout stride assertion fails on 3D (experts, mn, K//64)
tensors. Reshape to 2D (experts*mn, K//64) before calling the transform
function, matching the expected stride layout.
2026-05-11 07:30:54 +00:00
f98c1f7fd5
fix: add gran_k=16 (NVFP4) support to transform_sf_into_required_layout
...
The C++ function only handled gran_k=32 and 128 (MXFP4/FP8).
Added gran_k=16 for NVFP4 group_size=16 support.
2026-05-11 07:13:00 +00:00
388fd8dcfd
fix: pack UE4M3 into int32 before transform_sf_into_required_layout
...
The C++ transform function expects int32 (for kInt type) with 4 UE4M3
bytes packed per int32. We pack first, then transform for TMA alignment
and UTCCP transpose with recipe (1, 16).
2026-05-11 07:05:11 +00:00
acae75e109
fix: use transform_sf_into_required_layout for proper TMA-aligned SF
...
Instead of custom _pack_nvfp4_sf_for_utccp, use DeepGEMM's C++
transform_sf_into_required_layout with recipe (1, 1, 16) for NVFP4.
This handles TMA alignment and UTCCP transpose correctly.
2026-05-11 06:54:34 +00:00
5cb4fcaef3
fix: cast uint8 weights to int8 (kPackedFP4) for DeepGEMM compatibility
2026-05-11 06:36:32 +00:00
aa9e53d5b2
feat: add build script for in-container compilation
2026-05-11 05:53:07 +00:00
328a352119
feat: add Dockerfile for NVFP4 mega moe build
2026-05-11 05:52:41 +00:00
bbf9a5f46a
feat: fold weight_scale_2 into block scales in NVFP4 transform
...
- transform_nvfp4_weights_for_mega_moe now accepts weight_scale_2
- Folds global scale into block scales: UE4M3 * FP32 -> UE4M3
- Dequantize to f32, multiply by global scale, clamp [0,448], re-quantize
- This is needed because the kernel only applies one level of block scaling
2026-05-11 05:42:16 +00:00
42c215d49b
docs: add NVFP4 mega MoE kernel README
2026-05-11 05:41:25 +00:00
36b439ee26
feat: NVFP4 mega MoE kernel (scale_vec::4X, UE4M3 block scales)
...
- New CUDA kernel: sm100_fp8_nvfp4_mega_moe_impl
- kGranK=16 (NVFP4 group_size=16, vs MXFP4's 32)
- kind::mxf4nvf4.block_scale.scale_vec::4X PTX instruction
- float_ue4m3_t scale factor type in instruction descriptor
- SF layout: scale_vec::4X (4 TMEM sub-columns per UMMA atom)
- UTCCP column stride: i*8 (vs MXFP4's i*4) for 4X layout
- L1 epilogue: UE4M3 activation scales (float→cutlass::float_e4m3_t)
- SF loading: kNumSFUint32 = kHidden/64 (4 UE4M3 per int32)
- New PTX wrappers: SM100_MMA_MXF4NVF4_2x1SM_SS, SM100_MMA_MXF4NVF4_SS
- Python API:
- fp8_nvfp4_mega_moe() with recipe=(1,1,16)
- transform_nvfp4_weights_for_mega_moe() for UE4M3→int32 UTCCP packing
- _pack_nvfp4_sf_for_utccp() helper
- C++ bindings:
- mega_nvfp4.hpp with NVFP4-specific SymmBuffer (SF stride K/16)
- JIT kernel header with kGranK=16 TMA descriptors
- Registered in python_api.cpp
NOTE: Both SFA and SFB must use UE4M3 (scale_format_ is 1-bit, shared).
The L1 epilogue converts float→UE4M3 for activation scales.
2026-05-11 05:41:08 +00:00
Zhean Xu
891d57b4db
Add various optimizations and Mega MoE benchmarks ( #316 )
...
* Merge with private repo
* Add Mega MoE Benchmark
* Minor fix
* Update
---------
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com >
2026-04-24 18:41:37 +08:00
Chenggang Zhao
7f2a703ed5
[Public release 26/04] Introducing Mega MoE, FP4 Indexer and other features/fixes ( #304 )
...
* Merge with private repo
* Update README
* Update README
* Update README
* Add PyTorch requirements
* Fix sync scopes for MQA logits (#256 )
* Update README
2026-04-17 09:45:14 +08:00
Ray Wang
d30fc36c8f
Fix sync issue of TMEM alloc/dealloc ( #292 )
2026-03-22 16:41:28 +08:00
Xin Qiu
35c4bc8771
fix: k_grouped_fp8_gemm_nt_contiguous crashes with n = 768 on H100 ( #238 )
2026-02-25 10:13:54 +08:00
Ray Wang
477618cd51
Fix a sync issue in SM100 MQA logits ( #285 )
2026-02-03 17:29:49 +08:00
Zhean Xu
0f5f266202
Multiple updates and refactorings ( #280 )
2026-01-16 17:06:52 +08:00
Zhean Xu
3ccf40c53a
Merge pull request #270 from yurekami/fix/sm90-archspec-bug
...
fix: use SM90ArchSpec instead of SM100ArchSpec in sm90_bf16_k_grouped_gemm
2026-01-06 09:56:33 +08:00
yurekami
6be0eb31d9
fix: use SM90ArchSpec instead of SM100ArchSpec in sm90_bf16_k_grouped_gemm
...
The function sm90_bf16_k_grouped_gemm was incorrectly using SM100ArchSpec
to calculate TMA descriptor block sizes. Since this file is the SM90
implementation, it should consistently use SM90ArchSpec like the other
functions in this file (sm90_bf16_gemm, sm90_m_grouped_bf16_gemm_contiguous,
etc.).
This fixes a copy-paste error that could cause incorrect block size
calculations on SM90 (Hopper) GPUs.
Fixes #242
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2026-01-01 05:06:36 +09:00
Chenggang Zhao
9b680f4284
Update install.sh
2025-12-05 17:06:48 +08:00
AJ WISE
659a279bbd
Better error handling, code consistency, compile-time safety ( #234 )
2025-12-05 16:49:52 +08:00
Ray Wang
38f8ef73a4
Multiple updates and refactorings ( #231 )
2025-11-21 17:49:47 +08:00
Zhean Xu
bb4424aad4
Fix sum_k * shape_m overflow
2025-11-19 11:51:36 +08:00
Chenggang Zhao
8da33d6bd9
Clean up
2025-11-19 11:00:55 +08:00
Guoteng
f63d7f24d6
fix: prevent int32 overflow in k-grouped GEMM size calculations ( #226 )
2025-11-19 10:52:08 +08:00
Ray Wang
ec5e9ed0b8
Fix SM90 MQA logits ( #229 )
2025-11-19 10:50:36 +08:00
Ray Wang
2f9d87877e
Use larger MMA shape ( #227 )
2025-11-14 11:38:15 +08:00
oliver könig
c9f8b34dcd
Merge pull request #220 from ko3n1g/ko3n1g/chore/revert-name-change
...
Ko3n1g/chore/revert name change
2025-10-15 16:30:23 +02:00
oliver könig
237041a257
revert
...
Signed-off-by: oliver könig <okoenig@nvidia.com >
2025-10-15 14:29:57 +00:00
oliver könig
f82018273d
chore: Revert name change
...
Signed-off-by: oliver könig <okoenig@nvidia.com >
2025-10-15 14:29:16 +00:00
oliver könig
737e420fad
chore: Rename project to ds-deem-gemm
...
Signed-off-by: oliver könig <okoenig@nvidia.com >
2025-10-15 12:44:21 +00:00
oliver könig
2b8a8e24f8
Update publish.yml
2025-10-15 13:00:51 +02:00
oliver könig
9528451969
Ko3n1g/chore/rename to deepgemm ( #217 )
...
* py3.8
Signed-off-by: oliver könig <okoenig@nvidia.com >
* chore: Rename from `deep_geem` to `deepgemm`
Signed-off-by: oliver könig <okoenig@nvidia.com >
---------
Signed-off-by: oliver könig <okoenig@nvidia.com >
2025-10-15 18:13:42 +08:00
oliver könig
93b3c28fa8
ci: Fixes for pre-built wheels ( #214 )
...
* build: Allow NGC builds
Signed-off-by: oliver könig <okoenig@nvidia.com >
* reduce grid
Signed-off-by: oliver könig <okoenig@nvidia.com >
* update grid
Signed-off-by: oliver könig <okoenig@nvidia.com >
* fix
Signed-off-by: oliver könig <okoenig@nvidia.com >
* upgrade cuda action
Signed-off-by: oliver könig <okoenig@nvidia.com >
* remove test
Signed-off-by: oliver könig <okoenig@nvidia.com >
* py3.8
Signed-off-by: oliver könig <okoenig@nvidia.com >
* fix
Signed-off-by: oliver könig <okoenig@nvidia.com >
* exclude
Signed-off-by: oliver könig <okoenig@nvidia.com >
* fix
Signed-off-by: oliver könig <okoenig@nvidia.com >
* torch-version
Signed-off-by: oliver könig <okoenig@nvidia.com >
* py3.8/torch2.1/cuda12.3
Signed-off-by: oliver könig <okoenig@nvidia.com >
* Update publish.yml
* fix grid
Signed-off-by: oliver könig <okoenig@nvidia.com >
* fix
Signed-off-by: oliver könig <okoenig@nvidia.com >
* cuda11.8
Signed-off-by: oliver könig <okoenig@nvidia.com >
* no hopper for 118
Signed-off-by: oliver könig <okoenig@nvidia.com >
* fix
Signed-off-by: oliver könig <okoenig@nvidia.com >
* fix
Signed-off-by: oliver könig <okoenig@nvidia.com >
---------
Signed-off-by: oliver könig <okoenig@nvidia.com >
2025-10-14 13:05:47 +08:00
Chenggang Zhao
f8f41145da
Use CUDA runtime API to get device prop instead of ATen
2025-10-11 09:16:31 +08:00
oliver könig
9f196058ae
chore: Build and store bdist wheels ( #181 )
...
* build: Minor tweeks for wheel build
Signed-off-by: oliver könig <okoenig@nvidia.com >
* ci: Workflows for wheel build
Signed-off-by: oliver könig <okoenig@nvidia.com >
* fix
Signed-off-by: oliver könig <okoenig@nvidia.com >
* fix
Signed-off-by: oliver könig <okoenig@nvidia.com >
* build: Add CachedWheel
Signed-off-by: oliver könig <okoenig@nvidia.com >
* add version to init
Signed-off-by: oliver könig <okoenig@nvidia.com >
* revert
Signed-off-by: oliver könig <okoenig@nvidia.com >
* revert
Signed-off-by: oliver könig <okoenig@nvidia.com >
* revert
Signed-off-by: oliver könig <okoenig@nvidia.com >
* v2
Signed-off-by: oliver könig <okoenig@nvidia.com >
* update
Signed-off-by: oliver könig <okoenig@nvidia.com >
* test
Signed-off-by: oliver könig <okoenig@nvidia.com >
* from packaging.version import parse
Signed-off-by: oliver könig <okoenig@nvidia.com >
* local version
Signed-off-by: oliver könig <okoenig@nvidia.com >
* remove file
Signed-off-by: oliver könig <okoenig@nvidia.com >
* revert
Signed-off-by: oliver könig <okoenig@nvidia.com >
* Updates and lint
* revert missing cudaextension args
Signed-off-by: oliver könig <okoenig@nvidia.com >
* Add timeout
* fix DG settings
Signed-off-by: oliver könig <okoenig@nvidia.com >
* DG_USE_LOCAL_VERSION
Signed-off-by: oliver könig <okoenig@nvidia.com >
* Update version
* Detect local changes
* Minor fix
* Revert CUTLASS
* Unify options
---------
Signed-off-by: oliver könig <okoenig@nvidia.com >
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com >
2025-10-10 18:23:40 +08:00
Jun Jiang
6e74faad5c
Upgrade to CUTLASS 4.2.1 ( #203 )
2025-10-09 09:09:22 +08:00
PGFLMG
239112cb4c
Fix syntax errors and correct the conditional statements ( #206 )
2025-10-01 20:31:43 +08:00
Chenggang Zhao
c1bf4cae4b
Fix version
2025-10-01 20:31:27 +08:00
Chenggang Zhao
07b82fb8cd
Fix old CUDA compatibility
2025-10-01 20:29:15 +08:00
Chenggang Zhao
594953acce
Update version number
2025-09-29 17:12:21 +08:00
Chenggang Zhao
0ed3b949d0
Update README
2025-09-29 17:10:12 +08:00
Simon Mo
59f2c07cf2
Add SM100 kernels ( #201 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-09-29 17:07:28 +08:00
Chenggang Zhao
80ceeb2c76
Add SM90 kernels ( #200 )
2025-09-29 17:00:23 +08:00