This website requires JavaScript.
6a348d543d
fix: use raw cudaDeviceSynchronize instead of DG_CUDA_CHECK macro
nvfp4-mega-moe
biondizzle
2026-05-13 12:17:26 +00:00
c08a28888d
debug: sync + printf before mega_moe kernel launch
biondizzle
2026-05-13 12:15:49 +00:00
ad335c38fb
tweax n shit
biondizzle
2026-05-12 23:16:44 +00:00
8b27e85ee5
fix: advance TMEM SF start column per UMMA atom for scale_vec::4X
biondizzle
2026-05-12 20:56:35 +00:00
74bf612771
NVFP4 mega MoE: sf_id=0 fix for scale_vec::4X + UINT8 TMA + SF pipeline + interleaving
biondizzle
2026-05-12 20:26:13 +00:00
26a8ab75a1
NVFP4: fix SF pipeline — 2 K-cols per BLOCK_K for group=16
biondizzle
2026-05-12 08:08:17 +00:00
680874d067
NVFP4 L1 epilogue: group_size=16 SF layout
biondizzle
2026-05-12 07:08:08 +00:00
c0850a6859
Fix weight TMA descriptors: packed E2M1 needs K/2, block_k/2, swizzle/2
biondizzle
2026-05-12 06:51:39 +00:00
fbfeb54c9a
Fix fold_global_scale: UE4M3 scales use .to(float32), not shift-by-23
biondizzle
2026-05-12 05:52:33 +00:00
af092fa7ba
fix: double SMEM SF allocation for NVFP4 group=16 + clean stale comments
biondizzle
2026-05-11 23:58:07 +00:00
aa97a3f949
fix: correct TMEM column layout for scale_vec::4X
biondizzle
2026-05-11 23:44:12 +00:00
d6551617c0
fix: 4 kernel compilation fixes for packed FP4
biondizzle
2026-05-11 23:17:51 +00:00
49e5646b42
fix: remove duplicate kInt8 case — kPackedFP4 is already kInt8
biondizzle
2026-05-11 22:55:28 +00:00
80df24a641
fix: add kInt8 dtype support to TMA descriptor + change activation tensors to kInt8
biondizzle
2026-05-11 22:54:47 +00:00
e608a20dec
docs: major README update — packed FP4 SMEM layout, L1 epilogue, TMA descriptors
biondizzle
2026-05-11 22:40:09 +00:00
30d72e7ef5
fix: packed FP4 for mxf4nvf4 — correct SMEM layout, UMMA descriptors, L1 epilogue
biondizzle
2026-05-11 21:59:21 +00:00
0ac73a82f9
fix: L1 output uses unpacked E2M1 (1 byte/element) like FP8
biondizzle
2026-05-11 21:27:35 +00:00
091b974736
fix: L1 epilogue uses STSM with XOR swizzle for E2M1 FP4 output
biondizzle
2026-05-11 20:57:34 +00:00
a554de8b24
fix: dispatch TMA byte counts for FP4 (kHidden/2), rename fp8→fp4 layout refs
biondizzle
2026-05-11 20:47:58 +00:00
b3d1aae038
feat: full FP4 activations for mxf4nvf4 - E2M1 packed A side + UE4M3 scales
biondizzle
2026-05-11 20:29:08 +00:00
2cd86ff5e7
fix: UE8M0→float32 reinterpret in fold_global_scale (Bug #7 )
biondizzle
2026-05-11 19:40:01 +00:00
47621bb990
add NVFP4SymmBuffer + get_symm_buffer_for_nvfp4_mega_moe Python wrapper
biondizzle
2026-05-11 16:25:08 +00:00
86a1263f44
fix: gran_k=16 in transform_sf + sm_100a arch for NVFP4 mega_moe
biondizzle
2026-05-11 16:11:11 +00:00
fbdddaccf4
revert: restore mxf4nvf4/block16 code (correct path for sm_100a)
biondizzle
2026-05-11 15:02:47 +00:00
e80fe9af60
docs: CORRECTED — mxf4nvf4 IS supported on sm_100a (B200)
biondizzle
2026-05-11 14:24:55 +00:00
c2f4a30780
docs: comprehensive README update through build 22
biondizzle
2026-05-11 13:55:17 +00:00
57c629ed1b
fix: cast to int32 before >> 23 (uint32 doesn't support right-shift)
biondizzle
2026-05-11 09:45:54 +00:00
6d7231a50e
fix: reinterpret float32 bits as uint32 before >> 23 for UE8M0
biondizzle
2026-05-11 09:42:03 +00:00
f44ff7f6ca
docs: document SM100 hardware constraint and full debugging log
biondizzle
2026-05-11 09:30:44 +00:00
03b8c99ee1
fix: use mxf8f6f4 (UE8M0) on SM100 — mxf4nvf4 requires SM103+
biondizzle
2026-05-11 09:28:45 +00:00
b856c57ba6
fix: kGranK=32 in C++ binding (was still 16 from old block16 code)
biondizzle
2026-05-11 09:09:32 +00:00
cd7a612175
debug: add shape logging to SF packing
biondizzle
2026-05-11 08:54:14 +00:00
dcebe033e2
fix: use scale_vec::2X (block32) for SM100 B200 compatibility
biondizzle
2026-05-11 08:36:59 +00:00
deff80c9c1
fix: add Python wrapper for NVFP4 SymmBuffer allocation
biondizzle
2026-05-11 08:05:21 +00:00
acbe006498
docs: update debugging log in README
biondizzle
2026-05-11 07:33:02 +00:00
8d02eb38fa
fix: transpose SF to MN-major layout before TMA stride checks
biondizzle
2026-05-11 07:32:10 +00:00
7154500f22
fix: reshape SF to 2D before transform_sf_into_required_layout
biondizzle
2026-05-11 07:30:54 +00:00
f98c1f7fd5
fix: add gran_k=16 (NVFP4) support to transform_sf_into_required_layout
biondizzle
2026-05-11 07:13:00 +00:00
388fd8dcfd
fix: pack UE4M3 into int32 before transform_sf_into_required_layout
biondizzle
2026-05-11 07:05:11 +00:00
acae75e109
fix: use transform_sf_into_required_layout for proper TMA-aligned SF
biondizzle
2026-05-11 06:54:34 +00:00
5cb4fcaef3
fix: cast uint8 weights to int8 (kPackedFP4) for DeepGEMM compatibility
biondizzle
2026-05-11 06:36:32 +00:00
aa9e53d5b2
feat: add build script for in-container compilation
biondizzle
2026-05-11 05:53:07 +00:00
328a352119
feat: add Dockerfile for NVFP4 mega moe build
biondizzle
2026-05-11 05:52:41 +00:00
bbf9a5f46a
feat: fold weight_scale_2 into block scales in NVFP4 transform
biondizzle
2026-05-11 05:42:16 +00:00
42c215d49b
docs: add NVFP4 mega MoE kernel README
biondizzle
2026-05-11 05:41:25 +00:00
36b439ee26
feat: NVFP4 mega MoE kernel (scale_vec::4X, UE4M3 block scales)
biondizzle
2026-05-11 05:41:08 +00:00
891d57b4db
Add various optimizations and Mega MoE benchmarks (#316 )
Zhean Xu
2026-04-24 18:41:37 +08:00
7f2a703ed5
[Public release 26/04] Introducing Mega MoE, FP4 Indexer and other features/fixes (#304 )
Chenggang Zhao
2026-04-17 09:45:14 +08:00
d30fc36c8f
Fix sync issue of TMEM alloc/dealloc (#292 )
Ray Wang
2026-03-22 16:41:28 +08:00
35c4bc8771
fix: k_grouped_fp8_gemm_nt_contiguous crashes with n = 768 on H100 (#238 )
Xin Qiu
2026-02-25 10:13:54 +08:00
477618cd51
Fix a sync issue in SM100 MQA logits (#285 )
Ray Wang
2026-02-03 17:29:49 +08:00
0f5f266202
Multiple updates and refactorings (#280 )
Zhean Xu
2026-01-16 17:06:52 +08:00
3ccf40c53a
Merge pull request #270 from yurekami/fix/sm90-archspec-bug
Zhean Xu
2026-01-06 09:56:33 +08:00
6be0eb31d9
fix: use SM90ArchSpec instead of SM100ArchSpec in sm90_bf16_k_grouped_gemm
yurekami
2026-01-01 05:06:36 +09:00
9b680f4284
Update install.sh
Chenggang Zhao
2025-12-05 17:06:48 +08:00
659a279bbd
Better error handling, code consistency, compile-time safety (#234 )
AJ WISE
2025-12-05 08:49:52 +00:00
38f8ef73a4
Multiple updates and refactorings (#231 )
Ray Wang
2025-11-21 17:49:47 +08:00
bb4424aad4
Fix sum_k * shape_m overflow
Zhean Xu
2025-11-19 11:51:36 +08:00
8da33d6bd9
Clean up
Chenggang Zhao
2025-11-19 11:00:55 +08:00
f63d7f24d6
fix: prevent int32 overflow in k-grouped GEMM size calculations (#226 )
Guoteng
2025-11-19 10:52:08 +08:00
ec5e9ed0b8
Fix SM90 MQA logits (#229 )
Ray Wang
2025-11-19 10:50:36 +08:00
2f9d87877e
Use larger MMA shape (#227 )
Ray Wang
2025-11-14 11:38:15 +08:00
c9f8b34dcd
Merge pull request #220 from ko3n1g/ko3n1g/chore/revert-name-change
oliver könig
2025-10-15 16:30:23 +02:00
237041a257
revert
oliver könig
2025-10-15 14:29:57 +00:00
f82018273d
chore: Revert name change
oliver könig
2025-10-15 14:29:16 +00:00
737e420fad
chore: Rename project to ds-deem-gemm
oliver könig
2025-10-15 12:44:21 +00:00
2b8a8e24f8
Update publish.yml
oliver könig
2025-10-15 13:00:51 +02:00
9528451969
Ko3n1g/chore/rename to deepgemm (#217 )
oliver könig
2025-10-15 12:13:42 +02:00
93b3c28fa8
ci: Fixes for pre-built wheels (#214 )
oliver könig
2025-10-14 07:05:47 +02:00
f8f41145da
Use CUDA runtime API to get device prop instead of ATen
Chenggang Zhao
2025-10-11 09:14:00 +08:00
9f196058ae
chore: Build and store bdist wheels (#181 )
oliver könig
2025-10-10 12:23:40 +02:00
6e74faad5c
Upgrade to CUTLASS 4.2.1 (#203 )
Jun Jiang
2025-10-09 09:09:22 +08:00
239112cb4c
Fix syntax errors and correct the conditional statements (#206 )
PGFLMG
2025-10-01 20:31:43 +08:00
c1bf4cae4b
Fix version
Chenggang Zhao
2025-10-01 20:31:27 +08:00
07b82fb8cd
Fix old CUDA compatibility
Chenggang Zhao
2025-10-01 20:29:15 +08:00
594953acce
Update version number
Chenggang Zhao
2025-09-29 17:12:21 +08:00
0ed3b949d0
Update README
Chenggang Zhao
2025-09-29 17:10:12 +08:00
59f2c07cf2
Add SM100 kernels (#201 )
Simon Mo
2025-09-29 02:07:28 -07:00
80ceeb2c76
Add SM90 kernels (#200 )
Chenggang Zhao
2025-09-29 17:00:23 +08:00
904b721731
Update README
Chenggang Zhao
2025-09-25 16:27:57 +08:00
3f71de7aa9
Make various updates and fixes (#198 )
Ray Wang
2025-09-25 16:19:07 +08:00
79f48ee15a
Fix multicast bug and optimize masked GEMM (#193 )
yukuai26
2025-09-12 17:12:27 +08:00
ea9c5d9270
Use driver API
Chenggang Zhao
2025-08-19 11:29:15 +08:00
51d1e9cdd3
Support compilation with CUDA 13.0 (#174 )
Rain Jiang
2025-08-26 18:30:08 -07:00
0e49c3353b
Refactor compiler version checks and arch flags
Chenggang Zhao
2025-08-27 09:26:02 +08:00
3a93f4eb28
Fix B200 cu128 NVCC compilation failed (#173 )
PGFLMG
2025-08-27 09:07:18 +08:00
9c3783beb2
Fix CUBIN symbol name compatibility
Chenggang Zhao
2025-08-26 17:42:11 +08:00
89b4089d24
Update test files in README documentation (#169 )
ZiTian Zhao
2025-08-25 09:43:10 +08:00
2da871e304
Fix grouped gemms performance issue. (#168 )
zhonghui-J
2025-08-22 17:35:43 +08:00
e38c2e3103
Remove comments
Chenggang Zhao
2025-08-22 17:32:04 +08:00
f20256fd50
Compatible with CUDA 13
Chenggang Zhao
2025-08-22 17:29:10 +08:00
affdb1cd90
Add sm_100f support and make nvcc 13 happy (#157 )
xiweny
2025-08-22 17:19:32 +08:00
f85ec649d7
Make various updates and fixes: (#164 )
Ray Wang
2025-08-15 18:32:35 +08:00
3254b758e2
Polish get_best_configs modeling. (#158 )
zhonghui-J
2025-08-14 16:50:21 +08:00
6d3717d541
Update test_fp8.py (#159 )
fzyzcjy
2025-08-14 16:47:57 +08:00
7b6b5563b9
Fix smxx layout assertion (#154 )
LJC00118
2025-08-05 10:38:06 +08:00
3979c0576e
Merge pull request #151 from RayWang96/update_jit
Ray Wang
2025-08-03 11:04:02 +08:00
d9c363f86f
Make various updates and fixes: - Add support for legacy CUDA versions; now compatible with CUDA 12.3 and newer - Add support for NVRTC compilation - Other fixes and code refactoring
Ray Wang
2025-08-02 19:52:22 -07:00
aff9da0aba
Fix SM90 GEMM (#149 )
yukuai26
2025-08-01 10:36:49 +08:00
c50deed14c
Code lint
Chenggang Zhao
2025-07-30 10:39:30 +08:00