DeepGEMM

biondizzle/DeepGEMM

Fork 0

6a348d543d fix: use raw cudaDeviceSynchronize instead of DG_CUDA_CHECK macro nvfp4-mega-moe biondizzle 2026-05-13 12:17:26 +00:00
c08a28888d debug: sync + printf before mega_moe kernel launch biondizzle 2026-05-13 12:15:49 +00:00
ad335c38fb tweax n shit biondizzle 2026-05-12 23:16:44 +00:00
8b27e85ee5 fix: advance TMEM SF start column per UMMA atom for scale_vec::4X biondizzle 2026-05-12 20:56:35 +00:00
74bf612771 NVFP4 mega MoE: sf_id=0 fix for scale_vec::4X + UINT8 TMA + SF pipeline + interleaving biondizzle 2026-05-12 20:26:13 +00:00
26a8ab75a1 NVFP4: fix SF pipeline — 2 K-cols per BLOCK_K for group=16 biondizzle 2026-05-12 08:08:17 +00:00
680874d067 NVFP4 L1 epilogue: group_size=16 SF layout biondizzle 2026-05-12 07:08:08 +00:00
c0850a6859 Fix weight TMA descriptors: packed E2M1 needs K/2, block_k/2, swizzle/2 biondizzle 2026-05-12 06:51:39 +00:00
fbfeb54c9a Fix fold_global_scale: UE4M3 scales use .to(float32), not shift-by-23 biondizzle 2026-05-12 05:52:33 +00:00
af092fa7ba fix: double SMEM SF allocation for NVFP4 group=16 + clean stale comments biondizzle 2026-05-11 23:58:07 +00:00
aa97a3f949 fix: correct TMEM column layout for scale_vec::4X biondizzle 2026-05-11 23:44:12 +00:00
d6551617c0 fix: 4 kernel compilation fixes for packed FP4 biondizzle 2026-05-11 23:17:51 +00:00
49e5646b42 fix: remove duplicate kInt8 case — kPackedFP4 is already kInt8 biondizzle 2026-05-11 22:55:28 +00:00
80df24a641 fix: add kInt8 dtype support to TMA descriptor + change activation tensors to kInt8 biondizzle 2026-05-11 22:54:47 +00:00
e608a20dec docs: major README update — packed FP4 SMEM layout, L1 epilogue, TMA descriptors biondizzle 2026-05-11 22:40:09 +00:00
30d72e7ef5 fix: packed FP4 for mxf4nvf4 — correct SMEM layout, UMMA descriptors, L1 epilogue biondizzle 2026-05-11 21:59:21 +00:00
0ac73a82f9 fix: L1 output uses unpacked E2M1 (1 byte/element) like FP8 biondizzle 2026-05-11 21:27:35 +00:00
091b974736 fix: L1 epilogue uses STSM with XOR swizzle for E2M1 FP4 output biondizzle 2026-05-11 20:57:34 +00:00
a554de8b24 fix: dispatch TMA byte counts for FP4 (kHidden/2), rename fp8→fp4 layout refs biondizzle 2026-05-11 20:47:58 +00:00
b3d1aae038 feat: full FP4 activations for mxf4nvf4 - E2M1 packed A side + UE4M3 scales biondizzle 2026-05-11 20:29:08 +00:00
2cd86ff5e7 fix: UE8M0→float32 reinterpret in fold_global_scale (Bug #7) biondizzle 2026-05-11 19:40:01 +00:00
47621bb990 add NVFP4SymmBuffer + get_symm_buffer_for_nvfp4_mega_moe Python wrapper biondizzle 2026-05-11 16:25:08 +00:00
86a1263f44 fix: gran_k=16 in transform_sf + sm_100a arch for NVFP4 mega_moe biondizzle 2026-05-11 16:11:11 +00:00
fbdddaccf4 revert: restore mxf4nvf4/block16 code (correct path for sm_100a) biondizzle 2026-05-11 15:02:47 +00:00
e80fe9af60 docs: CORRECTED — mxf4nvf4 IS supported on sm_100a (B200) biondizzle 2026-05-11 14:24:55 +00:00
c2f4a30780 docs: comprehensive README update through build 22 biondizzle 2026-05-11 13:55:17 +00:00
57c629ed1b fix: cast to int32 before >> 23 (uint32 doesn't support right-shift) biondizzle 2026-05-11 09:45:54 +00:00
6d7231a50e fix: reinterpret float32 bits as uint32 before >> 23 for UE8M0 biondizzle 2026-05-11 09:42:03 +00:00
f44ff7f6ca docs: document SM100 hardware constraint and full debugging log biondizzle 2026-05-11 09:30:44 +00:00
03b8c99ee1 fix: use mxf8f6f4 (UE8M0) on SM100 — mxf4nvf4 requires SM103+ biondizzle 2026-05-11 09:28:45 +00:00
b856c57ba6 fix: kGranK=32 in C++ binding (was still 16 from old block16 code) biondizzle 2026-05-11 09:09:32 +00:00
cd7a612175 debug: add shape logging to SF packing biondizzle 2026-05-11 08:54:14 +00:00
dcebe033e2 fix: use scale_vec::2X (block32) for SM100 B200 compatibility biondizzle 2026-05-11 08:36:59 +00:00
deff80c9c1 fix: add Python wrapper for NVFP4 SymmBuffer allocation biondizzle 2026-05-11 08:05:21 +00:00
acbe006498 docs: update debugging log in README biondizzle 2026-05-11 07:33:02 +00:00
8d02eb38fa fix: transpose SF to MN-major layout before TMA stride checks biondizzle 2026-05-11 07:32:10 +00:00
7154500f22 fix: reshape SF to 2D before transform_sf_into_required_layout biondizzle 2026-05-11 07:30:54 +00:00
f98c1f7fd5 fix: add gran_k=16 (NVFP4) support to transform_sf_into_required_layout biondizzle 2026-05-11 07:13:00 +00:00
388fd8dcfd fix: pack UE4M3 into int32 before transform_sf_into_required_layout biondizzle 2026-05-11 07:05:11 +00:00
acae75e109 fix: use transform_sf_into_required_layout for proper TMA-aligned SF biondizzle 2026-05-11 06:54:34 +00:00
5cb4fcaef3 fix: cast uint8 weights to int8 (kPackedFP4) for DeepGEMM compatibility biondizzle 2026-05-11 06:36:32 +00:00
aa9e53d5b2 feat: add build script for in-container compilation biondizzle 2026-05-11 05:53:07 +00:00
328a352119 feat: add Dockerfile for NVFP4 mega moe build biondizzle 2026-05-11 05:52:41 +00:00
bbf9a5f46a feat: fold weight_scale_2 into block scales in NVFP4 transform biondizzle 2026-05-11 05:42:16 +00:00
42c215d49b docs: add NVFP4 mega MoE kernel README biondizzle 2026-05-11 05:41:25 +00:00
36b439ee26 feat: NVFP4 mega MoE kernel (scale_vec::4X, UE4M3 block scales) biondizzle 2026-05-11 05:41:08 +00:00
891d57b4db Add various optimizations and Mega MoE benchmarks (#316) Zhean Xu 2026-04-24 18:41:37 +08:00
7f2a703ed5 [Public release 26/04] Introducing Mega MoE, FP4 Indexer and other features/fixes (#304) Chenggang Zhao 2026-04-17 09:45:14 +08:00
d30fc36c8f Fix sync issue of TMEM alloc/dealloc (#292) Ray Wang 2026-03-22 16:41:28 +08:00
35c4bc8771 fix: k_grouped_fp8_gemm_nt_contiguous crashes with n = 768 on H100 (#238) Xin Qiu 2026-02-25 10:13:54 +08:00
477618cd51 Fix a sync issue in SM100 MQA logits (#285) Ray Wang 2026-02-03 17:29:49 +08:00
0f5f266202 Multiple updates and refactorings (#280) Zhean Xu 2026-01-16 17:06:52 +08:00
3ccf40c53a Merge pull request #270 from yurekami/fix/sm90-archspec-bug Zhean Xu 2026-01-06 09:56:33 +08:00
6be0eb31d9 fix: use SM90ArchSpec instead of SM100ArchSpec in sm90_bf16_k_grouped_gemm yurekami 2026-01-01 05:06:36 +09:00
9b680f4284 Update install.sh Chenggang Zhao 2025-12-05 17:06:48 +08:00
659a279bbd Better error handling, code consistency, compile-time safety (#234) AJ WISE 2025-12-05 08:49:52 +00:00
38f8ef73a4 Multiple updates and refactorings (#231) Ray Wang 2025-11-21 17:49:47 +08:00
bb4424aad4 Fix sum_k * shape_m overflow Zhean Xu 2025-11-19 11:51:36 +08:00
8da33d6bd9 Clean up Chenggang Zhao 2025-11-19 11:00:55 +08:00
f63d7f24d6 fix: prevent int32 overflow in k-grouped GEMM size calculations (#226) Guoteng 2025-11-19 10:52:08 +08:00
ec5e9ed0b8 Fix SM90 MQA logits (#229) Ray Wang 2025-11-19 10:50:36 +08:00
2f9d87877e Use larger MMA shape (#227) Ray Wang 2025-11-14 11:38:15 +08:00
c9f8b34dcd Merge pull request #220 from ko3n1g/ko3n1g/chore/revert-name-change oliver könig 2025-10-15 16:30:23 +02:00
237041a257 revert oliver könig 2025-10-15 14:29:57 +00:00
f82018273d chore: Revert name change oliver könig 2025-10-15 14:29:16 +00:00
737e420fad chore: Rename project to ds-deem-gemm oliver könig 2025-10-15 12:44:21 +00:00
2b8a8e24f8 Update publish.yml oliver könig 2025-10-15 13:00:51 +02:00
9528451969 Ko3n1g/chore/rename to deepgemm (#217) oliver könig 2025-10-15 12:13:42 +02:00
93b3c28fa8 ci: Fixes for pre-built wheels (#214) oliver könig 2025-10-14 07:05:47 +02:00
f8f41145da Use CUDA runtime API to get device prop instead of ATen Chenggang Zhao 2025-10-11 09:14:00 +08:00
9f196058ae chore: Build and store bdist wheels (#181) oliver könig 2025-10-10 12:23:40 +02:00
6e74faad5c Upgrade to CUTLASS 4.2.1 (#203) Jun Jiang 2025-10-09 09:09:22 +08:00
239112cb4c Fix syntax errors and correct the conditional statements (#206) PGFLMG 2025-10-01 20:31:43 +08:00
c1bf4cae4b Fix version Chenggang Zhao 2025-10-01 20:31:27 +08:00
07b82fb8cd Fix old CUDA compatibility Chenggang Zhao 2025-10-01 20:29:15 +08:00
594953acce Update version number Chenggang Zhao 2025-09-29 17:12:21 +08:00
0ed3b949d0 Update README Chenggang Zhao 2025-09-29 17:10:12 +08:00
59f2c07cf2 Add SM100 kernels (#201) Simon Mo 2025-09-29 02:07:28 -07:00
80ceeb2c76 Add SM90 kernels (#200) Chenggang Zhao 2025-09-29 17:00:23 +08:00
904b721731 Update README Chenggang Zhao 2025-09-25 16:27:57 +08:00
3f71de7aa9 Make various updates and fixes (#198) Ray Wang 2025-09-25 16:19:07 +08:00
79f48ee15a Fix multicast bug and optimize masked GEMM (#193) yukuai26 2025-09-12 17:12:27 +08:00
ea9c5d9270 Use driver API Chenggang Zhao 2025-08-19 11:29:15 +08:00
51d1e9cdd3 Support compilation with CUDA 13.0 (#174) Rain Jiang 2025-08-26 18:30:08 -07:00
0e49c3353b Refactor compiler version checks and arch flags Chenggang Zhao 2025-08-27 09:26:02 +08:00
3a93f4eb28 Fix B200 cu128 NVCC compilation failed (#173) PGFLMG 2025-08-27 09:07:18 +08:00
9c3783beb2 Fix CUBIN symbol name compatibility Chenggang Zhao 2025-08-26 17:42:11 +08:00
89b4089d24 Update test files in README documentation (#169) ZiTian Zhao 2025-08-25 09:43:10 +08:00
2da871e304 Fix grouped gemms performance issue. (#168) zhonghui-J 2025-08-22 17:35:43 +08:00
e38c2e3103 Remove comments Chenggang Zhao 2025-08-22 17:32:04 +08:00
f20256fd50 Compatible with CUDA 13 Chenggang Zhao 2025-08-22 17:29:10 +08:00
affdb1cd90 Add sm_100f support and make nvcc 13 happy (#157) xiweny 2025-08-22 17:19:32 +08:00
f85ec649d7 Make various updates and fixes: (#164) Ray Wang 2025-08-15 18:32:35 +08:00
3254b758e2 Polish get_best_configs modeling. (#158) zhonghui-J 2025-08-14 16:50:21 +08:00
6d3717d541 Update test_fp8.py (#159) fzyzcjy 2025-08-14 16:47:57 +08:00
7b6b5563b9 Fix smxx layout assertion (#154) LJC00118 2025-08-05 10:38:06 +08:00
3979c0576e Merge pull request #151 from RayWang96/update_jit Ray Wang 2025-08-03 11:04:02 +08:00
d9c363f86f Make various updates and fixes: - Add support for legacy CUDA versions; now compatible with CUDA 12.3 and newer - Add support for NVRTC compilation - Other fixes and code refactoring Ray Wang 2025-08-02 19:52:22 -07:00
aff9da0aba Fix SM90 GEMM (#149) yukuai26 2025-08-01 10:36:49 +08:00
c50deed14c Code lint Chenggang Zhao 2025-07-30 10:39:30 +08:00

1 2 3

Commit Graph Select branches Hide Pull Requests nvfp4-mega-moe Mono Color

Commit Graph

Select branches

Hide Pull Requests

nvfp4-mega-moe