DeepGEMM

Author	SHA1	Message	Date
biondizzle	ad335c38fb	tweax n shit	2026-05-12 23:16:44 +00:00
biondizzle	8b27e85ee5	fix: advance TMEM SF start column per UMMA atom for scale_vec::4X	2026-05-12 20:56:35 +00:00
biondizzle	74bf612771	NVFP4 mega MoE: sf_id=0 fix for scale_vec::4X + UINT8 TMA + SF pipeline + interleaving Root cause of ILLEGAL_INSTRUCTION: make_runtime_instr_desc_with_sf_id(instr_desc, k, k) passed sf_id=1 for k=1 (second UMMA atom), but mxf4nvf4 with scale_vec::4X requires sf_id=0 always — the hardware implicitly reads 4 SF positions per atom from a single TMEM region. Non-zero sf_id causes the hardware to access invalid TMEM offsets. Also includes: - UINT8 TMA for packed FP4 (avoids 16U4 driver bugs) - NVFP4 SF pipeline: 2 K-columns per BLOCK_K for group_size=16 - MN-major SF interleaving for gate/up L1 weights - Fix contiguous copy for SF byte view - Preserve MN-major layout in SF interleave - Force contiguous on SF tensors before C++ call - Unpack weight tuples before printing - Single transpose back to MN-major (don't double-transpose)	2026-05-12 20:26:13 +00:00
biondizzle	26a8ab75a1	NVFP4: fix SF pipeline — 2 K-cols per BLOCK_K for group=16 - TMA: issue two tma::copy calls per K-block (K_box=1, 2 SF K-columns) - UTCCP: double loop for 2 K-columns, correct SMEM offsets - TMEM: double SFA/SFB column counts (SF_BLOCK_M/32 * 2) - Heuristic: fix smem_size (2× SF, packed FP4 A/B, packed send buffers, no amax) - Staging kernel: fix double-count bug in packed_k_mask	2026-05-12 08:08:17 +00:00
biondizzle	680874d067	NVFP4 L1 epilogue: group_size=16 SF layout - Single amax per warp (16 N-elements = 1 SF group, no warp-pair reduction) - Single sf_val instead of sf.x/sf.y split - All 4 warps write SF (k_idx = n_block_idx*4 + warp_idx_in_wg) - Remove dead SMEM amax storage, reclaim barrier offset space - Remove dead __syncwarp after register-local amax	2026-05-12 07:08:08 +00:00
biondizzle	af092fa7ba	fix: double SMEM SF allocation for NVFP4 group=16 + clean stale comments - SMEM_SFA/SFB_SIZE_PER_STAGE doubled: group=16 needs 8 SFs per token per BLOCK_K=128 (vs 4 for group=32) - arrive_and_expect_tx updated to use SMEM_SFA/SFB constants - Removed stale comments about 8/16 TMEM columns	2026-05-11 23:58:07 +00:00
biondizzle	aa97a3f949	fix: correct TMEM column layout for scale_vec::4X UTCCP 4x32dp128bit always writes 4 TMEM cols per 128-element group regardless of 1X vs 4X. The 4X only changes MMA interpretation, not UTCCP column count. Reverted from (4, stride i8) to (same as 1X, stride i4): - kNumSFATmemCols: SF_BLOCK_M/32 (was SF_BLOCK_M/324) - kNumSFBTmemCols: SF_BLOCK_N/32 (was SF_BLOCK_N/324) - UTCCP stride: i4 (was i*8)	2026-05-11 23:44:12 +00:00
biondizzle	d6551617c0	fix: 4 kernel compilation fixes for packed FP4 1. sizeof_bits_v→sizeof_bits<T>::value (our CUTLASS lacks C++17 _v form) 2. reinterpret_cast<uint8_t> for TMA copy and UMMA desc calls (smem_a returns float_e2m1_t but templates expect uint8_t*) 3. kNumChunks extended to 4 (packed FP4 halved SMEM, need more chunks) 4. No code changes to PatternVisitor — all fixes at call sites	2026-05-11 23:17:51 +00:00
biondizzle	30d72e7ef5	fix: packed FP4 for mxf4nvf4 — correct SMEM layout, UMMA descriptors, L1 epilogue Key changes: - a_dtype_t/b_dtype_t: float_e2m1_t (packed 4-bit) with sizeof_bits_v==4 assert - kSwizzleAMode/BMode: BLOCK_K/2 (64 bytes packed, not 128 unpacked) - SMEM sizes: LOAD_BLOCK_M * BLOCK_K / 2 (packed byte count) - Token layouts: kHidden/2, kIntermediateHidden/2 (packed bytes) - TMA loads: BLOCK_K/2 inner dim, uint8_t, byte offsets k_block_idx*(BLOCK_K/2) - UMMA descriptors: BLOCK_K/2 template param, uint8_t dtype, UMMA_K/2 advance - L1 epilogue: dropped STSM, direct st.shared.u16 with packed nibbles, no swizzle (v1) - Pybind buffer sizes: hidden/2, intermediate_hidden/2 with packed tensor shapes - Host TMA descriptors: hidden/2 K-dims, block_k/2 inner, fp4_unpacked_smem=false - L1 output TMA: block_n/4 inner, no swizzle (CU_TENSOR_MAP_SWIZZLE_NONE)	2026-05-11 21:59:21 +00:00
biondizzle	0ac73a82f9	fix: L1 output uses unpacked E2M1 (1 byte/element) like FP8 - float_e2m1_unpacksmem_t: sizeof=1, SMEM is 1 byte/element (not packed) - TMA load unpacks 2 E2M1/global-byte → 2 SMEM bytes - UMMA reads unpacked SMEM, packs internally for mxf4nvf4 - L1→L2 handoff: unpacked format (same byte count as FP8) - Epilogue: 4 E2M1 bytes per uint32 STSM atom, same as FP8 - Dispatch TMA: kHidden bytes (unpacked), not kHidden/2 - Added static_assert on sizeof(a_dtype_t) and sizeof(b_dtype_t) - Note: no bandwidth savings at L1→L2 boundary for v1	2026-05-11 21:27:35 +00:00
biondizzle	091b974736	fix: L1 epilogue uses STSM with XOR swizzle for E2M1 FP4 output Keep STSM (not naive SMEM write) so TMA reads correct bank layout. Pack 4 E2M1 nibbles into uint32 per STSM atom with XOR swizzle. Known perf note: 32B swizzle zone for L1 output (land for v1).	2026-05-11 20:57:34 +00:00
biondizzle	a554de8b24	fix: dispatch TMA byte counts for FP4 (kHidden/2), rename fp8→fp4 layout refs	2026-05-11 20:47:58 +00:00
biondizzle	b3d1aae038	feat: full FP4 activations for mxf4nvf4 - E2M1 packed A side + UE4M3 scales mxf4nvf4 requires BOTH A and B to be FP4 (E2M1 packed). Changes: - a_dtype_t: float_e4m3_t → float_e2m1_unpacksmem_t - UMMA_K: 32 → 64 (FP4 MMA atom) - L1 epilogue: FP8 quant → E2M1 FP4 quantization with nearest-neighbor - L1 output SMEM: packed E2M1 (2 per byte), TMA store uint8 - TMA descriptors: adjusted for FP4 packing (K/2 bytes per row) - SymmBuffer: uint8 activations, shape (M, K//2) - Staging kernel: BF16 → E2M1 packed + UE4M3 block16 scales	2026-05-11 20:29:08 +00:00
biondizzle	fbdddaccf4	revert: restore mxf4nvf4/block16 code (correct path for sm_100a) Reverted to commit 36b439e's NVFP4 kernel code: - kGranK=16, mxf4nvf4.block_scale.scale_vec::4X - float_ue4m3_t instruction descriptor - Block16 SF layout (4X TMEM) - UE4M3 L1 epilogue - No UE4M3→UE8M0 conversion, no block16→block32 merge The mxf4nvf4.scale_vec::4X PTX instruction compiles successfully on both sm_100 and sm_100f with CUDA 13.0. The previous build 17 error was likely from a different cause, not the arch flag. Python: reverted transform_nvfp4_weights_for_mega_moe to use pack_ue4m3_to_int32 with gran_k=16, no UE8M0 conversion.	2026-05-11 15:02:47 +00:00
biondizzle	03b8c99ee1	fix: use mxf8f6f4 (UE8M0) on SM100 — mxf4nvf4 requires SM103+ B200 (SM100) does NOT support kind::mxf4nvf4 at all (neither 2X nor 4X). Only mxf8f6f4.block_scale with UE8M0 scales is available on SM100. Strategy: keep NVFP4 E2M1 weights, convert UE4M3 block scales → UE8M0 in the weight transformation. This is a scale format adaptation for hardware compatibility, not a format conversion. Changes: - Kernel: back to mxf8f6F4 instruction + float_ue8m0_t descriptor - L1 epilogue: back to UE8M0 (>> 23) activation scales - Python: merge block16→block32, convert UE4M3→float32→UE8M0 - Packing: uint8 (UE8M0) → int32, same as MXFP4	2026-05-11 09:28:45 +00:00
biondizzle	dcebe033e2	fix: use scale_vec::2X (block32) for SM100 B200 compatibility scale_vec::4X (block16) requires SM103/SM120 (B300/GB300), not SM100 (B200). Revert to block32 with UE4M3 scales. Same TMEM layout as MXFP4 but with UE4M3 scale format instead of UE8M0. Changes: - kGranK: 16 → 32 - PTX: scale_vec::4X → scale_vec::2X - SF layout: same as MXFP4 (K/32, K/128 for int32 packed) - UTCCP: i8 → i4 (2X layout, same as MXFP4) - TMEM columns: same as MXFP4 (SF_BLOCK_M/32, SF_BLOCK_N/32) - Python: merge NVFP4 block16→block32 scales (max of adjacent pairs) - recipe: (1,1,16) → (1,1,32)	2026-05-11 08:36:59 +00:00
biondizzle	36b439ee26	feat: NVFP4 mega MoE kernel (scale_vec::4X, UE4M3 block scales) - New CUDA kernel: sm100_fp8_nvfp4_mega_moe_impl - kGranK=16 (NVFP4 group_size=16, vs MXFP4's 32) - kind::mxf4nvf4.block_scale.scale_vec::4X PTX instruction - float_ue4m3_t scale factor type in instruction descriptor - SF layout: scale_vec::4X (4 TMEM sub-columns per UMMA atom) - UTCCP column stride: i8 (vs MXFP4's i4) for 4X layout - L1 epilogue: UE4M3 activation scales (float→cutlass::float_e4m3_t) - SF loading: kNumSFUint32 = kHidden/64 (4 UE4M3 per int32) - New PTX wrappers: SM100_MMA_MXF4NVF4_2x1SM_SS, SM100_MMA_MXF4NVF4_SS - Python API: - fp8_nvfp4_mega_moe() with recipe=(1,1,16) - transform_nvfp4_weights_for_mega_moe() for UE4M3→int32 UTCCP packing - _pack_nvfp4_sf_for_utccp() helper - C++ bindings: - mega_nvfp4.hpp with NVFP4-specific SymmBuffer (SF stride K/16) - JIT kernel header with kGranK=16 TMA descriptors - Registered in python_api.cpp NOTE: Both SFA and SFB must use UE4M3 (scale_format_ is 1-bit, shared). The L1 epilogue converts float→UE4M3 for activation scales.	2026-05-11 05:41:08 +00:00
Zhean Xu	891d57b4db	Add various optimizations and Mega MoE benchmarks (#316 ) * Merge with private repo * Add Mega MoE Benchmark * Minor fix * Update --------- Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>	2026-04-24 18:41:37 +08:00
Chenggang Zhao	7f2a703ed5	[Public release 26/04] Introducing Mega MoE, FP4 Indexer and other features/fixes (#304 ) * Merge with private repo * Update README * Update README * Update README * Add PyTorch requirements * Fix sync scopes for MQA logits (#256) * Update README	2026-04-17 09:45:14 +08:00
Ray Wang	d30fc36c8f	Fix sync issue of TMEM alloc/dealloc (#292 )	2026-03-22 16:41:28 +08:00
Xin Qiu	35c4bc8771	fix: k_grouped_fp8_gemm_nt_contiguous crashes with n = 768 on H100 (#238 )	2026-02-25 10:13:54 +08:00
Ray Wang	477618cd51	Fix a sync issue in SM100 MQA logits (#285 )	2026-02-03 17:29:49 +08:00
Zhean Xu	0f5f266202	Multiple updates and refactorings (#280 )	2026-01-16 17:06:52 +08:00
Ray Wang	38f8ef73a4	Multiple updates and refactorings (#231 )	2025-11-21 17:49:47 +08:00
Zhean Xu	bb4424aad4	Fix sum_k * shape_m overflow	2025-11-19 11:51:36 +08:00
Ray Wang	ec5e9ed0b8	Fix SM90 MQA logits (#229 )	2025-11-19 10:50:36 +08:00
Ray Wang	2f9d87877e	Use larger MMA shape (#227 )	2025-11-14 11:38:15 +08:00
Chenggang Zhao	c1bf4cae4b	Fix version	2025-10-01 20:31:27 +08:00
Chenggang Zhao	07b82fb8cd	Fix old CUDA compatibility	2025-10-01 20:29:15 +08:00
Simon Mo	59f2c07cf2	Add SM100 kernels (#201 ) Signed-off-by: simon-mo <simon.mo@hey.com>	2025-09-29 17:07:28 +08:00
Chenggang Zhao	80ceeb2c76	Add SM90 kernels (#200 )	2025-09-29 17:00:23 +08:00
Ray Wang	3f71de7aa9	Make various updates and fixes (#198 )	2025-09-25 16:19:07 +08:00
zhonghui-J	2da871e304	Fix grouped gemms performance issue. (#168 )	2025-08-22 17:35:43 +08:00
Chenggang Zhao	e38c2e3103	Remove comments	2025-08-22 17:32:04 +08:00
Chenggang Zhao	f20256fd50	Compatible with CUDA 13	2025-08-22 17:30:47 +08:00
xiweny	affdb1cd90	Add sm_100f support and make nvcc 13 happy (#157 ) Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>	2025-08-22 17:19:32 +08:00
Ray Wang	f85ec649d7	Make various updates and fixes: (#164 ) - Add BF16 support for SM90 and SM100 - Refactor Python APIs - Other fixes and code refactoring	2025-08-15 18:32:35 +08:00
Ray Wang	d9c363f86f	Make various updates and fixes: - Add support for legacy CUDA versions; now compatible with CUDA 12.3 and newer - Add support for NVRTC compilation - Other fixes and code refactoring	2025-08-02 19:52:22 -07:00
yukuai26	aff9da0aba	Fix SM90 GEMM (#149 ) * Fix sm90 GEMM * Fix typo --------- Co-authored-by: Kuai Yu <yukuai@deepseek.com>	2025-08-01 10:36:49 +08:00
Ray Wang	9da4a23561	Add more GPU architectures support (#112 ) * Add more GPU architectures support * Update layout.py * Optimize performance, Add SM90 support, Add 1D2D SM100 support * Add fmtlib submodule at commit 553ec11 --------- Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>	2025-07-18 11:32:22 +08:00
shixianc	0c88cd0139	Fix illegal memory address when skipping -1 m indices (#113 ) Co-authored-by: Shixian Cui <shixian@amazon.com>	2025-06-16 10:44:31 +08:00
yukuai26	8dfa329827	Grouped GEMM skip useless computation for unaligned Ms (#103 ) * Grouped GEMM skip useless computation for unaligned Ms * Update readme.md * small typo * Rename variables * Restore previous indent * Format * Refactor tests * Add `SkipComputation` types * Bug fixed * Format * Fix tests * Add assertions * Minor fix --------- Co-authored-by: yukuai <yukuai@deepseek.com> Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>	2025-05-27 13:43:38 +08:00
Chenggang Zhao	78d8362e7a	Add a missing `#pragma once`	2025-05-15 18:10:05 +08:00
Chenggang Zhao	816b39053a	Refactor launch-related structures	2025-05-15 16:14:21 +08:00
Chenggang Zhao	e2d6a107ef	Cleanup some useless staffs	2025-05-14 15:46:45 +08:00
Zhean Xu	04278f6dee	Weight gradient kernels for dense and MoE models (#95 ) * Init weight gradient kernels. * Support unaligned n,k and gmem stride * Update docs * Several cleanups * Remove restrictions on N * Add stride(0) assertions --------- Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>	2025-05-14 14:47:58 +08:00
Gabriel Wu	bfe983c4c2	Refactor JIT compilation (+NVRTC support) (#94 ) * [wip] refactor: compile to .cubin Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com> * refactor: compile to .cubin and add NVRTC option Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com> * fix: compiler version Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com> * feat: compat for old drivers Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com> * feat: save kernel name to file Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com> * feat: fix win compat Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com> * fix: windows compat Signed-off-by: Gabriel Wu <13583761+lucifer1004@users.noreply.github.com> * feat: make API more general Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com> * feat: drop support for CUDA<12.3 Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com> * doc: update README Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com> * Some lints and refactor * Refactor runtime * Several fixes * Refactor environment variables * Code format * Add a TODO * Compatible with CUDA 12.3 * Fix indent * Fix typing * Drop support for Windows * Add a TODO --------- Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com> Signed-off-by: Gabriel Wu <13583761+lucifer1004@users.noreply.github.com> Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>	2025-05-07 11:38:14 +08:00
yukuai26	95e81b3dd6	Indivisible TMA (#90 ) Fix indivisible shapes for TMA multicast --------- Co-authored-by: yukuai <yukuai@deepseek.com> Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>	2025-04-23 14:55:14 +08:00
yukuai26	891f35adf5	Support TMA multicast on B with m_grouped_gemm_contiguous. (#88 )	2025-04-21 09:43:17 +08:00
Chenggang Zhao	83aa960b9b	Fix bugs	2025-04-18 11:55:51 +08:00

1 2

81 Commits