DeepGEMM

Author	SHA1	Message	Date
biondizzle	5ac151d0a5	debug: print tensor dtypes/shapes at C++ call boundary in fp8_nvfp4_mega_moe	2026-05-12 13:10:32 +00:00
biondizzle	26a8ab75a1	NVFP4: fix SF pipeline — 2 K-cols per BLOCK_K for group=16 - TMA: issue two tma::copy calls per K-block (K_box=1, 2 SF K-columns) - UTCCP: double loop for 2 K-columns, correct SMEM offsets - TMEM: double SFA/SFB column counts (SF_BLOCK_M/32 * 2) - Heuristic: fix smem_size (2× SF, packed FP4 A/B, packed send buffers, no amax) - Staging kernel: fix double-count bug in packed_k_mask	2026-05-12 08:08:17 +00:00
biondizzle	680874d067	NVFP4 L1 epilogue: group_size=16 SF layout - Single amax per warp (16 N-elements = 1 SF group, no warp-pair reduction) - Single sf_val instead of sf.x/sf.y split - All 4 warps write SF (k_idx = n_block_idx*4 + warp_idx_in_wg) - Remove dead SMEM amax storage, reclaim barrier offset space - Remove dead __syncwarp after register-local amax	2026-05-12 07:08:08 +00:00
biondizzle	c0850a6859	Fix weight TMA descriptors: packed E2M1 needs K/2, block_k/2, swizzle/2 Weights are packed E2M1 (2 per byte) but TMA descriptors were using unpacked dimensions — K-dim in elements instead of bytes, 128B swizzle instead of 64B, full block_k instead of block_k/2. This caused OOB reads and swizzle mismatch with the UMMA descriptor, producing illegal instruction traps.	2026-05-12 06:51:39 +00:00
biondizzle	fbfeb54c9a	Fix fold_global_scale: UE4M3 scales use .to(float32), not shift-by-23 Checkpoint stores float8_e4m3fn (standard NVFP4), not UE8M0. The shift-by-23 was misinterpreting E4M3 bytes as E8M0 exponents.	2026-05-12 05:52:33 +00:00
biondizzle	af092fa7ba	fix: double SMEM SF allocation for NVFP4 group=16 + clean stale comments - SMEM_SFA/SFB_SIZE_PER_STAGE doubled: group=16 needs 8 SFs per token per BLOCK_K=128 (vs 4 for group=32) - arrive_and_expect_tx updated to use SMEM_SFA/SFB constants - Removed stale comments about 8/16 TMEM columns	2026-05-11 23:58:07 +00:00
biondizzle	aa97a3f949	fix: correct TMEM column layout for scale_vec::4X UTCCP 4x32dp128bit always writes 4 TMEM cols per 128-element group regardless of 1X vs 4X. The 4X only changes MMA interpretation, not UTCCP column count. Reverted from (4, stride i8) to (same as 1X, stride i4): - kNumSFATmemCols: SF_BLOCK_M/32 (was SF_BLOCK_M/324) - kNumSFBTmemCols: SF_BLOCK_N/32 (was SF_BLOCK_N/324) - UTCCP stride: i4 (was i*8)	2026-05-11 23:44:12 +00:00
biondizzle	d6551617c0	fix: 4 kernel compilation fixes for packed FP4 1. sizeof_bits_v→sizeof_bits<T>::value (our CUTLASS lacks C++17 _v form) 2. reinterpret_cast<uint8_t> for TMA copy and UMMA desc calls (smem_a returns float_e2m1_t but templates expect uint8_t*) 3. kNumChunks extended to 4 (packed FP4 halved SMEM, need more chunks) 4. No code changes to PatternVisitor — all fixes at call sites	2026-05-11 23:17:51 +00:00
biondizzle	49e5646b42	fix: remove duplicate kInt8 case — kPackedFP4 is already kInt8 kPackedFP4 = torch::kInt8, so the kInt8 case was a duplicate. The real fix was in mega_nvfp4.hpp: changing kUInt8→kInt8 so tensors match the existing kPackedFP4 path in the TMA switch.	2026-05-11 22:55:28 +00:00
biondizzle	80df24a641	fix: add kInt8 dtype support to TMA descriptor + change activation tensors to kInt8 - runtime_utils.hpp: added kInt8 -> CU_TENSOR_MAP_DATA_TYPE_UINT8 mapping - mega_nvfp4.hpp: changed activation tensor dtypes from kUInt8 to kInt8 (same byte layout, but kInt8 is recognized by the TMA dtype switch)	2026-05-11 22:54:47 +00:00
biondizzle	e608a20dec	docs: major README update — packed FP4 SMEM layout, L1 epilogue, TMA descriptors Added detailed documentation of the packed FP4 architecture: - mxf4nvf4 reads packed (2 per byte), NOT unpacked like mxf8f6f4 - SMEM layout: float_e2m1_t, BLOCK_K/2 swizzle, UMMA desc byte math - L1 epilogue: st.shared.u16, no swizzle, kWarpBytesPerRow - Host TMA: hidden/2 K-dim, block_k/2 inner, fp4_unpacked_smem=false - Build history through Build 35	2026-05-11 22:40:09 +00:00
biondizzle	30d72e7ef5	fix: packed FP4 for mxf4nvf4 — correct SMEM layout, UMMA descriptors, L1 epilogue Key changes: - a_dtype_t/b_dtype_t: float_e2m1_t (packed 4-bit) with sizeof_bits_v==4 assert - kSwizzleAMode/BMode: BLOCK_K/2 (64 bytes packed, not 128 unpacked) - SMEM sizes: LOAD_BLOCK_M * BLOCK_K / 2 (packed byte count) - Token layouts: kHidden/2, kIntermediateHidden/2 (packed bytes) - TMA loads: BLOCK_K/2 inner dim, uint8_t, byte offsets k_block_idx*(BLOCK_K/2) - UMMA descriptors: BLOCK_K/2 template param, uint8_t dtype, UMMA_K/2 advance - L1 epilogue: dropped STSM, direct st.shared.u16 with packed nibbles, no swizzle (v1) - Pybind buffer sizes: hidden/2, intermediate_hidden/2 with packed tensor shapes - Host TMA descriptors: hidden/2 K-dims, block_k/2 inner, fp4_unpacked_smem=false - L1 output TMA: block_n/4 inner, no swizzle (CU_TENSOR_MAP_SWIZZLE_NONE)	2026-05-11 21:59:21 +00:00
biondizzle	0ac73a82f9	fix: L1 output uses unpacked E2M1 (1 byte/element) like FP8 - float_e2m1_unpacksmem_t: sizeof=1, SMEM is 1 byte/element (not packed) - TMA load unpacks 2 E2M1/global-byte → 2 SMEM bytes - UMMA reads unpacked SMEM, packs internally for mxf4nvf4 - L1→L2 handoff: unpacked format (same byte count as FP8) - Epilogue: 4 E2M1 bytes per uint32 STSM atom, same as FP8 - Dispatch TMA: kHidden bytes (unpacked), not kHidden/2 - Added static_assert on sizeof(a_dtype_t) and sizeof(b_dtype_t) - Note: no bandwidth savings at L1→L2 boundary for v1	2026-05-11 21:27:35 +00:00
biondizzle	091b974736	fix: L1 epilogue uses STSM with XOR swizzle for E2M1 FP4 output Keep STSM (not naive SMEM write) so TMA reads correct bank layout. Pack 4 E2M1 nibbles into uint32 per STSM atom with XOR swizzle. Known perf note: 32B swizzle zone for L1 output (land for v1).	2026-05-11 20:57:34 +00:00
biondizzle	a554de8b24	fix: dispatch TMA byte counts for FP4 (kHidden/2), rename fp8→fp4 layout refs	2026-05-11 20:47:58 +00:00
biondizzle	b3d1aae038	feat: full FP4 activations for mxf4nvf4 - E2M1 packed A side + UE4M3 scales mxf4nvf4 requires BOTH A and B to be FP4 (E2M1 packed). Changes: - a_dtype_t: float_e4m3_t → float_e2m1_unpacksmem_t - UMMA_K: 32 → 64 (FP4 MMA atom) - L1 epilogue: FP8 quant → E2M1 FP4 quantization with nearest-neighbor - L1 output SMEM: packed E2M1 (2 per byte), TMA store uint8 - TMA descriptors: adjusted for FP4 packing (K/2 bytes per row) - SymmBuffer: uint8 activations, shape (M, K//2) - Staging kernel: BF16 → E2M1 packed + UE4M3 block16 scales	2026-05-11 20:29:08 +00:00
biondizzle	2cd86ff5e7	fix: UE8M0→float32 reinterpret in fold_global_scale (Bug #7 )	2026-05-11 19:40:01 +00:00
biondizzle	47621bb990	add NVFP4SymmBuffer + get_symm_buffer_for_nvfp4_mega_moe Python wrapper The C++ binding was registered but there was no Python wrapper. vLLM patch imports get_symm_buffer_for_nvfp4_mega_moe from deep_gemm.mega.	2026-05-11 16:25:08 +00:00
biondizzle	86a1263f44	fix: gran_k=16 in transform_sf + sm_100a arch for NVFP4 mega_moe - transform_sf_into_required_layout: add gran_k=16 branch for NVFP4 UE4M3 scales (4 per int32, group_size=16). Previously only handled 32/128. - get_arch: always return '100a' for SM100, never '100f'. The family variant lacks mxf4nvf4 (NVFP4 block-scaled MMA) support, causing 'scale_vec::4X not supported on sm_100f' errors. - transform_nvfp4_weights_for_mega_moe: fold weight_scale_2 into block scales, pack UE4M3→int32, transpose MN-major, call transform_sf_into_required_layout with gran_k=16.	2026-05-11 16:11:11 +00:00
biondizzle	fbdddaccf4	revert: restore mxf4nvf4/block16 code (correct path for sm_100a) Reverted to commit 36b439e's NVFP4 kernel code: - kGranK=16, mxf4nvf4.block_scale.scale_vec::4X - float_ue4m3_t instruction descriptor - Block16 SF layout (4X TMEM) - UE4M3 L1 epilogue - No UE4M3→UE8M0 conversion, no block16→block32 merge The mxf4nvf4.scale_vec::4X PTX instruction compiles successfully on both sm_100 and sm_100f with CUDA 13.0. The previous build 17 error was likely from a different cause, not the arch flag. Python: reverted transform_nvfp4_weights_for_mega_moe to use pack_ue4m3_to_int32 with gran_k=16, no UE8M0 conversion.	2026-05-11 15:02:47 +00:00
biondizzle	e80fe9af60	docs: CORRECTED — mxf4nvf4 IS supported on sm_100a (B200) The build 17-18 'scale_vec not supported on sm_100f' error was because we targeted sm_100 instead of sm_100a. The 'a' suffix is required for FP4 block-scaled MMA instructions. Reverting to mxf4nvf4 with correct arch target is the path forward.	2026-05-11 14:24:55 +00:00
biondizzle	c2f4a30780	docs: comprehensive README update through build 22	2026-05-11 13:55:17 +00:00
biondizzle	57c629ed1b	fix: cast to int32 before >> 23 (uint32 doesn't support right-shift)	2026-05-11 09:45:54 +00:00
biondizzle	6d7231a50e	fix: reinterpret float32 bits as uint32 before >> 23 for UE8M0	2026-05-11 09:42:03 +00:00
biondizzle	f44ff7f6ca	docs: document SM100 hardware constraint and full debugging log	2026-05-11 09:30:44 +00:00
biondizzle	03b8c99ee1	fix: use mxf8f6f4 (UE8M0) on SM100 — mxf4nvf4 requires SM103+ B200 (SM100) does NOT support kind::mxf4nvf4 at all (neither 2X nor 4X). Only mxf8f6f4.block_scale with UE8M0 scales is available on SM100. Strategy: keep NVFP4 E2M1 weights, convert UE4M3 block scales → UE8M0 in the weight transformation. This is a scale format adaptation for hardware compatibility, not a format conversion. Changes: - Kernel: back to mxf8f6F4 instruction + float_ue8m0_t descriptor - L1 epilogue: back to UE8M0 (>> 23) activation scales - Python: merge block16→block32, convert UE4M3→float32→UE8M0 - Packing: uint8 (UE8M0) → int32, same as MXFP4	2026-05-11 09:28:45 +00:00
biondizzle	b856c57ba6	fix: kGranK=32 in C++ binding (was still 16 from old block16 code)	2026-05-11 09:09:32 +00:00
biondizzle	cd7a612175	debug: add shape logging to SF packing	2026-05-11 08:54:14 +00:00
biondizzle	dcebe033e2	fix: use scale_vec::2X (block32) for SM100 B200 compatibility scale_vec::4X (block16) requires SM103/SM120 (B300/GB300), not SM100 (B200). Revert to block32 with UE4M3 scales. Same TMEM layout as MXFP4 but with UE4M3 scale format instead of UE8M0. Changes: - kGranK: 16 → 32 - PTX: scale_vec::4X → scale_vec::2X - SF layout: same as MXFP4 (K/32, K/128 for int32 packed) - UTCCP: i8 → i4 (2X layout, same as MXFP4) - TMEM columns: same as MXFP4 (SF_BLOCK_M/32, SF_BLOCK_N/32) - Python: merge NVFP4 block16→block32 scales (max of adjacent pairs) - recipe: (1,1,16) → (1,1,32)	2026-05-11 08:36:59 +00:00
biondizzle	deff80c9c1	fix: add Python wrapper for NVFP4 SymmBuffer allocation get_symm_buffer_for_nvfp4_mega_moe uses _C.get_symm_buffer_size_for_nvfp4_mega_moe to allocate the correct buffer size (2x SF entries due to group_size=16). Custom init to avoid SymmBuffer's hardcoded MXFP4 allocation.	2026-05-11 08:05:21 +00:00
biondizzle	acbe006498	docs: update debugging log in README	2026-05-11 07:33:02 +00:00
biondizzle	8d02eb38fa	fix: transpose SF to MN-major layout before TMA stride checks transform_sf_into_required_layout expects MN-major input (stride(-2)=1). Our packed int32 SF is K-major (stride(-1)=1). Transpose the last two dims, make contiguous, then transpose back so data is in MN-major order.	2026-05-11 07:32:10 +00:00
biondizzle	7154500f22	fix: reshape SF to 2D before transform_sf_into_required_layout The C++ check_sf_layout stride assertion fails on 3D (experts, mn, K//64) tensors. Reshape to 2D (experts*mn, K//64) before calling the transform function, matching the expected stride layout.	2026-05-11 07:30:54 +00:00
biondizzle	f98c1f7fd5	fix: add gran_k=16 (NVFP4) support to transform_sf_into_required_layout The C++ function only handled gran_k=32 and 128 (MXFP4/FP8). Added gran_k=16 for NVFP4 group_size=16 support.	2026-05-11 07:13:00 +00:00
biondizzle	388fd8dcfd	fix: pack UE4M3 into int32 before transform_sf_into_required_layout The C++ transform function expects int32 (for kInt type) with 4 UE4M3 bytes packed per int32. We pack first, then transform for TMA alignment and UTCCP transpose with recipe (1, 16).	2026-05-11 07:05:11 +00:00
biondizzle	acae75e109	fix: use transform_sf_into_required_layout for proper TMA-aligned SF Instead of custom _pack_nvfp4_sf_for_utccp, use DeepGEMM's C++ transform_sf_into_required_layout with recipe (1, 1, 16) for NVFP4. This handles TMA alignment and UTCCP transpose correctly.	2026-05-11 06:54:34 +00:00
biondizzle	5cb4fcaef3	fix: cast uint8 weights to int8 (kPackedFP4) for DeepGEMM compatibility	2026-05-11 06:36:32 +00:00
biondizzle	aa9e53d5b2	feat: add build script for in-container compilation	2026-05-11 05:53:07 +00:00
biondizzle	328a352119	feat: add Dockerfile for NVFP4 mega moe build	2026-05-11 05:52:41 +00:00
biondizzle	bbf9a5f46a	feat: fold weight_scale_2 into block scales in NVFP4 transform - transform_nvfp4_weights_for_mega_moe now accepts weight_scale_2 - Folds global scale into block scales: UE4M3 * FP32 -> UE4M3 - Dequantize to f32, multiply by global scale, clamp [0,448], re-quantize - This is needed because the kernel only applies one level of block scaling	2026-05-11 05:42:16 +00:00
biondizzle	42c215d49b	docs: add NVFP4 mega MoE kernel README	2026-05-11 05:41:25 +00:00
biondizzle	36b439ee26	feat: NVFP4 mega MoE kernel (scale_vec::4X, UE4M3 block scales) - New CUDA kernel: sm100_fp8_nvfp4_mega_moe_impl - kGranK=16 (NVFP4 group_size=16, vs MXFP4's 32) - kind::mxf4nvf4.block_scale.scale_vec::4X PTX instruction - float_ue4m3_t scale factor type in instruction descriptor - SF layout: scale_vec::4X (4 TMEM sub-columns per UMMA atom) - UTCCP column stride: i8 (vs MXFP4's i4) for 4X layout - L1 epilogue: UE4M3 activation scales (float→cutlass::float_e4m3_t) - SF loading: kNumSFUint32 = kHidden/64 (4 UE4M3 per int32) - New PTX wrappers: SM100_MMA_MXF4NVF4_2x1SM_SS, SM100_MMA_MXF4NVF4_SS - Python API: - fp8_nvfp4_mega_moe() with recipe=(1,1,16) - transform_nvfp4_weights_for_mega_moe() for UE4M3→int32 UTCCP packing - _pack_nvfp4_sf_for_utccp() helper - C++ bindings: - mega_nvfp4.hpp with NVFP4-specific SymmBuffer (SF stride K/16) - JIT kernel header with kGranK=16 TMA descriptors - Registered in python_api.cpp NOTE: Both SFA and SFB must use UE4M3 (scale_format_ is 1-bit, shared). The L1 epilogue converts float→UE4M3 for activation scales.	2026-05-11 05:41:08 +00:00
Zhean Xu	891d57b4db	Add various optimizations and Mega MoE benchmarks (#316 ) * Merge with private repo * Add Mega MoE Benchmark * Minor fix * Update --------- Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>	2026-04-24 18:41:37 +08:00
Chenggang Zhao	7f2a703ed5	[Public release 26/04] Introducing Mega MoE, FP4 Indexer and other features/fixes (#304 ) * Merge with private repo * Update README * Update README * Update README * Add PyTorch requirements * Fix sync scopes for MQA logits (#256) * Update README	2026-04-17 09:45:14 +08:00
Ray Wang	d30fc36c8f	Fix sync issue of TMEM alloc/dealloc (#292 )	2026-03-22 16:41:28 +08:00
Xin Qiu	35c4bc8771	fix: k_grouped_fp8_gemm_nt_contiguous crashes with n = 768 on H100 (#238 )	2026-02-25 10:13:54 +08:00
Ray Wang	477618cd51	Fix a sync issue in SM100 MQA logits (#285 )	2026-02-03 17:29:49 +08:00
Zhean Xu	0f5f266202	Multiple updates and refactorings (#280 )	2026-01-16 17:06:52 +08:00
Zhean Xu	3ccf40c53a	Merge pull request #270 from yurekami/fix/sm90-archspec-bug fix: use SM90ArchSpec instead of SM100ArchSpec in sm90_bf16_k_grouped_gemm	2026-01-06 09:56:33 +08:00
yurekami	6be0eb31d9	fix: use SM90ArchSpec instead of SM100ArchSpec in sm90_bf16_k_grouped_gemm The function sm90_bf16_k_grouped_gemm was incorrectly using SM100ArchSpec to calculate TMA descriptor block sizes. Since this file is the SM90 implementation, it should consistently use SM90ArchSpec like the other functions in this file (sm90_bf16_gemm, sm90_m_grouped_bf16_gemm_contiguous, etc.). This fixes a copy-paste error that could cause incorrect block size calculations on SM90 (Hopper) GPUs. Fixes #242 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2026-01-01 05:06:36 +09:00

1 2 3 4 5

231 Commits