DeepGEMM

Author	SHA1	Message	Date
biondizzle	d8ae7a3225	debug: print SF shape/strides before interleave	2026-05-12 14:31:41 +00:00
biondizzle	e498a2c729	fix: single transpose back to MN-major, don't double-transpose The .contiguous().transpose() dance was swapping dims back. A single transpose from (g,k,mn) gives (g,mn,k) with stride(-2)=1, which is exactly the MN-major layout TMA expects.	2026-05-12 14:23:02 +00:00
biondizzle	916f03d528	debug: add transform output shape/stride prints	2026-05-12 14:22:05 +00:00
biondizzle	1f13b24354	debug: add strides to SF debug prints	2026-05-12 14:11:53 +00:00
biondizzle	bfe612969b	fix: preserve MN-major layout when interleaving L1 SF tensors _interleave_l1_weights used empty_like+copy_ which destroyed the MN-major stride layout required by TMA. Added interleave_sf_mn_major that works in K-major, interleaves, then transposes back to MN-major.	2026-05-12 14:01:58 +00:00
biondizzle	76220ac6ee	fix: force contiguous on SF tensors before C++ call	2026-05-12 13:48:45 +00:00
biondizzle	bf5bf8d995	fix: unpack weight tuples before printing debug info	2026-05-12 13:28:32 +00:00
biondizzle	5ac151d0a5	debug: print tensor dtypes/shapes at C++ call boundary in fp8_nvfp4_mega_moe	2026-05-12 13:10:32 +00:00
biondizzle	fbfeb54c9a	Fix fold_global_scale: UE4M3 scales use .to(float32), not shift-by-23 Checkpoint stores float8_e4m3fn (standard NVFP4), not UE8M0. The shift-by-23 was misinterpreting E4M3 bytes as E8M0 exponents.	2026-05-12 05:52:33 +00:00
biondizzle	b3d1aae038	feat: full FP4 activations for mxf4nvf4 - E2M1 packed A side + UE4M3 scales mxf4nvf4 requires BOTH A and B to be FP4 (E2M1 packed). Changes: - a_dtype_t: float_e4m3_t → float_e2m1_unpacksmem_t - UMMA_K: 32 → 64 (FP4 MMA atom) - L1 epilogue: FP8 quant → E2M1 FP4 quantization with nearest-neighbor - L1 output SMEM: packed E2M1 (2 per byte), TMA store uint8 - TMA descriptors: adjusted for FP4 packing (K/2 bytes per row) - SymmBuffer: uint8 activations, shape (M, K//2) - Staging kernel: BF16 → E2M1 packed + UE4M3 block16 scales	2026-05-11 20:29:08 +00:00
biondizzle	2cd86ff5e7	fix: UE8M0→float32 reinterpret in fold_global_scale (Bug #7 )	2026-05-11 19:40:01 +00:00
biondizzle	47621bb990	add NVFP4SymmBuffer + get_symm_buffer_for_nvfp4_mega_moe Python wrapper The C++ binding was registered but there was no Python wrapper. vLLM patch imports get_symm_buffer_for_nvfp4_mega_moe from deep_gemm.mega.	2026-05-11 16:25:08 +00:00
biondizzle	86a1263f44	fix: gran_k=16 in transform_sf + sm_100a arch for NVFP4 mega_moe - transform_sf_into_required_layout: add gran_k=16 branch for NVFP4 UE4M3 scales (4 per int32, group_size=16). Previously only handled 32/128. - get_arch: always return '100a' for SM100, never '100f'. The family variant lacks mxf4nvf4 (NVFP4 block-scaled MMA) support, causing 'scale_vec::4X not supported on sm_100f' errors. - transform_nvfp4_weights_for_mega_moe: fold weight_scale_2 into block scales, pack UE4M3→int32, transpose MN-major, call transform_sf_into_required_layout with gran_k=16.	2026-05-11 16:11:11 +00:00
biondizzle	fbdddaccf4	revert: restore mxf4nvf4/block16 code (correct path for sm_100a) Reverted to commit 36b439e's NVFP4 kernel code: - kGranK=16, mxf4nvf4.block_scale.scale_vec::4X - float_ue4m3_t instruction descriptor - Block16 SF layout (4X TMEM) - UE4M3 L1 epilogue - No UE4M3→UE8M0 conversion, no block16→block32 merge The mxf4nvf4.scale_vec::4X PTX instruction compiles successfully on both sm_100 and sm_100f with CUDA 13.0. The previous build 17 error was likely from a different cause, not the arch flag. Python: reverted transform_nvfp4_weights_for_mega_moe to use pack_ue4m3_to_int32 with gran_k=16, no UE8M0 conversion.	2026-05-11 15:02:47 +00:00
biondizzle	57c629ed1b	fix: cast to int32 before >> 23 (uint32 doesn't support right-shift)	2026-05-11 09:45:54 +00:00
biondizzle	6d7231a50e	fix: reinterpret float32 bits as uint32 before >> 23 for UE8M0	2026-05-11 09:42:03 +00:00
biondizzle	03b8c99ee1	fix: use mxf8f6f4 (UE8M0) on SM100 — mxf4nvf4 requires SM103+ B200 (SM100) does NOT support kind::mxf4nvf4 at all (neither 2X nor 4X). Only mxf8f6f4.block_scale with UE8M0 scales is available on SM100. Strategy: keep NVFP4 E2M1 weights, convert UE4M3 block scales → UE8M0 in the weight transformation. This is a scale format adaptation for hardware compatibility, not a format conversion. Changes: - Kernel: back to mxf8f6F4 instruction + float_ue8m0_t descriptor - L1 epilogue: back to UE8M0 (>> 23) activation scales - Python: merge block16→block32, convert UE4M3→float32→UE8M0 - Packing: uint8 (UE8M0) → int32, same as MXFP4	2026-05-11 09:28:45 +00:00
biondizzle	cd7a612175	debug: add shape logging to SF packing	2026-05-11 08:54:14 +00:00
biondizzle	dcebe033e2	fix: use scale_vec::2X (block32) for SM100 B200 compatibility scale_vec::4X (block16) requires SM103/SM120 (B300/GB300), not SM100 (B200). Revert to block32 with UE4M3 scales. Same TMEM layout as MXFP4 but with UE4M3 scale format instead of UE8M0. Changes: - kGranK: 16 → 32 - PTX: scale_vec::4X → scale_vec::2X - SF layout: same as MXFP4 (K/32, K/128 for int32 packed) - UTCCP: i8 → i4 (2X layout, same as MXFP4) - TMEM columns: same as MXFP4 (SF_BLOCK_M/32, SF_BLOCK_N/32) - Python: merge NVFP4 block16→block32 scales (max of adjacent pairs) - recipe: (1,1,16) → (1,1,32)	2026-05-11 08:36:59 +00:00
biondizzle	deff80c9c1	fix: add Python wrapper for NVFP4 SymmBuffer allocation get_symm_buffer_for_nvfp4_mega_moe uses _C.get_symm_buffer_size_for_nvfp4_mega_moe to allocate the correct buffer size (2x SF entries due to group_size=16). Custom init to avoid SymmBuffer's hardcoded MXFP4 allocation.	2026-05-11 08:05:21 +00:00
biondizzle	8d02eb38fa	fix: transpose SF to MN-major layout before TMA stride checks transform_sf_into_required_layout expects MN-major input (stride(-2)=1). Our packed int32 SF is K-major (stride(-1)=1). Transpose the last two dims, make contiguous, then transpose back so data is in MN-major order.	2026-05-11 07:32:10 +00:00
biondizzle	7154500f22	fix: reshape SF to 2D before transform_sf_into_required_layout The C++ check_sf_layout stride assertion fails on 3D (experts, mn, K//64) tensors. Reshape to 2D (experts*mn, K//64) before calling the transform function, matching the expected stride layout.	2026-05-11 07:30:54 +00:00
biondizzle	388fd8dcfd	fix: pack UE4M3 into int32 before transform_sf_into_required_layout The C++ transform function expects int32 (for kInt type) with 4 UE4M3 bytes packed per int32. We pack first, then transform for TMA alignment and UTCCP transpose with recipe (1, 16).	2026-05-11 07:05:11 +00:00
biondizzle	acae75e109	fix: use transform_sf_into_required_layout for proper TMA-aligned SF Instead of custom _pack_nvfp4_sf_for_utccp, use DeepGEMM's C++ transform_sf_into_required_layout with recipe (1, 1, 16) for NVFP4. This handles TMA alignment and UTCCP transpose correctly.	2026-05-11 06:54:34 +00:00
biondizzle	5cb4fcaef3	fix: cast uint8 weights to int8 (kPackedFP4) for DeepGEMM compatibility	2026-05-11 06:36:32 +00:00
biondizzle	bbf9a5f46a	feat: fold weight_scale_2 into block scales in NVFP4 transform - transform_nvfp4_weights_for_mega_moe now accepts weight_scale_2 - Folds global scale into block scales: UE4M3 * FP32 -> UE4M3 - Dequantize to f32, multiply by global scale, clamp [0,448], re-quantize - This is needed because the kernel only applies one level of block scaling	2026-05-11 05:42:16 +00:00
biondizzle	36b439ee26	feat: NVFP4 mega MoE kernel (scale_vec::4X, UE4M3 block scales) - New CUDA kernel: sm100_fp8_nvfp4_mega_moe_impl - kGranK=16 (NVFP4 group_size=16, vs MXFP4's 32) - kind::mxf4nvf4.block_scale.scale_vec::4X PTX instruction - float_ue4m3_t scale factor type in instruction descriptor - SF layout: scale_vec::4X (4 TMEM sub-columns per UMMA atom) - UTCCP column stride: i8 (vs MXFP4's i4) for 4X layout - L1 epilogue: UE4M3 activation scales (float→cutlass::float_e4m3_t) - SF loading: kNumSFUint32 = kHidden/64 (4 UE4M3 per int32) - New PTX wrappers: SM100_MMA_MXF4NVF4_2x1SM_SS, SM100_MMA_MXF4NVF4_SS - Python API: - fp8_nvfp4_mega_moe() with recipe=(1,1,16) - transform_nvfp4_weights_for_mega_moe() for UE4M3→int32 UTCCP packing - _pack_nvfp4_sf_for_utccp() helper - C++ bindings: - mega_nvfp4.hpp with NVFP4-specific SymmBuffer (SF stride K/16) - JIT kernel header with kGranK=16 TMA descriptors - Registered in python_api.cpp NOTE: Both SFA and SFB must use UE4M3 (scale_format_ is 1-bit, shared). The L1 epilogue converts float→UE4M3 for activation scales.	2026-05-11 05:41:08 +00:00
Zhean Xu	891d57b4db	Add various optimizations and Mega MoE benchmarks (#316 ) * Merge with private repo * Add Mega MoE Benchmark * Minor fix * Update --------- Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>	2026-04-24 18:41:37 +08:00
Chenggang Zhao	7f2a703ed5	[Public release 26/04] Introducing Mega MoE, FP4 Indexer and other features/fixes (#304 ) * Merge with private repo * Update README * Update README * Update README * Add PyTorch requirements * Fix sync scopes for MQA logits (#256) * Update README	2026-04-17 09:45:14 +08:00

29 Commits