DeepGEMM

Author	SHA1	Message	Date
biondizzle	8d02eb38fa	fix: transpose SF to MN-major layout before TMA stride checks transform_sf_into_required_layout expects MN-major input (stride(-2)=1). Our packed int32 SF is K-major (stride(-1)=1). Transpose the last two dims, make contiguous, then transpose back so data is in MN-major order.	2026-05-11 07:32:10 +00:00
biondizzle	7154500f22	fix: reshape SF to 2D before transform_sf_into_required_layout The C++ check_sf_layout stride assertion fails on 3D (experts, mn, K//64) tensors. Reshape to 2D (experts*mn, K//64) before calling the transform function, matching the expected stride layout.	2026-05-11 07:30:54 +00:00
biondizzle	388fd8dcfd	fix: pack UE4M3 into int32 before transform_sf_into_required_layout The C++ transform function expects int32 (for kInt type) with 4 UE4M3 bytes packed per int32. We pack first, then transform for TMA alignment and UTCCP transpose with recipe (1, 16).	2026-05-11 07:05:11 +00:00
biondizzle	acae75e109	fix: use transform_sf_into_required_layout for proper TMA-aligned SF Instead of custom _pack_nvfp4_sf_for_utccp, use DeepGEMM's C++ transform_sf_into_required_layout with recipe (1, 1, 16) for NVFP4. This handles TMA alignment and UTCCP transpose correctly.	2026-05-11 06:54:34 +00:00
biondizzle	5cb4fcaef3	fix: cast uint8 weights to int8 (kPackedFP4) for DeepGEMM compatibility	2026-05-11 06:36:32 +00:00
biondizzle	bbf9a5f46a	feat: fold weight_scale_2 into block scales in NVFP4 transform - transform_nvfp4_weights_for_mega_moe now accepts weight_scale_2 - Folds global scale into block scales: UE4M3 * FP32 -> UE4M3 - Dequantize to f32, multiply by global scale, clamp [0,448], re-quantize - This is needed because the kernel only applies one level of block scaling	2026-05-11 05:42:16 +00:00
biondizzle	36b439ee26	feat: NVFP4 mega MoE kernel (scale_vec::4X, UE4M3 block scales) - New CUDA kernel: sm100_fp8_nvfp4_mega_moe_impl - kGranK=16 (NVFP4 group_size=16, vs MXFP4's 32) - kind::mxf4nvf4.block_scale.scale_vec::4X PTX instruction - float_ue4m3_t scale factor type in instruction descriptor - SF layout: scale_vec::4X (4 TMEM sub-columns per UMMA atom) - UTCCP column stride: i8 (vs MXFP4's i4) for 4X layout - L1 epilogue: UE4M3 activation scales (float→cutlass::float_e4m3_t) - SF loading: kNumSFUint32 = kHidden/64 (4 UE4M3 per int32) - New PTX wrappers: SM100_MMA_MXF4NVF4_2x1SM_SS, SM100_MMA_MXF4NVF4_SS - Python API: - fp8_nvfp4_mega_moe() with recipe=(1,1,16) - transform_nvfp4_weights_for_mega_moe() for UE4M3→int32 UTCCP packing - _pack_nvfp4_sf_for_utccp() helper - C++ bindings: - mega_nvfp4.hpp with NVFP4-specific SymmBuffer (SF stride K/16) - JIT kernel header with kGranK=16 TMA descriptors - Registered in python_api.cpp NOTE: Both SFA and SFB must use UE4M3 (scale_format_ is 1-bit, shared). The L1 epilogue converts float→UE4M3 for activation scales.	2026-05-11 05:41:08 +00:00
Zhean Xu	891d57b4db	Add various optimizations and Mega MoE benchmarks (#316 ) * Merge with private repo * Add Mega MoE Benchmark * Minor fix * Update --------- Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>	2026-04-24 18:41:37 +08:00
Chenggang Zhao	7f2a703ed5	[Public release 26/04] Introducing Mega MoE, FP4 Indexer and other features/fixes (#304 ) * Merge with private repo * Update README * Update README * Update README * Add PyTorch requirements * Fix sync scopes for MQA logits (#256) * Update README	2026-04-17 09:45:14 +08:00

9 Commits