30d72e7ef5
fix: packed FP4 for mxf4nvf4 — correct SMEM layout, UMMA descriptors, L1 epilogue
...
Key changes:
- a_dtype_t/b_dtype_t: float_e2m1_t (packed 4-bit) with sizeof_bits_v==4 assert
- kSwizzleAMode/BMode: BLOCK_K/2 (64 bytes packed, not 128 unpacked)
- SMEM sizes: LOAD_BLOCK_M * BLOCK_K / 2 (packed byte count)
- Token layouts: kHidden/2, kIntermediateHidden/2 (packed bytes)
- TMA loads: BLOCK_K/2 inner dim, uint8_t, byte offsets k_block_idx*(BLOCK_K/2)
- UMMA descriptors: BLOCK_K/2 template param, uint8_t dtype, UMMA_K/2 advance
- L1 epilogue: dropped STSM, direct st.shared.u16 with packed nibbles, no swizzle (v1)
- Pybind buffer sizes: hidden/2, intermediate_hidden/2 with packed tensor shapes
- Host TMA descriptors: hidden/2 K-dims, block_k/2 inner, fp4_unpacked_smem=false
- L1 output TMA: block_n/4 inner, no swizzle (CU_TENSOR_MAP_SWIZZLE_NONE)
2026-05-11 21:59:21 +00:00
0ac73a82f9
fix: L1 output uses unpacked E2M1 (1 byte/element) like FP8
...
- float_e2m1_unpacksmem_t: sizeof=1, SMEM is 1 byte/element (not packed)
- TMA load unpacks 2 E2M1/global-byte → 2 SMEM bytes
- UMMA reads unpacked SMEM, packs internally for mxf4nvf4
- L1→L2 handoff: unpacked format (same byte count as FP8)
- Epilogue: 4 E2M1 bytes per uint32 STSM atom, same as FP8
- Dispatch TMA: kHidden bytes (unpacked), not kHidden/2
- Added static_assert on sizeof(a_dtype_t) and sizeof(b_dtype_t)
- Note: no bandwidth savings at L1→L2 boundary for v1
2026-05-11 21:27:35 +00:00
091b974736
fix: L1 epilogue uses STSM with XOR swizzle for E2M1 FP4 output
...
Keep STSM (not naive SMEM write) so TMA reads correct bank layout.
Pack 4 E2M1 nibbles into uint32 per STSM atom with XOR swizzle.
Known perf note: 32B swizzle zone for L1 output (land for v1).
2026-05-11 20:57:34 +00:00
a554de8b24
fix: dispatch TMA byte counts for FP4 (kHidden/2), rename fp8→fp4 layout refs
2026-05-11 20:47:58 +00:00
b3d1aae038
feat: full FP4 activations for mxf4nvf4 - E2M1 packed A side + UE4M3 scales
...
mxf4nvf4 requires BOTH A and B to be FP4 (E2M1 packed).
Changes:
- a_dtype_t: float_e4m3_t → float_e2m1_unpacksmem_t
- UMMA_K: 32 → 64 (FP4 MMA atom)
- L1 epilogue: FP8 quant → E2M1 FP4 quantization with nearest-neighbor
- L1 output SMEM: packed E2M1 (2 per byte), TMA store uint8
- TMA descriptors: adjusted for FP4 packing (K/2 bytes per row)
- SymmBuffer: uint8 activations, shape (M, K//2)
- Staging kernel: BF16 → E2M1 packed + UE4M3 block16 scales
2026-05-11 20:29:08 +00:00
2cd86ff5e7
fix: UE8M0→float32 reinterpret in fold_global_scale (Bug #7 )
2026-05-11 19:40:01 +00:00
47621bb990
add NVFP4SymmBuffer + get_symm_buffer_for_nvfp4_mega_moe Python wrapper
...
The C++ binding was registered but there was no Python wrapper.
vLLM patch imports get_symm_buffer_for_nvfp4_mega_moe from deep_gemm.mega.
2026-05-11 16:25:08 +00:00
86a1263f44
fix: gran_k=16 in transform_sf + sm_100a arch for NVFP4 mega_moe
...
- transform_sf_into_required_layout: add gran_k=16 branch for NVFP4 UE4M3
scales (4 per int32, group_size=16). Previously only handled 32/128.
- get_arch: always return '100a' for SM100, never '100f'. The family
variant lacks mxf4nvf4 (NVFP4 block-scaled MMA) support, causing
'scale_vec::4X not supported on sm_100f' errors.
- transform_nvfp4_weights_for_mega_moe: fold weight_scale_2 into block
scales, pack UE4M3→int32, transpose MN-major, call
transform_sf_into_required_layout with gran_k=16.
2026-05-11 16:11:11 +00:00
fbdddaccf4
revert: restore mxf4nvf4/block16 code (correct path for sm_100a)
...
Reverted to commit 36b439e's NVFP4 kernel code:
- kGranK=16, mxf4nvf4.block_scale.scale_vec::4X
- float_ue4m3_t instruction descriptor
- Block16 SF layout (4X TMEM)
- UE4M3 L1 epilogue
- No UE4M3→UE8M0 conversion, no block16→block32 merge
The mxf4nvf4.scale_vec::4X PTX instruction compiles successfully
on both sm_100 and sm_100f with CUDA 13.0. The previous build 17
error was likely from a different cause, not the arch flag.
Python: reverted transform_nvfp4_weights_for_mega_moe to use
pack_ue4m3_to_int32 with gran_k=16, no UE8M0 conversion.
2026-05-11 15:02:47 +00:00
57c629ed1b
fix: cast to int32 before >> 23 (uint32 doesn't support right-shift)
2026-05-11 09:45:54 +00:00
6d7231a50e
fix: reinterpret float32 bits as uint32 before >> 23 for UE8M0
2026-05-11 09:42:03 +00:00
03b8c99ee1
fix: use mxf8f6f4 (UE8M0) on SM100 — mxf4nvf4 requires SM103+
...
B200 (SM100) does NOT support kind::mxf4nvf4 at all (neither 2X nor 4X).
Only mxf8f6f4.block_scale with UE8M0 scales is available on SM100.
Strategy: keep NVFP4 E2M1 weights, convert UE4M3 block scales → UE8M0
in the weight transformation. This is a scale format adaptation for
hardware compatibility, not a format conversion.
Changes:
- Kernel: back to mxf8f6F4 instruction + float_ue8m0_t descriptor
- L1 epilogue: back to UE8M0 (>> 23) activation scales
- Python: merge block16→block32, convert UE4M3→float32→UE8M0
- Packing: uint8 (UE8M0) → int32, same as MXFP4
2026-05-11 09:28:45 +00:00
cd7a612175
debug: add shape logging to SF packing
2026-05-11 08:54:14 +00:00
dcebe033e2
fix: use scale_vec::2X (block32) for SM100 B200 compatibility
...
scale_vec::4X (block16) requires SM103/SM120 (B300/GB300), not SM100 (B200).
Revert to block32 with UE4M3 scales. Same TMEM layout as MXFP4 but with
UE4M3 scale format instead of UE8M0.
Changes:
- kGranK: 16 → 32
- PTX: scale_vec::4X → scale_vec::2X
- SF layout: same as MXFP4 (K/32, K/128 for int32 packed)
- UTCCP: i*8 → i*4 (2X layout, same as MXFP4)
- TMEM columns: same as MXFP4 (SF_BLOCK_M/32, SF_BLOCK_N/32)
- Python: merge NVFP4 block16→block32 scales (max of adjacent pairs)
- recipe: (1,1,16) → (1,1,32)
2026-05-11 08:36:59 +00:00
deff80c9c1
fix: add Python wrapper for NVFP4 SymmBuffer allocation
...
get_symm_buffer_for_nvfp4_mega_moe uses _C.get_symm_buffer_size_for_nvfp4_mega_moe
to allocate the correct buffer size (2x SF entries due to group_size=16).
Custom init to avoid SymmBuffer's hardcoded MXFP4 allocation.
2026-05-11 08:05:21 +00:00
8d02eb38fa
fix: transpose SF to MN-major layout before TMA stride checks
...
transform_sf_into_required_layout expects MN-major input (stride(-2)=1).
Our packed int32 SF is K-major (stride(-1)=1). Transpose the last two
dims, make contiguous, then transpose back so data is in MN-major order.
2026-05-11 07:32:10 +00:00
7154500f22
fix: reshape SF to 2D before transform_sf_into_required_layout
...
The C++ check_sf_layout stride assertion fails on 3D (experts, mn, K//64)
tensors. Reshape to 2D (experts*mn, K//64) before calling the transform
function, matching the expected stride layout.
2026-05-11 07:30:54 +00:00
388fd8dcfd
fix: pack UE4M3 into int32 before transform_sf_into_required_layout
...
The C++ transform function expects int32 (for kInt type) with 4 UE4M3
bytes packed per int32. We pack first, then transform for TMA alignment
and UTCCP transpose with recipe (1, 16).
2026-05-11 07:05:11 +00:00
acae75e109
fix: use transform_sf_into_required_layout for proper TMA-aligned SF
...
Instead of custom _pack_nvfp4_sf_for_utccp, use DeepGEMM's C++
transform_sf_into_required_layout with recipe (1, 1, 16) for NVFP4.
This handles TMA alignment and UTCCP transpose correctly.
2026-05-11 06:54:34 +00:00
5cb4fcaef3
fix: cast uint8 weights to int8 (kPackedFP4) for DeepGEMM compatibility
2026-05-11 06:36:32 +00:00
bbf9a5f46a
feat: fold weight_scale_2 into block scales in NVFP4 transform
...
- transform_nvfp4_weights_for_mega_moe now accepts weight_scale_2
- Folds global scale into block scales: UE4M3 * FP32 -> UE4M3
- Dequantize to f32, multiply by global scale, clamp [0,448], re-quantize
- This is needed because the kernel only applies one level of block scaling
2026-05-11 05:42:16 +00:00
36b439ee26
feat: NVFP4 mega MoE kernel (scale_vec::4X, UE4M3 block scales)
...
- New CUDA kernel: sm100_fp8_nvfp4_mega_moe_impl
- kGranK=16 (NVFP4 group_size=16, vs MXFP4's 32)
- kind::mxf4nvf4.block_scale.scale_vec::4X PTX instruction
- float_ue4m3_t scale factor type in instruction descriptor
- SF layout: scale_vec::4X (4 TMEM sub-columns per UMMA atom)
- UTCCP column stride: i*8 (vs MXFP4's i*4) for 4X layout
- L1 epilogue: UE4M3 activation scales (float→cutlass::float_e4m3_t)
- SF loading: kNumSFUint32 = kHidden/64 (4 UE4M3 per int32)
- New PTX wrappers: SM100_MMA_MXF4NVF4_2x1SM_SS, SM100_MMA_MXF4NVF4_SS
- Python API:
- fp8_nvfp4_mega_moe() with recipe=(1,1,16)
- transform_nvfp4_weights_for_mega_moe() for UE4M3→int32 UTCCP packing
- _pack_nvfp4_sf_for_utccp() helper
- C++ bindings:
- mega_nvfp4.hpp with NVFP4-specific SymmBuffer (SF stride K/16)
- JIT kernel header with kGranK=16 TMA descriptors
- Registered in python_api.cpp
NOTE: Both SFA and SFB must use UE4M3 (scale_format_ is 1-bit, shared).
The L1 epilogue converts float→UE4M3 for activation scales.
2026-05-11 05:41:08 +00:00
Zhean Xu
891d57b4db
Add various optimizations and Mega MoE benchmarks ( #316 )
...
* Merge with private repo
* Add Mega MoE Benchmark
* Minor fix
* Update
---------
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com >
2026-04-24 18:41:37 +08:00
Chenggang Zhao
7f2a703ed5
[Public release 26/04] Introducing Mega MoE, FP4 Indexer and other features/fixes ( #304 )
...
* Merge with private repo
* Update README
* Update README
* Update README
* Add PyTorch requirements
* Fix sync scopes for MQA logits (#256 )
* Update README
2026-04-17 09:45:14 +08:00
Ray Wang
d30fc36c8f
Fix sync issue of TMEM alloc/dealloc ( #292 )
2026-03-22 16:41:28 +08:00
Xin Qiu
35c4bc8771
fix: k_grouped_fp8_gemm_nt_contiguous crashes with n = 768 on H100 ( #238 )
2026-02-25 10:13:54 +08:00
Ray Wang
477618cd51
Fix a sync issue in SM100 MQA logits ( #285 )
2026-02-03 17:29:49 +08:00
Zhean Xu
0f5f266202
Multiple updates and refactorings ( #280 )
2026-01-16 17:06:52 +08:00
Ray Wang
38f8ef73a4
Multiple updates and refactorings ( #231 )
2025-11-21 17:49:47 +08:00
Zhean Xu
bb4424aad4
Fix sum_k * shape_m overflow
2025-11-19 11:51:36 +08:00
Ray Wang
ec5e9ed0b8
Fix SM90 MQA logits ( #229 )
2025-11-19 10:50:36 +08:00
Ray Wang
2f9d87877e
Use larger MMA shape ( #227 )
2025-11-14 11:38:15 +08:00
oliver könig
9f196058ae
chore: Build and store bdist wheels ( #181 )
...
* build: Minor tweeks for wheel build
Signed-off-by: oliver könig <okoenig@nvidia.com >
* ci: Workflows for wheel build
Signed-off-by: oliver könig <okoenig@nvidia.com >
* fix
Signed-off-by: oliver könig <okoenig@nvidia.com >
* fix
Signed-off-by: oliver könig <okoenig@nvidia.com >
* build: Add CachedWheel
Signed-off-by: oliver könig <okoenig@nvidia.com >
* add version to init
Signed-off-by: oliver könig <okoenig@nvidia.com >
* revert
Signed-off-by: oliver könig <okoenig@nvidia.com >
* revert
Signed-off-by: oliver könig <okoenig@nvidia.com >
* revert
Signed-off-by: oliver könig <okoenig@nvidia.com >
* v2
Signed-off-by: oliver könig <okoenig@nvidia.com >
* update
Signed-off-by: oliver könig <okoenig@nvidia.com >
* test
Signed-off-by: oliver könig <okoenig@nvidia.com >
* from packaging.version import parse
Signed-off-by: oliver könig <okoenig@nvidia.com >
* local version
Signed-off-by: oliver könig <okoenig@nvidia.com >
* remove file
Signed-off-by: oliver könig <okoenig@nvidia.com >
* revert
Signed-off-by: oliver könig <okoenig@nvidia.com >
* Updates and lint
* revert missing cudaextension args
Signed-off-by: oliver könig <okoenig@nvidia.com >
* Add timeout
* fix DG settings
Signed-off-by: oliver könig <okoenig@nvidia.com >
* DG_USE_LOCAL_VERSION
Signed-off-by: oliver könig <okoenig@nvidia.com >
* Update version
* Detect local changes
* Minor fix
* Revert CUTLASS
* Unify options
---------
Signed-off-by: oliver könig <okoenig@nvidia.com >
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com >
2025-10-10 18:23:40 +08:00
Chenggang Zhao
c1bf4cae4b
Fix version
2025-10-01 20:31:27 +08:00
Chenggang Zhao
07b82fb8cd
Fix old CUDA compatibility
2025-10-01 20:29:15 +08:00
Simon Mo
59f2c07cf2
Add SM100 kernels ( #201 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-09-29 17:07:28 +08:00
Chenggang Zhao
80ceeb2c76
Add SM90 kernels ( #200 )
2025-09-29 17:00:23 +08:00
Ray Wang
3f71de7aa9
Make various updates and fixes ( #198 )
2025-09-25 16:19:07 +08:00
zhonghui-J
2da871e304
Fix grouped gemms performance issue. ( #168 )
2025-08-22 17:35:43 +08:00
Chenggang Zhao
e38c2e3103
Remove comments
2025-08-22 17:32:04 +08:00
Chenggang Zhao
f20256fd50
Compatible with CUDA 13
2025-08-22 17:30:47 +08:00
xiweny
affdb1cd90
Add sm_100f support and make nvcc 13 happy ( #157 )
...
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com >
2025-08-22 17:19:32 +08:00
Ray Wang
f85ec649d7
Make various updates and fixes: ( #164 )
...
- Add BF16 support for SM90 and SM100
- Refactor Python APIs
- Other fixes and code refactoring
2025-08-15 18:32:35 +08:00
Ray Wang
d9c363f86f
Make various updates and fixes:
...
- Add support for legacy CUDA versions; now compatible with CUDA 12.3 and newer
- Add support for NVRTC compilation
- Other fixes and code refactoring
2025-08-02 19:52:22 -07:00
yukuai26
aff9da0aba
Fix SM90 GEMM ( #149 )
...
* Fix sm90 GEMM
* Fix typo
---------
Co-authored-by: Kuai Yu <yukuai@deepseek.com >
2025-08-01 10:36:49 +08:00
Ray Wang
9da4a23561
Add more GPU architectures support ( #112 )
...
* Add more GPU architectures support
* Update layout.py
* Optimize performance, Add SM90 support, Add 1D2D SM100 support
* Add fmtlib submodule at commit 553ec11
---------
Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com >
2025-07-18 11:32:22 +08:00
Chenggang Zhao
03d0be3d2d
Simplify expression
2025-07-02 14:07:05 +08:00
fy1214
3fc6728dee
[add] fix smem_barrier size in wgrad way ( #122 )
2025-07-02 14:05:36 +08:00
yukuai
e82c4139da
Revert "Fixed the bug in get_swizzle_mode function related to elem_size setting. ( #115 )"
...
This reverts commit ac428e25e0 .
This PR causes wgrad to hang during testing. Revert it until we resolve the issue
2025-06-23 17:13:36 +08:00
TherLF
ac428e25e0
Fixed the bug in get_swizzle_mode function related to elem_size setting. ( #115 )
2025-06-23 09:37:10 +08:00