80df24a641
fix: add kInt8 dtype support to TMA descriptor + change activation tensors to kInt8
...
- runtime_utils.hpp: added kInt8 -> CU_TENSOR_MAP_DATA_TYPE_UINT8 mapping
- mega_nvfp4.hpp: changed activation tensor dtypes from kUInt8 to kInt8
(same byte layout, but kInt8 is recognized by the TMA dtype switch)
2026-05-11 22:54:47 +00:00
30d72e7ef5
fix: packed FP4 for mxf4nvf4 — correct SMEM layout, UMMA descriptors, L1 epilogue
...
Key changes:
- a_dtype_t/b_dtype_t: float_e2m1_t (packed 4-bit) with sizeof_bits_v==4 assert
- kSwizzleAMode/BMode: BLOCK_K/2 (64 bytes packed, not 128 unpacked)
- SMEM sizes: LOAD_BLOCK_M * BLOCK_K / 2 (packed byte count)
- Token layouts: kHidden/2, kIntermediateHidden/2 (packed bytes)
- TMA loads: BLOCK_K/2 inner dim, uint8_t, byte offsets k_block_idx*(BLOCK_K/2)
- UMMA descriptors: BLOCK_K/2 template param, uint8_t dtype, UMMA_K/2 advance
- L1 epilogue: dropped STSM, direct st.shared.u16 with packed nibbles, no swizzle (v1)
- Pybind buffer sizes: hidden/2, intermediate_hidden/2 with packed tensor shapes
- Host TMA descriptors: hidden/2 K-dims, block_k/2 inner, fp4_unpacked_smem=false
- L1 output TMA: block_n/4 inner, no swizzle (CU_TENSOR_MAP_SWIZZLE_NONE)
2026-05-11 21:59:21 +00:00
0ac73a82f9
fix: L1 output uses unpacked E2M1 (1 byte/element) like FP8
...
- float_e2m1_unpacksmem_t: sizeof=1, SMEM is 1 byte/element (not packed)
- TMA load unpacks 2 E2M1/global-byte → 2 SMEM bytes
- UMMA reads unpacked SMEM, packs internally for mxf4nvf4
- L1→L2 handoff: unpacked format (same byte count as FP8)
- Epilogue: 4 E2M1 bytes per uint32 STSM atom, same as FP8
- Dispatch TMA: kHidden bytes (unpacked), not kHidden/2
- Added static_assert on sizeof(a_dtype_t) and sizeof(b_dtype_t)
- Note: no bandwidth savings at L1→L2 boundary for v1
2026-05-11 21:27:35 +00:00
b3d1aae038
feat: full FP4 activations for mxf4nvf4 - E2M1 packed A side + UE4M3 scales
...
mxf4nvf4 requires BOTH A and B to be FP4 (E2M1 packed).
Changes:
- a_dtype_t: float_e4m3_t → float_e2m1_unpacksmem_t
- UMMA_K: 32 → 64 (FP4 MMA atom)
- L1 epilogue: FP8 quant → E2M1 FP4 quantization with nearest-neighbor
- L1 output SMEM: packed E2M1 (2 per byte), TMA store uint8
- TMA descriptors: adjusted for FP4 packing (K/2 bytes per row)
- SymmBuffer: uint8 activations, shape (M, K//2)
- Staging kernel: BF16 → E2M1 packed + UE4M3 block16 scales
2026-05-11 20:29:08 +00:00
86a1263f44
fix: gran_k=16 in transform_sf + sm_100a arch for NVFP4 mega_moe
...
- transform_sf_into_required_layout: add gran_k=16 branch for NVFP4 UE4M3
scales (4 per int32, group_size=16). Previously only handled 32/128.
- get_arch: always return '100a' for SM100, never '100f'. The family
variant lacks mxf4nvf4 (NVFP4 block-scaled MMA) support, causing
'scale_vec::4X not supported on sm_100f' errors.
- transform_nvfp4_weights_for_mega_moe: fold weight_scale_2 into block
scales, pack UE4M3→int32, transpose MN-major, call
transform_sf_into_required_layout with gran_k=16.
2026-05-11 16:11:11 +00:00
fbdddaccf4
revert: restore mxf4nvf4/block16 code (correct path for sm_100a)
...
Reverted to commit 36b439e's NVFP4 kernel code:
- kGranK=16, mxf4nvf4.block_scale.scale_vec::4X
- float_ue4m3_t instruction descriptor
- Block16 SF layout (4X TMEM)
- UE4M3 L1 epilogue
- No UE4M3→UE8M0 conversion, no block16→block32 merge
The mxf4nvf4.scale_vec::4X PTX instruction compiles successfully
on both sm_100 and sm_100f with CUDA 13.0. The previous build 17
error was likely from a different cause, not the arch flag.
Python: reverted transform_nvfp4_weights_for_mega_moe to use
pack_ue4m3_to_int32 with gran_k=16, no UE8M0 conversion.
2026-05-11 15:02:47 +00:00
b856c57ba6
fix: kGranK=32 in C++ binding (was still 16 from old block16 code)
2026-05-11 09:09:32 +00:00
dcebe033e2
fix: use scale_vec::2X (block32) for SM100 B200 compatibility
...
scale_vec::4X (block16) requires SM103/SM120 (B300/GB300), not SM100 (B200).
Revert to block32 with UE4M3 scales. Same TMEM layout as MXFP4 but with
UE4M3 scale format instead of UE8M0.
Changes:
- kGranK: 16 → 32
- PTX: scale_vec::4X → scale_vec::2X
- SF layout: same as MXFP4 (K/32, K/128 for int32 packed)
- UTCCP: i*8 → i*4 (2X layout, same as MXFP4)
- TMEM columns: same as MXFP4 (SF_BLOCK_M/32, SF_BLOCK_N/32)
- Python: merge NVFP4 block16→block32 scales (max of adjacent pairs)
- recipe: (1,1,16) → (1,1,32)
2026-05-11 08:36:59 +00:00
f98c1f7fd5
fix: add gran_k=16 (NVFP4) support to transform_sf_into_required_layout
...
The C++ function only handled gran_k=32 and 128 (MXFP4/FP8).
Added gran_k=16 for NVFP4 group_size=16 support.
2026-05-11 07:13:00 +00:00
36b439ee26
feat: NVFP4 mega MoE kernel (scale_vec::4X, UE4M3 block scales)
...
- New CUDA kernel: sm100_fp8_nvfp4_mega_moe_impl
- kGranK=16 (NVFP4 group_size=16, vs MXFP4's 32)
- kind::mxf4nvf4.block_scale.scale_vec::4X PTX instruction
- float_ue4m3_t scale factor type in instruction descriptor
- SF layout: scale_vec::4X (4 TMEM sub-columns per UMMA atom)
- UTCCP column stride: i*8 (vs MXFP4's i*4) for 4X layout
- L1 epilogue: UE4M3 activation scales (float→cutlass::float_e4m3_t)
- SF loading: kNumSFUint32 = kHidden/64 (4 UE4M3 per int32)
- New PTX wrappers: SM100_MMA_MXF4NVF4_2x1SM_SS, SM100_MMA_MXF4NVF4_SS
- Python API:
- fp8_nvfp4_mega_moe() with recipe=(1,1,16)
- transform_nvfp4_weights_for_mega_moe() for UE4M3→int32 UTCCP packing
- _pack_nvfp4_sf_for_utccp() helper
- C++ bindings:
- mega_nvfp4.hpp with NVFP4-specific SymmBuffer (SF stride K/16)
- JIT kernel header with kGranK=16 TMA descriptors
- Registered in python_api.cpp
NOTE: Both SFA and SFB must use UE4M3 (scale_format_ is 1-bit, shared).
The L1 epilogue converts float→UE4M3 for activation scales.
2026-05-11 05:41:08 +00:00
Zhean Xu
891d57b4db
Add various optimizations and Mega MoE benchmarks ( #316 )
...
* Merge with private repo
* Add Mega MoE Benchmark
* Minor fix
* Update
---------
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com >
2026-04-24 18:41:37 +08:00
Chenggang Zhao
7f2a703ed5
[Public release 26/04] Introducing Mega MoE, FP4 Indexer and other features/fixes ( #304 )
...
* Merge with private repo
* Update README
* Update README
* Update README
* Add PyTorch requirements
* Fix sync scopes for MQA logits (#256 )
* Update README
2026-04-17 09:45:14 +08:00
Zhean Xu
0f5f266202
Multiple updates and refactorings ( #280 )
2026-01-16 17:06:52 +08:00
Ray Wang
38f8ef73a4
Multiple updates and refactorings ( #231 )
2025-11-21 17:49:47 +08:00
Chenggang Zhao
8da33d6bd9
Clean up
2025-11-19 11:00:55 +08:00
Guoteng
f63d7f24d6
fix: prevent int32 overflow in k-grouped GEMM size calculations ( #226 )
2025-11-19 10:52:08 +08:00
Simon Mo
59f2c07cf2
Add SM100 kernels ( #201 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-09-29 17:07:28 +08:00
Chenggang Zhao
80ceeb2c76
Add SM90 kernels ( #200 )
2025-09-29 17:00:23 +08:00
Ray Wang
3f71de7aa9
Make various updates and fixes ( #198 )
2025-09-25 16:19:07 +08:00
Ray Wang
f85ec649d7
Make various updates and fixes: ( #164 )
...
- Add BF16 support for SM90 and SM100
- Refactor Python APIs
- Other fixes and code refactoring
2025-08-15 18:32:35 +08:00