75f1c8544b
fix: remove smem_inner_dim doubling for packed FP4 TMA — must match MMA row width (BLOCK_K/2)
2026-05-12 17:14:44 +00:00
94b30dc2bc
revert: block_n/4 was correct (SwiGLU halving × FP4 packing)
2026-05-12 15:04:23 +00:00
c71fb97687
fix: L1 output TMA smem_inner_dim was block_n/4, should be block_n/2
...
Packed E2M1 output has 2 elements per byte, so block_n elements = block_n/2 bytes.
block_n/4 was under-sizing the TMA SMEM row by 2x → OOB write → LAUNCH_FAILED.
2026-05-12 14:58:11 +00:00
26a8ab75a1
NVFP4: fix SF pipeline — 2 K-cols per BLOCK_K for group=16
...
- TMA: issue two tma::copy calls per K-block (K_box=1, 2 SF K-columns)
- UTCCP: double loop for 2 K-columns, correct SMEM offsets
- TMEM: double SFA/SFB column counts (SF_BLOCK_M/32 * 2)
- Heuristic: fix smem_size (2× SF, packed FP4 A/B, packed send buffers, no amax)
- Staging kernel: fix double-count bug in packed_k_mask
2026-05-12 08:08:17 +00:00
c0850a6859
Fix weight TMA descriptors: packed E2M1 needs K/2, block_k/2, swizzle/2
...
Weights are packed E2M1 (2 per byte) but TMA descriptors were using
unpacked dimensions — K-dim in elements instead of bytes, 128B swizzle
instead of 64B, full block_k instead of block_k/2. This caused OOB
reads and swizzle mismatch with the UMMA descriptor, producing
illegal instruction traps.
2026-05-12 06:51:39 +00:00
49e5646b42
fix: remove duplicate kInt8 case — kPackedFP4 is already kInt8
...
kPackedFP4 = torch::kInt8, so the kInt8 case was a duplicate.
The real fix was in mega_nvfp4.hpp: changing kUInt8→kInt8 so
tensors match the existing kPackedFP4 path in the TMA switch.
2026-05-11 22:55:28 +00:00
80df24a641
fix: add kInt8 dtype support to TMA descriptor + change activation tensors to kInt8
...
- runtime_utils.hpp: added kInt8 -> CU_TENSOR_MAP_DATA_TYPE_UINT8 mapping
- mega_nvfp4.hpp: changed activation tensor dtypes from kUInt8 to kInt8
(same byte layout, but kInt8 is recognized by the TMA dtype switch)
2026-05-11 22:54:47 +00:00
30d72e7ef5
fix: packed FP4 for mxf4nvf4 — correct SMEM layout, UMMA descriptors, L1 epilogue
...
Key changes:
- a_dtype_t/b_dtype_t: float_e2m1_t (packed 4-bit) with sizeof_bits_v==4 assert
- kSwizzleAMode/BMode: BLOCK_K/2 (64 bytes packed, not 128 unpacked)
- SMEM sizes: LOAD_BLOCK_M * BLOCK_K / 2 (packed byte count)
- Token layouts: kHidden/2, kIntermediateHidden/2 (packed bytes)
- TMA loads: BLOCK_K/2 inner dim, uint8_t, byte offsets k_block_idx*(BLOCK_K/2)
- UMMA descriptors: BLOCK_K/2 template param, uint8_t dtype, UMMA_K/2 advance
- L1 epilogue: dropped STSM, direct st.shared.u16 with packed nibbles, no swizzle (v1)
- Pybind buffer sizes: hidden/2, intermediate_hidden/2 with packed tensor shapes
- Host TMA descriptors: hidden/2 K-dims, block_k/2 inner, fp4_unpacked_smem=false
- L1 output TMA: block_n/4 inner, no swizzle (CU_TENSOR_MAP_SWIZZLE_NONE)
2026-05-11 21:59:21 +00:00
0ac73a82f9
fix: L1 output uses unpacked E2M1 (1 byte/element) like FP8
...
- float_e2m1_unpacksmem_t: sizeof=1, SMEM is 1 byte/element (not packed)
- TMA load unpacks 2 E2M1/global-byte → 2 SMEM bytes
- UMMA reads unpacked SMEM, packs internally for mxf4nvf4
- L1→L2 handoff: unpacked format (same byte count as FP8)
- Epilogue: 4 E2M1 bytes per uint32 STSM atom, same as FP8
- Dispatch TMA: kHidden bytes (unpacked), not kHidden/2
- Added static_assert on sizeof(a_dtype_t) and sizeof(b_dtype_t)
- Note: no bandwidth savings at L1→L2 boundary for v1
2026-05-11 21:27:35 +00:00
b3d1aae038
feat: full FP4 activations for mxf4nvf4 - E2M1 packed A side + UE4M3 scales
...
mxf4nvf4 requires BOTH A and B to be FP4 (E2M1 packed).
Changes:
- a_dtype_t: float_e4m3_t → float_e2m1_unpacksmem_t
- UMMA_K: 32 → 64 (FP4 MMA atom)
- L1 epilogue: FP8 quant → E2M1 FP4 quantization with nearest-neighbor
- L1 output SMEM: packed E2M1 (2 per byte), TMA store uint8
- TMA descriptors: adjusted for FP4 packing (K/2 bytes per row)
- SymmBuffer: uint8 activations, shape (M, K//2)
- Staging kernel: BF16 → E2M1 packed + UE4M3 block16 scales
2026-05-11 20:29:08 +00:00
86a1263f44
fix: gran_k=16 in transform_sf + sm_100a arch for NVFP4 mega_moe
...
- transform_sf_into_required_layout: add gran_k=16 branch for NVFP4 UE4M3
scales (4 per int32, group_size=16). Previously only handled 32/128.
- get_arch: always return '100a' for SM100, never '100f'. The family
variant lacks mxf4nvf4 (NVFP4 block-scaled MMA) support, causing
'scale_vec::4X not supported on sm_100f' errors.
- transform_nvfp4_weights_for_mega_moe: fold weight_scale_2 into block
scales, pack UE4M3→int32, transpose MN-major, call
transform_sf_into_required_layout with gran_k=16.
2026-05-11 16:11:11 +00:00
fbdddaccf4
revert: restore mxf4nvf4/block16 code (correct path for sm_100a)
...
Reverted to commit 36b439e's NVFP4 kernel code:
- kGranK=16, mxf4nvf4.block_scale.scale_vec::4X
- float_ue4m3_t instruction descriptor
- Block16 SF layout (4X TMEM)
- UE4M3 L1 epilogue
- No UE4M3→UE8M0 conversion, no block16→block32 merge
The mxf4nvf4.scale_vec::4X PTX instruction compiles successfully
on both sm_100 and sm_100f with CUDA 13.0. The previous build 17
error was likely from a different cause, not the arch flag.
Python: reverted transform_nvfp4_weights_for_mega_moe to use
pack_ue4m3_to_int32 with gran_k=16, no UE8M0 conversion.
2026-05-11 15:02:47 +00:00
b856c57ba6
fix: kGranK=32 in C++ binding (was still 16 from old block16 code)
2026-05-11 09:09:32 +00:00
dcebe033e2
fix: use scale_vec::2X (block32) for SM100 B200 compatibility
...
scale_vec::4X (block16) requires SM103/SM120 (B300/GB300), not SM100 (B200).
Revert to block32 with UE4M3 scales. Same TMEM layout as MXFP4 but with
UE4M3 scale format instead of UE8M0.
Changes:
- kGranK: 16 → 32
- PTX: scale_vec::4X → scale_vec::2X
- SF layout: same as MXFP4 (K/32, K/128 for int32 packed)
- UTCCP: i*8 → i*4 (2X layout, same as MXFP4)
- TMEM columns: same as MXFP4 (SF_BLOCK_M/32, SF_BLOCK_N/32)
- Python: merge NVFP4 block16→block32 scales (max of adjacent pairs)
- recipe: (1,1,16) → (1,1,32)
2026-05-11 08:36:59 +00:00
f98c1f7fd5
fix: add gran_k=16 (NVFP4) support to transform_sf_into_required_layout
...
The C++ function only handled gran_k=32 and 128 (MXFP4/FP8).
Added gran_k=16 for NVFP4 group_size=16 support.
2026-05-11 07:13:00 +00:00
36b439ee26
feat: NVFP4 mega MoE kernel (scale_vec::4X, UE4M3 block scales)
...
- New CUDA kernel: sm100_fp8_nvfp4_mega_moe_impl
- kGranK=16 (NVFP4 group_size=16, vs MXFP4's 32)
- kind::mxf4nvf4.block_scale.scale_vec::4X PTX instruction
- float_ue4m3_t scale factor type in instruction descriptor
- SF layout: scale_vec::4X (4 TMEM sub-columns per UMMA atom)
- UTCCP column stride: i*8 (vs MXFP4's i*4) for 4X layout
- L1 epilogue: UE4M3 activation scales (float→cutlass::float_e4m3_t)
- SF loading: kNumSFUint32 = kHidden/64 (4 UE4M3 per int32)
- New PTX wrappers: SM100_MMA_MXF4NVF4_2x1SM_SS, SM100_MMA_MXF4NVF4_SS
- Python API:
- fp8_nvfp4_mega_moe() with recipe=(1,1,16)
- transform_nvfp4_weights_for_mega_moe() for UE4M3→int32 UTCCP packing
- _pack_nvfp4_sf_for_utccp() helper
- C++ bindings:
- mega_nvfp4.hpp with NVFP4-specific SymmBuffer (SF stride K/16)
- JIT kernel header with kGranK=16 TMA descriptors
- Registered in python_api.cpp
NOTE: Both SFA and SFB must use UE4M3 (scale_format_ is 1-bit, shared).
The L1 epilogue converts float→UE4M3 for activation scales.
2026-05-11 05:41:08 +00:00
Zhean Xu
891d57b4db
Add various optimizations and Mega MoE benchmarks ( #316 )
...
* Merge with private repo
* Add Mega MoE Benchmark
* Minor fix
* Update
---------
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com >
2026-04-24 18:41:37 +08:00
Chenggang Zhao
7f2a703ed5
[Public release 26/04] Introducing Mega MoE, FP4 Indexer and other features/fixes ( #304 )
...
* Merge with private repo
* Update README
* Update README
* Update README
* Add PyTorch requirements
* Fix sync scopes for MQA logits (#256 )
* Update README
2026-04-17 09:45:14 +08:00
Zhean Xu
0f5f266202
Multiple updates and refactorings ( #280 )
2026-01-16 17:06:52 +08:00
yurekami
6be0eb31d9
fix: use SM90ArchSpec instead of SM100ArchSpec in sm90_bf16_k_grouped_gemm
...
The function sm90_bf16_k_grouped_gemm was incorrectly using SM100ArchSpec
to calculate TMA descriptor block sizes. Since this file is the SM90
implementation, it should consistently use SM90ArchSpec like the other
functions in this file (sm90_bf16_gemm, sm90_m_grouped_bf16_gemm_contiguous,
etc.).
This fixes a copy-paste error that could cause incorrect block size
calculations on SM90 (Hopper) GPUs.
Fixes #242
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2026-01-01 05:06:36 +09:00
Ray Wang
38f8ef73a4
Multiple updates and refactorings ( #231 )
2025-11-21 17:49:47 +08:00
Chenggang Zhao
8da33d6bd9
Clean up
2025-11-19 11:00:55 +08:00
Guoteng
f63d7f24d6
fix: prevent int32 overflow in k-grouped GEMM size calculations ( #226 )
2025-11-19 10:52:08 +08:00
Ray Wang
ec5e9ed0b8
Fix SM90 MQA logits ( #229 )
2025-11-19 10:50:36 +08:00
Ray Wang
2f9d87877e
Use larger MMA shape ( #227 )
2025-11-14 11:38:15 +08:00
Chenggang Zhao
f8f41145da
Use CUDA runtime API to get device prop instead of ATen
2025-10-11 09:16:31 +08:00
Chenggang Zhao
07b82fb8cd
Fix old CUDA compatibility
2025-10-01 20:29:15 +08:00
Simon Mo
59f2c07cf2
Add SM100 kernels ( #201 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-09-29 17:07:28 +08:00
Chenggang Zhao
80ceeb2c76
Add SM90 kernels ( #200 )
2025-09-29 17:00:23 +08:00
Ray Wang
3f71de7aa9
Make various updates and fixes ( #198 )
2025-09-25 16:19:07 +08:00
yukuai26
79f48ee15a
Fix multicast bug and optimize masked GEMM ( #193 )
...
* Fix multicast bug and profile masked GEMM
* Updates and lint
---------
Co-authored-by: Kuai Yu <yukuai@deepseek.com >
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com >
2025-09-12 17:12:27 +08:00
Chenggang Zhao
ea9c5d9270
Use driver API
2025-08-28 09:40:49 +08:00
Chenggang Zhao
0e49c3353b
Refactor compiler version checks and arch flags
2025-08-27 09:28:21 +08:00
PGFLMG
3a93f4eb28
Fix B200 cu128 NVCC compilation failed ( #173 )
2025-08-27 09:07:18 +08:00
Chenggang Zhao
9c3783beb2
Fix CUBIN symbol name compatibility
2025-08-26 17:43:26 +08:00
Chenggang Zhao
f20256fd50
Compatible with CUDA 13
2025-08-22 17:30:47 +08:00
xiweny
affdb1cd90
Add sm_100f support and make nvcc 13 happy ( #157 )
...
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com >
2025-08-22 17:19:32 +08:00
Ray Wang
f85ec649d7
Make various updates and fixes: ( #164 )
...
- Add BF16 support for SM90 and SM100
- Refactor Python APIs
- Other fixes and code refactoring
2025-08-15 18:32:35 +08:00
zhonghui-J
3254b758e2
Polish get_best_configs modeling. ( #158 )
2025-08-14 16:50:21 +08:00
LJC00118
7b6b5563b9
Fix smxx layout assertion ( #154 )
2025-08-05 10:38:06 +08:00
Ray Wang
d9c363f86f
Make various updates and fixes:
...
- Add support for legacy CUDA versions; now compatible with CUDA 12.3 and newer
- Add support for NVRTC compilation
- Other fixes and code refactoring
2025-08-02 19:52:22 -07:00
Chenggang Zhao
c50deed14c
Code lint
2025-07-30 10:39:30 +08:00
LJC00118
6bc75b549e
Fix smxx layout assertion ( #141 )
...
* Fix assertion error in smxx_layout.hpp for mn % 4 != 0 cases
* Fix assertion error in smxx_layout.hpp for mn % 4 != 0 cases
* Align submodule files
* Fix assertion error in smxx_layout.hpp for mn % 4 != 0 cases
* fix(smxx_layout): support mn%4!=0 and num_groups>1 via torch
* fix(smxx_layout): support mn%4!=0 and num_groups>1 via torch
* fix: correct logic for entering get_mn_major_tma_aligned_packed_ue8m0_tensor_torch
2025-07-30 10:36:54 +08:00
dan_the_3rd
4b4e4f20dd
Update system.hpp ( #133 )
2025-07-28 17:01:05 +08:00
Chenggang Zhao
187656694f
Code lint
2025-07-21 11:00:50 +08:00
Ray Wang
436a56314c
Use std::filesystem::directory_iterator instead of std::filesystem::recursive_directory_iterator to avoid an ABI breakage we met ( #131 )
2025-07-21 10:44:20 +08:00
Ray Wang
9da4a23561
Add more GPU architectures support ( #112 )
...
* Add more GPU architectures support
* Update layout.py
* Optimize performance, Add SM90 support, Add 1D2D SM100 support
* Add fmtlib submodule at commit 553ec11
---------
Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com >
2025-07-18 11:32:22 +08:00