Commit Graph

12 Commits

Author SHA1 Message Date
cd7a612175 debug: add shape logging to SF packing 2026-05-11 08:54:14 +00:00
dcebe033e2 fix: use scale_vec::2X (block32) for SM100 B200 compatibility
scale_vec::4X (block16) requires SM103/SM120 (B300/GB300), not SM100 (B200).
Revert to block32 with UE4M3 scales. Same TMEM layout as MXFP4 but with
UE4M3 scale format instead of UE8M0.

Changes:
- kGranK: 16 → 32
- PTX: scale_vec::4X → scale_vec::2X
- SF layout: same as MXFP4 (K/32, K/128 for int32 packed)
- UTCCP: i*8 → i*4 (2X layout, same as MXFP4)
- TMEM columns: same as MXFP4 (SF_BLOCK_M/32, SF_BLOCK_N/32)
- Python: merge NVFP4 block16→block32 scales (max of adjacent pairs)
- recipe: (1,1,16) → (1,1,32)
2026-05-11 08:36:59 +00:00
deff80c9c1 fix: add Python wrapper for NVFP4 SymmBuffer allocation
get_symm_buffer_for_nvfp4_mega_moe uses _C.get_symm_buffer_size_for_nvfp4_mega_moe
to allocate the correct buffer size (2x SF entries due to group_size=16).
Custom init to avoid SymmBuffer's hardcoded MXFP4 allocation.
2026-05-11 08:05:21 +00:00
8d02eb38fa fix: transpose SF to MN-major layout before TMA stride checks
transform_sf_into_required_layout expects MN-major input (stride(-2)=1).
Our packed int32 SF is K-major (stride(-1)=1). Transpose the last two
dims, make contiguous, then transpose back so data is in MN-major order.
2026-05-11 07:32:10 +00:00
7154500f22 fix: reshape SF to 2D before transform_sf_into_required_layout
The C++ check_sf_layout stride assertion fails on 3D (experts, mn, K//64)
tensors. Reshape to 2D (experts*mn, K//64) before calling the transform
function, matching the expected stride layout.
2026-05-11 07:30:54 +00:00
388fd8dcfd fix: pack UE4M3 into int32 before transform_sf_into_required_layout
The C++ transform function expects int32 (for kInt type) with 4 UE4M3
bytes packed per int32. We pack first, then transform for TMA alignment
and UTCCP transpose with recipe (1, 16).
2026-05-11 07:05:11 +00:00
acae75e109 fix: use transform_sf_into_required_layout for proper TMA-aligned SF
Instead of custom _pack_nvfp4_sf_for_utccp, use DeepGEMM's C++
transform_sf_into_required_layout with recipe (1, 1, 16) for NVFP4.
This handles TMA alignment and UTCCP transpose correctly.
2026-05-11 06:54:34 +00:00
5cb4fcaef3 fix: cast uint8 weights to int8 (kPackedFP4) for DeepGEMM compatibility 2026-05-11 06:36:32 +00:00
bbf9a5f46a feat: fold weight_scale_2 into block scales in NVFP4 transform
- transform_nvfp4_weights_for_mega_moe now accepts weight_scale_2
- Folds global scale into block scales: UE4M3 * FP32 -> UE4M3
- Dequantize to f32, multiply by global scale, clamp [0,448], re-quantize
- This is needed because the kernel only applies one level of block scaling
2026-05-11 05:42:16 +00:00
36b439ee26 feat: NVFP4 mega MoE kernel (scale_vec::4X, UE4M3 block scales)
- New CUDA kernel: sm100_fp8_nvfp4_mega_moe_impl
  - kGranK=16 (NVFP4 group_size=16, vs MXFP4's 32)
  - kind::mxf4nvf4.block_scale.scale_vec::4X PTX instruction
  - float_ue4m3_t scale factor type in instruction descriptor
  - SF layout: scale_vec::4X (4 TMEM sub-columns per UMMA atom)
  - UTCCP column stride: i*8 (vs MXFP4's i*4) for 4X layout
  - L1 epilogue: UE4M3 activation scales (float→cutlass::float_e4m3_t)
  - SF loading: kNumSFUint32 = kHidden/64 (4 UE4M3 per int32)

- New PTX wrappers: SM100_MMA_MXF4NVF4_2x1SM_SS, SM100_MMA_MXF4NVF4_SS

- Python API:
  - fp8_nvfp4_mega_moe() with recipe=(1,1,16)
  - transform_nvfp4_weights_for_mega_moe() for UE4M3→int32 UTCCP packing
  - _pack_nvfp4_sf_for_utccp() helper

- C++ bindings:
  - mega_nvfp4.hpp with NVFP4-specific SymmBuffer (SF stride K/16)
  - JIT kernel header with kGranK=16 TMA descriptors
  - Registered in python_api.cpp

NOTE: Both SFA and SFB must use UE4M3 (scale_format_ is 1-bit, shared).
The L1 epilogue converts float→UE4M3 for activation scales.
2026-05-11 05:41:08 +00:00
Zhean Xu
891d57b4db Add various optimizations and Mega MoE benchmarks (#316)
* Merge with private repo

* Add Mega MoE Benchmark

* Minor fix

* Update

---------

Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>
2026-04-24 18:41:37 +08:00
Chenggang Zhao
7f2a703ed5 [Public release 26/04] Introducing Mega MoE, FP4 Indexer and other features/fixes (#304)
* Merge with private repo

* Update README

* Update README

* Update README

* Add PyTorch requirements

* Fix sync scopes for MQA logits (#256)

* Update README
2026-04-17 09:45:14 +08:00