f375c80bfe44d071f89564bf31bc8e45fc3bfaf5
- Native NVFP4 block-scaled MMA using CUTLASS MainloopSm100TmaUmmaWarpSpecializedBlockScaled - Invokes mxf8f6f4.block_scale tensor core instructions (tcgen05.mma) - E2M1 (packed int8) + UE4M3 (float8_e4m3fn) block-16 scales → BF16 output - No dequantization: hardware block-scaled MMA avoids costly dequantize+BF16 path - PyTorch CUDA extension with CollectiveBuilder auto-deduction - Grouped expert GEMM for MoE dispatch (32 experts/rank, top-6 routing) - Integrated into nvfp4_mega_moe.py as primary path with TileLang fallback - Standalone C API (cutlass_nvfp4_gemm.cu) for direct B200 compilation - Build script, setup.py, and test script for B200 deployment Files: cutlass_nvfp4_gemm/ — Kernel source, PyTorch binding, build/test scripts nvfp4_mega_moe.py — Updated to use CUTLASS kernel when available
NVFP4 Mega MoE Kernel — Mojo Rewrite
Rewrite of the DeepGEMM fp8_nvfp4_mega_moe kernel in Mojo.
Why Mojo?
- Python-like syntax, C-level performance
- Direct GPU programming without PTX inline asm
- Safer than CUDA C++ (ownership, borrowing)
- Better ergonomics for complex kernel development
Architecture
The kernel performs NVFP4 (E2M1 + UE4M3 block16 scales) matrix multiply for MoE (Mixture of Experts) with expert parallelism across NVLink.
Key operations:
- Staging — quantize BF16 activation to FP4 (E2M1) with UE8M0 scales
- TMA load — load packed FP4 weights and UE4M3 scales from global memory
- UMMA —
mxf4nvf4matrix multiply with block scaling - Epilogue — quantize L1 output (BF16 → FP4 + UE4M3 scales for L2)
- NVLink sync — cross-rank barrier and buffer management
NVFP4 specifics (vs MXFP4):
- group_size=16 (UE4M3 block scales), not group_size=32 (UE8M0)
- 2 SF K-columns per BLOCK_K (128/16/4=2), not 1
- Weights are E2M1 packed int8 (2 values per byte)
mxf4nvf4UMMA instruction withscale_vec::4X
Structure
src/
mega_moe.mojo — main kernel entry point
staging.mojo — activation quantization (BF16 → FP4)
tma.mojo — TMA descriptor creation and copy
umma.mojo — UMMA descriptor and MMA operations
epilogue.mojo — output quantization and TMA store
barrier.mojo — NVLink cluster sync and symm buffer
layout.mojo — weight transformation and SF layout
utils.mojo — math helpers, UE4M3 packing
Description
Languages
Python
74.9%
Cuda
25%