Go to file

biondizzle f375c80bfe feat: CUTLASS NVFP4 block-scaled GEMM kernel (native SM100 Blackwell)

- Native NVFP4 block-scaled MMA using CUTLASS MainloopSm100TmaUmmaWarpSpecializedBlockScaled
- Invokes mxf8f6f4.block_scale tensor core instructions (tcgen05.mma)
- E2M1 (packed int8) + UE4M3 (float8_e4m3fn) block-16 scales → BF16 output
- No dequantization: hardware block-scaled MMA avoids costly dequantize+BF16 path
- PyTorch CUDA extension with CollectiveBuilder auto-deduction
- Grouped expert GEMM for MoE dispatch (32 experts/rank, top-6 routing)
- Integrated into nvfp4_mega_moe.py as primary path with TileLang fallback
- Standalone C API (cutlass_nvfp4_gemm.cu) for direct B200 compilation
- Build script, setup.py, and test script for B200 deployment

Files:
  cutlass_nvfp4_gemm/ — Kernel source, PyTorch binding, build/test scripts
  nvfp4_mega_moe.py — Updated to use CUTLASS kernel when available

2026-05-13 23:11:15 +00:00

src

feat: CUTLASS NVFP4 block-scaled GEMM kernel (native SM100 Blackwell)

2026-05-13 23:11:15 +00:00

.gitignore

Implement TileLang NVFP4 mega_moe L1/L2 kernels

2026-05-13 22:36:58 +00:00

pyproject.toml

Initial: TileLang NVFP4 mega_moe kernel package

2026-05-13 15:44:51 +00:00

README.md

Initial: TileLang NVFP4 mega_moe kernel package

2026-05-13 15:44:51 +00:00

README.md

NVFP4 Mega MoE Kernel — Mojo Rewrite

Rewrite of the DeepGEMM fp8_nvfp4_mega_moe kernel in Mojo.

Why Mojo?

Python-like syntax, C-level performance
Direct GPU programming without PTX inline asm
Safer than CUDA C++ (ownership, borrowing)
Better ergonomics for complex kernel development

Architecture

The kernel performs NVFP4 (E2M1 + UE4M3 block16 scales) matrix multiply for MoE (Mixture of Experts) with expert parallelism across NVLink.

Key operations:

Staging — quantize BF16 activation to FP4 (E2M1) with UE8M0 scales
TMA load — load packed FP4 weights and UE4M3 scales from global memory
UMMA — mxf4nvf4 matrix multiply with block scaling
Epilogue — quantize L1 output (BF16 → FP4 + UE4M3 scales for L2)
NVLink sync — cross-rank barrier and buffer management

NVFP4 specifics (vs MXFP4):

group_size=16 (UE4M3 block scales), not group_size=32 (UE8M0)
2 SF K-columns per BLOCK_K (128/16/4=2), not 1
Weights are E2M1 packed int8 (2 values per byte)
mxf4nvf4 UMMA instruction with scale_vec::4X

Structure

src/
  mega_moe.mojo    — main kernel entry point
  staging.mojo     — activation quantization (BF16 → FP4)
  tma.mojo         — TMA descriptor creation and copy
  umma.mojo        — UMMA descriptor and MMA operations
  epilogue.mojo    — output quantization and TMA store
  barrier.mojo     — NVLink cluster sync and symm buffer
  layout.mojo      — weight transformation and SF layout
  utils.mojo       — math helpers, UE4M3 packing