biondizzle bf13665dbe Implement TileLang NVFP4 mega_moe L1/L2 kernels
- nvfp4_mega_moe_l1: L1 GEMM (gate_up_proj) with FP4 dequant → BF16 GEMM
- nvfp4_mega_moe_l2: L2 GEMM (down_proj) with FP4 dequant → BF16 GEMM
- nvfp4_dequant.py: E2M1 packed → BF16 with UE4M3 block16 scales
- tilelang_kernels.py: Grouped expert GEMM with TileLang-compiled BF16 GEMM
- Full pipeline: L1 GEMM → SiLU+Mul → re-quantize → L2 GEMM → output
- MEGA_MOE_STATIC=1 bypass still works for pipeline testing

Current approach: dequantize FP4→BF16 then run BF16 GEMM via TileLang T.gemm
(auto-lowers to tcgen05 on Blackwell). Will be upgraded to native FP4
block-scaled MMA (tcgen05.mma kind::mxf8f6f4.block_scale) once TileLang
adds E2M1+UE4M3 support.
2026-05-13 22:36:58 +00:00

NVFP4 Mega MoE Kernel — Mojo Rewrite

Rewrite of the DeepGEMM fp8_nvfp4_mega_moe kernel in Mojo.

Why Mojo?

  • Python-like syntax, C-level performance
  • Direct GPU programming without PTX inline asm
  • Safer than CUDA C++ (ownership, borrowing)
  • Better ergonomics for complex kernel development

Architecture

The kernel performs NVFP4 (E2M1 + UE4M3 block16 scales) matrix multiply for MoE (Mixture of Experts) with expert parallelism across NVLink.

Key operations:

  1. Staging — quantize BF16 activation to FP4 (E2M1) with UE8M0 scales
  2. TMA load — load packed FP4 weights and UE4M3 scales from global memory
  3. UMMAmxf4nvf4 matrix multiply with block scaling
  4. Epilogue — quantize L1 output (BF16 → FP4 + UE4M3 scales for L2)
  5. NVLink sync — cross-rank barrier and buffer management

NVFP4 specifics (vs MXFP4):

  • group_size=16 (UE4M3 block scales), not group_size=32 (UE8M0)
  • 2 SF K-columns per BLOCK_K (128/16/4=2), not 1
  • Weights are E2M1 packed int8 (2 values per byte)
  • mxf4nvf4 UMMA instruction with scale_vec::4X

Structure

src/
  mega_moe.mojo    — main kernel entry point
  staging.mojo     — activation quantization (BF16 → FP4)
  tma.mojo         — TMA descriptor creation and copy
  umma.mojo        — UMMA descriptor and MMA operations
  epilogue.mojo    — output quantization and TMA store
  barrier.mojo     — NVLink cluster sync and symm buffer
  layout.mojo      — weight transformation and SF layout
  utils.mojo       — math helpers, UE4M3 packing
Description
No description provided
Readme 13 MiB
Languages
Python 74.9%
Cuda 25%