NVFP4 Mega MoE Kernel — CUTLASS Native Blackwell Implementation
Native NVFP4 block-scaled GEMM kernel for DeepSeek-V4-Pro on NVIDIA B200 (Blackwell SM100).
What This Does
Replaces the broken fp8_nvfp4_mega_moe DeepGEMM kernel with a working CUTLASS-based implementation that uses native Blackwell tensor core instructions (SM100_MMA_MXF4_SS) for E2M1 × E2M1 matrix multiplication with UE4M3 block scaling.
Architecture
DeepSeek-V4-Pro MoE layer (per rank, expert parallel):
- L1 (gate_up_proj): HIDDEN=7168 → 2×INTERMEDIATE=6144
- L2 (down_proj): INTERMEDIATE=3072 → HIDDEN=7168
- 256 experts total, 32-48 per rank (depends on EP config), top-6 routing
- NVFP4 quantization: packed E2M1 (int8, 2 FP4 per byte) + UE4M3 block16 scales
Components
cutlass_nvfp4_gemm/ — The CUTLASS Extension
| File | Purpose |
|---|---|
cutlass_nvfp4_gemm.cu |
CUTLASS GEMM + scale factor remap kernel |
pytorch_binding.cpp |
PyTorch C++ extension binding |
kernel.py |
Python wrapper (cutlass_nvfp4_gemm, cutlass_grouped_nvfp4_gemm) |
setup.py |
Build configuration (SM100a target) |
nvfp4_mega_moe.py — Main Entry Point
Called by the patched deepseek_v4.py. Dispatches to CUTLASS when MEGA_MOE_USE_CUTLASS=1.
weight_transform.py — Weight Transformation
Converts raw NVFP4 checkpoint weights into the format expected by the kernel:
- Folds global scales (float32) into block scales (UE4M3)
- Interleaves L1 gate_up weights for 2CTA UMMA
symm_buffer.py — Symmetric Buffer
Stub for NVLink cross-rank all-reduce. Matches the DeepGEMM API expected by vLLM's deepseek_v4.py.
Critical Quirks & Pitfalls
1. Scale Factor Layout (THE BIG ONE)
CUTLASS's Sm1xxBlockScaledConfig expects scale factors in an interleaved layout, NOT simple row-major. The layout is defined by:
SfAtom = Shape<Shape<32,4>, Shape<16,4>>
with Stride<Stride<16,4>, Stride<0,1>>
layout_SFA = tile_to_shape(SfAtom{}, make_shape(M,K), Step<_2,_1>{})
If you pass row-major scales directly, TMA loads read garbage addresses → NaN output → downstream CUDA illegal memory access.
Fix: GPU-side remap kernel using cute::idx2crd() to convert CUTLASS layout indices to (row, k_group) coordinates, then index into row-major source.
2. CUTLASS Version Matters
TileLang's bundled CUTLASS is too old — missing float_e2m1_t, float_ue4m3_t, block-scaled types. You need the latest from GitHub:
git clone --depth 1 https://github.com/NVIDIA/cutlass.git /root/cutlass
Key files only in the latest version:
include/cutlass/float_subbyte.h—float_e2m1_tandfloat_ue4m3_tinclude/cutlass/detail/sm100_blockscaled_layout.hpp— SFA/SFB layout computationexamples/72_blackwell_narrow_precision_gemm/72b_nvfp4_nvfp4_gemm.cu— reference implementation
3. nixl_ep Breaks CUDA 13 Images
The vllm/vllm-openai:nightly image ships nixl_ep compiled against CUDA 12, but the image is CUDA 13. At import time it tries to dlopen("libcudart.so.12") → crash. Remove it:
RUN pip uninstall -y nixl-ep; rm -rf /usr/local/lib/python3.12/dist-packages/nixl_ep
4. Fabric Manager Required for B200
B200 NVLink clusters need nvidia-fabricmanager running before CUDA runtime can init. Without it:
nvidia-smiworks (kernel module)cudaGetDeviceCount()segfaults (userspace driver)- Error 802: "system not yet initialized"
systemctl enable nvidia-fabricmanager
systemctl start nvidia-fabricmanager
5. Docker GPU Access
Must use deploy.resources.reservations.devices in docker-compose, NOT runtime: nvidia in daemon.json:
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
6. PyTorch Extension API (nightly vLLM)
- Use
c10::cuda::getCurrentCUDAStream()notat::cuda::getCurrentCUDAStream() - Use
torch::kBFloat16notat::kBF16 CUDAExtensionusesinclude_dirsnotextra_include_pathspython3notpythonin the image
7. CCCL Headers
CUTLASS 3.x depends on libcu++ (CCCL). Found at:
/usr/local/cuda-13.0/targets/x86_64-linux/include/cccl/
8. No Mixing CUDA Versions
Hard rule. If something needs CUDA 12 in a CUDA 13 image, remove the thing that needs CUDA 12. Never symlink libcudart.so.13 → libcudart.so.12.
Building
On the B200 server, inside the vLLM Docker container:
cd /root/nvfp4-megamoe-kernel/src/nvfp4_megamoe_kernel/cutlass_nvfp4_gemm
TORCH_CUDA_ARCH_LIST=10.0 python3 setup.py build_ext --inplace
Requires:
- CUTLASS at
/root/cutlass/include - CCCL at
/usr/local/cuda-13.0/targets/x86_64-linux/include/cccl/
Environment Variables
| Variable | Default | Description |
|---|---|---|
MEGA_MOE_STATIC |
0 |
Set to 1 to bypass kernel (return zeros) |
MEGA_MOE_USE_CUTLASS |
1 |
Use CUTLASS native NVFP4 kernel |
MEGA_MOE_DEBUG |
0 |
Enable debug prints |
VLLM_USE_NIXL |
0 |
Disable NIXL (broken in nightly) |
Current Status (May 14, 2026)
- ✅ CUTLASS NVFP4 GEMM compiles and loads
- ✅ Scale factor remap works (no NaN)
- ✅ vLLM server starts with native kernel
- ✅ L1 and L2 CUTLASS kernels execute
- ⚠️ Output is garbage — shared experts are bypassed (zeros)
- ⚠️ FlashInfer/DeepGEMM TF32 GEMM (shared experts) crashes workers
- ⚠️ MoE dispatch is slow (Python per-expert loop)
Next Steps
- Fix shared experts crash (FlashInfer TF32 GEMM illegal memory access)
- Verify numerical correctness of SF remap (compare against dequantize+BF16 reference)
- Optimize MoE dispatch (batched/grouped GEMM)
- Replace simple
stage_activationwith proper E2M1 quantization - Re-enable shared experts once FlashInfer crash is fixed