NVFP4 Mega MoE Kernel — CUTLASS Native Blackwell Implementation

Native NVFP4 block-scaled GEMM kernel for DeepSeek-V4-Pro on NVIDIA B200 (Blackwell SM100).

What This Does

Replaces the broken fp8_nvfp4_mega_moe DeepGEMM kernel with a working CUTLASS-based implementation that uses native Blackwell tensor core instructions (SM100_MMA_MXF4_SS) for E2M1 × E2M1 matrix multiplication with UE4M3 block scaling.

Architecture

DeepSeek-V4-Pro MoE layer (per rank, expert parallel):

  • L1 (gate_up_proj): HIDDEN=7168 → 2×INTERMEDIATE=6144
  • L2 (down_proj): INTERMEDIATE=3072 → HIDDEN=7168
  • 256 experts total, 32-48 per rank (depends on EP config), top-6 routing
  • NVFP4 quantization: packed E2M1 (int8, 2 FP4 per byte) + UE4M3 block16 scales

Components

cutlass_nvfp4_gemm/ — The CUTLASS Extension

File Purpose
cutlass_nvfp4_gemm.cu CUTLASS GEMM + scale factor remap kernel
pytorch_binding.cpp PyTorch C++ extension binding
kernel.py Python wrapper (cutlass_nvfp4_gemm, cutlass_grouped_nvfp4_gemm)
setup.py Build configuration (SM100a target)

nvfp4_mega_moe.py — Main Entry Point

Called by the patched deepseek_v4.py. Dispatches to CUTLASS when MEGA_MOE_USE_CUTLASS=1.

weight_transform.py — Weight Transformation

Converts raw NVFP4 checkpoint weights into the format expected by the kernel:

  • Folds global scales (float32) into block scales (UE4M3)
  • Interleaves L1 gate_up weights for 2CTA UMMA

symm_buffer.py — Symmetric Buffer

Stub for NVLink cross-rank all-reduce. Matches the DeepGEMM API expected by vLLM's deepseek_v4.py.

Critical Quirks & Pitfalls

1. Scale Factor Layout (THE BIG ONE)

CUTLASS's Sm1xxBlockScaledConfig expects scale factors in an interleaved layout, NOT simple row-major. The layout is defined by:

SfAtom = Shape<Shape<32,4>, Shape<16,4>>
       with Stride<Stride<16,4>, Stride<0,1>>
layout_SFA = tile_to_shape(SfAtom{}, make_shape(M,K), Step<_2,_1>{})

If you pass row-major scales directly, TMA loads read garbage addresses → NaN output → downstream CUDA illegal memory access.

Fix: GPU-side remap kernel using cute::idx2crd() to convert CUTLASS layout indices to (row, k_group) coordinates, then index into row-major source.

2. CUTLASS Version Matters

TileLang's bundled CUTLASS is too old — missing float_e2m1_t, float_ue4m3_t, block-scaled types. You need the latest from GitHub:

git clone --depth 1 https://github.com/NVIDIA/cutlass.git /root/cutlass

Key files only in the latest version:

  • include/cutlass/float_subbyte.hfloat_e2m1_t and float_ue4m3_t
  • include/cutlass/detail/sm100_blockscaled_layout.hpp — SFA/SFB layout computation
  • examples/72_blackwell_narrow_precision_gemm/72b_nvfp4_nvfp4_gemm.cu — reference implementation

3. nixl_ep Breaks CUDA 13 Images

The vllm/vllm-openai:nightly image ships nixl_ep compiled against CUDA 12, but the image is CUDA 13. At import time it tries to dlopen("libcudart.so.12") → crash. Remove it:

RUN pip uninstall -y nixl-ep; rm -rf /usr/local/lib/python3.12/dist-packages/nixl_ep

4. Fabric Manager Required for B200

B200 NVLink clusters need nvidia-fabricmanager running before CUDA runtime can init. Without it:

  • nvidia-smi works (kernel module)
  • cudaGetDeviceCount() segfaults (userspace driver)
  • Error 802: "system not yet initialized"
systemctl enable nvidia-fabricmanager
systemctl start nvidia-fabricmanager

5. Docker GPU Access

Must use deploy.resources.reservations.devices in docker-compose, NOT runtime: nvidia in daemon.json:

deploy:
  resources:
    reservations:
      devices:
        - driver: nvidia
          count: all
          capabilities: [gpu]

6. PyTorch Extension API (nightly vLLM)

  • Use c10::cuda::getCurrentCUDAStream() not at::cuda::getCurrentCUDAStream()
  • Use torch::kBFloat16 not at::kBF16
  • CUDAExtension uses include_dirs not extra_include_paths
  • python3 not python in the image

7. CCCL Headers

CUTLASS 3.x depends on libcu++ (CCCL). Found at:

/usr/local/cuda-13.0/targets/x86_64-linux/include/cccl/

8. No Mixing CUDA Versions

Hard rule. If something needs CUDA 12 in a CUDA 13 image, remove the thing that needs CUDA 12. Never symlink libcudart.so.13 → libcudart.so.12.

Building

On the B200 server, inside the vLLM Docker container:

cd /root/nvfp4-megamoe-kernel/src/nvfp4_megamoe_kernel/cutlass_nvfp4_gemm
TORCH_CUDA_ARCH_LIST=10.0 python3 setup.py build_ext --inplace

Requires:

  • CUTLASS at /root/cutlass/include
  • CCCL at /usr/local/cuda-13.0/targets/x86_64-linux/include/cccl/

Environment Variables

Variable Default Description
MEGA_MOE_STATIC 0 Set to 1 to bypass kernel (return zeros)
MEGA_MOE_USE_CUTLASS 1 Use CUTLASS native NVFP4 kernel
MEGA_MOE_DEBUG 0 Enable debug prints
VLLM_USE_NIXL 0 Disable NIXL (broken in nightly)

Current Status (May 14, 2026)

  • CUTLASS NVFP4 GEMM compiles and loads
  • Scale factor remap works (no NaN)
  • vLLM server starts with native kernel
  • L1 and L2 CUTLASS kernels execute
  • ⚠️ Output is garbage — shared experts are bypassed (zeros)
  • ⚠️ FlashInfer/DeepGEMM TF32 GEMM (shared experts) crashes workers
  • ⚠️ MoE dispatch is slow (Python per-expert loop)

Next Steps

  1. Fix shared experts crash (FlashInfer TF32 GEMM illegal memory access)
  2. Verify numerical correctness of SF remap (compare against dequantize+BF16 reference)
  3. Optimize MoE dispatch (batched/grouped GEMM)
  4. Replace simple stage_activation with proper E2M1 quantization
  5. Re-enable shared experts once FlashInfer crash is fixed
Description
No description provided
Readme 13 MiB
Languages
Python 74.9%
Cuda 25%