Replaces the broken fp8_nvfp4_mega_moe DeepGEMM kernel with a working CUTLASS-based implementation that uses native Blackwell tensor core instructions (SM100_MMA_MXF4_SS) for E2M1 × E2M1 matrix multiplication with UE4M3 block scaling.

Architecture

DeepSeek-V4-Pro MoE layer (per rank, expert parallel):

L1 (gate_up_proj): HIDDEN=7168 → 2×INTERMEDIATE=6144
L2 (down_proj): INTERMEDIATE=3072 → HIDDEN=7168
256 experts total, 32-48 per rank (depends on EP config), top-6 routing
NVFP4 quantization: packed E2M1 (int8, 2 FP4 per byte) + UE4M3 block16 scales

Components

`cutlass_nvfp4_gemm/` — The CUTLASS Extension

File	Purpose
`cutlass_nvfp4_gemm.cu`	CUTLASS GEMM + scale factor remap kernel
`pytorch_binding.cpp`	PyTorch C++ extension binding
`kernel.py`	Python wrapper (`cutlass_nvfp4_gemm`, `cutlass_grouped_nvfp4_gemm`)
`setup.py`	Build configuration (SM100a target)

`nvfp4_mega_moe.py` — Main Entry Point

Called by the patched deepseek_v4.py. Dispatches to CUTLASS when MEGA_MOE_USE_CUTLASS=1.

`weight_transform.py` — Weight Transformation

Converts raw NVFP4 checkpoint weights into the format expected by the kernel:

Folds global scales (float32) into block scales (UE4M3)
Interleaves L1 gate_up weights for 2CTA UMMA

`symm_buffer.py` — Symmetric Buffer

Stub for NVLink cross-rank all-reduce. Matches the DeepGEMM API expected by vLLM's deepseek_v4.py.

Critical Quirks & Pitfalls

1. Scale Factor Layout (THE BIG ONE)

CUTLASS's Sm1xxBlockScaledConfig expects scale factors in an interleaved layout, NOT simple row-major. The layout is defined by:

SfAtom = Shape<Shape<32,4>, Shape<16,4>>
       with Stride<Stride<16,4>, Stride<0,1>>
layout_SFA = tile_to_shape(SfAtom{}, make_shape(M,K), Step<_2,_1>{})

If you pass row-major scales directly, TMA loads read garbage addresses → NaN output → downstream CUDA illegal memory access.

Fix: GPU-side remap kernel using cute::idx2crd() to convert CUTLASS layout indices to (row, k_group) coordinates, then index into row-major source.

2. CUTLASS Version Matters

TileLang's bundled CUTLASS is too old — missing float_e2m1_t, float_ue4m3_t, block-scaled types. You need the latest from GitHub:

git clone --depth 1 https://github.com/NVIDIA/cutlass.git /root/cutlass

Key files only in the latest version:

include/cutlass/float_subbyte.h — float_e2m1_t and float_ue4m3_t
include/cutlass/detail/sm100_blockscaled_layout.hpp — SFA/SFB layout computation
examples/72_blackwell_narrow_precision_gemm/72b_nvfp4_nvfp4_gemm.cu — reference implementation

3. nixl_ep Breaks CUDA 13 Images

The vllm/vllm-openai:nightly image ships nixl_ep compiled against CUDA 12, but the image is CUDA 13. At import time it tries to dlopen("libcudart.so.12") → crash. Remove it:

RUN pip uninstall -y nixl-ep; rm -rf /usr/local/lib/python3.12/dist-packages/nixl_ep

4. Fabric Manager Required for B200

B200 NVLink clusters need nvidia-fabricmanager running before CUDA runtime can init. Without it:

nvidia-smi works (kernel module)
cudaGetDeviceCount() segfaults (userspace driver)
Error 802: "system not yet initialized"

systemctl enable nvidia-fabricmanager
systemctl start nvidia-fabricmanager

5. Docker GPU Access

Must use deploy.resources.reservations.devices in docker-compose, NOT runtime: nvidia in daemon.json:

deploy:
  resources:
    reservations:
      devices:
        - driver: nvidia
          count: all
          capabilities: [gpu]

6. PyTorch Extension API (nightly vLLM)

Use c10::cuda::getCurrentCUDAStream() not at::cuda::getCurrentCUDAStream()
Use torch::kBFloat16 not at::kBF16
CUDAExtension uses include_dirs not extra_include_paths
python3 not python in the image

7. CCCL Headers

CUTLASS 3.x depends on libcu++ (CCCL). Found at:

/usr/local/cuda-13.0/targets/x86_64-linux/include/cccl/

8. No Mixing CUDA Versions

Hard rule. If something needs CUDA 12 in a CUDA 13 image, remove the thing that needs CUDA 12. Never symlink libcudart.so.13 → libcudart.so.12.

Building

On the B200 server, inside the vLLM Docker container:

cd /root/nvfp4-megamoe-kernel/src/nvfp4_megamoe_kernel/cutlass_nvfp4_gemm
TORCH_CUDA_ARCH_LIST=10.0 python3 setup.py build_ext --inplace

Requires:

CUTLASS at /root/cutlass/include
CCCL at /usr/local/cuda-13.0/targets/x86_64-linux/include/cccl/

Environment Variables

Variable	Default	Description
`MEGA_MOE_STATIC`	`0`	Set to `1` to bypass kernel (return zeros)
`MEGA_MOE_USE_CUTLASS`	`1`	Use CUTLASS native NVFP4 kernel
`MEGA_MOE_DEBUG`	`0`	Enable debug prints
`VLLM_USE_NIXL`	`0`	Disable NIXL (broken in nightly)

Current Status (May 14, 2026)

✅ CUTLASS NVFP4 GEMM compiles and loads
✅ Scale factor remap works (no NaN)
✅ vLLM server starts with native kernel
✅ L1 and L2 CUTLASS kernels execute
⚠️ Output is garbage — shared experts are bypassed (zeros)
⚠️ FlashInfer/DeepGEMM TF32 GEMM (shared experts) crashes workers
⚠️ MoE dispatch is slow (Python per-expert loop)

Next Steps

Fix shared experts crash (FlashInfer TF32 GEMM illegal memory access)
Verify numerical correctness of SF remap (compare against dequantize+BF16 reference)
Optimize MoE dispatch (batched/grouped GEMM)
Replace simple stage_activation with proper E2M1 quantization
Re-enable shared experts once FlashInfer crash is fixed

README.md Unescape Escape

NVFP4 Mega MoE Kernel — CUTLASS Native Blackwell Implementation

What This Does