NVFP4 MegaMoE Kernel
Native NVFP4 inference stack for DeepSeek-V4 on NVIDIA Blackwell (SM100). CuTeDSL kernels for the entire model — MoE experts, shared experts, attention projections — running in native NVFP4 with zero dequantization overhead.
⚠️ THE #1 RULE
WE OWN ALL OUR KERNELS. WE DO NOT PATCH vLLM.
vLLM's internal kernels (FlashMLA, fp8_ds_mla, fused compressor, Triton indexer) are deeply coupled. You cannot swap one piece and expect the rest to work. We build our own CuTeDSL kernels, test standalone, then wire into vLLM as an attention backend.
Repository Layout
This repo (nvfp4-megamoe-kernel): The kernel library — CuTeDSL kernels, bridge layer, standalone tests.
vLLM fork (vllm-deepseekv4-nvfp4): The vLLM integration — model definition, weight loading, attention backend. Lives at /root/dsv4-nvfp4-workspace/vllm on the B200.
Workspace (/root/dsv4-nvfp4-workspace):
kernel/— clone of this repovllm/— clone of the vLLM fork
Kernel Status
✅ CuTeDSL NVFP4 Grouped GEMM (the building block)
ScaledGroupedGemmKernel in cutedsl/kernel/moe/torch_scaled_grouped_mm.py:
- 2D×3D scenario: A(M,K) × B(E,K,N) → C(M,N)
- Block-scaled: per-16-element FP8 scales on both A and B sides
- Global scales (per-expert) for full dynamic range
- Persistent scheduler, TMA pipelining, SMEM swizzle
- CUDAGraph-safe (workspace pre-allocated, no runtime allocations)
✅ Fused SwiGLU GEMM (L1 gate+up with SwiGLU in registers)
FusedSwiGLUScaledGroupedGemmKernel in cutedsl/kernel/moe/fused_swiglu_grouped_mm.py:
- Extends the base GEMM with an in-epilogue SwiGLU
- Weight interleave:
interleave_l1_weights()interleaves gate/up at granularity 8 BF16 - epi_tile=(128, 8): each 8-wide subtile is pure gate or pure up
- Subtile-level pairing: even subtiles = gate (SiLU in FP32, save to register buffer), odd subtiles = up (load silu(gate) from buffer, compute silu(gate)*up)
- Output: BF16 with interleaved [silu(gate), silu(gate)*up] at granularity 8
- Cosine 0.988 vs BF16 reference (full MoE pipeline)
✅ Custom CUDA De-interleave + NVFP4 Quantize
cutedsl/kernels/deinterleave_quantize.cu:
- Single GPU kernel: reads fused L1 BF16 output, extracts SwiGLU from odd 8-col groups, quantizes to NVFP4
- Replaces the Python
deinterleave_l1_weights()+quantize_activation_nvfp4()path - 4.3x faster (0.043ms vs 0.184ms for 128 tokens)
- 99.97% cosine match with Python reference, 99.7% FP4 byte match
- Saves ~8.5ms over 60 MoE layers
✅ NVFP4 Linear (cutedsl/nvfp4_linear.py)
CuTeDSLNvfp4Linear — single-expert NVFP4 GEMM for shared experts and attention projections.
✅ GPU-Native SWA Decode Attention (CuTeDSL)
cutedsl/native_swa_decode.py — BlackwellSWADecodeKernel:
- CTA mapping: 1 CTA per (decode_token, q_head_group) — 8 groups × T tokens
- Q loaded into registers, KV streamed in 16-token tiles through smem
- Online softmax (max/exp/rescale/sum) across tiles
- Pre-dequantized bf16 KV (fp8 dequant done on host, fused dequant is future work)
- Cosine 0.9999+ vs PyTorch batched SDPA reference
✅ GPU-Native Sparse + SWA Decode Attention (CuTeDSL)
cutedsl/native_sparse_decode.py — BlackwellSparseDecodeKernel:
- Same CTA mapping as SWA kernel
- Concatenated SWA + compressed KV in a single attention pass
- Sink weight merge applied on host side
- Cosine 0.9999+ vs combined SDPA reference
- Supports both CSA (cr=4) and HCA (cr=128) layers
✅ Sparse Topk Metadata Kernels (C128A + C4A)
cutedsl/kernels/sparse_topk_metadata.cu + cutedsl/sparse_topk_metadata.py:
build_c128a_topk_metadata: position-based compressed KV slot lookup via block table for C128A (cr=128) decode tokens. Maps(position, block_table) → global compressed KV slot IDs + lengthscompute_c4a_global_topk: local topk index → global KV cache slot mapping via block table for C4A (cr=4) decode tokens- Both tested: correct block table lookups, proper padding, valid length counts
- No FlashMLA, no vLLM Triton dependency — own CUDA kernels
✅ Blackwell Attention (standalone tests)
cutedsl/blackwell_attention.py— KV cache write/read, full attention pipelinecutedsl/csa_attention.py— CSA (cr=4) and HCA (cr=128) sparse attention- All standalone tests pass: KV cache (0.9997), CSA/HCA, prefill+decode (0.9998)
✅ CuTeDSL Warmup Compilation
warmup_compilation() and warmup_fused_swiglu_compilation() in bridge.py:
- Eagerly JIT-compiles GEMM kernels before model forward pass
- Uses quantized random BF16 (via
quantize_to_nvfp4) for warmup data - Zero-filled FP4/FP8 causes
cudaErrorIllegalInstruction— random bytes produce NaN in MMA dequant - All three shapes compile successfully: L1 (48 experts, 3584×3072), L2 (48 experts, 3072×3584), Fused L1
Bridge Layer (cutedsl/bridge.py)
Quantization, layout, kernel launch utilities:
| Function | Purpose |
|---|---|
quantize_to_nvfp4() |
BF16 → NVFP4 with global scale |
quantize_activation_nvfp4() |
CUDAGraph-safe quantize (pre-computed gs) |
quantize_weight_to_nvfp4() |
Weight quantization along K dim |
interleave_l1_weights() |
Gate/up interleave at granularity 8 BF16 |
deinterleave_l1_weights() |
Reverse the interleave |
deinterleave_quantize_nvfp4_cuda() |
Custom CUDA: de-interleave + quantize in one kernel |
make_b_k_major() |
B tensor stride conversion |
assemble_scales_2d_side() / assemble_scales_3d_side() |
Scale assembly + swizzle |
warmup_compilation() |
Eager JIT compilation with quantized random data (base GEMM) |
warmup_fused_swiglu_compilation() |
Eager JIT compilation with quantized random data (fused SwiGLU) |
run_nvfp4_grouped_gemm() |
Base GEMM entry point |
run_fused_swiglu_grouped_gemm() |
Fused SwiGLU GEMM entry point |
MoE Pipeline
Non-Fused Path
CuTeDSLMoERunner / run_nvfp4_moe():
- Quantize input BF16 → NVFP4 (pre-computed gs)
- L1 GEMM: NVFP4 × NVFP4 → BF16 (gate+up interleaved)
- De-interleave, split gate/up
- SiLU(gate) * up → BF16 (PyTorch)
- Re-quantize BF16 → NVFP4
- L2 GEMM: NVFP4 × NVFP4 → BF16 (down_proj)
- Scatter with routing weights
Fused Path
run_nvfp4_moe_fused() / CuTeDSLMoERunner(fused_swiglu=True):
- Quantize input BF16 → NVFP4 (pre-computed gs)
- Fused L1 GEMM + SwiGLU in kernel registers → BF16 TMA store
- Custom CUDA kernel: de-interleave + NVFP4 quantize (0.043ms)
- L2 GEMM: NVFP4 × NVFP4 → BF16 (down_proj)
- Scatter with routing weights
Both paths: cosine 0.988 vs BF16 reference. Fused path is marginally more accurate (FP32 SiLU in registers vs PyTorch BF16 SiLU).
Blackwell Decode Path (vLLM Integration)
The Blackwell decode path in attention.py routes through our own kernels:
SWA-only layers (cr=0): native_swa_decode_attention — CuTeDSL kernel
CSA layers (cr=4): native_sparse_decode_attention with topk indices from compute_c4a_global_topk — our CUDA kernel maps indexer local topk → global KV cache slots
HCA layers (cr=128): native_sparse_decode_attention with topk indices from build_c128a_topk_metadata — our CUDA kernel maps positions → compressed KV slot IDs via block table lookup
Metadata flow:
DeepseekSparseSWAMetadataBuilderbuilds SWA indices + C128A buffersattention.pydetects FlashMLA vs Indexer metadata at runtime- Blackwell path reads
indexer_metadata.decode.block_tablefor block table access - No FlashMLA dependency on Blackwell
Correctness Bugs Fixed (May 20, 2026)
| Bug | Issue | Fix |
|---|---|---|
| C128A topk missing | DeepseekSparseSWAMetadataBuilder returned None for C128A topk → SWA-only fallback |
build_c128a_topk_metadata CUDA kernel computes global slot IDs from positions + block table |
| C4A topk missing | Relied on vLLM's Triton compute_global_topk_indices_and_lens (not ours) |
compute_c4a_global_topk CUDA kernel replaces it on Blackwell |
| Warmup crash | Zero-filled FP4/FP8 → cudaErrorIllegalInstruction in MMA hardware |
Quantize random BF16 through quantize_to_nvfp4 for mathematically consistent warmup data |
| Warmup disabled | Was commented out → lazy JIT on first forward → OOM competing with model | Re-enabled in runner.py; L1/L2/fused all compile eagerly |
_fused_swiglu not initialized |
CuTeDSLMoERunner.__init__ missing self._fused_swiglu = False |
Added initialization |
| FlashMLA assert crash | assert flashmla_metadata is not None crashes on Blackwell where indexer_metadata is used instead |
Fixed assert to accept either |
_needs_token_refill myth |
cute.compile doesn't corrupt GPU memory | Removed hack |
| Zero block FP8 scale | clamp(min=1e-8) gives nonzero scale for zero blocks |
Detect zero blocks, force FP8 scale to exact 0 |
| Underflow blocks | amax < 6×2⁻⁹ gets nonzero FP4 | Detect underflow, zero x_norm before division |
| Expert counting | Materializes 18M bool tensor | torch.bincount replaces O(n×E) comparison |
| Dequantize→requantize | "Supposedly lossy" | Verified 100% byte-identical round-trip |
Fused SwiGLU — How It Works
The Problem
The L1 GEMM produces (M, 2×intermediate) BF16 output with gate and up columns side by side. SwiGLU needs silu(gate)*up, producing (M, intermediate). In the unfused path, this requires:
- ~580MB BF16 write to GMEM (L1 output)
- ~290MB BF16 read back (for gate/up split + SiLU)
- 3 kernel launches + 12 quantize ops
The Solution: Granularity-8 Weight Interleave + Subtile Pairing
Key insight: With interleave_l1_weights(), gate and up weight columns are interleaved at granularity 8 BF16. In the GEMM output, every 8 BF16 columns alternate: [gate₀-₇, up₀-₇, gate₈-₁₅, up₈-₁₅, ...].
With epi_tile_n=8, each epilogue subtile covers exactly 8 BF16 N-columns. So each subtile is pure gate or pure up — no mixing. Even subtile indices = gate, odd = up.
The epilogue loop processes gate/up pairs:
for subtile_idx in range(subtile_cnt):
acc_vec = load_accumulator(subtile_idx)
acc_vec_bf16 = acc_vec.to(bf16) # init before dynamic if
if even (gate):
silu_result = silu(acc_vec) # FP32 math
silu_gate_buf = silu_result # save to register buffer
acc_vec_bf16 = silu_result
if odd (up):
gate_vals = silu_gate_buf # from previous iteration
acc_vec_bf16 = gate_vals * acc_vec # SwiGLU
store_to_smem(acc_vec_bf16)
tma_store_to_gmem()
Both branches produce acc_vec_bf16 of the same BF16 type. No runtime conditional affects tensor structure. The silu_gate_buf is a register buffer initialized before the loop.
The output has interleaved [silu(gate), silu(gate)*up] at granularity 8. The custom CUDA kernel extracts odd 8-col groups (the SwiGLU result) and quantizes to NVFP4 for the L2 GEMM.
The //2 Bug
interleave_l1_weights had g = granularity_bf16 // 2, correct for K-axis interleave (FP4 packing along K). But we interleave along N, where each N-column = 1 BF16 column. The //2 was a K-axis leftover that silently gave g=4 instead of g=8. Fixed: g = granularity_bf16 (no //2).
CuTeDSL Runtime Conditionals
CuTeDSL does support runtime conditionals on register tensors — both branches must produce the same tensor type (shape, layout, dtype). The earlier "blocked by type system" framing was wrong. The real issue: the old code applied SiLU to ALL positions (just SiLU, not SwiGLU) and the mask-blending approach (silu(both)*0.5) is mathematically wrong. With epi_tile_n=8 and subtile-level pairing, the conditional is clean.
The Global Scale Gotcha
The custom CUDA quantize kernel needs the L2 activation global scale (from the SwiGLU output), NOT the L1 input global scale. The L1 gs is based on the input magnitude (~0.1), while the SwiGLU output can be orders of magnitude larger. Passing the wrong gs causes the FP8 block scale to overflow, producing NaN. The runner pre-computes the L2 gs in compute_activation_global_scales() before CUDAGraph capture.
Remaining Work
| What | Status | Notes |
|---|---|---|
| In-epilogue NVFP4 quantize (replace BF16 TMA with FP4 TMA) | 🔨 Future | Saves ~0.14ms/layer; requires register→GMEM mapping for FP4 output |
| Fuse fp8→bf16 dequant into CuTeDSL kernel | 🔨 Future | Currently pre-dequantized on host; need vectorized fp8 loads |
| CSA/HCA sink weight merge in CuTeDSL | 🔨 Future | Applied on host for now; fuse into kernel for perf |
DeepSeek-V4 Architecture Notes
NOT MLA. DeepSeek-V4 uses:
- CSA (Compressed Sparse Attention, cr=4): KV compressed 4x, indexer finds top-k
- HCA (Heavily Compressed Attention, cr=128): KV compressed 128x, pre-computed indices
- SWA: Standard sliding window (window=128, last layer only)
- mHC: Manifold-Constrained Hyper-Connections — replaces residual connections
- 384 experts, top-6, intermediate=3072
Compress ratios by layer: alternating 128/4, layer 60 = 0 (SWA).
File Structure
cutedsl/
├── bridge.py # Quantization, layout, kernel launch
├── nvfp4_linear.py # Single-expert NVFP4 GEMM runner
├── runner.py # MoE grouped GEMM runner (fused + non-fused)
├── blackwell_attention.py # KV cache + attention (standalone)
├── csa_attention.py # CSA/HCA attention
├── custom_ops.py # torch.autograd wrappers
├── moe_pipeline.py # Standalone test pipeline (fused + non-fused)
├── sparse_topk_metadata.py # C128A + C4A topk metadata (Python wrapper)
├── native_swa_decode.py # GPU-native SWA decode (CuTeDSL)
├── native_sparse_decode.py # GPU-native sparse+SWA decode (CuTeDSL)
├── kernels/
│ ├── deinterleave_quantize.cu # Custom CUDA: de-interleave + NVFP4 quantize
│ └── sparse_topk_metadata.cu # Custom CUDA: C128A + C4A topk metadata
└── kernel/moe/
├── torch_scaled_grouped_mm.py # ScaledGroupedGemmKernel (the GEMM)
└── fused_swiglu_grouped_mm.py # FusedSwiGLUScaledGroupedGemmKernel
tests/
├── layertest.py # MoE layer test — fused + non-fused (PASS, 0.988)
├── cudagraph_test.py # CUDAGraph test (PASS)
├── test_full_layer_b200.py # All NVFP4 projections (PASS, 0.994+)
├── test_v4_attention_b200.py # All 3 attention types (PASS)
├── test_kv_cache_b200.py # KV cache (PASS, 0.9997)
├── test_sparse_attn_b200.py # CSA/HCA (PASS)
├── test_decode_attention_b200.py # Prefill+decode (PASS, 0.9998)
└── ...
Key Lessons
-
⛔ NEVER assume CuTeDSL GPU tensors survive JIT compilation.
cute.compilezeroes GPU memory. Keep index/mapping tensors on CPU. -
⛔ NEVER nuke working code without understanding why it exists. CUDAGraph-safe functions exist because vLLM requires CUDAGraph.
-
⛔ NEVER fabricate facts from MEMORY.md. Verify what "works" means before citing it.
-
⛔ NEVER quantize a padded buffer and slice the output. Quantize compact data, scatter into padded layout.
-
⛔ Silent weight drops are deadly. vLLM's
if name not in params_dict: continueskips weights with no warning. Replace with hard RuntimeError. -
⛔ NVFP4 is NOT suitable for attention Q×K^T. Per-element dot products are too sensitive. Keep attention in BF16.
-
⛔ NEVER touch drivers, kernels, firmware, or system packages on the B200. The cluster costs millions. Always confirm with Mike.
-
⛔ CuTeDSL
ifbranches must produce the same tensor type. Both branches must yield identical (shape, layout, dtype). Initialize variables before theif— using values defined only inside a branch is not supported. -
⛔ The
//2in interleave was a K-axis leftover. FP4 packing is along K, not N. When interleaving along N,g = granularity_bf16(no//2). The bug silently gave granularity 4 instead of 8. -
⛔ "SiLU on all positions" is NOT SwiGLU. SwiGLU pairs silu(gate)*up. Applying SiLU to the full (M, 2×intermediate) output is just SiLU. The pairing must be explicit.
-
⛔ The global scale must match the data being quantized. Passing the L1 input gs to the SwiGLU quantize causes FP8 overflow → NaN. The gs must come from the SwiGLU output's magnitude.
-
⛔ NEVER use zero-filled or random-byte data for CuTeDSL warmup. Zeros cause division-by-zero in scale dequant. Random uint8 bytes as FP4 produce NaN/Inf in MMA →
cudaErrorIllegalInstruction. Always quantize random BF16 throughquantize_to_nvfp4for mathematically consistent warmup data. -
⛔ NEVER borrow kernels from vLLM or FlashMLA. We own all our kernels. If we need a kernel that exists in vLLM's Triton or FlashMLA's C++, we build our own CUDA/CuTeDSL equivalent from scratch.