Files
nvfp4-megamoe-kernel/README.md

236 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# NVFP4 MegaMoE Kernel
Full NVFP4 inference pipeline for DeepSeek-V4 on NVIDIA Blackwell (SM100). The entire model — MoE experts, shared experts, and attention projections — runs in native NVFP4 with zero dequantization overhead.
## What This Is
A native NVFP4 inference stack for DeepSeek-V4:
**MoE Experts** — CuTeDSL ScaledGroupedGemmKernel (our work):
```
BF16 input → quantize to NVFP4
L1 GEMM: NVFP4 × NVFP4 → BF16 (gate + up)
SiLU(gate) * up → BF16 (only nonlinear — can't avoid BF16 here)
Re-quantize → NVFP4
L2 GEMM: NVFP4 × NVFP4 → BF16 (down_proj)
Scatter with routing weights → BF16 output
```
**Attention Projections** — FlashInferCutlassNvFp4LinearKernel (vLLM built-in):
- `wq_b`, `wo_b`, `fused_wqa_wkv` — native NVFP4, no conversion
- `wo_a` — NVFP4→FP8 for `fp8_einsum` (only attention weight that needs conversion)
- Compressor — BF16 (weight_loader stacking issue, small matmul)
**Shared Experts** — FlashInferCutlassNvFp4LinearKernel (vLLM built-in):
- `gate_up_proj`, `down_proj` — native NVFP4
Both GEMM types use `float4_e2m1fn_x2` for weights, `float8_e4m3fn` for block scales, `float32` for global scales. BF16 is used only for SiLU activation, the final MoE scatter, and the compressor — the minimum possible.
## How We Got Here
### The C++ CUTLASS Kernel Was Broken
The original kernel was a C++ `.cu` file using CUTLASS's C++ API directly. It passed all the simple tests (uniform data → exact output, SF remap verifier → 0 errors) but produced **cosine 0.05** with real random data. After weeks of debugging the SF remap (8+ iterations, all producing the same 0.2 cosine against a wrong reference), we discovered:
1. **The BF16 reference comparison was wrong** — our Python dequantization didn't match CUTLASS's internal FP4 handling. A wrong reference is worse than no reference. We chased ghosts through 8+ SF remap rewrites because the 0.2 cosine was never about the remap.
2. **The C++ CUTLASS kernel misinterpreted FP4 data** — even with SF remap verified correct (0 byte errors), the GEMM produced garbage with non-uniform data. The issue was in how CUTLASS's C++ API handles FP4 packing/tiling internally — something we couldn't easily debug or fix.
3. **The checkpoint `input_scale` was a red herring** — we tried using the checkpoint's calibration scale as the activation normalization scale. It saturated all block scales to 448.0 (max float8). The `input_scale` is a calibration constant for alpha computation, not a normalization scale.
### The CuTeDSL Kernel Works
NVIDIA's CuTeDSL approach (Python-based CUTLASS kernels compiled via MLML → PTX) is what the CUTLASS team recommends for Blackwell. Their official MoE scaled grouped GEMM example (`torch_scaled_grouped_mm.py`) supports NVFP4 out of the box. We adapted it.
**Results with real DeepSeek-V4 layer 0 weights:**
- L1 GEMM alone: cosine 0.995
- Full MoE pipeline (L1→SiLU→L2→scatter): cosine 0.989
- Weight loading: **0% loss** — direct uint8→float4_e2m1fn_x2 view-cast, bit-identical to checkpoint
- Activation quantization: ~1.1% cosine loss (dynamic BF16→NVFP4 — inherent to the format, unavoidable)
- GEMM kernel: 0% loss (CuTeDSL is correct)
The 0.989 cosine is entirely from activation quantization. The weights are bit-identical to the checkpoint — no BF16 round-trip, no precision loss.
### The Dequant→Requant Anti-Pattern
Early versions dequantized all NVFP4 weights to BF16, then let vLLM's `FlashInferCutlassNvFp4LinearKernel` requantize them back to NVFP4 at inference time. This:
- Wasted 5 minutes on load doing NVFP4→BF16 conversion
- Lost precision on the double round-trip
- Caused vLLM to hang — the NVFP4 attention kernel expects native NVFP4 weights, not BF16 weights with an NVFP4 quant_method attached
The fix: **keep everything in NVFP4**. The checkpoint stores NVFP4. The kernels consume NVFP4. No conversion needed.
### CUDAGraph Compatibility
vLLM uses CUDA graphs to eliminate kernel launch overhead in the decode path. CUDA graphs record the entire forward pass once, then replay it — but they require **fixed tensor shapes, fixed memory addresses, and zero CPU-GPU syncs**.
Our original runner was not cudagraph-safe. We had to fix several classes of issues:
#### 1. CPU↔CUDA Tensor Copies
`torch.tensor([0,1,...], device=x.device)` creates the tensor on **CPU first**, then copies to CUDA. This copy is forbidden during graph capture. The fix: **cache tensors per device** on first use, outside the graph.
```python
# BAD — CPU→CUDA copy inside graph
step_to_idx = torch.tensor([0,1,2,3,4,4,5,5,6,6,6,7,7], device=x.device)
# GOOD — cached on first use, reused in graph
step_to_idx = _get_step_to_idx_lut(x.device) # returns cached CUDA tensor
```
Similarly, `torch.zeros` and `torch.rand` don't support `float4_e2m1fn_x2` or `float8_e4m3fn` dtypes. The fix: create as `uint8` or `float16`, then `.view()` or `.to()` the target dtype.
#### 2. GPU Scalar Slicing
`buf[:gpu_scalar, :]` requires the runtime to query the GPU scalar's value to determine the output shape. This triggers an implicit CPU-GPU sync, which invalidates the graph. The fix: **always use full pre-allocated buffers**. Extra rows are zeros that contribute nothing to the computation.
```python
# BAD — GPU scalar as slice index (implicit sync)
total_padded_rows = padded_expert_offsets[-1] # GPU scalar
padded_scales = buf[:total_padded_rows, :padded_cols] # sync!
# GOOD — full pre-allocated buffer, zero out before use
padded_scales = self._padded_scales_buf # always max size
padded_scales.zero_()
```
**Design decision:** Padding to max size wastes a few rows of compute on zero data, but:
- The extra rows are zeros → zero GEMM output → no accuracy impact
- GEMMs are memory-bandwidth bound → multiplying zeros is nearly free
- VRAM cost is negligible (~350KB for activation intermediates across all MoE layers)
- vLLM already does this everywhere (attention, FFN, etc.)
#### 3. Dynamic Output Allocation
`torch.zeros(num_tokens, ...)` inside the forward pass creates a new tensor each call. In cudagraph, new allocations are recorded and replayed — this works, but only if `num_tokens` is fixed (which it is, since vLLM captures at fixed token budgets).
#### Test Harness
`tests/cudagraph_test.py` validates cudagraph compatibility by:
1. Creating a runner with dummy weights
2. Running a warmup forward pass (triggers kernel compilation)
3. Attempting `torch.cuda.graph(g)` capture on the forward pass
4. If capture fails, patching `torch.cuda.synchronize`, `.item()`, `.tolist()`, `.cpu()` to detect exactly which syncs are happening
Run on the B200:
```bash
cd /root/nvfp4-megamoe-kernel
source tests/.venv/bin/activate
python3 tests/cudagraph_test.py
```
### Key Lessons
1. **A wrong reference is worse than no reference** — the 0.2 cosine against a broken BF16 dequant sent us chasing SF remap bugs for weeks
2. **The C++ CUTLASS API is a footgun for FP4** — CuTeDSL handles tensor layouts, tiling, and SF construction correctly by construction
3. **Test with real data early** — uniform tests pass even with broken kernels; random data reveals real bugs
4. **Separate the GEMM from the pipeline** — our `layertest.py` runs without vLLM, Docker, or tensor parallelism. It caught the kernel bug that vLLM's integration layers masked.
5. **Don't dequant what's already quantized** — if the kernel expects NVFP4 and the checkpoint is NVFP4, leave it alone. No BF16 round-trips.
6. **GPU scalar slicing is a silent cudagraph killer** — no error, no warning, just `cudaErrorStreamCaptureInvalidated` with no pointer to the cause. The test harness catches it.
## Project Structure
```
nvfp4-megamoe-kernel/
├── cutedsl/ # CuTeDSL kernel + bridge layer
│ ├── bridge.py # Tensor layout conversion, quantization, kernel launch
│ ├── moe_pipeline.py # Full MoE pipeline (L1→SiLU→L2→scatter)
│ └── kernel/moe/ # NVIDIA's ScaledGroupedGemmKernel (untouched)
│ ├── torch_scaled_grouped_mm.py # The working kernel (3900 lines)
│ ├── moe_utils.py
│ moe_persistent_scheduler.py
│ └── moe_sched_extension.py
├── vllm/ # vLLM integration
│ ├── nvfp4_cutedsl.py # CuTeDSLMoERunner — cudagraph-safe MoE kernel interface
│ └── patches/
│ ├── deepseek_v4.py # DeepSeek-V4 model patch (NVFP4 native)
│ └── deepseek_v4_attention.py # Attention patch (NVFP4 native)
├── tests/
│ ├── cudagraph_test.py # CUDAGraph compatibility test (✅ PASS)
│ ├── layertest.py # Layer 0 comparison: CuTeDSL vs BF16 (✅ cosine 0.989)
│ ├── test_cutedsl.py # Small standalone CuTeDSL test (✅ cosine 0.991)
│ ├── test_uniform_fp4.py # Uniform data GEMM test
│ ├── test_b_layout.py # B matrix column layout test
│ └── test_quick_rand.py # Quick random GEMM sanity check
└── reference/ # Reference files for study
```
## The Bridge Layer (`cutedsl/bridge.py`)
Handles all tensor layout conversion from our pipeline to what the CuTeDSL kernel expects:
| Function | What it does |
|----------|--------------|
| `quantize_activation_nvfp4()` | BF16 → float4_e2m1fn_x2 + float8_e4m3fn block scales (cudagraph-safe, no `.max()` sync) |
| `quantize_weight_to_nvfp4()` | Same, but for weight matrices with K as the packed dimension |
| `assemble_scales_2d_side()` | Pad and swizzle activation scale factors (2Dx3D A side) |
| `assemble_scales_3d_side()` | Pad and swizzle weight scale factors (2Dx3D B side) |
| `make_b_k_major()` | Convert B tensor from N-major to K-major strides (required by kernel) |
| `run_nvfp4_grouped_gemm()` | Full kernel launch (compile + run, cudagraph-safe) |
## Running Tests
On the B200:
```bash
cd /root/nvfp4-megamoe-kernel
source tests/.venv/bin/activate
# CUDAGraph compatibility test
python3 tests/cudagraph_test.py
# Small standalone test
python3 tests/test_cutedsl.py
# Full layer 0 comparison with real weights
python3 tests/layertest.py
```
## NVFP4 Coverage
| Component | Format | Kernel | Conversion? |
|-----------|--------|--------|-------------|
| MoE experts (L1+L2) | NVFP4 native | CuTeDSL ScaledGroupedGemm | No — direct uint8→float4 view-cast |
| Shared experts | NVFP4 native | FlashInferCutlassNvFp4 | No — stays native |
| wq_b, wo_b, fused_wqa_wkv | NVFP4 native | FlashInferCutlassNvFp4 | No — stays native |
| wo_a | NVFP4 → FP8 | fp8_einsum | Yes — fp8_einsum requires FP8 |
| Compressor | NVFP4 → BF16 | torch.mm | Yes — weight_loader stacking issue |
| KV cache | FP8 | FlashInfer MLA | N/A — FP8 is optimal for KV cache |
## Plan
### Phase 1: Kernel ✅ DONE
- CuTeDSL ScaledGroupedGemmKernel works with NVFP4
- Bridge layer handles all tensor layout conversion
- Full MoE pipeline (L1→SiLU→L2→scatter) produces cosine 0.989 vs BF16
### Phase 2: vLLM Integration ✅ DONE
- CuTeDSLMoERunner wires CuTeDSL kernel into vLLM
- Weight loading: checkpoint uint8 → float4_e2m1fn_x2 view-cast (bit-preserving)
- Block scales (float8_e4m3fn) and global scales (float32) pass through directly
- L1 dual global scale handling: normalize to max(gate_gs, up_gs), fold ratio into block scales
- Attention projections stay native NVFP4 (FlashInferCutlassNvFp4LinearKernel)
- CuTeDSL kernel warmup during model load (prevents RPC timeout)
- Removed all debug prints and env var gates from vLLM serving path
### Phase 2.5: CUDAGraph Compatibility ✅ DONE
- CuTeDSLMoERunner is fully cudagraph-safe
- Zero CPU-GPU syncs, zero dynamic shapes, zero GPU scalar slicing
- All intermediate buffers pre-allocated at max_num_tokens * top_k
- `quantize_activation_nvfp4` uses cached LUT (no CPU→CUDA copy)
- `torch.zeros/rand` for float4/float8 → uint8→view or float16→cast
- Test harness validates capture + replay
- VRAM overhead: ~350KB (negligible)
- Compute overhead: zero rows through GEMM on padding (memory-bound, free)
### Phase 3: Optimization
- Replace wo_a FP8 conversion with native NVFP4 GEMM (eliminate last dequant)
- Fix compressor weight_loader so it stays NVFP4 native
- Explore larger tile sizes for better occupancy
- Profile end-to-end inference on full model
### Phase 4: Production
- Clean up old C++ kernel code (tagged `the-last-of-cutlass`)
- Add proper error handling and logging
- Benchmark vs BF16 baseline