Go to file

biondizzle 653e2d7a50 vLLM NVFP4 serving: full end-to-end pipeline working

Bridged the gap between ModelOpt NVFP4 and vLLM DeepSeek V4 attention.
Server loads and serves tokens on 8x B200 with TP=8, EP=8.

Key changes:
- wo_a: NVFP4->BF16->FP8 with DeepGEMM block-scale format for BMM einsum
  Uses deepgemm_post_process_fp8_weight_block for correct scale layout
  weight_scale_inv = DeepGEMM-formatted block scale (NOT per-tensor scalar)
  Block scale filled with fp8_scale (NOT all-ones -- causes garbage output)
- Attention: NVFP4->BF16 dequantization, UnquantizedLinearMethod
- Compressor: reconstruct fused_wkv_wgate from separate kv_proj+gate_proj
  Fixed indexer path: compressor.indexer.kv_proj (was loading main compressor)
- MoE experts: stay NVFP4, FLASHINFER_TRTLLM FusedMoE backend

Bugs fixed:
1. DeepGEMM sf.dim() assertion: weight_scale_inv must be block-scale tensor
2. Block scale dtype: float32 (not float8_e4m3fn)
3. Missing deepgemm_post_process args: quant_block_shape, use_e8m0
4. Compressor indexer shape mismatch: wrong checkpoint key prefix
5. All-ones block scale: DeepGEMM divides by 1.0 instead of actual scale

Updated README with full technical documentation of all fixes.

2026-05-11 02:01:46 +00:00

patches

vLLM NVFP4 serving: full end-to-end pipeline working

2026-05-11 02:01:46 +00:00

scripts

vLLM serving: patched deepseek_v4.py, disabled mega_moe, updated docs

2026-05-10 16:14:17 +00:00

.env

Purge OpenClaw session files, memory dumps, __pycache__. Update .gitignore

2026-05-08 17:09:59 +00:00

.gitignore

Replace shell wrapper with in-process quantize script

2026-05-09 06:07:22 +00:00

docker-compose.yml

vLLM serving: patched deepseek_v4.py, disabled mega_moe, updated docs

2026-05-10 16:14:17 +00:00

index.yaml

Purge INT4 references — expert weights are FP4 (E2M1), not INT4

2026-05-08 02:33:46 +00:00

README.md

vLLM NVFP4 serving: full end-to-end pipeline working

2026-05-11 02:01:46 +00:00

requirements.txt

NVIDIA Model Optimizer branch: nvfp4_experts_only PTQ for DeepSeek V4 Pro

2026-05-07 00:11:31 +00:00

README.md

DeepSeek V4 Pro → NVFP4 Quantization + vLLM Serving

Full NVFP4 quantization of DeepSeek V4 Pro and vLLM serving on 8× NVIDIA B200 GPUs.

Quick Status

Component	Status
NVFP4 Quantization	✅ 881GB (Run 11), modelopt 0.45.0.dev64
Weight Loading	✅ 95 safetensors shards, all 8 TP ranks
NVFP4→FP8 Conversion (wo_a)	✅ DeepGEMM block-scale format
NVFP4→BF16 Dequantization	✅ 305 attn/shared, 91 compressor layers
Compressor Reconstruction	✅ Separate kv_proj/gate_proj → fused_wkv_wgate
MoE Expert Serving	✅ FusedMoE NVFP4 (FLASHINFER_TRTLLM backend)
Profile/Warmup Run	✅ Passes
API Server	✅ Running on port 8000
Output Quality	🔧 Under investigation (FP4 quantization loss + scale tuning)

B200 Node

IP: 45.76.247.107
User: root
Password: see .env
GPUs: 8× NVIDIA B200 (SM100)
RAM: ~2.7 TB
Model weights: /root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4/
BF16 reference: /root/nvidia-meeting/DeepSeek-V4-Pro-BF16/

Architecture

DeepSeek V4 Pro (1.2T params, 61 layers)
├── MLA Attention (61 layers)
│   ├── fused_wqa_wkv → BF16 (UnquantizedLinearMethod)
│   ├── wo_a → FP8 (DeepGEMM block-scale, BMM einsum)
│   ├── wo_b → BF16 (UnquantizedLinearMethod)
│   └── compressor.fused_wkv_wgate → BF16 (reconstructed from NVFP4)
├── MoE Experts (384 experts, 61 layers)
│   ├── w13_weight → NVFP4 (FusedMoE, FLASHINFER_TRTLLM backend)
│   └── w2_weight → NVFP4 (FusedMoE, FLASHINFER_TRTLLM backend)
└── Shared Expert → FP8 (Fp8LinearMethod, DeepGEMM)

The NVFP4 → vLLM Gap

ModelOpt quantizes to NVFP4 (4-bit FP4 with block scales). vLLM's DeepSeek V4 attention code expects FP8 with DeepGEMM block-scale einsum. These formats were never integrated — we're ahead of NVIDIA on this. Key gaps we had to bridge:

1. wo_a: NVFP4 → FP8 + DeepGEMM Block Scale

Problem: wo_a uses deepseek_v4_fp8_einsum (BMM with DeepGEMM), which expects:

Weight: float8_e4m3fn in 3D shape (g, r, d) for batched matmul
Scale: DeepGEMM-formatted block scale tensor (not a per-tensor scalar)

Our NVFP4 weights are uint8 packed FP4 with separate block/global scales.

Solution (_convert_nvfp4_to_fp8):

Unpack NVFP4 uint8 → BF16 using E2M1 lookup table
Dequantize: weight_bf16 * block_scale * global_scale * input_scale
Re-quantize BF16 → FP8 e4m3 with per-tensor scale (w_amax / fp8_max)
Create block scale tensor filled with fp8_scale (same scale for every 128×128 block)
Call deepgemm_post_process_fp8_weight_block(wq, ws, quant_block_shape=(128,128), use_e8m0=True, is_bmm=True, bmm_batch_size=N)
Store: weight_scale_inv = dg_ws (DeepGEMM-formatted scale), weight = w_fp8 (3D BMM shape)

Why weight_scale_inv? The attention forward reads self.wo_a.weight_scale_inv as b_scale for deepseek_v4_fp8_einsum → DeepGEMM fp8_einsum. This must be the DeepGEMM block-scale tensor, not a per-tensor scalar.

Why fp8_scale in the block scale (not all-ones)? DeepGEMM divides by the block scale at runtime. If the block scale is all-ones, it divides by 1.0, producing garbage. Each block needs the actual per-tensor scale value.

2. Attention Layers: NVFP4 → BF16

Problem: fused_wqa_wkv, wo_b use standard torch.nn.functional.linear. NVFP4 weights (uint8) can't be used directly.

Solution (_convert_nvfp4_to_bf16):

Unpack NVFP4 → BF16
Dequantize with block/global/input scales
Replace mod.weight with BF16 parameter
Set quant_method = UnquantizedLinearMethod()
Remove NVFP4 scale attributes (weight_scale, weight_scale_2, input_scale)

3. Compressor: Reconstructing fused_wkv_wgate from NVFP4

Problem: The compressor's fused_wkv_wgate is a MergedColumnParallelLinear with disable_tp=True. NVFP4 uint8 data can't be loaded into the BF16 parameter (shape mismatch: uint8 is half the input dim). The default weight loader silently skips these weights, leaving the parameter uninitialized.

Solution (_reconstruct_compressor_weight):

Read original kv_proj.weight and gate_proj.weight directly from safetensors
Unpack NVFP4 → BF16, dequantize with scales
Concatenate: fused = cat([wkv, wgate], dim=0)
Replace the uninitialized parameter

Critical detail: The indexer compressor is at a different checkpoint path:

Main: model.layers.N.self_attn.compressor.{kv_proj,gate_proj}.weight
Indexer: model.layers.N.self_attn.compressor.indexer.{kv_proj,gate_proj}.weight

Using the wrong prefix loads the main compressor weight into the indexer's fused_wkv_wgate, causing a 4× shape mismatch and split_with_sizes crash.

4. MoE Experts: NVFP4 FusedMoE

Problem: vLLM's DeepSeek V4 uses DeepseekV4MegaMoEExperts with DeepGEMM grouped GEMM. NVFP4 experts need a different kernel path.

Solution: The existing ModelOptNvFp4LinearMethod + FusedMoE infrastructure handles NVFP4 experts natively. We just need to:

Keep expert weights as NVFP4 uint8 + block/global scales
Use FLASHINFER_TRTLLM MoE backend (auto-selected)
Skip any conversion in process_weights_after_loading

5. BF16 wo_a Layers: BF16 → FP8

Problem: Some wo_a layers were NOT quantized by modelopt (BF16 in checkpoint). The attention forward still reads them as FP8 for the einsum path.

Solution (_convert_bf16_to_fp8): Same as #1 but skip the NVFP4 unpack step. Directly quantize BF16 → FP8 with block scale.

Bugs Found and Fixed

DeepGEMM `sf.dim()` Assertion (layout.hpp:94)

Root cause: weight_scale_inv was a 1D per-tensor scale (g,). DeepGEMM expects 2D/3D block-scale tensor formatted by transform_sf_into_required_layout.
Fix: Use deepgemm_post_process_fp8_weight_block to produce correctly formatted block scales, store result in weight_scale_inv.

Block Scale dtype (`float8_e4m3fn` vs `float32`)

Root cause: deepgemm_post_process_fp8_weight_block expects float32 or float8_e8m0fnu block scales. We initially used float8_e4m3fn.
Fix: Create block scale as dtype=torch.float32.

Missing `deepgemm_post_process` args

Root cause: Function signature changed to require quant_block_shape and use_e8m0.
Fix: Pass quant_block_shape=(128, 128) and use_e8m0=True.

Compressor Indexer Shape Mismatch

Root cause: _reconstruct_compressor_weight used the same checkpoint prefix for both main and indexer compressors. The indexer's keys have .indexer. in the path.
Fix: Add sub_path parameter; pass ".indexer" for indexer compressors.

All-Ones Block Scale → Garbage Output

Root cause: Block scale was torch.ones(...) (scale=1.0). DeepGEMM divides by the block scale at runtime, so the output was divided by 1.0 instead of the actual per-tensor scale, producing incoherent text.
Fix: Use torch.full(..., fp8_scale.item()) to fill the block scale with the correct per-tensor FP8 quantization scale.

Running

# On B200 node
cd /root/nvidia-meeting
docker compose up -d

# Check logs
docker logs -f nvidia-meeting-vllm-1

# Test
curl http://localhost:8000/v1/models
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "/model", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 50}'

Files

File	Purpose
`patches/deepseek_v4.py`	Main patch: NVFP4 post-load conversion, weight reconstruction, DeepGEMM block-scale
`patches/modelopt.py`	ModelOpt FP4 config patches for weight loading
`.env`	B200 node credentials
`docker-compose.yml`	Container config (8 GPU, TP=8, EP=8, NVFP4 quant)

Conversion Flow

Checkpoint (NVFP4 safetensors)
  │
  ├── [weight loader] ──→ vLLM model (NVFP4 uint8 params)
  │
  └── [process_weights_after_loading]
       ├── wo_a (is_bmm=True):
       │     NVFP4→BF16→FP8 + DeepGEMM block scale
       │     weight_scale_inv = dg_ws, weight = 3D FP8
       │
       ├── fused_wqa_wkv, wo_b, shared_expert:
       │     NVFP4→BF16, UnquantizedLinearMethod
       │
       ├── compressor.fused_wkv_wgate:
       │     Read kv_proj+gate_proj from checkpoint
       │     NVFP4→BF16, cat into fused weight
       │
       └── MoE experts: stay NVFP4 (FusedMoE backend)

Known Issues

Output quality: FP4 is very aggressive quantization. The model produces tokens but they may be incoherent. This could be:
- Normal FP4 quality degradation
- Subtle dequantization bugs (sign handling, scale ordering)
- The per-tensor FP8 requantization of wo_a losing per-block precision
Runtime performance: Not yet benchmarked. The DeepGEMM einsum + FusedMoE path should be efficient on B200, but the BF16 layers go through UnquantizedLinearMethod which may be slower than dedicated kernels.

Quantization Details

Model: DeepSeek V4 Pro (1.2T parameters)
Format: NVIDIA NVFP4 (4-bit floating point with 128-element block scales)
Tool: modelopt 0.45.0.dev64 + transformers 5.8.0.dev0
Run: Run 11 (881GB), 8× B200, ~$161/run
Checkpoint: 95 safetensors shards

README.md Unescape Escape

DeepSeek V4 Pro → NVFP4 Quantization + vLLM Serving

Quick Status

B200 Node

Architecture

The NVFP4 → vLLM Gap

1. wo_a: NVFP4 → FP8 + DeepGEMM Block Scale

2. Attention Layers: NVFP4 → BF16

3. Compressor: Reconstructing fused_wkv_wgate from NVFP4

4. MoE Experts: NVFP4 FusedMoE

5. BF16 wo_a Layers: BF16 → FP8

Bugs Found and Fixed

DeepGEMM sf.dim() Assertion (layout.hpp:94)

Block Scale dtype (float8_e4m3fn vs float32)

Missing deepgemm_post_process args

Compressor Indexer Shape Mismatch

All-Ones Block Scale → Garbage Output

Running

Files

Conversion Flow

Known Issues

Quantization Details

README.md

DeepGEMM `sf.dim()` Assertion (layout.hpp:94)

Block Scale dtype (`float8_e4m3fn` vs `float32`)

Missing `deepgemm_post_process` args