DeepSeek V4 Pro → NVFP4 Quantization
Full NVFP4 quantization of DeepSeek V4 Pro on a single B200 node (8× B200, 2.7TB RAM, 13TB NVMe).
Pipeline
Step 1: Dequantize FP8 → BF16
python3 scripts/dequant_fp8_to_bf16.py /root/nvidia-meeting/DeepSeek-V4-Pro-FP8 /root/nvidia-meeting/DeepSeek-V4-Pro-BF16
The original V4 weights use mixed precision (FP8 attention + FP4/E2M1 experts with per-tensor scales). We dequantize everything to pure BF16 so modelopt can run calibration without hitting broken FP8 kernel paths on Blackwell (DeepGEMM unsupported, Triton finegrained FP8 matmul shape mismatches).
This is not a blind upcast — it applies the actual scale factors:
W_bf16 = dequantize_fp4_weight(W_int, S) # per-tensor scale dequant, not .to(bfloat16)
We verified byte-exact correctness by dequantizing a single expert and running a matmul against the official inference path:
W_bf16 = dequantize_fp4_weight(W_int, S)
y_ours = W_bf16 @ x.bfloat16()
y_ref = official_expert_forward(W_int, S, x)
print((y_ours - y_ref).abs().max() / y_ref.abs().mean())
Results:
Max abs diff: 0.00000000
Mean abs diff: 0.00000000
Relative error: 0.000000
Matmul max diff: 0.00000000
Byte-exact. Zero drift from BF16 rounding noise — ruled out as a potential issue in the final quant.
Step 2: Run ModelOpt NVFP4 Full Quantization
python3 scripts/model_opt_nvfp4_full.py
Runs NVIDIA's official ModelOpt PTQ pipeline (hf_ptq.py) with full nvfp4 quantization (attention + experts + shared MLP). Output target: ~600GB.
Config:
--quant nvfp4(full model, not experts-only)--calib 128— 128 calibration samples. The B200 node has 2.7TB RAM; the 3TB BF16 model doesn't fit in GPU VRAM (~1.4TB total), so it runs with--use_seq_device_map(CPU offload). 256 calibration samples OOMs. 128 is the max that fits.--kv_cache_quant fp8_cast--use_seq_device_map— sequential device mapping, loads model into CPU RAM, moves layers to GPU for forward passes--gpu_max_mem_percentage 0.7— VRAM headroom
Calibration datasets: abisee/cnn_dailymail + nvidia/Nemotron-Post-Training-Dataset-v2 (gated — requires HF token). The script exports HF_TOKEN and HUGGING_FACE_HUB_TOKEN; the token must also be set via hf auth login on the node.
Runtime: Model loading takes ~53 minutes. Quantization + calibration takes several hours. Total expect 6-12 hours.
Dependencies (pinned versions)
- nvidia-modelopt:
0.45.0.dev64+g579fc6c31(installed from git, not PyPI) - transformers:
5.8.0.dev0(from git, required for DeepSeekV4 support) - kernels: latest (
pip install -U kernels— needed for finegrained FP8 ops) - Python: 3.10
The quant_module_patched.py fix is for modelopt 0.45.0.dev64 specifically. Later versions may include the fix natively — check before applying. Using a different modelopt version may cause patches to fail or V4 quantization to break.
Key Notes
- V4 is NOT BF16 — it ships as mixed-precision FP8/FP4. You MUST dequantize to BF16 first (Step 1). The raw FP8 source has kernel problems on Blackwell; the mixed-precision source causes modelopt errors
--low_memory_modecauses meta device errors with V4 — don't use- modelopt has no explicit V4 support — relies on auto-detection of fused experts
- The
quant_module_patched.pypatch fixesiter_weights_for_calibration()for V4'snn.ModuleListexpert quantizers — already applied in the venv
Bugs Found (V4 + modelopt)
QuantDeepseekV4ExpertsAttributeError — patchediter_weights_for_calibration()for ModuleList quantizers--low_memory_mode→ meta device error- Missing
kernelspackage for FP8 ops --calibnot--calib_size,--quantnot--qformat(shell script arg names)