DeepSeek V4 Pro → NVFP4 Quantization

Full NVFP4 quantization of DeepSeek V4 Pro on a single B200 node (8× B200, 2.7TB RAM, 13TB NVMe).

Pipeline

Step 1: Dequantize FP8 → BF16

python3 scripts/dequant_fp8_to_bf16.py /root/nvidia-meeting/DeepSeek-V4-Pro-FP8 /root/nvidia-meeting/DeepSeek-V4-Pro-BF16

The original V4 weights use mixed precision (FP8 attention + FP4/E2M1 experts with per-tensor scales). We dequantize everything to pure BF16 so modelopt can run calibration without hitting broken FP8 kernel paths on Blackwell (DeepGEMM unsupported, Triton finegrained FP8 matmul shape mismatches).

This is not a blind upcast — it applies the actual scale factors:

W_bf16 = dequantize_fp4_weight(W_int, S)  # per-tensor scale dequant, not .to(bfloat16)

We verified byte-exact correctness by dequantizing a single expert and running a matmul against the official inference path:

W_bf16 = dequantize_fp4_weight(W_int, S)
y_ours = W_bf16 @ x.bfloat16()
y_ref = official_expert_forward(W_int, S, x)
print((y_ours - y_ref).abs().max() / y_ref.abs().mean())

Results:

Max abs diff: 0.00000000
Mean abs diff: 0.00000000
Relative error: 0.000000
Matmul max diff: 0.00000000

Byte-exact. Zero drift from BF16 rounding noise — ruled out as a potential issue in the final quant.

Step 2: Run ModelOpt NVFP4 Full Quantization

python3 scripts/model_opt_nvfp4_full.py

Runs NVIDIA's official ModelOpt PTQ pipeline (hf_ptq.py) with full nvfp4 quantization (attention + experts + shared MLP). Output target: ~600GB.

Config:

  • --quant nvfp4 (full model, not experts-only)
  • --calib 128 — 128 calibration samples. The B200 node has 2.7TB RAM; the 3TB BF16 model doesn't fit in GPU VRAM (~1.4TB total), so it runs with --use_seq_device_map (CPU offload). 256 calibration samples OOMs. 128 is the max that fits.
  • --kv_cache_quant fp8_cast
  • --use_seq_device_map — sequential device mapping, loads model into CPU RAM, moves layers to GPU for forward passes
  • --gpu_max_mem_percentage 0.7 — VRAM headroom

Calibration datasets: abisee/cnn_dailymail + nvidia/Nemotron-Post-Training-Dataset-v2 (gated — requires HF token). The script exports HF_TOKEN and HUGGING_FACE_HUB_TOKEN; the token must also be set via hf auth login on the node.

Runtime: Model loading takes ~53 minutes. Quantization + calibration takes several hours. Total expect 6-12 hours.

Dependencies (pinned versions)

  • nvidia-modelopt: 0.45.0.dev64+g579fc6c31 (installed from git, not PyPI)
  • transformers: 5.8.0.dev0 (from git, required for DeepSeekV4 support)
  • kernels: latest (pip install -U kernels — needed for finegrained FP8 ops)
  • Python: 3.10

The quant_module_patched.py fix is for modelopt 0.45.0.dev64 specifically. Later versions may include the fix natively — check before applying. Using a different modelopt version may cause patches to fail or V4 quantization to break.

Key Notes

  • Use BF16 source — V4's mixed precision causes issues, FP8 source has kernel problems on Blackwell
  • --low_memory_mode causes meta device errors with V4 — don't use
  • modelopt has no explicit V4 support — relies on auto-detection of fused experts
  • The quant_module_patched.py patch fixes iter_weights_for_calibration() for V4's nn.ModuleList expert quantizers — already applied in the venv

Bugs Found (V4 + modelopt)

  1. QuantDeepseekV4Experts AttributeError — patched iter_weights_for_calibration() for ModuleList quantizers
  2. --low_memory_mode → meta device error
  3. Missing kernels package for FP8 ops
  4. --calib not --calib_size, --quant not --qformat (shell script arg names)
Description
No description provided
Readme 1.6 MiB
Languages
Python 100%