DeepSeek V4 Pro → NVFP4 Quantization
Full NVFP4 quantization of DeepSeek V4 Pro on a single B200 node (8× B200, 2.7TB RAM, 13TB NVMe). Target: ~600GB.
Pipeline
Step 1: Dequantize FP8 → BF16
python3 scripts/dequant_fp8_to_bf16.py /root/nvidia-meeting/DeepSeek-V4-Pro-FP8 /root/nvidia-meeting/DeepSeek-V4-Pro-BF16
The original V4 weights use mixed precision (FP8 attention + FP4/E2M1 experts with per-tensor scales). We dequantize everything to pure BF16 so modelopt can run calibration without hitting broken FP8 kernel paths on Blackwell (DeepGEMM unsupported, Triton finegrained FP8 matmul shape mismatches).
This is not a blind upcast — it applies the actual scale factors:
W_bf16 = dequantize_fp4_weight(W_int, S) # per-tensor scale dequant, not .to(bfloat16)
Byte-exact verified — matmul diff is 0.000000 against the official inference path.
Step 2: Run NVFP4 Quantization
python3 scripts/quantize_nvfp4.py
This script runs the full pipeline in-process (not wrapping the shell script):
- Load BF16 model with sequential device map (3TB model, CPU offload)
- Patch modelopt for V4 compatibility (ModuleList quantizers, GPU tensor safety)
- Quantize + Calibrate (5-6 hours, 128 samples)
- SAVE model state to disk ← insurance against export crashes
- Export to HF safetensors
If the export crashes (and it will — modelopt's export reads stale GPU tensors after hours of calibration):
python3 scripts/quantize_nvfp4.py --export-only
This loads the saved calibration state and retries just the export step.
Config:
--quant nvfp4(full model, not experts-only)--calib 128— 128 calibration samples. 256 OOMs with 3TB BF16 on CPU offload.--kv_cache_quant fp8_cast--use_seq_device_map— sequential device mapping (CPU offload)--gpu_max_mem_percentage 0.7— VRAM headroom
Calibration datasets: abisee/cnn_dailymail + nvidia/Nemotron-Post-Training-Dataset-v2 (gated — requires HF token).
Runtime: Model loading ~53 min. Calibration ~5.5 hours. Export ~30-60 min. Total 7-8 hours.
Bugs Found (V4 + modelopt)
QuantDeepseekV4ExpertsAttributeError — V4 usesnn.ModuleListfor per-expert quantizers, modelopt expected singularTensorQuantizer. Patched inquantize_nvfp4.py.--low_memory_mode→ meta device error. Don't use with V4.- Missing
kernelspackage for FP8 ops.pip install -U kernels. --calibnot--calib_size,--quantnot--qformat(shell script arg names — no longer relevant, we run in-process).- Export crash — stale GPU tensors. After 5+ hours of calibration, modelopt's export step reads quantizer amax tensors that have been sitting in VRAM for hours. CUDA illegal memory access. Fixed by moving quantizer tensors to CPU before export.
- Export crash —
assert torch.all(activation_scaling_factor > 0). Related to #5. The amax values from stale GPU reads are garbage. Fixed by clamping instead of asserting.
Dependencies (pinned versions)
- nvidia-modelopt:
0.45.0.dev64+g579fc6c31(installed from git, not PyPI) - transformers:
5.8.0.dev0(from git, required for DeepSeekV4 support) - kernels: latest (
pip install -U kernels— needed for finegrained FP8 ops) - Python: 3.10
The patches in quantize_nvfp4.py are for modelopt 0.45.0.dev64 specifically. Later versions may include fixes natively.
Key Notes
- V4 is NOT BF16 — it ships as mixed-precision FP8/FP4. You MUST dequantize to BF16 first (Step 1).
--low_memory_modecauses meta device errors with V4 — don't use.- modelopt has no explicit V4 support — relies on auto-detection of fused experts.
- The calibration state save (
v4_nvfp4_calibrated_state.pt) is ~1.5TB. It lives on NVMe, not in git.