ca9a4f5eaa7a34d1ba338e7c26956912bc737d2a
DeepSeek V4 Pro → NVFP4 Quantization
Full NVFP4 quantization of DeepSeek V4 Pro using NVIDIA's Model Optimizer.
Strategy
- Dequantize the original mixed-precision FP8 weights to pure BF16 (
scripts/dequant_fp8_to_bf16.py) - Full quantize BF16 → NVFP4 using NVIDIA's official ModelOpt PTQ pipeline (
scripts/model_opt_nvfp4_full.py)
Full model quantization (attention + experts + shared MLP) to NVFP4. Target output: ~600GB.
Scripts
| File | Purpose |
|---|---|
scripts/dequant_fp8_to_bf16.py |
Dequant FP8 source → pure BF16 (resumable, shard-level) |
scripts/upcast_to_bf16.py |
Alternative: upcast mixed-precision to BF16 |
scripts/model_opt_nvfp4_full.py |
Run ModelOpt NVFP4 full quantization (calib 128) |
patches/quant_module_patched.py |
Patch for modelopt V4 experts ModuleList bug |
patches/patch_finegrained_fp8_blackwell.py |
Blackwell FP8 kernel patch |
check-ttl.sh |
B200 node TTL watchdog |
B200 Node
- 8× B200, 2.7TB RAM, 13TB NVMe
- See
.envfor access details
Key Notes
- Calib size: 128 (256 OOMs on 2.8TB RAM with 3TB BF16 model)
- Full quant (
nvfp4), not experts-only - Use BF16 source — V4's mixed precision causes issues, FP8 source has kernel problems on Blackwell
--use_seq_device_maprequired (model doesn't fit in GPU VRAM alone)--gpu_max_mem_percentage 0.7for VRAM headroom--low_memory_modecauses meta device errors with V4 — don't use- modelopt has no explicit V4 support — relies on auto-detection of fused experts
- Calibration dataset
nvidia/Nemotron-Post-Training-Dataset-v2is gated — requires HF token
Bugs Found (V4 + modelopt)
QuantDeepseekV4ExpertsAttributeError — patchediter_weights_for_calibration()for ModuleList quantizers--low_memory_mode→ meta device error- Missing
kernelspackage for FP8 ops --calibnot--calib_size,--quantnot--qformat(shell script arg names)
Description
Languages
Python
100%