DeepSeek V4 Pro → NVFP4 via NVIDIA Model Optimizer
Fallback quantization path using NVIDIA's official Model Optimizer (nvidia-modelopt) PTQ pipeline.
Why this branch
Path A (custom streaming FP8→NVFP4) is weight-only W4A16. If it doesn't produce good enough accuracy, NVIDIA's Model Optimizer provides data-driven calibration with proper activation scales, and is the officially supported path for DeepSeek V3/V4 NVFP4.
What's here
| File | Purpose |
|---|---|
quantize_modelopt.py |
PTQ via nvidia-modelopt with NVFP4_EXPERTS_ONLY config |
Quantization config
Using nvfp4_experts_only — NVIDIA's recommended config for MoE models. This quantizes only the expert MLP layers (mlp.experts / block_sparse_moe) while keeping attention QKV projections in higher precision. Options:
nvfp4_experts_only— Experts only (recommended for MoE)nvfp4_mlp_only— All MLP layers (experts + shared)nvfp4— Full model NVFP4 (riskier for attention)
Prerequisites
# Use the TensorRT-LLM docker if possible:
# docker run --gpus all -it nvcr.io/nvidia/tensorrt-llm/release:1.2.0 bash
# Otherwise pip install:
pip install -U "nvidia-modelopt[hf]"
pip install compressed-tensors fire flash-attn transformers_stream_generator zstandard
# Note: requires transformers<5.0 for modelopt compatibility
Usage
# On the B200 node (8× B200, 2.7 TB RAM)
cd /root/nvidia-meeting
source venv/bin/activate
# Using BF16 source weights (preferred for modelopt calibration)
python quantize_modelopt.py \
--model /root/nvidia-meeting/DeepSeek-V4-Pro \
--export_dir /root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4-modelopt \
--qformat nvfp4_experts_only \
--tp 8 \
--calib_size 256
# Using FP8 source (modelopt handles dequant internally)
python quantize_modelopt.py \
--model /root/nvidia-meeting/DeepSeek-V4-Pro-FP8 \
--export_dir /root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4-modelopt-fp8src \
--qformat nvfp4_experts_only \
--tp 8 \
--calib_size 256
Low-memory options
If you hit OOM during calibration:
--use_seq_device_map— sequential device mapping across GPUs--low_memory_mode— compress weights before calibration (FP8/NVFP4 only)
Output
Exports a Unified HuggingFace checkpoint compatible with:
- TensorRT-LLM (PyTorch and C++ backends)
- vLLM
- SGLang
Expected runtime
24-72 hours for full calibration on 8× B200 with 256 calibration samples.
Description
Languages
Python
100%