DeepSeek V4 Pro → NVFP4 via NVIDIA Model Optimizer

Fallback quantization path using NVIDIA's official Model Optimizer (nvidia-modelopt) PTQ pipeline.

Why this branch

Path A (custom streaming FP8→NVFP4) is weight-only W4A16. If it doesn't produce good enough accuracy, NVIDIA's Model Optimizer provides data-driven calibration with proper activation scales, and is the officially supported path for DeepSeek V3/V4 NVFP4.

What's here

File Purpose
quantize_modelopt.py PTQ via nvidia-modelopt with NVFP4_EXPERTS_ONLY config

Quantization config

Using nvfp4_experts_only — NVIDIA's recommended config for MoE models. This quantizes only the expert MLP layers (mlp.experts / block_sparse_moe) while keeping attention QKV projections in higher precision. Options:

  • nvfp4_experts_only — Experts only (recommended for MoE)
  • nvfp4_mlp_only — All MLP layers (experts + shared)
  • nvfp4 — Full model NVFP4 (riskier for attention)

Prerequisites

# Use the TensorRT-LLM docker if possible:
# docker run --gpus all -it nvcr.io/nvidia/tensorrt-llm/release:1.2.0 bash

# Otherwise pip install:
pip install -U "nvidia-modelopt[hf]"
pip install compressed-tensors fire flash-attn transformers_stream_generator zstandard
# Note: requires transformers<5.0 for modelopt compatibility

Usage

# On the B200 node (8× B200, 2.7 TB RAM)
cd /root/nvidia-meeting
source venv/bin/activate

# Using BF16 source weights (preferred for modelopt calibration)
python quantize_modelopt.py \
    --model /root/nvidia-meeting/DeepSeek-V4-Pro \
    --export_dir /root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4-modelopt \
    --qformat nvfp4_experts_only \
    --tp 8 \
    --calib_size 256

# Using FP8 source (modelopt handles dequant internally)
python quantize_modelopt.py \
    --model /root/nvidia-meeting/DeepSeek-V4-Pro-FP8 \
    --export_dir /root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4-modelopt-fp8src \
    --qformat nvfp4_experts_only \
    --tp 8 \
    --calib_size 256

Low-memory options

If you hit OOM during calibration:

  • --use_seq_device_map — sequential device mapping across GPUs
  • --low_memory_mode — compress weights before calibration (FP8/NVFP4 only)

Output

Exports a Unified HuggingFace checkpoint compatible with:

  • TensorRT-LLM (PyTorch and C++ backends)
  • vLLM
  • SGLang

Expected runtime

24-72 hours for full calibration on 8× B200 with 256 calibration samples.

Description
No description provided
Readme 1.6 MiB
Languages
Python 100%