# DeepSeek V4 Pro → NVFP4 via NVIDIA Model Optimizer Fallback quantization path using NVIDIA's official Model Optimizer (`nvidia-modelopt`) PTQ pipeline. ## Why this branch Path A (custom streaming FP8→NVFP4) is weight-only W4A16. If it doesn't produce good enough accuracy, NVIDIA's Model Optimizer provides data-driven calibration with proper activation scales, and is the officially supported path for DeepSeek V3/V4 NVFP4. ## What's here | File | Purpose | | --- | --- | | `quantize_modelopt.py` | PTQ via `nvidia-modelopt` with `NVFP4_EXPERTS_ONLY` config | ## Quantization config Using `nvfp4_experts_only` — NVIDIA's recommended config for MoE models. This quantizes only the expert MLP layers (`mlp.experts` / `block_sparse_moe`) while keeping attention QKV projections in higher precision. Options: - `nvfp4_experts_only` — Experts only (recommended for MoE) - `nvfp4_mlp_only` — All MLP layers (experts + shared) - `nvfp4` — Full model NVFP4 (riskier for attention) ## Prerequisites ```bash # Use the TensorRT-LLM docker if possible: # docker run --gpus all -it nvcr.io/nvidia/tensorrt-llm/release:1.2.0 bash # Otherwise pip install: pip install -U "nvidia-modelopt[hf]" pip install compressed-tensors fire flash-attn transformers_stream_generator zstandard # Note: requires transformers<5.0 for modelopt compatibility ``` ## Usage ```bash # On the B200 node (8× B200, 2.7 TB RAM) cd /root/nvidia-meeting source venv/bin/activate # Using BF16 source weights (preferred for modelopt calibration) python quantize_modelopt.py \ --model /root/nvidia-meeting/DeepSeek-V4-Pro \ --export_dir /root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4-modelopt \ --qformat nvfp4_experts_only \ --tp 8 \ --calib_size 256 # Using FP8 source (modelopt handles dequant internally) python quantize_modelopt.py \ --model /root/nvidia-meeting/DeepSeek-V4-Pro-FP8 \ --export_dir /root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4-modelopt-fp8src \ --qformat nvfp4_experts_only \ --tp 8 \ --calib_size 256 ``` ## Low-memory options If you hit OOM during calibration: - `--use_seq_device_map` — sequential device mapping across GPUs - `--low_memory_mode` — compress weights before calibration (FP8/NVFP4 only) ## Output Exports a **Unified HuggingFace checkpoint** compatible with: - TensorRT-LLM (PyTorch and C++ backends) - vLLM - SGLang ## Expected runtime 24-72 hours for full calibration on 8× B200 with 256 calibration samples.