76 lines
2.4 KiB
Markdown
76 lines
2.4 KiB
Markdown
# DeepSeek V4 Pro → NVFP4 via NVIDIA Model Optimizer
|
||
|
||
Fallback quantization path using NVIDIA's official Model Optimizer (`nvidia-modelopt`) PTQ pipeline.
|
||
|
||
## Why this branch
|
||
|
||
Path A (custom streaming FP8→NVFP4) is weight-only W4A16. If it doesn't produce good enough accuracy, NVIDIA's Model Optimizer provides data-driven calibration with proper activation scales, and is the officially supported path for DeepSeek V3/V4 NVFP4.
|
||
|
||
## What's here
|
||
|
||
| File | Purpose |
|
||
| --- | --- |
|
||
| `quantize_modelopt.py` | PTQ via `nvidia-modelopt` with `NVFP4_EXPERTS_ONLY` config |
|
||
|
||
## Quantization config
|
||
|
||
Using `nvfp4_experts_only` — NVIDIA's recommended config for MoE models. This quantizes only the expert MLP layers (`mlp.experts` / `block_sparse_moe`) while keeping attention QKV projections in higher precision. Options:
|
||
|
||
- `nvfp4_experts_only` — Experts only (recommended for MoE)
|
||
- `nvfp4_mlp_only` — All MLP layers (experts + shared)
|
||
- `nvfp4` — Full model NVFP4 (riskier for attention)
|
||
|
||
## Prerequisites
|
||
|
||
```bash
|
||
# Use the TensorRT-LLM docker if possible:
|
||
# docker run --gpus all -it nvcr.io/nvidia/tensorrt-llm/release:1.2.0 bash
|
||
|
||
# Otherwise pip install:
|
||
pip install -U "nvidia-modelopt[hf]"
|
||
pip install compressed-tensors fire flash-attn transformers_stream_generator zstandard
|
||
# Note: requires transformers<5.0 for modelopt compatibility
|
||
```
|
||
|
||
## Usage
|
||
|
||
```bash
|
||
# On the B200 node (8× B200, 2.7 TB RAM)
|
||
cd /root/nvidia-meeting
|
||
source venv/bin/activate
|
||
|
||
# Using BF16 source weights (preferred for modelopt calibration)
|
||
python quantize_modelopt.py \
|
||
--model /root/nvidia-meeting/DeepSeek-V4-Pro \
|
||
--export_dir /root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4-modelopt \
|
||
--qformat nvfp4_experts_only \
|
||
--tp 8 \
|
||
--calib_size 256
|
||
|
||
# Using FP8 source (modelopt handles dequant internally)
|
||
python quantize_modelopt.py \
|
||
--model /root/nvidia-meeting/DeepSeek-V4-Pro-FP8 \
|
||
--export_dir /root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4-modelopt-fp8src \
|
||
--qformat nvfp4_experts_only \
|
||
--tp 8 \
|
||
--calib_size 256
|
||
```
|
||
|
||
## Low-memory options
|
||
|
||
If you hit OOM during calibration:
|
||
|
||
- `--use_seq_device_map` — sequential device mapping across GPUs
|
||
- `--low_memory_mode` — compress weights before calibration (FP8/NVFP4 only)
|
||
|
||
## Output
|
||
|
||
Exports a **Unified HuggingFace checkpoint** compatible with:
|
||
- TensorRT-LLM (PyTorch and C++ backends)
|
||
- vLLM
|
||
- SGLang
|
||
|
||
## Expected runtime
|
||
|
||
24-72 hours for full calibration on 8× B200 with 256 calibration samples.
|