Files
deepseek-v4-quant/README.md

76 lines
2.4 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# DeepSeek V4 Pro → NVFP4 via NVIDIA Model Optimizer
Fallback quantization path using NVIDIA's official Model Optimizer (`nvidia-modelopt`) PTQ pipeline.
## Why this branch
Path A (custom streaming FP8→NVFP4) is weight-only W4A16. If it doesn't produce good enough accuracy, NVIDIA's Model Optimizer provides data-driven calibration with proper activation scales, and is the officially supported path for DeepSeek V3/V4 NVFP4.
## What's here
| File | Purpose |
| --- | --- |
| `quantize_modelopt.py` | PTQ via `nvidia-modelopt` with `NVFP4_EXPERTS_ONLY` config |
## Quantization config
Using `nvfp4_experts_only` — NVIDIA's recommended config for MoE models. This quantizes only the expert MLP layers (`mlp.experts` / `block_sparse_moe`) while keeping attention QKV projections in higher precision. Options:
- `nvfp4_experts_only` — Experts only (recommended for MoE)
- `nvfp4_mlp_only` — All MLP layers (experts + shared)
- `nvfp4` — Full model NVFP4 (riskier for attention)
## Prerequisites
```bash
# Use the TensorRT-LLM docker if possible:
# docker run --gpus all -it nvcr.io/nvidia/tensorrt-llm/release:1.2.0 bash
# Otherwise pip install:
pip install -U "nvidia-modelopt[hf]"
pip install compressed-tensors fire flash-attn transformers_stream_generator zstandard
# Note: requires transformers<5.0 for modelopt compatibility
```
## Usage
```bash
# On the B200 node (8× B200, 2.7 TB RAM)
cd /root/nvidia-meeting
source venv/bin/activate
# Using BF16 source weights (preferred for modelopt calibration)
python quantize_modelopt.py \
--model /root/nvidia-meeting/DeepSeek-V4-Pro \
--export_dir /root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4-modelopt \
--qformat nvfp4_experts_only \
--tp 8 \
--calib_size 256
# Using FP8 source (modelopt handles dequant internally)
python quantize_modelopt.py \
--model /root/nvidia-meeting/DeepSeek-V4-Pro-FP8 \
--export_dir /root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4-modelopt-fp8src \
--qformat nvfp4_experts_only \
--tp 8 \
--calib_size 256
```
## Low-memory options
If you hit OOM during calibration:
- `--use_seq_device_map` — sequential device mapping across GPUs
- `--low_memory_mode` — compress weights before calibration (FP8/NVFP4 only)
## Output
Exports a **Unified HuggingFace checkpoint** compatible with:
- TensorRT-LLM (PyTorch and C++ backends)
- vLLM
- SGLang
## Expected runtime
24-72 hours for full calibration on 8× B200 with 256 calibration samples.