deepseek-v4-quant/README.md

# DeepSeek V4 Pro → NVFP4 via NVIDIA Model Optimizer

Fallback quantization path using NVIDIA's official Model Optimizer (`nvidia-modelopt`) PTQ pipeline.

## Why this branch

Path A (custom streaming FP8→NVFP4) is weight-only W4A16. If it doesn't produce good enough accuracy, NVIDIA's Model Optimizer provides data-driven calibration with proper activation scales, and is the officially supported path for DeepSeek V3/V4 NVFP4.

## What's here

| File | Purpose |
| --- | --- |
| `quantize_modelopt.py` | PTQ via `nvidia-modelopt` with `NVFP4_EXPERTS_ONLY` config |

## Quantization config

Using `nvfp4_experts_only` — NVIDIA's recommended config for MoE models. This quantizes only the expert MLP layers (`mlp.experts` / `block_sparse_moe`) while keeping attention QKV projections in higher precision. Options:

- `nvfp4_experts_only` — Experts only (recommended for MoE)
- `nvfp4_mlp_only` — All MLP layers (experts + shared)
- `nvfp4` — Full model NVFP4 (riskier for attention)

## Prerequisites

```bash
# Use the TensorRT-LLM docker if possible:
# docker run --gpus all -it nvcr.io/nvidia/tensorrt-llm/release:1.2.0 bash

# Otherwise pip install:
pip install -U "nvidia-modelopt[hf]"
pip install compressed-tensors fire flash-attn transformers_stream_generator zstandard
# Note: requires transformers<5.0 for modelopt compatibility
```

## Usage

```bash
# On the B200 node (8× B200, 2.7 TB RAM)
cd /root/nvidia-meeting
source venv/bin/activate

# Using BF16 source weights (preferred for modelopt calibration)
python quantize_modelopt.py \
    --model /root/nvidia-meeting/DeepSeek-V4-Pro \
    --export_dir /root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4-modelopt \
    --qformat nvfp4_experts_only \
    --tp 8 \
    --calib_size 256

# Using FP8 source (modelopt handles dequant internally)
python quantize_modelopt.py \
    --model /root/nvidia-meeting/DeepSeek-V4-Pro-FP8 \
    --export_dir /root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4-modelopt-fp8src \
    --qformat nvfp4_experts_only \
    --tp 8 \
    --calib_size 256
```

## Low-memory options

If you hit OOM during calibration:

- `--use_seq_device_map` — sequential device mapping across GPUs
- `--low_memory_mode` — compress weights before calibration (FP8/NVFP4 only)

## Output

Exports a **Unified HuggingFace checkpoint** compatible with:
- TensorRT-LLM (PyTorch and C++ backends)
- vLLM
- SGLang

## Expected runtime

24-72 hours for full calibration on 8× B200 with 256 calibration samples.