Cleanup: nuke dead scripts and stale docs, rewrite README for full NVFP4 pipeline
This commit is contained in:
93
README.md
93
README.md
@@ -1,75 +1,44 @@
|
||||
# DeepSeek V4 Pro → NVFP4 via NVIDIA Model Optimizer
|
||||
# DeepSeek V4 Pro → NVFP4 Quantization
|
||||
|
||||
Fallback quantization path using NVIDIA's official Model Optimizer (`nvidia-modelopt`) PTQ pipeline.
|
||||
Full NVFP4 quantization of DeepSeek V4 Pro using NVIDIA's Model Optimizer.
|
||||
|
||||
## Why this branch
|
||||
## Strategy
|
||||
|
||||
Path A (custom streaming FP8→NVFP4) is weight-only W4A16. If it doesn't produce good enough accuracy, NVIDIA's Model Optimizer provides data-driven calibration with proper activation scales, and is the officially supported path for DeepSeek V3/V4 NVFP4.
|
||||
1. **Dequantize** the original mixed-precision FP8 weights to pure BF16 (`scripts/dequant_fp8_to_bf16.py`)
|
||||
2. **Full quantize** BF16 → NVFP4 using NVIDIA's official ModelOpt PTQ pipeline (`scripts/model_opt_nvfp4_full.py`)
|
||||
|
||||
## What's here
|
||||
Full model quantization (attention + experts + shared MLP) to NVFP4. Target output: ~600GB.
|
||||
|
||||
## Scripts
|
||||
|
||||
| File | Purpose |
|
||||
| --- | --- |
|
||||
| `quantize_modelopt.py` | PTQ via `nvidia-modelopt` with `NVFP4_EXPERTS_ONLY` config |
|
||||
| `scripts/dequant_fp8_to_bf16.py` | Dequant FP8 source → pure BF16 (resumable, shard-level) |
|
||||
| `scripts/upcast_to_bf16.py` | Alternative: upcast mixed-precision to BF16 |
|
||||
| `scripts/model_opt_nvfp4_full.py` | Run ModelOpt NVFP4 full quantization (calib 128) |
|
||||
| `patches/quant_module_patched.py` | Patch for modelopt V4 experts ModuleList bug |
|
||||
| `patches/patch_finegrained_fp8_blackwell.py` | Blackwell FP8 kernel patch |
|
||||
| `check-ttl.sh` | B200 node TTL watchdog |
|
||||
|
||||
## Quantization config
|
||||
## B200 Node
|
||||
|
||||
Using `nvfp4_experts_only` — NVIDIA's recommended config for MoE models. This quantizes only the expert MLP layers (`mlp.experts` / `block_sparse_moe`) while keeping attention QKV projections in higher precision. Options:
|
||||
- 8× B200, 2.7TB RAM, 13TB NVMe
|
||||
- See `.env` for access details
|
||||
|
||||
- `nvfp4_experts_only` — Experts only (recommended for MoE)
|
||||
- `nvfp4_mlp_only` — All MLP layers (experts + shared)
|
||||
- `nvfp4` — Full model NVFP4 (riskier for attention)
|
||||
## Key Notes
|
||||
|
||||
## Prerequisites
|
||||
- **Calib size: 128** (256 OOMs on 2.8TB RAM with 3TB BF16 model)
|
||||
- **Full quant (`nvfp4`)**, not experts-only
|
||||
- Use BF16 source — V4's mixed precision causes issues, FP8 source has kernel problems on Blackwell
|
||||
- `--use_seq_device_map` required (model doesn't fit in GPU VRAM alone)
|
||||
- `--gpu_max_mem_percentage 0.7` for VRAM headroom
|
||||
- `--low_memory_mode` causes meta device errors with V4 — don't use
|
||||
- modelopt has no explicit V4 support — relies on auto-detection of fused experts
|
||||
- Calibration dataset `nvidia/Nemotron-Post-Training-Dataset-v2` is gated — requires HF token
|
||||
|
||||
```bash
|
||||
# Use the TensorRT-LLM docker if possible:
|
||||
# docker run --gpus all -it nvcr.io/nvidia/tensorrt-llm/release:1.2.0 bash
|
||||
## Bugs Found (V4 + modelopt)
|
||||
|
||||
# Otherwise pip install:
|
||||
pip install -U "nvidia-modelopt[hf]"
|
||||
pip install compressed-tensors fire flash-attn transformers_stream_generator zstandard
|
||||
# Note: requires transformers<5.0 for modelopt compatibility
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
```bash
|
||||
# On the B200 node (8× B200, 2.7 TB RAM)
|
||||
cd /root/nvidia-meeting
|
||||
source venv/bin/activate
|
||||
|
||||
# Using BF16 source weights (preferred for modelopt calibration)
|
||||
python quantize_modelopt.py \
|
||||
--model /root/nvidia-meeting/DeepSeek-V4-Pro \
|
||||
--export_dir /root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4-modelopt \
|
||||
--qformat nvfp4_experts_only \
|
||||
--tp 8 \
|
||||
--calib_size 256
|
||||
|
||||
# Using FP8 source (modelopt handles dequant internally)
|
||||
python quantize_modelopt.py \
|
||||
--model /root/nvidia-meeting/DeepSeek-V4-Pro-FP8 \
|
||||
--export_dir /root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4-modelopt-fp8src \
|
||||
--qformat nvfp4_experts_only \
|
||||
--tp 8 \
|
||||
--calib_size 256
|
||||
```
|
||||
|
||||
## Low-memory options
|
||||
|
||||
If you hit OOM during calibration:
|
||||
|
||||
- `--use_seq_device_map` — sequential device mapping across GPUs
|
||||
- `--low_memory_mode` — compress weights before calibration (FP8/NVFP4 only)
|
||||
|
||||
## Output
|
||||
|
||||
Exports a **Unified HuggingFace checkpoint** compatible with:
|
||||
- TensorRT-LLM (PyTorch and C++ backends)
|
||||
- vLLM
|
||||
- SGLang
|
||||
|
||||
## Expected runtime
|
||||
|
||||
24-72 hours for full calibration on 8× B200 with 256 calibration samples.
|
||||
1. `QuantDeepseekV4Experts` AttributeError — patched `iter_weights_for_calibration()` for ModuleList quantizers
|
||||
2. `--low_memory_mode` → meta device error
|
||||
3. Missing `kernels` package for FP8 ops
|
||||
4. `--calib` not `--calib_size`, `--quant` not `--qformat` (shell script arg names)
|
||||
|
||||
@@ -1,38 +0,0 @@
|
||||
# DeepSeek V4 Pro NVFP4 via NVIDIA ModelOpt
|
||||
|
||||
## What this does
|
||||
Quantizes DeepSeek V4 Pro (FP8 weights) to full NVFP4 format using NVIDIA's official ModelOpt pipeline.
|
||||
Target output: ~600GB (vs 840GB from custom Path A converter).
|
||||
|
||||
## Prerequisites
|
||||
- B200 node (8× B200, 2.7TB RAM) — NVFP4 requires Blackwell GPUs
|
||||
- modelopt 0.45.0+ from git
|
||||
- transformers 5.8.0.dev0 (for DeepSeekV4 support)
|
||||
- kernels package (for FP8 dequantization during calibration)
|
||||
|
||||
## Critical Patch
|
||||
modelopt has a bug with DeepSeekV4Experts — the `iter_weights_for_calibration()` method
|
||||
doesn't handle ModuleList quantizers (plural `gate_up_proj_weight_quantizers`).
|
||||
Apply the patch before running:
|
||||
|
||||
```bash
|
||||
cp patches/quant_module_patched.py <venv-path>/lib/python3.10/site-packages/modelopt/torch/quantization/nn/modules/quant_module.py
|
||||
```
|
||||
|
||||
## Do NOT use these flags
|
||||
- `--low_memory_mode`: causes meta device error with V4
|
||||
- `--calib_size`: wrong arg name (use `--calib`)
|
||||
|
||||
## Run
|
||||
```bash
|
||||
bash scripts/run_modelopt_nvfp4.sh
|
||||
```
|
||||
|
||||
## Output
|
||||
`/root/nvidia-meeting/modelopt-repo/examples/llm_ptq/saved_models_DeepSeek-V4-Pro-FP8_nvfp4_kv_fp8_cast`
|
||||
|
||||
## Notes
|
||||
- Use FP8 source (`DeepSeek-V4-Pro-FP8`), NOT mixed-precision BF16 (`DeepSeek-V4-Pro`)
|
||||
- V4's mixed precision causes "wonky shit" — FP8 is clean
|
||||
- Calibration takes hours with CPU offload (`--use_seq_device_map`)
|
||||
- Expected calibration time: several hours for 256 samples
|
||||
@@ -1,166 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""NVIDIA Model Optimizer PTQ for DeepSeek V4 Pro → NVFP4.
|
||||
|
||||
Uses nvidia-modelopt's official PTQ pipeline with NVFP4Experts-Only config,
|
||||
which quantizes only MoE expert layers while keeping attention QKV in higher
|
||||
precision — the recommended approach for DeepSeek MoE models.
|
||||
|
||||
Output is a Unified HuggingFace checkpoint deployable on TRT-LLM / vLLM / SGLang.
|
||||
|
||||
Usage:
|
||||
python quantize_modelopt.py \
|
||||
--model /root/nvidia-meeting/DeepSeek-V4-Pro \
|
||||
--export_dir /root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4-modelopt \
|
||||
--qformat nvfp4_experts_only \
|
||||
--tp 8 \
|
||||
--calib_size 256
|
||||
|
||||
For the FP8 source variant, just change --model path. modelopt handles
|
||||
dequantization internally.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import os
|
||||
import random
|
||||
import time
|
||||
|
||||
import numpy as np
|
||||
import torch
|
||||
|
||||
import modelopt.torch.opt as mto
|
||||
import modelopt.torch.quantization as mtq
|
||||
from modelopt.torch.export import export_hf_checkpoint
|
||||
from modelopt.torch.utils.dataset_utils import create_forward_loop
|
||||
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||
|
||||
|
||||
mto.enable_huggingface_checkpointing()
|
||||
|
||||
|
||||
QUANT_CONFIGS = {
|
||||
"nvfp4": mtq.NVFP4_DEFAULT_CFG,
|
||||
"nvfp4_experts_only": mtq.NVFP4_EXPERTS_ONLY_CFG,
|
||||
"nvfp4_mlp_only": mtq.NVFP4_MLP_ONLY_CFG,
|
||||
"nvfp4_omlp_only": mtq.NVFP4_OMLP_ONLY_CFG,
|
||||
"fp8": mtq.FP8_DEFAULT_CFG,
|
||||
}
|
||||
|
||||
|
||||
def main():
|
||||
ap = argparse.ArgumentParser(description="Model Optimizer PTQ for DeepSeek V4 Pro")
|
||||
ap.add_argument("--model", required=True, help="Path to HF model (BF16 or FP8)")
|
||||
ap.add_argument("--export_dir", required=True, help="Output directory for quantized checkpoint")
|
||||
ap.add_argument("--qformat", default="nvfp4_experts_only",
|
||||
choices=list(QUANT_CONFIGS.keys()),
|
||||
help="Quantization format (default: nvfp4_experts_only for MoE)")
|
||||
ap.add_argument("--kv_cache_qformat", default="fp8_cast",
|
||||
help="KV cache quantization (default: fp8_cast, fast no-calib)")
|
||||
ap.add_argument("--tp", type=int, default=8, help="Tensor parallelism for export")
|
||||
ap.add_argument("--calib_size", type=int, nargs="+", default=[256],
|
||||
help="Calibration dataset size (per dataset)")
|
||||
ap.add_argument("--batch_size", type=int, default=1, help="Calibration batch size")
|
||||
ap.add_argument("--calib_seq", type=int, default=4096, help="Max calibration sequence length")
|
||||
ap.add_argument("--trust_remote_code", action="store_true", default=True,
|
||||
help="Trust remote code (required for V4)")
|
||||
ap.add_argument("--use_seq_device_map", action="store_true",
|
||||
help="Use sequential device map for low-memory calibration")
|
||||
ap.add_argument("--low_memory_mode", action="store_true",
|
||||
help="Compress weights before calibration (FP8/NVFP4 only)")
|
||||
args = ap.parse_args()
|
||||
|
||||
print(f"=== Model Optimizer PTQ ===")
|
||||
print(f" Model: {args.model}")
|
||||
print(f" QFormat: {args.qformat}")
|
||||
print(f" KV Cache: {args.kv_cache_qformat}")
|
||||
print(f" TP: {args.tp}")
|
||||
print(f" Calib: {args.calib_size} samples, seq_len={args.calib_seq}")
|
||||
print()
|
||||
|
||||
# Seed everything
|
||||
random.seed(1234)
|
||||
np.random.seed(1234)
|
||||
torch.manual_seed(1234)
|
||||
|
||||
# Load tokenizer
|
||||
print("Loading tokenizer...")
|
||||
tokenizer = AutoTokenizer.from_pretrained(
|
||||
args.model,
|
||||
trust_remote_code=args.trust_remote_code,
|
||||
padding_side="left",
|
||||
)
|
||||
if tokenizer.pad_token is None:
|
||||
tokenizer.pad_token = tokenizer.eos_token
|
||||
|
||||
# Load model
|
||||
print("Loading model...")
|
||||
model_kwargs = {
|
||||
"trust_remote_code": args.trust_remote_code,
|
||||
"torch_dtype": torch.bfloat16,
|
||||
}
|
||||
if args.use_seq_device_map:
|
||||
model_kwargs["device_map"] = "auto"
|
||||
model_kwargs["offload_folder"] = "offload"
|
||||
model_kwargs["offload_state_dict"] = True
|
||||
model_kwargs["max_memory"] = {i: "100GiB" for i in range(8)}
|
||||
model_kwargs["max_memory"]["cpu"] = "2500GiB"
|
||||
elif args.low_memory_mode:
|
||||
# Load entirely on CPU, modelopt will handle placement
|
||||
model_kwargs["device_map"] = {"": "cpu"}
|
||||
|
||||
model = AutoModelForCausalLM.from_pretrained(args.model, **model_kwargs)
|
||||
|
||||
if not args.use_seq_device_map and not args.low_memory_mode:
|
||||
model = model.cuda()
|
||||
|
||||
# Build calibration dataloader
|
||||
print("Building calibration dataset...")
|
||||
calib_dataloader = get_dataloader(
|
||||
tokenizer=tokenizer,
|
||||
calib_size=args.calib_size,
|
||||
batch_size=args.batch_size,
|
||||
calib_seq=args.calib_seq,
|
||||
)
|
||||
|
||||
# Build forward loop for calibration
|
||||
def forward_loop(model):
|
||||
for batch in calib_dataloader:
|
||||
model(**batch)
|
||||
|
||||
# Quantize
|
||||
quant_cfg = QUANT_CONFIGS[args.qformat]
|
||||
print(f"Running PTQ with {args.qformat}...")
|
||||
t0 = time.time()
|
||||
|
||||
model = mtq.quantize(model, quant_cfg, forward_loop)
|
||||
|
||||
elapsed = time.time() - t0
|
||||
print(f"Quantization complete in {elapsed/60:.1f} min")
|
||||
|
||||
# Export
|
||||
print(f"Exporting to {args.export_dir} ...")
|
||||
with torch.inference_mode():
|
||||
export_hf_checkpoint(
|
||||
model,
|
||||
args.export_dir,
|
||||
tokenizer=tokenizer,
|
||||
export_tensorrt_llm_plugins=True,
|
||||
)
|
||||
|
||||
print(f"Done. Output at {args.export_dir}")
|
||||
|
||||
|
||||
def get_dataloader(tokenizer, calib_size, batch_size, calib_seq):
|
||||
"""Create calibration dataloader using modelopt's built-in dataset utils."""
|
||||
from modelopt.torch.utils.dataset_utils import get_dataset_dataloader
|
||||
|
||||
return get_dataset_dataloader(
|
||||
tokenizer=tokenizer,
|
||||
num_samples=calib_size[0],
|
||||
batch_size=batch_size,
|
||||
max_sample_length=calib_seq,
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -1,65 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
ModelOpt NVFP4 quantization — experts only.
|
||||
|
||||
Quantizes only the MoE expert weights (gate_up_proj, down_proj) to NVFP4,
|
||||
leaving attention and shared MLP layers untouched. This avoids issues with
|
||||
FP8 attention kernels on Blackwell (DeepGEMM unsupported, Triton finegrained
|
||||
FP8 matmul shape mismatches).
|
||||
|
||||
Available NVFP4 quantization strategies (from modelopt huggingface_example.sh):
|
||||
- nvfp4 : Full model NVFP4 quantization
|
||||
- nvfp4_experts_only : Only MoE expert weights (this script)
|
||||
- nvfp4_mlp_only : Only MLP layers (experts + shared MLP)
|
||||
- nvfp4_omlp_only : Only output + MLP layers
|
||||
- nvfp4_awq : NVFP4 with AWQ calibration
|
||||
- nvfp4_mse : NVFP4 with MSE calibration
|
||||
- w4a8_nvfp4_fp8 : W4A8 NVFP4 weights + FP8 activations
|
||||
- w4a8_mxfp4_fp8 : W4A8 MXFP4 weights + FP8 activations
|
||||
- nvfp4_svdquant : NVFP4 with SVDQuant
|
||||
- nvfp4_local_hessian : NVFP4 with local Hessian calibration
|
||||
|
||||
Strategy: Copy this file to model_opt_nvfp4_<strategy>.py and tweak as needed.
|
||||
By the end, we'll have working quantized weights for each successful strategy.
|
||||
|
||||
Output dir naming: DeepSeek-V4-Pro_NVFP4-<strategy>_kv_fp8_cast
|
||||
"""
|
||||
|
||||
import subprocess
|
||||
import sys
|
||||
import os
|
||||
|
||||
# ── Config ──────────────────────────────────────────────────────────────────
|
||||
MODEL = "/root/nvidia-meeting/DeepSeek-V4-Pro-BF16" # Dequantized BF16 (from scripts/dequant_fp8_to_bf16.py)
|
||||
QUANT = "nvfp4_experts_only"
|
||||
TP = 8
|
||||
CALIB = 256
|
||||
KV_CACHE_QUANT = "fp8_cast"
|
||||
EXTRA_FLAGS = "--trust_remote_code --use_seq_device_map"
|
||||
|
||||
# Output dir follows modelopt convention: <model>_<quant>_kv_<kv_quant>
|
||||
# We override the model name to make the strategy clear
|
||||
OUTPUT_NAME = f"DeepSeek-V4-Pro_NVFP4-{QUANT}_kv_{KV_CACHE_QUANT}"
|
||||
|
||||
SCRIPT_DIR = "/root/nvidia-meeting/modelopt-repo/examples/llm_ptq"
|
||||
LOG_FILE = f"/root/nvidia-meeting/modelopt_{QUANT}.log"
|
||||
|
||||
# ── Run ─────────────────────────────────────────────────────────────────────
|
||||
cmd = f"""cd {SCRIPT_DIR} && \\
|
||||
source /root/nvidia-meeting/venv/bin/activate && \\
|
||||
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \\
|
||||
bash scripts/huggingface_example.sh \\
|
||||
--model {MODEL} \\
|
||||
--quant {QUANT} \\
|
||||
--tp {TP} \\
|
||||
--calib {CALIB} \\
|
||||
--kv_cache_quant {KV_CACHE_QUANT} \\
|
||||
{EXTRA_FLAGS} 2>&1 | tee {LOG_FILE}"""
|
||||
|
||||
print(f"Running: {QUANT} quantization on {MODEL}")
|
||||
print(f"Output: {OUTPUT_NAME}")
|
||||
print(f"Log: {LOG_FILE}")
|
||||
print(f"Command:\n{cmd}\n")
|
||||
|
||||
ret = subprocess.call(cmd, shell=True)
|
||||
sys.exit(ret)
|
||||
@@ -1,25 +0,0 @@
|
||||
#!/bin/bash
|
||||
# DeepSeek V4 Pro FP8 → NVFP4 via NVIDIA ModelOpt
|
||||
# Run from: /root/nvidia-meeting/modelopt-repo/examples/llm_ptq
|
||||
#
|
||||
# Prerequisites:
|
||||
# - modelopt 0.45.0+ from git: pip install "nvidia-modelopt[hf] @ git+https://github.com/NVIDIA/Model-Optimizer.git"
|
||||
# - transformers 5.8.0.dev0: pip install git+https://github.com/huggingface/transformers.git
|
||||
# - kernels: pip install -U kernels
|
||||
# - Patch modelopt: cp patches/quant_module_patched.py <venv>/lib/python3.10/site-packages/modelopt/torch/quantization/nn/modules/quant_module.py
|
||||
#
|
||||
# Source weights: /root/nvidia-meeting/DeepSeek-V4-Pro-FP8
|
||||
|
||||
set -e
|
||||
cd /root/nvidia-meeting/modelopt-repo/examples/llm_ptq
|
||||
source /root/nvidia-meeting/venv/bin/activate
|
||||
|
||||
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
|
||||
bash scripts/huggingface_example.sh \
|
||||
--model /root/nvidia-meeting/DeepSeek-V4-Pro-FP8 \
|
||||
--quant nvfp4 \
|
||||
--tp 8 \
|
||||
--calib 256 \
|
||||
--kv_cache_quant fp8_cast \
|
||||
--trust_remote_code \
|
||||
--use_seq_device_map
|
||||
Reference in New Issue
Block a user