Cleanup: nuke dead scripts and stale docs, rewrite README for full NVFP4 pipeline

This commit is contained in:
2026-05-08 17:02:07 +00:00
parent 075da675dc
commit eeba101cc4
5 changed files with 31 additions and 356 deletions

View File

@@ -1,75 +1,44 @@
# DeepSeek V4 Pro → NVFP4 via NVIDIA Model Optimizer
# DeepSeek V4 Pro → NVFP4 Quantization
Fallback quantization path using NVIDIA's official Model Optimizer (`nvidia-modelopt`) PTQ pipeline.
Full NVFP4 quantization of DeepSeek V4 Pro using NVIDIA's Model Optimizer.
## Why this branch
## Strategy
Path A (custom streaming FP8→NVFP4) is weight-only W4A16. If it doesn't produce good enough accuracy, NVIDIA's Model Optimizer provides data-driven calibration with proper activation scales, and is the officially supported path for DeepSeek V3/V4 NVFP4.
1. **Dequantize** the original mixed-precision FP8 weights to pure BF16 (`scripts/dequant_fp8_to_bf16.py`)
2. **Full quantize** BF16 → NVFP4 using NVIDIA's official ModelOpt PTQ pipeline (`scripts/model_opt_nvfp4_full.py`)
## What's here
Full model quantization (attention + experts + shared MLP) to NVFP4. Target output: ~600GB.
## Scripts
| File | Purpose |
| --- | --- |
| `quantize_modelopt.py` | PTQ via `nvidia-modelopt` with `NVFP4_EXPERTS_ONLY` config |
| `scripts/dequant_fp8_to_bf16.py` | Dequant FP8 source → pure BF16 (resumable, shard-level) |
| `scripts/upcast_to_bf16.py` | Alternative: upcast mixed-precision to BF16 |
| `scripts/model_opt_nvfp4_full.py` | Run ModelOpt NVFP4 full quantization (calib 128) |
| `patches/quant_module_patched.py` | Patch for modelopt V4 experts ModuleList bug |
| `patches/patch_finegrained_fp8_blackwell.py` | Blackwell FP8 kernel patch |
| `check-ttl.sh` | B200 node TTL watchdog |
## Quantization config
## B200 Node
Using `nvfp4_experts_only` — NVIDIA's recommended config for MoE models. This quantizes only the expert MLP layers (`mlp.experts` / `block_sparse_moe`) while keeping attention QKV projections in higher precision. Options:
- 8× B200, 2.7TB RAM, 13TB NVMe
- See `.env` for access details
- `nvfp4_experts_only` — Experts only (recommended for MoE)
- `nvfp4_mlp_only` — All MLP layers (experts + shared)
- `nvfp4` — Full model NVFP4 (riskier for attention)
## Key Notes
## Prerequisites
- **Calib size: 128** (256 OOMs on 2.8TB RAM with 3TB BF16 model)
- **Full quant (`nvfp4`)**, not experts-only
- Use BF16 source — V4's mixed precision causes issues, FP8 source has kernel problems on Blackwell
- `--use_seq_device_map` required (model doesn't fit in GPU VRAM alone)
- `--gpu_max_mem_percentage 0.7` for VRAM headroom
- `--low_memory_mode` causes meta device errors with V4 — don't use
- modelopt has no explicit V4 support — relies on auto-detection of fused experts
- Calibration dataset `nvidia/Nemotron-Post-Training-Dataset-v2` is gated — requires HF token
```bash
# Use the TensorRT-LLM docker if possible:
# docker run --gpus all -it nvcr.io/nvidia/tensorrt-llm/release:1.2.0 bash
## Bugs Found (V4 + modelopt)
# Otherwise pip install:
pip install -U "nvidia-modelopt[hf]"
pip install compressed-tensors fire flash-attn transformers_stream_generator zstandard
# Note: requires transformers<5.0 for modelopt compatibility
```
## Usage
```bash
# On the B200 node (8× B200, 2.7 TB RAM)
cd /root/nvidia-meeting
source venv/bin/activate
# Using BF16 source weights (preferred for modelopt calibration)
python quantize_modelopt.py \
--model /root/nvidia-meeting/DeepSeek-V4-Pro \
--export_dir /root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4-modelopt \
--qformat nvfp4_experts_only \
--tp 8 \
--calib_size 256
# Using FP8 source (modelopt handles dequant internally)
python quantize_modelopt.py \
--model /root/nvidia-meeting/DeepSeek-V4-Pro-FP8 \
--export_dir /root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4-modelopt-fp8src \
--qformat nvfp4_experts_only \
--tp 8 \
--calib_size 256
```
## Low-memory options
If you hit OOM during calibration:
- `--use_seq_device_map` — sequential device mapping across GPUs
- `--low_memory_mode` — compress weights before calibration (FP8/NVFP4 only)
## Output
Exports a **Unified HuggingFace checkpoint** compatible with:
- TensorRT-LLM (PyTorch and C++ backends)
- vLLM
- SGLang
## Expected runtime
24-72 hours for full calibration on 8× B200 with 256 calibration samples.
1. `QuantDeepseekV4Experts` AttributeError — patched `iter_weights_for_calibration()` for ModuleList quantizers
2. `--low_memory_mode` → meta device error
3. Missing `kernels` package for FP8 ops
4. `--calib` not `--calib_size`, `--quant` not `--qformat` (shell script arg names)

View File

@@ -1,38 +0,0 @@
# DeepSeek V4 Pro NVFP4 via NVIDIA ModelOpt
## What this does
Quantizes DeepSeek V4 Pro (FP8 weights) to full NVFP4 format using NVIDIA's official ModelOpt pipeline.
Target output: ~600GB (vs 840GB from custom Path A converter).
## Prerequisites
- B200 node (8× B200, 2.7TB RAM) — NVFP4 requires Blackwell GPUs
- modelopt 0.45.0+ from git
- transformers 5.8.0.dev0 (for DeepSeekV4 support)
- kernels package (for FP8 dequantization during calibration)
## Critical Patch
modelopt has a bug with DeepSeekV4Experts — the `iter_weights_for_calibration()` method
doesn't handle ModuleList quantizers (plural `gate_up_proj_weight_quantizers`).
Apply the patch before running:
```bash
cp patches/quant_module_patched.py <venv-path>/lib/python3.10/site-packages/modelopt/torch/quantization/nn/modules/quant_module.py
```
## Do NOT use these flags
- `--low_memory_mode`: causes meta device error with V4
- `--calib_size`: wrong arg name (use `--calib`)
## Run
```bash
bash scripts/run_modelopt_nvfp4.sh
```
## Output
`/root/nvidia-meeting/modelopt-repo/examples/llm_ptq/saved_models_DeepSeek-V4-Pro-FP8_nvfp4_kv_fp8_cast`
## Notes
- Use FP8 source (`DeepSeek-V4-Pro-FP8`), NOT mixed-precision BF16 (`DeepSeek-V4-Pro`)
- V4's mixed precision causes "wonky shit" — FP8 is clean
- Calibration takes hours with CPU offload (`--use_seq_device_map`)
- Expected calibration time: several hours for 256 samples

View File

@@ -1,166 +0,0 @@
#!/usr/bin/env python3
"""NVIDIA Model Optimizer PTQ for DeepSeek V4 Pro → NVFP4.
Uses nvidia-modelopt's official PTQ pipeline with NVFP4Experts-Only config,
which quantizes only MoE expert layers while keeping attention QKV in higher
precision — the recommended approach for DeepSeek MoE models.
Output is a Unified HuggingFace checkpoint deployable on TRT-LLM / vLLM / SGLang.
Usage:
python quantize_modelopt.py \
--model /root/nvidia-meeting/DeepSeek-V4-Pro \
--export_dir /root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4-modelopt \
--qformat nvfp4_experts_only \
--tp 8 \
--calib_size 256
For the FP8 source variant, just change --model path. modelopt handles
dequantization internally.
"""
import argparse
import os
import random
import time
import numpy as np
import torch
import modelopt.torch.opt as mto
import modelopt.torch.quantization as mtq
from modelopt.torch.export import export_hf_checkpoint
from modelopt.torch.utils.dataset_utils import create_forward_loop
from transformers import AutoModelForCausalLM, AutoTokenizer
mto.enable_huggingface_checkpointing()
QUANT_CONFIGS = {
"nvfp4": mtq.NVFP4_DEFAULT_CFG,
"nvfp4_experts_only": mtq.NVFP4_EXPERTS_ONLY_CFG,
"nvfp4_mlp_only": mtq.NVFP4_MLP_ONLY_CFG,
"nvfp4_omlp_only": mtq.NVFP4_OMLP_ONLY_CFG,
"fp8": mtq.FP8_DEFAULT_CFG,
}
def main():
ap = argparse.ArgumentParser(description="Model Optimizer PTQ for DeepSeek V4 Pro")
ap.add_argument("--model", required=True, help="Path to HF model (BF16 or FP8)")
ap.add_argument("--export_dir", required=True, help="Output directory for quantized checkpoint")
ap.add_argument("--qformat", default="nvfp4_experts_only",
choices=list(QUANT_CONFIGS.keys()),
help="Quantization format (default: nvfp4_experts_only for MoE)")
ap.add_argument("--kv_cache_qformat", default="fp8_cast",
help="KV cache quantization (default: fp8_cast, fast no-calib)")
ap.add_argument("--tp", type=int, default=8, help="Tensor parallelism for export")
ap.add_argument("--calib_size", type=int, nargs="+", default=[256],
help="Calibration dataset size (per dataset)")
ap.add_argument("--batch_size", type=int, default=1, help="Calibration batch size")
ap.add_argument("--calib_seq", type=int, default=4096, help="Max calibration sequence length")
ap.add_argument("--trust_remote_code", action="store_true", default=True,
help="Trust remote code (required for V4)")
ap.add_argument("--use_seq_device_map", action="store_true",
help="Use sequential device map for low-memory calibration")
ap.add_argument("--low_memory_mode", action="store_true",
help="Compress weights before calibration (FP8/NVFP4 only)")
args = ap.parse_args()
print(f"=== Model Optimizer PTQ ===")
print(f" Model: {args.model}")
print(f" QFormat: {args.qformat}")
print(f" KV Cache: {args.kv_cache_qformat}")
print(f" TP: {args.tp}")
print(f" Calib: {args.calib_size} samples, seq_len={args.calib_seq}")
print()
# Seed everything
random.seed(1234)
np.random.seed(1234)
torch.manual_seed(1234)
# Load tokenizer
print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(
args.model,
trust_remote_code=args.trust_remote_code,
padding_side="left",
)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
# Load model
print("Loading model...")
model_kwargs = {
"trust_remote_code": args.trust_remote_code,
"torch_dtype": torch.bfloat16,
}
if args.use_seq_device_map:
model_kwargs["device_map"] = "auto"
model_kwargs["offload_folder"] = "offload"
model_kwargs["offload_state_dict"] = True
model_kwargs["max_memory"] = {i: "100GiB" for i in range(8)}
model_kwargs["max_memory"]["cpu"] = "2500GiB"
elif args.low_memory_mode:
# Load entirely on CPU, modelopt will handle placement
model_kwargs["device_map"] = {"": "cpu"}
model = AutoModelForCausalLM.from_pretrained(args.model, **model_kwargs)
if not args.use_seq_device_map and not args.low_memory_mode:
model = model.cuda()
# Build calibration dataloader
print("Building calibration dataset...")
calib_dataloader = get_dataloader(
tokenizer=tokenizer,
calib_size=args.calib_size,
batch_size=args.batch_size,
calib_seq=args.calib_seq,
)
# Build forward loop for calibration
def forward_loop(model):
for batch in calib_dataloader:
model(**batch)
# Quantize
quant_cfg = QUANT_CONFIGS[args.qformat]
print(f"Running PTQ with {args.qformat}...")
t0 = time.time()
model = mtq.quantize(model, quant_cfg, forward_loop)
elapsed = time.time() - t0
print(f"Quantization complete in {elapsed/60:.1f} min")
# Export
print(f"Exporting to {args.export_dir} ...")
with torch.inference_mode():
export_hf_checkpoint(
model,
args.export_dir,
tokenizer=tokenizer,
export_tensorrt_llm_plugins=True,
)
print(f"Done. Output at {args.export_dir}")
def get_dataloader(tokenizer, calib_size, batch_size, calib_seq):
"""Create calibration dataloader using modelopt's built-in dataset utils."""
from modelopt.torch.utils.dataset_utils import get_dataset_dataloader
return get_dataset_dataloader(
tokenizer=tokenizer,
num_samples=calib_size[0],
batch_size=batch_size,
max_sample_length=calib_seq,
)
if __name__ == "__main__":
main()

View File

@@ -1,65 +0,0 @@
#!/usr/bin/env python3
"""
ModelOpt NVFP4 quantization — experts only.
Quantizes only the MoE expert weights (gate_up_proj, down_proj) to NVFP4,
leaving attention and shared MLP layers untouched. This avoids issues with
FP8 attention kernels on Blackwell (DeepGEMM unsupported, Triton finegrained
FP8 matmul shape mismatches).
Available NVFP4 quantization strategies (from modelopt huggingface_example.sh):
- nvfp4 : Full model NVFP4 quantization
- nvfp4_experts_only : Only MoE expert weights (this script)
- nvfp4_mlp_only : Only MLP layers (experts + shared MLP)
- nvfp4_omlp_only : Only output + MLP layers
- nvfp4_awq : NVFP4 with AWQ calibration
- nvfp4_mse : NVFP4 with MSE calibration
- w4a8_nvfp4_fp8 : W4A8 NVFP4 weights + FP8 activations
- w4a8_mxfp4_fp8 : W4A8 MXFP4 weights + FP8 activations
- nvfp4_svdquant : NVFP4 with SVDQuant
- nvfp4_local_hessian : NVFP4 with local Hessian calibration
Strategy: Copy this file to model_opt_nvfp4_<strategy>.py and tweak as needed.
By the end, we'll have working quantized weights for each successful strategy.
Output dir naming: DeepSeek-V4-Pro_NVFP4-<strategy>_kv_fp8_cast
"""
import subprocess
import sys
import os
# ── Config ──────────────────────────────────────────────────────────────────
MODEL = "/root/nvidia-meeting/DeepSeek-V4-Pro-BF16" # Dequantized BF16 (from scripts/dequant_fp8_to_bf16.py)
QUANT = "nvfp4_experts_only"
TP = 8
CALIB = 256
KV_CACHE_QUANT = "fp8_cast"
EXTRA_FLAGS = "--trust_remote_code --use_seq_device_map"
# Output dir follows modelopt convention: <model>_<quant>_kv_<kv_quant>
# We override the model name to make the strategy clear
OUTPUT_NAME = f"DeepSeek-V4-Pro_NVFP4-{QUANT}_kv_{KV_CACHE_QUANT}"
SCRIPT_DIR = "/root/nvidia-meeting/modelopt-repo/examples/llm_ptq"
LOG_FILE = f"/root/nvidia-meeting/modelopt_{QUANT}.log"
# ── Run ─────────────────────────────────────────────────────────────────────
cmd = f"""cd {SCRIPT_DIR} && \\
source /root/nvidia-meeting/venv/bin/activate && \\
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \\
bash scripts/huggingface_example.sh \\
--model {MODEL} \\
--quant {QUANT} \\
--tp {TP} \\
--calib {CALIB} \\
--kv_cache_quant {KV_CACHE_QUANT} \\
{EXTRA_FLAGS} 2>&1 | tee {LOG_FILE}"""
print(f"Running: {QUANT} quantization on {MODEL}")
print(f"Output: {OUTPUT_NAME}")
print(f"Log: {LOG_FILE}")
print(f"Command:\n{cmd}\n")
ret = subprocess.call(cmd, shell=True)
sys.exit(ret)

View File

@@ -1,25 +0,0 @@
#!/bin/bash
# DeepSeek V4 Pro FP8 → NVFP4 via NVIDIA ModelOpt
# Run from: /root/nvidia-meeting/modelopt-repo/examples/llm_ptq
#
# Prerequisites:
# - modelopt 0.45.0+ from git: pip install "nvidia-modelopt[hf] @ git+https://github.com/NVIDIA/Model-Optimizer.git"
# - transformers 5.8.0.dev0: pip install git+https://github.com/huggingface/transformers.git
# - kernels: pip install -U kernels
# - Patch modelopt: cp patches/quant_module_patched.py <venv>/lib/python3.10/site-packages/modelopt/torch/quantization/nn/modules/quant_module.py
#
# Source weights: /root/nvidia-meeting/DeepSeek-V4-Pro-FP8
set -e
cd /root/nvidia-meeting/modelopt-repo/examples/llm_ptq
source /root/nvidia-meeting/venv/bin/activate
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
bash scripts/huggingface_example.sh \
--model /root/nvidia-meeting/DeepSeek-V4-Pro-FP8 \
--quant nvfp4 \
--tp 8 \
--calib 256 \
--kv_cache_quant fp8_cast \
--trust_remote_code \
--use_seq_device_map