diff --git a/README.md b/README.md index fe0d372..7406cfd 100644 --- a/README.md +++ b/README.md @@ -1,75 +1,44 @@ -# DeepSeek V4 Pro → NVFP4 via NVIDIA Model Optimizer +# DeepSeek V4 Pro → NVFP4 Quantization -Fallback quantization path using NVIDIA's official Model Optimizer (`nvidia-modelopt`) PTQ pipeline. +Full NVFP4 quantization of DeepSeek V4 Pro using NVIDIA's Model Optimizer. -## Why this branch +## Strategy -Path A (custom streaming FP8→NVFP4) is weight-only W4A16. If it doesn't produce good enough accuracy, NVIDIA's Model Optimizer provides data-driven calibration with proper activation scales, and is the officially supported path for DeepSeek V3/V4 NVFP4. +1. **Dequantize** the original mixed-precision FP8 weights to pure BF16 (`scripts/dequant_fp8_to_bf16.py`) +2. **Full quantize** BF16 → NVFP4 using NVIDIA's official ModelOpt PTQ pipeline (`scripts/model_opt_nvfp4_full.py`) -## What's here +Full model quantization (attention + experts + shared MLP) to NVFP4. Target output: ~600GB. + +## Scripts | File | Purpose | | --- | --- | -| `quantize_modelopt.py` | PTQ via `nvidia-modelopt` with `NVFP4_EXPERTS_ONLY` config | +| `scripts/dequant_fp8_to_bf16.py` | Dequant FP8 source → pure BF16 (resumable, shard-level) | +| `scripts/upcast_to_bf16.py` | Alternative: upcast mixed-precision to BF16 | +| `scripts/model_opt_nvfp4_full.py` | Run ModelOpt NVFP4 full quantization (calib 128) | +| `patches/quant_module_patched.py` | Patch for modelopt V4 experts ModuleList bug | +| `patches/patch_finegrained_fp8_blackwell.py` | Blackwell FP8 kernel patch | +| `check-ttl.sh` | B200 node TTL watchdog | -## Quantization config +## B200 Node -Using `nvfp4_experts_only` — NVIDIA's recommended config for MoE models. This quantizes only the expert MLP layers (`mlp.experts` / `block_sparse_moe`) while keeping attention QKV projections in higher precision. Options: +- 8× B200, 2.7TB RAM, 13TB NVMe +- See `.env` for access details -- `nvfp4_experts_only` — Experts only (recommended for MoE) -- `nvfp4_mlp_only` — All MLP layers (experts + shared) -- `nvfp4` — Full model NVFP4 (riskier for attention) +## Key Notes -## Prerequisites +- **Calib size: 128** (256 OOMs on 2.8TB RAM with 3TB BF16 model) +- **Full quant (`nvfp4`)**, not experts-only +- Use BF16 source — V4's mixed precision causes issues, FP8 source has kernel problems on Blackwell +- `--use_seq_device_map` required (model doesn't fit in GPU VRAM alone) +- `--gpu_max_mem_percentage 0.7` for VRAM headroom +- `--low_memory_mode` causes meta device errors with V4 — don't use +- modelopt has no explicit V4 support — relies on auto-detection of fused experts +- Calibration dataset `nvidia/Nemotron-Post-Training-Dataset-v2` is gated — requires HF token -```bash -# Use the TensorRT-LLM docker if possible: -# docker run --gpus all -it nvcr.io/nvidia/tensorrt-llm/release:1.2.0 bash +## Bugs Found (V4 + modelopt) -# Otherwise pip install: -pip install -U "nvidia-modelopt[hf]" -pip install compressed-tensors fire flash-attn transformers_stream_generator zstandard -# Note: requires transformers<5.0 for modelopt compatibility -``` - -## Usage - -```bash -# On the B200 node (8× B200, 2.7 TB RAM) -cd /root/nvidia-meeting -source venv/bin/activate - -# Using BF16 source weights (preferred for modelopt calibration) -python quantize_modelopt.py \ - --model /root/nvidia-meeting/DeepSeek-V4-Pro \ - --export_dir /root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4-modelopt \ - --qformat nvfp4_experts_only \ - --tp 8 \ - --calib_size 256 - -# Using FP8 source (modelopt handles dequant internally) -python quantize_modelopt.py \ - --model /root/nvidia-meeting/DeepSeek-V4-Pro-FP8 \ - --export_dir /root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4-modelopt-fp8src \ - --qformat nvfp4_experts_only \ - --tp 8 \ - --calib_size 256 -``` - -## Low-memory options - -If you hit OOM during calibration: - -- `--use_seq_device_map` — sequential device mapping across GPUs -- `--low_memory_mode` — compress weights before calibration (FP8/NVFP4 only) - -## Output - -Exports a **Unified HuggingFace checkpoint** compatible with: -- TensorRT-LLM (PyTorch and C++ backends) -- vLLM -- SGLang - -## Expected runtime - -24-72 hours for full calibration on 8× B200 with 256 calibration samples. +1. `QuantDeepseekV4Experts` AttributeError — patched `iter_weights_for_calibration()` for ModuleList quantizers +2. `--low_memory_mode` → meta device error +3. Missing `kernels` package for FP8 ops +4. `--calib` not `--calib_size`, `--quant` not `--qformat` (shell script arg names) diff --git a/README_modelopt_nvfp4.md b/README_modelopt_nvfp4.md deleted file mode 100644 index fb7b1d8..0000000 --- a/README_modelopt_nvfp4.md +++ /dev/null @@ -1,38 +0,0 @@ -# DeepSeek V4 Pro NVFP4 via NVIDIA ModelOpt - -## What this does -Quantizes DeepSeek V4 Pro (FP8 weights) to full NVFP4 format using NVIDIA's official ModelOpt pipeline. -Target output: ~600GB (vs 840GB from custom Path A converter). - -## Prerequisites -- B200 node (8× B200, 2.7TB RAM) — NVFP4 requires Blackwell GPUs -- modelopt 0.45.0+ from git -- transformers 5.8.0.dev0 (for DeepSeekV4 support) -- kernels package (for FP8 dequantization during calibration) - -## Critical Patch -modelopt has a bug with DeepSeekV4Experts — the `iter_weights_for_calibration()` method -doesn't handle ModuleList quantizers (plural `gate_up_proj_weight_quantizers`). -Apply the patch before running: - -```bash -cp patches/quant_module_patched.py /lib/python3.10/site-packages/modelopt/torch/quantization/nn/modules/quant_module.py -``` - -## Do NOT use these flags -- `--low_memory_mode`: causes meta device error with V4 -- `--calib_size`: wrong arg name (use `--calib`) - -## Run -```bash -bash scripts/run_modelopt_nvfp4.sh -``` - -## Output -`/root/nvidia-meeting/modelopt-repo/examples/llm_ptq/saved_models_DeepSeek-V4-Pro-FP8_nvfp4_kv_fp8_cast` - -## Notes -- Use FP8 source (`DeepSeek-V4-Pro-FP8`), NOT mixed-precision BF16 (`DeepSeek-V4-Pro`) -- V4's mixed precision causes "wonky shit" — FP8 is clean -- Calibration takes hours with CPU offload (`--use_seq_device_map`) -- Expected calibration time: several hours for 256 samples diff --git a/quantize_modelopt.py b/quantize_modelopt.py deleted file mode 100644 index b500b90..0000000 --- a/quantize_modelopt.py +++ /dev/null @@ -1,166 +0,0 @@ -#!/usr/bin/env python3 -"""NVIDIA Model Optimizer PTQ for DeepSeek V4 Pro → NVFP4. - -Uses nvidia-modelopt's official PTQ pipeline with NVFP4Experts-Only config, -which quantizes only MoE expert layers while keeping attention QKV in higher -precision — the recommended approach for DeepSeek MoE models. - -Output is a Unified HuggingFace checkpoint deployable on TRT-LLM / vLLM / SGLang. - -Usage: - python quantize_modelopt.py \ - --model /root/nvidia-meeting/DeepSeek-V4-Pro \ - --export_dir /root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4-modelopt \ - --qformat nvfp4_experts_only \ - --tp 8 \ - --calib_size 256 - -For the FP8 source variant, just change --model path. modelopt handles -dequantization internally. -""" - -import argparse -import os -import random -import time - -import numpy as np -import torch - -import modelopt.torch.opt as mto -import modelopt.torch.quantization as mtq -from modelopt.torch.export import export_hf_checkpoint -from modelopt.torch.utils.dataset_utils import create_forward_loop - -from transformers import AutoModelForCausalLM, AutoTokenizer - - -mto.enable_huggingface_checkpointing() - - -QUANT_CONFIGS = { - "nvfp4": mtq.NVFP4_DEFAULT_CFG, - "nvfp4_experts_only": mtq.NVFP4_EXPERTS_ONLY_CFG, - "nvfp4_mlp_only": mtq.NVFP4_MLP_ONLY_CFG, - "nvfp4_omlp_only": mtq.NVFP4_OMLP_ONLY_CFG, - "fp8": mtq.FP8_DEFAULT_CFG, -} - - -def main(): - ap = argparse.ArgumentParser(description="Model Optimizer PTQ for DeepSeek V4 Pro") - ap.add_argument("--model", required=True, help="Path to HF model (BF16 or FP8)") - ap.add_argument("--export_dir", required=True, help="Output directory for quantized checkpoint") - ap.add_argument("--qformat", default="nvfp4_experts_only", - choices=list(QUANT_CONFIGS.keys()), - help="Quantization format (default: nvfp4_experts_only for MoE)") - ap.add_argument("--kv_cache_qformat", default="fp8_cast", - help="KV cache quantization (default: fp8_cast, fast no-calib)") - ap.add_argument("--tp", type=int, default=8, help="Tensor parallelism for export") - ap.add_argument("--calib_size", type=int, nargs="+", default=[256], - help="Calibration dataset size (per dataset)") - ap.add_argument("--batch_size", type=int, default=1, help="Calibration batch size") - ap.add_argument("--calib_seq", type=int, default=4096, help="Max calibration sequence length") - ap.add_argument("--trust_remote_code", action="store_true", default=True, - help="Trust remote code (required for V4)") - ap.add_argument("--use_seq_device_map", action="store_true", - help="Use sequential device map for low-memory calibration") - ap.add_argument("--low_memory_mode", action="store_true", - help="Compress weights before calibration (FP8/NVFP4 only)") - args = ap.parse_args() - - print(f"=== Model Optimizer PTQ ===") - print(f" Model: {args.model}") - print(f" QFormat: {args.qformat}") - print(f" KV Cache: {args.kv_cache_qformat}") - print(f" TP: {args.tp}") - print(f" Calib: {args.calib_size} samples, seq_len={args.calib_seq}") - print() - - # Seed everything - random.seed(1234) - np.random.seed(1234) - torch.manual_seed(1234) - - # Load tokenizer - print("Loading tokenizer...") - tokenizer = AutoTokenizer.from_pretrained( - args.model, - trust_remote_code=args.trust_remote_code, - padding_side="left", - ) - if tokenizer.pad_token is None: - tokenizer.pad_token = tokenizer.eos_token - - # Load model - print("Loading model...") - model_kwargs = { - "trust_remote_code": args.trust_remote_code, - "torch_dtype": torch.bfloat16, - } - if args.use_seq_device_map: - model_kwargs["device_map"] = "auto" - model_kwargs["offload_folder"] = "offload" - model_kwargs["offload_state_dict"] = True - model_kwargs["max_memory"] = {i: "100GiB" for i in range(8)} - model_kwargs["max_memory"]["cpu"] = "2500GiB" - elif args.low_memory_mode: - # Load entirely on CPU, modelopt will handle placement - model_kwargs["device_map"] = {"": "cpu"} - - model = AutoModelForCausalLM.from_pretrained(args.model, **model_kwargs) - - if not args.use_seq_device_map and not args.low_memory_mode: - model = model.cuda() - - # Build calibration dataloader - print("Building calibration dataset...") - calib_dataloader = get_dataloader( - tokenizer=tokenizer, - calib_size=args.calib_size, - batch_size=args.batch_size, - calib_seq=args.calib_seq, - ) - - # Build forward loop for calibration - def forward_loop(model): - for batch in calib_dataloader: - model(**batch) - - # Quantize - quant_cfg = QUANT_CONFIGS[args.qformat] - print(f"Running PTQ with {args.qformat}...") - t0 = time.time() - - model = mtq.quantize(model, quant_cfg, forward_loop) - - elapsed = time.time() - t0 - print(f"Quantization complete in {elapsed/60:.1f} min") - - # Export - print(f"Exporting to {args.export_dir} ...") - with torch.inference_mode(): - export_hf_checkpoint( - model, - args.export_dir, - tokenizer=tokenizer, - export_tensorrt_llm_plugins=True, - ) - - print(f"Done. Output at {args.export_dir}") - - -def get_dataloader(tokenizer, calib_size, batch_size, calib_seq): - """Create calibration dataloader using modelopt's built-in dataset utils.""" - from modelopt.torch.utils.dataset_utils import get_dataset_dataloader - - return get_dataset_dataloader( - tokenizer=tokenizer, - num_samples=calib_size[0], - batch_size=batch_size, - max_sample_length=calib_seq, - ) - - -if __name__ == "__main__": - main() diff --git a/scripts/model_opt_nvfp4_experts_only.py b/scripts/model_opt_nvfp4_experts_only.py deleted file mode 100644 index 4558251..0000000 --- a/scripts/model_opt_nvfp4_experts_only.py +++ /dev/null @@ -1,65 +0,0 @@ -#!/usr/bin/env python3 -""" -ModelOpt NVFP4 quantization — experts only. - -Quantizes only the MoE expert weights (gate_up_proj, down_proj) to NVFP4, -leaving attention and shared MLP layers untouched. This avoids issues with -FP8 attention kernels on Blackwell (DeepGEMM unsupported, Triton finegrained -FP8 matmul shape mismatches). - -Available NVFP4 quantization strategies (from modelopt huggingface_example.sh): - - nvfp4 : Full model NVFP4 quantization - - nvfp4_experts_only : Only MoE expert weights (this script) - - nvfp4_mlp_only : Only MLP layers (experts + shared MLP) - - nvfp4_omlp_only : Only output + MLP layers - - nvfp4_awq : NVFP4 with AWQ calibration - - nvfp4_mse : NVFP4 with MSE calibration - - w4a8_nvfp4_fp8 : W4A8 NVFP4 weights + FP8 activations - - w4a8_mxfp4_fp8 : W4A8 MXFP4 weights + FP8 activations - - nvfp4_svdquant : NVFP4 with SVDQuant - - nvfp4_local_hessian : NVFP4 with local Hessian calibration - -Strategy: Copy this file to model_opt_nvfp4_.py and tweak as needed. -By the end, we'll have working quantized weights for each successful strategy. - -Output dir naming: DeepSeek-V4-Pro_NVFP4-_kv_fp8_cast -""" - -import subprocess -import sys -import os - -# ── Config ────────────────────────────────────────────────────────────────── -MODEL = "/root/nvidia-meeting/DeepSeek-V4-Pro-BF16" # Dequantized BF16 (from scripts/dequant_fp8_to_bf16.py) -QUANT = "nvfp4_experts_only" -TP = 8 -CALIB = 256 -KV_CACHE_QUANT = "fp8_cast" -EXTRA_FLAGS = "--trust_remote_code --use_seq_device_map" - -# Output dir follows modelopt convention: __kv_ -# We override the model name to make the strategy clear -OUTPUT_NAME = f"DeepSeek-V4-Pro_NVFP4-{QUANT}_kv_{KV_CACHE_QUANT}" - -SCRIPT_DIR = "/root/nvidia-meeting/modelopt-repo/examples/llm_ptq" -LOG_FILE = f"/root/nvidia-meeting/modelopt_{QUANT}.log" - -# ── Run ───────────────────────────────────────────────────────────────────── -cmd = f"""cd {SCRIPT_DIR} && \\ -source /root/nvidia-meeting/venv/bin/activate && \\ -PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \\ -bash scripts/huggingface_example.sh \\ - --model {MODEL} \\ - --quant {QUANT} \\ - --tp {TP} \\ - --calib {CALIB} \\ - --kv_cache_quant {KV_CACHE_QUANT} \\ - {EXTRA_FLAGS} 2>&1 | tee {LOG_FILE}""" - -print(f"Running: {QUANT} quantization on {MODEL}") -print(f"Output: {OUTPUT_NAME}") -print(f"Log: {LOG_FILE}") -print(f"Command:\n{cmd}\n") - -ret = subprocess.call(cmd, shell=True) -sys.exit(ret) diff --git a/scripts/run_modelopt_nvfp4.sh b/scripts/run_modelopt_nvfp4.sh deleted file mode 100755 index c947767..0000000 --- a/scripts/run_modelopt_nvfp4.sh +++ /dev/null @@ -1,25 +0,0 @@ -#!/bin/bash -# DeepSeek V4 Pro FP8 → NVFP4 via NVIDIA ModelOpt -# Run from: /root/nvidia-meeting/modelopt-repo/examples/llm_ptq -# -# Prerequisites: -# - modelopt 0.45.0+ from git: pip install "nvidia-modelopt[hf] @ git+https://github.com/NVIDIA/Model-Optimizer.git" -# - transformers 5.8.0.dev0: pip install git+https://github.com/huggingface/transformers.git -# - kernels: pip install -U kernels -# - Patch modelopt: cp patches/quant_module_patched.py /lib/python3.10/site-packages/modelopt/torch/quantization/nn/modules/quant_module.py -# -# Source weights: /root/nvidia-meeting/DeepSeek-V4-Pro-FP8 - -set -e -cd /root/nvidia-meeting/modelopt-repo/examples/llm_ptq -source /root/nvidia-meeting/venv/bin/activate - -PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \ -bash scripts/huggingface_example.sh \ - --model /root/nvidia-meeting/DeepSeek-V4-Pro-FP8 \ - --quant nvfp4 \ - --tp 8 \ - --calib 256 \ - --kv_cache_quant fp8_cast \ - --trust_remote_code \ - --use_seq_device_map