Cleanup: nuke dead scripts and stale docs, rewrite README for full NVFP4 pipeline

2026-05-08 17:02:07 +00:00
parent 075da675dc
commit eeba101cc4
5 changed files with 31 additions and 356 deletions
--- a/README.md
+++ b/README.md
@@ -1,75 +1,44 @@
-# DeepSeek V4 Pro → NVFP4 via NVIDIA Model Optimizer
+# DeepSeek V4 Pro → NVFP4 Quantization

-Fallback quantization path using NVIDIA's official Model Optimizer (`nvidia-modelopt`) PTQ pipeline.
+Full NVFP4 quantization of DeepSeek V4 Pro using NVIDIA's Model Optimizer.

-## Why this branch
+## Strategy

-Path A (custom streaming FP8→NVFP4) is weight-only W4A16. If it doesn't produce good enough accuracy, NVIDIA's Model Optimizer provides data-driven calibration with proper activation scales, and is the officially supported path for DeepSeek V3/V4 NVFP4.
+1. **Dequantize** the original mixed-precision FP8 weights to pure BF16 (`scripts/dequant_fp8_to_bf16.py`)
+2. **Full quantize** BF16 → NVFP4 using NVIDIA's official ModelOpt PTQ pipeline (`scripts/model_opt_nvfp4_full.py`)

-## What's here
+Full model quantization (attention + experts + shared MLP) to NVFP4. Target output: ~600GB.
+
+## Scripts

 | File | Purpose |
 | --- | --- |
-| `quantize_modelopt.py` | PTQ via `nvidia-modelopt` with `NVFP4_EXPERTS_ONLY` config |
+| `scripts/dequant_fp8_to_bf16.py` | Dequant FP8 source → pure BF16 (resumable, shard-level) |
+| `scripts/upcast_to_bf16.py` | Alternative: upcast mixed-precision to BF16 |
+| `scripts/model_opt_nvfp4_full.py` | Run ModelOpt NVFP4 full quantization (calib 128) |
+| `patches/quant_module_patched.py` | Patch for modelopt V4 experts ModuleList bug |
+| `patches/patch_finegrained_fp8_blackwell.py` | Blackwell FP8 kernel patch |
+| `check-ttl.sh` | B200 node TTL watchdog |

-## Quantization config
+## B200 Node

-Using `nvfp4_experts_only` — NVIDIA's recommended config for MoE models. This quantizes only the expert MLP layers (`mlp.experts` / `block_sparse_moe`) while keeping attention QKV projections in higher precision. Options:
+- 8× B200, 2.7TB RAM, 13TB NVMe
+- See `.env` for access details

- `nvfp4_experts_only` — Experts only (recommended for MoE)
- `nvfp4_mlp_only` — All MLP layers (experts + shared)
- `nvfp4` — Full model NVFP4 (riskier for attention)
+## Key Notes

-## Prerequisites
+- **Calib size: 128** (256 OOMs on 2.8TB RAM with 3TB BF16 model)
+- **Full quant (`nvfp4`)**, not experts-only
+- Use BF16 source — V4's mixed precision causes issues, FP8 source has kernel problems on Blackwell
+- `--use_seq_device_map` required (model doesn't fit in GPU VRAM alone)
+- `--gpu_max_mem_percentage 0.7` for VRAM headroom
+- `--low_memory_mode` causes meta device errors with V4 — don't use
+- modelopt has no explicit V4 support — relies on auto-detection of fused experts
+- Calibration dataset `nvidia/Nemotron-Post-Training-Dataset-v2` is gated — requires HF token

-```bash
-# Use the TensorRT-LLM docker if possible:
-# docker run --gpus all -it nvcr.io/nvidia/tensorrt-llm/release:1.2.0 bash
+## Bugs Found (V4 + modelopt)

-# Otherwise pip install:
-pip install -U "nvidia-modelopt[hf]"
-pip install compressed-tensors fire flash-attn transformers_stream_generator zstandard
-# Note: requires transformers<5.0 for modelopt compatibility
-```
-
-## Usage
-
-```bash
-# On the B200 node (8× B200, 2.7 TB RAM)
-cd /root/nvidia-meeting
-source venv/bin/activate
-
-# Using BF16 source weights (preferred for modelopt calibration)
-python quantize_modelopt.py \
-    --model /root/nvidia-meeting/DeepSeek-V4-Pro \
-    --export_dir /root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4-modelopt \
-    --qformat nvfp4_experts_only \
-    --tp 8 \
-    --calib_size 256
-
-# Using FP8 source (modelopt handles dequant internally)
-python quantize_modelopt.py \
-    --model /root/nvidia-meeting/DeepSeek-V4-Pro-FP8 \
-    --export_dir /root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4-modelopt-fp8src \
-    --qformat nvfp4_experts_only \
-    --tp 8 \
-    --calib_size 256
-```
-
-## Low-memory options
-
-If you hit OOM during calibration:
-
- `--use_seq_device_map` — sequential device mapping across GPUs
- `--low_memory_mode` — compress weights before calibration (FP8/NVFP4 only)
-
-## Output
-
-Exports a **Unified HuggingFace checkpoint** compatible with:
- TensorRT-LLM (PyTorch and C++ backends)
- vLLM
- SGLang
-
-## Expected runtime
-
-24-72 hours for full calibration on 8× B200 with 256 calibration samples.
+1. `QuantDeepseekV4Experts` AttributeError — patched `iter_weights_for_calibration()` for ModuleList quantizers
+2. `--low_memory_mode` → meta device error
+3. Missing `kernels` package for FP8 ops
+4. `--calib` not `--calib_size`, `--quant` not `--qformat` (shell script arg names)
--- a/README_modelopt_nvfp4.md
+++ b/README_modelopt_nvfp4.md
@@ -1,38 +0,0 @@
-# DeepSeek V4 Pro NVFP4 via NVIDIA ModelOpt
-
-## What this does
-Quantizes DeepSeek V4 Pro (FP8 weights) to full NVFP4 format using NVIDIA's official ModelOpt pipeline.
-Target output: ~600GB (vs 840GB from custom Path A converter).
-
-## Prerequisites
- B200 node (8× B200, 2.7TB RAM) — NVFP4 requires Blackwell GPUs
- modelopt 0.45.0+ from git
- transformers 5.8.0.dev0 (for DeepSeekV4 support)
- kernels package (for FP8 dequantization during calibration)
-
-## Critical Patch
-modelopt has a bug with DeepSeekV4Experts — the `iter_weights_for_calibration()` method
-doesn't handle ModuleList quantizers (plural `gate_up_proj_weight_quantizers`).
-Apply the patch before running:
-
-```bash
-cp patches/quant_module_patched.py <venv-path>/lib/python3.10/site-packages/modelopt/torch/quantization/nn/modules/quant_module.py
-```
-
-## Do NOT use these flags
- `--low_memory_mode`: causes meta device error with V4
- `--calib_size`: wrong arg name (use `--calib`)
-
-## Run
-```bash
-bash scripts/run_modelopt_nvfp4.sh
-```
-
-## Output
-`/root/nvidia-meeting/modelopt-repo/examples/llm_ptq/saved_models_DeepSeek-V4-Pro-FP8_nvfp4_kv_fp8_cast`
-
-## Notes
- Use FP8 source (`DeepSeek-V4-Pro-FP8`), NOT mixed-precision BF16 (`DeepSeek-V4-Pro`)
- V4's mixed precision causes "wonky shit" — FP8 is clean
- Calibration takes hours with CPU offload (`--use_seq_device_map`)
- Expected calibration time: several hours for 256 samples
--- a/quantize_modelopt.py
+++ b/quantize_modelopt.py
@@ -1,166 +0,0 @@
-#!/usr/bin/env python3
-"""NVIDIA Model Optimizer PTQ for DeepSeek V4 Pro → NVFP4.
-
-Uses nvidia-modelopt's official PTQ pipeline with NVFP4Experts-Only config,
-which quantizes only MoE expert layers while keeping attention QKV in higher
-precision — the recommended approach for DeepSeek MoE models.
-
-Output is a Unified HuggingFace checkpoint deployable on TRT-LLM / vLLM / SGLang.
-
-Usage:
-    python quantize_modelopt.py \
-        --model /root/nvidia-meeting/DeepSeek-V4-Pro \
-        --export_dir /root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4-modelopt \
-        --qformat nvfp4_experts_only \
-        --tp 8 \
-        --calib_size 256
-
-For the FP8 source variant, just change --model path. modelopt handles
-dequantization internally.
-"""
-
-import argparse
-import os
-import random
-import time
-
-import numpy as np
-import torch
-
-import modelopt.torch.opt as mto
-import modelopt.torch.quantization as mtq
-from modelopt.torch.export import export_hf_checkpoint
-from modelopt.torch.utils.dataset_utils import create_forward_loop
-
-from transformers import AutoModelForCausalLM, AutoTokenizer
-
-
-mto.enable_huggingface_checkpointing()
-
-
-QUANT_CONFIGS = {
-    "nvfp4": mtq.NVFP4_DEFAULT_CFG,
-    "nvfp4_experts_only": mtq.NVFP4_EXPERTS_ONLY_CFG,
-    "nvfp4_mlp_only": mtq.NVFP4_MLP_ONLY_CFG,
-    "nvfp4_omlp_only": mtq.NVFP4_OMLP_ONLY_CFG,
-    "fp8": mtq.FP8_DEFAULT_CFG,
-}
-
-
-def main():
-    ap = argparse.ArgumentParser(description="Model Optimizer PTQ for DeepSeek V4 Pro")
-    ap.add_argument("--model", required=True, help="Path to HF model (BF16 or FP8)")
-    ap.add_argument("--export_dir", required=True, help="Output directory for quantized checkpoint")
-    ap.add_argument("--qformat", default="nvfp4_experts_only",
-                    choices=list(QUANT_CONFIGS.keys()),
-                    help="Quantization format (default: nvfp4_experts_only for MoE)")
-    ap.add_argument("--kv_cache_qformat", default="fp8_cast",
-                    help="KV cache quantization (default: fp8_cast, fast no-calib)")
-    ap.add_argument("--tp", type=int, default=8, help="Tensor parallelism for export")
-    ap.add_argument("--calib_size", type=int, nargs="+", default=[256],
-                    help="Calibration dataset size (per dataset)")
-    ap.add_argument("--batch_size", type=int, default=1, help="Calibration batch size")
-    ap.add_argument("--calib_seq", type=int, default=4096, help="Max calibration sequence length")
-    ap.add_argument("--trust_remote_code", action="store_true", default=True,
-                    help="Trust remote code (required for V4)")
-    ap.add_argument("--use_seq_device_map", action="store_true",
-                    help="Use sequential device map for low-memory calibration")
-    ap.add_argument("--low_memory_mode", action="store_true",
-                    help="Compress weights before calibration (FP8/NVFP4 only)")
-    args = ap.parse_args()
-
-    print(f"=== Model Optimizer PTQ ===")
-    print(f"  Model:    {args.model}")
-    print(f"  QFormat:  {args.qformat}")
-    print(f"  KV Cache: {args.kv_cache_qformat}")
-    print(f"  TP:       {args.tp}")
-    print(f"  Calib:    {args.calib_size} samples, seq_len={args.calib_seq}")
-    print()
-
-    # Seed everything
-    random.seed(1234)
-    np.random.seed(1234)
-    torch.manual_seed(1234)
-
-    # Load tokenizer
-    print("Loading tokenizer...")
-    tokenizer = AutoTokenizer.from_pretrained(
-        args.model,
-        trust_remote_code=args.trust_remote_code,
-        padding_side="left",
-    )
-    if tokenizer.pad_token is None:
-        tokenizer.pad_token = tokenizer.eos_token
-
-    # Load model
-    print("Loading model...")
-    model_kwargs = {
-        "trust_remote_code": args.trust_remote_code,
-        "torch_dtype": torch.bfloat16,
-    }
-    if args.use_seq_device_map:
-        model_kwargs["device_map"] = "auto"
-        model_kwargs["offload_folder"] = "offload"
-        model_kwargs["offload_state_dict"] = True
-        model_kwargs["max_memory"] = {i: "100GiB" for i in range(8)}
-        model_kwargs["max_memory"]["cpu"] = "2500GiB"
-    elif args.low_memory_mode:
-        # Load entirely on CPU, modelopt will handle placement
-        model_kwargs["device_map"] = {"": "cpu"}
-
-    model = AutoModelForCausalLM.from_pretrained(args.model, **model_kwargs)
-
-    if not args.use_seq_device_map and not args.low_memory_mode:
-        model = model.cuda()
-
-    # Build calibration dataloader
-    print("Building calibration dataset...")
-    calib_dataloader = get_dataloader(
-        tokenizer=tokenizer,
-        calib_size=args.calib_size,
-        batch_size=args.batch_size,
-        calib_seq=args.calib_seq,
-    )
-
-    # Build forward loop for calibration
-    def forward_loop(model):
-        for batch in calib_dataloader:
-            model(**batch)
-
-    # Quantize
-    quant_cfg = QUANT_CONFIGS[args.qformat]
-    print(f"Running PTQ with {args.qformat}...")
-    t0 = time.time()
-
-    model = mtq.quantize(model, quant_cfg, forward_loop)
-
-    elapsed = time.time() - t0
-    print(f"Quantization complete in {elapsed/60:.1f} min")
-
-    # Export
-    print(f"Exporting to {args.export_dir} ...")
-    with torch.inference_mode():
-        export_hf_checkpoint(
-            model,
-            args.export_dir,
-            tokenizer=tokenizer,
-            export_tensorrt_llm_plugins=True,
-        )
-
-    print(f"Done. Output at {args.export_dir}")
-
-
-def get_dataloader(tokenizer, calib_size, batch_size, calib_seq):
-    """Create calibration dataloader using modelopt's built-in dataset utils."""
-    from modelopt.torch.utils.dataset_utils import get_dataset_dataloader
-
-    return get_dataset_dataloader(
-        tokenizer=tokenizer,
-        num_samples=calib_size[0],
-        batch_size=batch_size,
-        max_sample_length=calib_seq,
-    )
-
-
-if __name__ == "__main__":
-    main()
--- a/scripts/model_opt_nvfp4_experts_only.py
+++ b/scripts/model_opt_nvfp4_experts_only.py
@@ -1,65 +0,0 @@
-#!/usr/bin/env python3
-"""
-ModelOpt NVFP4 quantization — experts only.
-
-Quantizes only the MoE expert weights (gate_up_proj, down_proj) to NVFP4,
-leaving attention and shared MLP layers untouched. This avoids issues with
-FP8 attention kernels on Blackwell (DeepGEMM unsupported, Triton finegrained
-FP8 matmul shape mismatches).
-
-Available NVFP4 quantization strategies (from modelopt huggingface_example.sh):
-  - nvfp4               : Full model NVFP4 quantization
-  - nvfp4_experts_only  : Only MoE expert weights (this script)
-  - nvfp4_mlp_only      : Only MLP layers (experts + shared MLP)
-  - nvfp4_omlp_only     : Only output + MLP layers
-  - nvfp4_awq           : NVFP4 with AWQ calibration
-  - nvfp4_mse           : NVFP4 with MSE calibration
-  - w4a8_nvfp4_fp8      : W4A8 NVFP4 weights + FP8 activations
-  - w4a8_mxfp4_fp8      : W4A8 MXFP4 weights + FP8 activations
-  - nvfp4_svdquant      : NVFP4 with SVDQuant
-  - nvfp4_local_hessian : NVFP4 with local Hessian calibration
-
-Strategy: Copy this file to model_opt_nvfp4_<strategy>.py and tweak as needed.
-By the end, we'll have working quantized weights for each successful strategy.
-
-Output dir naming: DeepSeek-V4-Pro_NVFP4-<strategy>_kv_fp8_cast
-"""
-
-import subprocess
-import sys
-import os
-
-# ── Config ──────────────────────────────────────────────────────────────────
-MODEL = "/root/nvidia-meeting/DeepSeek-V4-Pro-BF16"  # Dequantized BF16 (from scripts/dequant_fp8_to_bf16.py)
-QUANT = "nvfp4_experts_only"
-TP = 8
-CALIB = 256
-KV_CACHE_QUANT = "fp8_cast"
-EXTRA_FLAGS = "--trust_remote_code --use_seq_device_map"
-
-# Output dir follows modelopt convention: <model>_<quant>_kv_<kv_quant>
-# We override the model name to make the strategy clear
-OUTPUT_NAME = f"DeepSeek-V4-Pro_NVFP4-{QUANT}_kv_{KV_CACHE_QUANT}"
-
-SCRIPT_DIR = "/root/nvidia-meeting/modelopt-repo/examples/llm_ptq"
-LOG_FILE = f"/root/nvidia-meeting/modelopt_{QUANT}.log"
-
-# ── Run ─────────────────────────────────────────────────────────────────────
-cmd = f"""cd {SCRIPT_DIR} && \\
-source /root/nvidia-meeting/venv/bin/activate && \\
-PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \\
-bash scripts/huggingface_example.sh \\
-    --model {MODEL} \\
-    --quant {QUANT} \\
-    --tp {TP} \\
-    --calib {CALIB} \\
-    --kv_cache_quant {KV_CACHE_QUANT} \\
-    {EXTRA_FLAGS} 2>&1 | tee {LOG_FILE}"""
-
-print(f"Running: {QUANT} quantization on {MODEL}")
-print(f"Output: {OUTPUT_NAME}")
-print(f"Log: {LOG_FILE}")
-print(f"Command:\n{cmd}\n")
-
-ret = subprocess.call(cmd, shell=True)
-sys.exit(ret)
--- a/scripts/run_modelopt_nvfp4.sh
+++ b/scripts/run_modelopt_nvfp4.sh
@@ -1,25 +0,0 @@
-#!/bin/bash
-# DeepSeek V4 Pro FP8 → NVFP4 via NVIDIA ModelOpt
-# Run from: /root/nvidia-meeting/modelopt-repo/examples/llm_ptq
-#
-# Prerequisites:
-#   - modelopt 0.45.0+ from git: pip install "nvidia-modelopt[hf] @ git+https://github.com/NVIDIA/Model-Optimizer.git"
-#   - transformers 5.8.0.dev0: pip install git+https://github.com/huggingface/transformers.git
-#   - kernels: pip install -U kernels
-#   - Patch modelopt: cp patches/quant_module_patched.py <venv>/lib/python3.10/site-packages/modelopt/torch/quantization/nn/modules/quant_module.py
-#
-# Source weights: /root/nvidia-meeting/DeepSeek-V4-Pro-FP8
-
-set -e
-cd /root/nvidia-meeting/modelopt-repo/examples/llm_ptq
-source /root/nvidia-meeting/venv/bin/activate
-
-PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
-bash scripts/huggingface_example.sh \
-    --model /root/nvidia-meeting/DeepSeek-V4-Pro-FP8 \
-    --quant nvfp4 \
-    --tp 8 \
-    --calib 256 \
-    --kv_cache_quant fp8_cast \
-    --trust_remote_code \
-    --use_seq_device_map