README overhaul: reflect current architecture (hf_main, run history through Run 10)
- Architecture section: call hf_main() directly, not rewrite the pipeline - Run history: all 10 runs with root causes and fixes - Key lessons: stale GPU tensors, expert OOM, pipeline rewriting trap, __main__ gap - Runtime patches: 3 monkey-patches + 3 post-calibration hook steps - Do NOT repeat: 8 specific mistakes with run references - File layout with legacy patches note
This commit is contained in:
103
README.md
103
README.md
@@ -4,6 +4,16 @@ Full NVFP4 quantization of DeepSeek V4 Pro on a single B200 node (8× B200, 2.7T
|
||||
|
||||
**Cost:** ~$161/run at $23/hr (7 hours each). Don't waste runs.
|
||||
|
||||
## Architecture
|
||||
|
||||
We call modelopt's `hf_ptq.main()` directly — the same entry point the shell script uses. We don't rewrite the pipeline. We just:
|
||||
|
||||
1. **Patch** modelopt at runtime (GPU tensor safety, before anything runs)
|
||||
2. **Hook** `export_quantized` to snapshot amax + save state before export
|
||||
3. **Call** `hf_main(args)` with properly parsed args
|
||||
|
||||
This avoids the cascade of missing-arg bugs from manually constructing `argparse.Namespace` (Runs 4–8).
|
||||
|
||||
## Pipeline
|
||||
|
||||
### Step 1: Dequantize FP8 → BF16
|
||||
@@ -31,13 +41,11 @@ python3 /root/nvidia-meeting/deepseek-v4-quant/scripts/quantize_nvfp4.py
|
||||
|
||||
Must run from the modelopt example directory (relative imports).
|
||||
|
||||
Pipeline steps:
|
||||
1. **Load** BF16 model via modelopt's `get_model()` (handles memory limits properly for 3TB model)
|
||||
2. **Patch** modelopt at runtime (load_calib_amax forces CPU, export_amax CPU fallback, graceful clamp)
|
||||
3. **Quantize + Calibrate** (5-6 hours, 128 samples)
|
||||
4. **Snapshot amax to CPU** — copies all quantizer state to CPU and saves to disk (~50MB)
|
||||
5. **Save model state** — full state dict to disk (insurance against export crashes)
|
||||
6. **Export** to HF safetensors
|
||||
What happens inside:
|
||||
1. **Apply patches** — 3 runtime monkey-patches for GPU tensor safety (see below)
|
||||
2. **Parse args** — uses `hf_ptq.parse_args()` with our config via `sys.argv` replacement, then applies the same post-parse conversions (`dataset` split, `calib_size` int list) that `hf_ptq.__main__` normally does
|
||||
3. **Hook export** — monkey-patch `export_quantized` to snapshot amax + save state before export
|
||||
4. **Call `hf_main(args)`** — the exact same pipeline the shell script uses
|
||||
|
||||
If the export crashes:
|
||||
|
||||
@@ -51,56 +59,70 @@ To validate saved state without running anything:
|
||||
python3 /root/nvidia-meeting/deepseek-v4-quant/scripts/quantize_nvfp4.py --validate-only
|
||||
```
|
||||
|
||||
**Config:** `nvfp4`, 128 calib samples, calib_seq=512, kv_fp8_cast, gpu_mem_pct=0.7
|
||||
**Config:** `nvfp4`, 128 calib samples, `calib_seq=512`, `kv_cache_qformat=fp8_cast`, `gpu_max_mem_percentage=0.7`, `use_seq_device_map`, `inference_tensor_parallel=8`
|
||||
|
||||
**Calibration datasets:** `abisee/cnn_dailymail` + `nvidia/Nemotron-Post-Training-Dataset-v2` (gated — requires HF token).
|
||||
**Calibration datasets:** `abisee/cnn_dailymail` + `nvidia/Nemotron-Post-Training-Dataset-v2` (default when no `--dataset` specified).
|
||||
|
||||
**Runtime:** Model loading ~53 min. Calibration ~5.5 hours. Export ~30-60 min. Total 7-8 hours.
|
||||
**Runtime:** Model loading ~50 min. Calibration ~5.5 hours. Export ~30-60 min. Total 7-8 hours.
|
||||
|
||||
## Run History (forward progression)
|
||||
## Run History
|
||||
|
||||
| Run | Date | Commit | Calib | Result | Root Cause | Fix |
|
||||
|-----|------|--------|-------|--------|------------|-----|
|
||||
| 1 | May 7 | shell wrapper (pre-repo) | 256 | ❌ Crashed at batch probing | `o_b_proj` shape mismatch — finegrained_fp8 wraps MLA projections incorrectly with FP8 source | Use BF16 source (dequantized) |
|
||||
| 2 | May 8-9 | shell wrapper (pre-repo) | 128 | ❌ Crashed at export (128/128 calib ✅) | `get_activation_scaling_factor` reads stale GPU amax → CUDA illegal memory access | Snapshot amax to CPU immediately after calibration |
|
||||
| 3 | May 9 06:10 | `3907838` | 128 | ❌ Crashed at model loading | `AutoModelForCausalLM.from_pretrained` OOM during expert weight `torch.cat` (31.5GB alloc, 25.9GB free) | Use modelopt `get_model()` which handles max_memory properly |
|
||||
| 4 | May 9 ~07:00 | `86dd8df` | 128 | ❌ Crashed at quantize config setup | `mtq.KV_QUANT_CFG_CHOICES` doesn't exist — it's `hf_ptq.KV_QUANT_CFG_CHOICES` | Import `KV_QUANT_CFG_CHOICES` from `hf_ptq`, not `mtq` |
|
||||
| 5 | May 9 ~08:05 | `f9bbef8` | 128 | ❌ Same as Run 4 | Same import bug, wasn't synced properly | Same fix, properly synced |
|
||||
| 6 | May 9 ~09:25 | `6c1bff6` | 128 | ❌ Crashed at dataloader setup | `make_calib_dataloader` AttributeError — missing args (`dataset`, `calib_with_images`, etc.) | Add all required args to argparse.Namespace |
|
||||
| 7 | May 9 ~13:40 | `25b4d8d` | 128 | ❌ Crashed at dataloader setup | Same — `dataset=None`, `len()` on None | Provide actual dataset list |
|
||||
| 8 | May 9 ~14:00 | `b2849a8` | 128 | ❌ Crashed at argparse | Wrong flag names (`--quant`, `--calib`, `--kv_cache_quant`, `--tp`) — these are shell script names, not `hf_ptq.py` names | Use `hf_ptq.py` flag names (`--qformat`, `--calib_size`, `--kv_cache_qformat`, `--inference_tensor_parallel`) |
|
||||
| 9 | May 9 ~14:30 | `a300302` | 128 | 🔄 Running | — | Calls `hf_main(args)` directly, no more pipeline rewriting |
|
||||
| 1 | May 7 | shell wrapper | 256 | ❌ Batch probing crash | `o_b_proj` shape mismatch — finegrained_fp8 wraps MLA projections incorrectly with FP8 source | Use BF16 source (dequantized) |
|
||||
| 2 | May 8-9 | shell wrapper | 128 | ❌ Export crash (calib ✅) | `get_activation_scaling_factor` reads stale GPU amax → CUDA illegal memory access | Snapshot amax to CPU after calibration |
|
||||
| 3 | May 9 06:10 | `3907838` | 128 | ❌ Model loading OOM | `AutoModelForCausalLM.from_pretrained` OOM during expert weight `torch.cat` | Use modelopt `get_model()` with `max_memory` |
|
||||
| 4 | May 9 ~07:00 | `86dd8df` | 128 | ❌ Import error | `mtq.KV_QUANT_CFG_CHOICES` doesn't exist — it's `hf_ptq.KV_QUANT_CFG_CHOICES` | Import from `hf_ptq`, not `mtq` |
|
||||
| 5 | May 9 ~08:05 | `f9bbef8` | 128 | ❌ Same as Run 4 | Fix wasn't synced properly | Properly synced |
|
||||
| 6 | May 9 ~09:25 | `6c1bff6` | 128 | ❌ Dataloader crash | `make_calib_dataloader` AttributeError — missing args | Added args to Namespace |
|
||||
| 7 | May 9 ~13:40 | `25b4d8d` | 128 | ❌ Dataloader crash | `dataset=None`, `len()` on None | Provided dataset list |
|
||||
| 8 | May 9 ~14:00 | `b2849a8` | 128 | ❌ Argparse crash | Wrong flag names (shell script names vs `hf_ptq.py` names) | Use `hf_ptq.py` flag names |
|
||||
| 9 | May 9 ~14:30 | `a300302` | 128 | ❌ TypeError | Skipped `__main__` post-parse conversions (`calib_size` still string, not int list) | Apply same conversions after `parse_args()` |
|
||||
| 10 | May 9 ~15:30 | `5a72da7` | 128 | 🔄 Running | — | Calls `hf_main(args)` directly |
|
||||
|
||||
**If Run 4 succeeds**, current code is good. No further changes needed.
|
||||
**If Run 4 fails**, check the log, identify the crash point, add it to this table.
|
||||
### Key Lessons
|
||||
|
||||
**Key lesson from Run 2:** The `use_seq_device_map` mode shuffles layers through GPU for calibration. Quantizer amax tensors sit in VRAM for 5+ hours while CUDA's allocator churns memory. By export time, the GPU tensor metadata is valid but the underlying memory has been recycled — reading it triggers `cudaErrorIllegalAddress`. The fix is to copy amax to CPU *immediately* after calibration, before any further GPU operations.
|
||||
**Run 2 — Stale GPU tensors:** `use_seq_device_map` shuffles layers through GPU for calibration. Quantizer amax tensors sit in VRAM for 5+ hours while CUDA's allocator churns memory. By export time, the GPU tensor metadata is valid but the underlying memory has been recycled — reading it triggers `cudaErrorIllegalAddress`. Fix: copy amax to CPU immediately after calibration.
|
||||
|
||||
**Key lesson from Run 3:** Don't use `AutoModelForCausalLM.from_pretrained` directly for models with large expert weights. The weight conversion step does `torch.cat` on GPU which can OOM. Use modelopt's `get_model()` which sets `max_memory` per GPU before loading.
|
||||
**Run 3 — Expert weight OOM:** `AutoModelForCausalLM.from_pretrained` does `torch.cat` on GPU for expert `gate_up_proj` (31.5GB alloc, 25.9GB free). Fix: use modelopt's `get_model()` which sets `max_memory` per GPU before loading. (Note: Run 10 uses `hf_main()` which calls `get_model()` internally.)
|
||||
|
||||
**Runs 4–8 — Pipeline rewriting trap:** Trying to reconstruct hf_ptq's pipeline by importing individual functions and building a fake `argparse.Namespace` causes an endless stream of missing-attribute and type errors. Each fix reveals the next bug. Fix: call `hf_main(args)` directly with a properly parsed args object.
|
||||
|
||||
**Run 9 — `__main__` gap:** `hf_ptq.py` does critical type conversions in its `__main__` block (string → list for `dataset`, string → int list for `calib_size`). When calling `main()` directly, these are skipped. Fix: apply the same conversions after `parse_args()`.
|
||||
|
||||
### Do NOT Repeat These Mistakes
|
||||
|
||||
**Do NOT repeat these mistakes:**
|
||||
- Don't use FP8 source model — kernel issues on Blackwell (Run 1)
|
||||
- Don't use `--low_memory_mode` with V4 — meta device errors
|
||||
- Don't use calib_size=256 — OOMs with 3TB BF16 on CPU offload
|
||||
- Don't use `calib_size=256` — OOMs with 3TB BF16 on CPU offload
|
||||
- Don't use `AutoModelForCausalLM.from_pretrained` directly — OOM during expert weight concat (Run 3)
|
||||
- Don't assume GPU tensor integrity after 5+ hours of sequential calibration
|
||||
- Don't assume GPU tensor integrity after 5+ hours of sequential calibration (Run 2)
|
||||
- Don't rewrite the hf_ptq pipeline — call `hf_main()` directly (Runs 4–8)
|
||||
- Don't skip the `__main__` post-parse conversions — `calib_size` must be int list, `dataset` must be list (Run 9)
|
||||
- Don't use shell script arg names (`--quant`, `--calib`, `--kv_cache_quant`, `--tp`) — use `hf_ptq.py` names (`--qformat`, `--calib_size`, `--kv_cache_qformat`, `--inference_tensor_parallel`)
|
||||
|
||||
## Runtime Patches Applied by quantize_nvfp4.py
|
||||
|
||||
These are monkey-patches applied at runtime — no modelopt source files are modified.
|
||||
|
||||
1. **`load_calib_amax`** — After calibration writes `_amax` to GPU, immediately moves it to CPU. Prevents stale GPU tensors.
|
||||
2. **`export_amax`** — If any `_amax` is still on GPU at export time, moves to CPU before reading. Safety net.
|
||||
3. **`get_activation_scaling_factor`** — Moves amax to CPU, clamps bad values instead of hard assert. Prevents crash on garbage from GPU corruption.
|
||||
4. **`snapshot_amax_to_cpu()`** — After calibration, walks all quantizers, copies `_amax` to CPU, saves to disk (~50MB). Insurance policy.
|
||||
1. **`TensorQuantizer.load_calib_amax`** — After calibration writes `_amax` to GPU, immediately moves it to CPU. Prevents stale GPU tensors.
|
||||
2. **`TensorQuantizer.export_amax`** — If `_amax` is still on GPU at export time, moves to CPU before reading. Safety net.
|
||||
3. **`NVFP4QTensor.get_activation_scaling_factor`** — Moves amax to CPU, clamps bad values instead of hard assert. Prevents crash on garbage from GPU corruption.
|
||||
|
||||
### Post-Calibration Hook
|
||||
|
||||
`export_quantized` is monkey-patched to run these steps before the real export:
|
||||
|
||||
4. **`snapshot_amax_to_cpu()`** — Walks all quantizers, copies `_amax` to CPU, saves to disk (~50MB). Insurance policy.
|
||||
5. **`force_all_amax_to_cpu()`** — Moves `_pre_quant_scale`, `_global_amax` to CPU too. Nuclear option.
|
||||
6. **`save_calibrated_state()`** — Saves full model state dict to disk (~1.5TB). Enables `--export-only` recovery if export crashes.
|
||||
|
||||
## Bugs Found (V4 + modelopt 0.45.0.dev64)
|
||||
|
||||
1. ~~`QuantDeepseekV4Experts` AttributeError~~ — **Already fixed** in modelopt 0.45.0.dev64 (handles `nn.ModuleList` quantizers natively).
|
||||
2. `--low_memory_mode` → meta device error. Don't use with V4.
|
||||
3. Missing `kernels` package for FP8 ops. `pip install -U kernels`.
|
||||
4. ~~Shell script arg names~~ — No longer relevant (in-process script).
|
||||
4. ~~Shell script arg names~~ — Resolved by calling `hf_main()` directly.
|
||||
5. **Export crash — stale GPU tensors in `export_amax()`.** After hours of calibration, quantizer `_amax` on GPU becomes unreadable. Fixed by patching `export_amax` to move `_amax` to CPU before reading.
|
||||
6. **Export crash — `assert torch.all(activation_scaling_factor > 0)`.** Amax values from stale GPU reads are garbage (zeros, negatives, NaN). Fixed by clamping instead of asserting, plus snapshotting valid amax to CPU before corruption can occur.
|
||||
7. **Model loading OOM during expert weight conversion.** `AutoModelForCausalLM.from_pretrained` does `torch.cat` on GPU for expert `gate_up_proj` (31.5GB alloc), but only 25.9GB free with `device_map="sequential"`. Fixed by using modelopt's `get_model()` which sets `max_memory` per GPU before loading.
|
||||
@@ -121,4 +143,19 @@ The patches in `quantize_nvfp4.py` are for **modelopt 0.45.0.dev64** specificall
|
||||
- modelopt has no explicit V4 support — relies on auto-detection of fused experts.
|
||||
- The calibration state save (`v4_nvfp4_calibrated_state.pt`) is ~1.5TB. It lives on NVMe, not in git.
|
||||
- The amax snapshot (`v4_nvfp4_amax_snapshots.pt`) is ~50MB. Small, critical, cheap insurance.
|
||||
- Only divergence from modelopt example path: `get_model()` instead of `AutoModelForCausalLM.from_pretrained` (avoids OOM). Everything else uses the same `hf_ptq` functions.
|
||||
- The script calls `hf_main(args)` — the exact same entry point as the shell script. No pipeline divergence.
|
||||
- Must run from `/root/nvidia-meeting/modelopt-repo/examples/llm_ptq` (relative imports).
|
||||
|
||||
## File Layout
|
||||
|
||||
```
|
||||
scripts/
|
||||
dequant_fp8_to_bf16.py — Step 1: FP8/FP4 → BF16 dequantization
|
||||
quantize_nvfp4.py — Step 2: NVFP4 quantization (patches + hf_main)
|
||||
|
||||
patches/
|
||||
patch_finegrained_fp8_blackwell.py — (legacy) FP8 kernel patches for Blackwell
|
||||
quant_module_patched.py — (legacy) quant module patches
|
||||
```
|
||||
|
||||
The `patches/` directory contains earlier approaches that modified modelopt source files directly. The current approach (`quantize_nvfp4.py`) uses runtime monkey-patching instead — no source files are modified.
|
||||
|
||||
Reference in New Issue
Block a user