From ce9056d2592c4ccca7e2091f34d2720c712b2bb3 Mon Sep 17 00:00:00 2001 From: biondizzle Date: Sat, 9 May 2026 16:09:09 +0000 Subject: [PATCH] README overhaul: reflect current architecture (hf_main, run history through Run 10) - Architecture section: call hf_main() directly, not rewrite the pipeline - Run history: all 10 runs with root causes and fixes - Key lessons: stale GPU tensors, expert OOM, pipeline rewriting trap, __main__ gap - Runtime patches: 3 monkey-patches + 3 post-calibration hook steps - Do NOT repeat: 8 specific mistakes with run references - File layout with legacy patches note --- README.md | 103 +++++++++++++++++++++++++++++++++++++----------------- 1 file changed, 70 insertions(+), 33 deletions(-) diff --git a/README.md b/README.md index 542cfe8..793c772 100644 --- a/README.md +++ b/README.md @@ -4,6 +4,16 @@ Full NVFP4 quantization of DeepSeek V4 Pro on a single B200 node (8× B200, 2.7T **Cost:** ~$161/run at $23/hr (7 hours each). Don't waste runs. +## Architecture + +We call modelopt's `hf_ptq.main()` directly — the same entry point the shell script uses. We don't rewrite the pipeline. We just: + +1. **Patch** modelopt at runtime (GPU tensor safety, before anything runs) +2. **Hook** `export_quantized` to snapshot amax + save state before export +3. **Call** `hf_main(args)` with properly parsed args + +This avoids the cascade of missing-arg bugs from manually constructing `argparse.Namespace` (Runs 4–8). + ## Pipeline ### Step 1: Dequantize FP8 → BF16 @@ -31,13 +41,11 @@ python3 /root/nvidia-meeting/deepseek-v4-quant/scripts/quantize_nvfp4.py Must run from the modelopt example directory (relative imports). -Pipeline steps: -1. **Load** BF16 model via modelopt's `get_model()` (handles memory limits properly for 3TB model) -2. **Patch** modelopt at runtime (load_calib_amax forces CPU, export_amax CPU fallback, graceful clamp) -3. **Quantize + Calibrate** (5-6 hours, 128 samples) -4. **Snapshot amax to CPU** — copies all quantizer state to CPU and saves to disk (~50MB) -5. **Save model state** — full state dict to disk (insurance against export crashes) -6. **Export** to HF safetensors +What happens inside: +1. **Apply patches** — 3 runtime monkey-patches for GPU tensor safety (see below) +2. **Parse args** — uses `hf_ptq.parse_args()` with our config via `sys.argv` replacement, then applies the same post-parse conversions (`dataset` split, `calib_size` int list) that `hf_ptq.__main__` normally does +3. **Hook export** — monkey-patch `export_quantized` to snapshot amax + save state before export +4. **Call `hf_main(args)`** — the exact same pipeline the shell script uses If the export crashes: @@ -51,56 +59,70 @@ To validate saved state without running anything: python3 /root/nvidia-meeting/deepseek-v4-quant/scripts/quantize_nvfp4.py --validate-only ``` -**Config:** `nvfp4`, 128 calib samples, calib_seq=512, kv_fp8_cast, gpu_mem_pct=0.7 +**Config:** `nvfp4`, 128 calib samples, `calib_seq=512`, `kv_cache_qformat=fp8_cast`, `gpu_max_mem_percentage=0.7`, `use_seq_device_map`, `inference_tensor_parallel=8` -**Calibration datasets:** `abisee/cnn_dailymail` + `nvidia/Nemotron-Post-Training-Dataset-v2` (gated — requires HF token). +**Calibration datasets:** `abisee/cnn_dailymail` + `nvidia/Nemotron-Post-Training-Dataset-v2` (default when no `--dataset` specified). -**Runtime:** Model loading ~53 min. Calibration ~5.5 hours. Export ~30-60 min. Total 7-8 hours. +**Runtime:** Model loading ~50 min. Calibration ~5.5 hours. Export ~30-60 min. Total 7-8 hours. -## Run History (forward progression) +## Run History | Run | Date | Commit | Calib | Result | Root Cause | Fix | |-----|------|--------|-------|--------|------------|-----| -| 1 | May 7 | shell wrapper (pre-repo) | 256 | ❌ Crashed at batch probing | `o_b_proj` shape mismatch — finegrained_fp8 wraps MLA projections incorrectly with FP8 source | Use BF16 source (dequantized) | -| 2 | May 8-9 | shell wrapper (pre-repo) | 128 | ❌ Crashed at export (128/128 calib ✅) | `get_activation_scaling_factor` reads stale GPU amax → CUDA illegal memory access | Snapshot amax to CPU immediately after calibration | -| 3 | May 9 06:10 | `3907838` | 128 | ❌ Crashed at model loading | `AutoModelForCausalLM.from_pretrained` OOM during expert weight `torch.cat` (31.5GB alloc, 25.9GB free) | Use modelopt `get_model()` which handles max_memory properly | -| 4 | May 9 ~07:00 | `86dd8df` | 128 | ❌ Crashed at quantize config setup | `mtq.KV_QUANT_CFG_CHOICES` doesn't exist — it's `hf_ptq.KV_QUANT_CFG_CHOICES` | Import `KV_QUANT_CFG_CHOICES` from `hf_ptq`, not `mtq` | -| 5 | May 9 ~08:05 | `f9bbef8` | 128 | ❌ Same as Run 4 | Same import bug, wasn't synced properly | Same fix, properly synced | -| 6 | May 9 ~09:25 | `6c1bff6` | 128 | ❌ Crashed at dataloader setup | `make_calib_dataloader` AttributeError — missing args (`dataset`, `calib_with_images`, etc.) | Add all required args to argparse.Namespace | -| 7 | May 9 ~13:40 | `25b4d8d` | 128 | ❌ Crashed at dataloader setup | Same — `dataset=None`, `len()` on None | Provide actual dataset list | -| 8 | May 9 ~14:00 | `b2849a8` | 128 | ❌ Crashed at argparse | Wrong flag names (`--quant`, `--calib`, `--kv_cache_quant`, `--tp`) — these are shell script names, not `hf_ptq.py` names | Use `hf_ptq.py` flag names (`--qformat`, `--calib_size`, `--kv_cache_qformat`, `--inference_tensor_parallel`) | -| 9 | May 9 ~14:30 | `a300302` | 128 | 🔄 Running | — | Calls `hf_main(args)` directly, no more pipeline rewriting | +| 1 | May 7 | shell wrapper | 256 | ❌ Batch probing crash | `o_b_proj` shape mismatch — finegrained_fp8 wraps MLA projections incorrectly with FP8 source | Use BF16 source (dequantized) | +| 2 | May 8-9 | shell wrapper | 128 | ❌ Export crash (calib ✅) | `get_activation_scaling_factor` reads stale GPU amax → CUDA illegal memory access | Snapshot amax to CPU after calibration | +| 3 | May 9 06:10 | `3907838` | 128 | ❌ Model loading OOM | `AutoModelForCausalLM.from_pretrained` OOM during expert weight `torch.cat` | Use modelopt `get_model()` with `max_memory` | +| 4 | May 9 ~07:00 | `86dd8df` | 128 | ❌ Import error | `mtq.KV_QUANT_CFG_CHOICES` doesn't exist — it's `hf_ptq.KV_QUANT_CFG_CHOICES` | Import from `hf_ptq`, not `mtq` | +| 5 | May 9 ~08:05 | `f9bbef8` | 128 | ❌ Same as Run 4 | Fix wasn't synced properly | Properly synced | +| 6 | May 9 ~09:25 | `6c1bff6` | 128 | ❌ Dataloader crash | `make_calib_dataloader` AttributeError — missing args | Added args to Namespace | +| 7 | May 9 ~13:40 | `25b4d8d` | 128 | ❌ Dataloader crash | `dataset=None`, `len()` on None | Provided dataset list | +| 8 | May 9 ~14:00 | `b2849a8` | 128 | ❌ Argparse crash | Wrong flag names (shell script names vs `hf_ptq.py` names) | Use `hf_ptq.py` flag names | +| 9 | May 9 ~14:30 | `a300302` | 128 | ❌ TypeError | Skipped `__main__` post-parse conversions (`calib_size` still string, not int list) | Apply same conversions after `parse_args()` | +| 10 | May 9 ~15:30 | `5a72da7` | 128 | 🔄 Running | — | Calls `hf_main(args)` directly | -**If Run 4 succeeds**, current code is good. No further changes needed. -**If Run 4 fails**, check the log, identify the crash point, add it to this table. +### Key Lessons -**Key lesson from Run 2:** The `use_seq_device_map` mode shuffles layers through GPU for calibration. Quantizer amax tensors sit in VRAM for 5+ hours while CUDA's allocator churns memory. By export time, the GPU tensor metadata is valid but the underlying memory has been recycled — reading it triggers `cudaErrorIllegalAddress`. The fix is to copy amax to CPU *immediately* after calibration, before any further GPU operations. +**Run 2 — Stale GPU tensors:** `use_seq_device_map` shuffles layers through GPU for calibration. Quantizer amax tensors sit in VRAM for 5+ hours while CUDA's allocator churns memory. By export time, the GPU tensor metadata is valid but the underlying memory has been recycled — reading it triggers `cudaErrorIllegalAddress`. Fix: copy amax to CPU immediately after calibration. -**Key lesson from Run 3:** Don't use `AutoModelForCausalLM.from_pretrained` directly for models with large expert weights. The weight conversion step does `torch.cat` on GPU which can OOM. Use modelopt's `get_model()` which sets `max_memory` per GPU before loading. +**Run 3 — Expert weight OOM:** `AutoModelForCausalLM.from_pretrained` does `torch.cat` on GPU for expert `gate_up_proj` (31.5GB alloc, 25.9GB free). Fix: use modelopt's `get_model()` which sets `max_memory` per GPU before loading. (Note: Run 10 uses `hf_main()` which calls `get_model()` internally.) + +**Runs 4–8 — Pipeline rewriting trap:** Trying to reconstruct hf_ptq's pipeline by importing individual functions and building a fake `argparse.Namespace` causes an endless stream of missing-attribute and type errors. Each fix reveals the next bug. Fix: call `hf_main(args)` directly with a properly parsed args object. + +**Run 9 — `__main__` gap:** `hf_ptq.py` does critical type conversions in its `__main__` block (string → list for `dataset`, string → int list for `calib_size`). When calling `main()` directly, these are skipped. Fix: apply the same conversions after `parse_args()`. + +### Do NOT Repeat These Mistakes -**Do NOT repeat these mistakes:** - Don't use FP8 source model — kernel issues on Blackwell (Run 1) - Don't use `--low_memory_mode` with V4 — meta device errors -- Don't use calib_size=256 — OOMs with 3TB BF16 on CPU offload +- Don't use `calib_size=256` — OOMs with 3TB BF16 on CPU offload - Don't use `AutoModelForCausalLM.from_pretrained` directly — OOM during expert weight concat (Run 3) -- Don't assume GPU tensor integrity after 5+ hours of sequential calibration +- Don't assume GPU tensor integrity after 5+ hours of sequential calibration (Run 2) +- Don't rewrite the hf_ptq pipeline — call `hf_main()` directly (Runs 4–8) +- Don't skip the `__main__` post-parse conversions — `calib_size` must be int list, `dataset` must be list (Run 9) +- Don't use shell script arg names (`--quant`, `--calib`, `--kv_cache_quant`, `--tp`) — use `hf_ptq.py` names (`--qformat`, `--calib_size`, `--kv_cache_qformat`, `--inference_tensor_parallel`) ## Runtime Patches Applied by quantize_nvfp4.py These are monkey-patches applied at runtime — no modelopt source files are modified. -1. **`load_calib_amax`** — After calibration writes `_amax` to GPU, immediately moves it to CPU. Prevents stale GPU tensors. -2. **`export_amax`** — If any `_amax` is still on GPU at export time, moves to CPU before reading. Safety net. -3. **`get_activation_scaling_factor`** — Moves amax to CPU, clamps bad values instead of hard assert. Prevents crash on garbage from GPU corruption. -4. **`snapshot_amax_to_cpu()`** — After calibration, walks all quantizers, copies `_amax` to CPU, saves to disk (~50MB). Insurance policy. +1. **`TensorQuantizer.load_calib_amax`** — After calibration writes `_amax` to GPU, immediately moves it to CPU. Prevents stale GPU tensors. +2. **`TensorQuantizer.export_amax`** — If `_amax` is still on GPU at export time, moves to CPU before reading. Safety net. +3. **`NVFP4QTensor.get_activation_scaling_factor`** — Moves amax to CPU, clamps bad values instead of hard assert. Prevents crash on garbage from GPU corruption. + +### Post-Calibration Hook + +`export_quantized` is monkey-patched to run these steps before the real export: + +4. **`snapshot_amax_to_cpu()`** — Walks all quantizers, copies `_amax` to CPU, saves to disk (~50MB). Insurance policy. 5. **`force_all_amax_to_cpu()`** — Moves `_pre_quant_scale`, `_global_amax` to CPU too. Nuclear option. +6. **`save_calibrated_state()`** — Saves full model state dict to disk (~1.5TB). Enables `--export-only` recovery if export crashes. ## Bugs Found (V4 + modelopt 0.45.0.dev64) 1. ~~`QuantDeepseekV4Experts` AttributeError~~ — **Already fixed** in modelopt 0.45.0.dev64 (handles `nn.ModuleList` quantizers natively). 2. `--low_memory_mode` → meta device error. Don't use with V4. 3. Missing `kernels` package for FP8 ops. `pip install -U kernels`. -4. ~~Shell script arg names~~ — No longer relevant (in-process script). +4. ~~Shell script arg names~~ — Resolved by calling `hf_main()` directly. 5. **Export crash — stale GPU tensors in `export_amax()`.** After hours of calibration, quantizer `_amax` on GPU becomes unreadable. Fixed by patching `export_amax` to move `_amax` to CPU before reading. 6. **Export crash — `assert torch.all(activation_scaling_factor > 0)`.** Amax values from stale GPU reads are garbage (zeros, negatives, NaN). Fixed by clamping instead of asserting, plus snapshotting valid amax to CPU before corruption can occur. 7. **Model loading OOM during expert weight conversion.** `AutoModelForCausalLM.from_pretrained` does `torch.cat` on GPU for expert `gate_up_proj` (31.5GB alloc), but only 25.9GB free with `device_map="sequential"`. Fixed by using modelopt's `get_model()` which sets `max_memory` per GPU before loading. @@ -121,4 +143,19 @@ The patches in `quantize_nvfp4.py` are for **modelopt 0.45.0.dev64** specificall - modelopt has no explicit V4 support — relies on auto-detection of fused experts. - The calibration state save (`v4_nvfp4_calibrated_state.pt`) is ~1.5TB. It lives on NVMe, not in git. - The amax snapshot (`v4_nvfp4_amax_snapshots.pt`) is ~50MB. Small, critical, cheap insurance. -- Only divergence from modelopt example path: `get_model()` instead of `AutoModelForCausalLM.from_pretrained` (avoids OOM). Everything else uses the same `hf_ptq` functions. +- The script calls `hf_main(args)` — the exact same entry point as the shell script. No pipeline divergence. +- Must run from `/root/nvidia-meeting/modelopt-repo/examples/llm_ptq` (relative imports). + +## File Layout + +``` +scripts/ + dequant_fp8_to_bf16.py — Step 1: FP8/FP4 → BF16 dequantization + quantize_nvfp4.py — Step 2: NVFP4 quantization (patches + hf_main) + +patches/ + patch_finegrained_fp8_blackwell.py — (legacy) FP8 kernel patches for Blackwell + quant_module_patched.py — (legacy) quant module patches +``` + +The `patches/` directory contains earlier approaches that modified modelopt source files directly. The current approach (`quantize_nvfp4.py`) uses runtime monkey-patching instead — no source files are modified.