README overhaul: reflect current architecture (hf_main, run history through Run 10)

- Architecture section: call hf_main() directly, not rewrite the pipeline - Run history: all 10 runs with root causes and fixes - Key lessons: stale GPU tensors, expert OOM, pipeline rewriting trap, __main__ gap - Runtime patches: 3 monkey-patches + 3 post-calibration hook steps - Do NOT repeat: 8 specific mistakes with run references - File layout with legacy patches note
2026-05-09 16:09:09 +00:00
parent 5a72da7193
commit ce9056d259
1 changed files with 70 additions and 33 deletions
--- a/README.md
+++ b/README.md
@@ -4,6 +4,16 @@ Full NVFP4 quantization of DeepSeek V4 Pro on a single B200 node (8× B200, 2.7T

 **Cost:** ~$161/run at $23/hr (7 hours each). Don't waste runs.

+## Architecture
+
+We call modelopt's `hf_ptq.main()` directly — the same entry point the shell script uses. We don't rewrite the pipeline. We just:
+
+1. **Patch** modelopt at runtime (GPU tensor safety, before anything runs)
+2. **Hook** `export_quantized` to snapshot amax + save state before export
+3. **Call** `hf_main(args)` with properly parsed args
+
+This avoids the cascade of missing-arg bugs from manually constructing `argparse.Namespace` (Runs 4–8).
+
 ## Pipeline

 ### Step 1: Dequantize FP8 → BF16
@@ -31,13 +41,11 @@ python3 /root/nvidia-meeting/deepseek-v4-quant/scripts/quantize_nvfp4.py

 Must run from the modelopt example directory (relative imports).

-Pipeline steps:
-1. **Load** BF16 model via modelopt's `get_model()` (handles memory limits properly for 3TB model)
-2. **Patch** modelopt at runtime (load_calib_amax forces CPU, export_amax CPU fallback, graceful clamp)
-3. **Quantize + Calibrate** (5-6 hours, 128 samples)
-4. **Snapshot amax to CPU** — copies all quantizer state to CPU and saves to disk (~50MB)
-5. **Save model state** — full state dict to disk (insurance against export crashes)
-6. **Export** to HF safetensors
+What happens inside:
+1. **Apply patches** — 3 runtime monkey-patches for GPU tensor safety (see below)
+2. **Parse args** — uses `hf_ptq.parse_args()` with our config via `sys.argv` replacement, then applies the same post-parse conversions (`dataset` split, `calib_size` int list) that `hf_ptq.__main__` normally does
+3. **Hook export** — monkey-patch `export_quantized` to snapshot amax + save state before export
+4. **Call `hf_main(args)`** — the exact same pipeline the shell script uses

 If the export crashes:

@@ -51,56 +59,70 @@ To validate saved state without running anything:
 python3 /root/nvidia-meeting/deepseek-v4-quant/scripts/quantize_nvfp4.py --validate-only
 ```

-**Config:** `nvfp4`, 128 calib samples, calib_seq=512, kv_fp8_cast, gpu_mem_pct=0.7
+**Config:** `nvfp4`, 128 calib samples, `calib_seq=512`, `kv_cache_qformat=fp8_cast`, `gpu_max_mem_percentage=0.7`, `use_seq_device_map`, `inference_tensor_parallel=8`

-**Calibration datasets:** `abisee/cnn_dailymail` + `nvidia/Nemotron-Post-Training-Dataset-v2` (gated — requires HF token).
+**Calibration datasets:** `abisee/cnn_dailymail` + `nvidia/Nemotron-Post-Training-Dataset-v2` (default when no `--dataset` specified).

-**Runtime:** Model loading ~53 min. Calibration ~5.5 hours. Export ~30-60 min. Total 7-8 hours.
+**Runtime:** Model loading ~50 min. Calibration ~5.5 hours. Export ~30-60 min. Total 7-8 hours.

-## Run History (forward progression)
+## Run History

 | Run | Date | Commit | Calib | Result | Root Cause | Fix |
 |-----|------|--------|-------|--------|------------|-----|
-| 1 | May 7 | shell wrapper (pre-repo) | 256 | ❌ Crashed at batch probing | `o_b_proj` shape mismatch — finegrained_fp8 wraps MLA projections incorrectly with FP8 source | Use BF16 source (dequantized) |
-| 2 | May 8-9 | shell wrapper (pre-repo) | 128 | ❌ Crashed at export (128/128 calib ✅) | `get_activation_scaling_factor` reads stale GPU amax → CUDA illegal memory access | Snapshot amax to CPU immediately after calibration |
-| 3 | May 9 06:10 | `3907838` | 128 | ❌ Crashed at model loading | `AutoModelForCausalLM.from_pretrained` OOM during expert weight `torch.cat` (31.5GB alloc, 25.9GB free) | Use modelopt `get_model()` which handles max_memory properly |
-| 4 | May 9 ~07:00 | `86dd8df` | 128 | ❌ Crashed at quantize config setup | `mtq.KV_QUANT_CFG_CHOICES` doesn't exist — it's `hf_ptq.KV_QUANT_CFG_CHOICES` | Import `KV_QUANT_CFG_CHOICES` from `hf_ptq`, not `mtq` |
-| 5 | May 9 ~08:05 | `f9bbef8` | 128 | ❌ Same as Run 4 | Same import bug, wasn't synced properly | Same fix, properly synced |
-| 6 | May 9 ~09:25 | `6c1bff6` | 128 | ❌ Crashed at dataloader setup | `make_calib_dataloader` AttributeError — missing args (`dataset`, `calib_with_images`, etc.) | Add all required args to argparse.Namespace |
-| 7 | May 9 ~13:40 | `25b4d8d` | 128 | ❌ Crashed at dataloader setup | Same — `dataset=None`, `len()` on None | Provide actual dataset list |
-| 8 | May 9 ~14:00 | `b2849a8` | 128 | ❌ Crashed at argparse | Wrong flag names (`--quant`, `--calib`, `--kv_cache_quant`, `--tp`) — these are shell script names, not `hf_ptq.py` names | Use `hf_ptq.py` flag names (`--qformat`, `--calib_size`, `--kv_cache_qformat`, `--inference_tensor_parallel`) |
-| 9 | May 9 ~14:30 | `a300302` | 128 | 🔄 Running | — | Calls `hf_main(args)` directly, no more pipeline rewriting |
+| 1 | May 7 | shell wrapper | 256 | ❌ Batch probing crash | `o_b_proj` shape mismatch — finegrained_fp8 wraps MLA projections incorrectly with FP8 source | Use BF16 source (dequantized) |
+| 2 | May 8-9 | shell wrapper | 128 | ❌ Export crash (calib ✅) | `get_activation_scaling_factor` reads stale GPU amax → CUDA illegal memory access | Snapshot amax to CPU after calibration |
+| 3 | May 9 06:10 | `3907838` | 128 | ❌ Model loading OOM | `AutoModelForCausalLM.from_pretrained` OOM during expert weight `torch.cat` | Use modelopt `get_model()` with `max_memory` |
+| 4 | May 9 ~07:00 | `86dd8df` | 128 | ❌ Import error | `mtq.KV_QUANT_CFG_CHOICES` doesn't exist — it's `hf_ptq.KV_QUANT_CFG_CHOICES` | Import from `hf_ptq`, not `mtq` |
+| 5 | May 9 ~08:05 | `f9bbef8` | 128 | ❌ Same as Run 4 | Fix wasn't synced properly | Properly synced |
+| 6 | May 9 ~09:25 | `6c1bff6` | 128 | ❌ Dataloader crash | `make_calib_dataloader` AttributeError — missing args | Added args to Namespace |
+| 7 | May 9 ~13:40 | `25b4d8d` | 128 | ❌ Dataloader crash | `dataset=None`, `len()` on None | Provided dataset list |
+| 8 | May 9 ~14:00 | `b2849a8` | 128 | ❌ Argparse crash | Wrong flag names (shell script names vs `hf_ptq.py` names) | Use `hf_ptq.py` flag names |
+| 9 | May 9 ~14:30 | `a300302` | 128 | ❌ TypeError | Skipped `__main__` post-parse conversions (`calib_size` still string, not int list) | Apply same conversions after `parse_args()` |
+| 10 | May 9 ~15:30 | `5a72da7` | 128 | 🔄 Running | — | Calls `hf_main(args)` directly |

-**If Run 4 succeeds**, current code is good. No further changes needed.
-**If Run 4 fails**, check the log, identify the crash point, add it to this table.
+### Key Lessons

-**Key lesson from Run 2:** The `use_seq_device_map` mode shuffles layers through GPU for calibration. Quantizer amax tensors sit in VRAM for 5+ hours while CUDA's allocator churns memory. By export time, the GPU tensor metadata is valid but the underlying memory has been recycled — reading it triggers `cudaErrorIllegalAddress`. The fix is to copy amax to CPU *immediately* after calibration, before any further GPU operations.
+**Run 2 — Stale GPU tensors:** `use_seq_device_map` shuffles layers through GPU for calibration. Quantizer amax tensors sit in VRAM for 5+ hours while CUDA's allocator churns memory. By export time, the GPU tensor metadata is valid but the underlying memory has been recycled — reading it triggers `cudaErrorIllegalAddress`. Fix: copy amax to CPU immediately after calibration.

-**Key lesson from Run 3:** Don't use `AutoModelForCausalLM.from_pretrained` directly for models with large expert weights. The weight conversion step does `torch.cat` on GPU which can OOM. Use modelopt's `get_model()` which sets `max_memory` per GPU before loading.
+**Run 3 — Expert weight OOM:** `AutoModelForCausalLM.from_pretrained` does `torch.cat` on GPU for expert `gate_up_proj` (31.5GB alloc, 25.9GB free). Fix: use modelopt's `get_model()` which sets `max_memory` per GPU before loading. (Note: Run 10 uses `hf_main()` which calls `get_model()` internally.)
+
+**Runs 4–8 — Pipeline rewriting trap:** Trying to reconstruct hf_ptq's pipeline by importing individual functions and building a fake `argparse.Namespace` causes an endless stream of missing-attribute and type errors. Each fix reveals the next bug. Fix: call `hf_main(args)` directly with a properly parsed args object.
+
+**Run 9 — `__main__` gap:** `hf_ptq.py` does critical type conversions in its `__main__` block (string → list for `dataset`, string → int list for `calib_size`). When calling `main()` directly, these are skipped. Fix: apply the same conversions after `parse_args()`.
+
+### Do NOT Repeat These Mistakes

-**Do NOT repeat these mistakes:**
 - Don't use FP8 source model — kernel issues on Blackwell (Run 1)
 - Don't use `--low_memory_mode` with V4 — meta device errors
- Don't use calib_size=256 — OOMs with 3TB BF16 on CPU offload
+- Don't use `calib_size=256` — OOMs with 3TB BF16 on CPU offload
 - Don't use `AutoModelForCausalLM.from_pretrained` directly — OOM during expert weight concat (Run 3)
- Don't assume GPU tensor integrity after 5+ hours of sequential calibration
+- Don't assume GPU tensor integrity after 5+ hours of sequential calibration (Run 2)
+- Don't rewrite the hf_ptq pipeline — call `hf_main()` directly (Runs 4–8)
+- Don't skip the `__main__` post-parse conversions — `calib_size` must be int list, `dataset` must be list (Run 9)
+- Don't use shell script arg names (`--quant`, `--calib`, `--kv_cache_quant`, `--tp`) — use `hf_ptq.py` names (`--qformat`, `--calib_size`, `--kv_cache_qformat`, `--inference_tensor_parallel`)

 ## Runtime Patches Applied by quantize_nvfp4.py

 These are monkey-patches applied at runtime — no modelopt source files are modified.

-1. **`load_calib_amax`** — After calibration writes `_amax` to GPU, immediately moves it to CPU. Prevents stale GPU tensors.
-2. **`export_amax`** — If any `_amax` is still on GPU at export time, moves to CPU before reading. Safety net.
-3. **`get_activation_scaling_factor`** — Moves amax to CPU, clamps bad values instead of hard assert. Prevents crash on garbage from GPU corruption.
-4. **`snapshot_amax_to_cpu()`** — After calibration, walks all quantizers, copies `_amax` to CPU, saves to disk (~50MB). Insurance policy.
+1. **`TensorQuantizer.load_calib_amax`** — After calibration writes `_amax` to GPU, immediately moves it to CPU. Prevents stale GPU tensors.
+2. **`TensorQuantizer.export_amax`** — If `_amax` is still on GPU at export time, moves to CPU before reading. Safety net.
+3. **`NVFP4QTensor.get_activation_scaling_factor`** — Moves amax to CPU, clamps bad values instead of hard assert. Prevents crash on garbage from GPU corruption.
+
+### Post-Calibration Hook
+
+`export_quantized` is monkey-patched to run these steps before the real export:
+
+4. **`snapshot_amax_to_cpu()`** — Walks all quantizers, copies `_amax` to CPU, saves to disk (~50MB). Insurance policy.
 5. **`force_all_amax_to_cpu()`** — Moves `_pre_quant_scale`, `_global_amax` to CPU too. Nuclear option.
+6. **`save_calibrated_state()`** — Saves full model state dict to disk (~1.5TB). Enables `--export-only` recovery if export crashes.

 ## Bugs Found (V4 + modelopt 0.45.0.dev64)

 1. ~~`QuantDeepseekV4Experts` AttributeError~~ — **Already fixed** in modelopt 0.45.0.dev64 (handles `nn.ModuleList` quantizers natively).
 2. `--low_memory_mode` → meta device error. Don't use with V4.
 3. Missing `kernels` package for FP8 ops. `pip install -U kernels`.
-4. ~~Shell script arg names~~ — No longer relevant (in-process script).
+4. ~~Shell script arg names~~ — Resolved by calling `hf_main()` directly.
 5. **Export crash — stale GPU tensors in `export_amax()`.** After hours of calibration, quantizer `_amax` on GPU becomes unreadable. Fixed by patching `export_amax` to move `_amax` to CPU before reading.
 6. **Export crash — `assert torch.all(activation_scaling_factor > 0)`.** Amax values from stale GPU reads are garbage (zeros, negatives, NaN). Fixed by clamping instead of asserting, plus snapshotting valid amax to CPU before corruption can occur.
 7. **Model loading OOM during expert weight conversion.** `AutoModelForCausalLM.from_pretrained` does `torch.cat` on GPU for expert `gate_up_proj` (31.5GB alloc), but only 25.9GB free with `device_map="sequential"`. Fixed by using modelopt's `get_model()` which sets `max_memory` per GPU before loading.
@@ -121,4 +143,19 @@ The patches in `quantize_nvfp4.py` are for **modelopt 0.45.0.dev64** specificall
 - modelopt has no explicit V4 support — relies on auto-detection of fused experts.
 - The calibration state save (`v4_nvfp4_calibrated_state.pt`) is ~1.5TB. It lives on NVMe, not in git.
 - The amax snapshot (`v4_nvfp4_amax_snapshots.pt`) is ~50MB. Small, critical, cheap insurance.
- Only divergence from modelopt example path: `get_model()` instead of `AutoModelForCausalLM.from_pretrained` (avoids OOM). Everything else uses the same `hf_ptq` functions.
+- The script calls `hf_main(args)` — the exact same entry point as the shell script. No pipeline divergence.
+- Must run from `/root/nvidia-meeting/modelopt-repo/examples/llm_ptq` (relative imports).
+
+## File Layout
+
+```
+scripts/
+  dequant_fp8_to_bf16.py   — Step 1: FP8/FP4 → BF16 dequantization
+  quantize_nvfp4.py         — Step 2: NVFP4 quantization (patches + hf_main)
+
+patches/
+  patch_finegrained_fp8_blackwell.py  — (legacy) FP8 kernel patches for Blackwell
+  quant_module_patched.py             — (legacy) quant module patches
+```
+
+The `patches/` directory contains earlier approaches that modified modelopt source files directly. The current approach (`quantize_nvfp4.py`) uses runtime monkey-patching instead — no source files are modified.