README overhaul: reflect current architecture (hf_main, run history through Run 10)

- Architecture section: call hf_main() directly, not rewrite the pipeline
- Run history: all 10 runs with root causes and fixes
- Key lessons: stale GPU tensors, expert OOM, pipeline rewriting trap, __main__ gap
- Runtime patches: 3 monkey-patches + 3 post-calibration hook steps
- Do NOT repeat: 8 specific mistakes with run references
- File layout with legacy patches note
This commit is contained in:
2026-05-09 16:09:09 +00:00
parent 5a72da7193
commit ce9056d259

103
README.md
View File

@@ -4,6 +4,16 @@ Full NVFP4 quantization of DeepSeek V4 Pro on a single B200 node (8× B200, 2.7T
**Cost:** ~$161/run at $23/hr (7 hours each). Don't waste runs.
## Architecture
We call modelopt's `hf_ptq.main()` directly — the same entry point the shell script uses. We don't rewrite the pipeline. We just:
1. **Patch** modelopt at runtime (GPU tensor safety, before anything runs)
2. **Hook** `export_quantized` to snapshot amax + save state before export
3. **Call** `hf_main(args)` with properly parsed args
This avoids the cascade of missing-arg bugs from manually constructing `argparse.Namespace` (Runs 48).
## Pipeline
### Step 1: Dequantize FP8 → BF16
@@ -31,13 +41,11 @@ python3 /root/nvidia-meeting/deepseek-v4-quant/scripts/quantize_nvfp4.py
Must run from the modelopt example directory (relative imports).
Pipeline steps:
1. **Load** BF16 model via modelopt's `get_model()` (handles memory limits properly for 3TB model)
2. **Patch** modelopt at runtime (load_calib_amax forces CPU, export_amax CPU fallback, graceful clamp)
3. **Quantize + Calibrate** (5-6 hours, 128 samples)
4. **Snapshot amax to CPU** — copies all quantizer state to CPU and saves to disk (~50MB)
5. **Save model state** — full state dict to disk (insurance against export crashes)
6. **Export** to HF safetensors
What happens inside:
1. **Apply patches** — 3 runtime monkey-patches for GPU tensor safety (see below)
2. **Parse args** — uses `hf_ptq.parse_args()` with our config via `sys.argv` replacement, then applies the same post-parse conversions (`dataset` split, `calib_size` int list) that `hf_ptq.__main__` normally does
3. **Hook export** — monkey-patch `export_quantized` to snapshot amax + save state before export
4. **Call `hf_main(args)`** — the exact same pipeline the shell script uses
If the export crashes:
@@ -51,56 +59,70 @@ To validate saved state without running anything:
python3 /root/nvidia-meeting/deepseek-v4-quant/scripts/quantize_nvfp4.py --validate-only
```
**Config:** `nvfp4`, 128 calib samples, calib_seq=512, kv_fp8_cast, gpu_mem_pct=0.7
**Config:** `nvfp4`, 128 calib samples, `calib_seq=512`, `kv_cache_qformat=fp8_cast`, `gpu_max_mem_percentage=0.7`, `use_seq_device_map`, `inference_tensor_parallel=8`
**Calibration datasets:** `abisee/cnn_dailymail` + `nvidia/Nemotron-Post-Training-Dataset-v2` (gated — requires HF token).
**Calibration datasets:** `abisee/cnn_dailymail` + `nvidia/Nemotron-Post-Training-Dataset-v2` (default when no `--dataset` specified).
**Runtime:** Model loading ~53 min. Calibration ~5.5 hours. Export ~30-60 min. Total 7-8 hours.
**Runtime:** Model loading ~50 min. Calibration ~5.5 hours. Export ~30-60 min. Total 7-8 hours.
## Run History (forward progression)
## Run History
| Run | Date | Commit | Calib | Result | Root Cause | Fix |
|-----|------|--------|-------|--------|------------|-----|
| 1 | May 7 | shell wrapper (pre-repo) | 256 | ❌ Crashed at batch probing | `o_b_proj` shape mismatch — finegrained_fp8 wraps MLA projections incorrectly with FP8 source | Use BF16 source (dequantized) |
| 2 | May 8-9 | shell wrapper (pre-repo) | 128 | ❌ Crashed at export (128/128 calib ✅) | `get_activation_scaling_factor` reads stale GPU amax → CUDA illegal memory access | Snapshot amax to CPU immediately after calibration |
| 3 | May 9 06:10 | `3907838` | 128 | ❌ Crashed at model loading | `AutoModelForCausalLM.from_pretrained` OOM during expert weight `torch.cat` (31.5GB alloc, 25.9GB free) | Use modelopt `get_model()` which handles max_memory properly |
| 4 | May 9 ~07:00 | `86dd8df` | 128 | ❌ Crashed at quantize config setup | `mtq.KV_QUANT_CFG_CHOICES` doesn't exist — it's `hf_ptq.KV_QUANT_CFG_CHOICES` | Import `KV_QUANT_CFG_CHOICES` from `hf_ptq`, not `mtq` |
| 5 | May 9 ~08:05 | `f9bbef8` | 128 | ❌ Same as Run 4 | Same import bug, wasn't synced properly | Same fix, properly synced |
| 6 | May 9 ~09:25 | `6c1bff6` | 128 | ❌ Crashed at dataloader setup | `make_calib_dataloader` AttributeError — missing args (`dataset`, `calib_with_images`, etc.) | Add all required args to argparse.Namespace |
| 7 | May 9 ~13:40 | `25b4d8d` | 128 | ❌ Crashed at dataloader setup | Same — `dataset=None`, `len()` on None | Provide actual dataset list |
| 8 | May 9 ~14:00 | `b2849a8` | 128 | ❌ Crashed at argparse | Wrong flag names (`--quant`, `--calib`, `--kv_cache_quant`, `--tp`) — these are shell script names, not `hf_ptq.py` names | Use `hf_ptq.py` flag names (`--qformat`, `--calib_size`, `--kv_cache_qformat`, `--inference_tensor_parallel`) |
| 9 | May 9 ~14:30 | `a300302` | 128 | 🔄 Running | — | Calls `hf_main(args)` directly, no more pipeline rewriting |
| 1 | May 7 | shell wrapper | 256 | ❌ Batch probing crash | `o_b_proj` shape mismatch — finegrained_fp8 wraps MLA projections incorrectly with FP8 source | Use BF16 source (dequantized) |
| 2 | May 8-9 | shell wrapper | 128 | ❌ Export crash (calib ✅) | `get_activation_scaling_factor` reads stale GPU amax → CUDA illegal memory access | Snapshot amax to CPU after calibration |
| 3 | May 9 06:10 | `3907838` | 128 | ❌ Model loading OOM | `AutoModelForCausalLM.from_pretrained` OOM during expert weight `torch.cat` | Use modelopt `get_model()` with `max_memory` |
| 4 | May 9 ~07:00 | `86dd8df` | 128 | ❌ Import error | `mtq.KV_QUANT_CFG_CHOICES` doesn't exist — it's `hf_ptq.KV_QUANT_CFG_CHOICES` | Import from `hf_ptq`, not `mtq` |
| 5 | May 9 ~08:05 | `f9bbef8` | 128 | ❌ Same as Run 4 | Fix wasn't synced properly | Properly synced |
| 6 | May 9 ~09:25 | `6c1bff6` | 128 | ❌ Dataloader crash | `make_calib_dataloader` AttributeError — missing args | Added args to Namespace |
| 7 | May 9 ~13:40 | `25b4d8d` | 128 | ❌ Dataloader crash | `dataset=None`, `len()` on None | Provided dataset list |
| 8 | May 9 ~14:00 | `b2849a8` | 128 | ❌ Argparse crash | Wrong flag names (shell script names vs `hf_ptq.py` names) | Use `hf_ptq.py` flag names |
| 9 | May 9 ~14:30 | `a300302` | 128 | ❌ TypeError | Skipped `__main__` post-parse conversions (`calib_size` still string, not int list) | Apply same conversions after `parse_args()` |
| 10 | May 9 ~15:30 | `5a72da7` | 128 | 🔄 Running | — | Calls `hf_main(args)` directly |
**If Run 4 succeeds**, current code is good. No further changes needed.
**If Run 4 fails**, check the log, identify the crash point, add it to this table.
### Key Lessons
**Key lesson from Run 2:** The `use_seq_device_map` mode shuffles layers through GPU for calibration. Quantizer amax tensors sit in VRAM for 5+ hours while CUDA's allocator churns memory. By export time, the GPU tensor metadata is valid but the underlying memory has been recycled — reading it triggers `cudaErrorIllegalAddress`. The fix is to copy amax to CPU *immediately* after calibration, before any further GPU operations.
**Run 2 — Stale GPU tensors:** `use_seq_device_map` shuffles layers through GPU for calibration. Quantizer amax tensors sit in VRAM for 5+ hours while CUDA's allocator churns memory. By export time, the GPU tensor metadata is valid but the underlying memory has been recycled — reading it triggers `cudaErrorIllegalAddress`. Fix: copy amax to CPU immediately after calibration.
**Key lesson from Run 3:** Don't use `AutoModelForCausalLM.from_pretrained` directly for models with large expert weights. The weight conversion step does `torch.cat` on GPU which can OOM. Use modelopt's `get_model()` which sets `max_memory` per GPU before loading.
**Run 3 — Expert weight OOM:** `AutoModelForCausalLM.from_pretrained` does `torch.cat` on GPU for expert `gate_up_proj` (31.5GB alloc, 25.9GB free). Fix: use modelopt's `get_model()` which sets `max_memory` per GPU before loading. (Note: Run 10 uses `hf_main()` which calls `get_model()` internally.)
**Runs 48 — Pipeline rewriting trap:** Trying to reconstruct hf_ptq's pipeline by importing individual functions and building a fake `argparse.Namespace` causes an endless stream of missing-attribute and type errors. Each fix reveals the next bug. Fix: call `hf_main(args)` directly with a properly parsed args object.
**Run 9 — `__main__` gap:** `hf_ptq.py` does critical type conversions in its `__main__` block (string → list for `dataset`, string → int list for `calib_size`). When calling `main()` directly, these are skipped. Fix: apply the same conversions after `parse_args()`.
### Do NOT Repeat These Mistakes
**Do NOT repeat these mistakes:**
- Don't use FP8 source model — kernel issues on Blackwell (Run 1)
- Don't use `--low_memory_mode` with V4 — meta device errors
- Don't use calib_size=256 — OOMs with 3TB BF16 on CPU offload
- Don't use `calib_size=256` — OOMs with 3TB BF16 on CPU offload
- Don't use `AutoModelForCausalLM.from_pretrained` directly — OOM during expert weight concat (Run 3)
- Don't assume GPU tensor integrity after 5+ hours of sequential calibration
- Don't assume GPU tensor integrity after 5+ hours of sequential calibration (Run 2)
- Don't rewrite the hf_ptq pipeline — call `hf_main()` directly (Runs 48)
- Don't skip the `__main__` post-parse conversions — `calib_size` must be int list, `dataset` must be list (Run 9)
- Don't use shell script arg names (`--quant`, `--calib`, `--kv_cache_quant`, `--tp`) — use `hf_ptq.py` names (`--qformat`, `--calib_size`, `--kv_cache_qformat`, `--inference_tensor_parallel`)
## Runtime Patches Applied by quantize_nvfp4.py
These are monkey-patches applied at runtime — no modelopt source files are modified.
1. **`load_calib_amax`** — After calibration writes `_amax` to GPU, immediately moves it to CPU. Prevents stale GPU tensors.
2. **`export_amax`** — If any `_amax` is still on GPU at export time, moves to CPU before reading. Safety net.
3. **`get_activation_scaling_factor`** — Moves amax to CPU, clamps bad values instead of hard assert. Prevents crash on garbage from GPU corruption.
4. **`snapshot_amax_to_cpu()`** — After calibration, walks all quantizers, copies `_amax` to CPU, saves to disk (~50MB). Insurance policy.
1. **`TensorQuantizer.load_calib_amax`** — After calibration writes `_amax` to GPU, immediately moves it to CPU. Prevents stale GPU tensors.
2. **`TensorQuantizer.export_amax`** — If `_amax` is still on GPU at export time, moves to CPU before reading. Safety net.
3. **`NVFP4QTensor.get_activation_scaling_factor`** — Moves amax to CPU, clamps bad values instead of hard assert. Prevents crash on garbage from GPU corruption.
### Post-Calibration Hook
`export_quantized` is monkey-patched to run these steps before the real export:
4. **`snapshot_amax_to_cpu()`** — Walks all quantizers, copies `_amax` to CPU, saves to disk (~50MB). Insurance policy.
5. **`force_all_amax_to_cpu()`** — Moves `_pre_quant_scale`, `_global_amax` to CPU too. Nuclear option.
6. **`save_calibrated_state()`** — Saves full model state dict to disk (~1.5TB). Enables `--export-only` recovery if export crashes.
## Bugs Found (V4 + modelopt 0.45.0.dev64)
1. ~~`QuantDeepseekV4Experts` AttributeError~~**Already fixed** in modelopt 0.45.0.dev64 (handles `nn.ModuleList` quantizers natively).
2. `--low_memory_mode` → meta device error. Don't use with V4.
3. Missing `kernels` package for FP8 ops. `pip install -U kernels`.
4. ~~Shell script arg names~~No longer relevant (in-process script).
4. ~~Shell script arg names~~Resolved by calling `hf_main()` directly.
5. **Export crash — stale GPU tensors in `export_amax()`.** After hours of calibration, quantizer `_amax` on GPU becomes unreadable. Fixed by patching `export_amax` to move `_amax` to CPU before reading.
6. **Export crash — `assert torch.all(activation_scaling_factor > 0)`.** Amax values from stale GPU reads are garbage (zeros, negatives, NaN). Fixed by clamping instead of asserting, plus snapshotting valid amax to CPU before corruption can occur.
7. **Model loading OOM during expert weight conversion.** `AutoModelForCausalLM.from_pretrained` does `torch.cat` on GPU for expert `gate_up_proj` (31.5GB alloc), but only 25.9GB free with `device_map="sequential"`. Fixed by using modelopt's `get_model()` which sets `max_memory` per GPU before loading.
@@ -121,4 +143,19 @@ The patches in `quantize_nvfp4.py` are for **modelopt 0.45.0.dev64** specificall
- modelopt has no explicit V4 support — relies on auto-detection of fused experts.
- The calibration state save (`v4_nvfp4_calibrated_state.pt`) is ~1.5TB. It lives on NVMe, not in git.
- The amax snapshot (`v4_nvfp4_amax_snapshots.pt`) is ~50MB. Small, critical, cheap insurance.
- Only divergence from modelopt example path: `get_model()` instead of `AutoModelForCausalLM.from_pretrained` (avoids OOM). Everything else uses the same `hf_ptq` functions.
- The script calls `hf_main(args)` — the exact same entry point as the shell script. No pipeline divergence.
- Must run from `/root/nvidia-meeting/modelopt-repo/examples/llm_ptq` (relative imports).
## File Layout
```
scripts/
dequant_fp8_to_bf16.py — Step 1: FP8/FP4 → BF16 dequantization
quantize_nvfp4.py — Step 2: NVFP4 quantization (patches + hf_main)
patches/
patch_finegrained_fp8_blackwell.py — (legacy) FP8 kernel patches for Blackwell
quant_module_patched.py — (legacy) quant module patches
```
The `patches/` directory contains earlier approaches that modified modelopt source files directly. The current approach (`quantize_nvfp4.py`) uses runtime monkey-patching instead — no source files are modified.