Update README and memory: Run 3 OOM crash, Run 4 running on f9bbef8

- Added Run 3 to table (model loading OOM, fixed with get_model()) - Added Run 4 (current, commit f9bbef8) - Added bug #7 (model loading OOM during expert weight concat) - Added 'do NOT repeat' for AutoModelForCausalLM.from_pretrained - Documented all 5 runtime patches - Noted only divergence from modelopt example: get_model()
2026-05-09 08:10:04 +00:00
parent f9bbef8e91
commit 99f861f48a
1 changed files with 21 additions and 5 deletions
--- a/README.md
+++ b/README.md
@@ -32,8 +32,8 @@ python3 /root/nvidia-meeting/deepseek-v4-quant/scripts/quantize_nvfp4.py
 Must run from the modelopt example directory (relative imports).

 Pipeline steps:
-1. **Load** BF16 model with sequential device map (3TB model, CPU offload)
-2. **Patch** modelopt at runtime (GPU tensor safety, graceful degradation)
+1. **Load** BF16 model via modelopt's `get_model()` (handles memory limits properly for 3TB model)
+2. **Patch** modelopt at runtime (load_calib_amax forces CPU, export_amax CPU fallback, graceful clamp)
 3. **Quantize + Calibrate** (5-6 hours, 128 samples)
 4. **Snapshot amax to CPU** — copies all quantizer state to CPU and saves to disk (~50MB)
 5. **Save model state** — full state dict to disk (insurance against export crashes)
@@ -63,19 +63,33 @@ python3 /root/nvidia-meeting/deepseek-v4-quant/scripts/quantize_nvfp4.py --valid
 |-----|------|--------|-------|--------|------------|-----|
 | 1 | May 7 | shell wrapper (pre-repo) | 256 | ❌ Crashed at batch probing | `o_b_proj` shape mismatch — finegrained_fp8 wraps MLA projections incorrectly with FP8 source | Use BF16 source (dequantized) |
 | 2 | May 8-9 | shell wrapper (pre-repo) | 128 | ❌ Crashed at export (128/128 calib ✅) | `get_activation_scaling_factor` reads stale GPU amax → CUDA illegal memory access | Snapshot amax to CPU immediately after calibration |
-| 3 | May 9 | `3907838` | 128 | 🔄 Running | — | — |
+| 3 | May 9 06:10 | `3907838` | 128 | ❌ Crashed at model loading | `AutoModelForCausalLM.from_pretrained` OOM during expert weight `torch.cat` (31.5GB alloc, 25.9GB free) | Use modelopt `get_model()` which handles max_memory properly |
+| 4 | May 9 ~08:05 | `f9bbef8` | 128 | 🔄 Running | — | — |

-**If Run 3 succeeds**, the v2 script (`6eeba26`) is reference only — no need to adopt it.
-**If Run 3 fails at export**, use commit `6eaba26` (adds amax snapshot + force CPU + validate mode).
+**If Run 4 succeeds**, current code is good. No further changes needed.
+**If Run 4 fails**, check the log, identify the crash point, add it to this table.

 **Key lesson from Run 2:** The `use_seq_device_map` mode shuffles layers through GPU for calibration. Quantizer amax tensors sit in VRAM for 5+ hours while CUDA's allocator churns memory. By export time, the GPU tensor metadata is valid but the underlying memory has been recycled — reading it triggers `cudaErrorIllegalAddress`. The fix is to copy amax to CPU *immediately* after calibration, before any further GPU operations.

+**Key lesson from Run 3:** Don't use `AutoModelForCausalLM.from_pretrained` directly for models with large expert weights. The weight conversion step does `torch.cat` on GPU which can OOM. Use modelopt's `get_model()` which sets `max_memory` per GPU before loading.
+
 **Do NOT repeat these mistakes:**
 - Don't use FP8 source model — kernel issues on Blackwell (Run 1)
 - Don't use `--low_memory_mode` with V4 — meta device errors
 - Don't use calib_size=256 — OOMs with 3TB BF16 on CPU offload
+- Don't use `AutoModelForCausalLM.from_pretrained` directly — OOM during expert weight concat (Run 3)
 - Don't assume GPU tensor integrity after 5+ hours of sequential calibration

+## Runtime Patches Applied by quantize_nvfp4.py
+
+These are monkey-patches applied at runtime — no modelopt source files are modified.
+
+1. **`load_calib_amax`** — After calibration writes `_amax` to GPU, immediately moves it to CPU. Prevents stale GPU tensors.
+2. **`export_amax`** — If any `_amax` is still on GPU at export time, moves to CPU before reading. Safety net.
+3. **`get_activation_scaling_factor`** — Moves amax to CPU, clamps bad values instead of hard assert. Prevents crash on garbage from GPU corruption.
+4. **`snapshot_amax_to_cpu()`** — After calibration, walks all quantizers, copies `_amax` to CPU, saves to disk (~50MB). Insurance policy.
+5. **`force_all_amax_to_cpu()`** — Moves `_pre_quant_scale`, `_global_amax` to CPU too. Nuclear option.
+
 ## Bugs Found (V4 + modelopt 0.45.0.dev64)

 1. ~~`QuantDeepseekV4Experts` AttributeError~~ — **Already fixed** in modelopt 0.45.0.dev64 (handles `nn.ModuleList` quantizers natively).
@@ -84,6 +98,7 @@ python3 /root/nvidia-meeting/deepseek-v4-quant/scripts/quantize_nvfp4.py --valid
 4. ~~Shell script arg names~~ — No longer relevant (in-process script).
 5. **Export crash — stale GPU tensors in `export_amax()`.** After hours of calibration, quantizer `_amax` on GPU becomes unreadable. Fixed by patching `export_amax` to move `_amax` to CPU before reading.
 6. **Export crash — `assert torch.all(activation_scaling_factor > 0)`.** Amax values from stale GPU reads are garbage (zeros, negatives, NaN). Fixed by clamping instead of asserting, plus snapshotting valid amax to CPU before corruption can occur.
+7. **Model loading OOM during expert weight conversion.** `AutoModelForCausalLM.from_pretrained` does `torch.cat` on GPU for expert `gate_up_proj` (31.5GB alloc), but only 25.9GB free with `device_map="sequential"`. Fixed by using modelopt's `get_model()` which sets `max_memory` per GPU before loading.

 ## Dependencies (pinned versions)

@@ -101,3 +116,4 @@ The patches in `quantize_nvfp4.py` are for **modelopt 0.45.0.dev64** specificall
 - modelopt has no explicit V4 support — relies on auto-detection of fused experts.
 - The calibration state save (`v4_nvfp4_calibrated_state.pt`) is ~1.5TB. It lives on NVMe, not in git.
 - The amax snapshot (`v4_nvfp4_amax_snapshots.pt`) is ~50MB. Small, critical, cheap insurance.
+- Only divergence from modelopt example path: `get_model()` instead of `AutoModelForCausalLM.from_pretrained` (avoids OOM). Everything else uses the same `hf_ptq` functions.