diff --git a/README.md b/README.md index bb748bd..b126c96 100644 --- a/README.md +++ b/README.md @@ -32,8 +32,8 @@ python3 /root/nvidia-meeting/deepseek-v4-quant/scripts/quantize_nvfp4.py Must run from the modelopt example directory (relative imports). Pipeline steps: -1. **Load** BF16 model with sequential device map (3TB model, CPU offload) -2. **Patch** modelopt at runtime (GPU tensor safety, graceful degradation) +1. **Load** BF16 model via modelopt's `get_model()` (handles memory limits properly for 3TB model) +2. **Patch** modelopt at runtime (load_calib_amax forces CPU, export_amax CPU fallback, graceful clamp) 3. **Quantize + Calibrate** (5-6 hours, 128 samples) 4. **Snapshot amax to CPU** — copies all quantizer state to CPU and saves to disk (~50MB) 5. **Save model state** — full state dict to disk (insurance against export crashes) @@ -63,19 +63,33 @@ python3 /root/nvidia-meeting/deepseek-v4-quant/scripts/quantize_nvfp4.py --valid |-----|------|--------|-------|--------|------------|-----| | 1 | May 7 | shell wrapper (pre-repo) | 256 | ❌ Crashed at batch probing | `o_b_proj` shape mismatch — finegrained_fp8 wraps MLA projections incorrectly with FP8 source | Use BF16 source (dequantized) | | 2 | May 8-9 | shell wrapper (pre-repo) | 128 | ❌ Crashed at export (128/128 calib ✅) | `get_activation_scaling_factor` reads stale GPU amax → CUDA illegal memory access | Snapshot amax to CPU immediately after calibration | -| 3 | May 9 | `3907838` | 128 | 🔄 Running | — | — | +| 3 | May 9 06:10 | `3907838` | 128 | ❌ Crashed at model loading | `AutoModelForCausalLM.from_pretrained` OOM during expert weight `torch.cat` (31.5GB alloc, 25.9GB free) | Use modelopt `get_model()` which handles max_memory properly | +| 4 | May 9 ~08:05 | `f9bbef8` | 128 | 🔄 Running | — | — | -**If Run 3 succeeds**, the v2 script (`6eeba26`) is reference only — no need to adopt it. -**If Run 3 fails at export**, use commit `6eaba26` (adds amax snapshot + force CPU + validate mode). +**If Run 4 succeeds**, current code is good. No further changes needed. +**If Run 4 fails**, check the log, identify the crash point, add it to this table. **Key lesson from Run 2:** The `use_seq_device_map` mode shuffles layers through GPU for calibration. Quantizer amax tensors sit in VRAM for 5+ hours while CUDA's allocator churns memory. By export time, the GPU tensor metadata is valid but the underlying memory has been recycled — reading it triggers `cudaErrorIllegalAddress`. The fix is to copy amax to CPU *immediately* after calibration, before any further GPU operations. +**Key lesson from Run 3:** Don't use `AutoModelForCausalLM.from_pretrained` directly for models with large expert weights. The weight conversion step does `torch.cat` on GPU which can OOM. Use modelopt's `get_model()` which sets `max_memory` per GPU before loading. + **Do NOT repeat these mistakes:** - Don't use FP8 source model — kernel issues on Blackwell (Run 1) - Don't use `--low_memory_mode` with V4 — meta device errors - Don't use calib_size=256 — OOMs with 3TB BF16 on CPU offload +- Don't use `AutoModelForCausalLM.from_pretrained` directly — OOM during expert weight concat (Run 3) - Don't assume GPU tensor integrity after 5+ hours of sequential calibration +## Runtime Patches Applied by quantize_nvfp4.py + +These are monkey-patches applied at runtime — no modelopt source files are modified. + +1. **`load_calib_amax`** — After calibration writes `_amax` to GPU, immediately moves it to CPU. Prevents stale GPU tensors. +2. **`export_amax`** — If any `_amax` is still on GPU at export time, moves to CPU before reading. Safety net. +3. **`get_activation_scaling_factor`** — Moves amax to CPU, clamps bad values instead of hard assert. Prevents crash on garbage from GPU corruption. +4. **`snapshot_amax_to_cpu()`** — After calibration, walks all quantizers, copies `_amax` to CPU, saves to disk (~50MB). Insurance policy. +5. **`force_all_amax_to_cpu()`** — Moves `_pre_quant_scale`, `_global_amax` to CPU too. Nuclear option. + ## Bugs Found (V4 + modelopt 0.45.0.dev64) 1. ~~`QuantDeepseekV4Experts` AttributeError~~ — **Already fixed** in modelopt 0.45.0.dev64 (handles `nn.ModuleList` quantizers natively). @@ -84,6 +98,7 @@ python3 /root/nvidia-meeting/deepseek-v4-quant/scripts/quantize_nvfp4.py --valid 4. ~~Shell script arg names~~ — No longer relevant (in-process script). 5. **Export crash — stale GPU tensors in `export_amax()`.** After hours of calibration, quantizer `_amax` on GPU becomes unreadable. Fixed by patching `export_amax` to move `_amax` to CPU before reading. 6. **Export crash — `assert torch.all(activation_scaling_factor > 0)`.** Amax values from stale GPU reads are garbage (zeros, negatives, NaN). Fixed by clamping instead of asserting, plus snapshotting valid amax to CPU before corruption can occur. +7. **Model loading OOM during expert weight conversion.** `AutoModelForCausalLM.from_pretrained` does `torch.cat` on GPU for expert `gate_up_proj` (31.5GB alloc), but only 25.9GB free with `device_map="sequential"`. Fixed by using modelopt's `get_model()` which sets `max_memory` per GPU before loading. ## Dependencies (pinned versions) @@ -101,3 +116,4 @@ The patches in `quantize_nvfp4.py` are for **modelopt 0.45.0.dev64** specificall - modelopt has no explicit V4 support — relies on auto-detection of fused experts. - The calibration state save (`v4_nvfp4_calibrated_state.pt`) is ~1.5TB. It lives on NVMe, not in git. - The amax snapshot (`v4_nvfp4_amax_snapshots.pt`) is ~50MB. Small, critical, cheap insurance. +- Only divergence from modelopt example path: `get_model()` instead of `AutoModelForCausalLM.from_pretrained` (avoids OOM). Everything else uses the same `hf_ptq` functions.