Update README and memory: Run 3 OOM crash, Run 4 running on f9bbef8
- Added Run 3 to table (model loading OOM, fixed with get_model())
- Added Run 4 (current, commit f9bbef8)
- Added bug #7 (model loading OOM during expert weight concat)
- Added 'do NOT repeat' for AutoModelForCausalLM.from_pretrained
- Documented all 5 runtime patches
- Noted only divergence from modelopt example: get_model()
This commit is contained in:
26
README.md
26
README.md
@@ -32,8 +32,8 @@ python3 /root/nvidia-meeting/deepseek-v4-quant/scripts/quantize_nvfp4.py
|
||||
Must run from the modelopt example directory (relative imports).
|
||||
|
||||
Pipeline steps:
|
||||
1. **Load** BF16 model with sequential device map (3TB model, CPU offload)
|
||||
2. **Patch** modelopt at runtime (GPU tensor safety, graceful degradation)
|
||||
1. **Load** BF16 model via modelopt's `get_model()` (handles memory limits properly for 3TB model)
|
||||
2. **Patch** modelopt at runtime (load_calib_amax forces CPU, export_amax CPU fallback, graceful clamp)
|
||||
3. **Quantize + Calibrate** (5-6 hours, 128 samples)
|
||||
4. **Snapshot amax to CPU** — copies all quantizer state to CPU and saves to disk (~50MB)
|
||||
5. **Save model state** — full state dict to disk (insurance against export crashes)
|
||||
@@ -63,19 +63,33 @@ python3 /root/nvidia-meeting/deepseek-v4-quant/scripts/quantize_nvfp4.py --valid
|
||||
|-----|------|--------|-------|--------|------------|-----|
|
||||
| 1 | May 7 | shell wrapper (pre-repo) | 256 | ❌ Crashed at batch probing | `o_b_proj` shape mismatch — finegrained_fp8 wraps MLA projections incorrectly with FP8 source | Use BF16 source (dequantized) |
|
||||
| 2 | May 8-9 | shell wrapper (pre-repo) | 128 | ❌ Crashed at export (128/128 calib ✅) | `get_activation_scaling_factor` reads stale GPU amax → CUDA illegal memory access | Snapshot amax to CPU immediately after calibration |
|
||||
| 3 | May 9 | `3907838` | 128 | 🔄 Running | — | — |
|
||||
| 3 | May 9 06:10 | `3907838` | 128 | ❌ Crashed at model loading | `AutoModelForCausalLM.from_pretrained` OOM during expert weight `torch.cat` (31.5GB alloc, 25.9GB free) | Use modelopt `get_model()` which handles max_memory properly |
|
||||
| 4 | May 9 ~08:05 | `f9bbef8` | 128 | 🔄 Running | — | — |
|
||||
|
||||
**If Run 3 succeeds**, the v2 script (`6eeba26`) is reference only — no need to adopt it.
|
||||
**If Run 3 fails at export**, use commit `6eaba26` (adds amax snapshot + force CPU + validate mode).
|
||||
**If Run 4 succeeds**, current code is good. No further changes needed.
|
||||
**If Run 4 fails**, check the log, identify the crash point, add it to this table.
|
||||
|
||||
**Key lesson from Run 2:** The `use_seq_device_map` mode shuffles layers through GPU for calibration. Quantizer amax tensors sit in VRAM for 5+ hours while CUDA's allocator churns memory. By export time, the GPU tensor metadata is valid but the underlying memory has been recycled — reading it triggers `cudaErrorIllegalAddress`. The fix is to copy amax to CPU *immediately* after calibration, before any further GPU operations.
|
||||
|
||||
**Key lesson from Run 3:** Don't use `AutoModelForCausalLM.from_pretrained` directly for models with large expert weights. The weight conversion step does `torch.cat` on GPU which can OOM. Use modelopt's `get_model()` which sets `max_memory` per GPU before loading.
|
||||
|
||||
**Do NOT repeat these mistakes:**
|
||||
- Don't use FP8 source model — kernel issues on Blackwell (Run 1)
|
||||
- Don't use `--low_memory_mode` with V4 — meta device errors
|
||||
- Don't use calib_size=256 — OOMs with 3TB BF16 on CPU offload
|
||||
- Don't use `AutoModelForCausalLM.from_pretrained` directly — OOM during expert weight concat (Run 3)
|
||||
- Don't assume GPU tensor integrity after 5+ hours of sequential calibration
|
||||
|
||||
## Runtime Patches Applied by quantize_nvfp4.py
|
||||
|
||||
These are monkey-patches applied at runtime — no modelopt source files are modified.
|
||||
|
||||
1. **`load_calib_amax`** — After calibration writes `_amax` to GPU, immediately moves it to CPU. Prevents stale GPU tensors.
|
||||
2. **`export_amax`** — If any `_amax` is still on GPU at export time, moves to CPU before reading. Safety net.
|
||||
3. **`get_activation_scaling_factor`** — Moves amax to CPU, clamps bad values instead of hard assert. Prevents crash on garbage from GPU corruption.
|
||||
4. **`snapshot_amax_to_cpu()`** — After calibration, walks all quantizers, copies `_amax` to CPU, saves to disk (~50MB). Insurance policy.
|
||||
5. **`force_all_amax_to_cpu()`** — Moves `_pre_quant_scale`, `_global_amax` to CPU too. Nuclear option.
|
||||
|
||||
## Bugs Found (V4 + modelopt 0.45.0.dev64)
|
||||
|
||||
1. ~~`QuantDeepseekV4Experts` AttributeError~~ — **Already fixed** in modelopt 0.45.0.dev64 (handles `nn.ModuleList` quantizers natively).
|
||||
@@ -84,6 +98,7 @@ python3 /root/nvidia-meeting/deepseek-v4-quant/scripts/quantize_nvfp4.py --valid
|
||||
4. ~~Shell script arg names~~ — No longer relevant (in-process script).
|
||||
5. **Export crash — stale GPU tensors in `export_amax()`.** After hours of calibration, quantizer `_amax` on GPU becomes unreadable. Fixed by patching `export_amax` to move `_amax` to CPU before reading.
|
||||
6. **Export crash — `assert torch.all(activation_scaling_factor > 0)`.** Amax values from stale GPU reads are garbage (zeros, negatives, NaN). Fixed by clamping instead of asserting, plus snapshotting valid amax to CPU before corruption can occur.
|
||||
7. **Model loading OOM during expert weight conversion.** `AutoModelForCausalLM.from_pretrained` does `torch.cat` on GPU for expert `gate_up_proj` (31.5GB alloc), but only 25.9GB free with `device_map="sequential"`. Fixed by using modelopt's `get_model()` which sets `max_memory` per GPU before loading.
|
||||
|
||||
## Dependencies (pinned versions)
|
||||
|
||||
@@ -101,3 +116,4 @@ The patches in `quantize_nvfp4.py` are for **modelopt 0.45.0.dev64** specificall
|
||||
- modelopt has no explicit V4 support — relies on auto-detection of fused experts.
|
||||
- The calibration state save (`v4_nvfp4_calibrated_state.pt`) is ~1.5TB. It lives on NVMe, not in git.
|
||||
- The amax snapshot (`v4_nvfp4_amax_snapshots.pt`) is ~50MB. Small, critical, cheap insurance.
|
||||
- Only divergence from modelopt example path: `get_model()` instead of `AutoModelForCausalLM.from_pretrained` (avoids OOM). Everything else uses the same `hf_ptq` functions.
|
||||
|
||||
Reference in New Issue
Block a user