From eb80bd6f80eb5f00f600e6d73d01d7494c92a3fc Mon Sep 17 00:00:00 2001 From: biondizzle Date: Sat, 9 May 2026 23:00:17 +0000 Subject: [PATCH] README + memory: Run 10 result (export crash in get_weight_scaling_factor), Run 11 running - Run 10: calibration succeeded but export crashed in get_weight_scaling_factor (stale GPU weight, not just amax). Patch 4 forces weight to CPU at _export_quantized_weight entry point, covering the entire export chain. - Updated Key Lessons with Run 10 analysis - Updated Runtime Patches section to document all 8 patches - Added Bug #8 (stale GPU weight tensors) - Updated Do NOT Repeat list --- README.md | 21 +++++++++++++++++++-- 1 file changed, 19 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 793c772..da40ce1 100644 --- a/README.md +++ b/README.md @@ -78,7 +78,8 @@ python3 /root/nvidia-meeting/deepseek-v4-quant/scripts/quantize_nvfp4.py --valid | 7 | May 9 ~13:40 | `25b4d8d` | 128 | ❌ Dataloader crash | `dataset=None`, `len()` on None | Provided dataset list | | 8 | May 9 ~14:00 | `b2849a8` | 128 | ❌ Argparse crash | Wrong flag names (shell script names vs `hf_ptq.py` names) | Use `hf_ptq.py` flag names | | 9 | May 9 ~14:30 | `a300302` | 128 | ❌ TypeError | Skipped `__main__` post-parse conversions (`calib_size` still string, not int list) | Apply same conversions after `parse_args()` | -| 10 | May 9 ~15:30 | `5a72da7` | 128 | 🔄 Running | — | Calls `hf_main(args)` directly | +| 10 | May 9 ~15:30 | `5a72da7` | 128 | ❌ Export crash (calib ✅) | `get_weight_scaling_factor` reads stale GPU weight → `cudaErrorIllegalAddress` | Patch `_export_quantized_weight` to force weight to CPU at entry point | +| 11 | May 9 ~22:50 | `07cd50e` | 128 | 🔄 Running | — | 8 patches covering full export chain | ### Key Lessons @@ -90,25 +91,40 @@ python3 /root/nvidia-meeting/deepseek-v4-quant/scripts/quantize_nvfp4.py --valid **Run 9 — `__main__` gap:** `hf_ptq.py` does critical type conversions in its `__main__` block (string → list for `dataset`, string → int list for `calib_size`). When calling `main()` directly, these are skipped. Fix: apply the same conversions after `parse_args()`. +**Run 10 — Stale GPU weight tensors in export:** The amax patches (Patch 1-3) only cover quantizer state. The model *weights* themselves are also on stale GPU. `get_weight_scaling_factor` does `weight_scaling_factor_2.to(weight.device)` which triggers `cudaErrorIllegalAddress` because `weight` is on stale GPU. Fix: patch `_export_quantized_weight` (the entry point for each module's export) to force `weight` to CPU before any downstream code reads it. This covers the entire chain: `get_weight_scaling_factor`, `get_weights_scaling_factor_from_quantizer`, `to_quantized_weight`, `weight.to(dtype)` — all resolve to CPU because `weight.device` is CPU. + ### Do NOT Repeat These Mistakes - Don't use FP8 source model — kernel issues on Blackwell (Run 1) - Don't use `--low_memory_mode` with V4 — meta device errors - Don't use `calib_size=256` — OOMs with 3TB BF16 on CPU offload - Don't use `AutoModelForCausalLM.from_pretrained` directly — OOM during expert weight concat (Run 3) -- Don't assume GPU tensor integrity after 5+ hours of sequential calibration (Run 2) +- Don't assume GPU tensor integrity after 5+ hours of sequential calibration (Run 2, Run 10) - Don't rewrite the hf_ptq pipeline — call `hf_main()` directly (Runs 4–8) - Don't skip the `__main__` post-parse conversions — `calib_size` must be int list, `dataset` must be list (Run 9) - Don't use shell script arg names (`--quant`, `--calib`, `--kv_cache_quant`, `--tp`) — use `hf_ptq.py` names (`--qformat`, `--calib_size`, `--kv_cache_qformat`, `--inference_tensor_parallel`) +- Don't patch individual export functions one at a time — patch the entry point (`_export_quantized_weight`) so weight is on CPU for the entire chain (Run 10) ## Runtime Patches Applied by quantize_nvfp4.py These are monkey-patches applied at runtime — no modelopt source files are modified. +### Calibration-time patches (applied before pipeline runs) + 1. **`TensorQuantizer.load_calib_amax`** — After calibration writes `_amax` to GPU, immediately moves it to CPU. Prevents stale GPU tensors. 2. **`TensorQuantizer.export_amax`** — If `_amax` is still on GPU at export time, moves to CPU before reading. Safety net. 3. **`NVFP4QTensor.get_activation_scaling_factor`** — Moves amax to CPU, clamps bad values instead of hard assert. Prevents crash on garbage from GPU corruption. +### Export-time patches (force stale GPU tensors to CPU at entry points) + +4. **`_export_quantized_weight`** (KEY PATCH) — Forces weight + all quantizer state to CPU *before* any downstream code reads them. This is the entry point for exporting each linear layer. By forcing weight to CPU here, every downstream `.to(weight.device)` resolves to CPU, covering the entire chain: `get_weight_scaling_factor`, `get_weights_scaling_factor_from_quantizer`, `to_quantized_weight`, `weight.to(dtype)`. +5. **`_export_fused_experts`** — Same treatment for MoE expert weights (DeepseekV4Experts go through this path). Forces expert weights, buffers, and quantizer state to CPU. +6. **`to_quantized_weight`** — Forces weight and scaling factors to CPU. Redundant if Patch 4 works, but catches any code path that reaches this function without going through `_export_quantized_weight`. +7. **`get_weight_scaling_factor`** — Forces weight + quantizer to CPU. Redundant if Patch 4 works. +8. **`get_weight_scaling_factor_2`** — Forces quantizer state to CPU. Redundant if Patch 4 works. + +Patches 6-8 are belt-and-suspenders. Patch 4 is the one that matters — it moves weight to CPU at the earliest possible point in the export chain, making all downstream stale GPU reads impossible. + ### Post-Calibration Hook `export_quantized` is monkey-patched to run these steps before the real export: @@ -126,6 +142,7 @@ These are monkey-patches applied at runtime — no modelopt source files are mod 5. **Export crash — stale GPU tensors in `export_amax()`.** After hours of calibration, quantizer `_amax` on GPU becomes unreadable. Fixed by patching `export_amax` to move `_amax` to CPU before reading. 6. **Export crash — `assert torch.all(activation_scaling_factor > 0)`.** Amax values from stale GPU reads are garbage (zeros, negatives, NaN). Fixed by clamping instead of asserting, plus snapshotting valid amax to CPU before corruption can occur. 7. **Model loading OOM during expert weight conversion.** `AutoModelForCausalLM.from_pretrained` does `torch.cat` on GPU for expert `gate_up_proj` (31.5GB alloc), but only 25.9GB free with `device_map="sequential"`. Fixed by using modelopt's `get_model()` which sets `max_memory` per GPU before loading. +8. **Export crash — stale GPU weight tensors in `get_weight_scaling_factor`.** Patches 1-3 only covered quantizer amax. The model weights themselves are also on stale GPU. `weight_scaling_factor_2.to(weight.device)` triggers `cudaErrorIllegalAddress`. Fixed by patching `_export_quantized_weight` to force weight to CPU at the entry point, covering the entire export chain. ## Dependencies (pinned versions)