From eb80bd6f80eb5f00f600e6d73d01d7494c92a3fc Mon Sep 17 00:00:00 2001
From: biondizzle <biondizzle@gmail.com>
Date: Sat, 9 May 2026 23:00:17 +0000
Subject: [PATCH] README + memory: Run 10 result (export crash in
 get_weight_scaling_factor), Run 11 running

- Run 10: calibration succeeded but export crashed in get_weight_scaling_factor
  (stale GPU weight, not just amax). Patch 4 forces weight to CPU at
  _export_quantized_weight entry point, covering the entire export chain.
- Updated Key Lessons with Run 10 analysis
- Updated Runtime Patches section to document all 8 patches
- Added Bug #8 (stale GPU weight tensors)
- Updated Do NOT Repeat list
---
 README.md | 21 +++++++++++++++++++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/README.md b/README.md
index 793c772..da40ce1 100644
--- a/README.md
+++ b/README.md
@@ -78,7 +78,8 @@ python3 /root/nvidia-meeting/deepseek-v4-quant/scripts/quantize_nvfp4.py --valid
 | 7 | May 9 ~13:40 | `25b4d8d` | 128 | ❌ Dataloader crash | `dataset=None`, `len()` on None | Provided dataset list |
 | 8 | May 9 ~14:00 | `b2849a8` | 128 | ❌ Argparse crash | Wrong flag names (shell script names vs `hf_ptq.py` names) | Use `hf_ptq.py` flag names |
 | 9 | May 9 ~14:30 | `a300302` | 128 | ❌ TypeError | Skipped `__main__` post-parse conversions (`calib_size` still string, not int list) | Apply same conversions after `parse_args()` |
-| 10 | May 9 ~15:30 | `5a72da7` | 128 | 🔄 Running | — | Calls `hf_main(args)` directly |
+| 10 | May 9 ~15:30 | `5a72da7` | 128 | ❌ Export crash (calib ✅) | `get_weight_scaling_factor` reads stale GPU weight → `cudaErrorIllegalAddress` | Patch `_export_quantized_weight` to force weight to CPU at entry point |
+| 11 | May 9 ~22:50 | `07cd50e` | 128 | 🔄 Running | — | 8 patches covering full export chain |
 
 ### Key Lessons
 
@@ -90,25 +91,40 @@ python3 /root/nvidia-meeting/deepseek-v4-quant/scripts/quantize_nvfp4.py --valid
 
 **Run 9 — `__main__` gap:** `hf_ptq.py` does critical type conversions in its `__main__` block (string → list for `dataset`, string → int list for `calib_size`). When calling `main()` directly, these are skipped. Fix: apply the same conversions after `parse_args()`.
 
+**Run 10 — Stale GPU weight tensors in export:** The amax patches (Patch 1-3) only cover quantizer state. The model *weights* themselves are also on stale GPU. `get_weight_scaling_factor` does `weight_scaling_factor_2.to(weight.device)` which triggers `cudaErrorIllegalAddress` because `weight` is on stale GPU. Fix: patch `_export_quantized_weight` (the entry point for each module's export) to force `weight` to CPU before any downstream code reads it. This covers the entire chain: `get_weight_scaling_factor`, `get_weights_scaling_factor_from_quantizer`, `to_quantized_weight`, `weight.to(dtype)` — all resolve to CPU because `weight.device` is CPU.
+
 ### Do NOT Repeat These Mistakes
 
 - Don't use FP8 source model — kernel issues on Blackwell (Run 1)
 - Don't use `--low_memory_mode` with V4 — meta device errors
 - Don't use `calib_size=256` — OOMs with 3TB BF16 on CPU offload
 - Don't use `AutoModelForCausalLM.from_pretrained` directly — OOM during expert weight concat (Run 3)
-- Don't assume GPU tensor integrity after 5+ hours of sequential calibration (Run 2)
+- Don't assume GPU tensor integrity after 5+ hours of sequential calibration (Run 2, Run 10)
 - Don't rewrite the hf_ptq pipeline — call `hf_main()` directly (Runs 4–8)
 - Don't skip the `__main__` post-parse conversions — `calib_size` must be int list, `dataset` must be list (Run 9)
 - Don't use shell script arg names (`--quant`, `--calib`, `--kv_cache_quant`, `--tp`) — use `hf_ptq.py` names (`--qformat`, `--calib_size`, `--kv_cache_qformat`, `--inference_tensor_parallel`)
+- Don't patch individual export functions one at a time — patch the entry point (`_export_quantized_weight`) so weight is on CPU for the entire chain (Run 10)
 
 ## Runtime Patches Applied by quantize_nvfp4.py
 
 These are monkey-patches applied at runtime — no modelopt source files are modified.
 
+### Calibration-time patches (applied before pipeline runs)
+
 1. **`TensorQuantizer.load_calib_amax`** — After calibration writes `_amax` to GPU, immediately moves it to CPU. Prevents stale GPU tensors.
 2. **`TensorQuantizer.export_amax`** — If `_amax` is still on GPU at export time, moves to CPU before reading. Safety net.
 3. **`NVFP4QTensor.get_activation_scaling_factor`** — Moves amax to CPU, clamps bad values instead of hard assert. Prevents crash on garbage from GPU corruption.
 
+### Export-time patches (force stale GPU tensors to CPU at entry points)
+
+4. **`_export_quantized_weight`** (KEY PATCH) — Forces weight + all quantizer state to CPU *before* any downstream code reads them. This is the entry point for exporting each linear layer. By forcing weight to CPU here, every downstream `.to(weight.device)` resolves to CPU, covering the entire chain: `get_weight_scaling_factor`, `get_weights_scaling_factor_from_quantizer`, `to_quantized_weight`, `weight.to(dtype)`.
+5. **`_export_fused_experts`** — Same treatment for MoE expert weights (DeepseekV4Experts go through this path). Forces expert weights, buffers, and quantizer state to CPU.
+6. **`to_quantized_weight`** — Forces weight and scaling factors to CPU. Redundant if Patch 4 works, but catches any code path that reaches this function without going through `_export_quantized_weight`.
+7. **`get_weight_scaling_factor`** — Forces weight + quantizer to CPU. Redundant if Patch 4 works.
+8. **`get_weight_scaling_factor_2`** — Forces quantizer state to CPU. Redundant if Patch 4 works.
+
+Patches 6-8 are belt-and-suspenders. Patch 4 is the one that matters — it moves weight to CPU at the earliest possible point in the export chain, making all downstream stale GPU reads impossible.
+
 ### Post-Calibration Hook
 
 `export_quantized` is monkey-patched to run these steps before the real export:
@@ -126,6 +142,7 @@ These are monkey-patches applied at runtime — no modelopt source files are mod
 5. **Export crash — stale GPU tensors in `export_amax()`.** After hours of calibration, quantizer `_amax` on GPU becomes unreadable. Fixed by patching `export_amax` to move `_amax` to CPU before reading.
 6. **Export crash — `assert torch.all(activation_scaling_factor > 0)`.** Amax values from stale GPU reads are garbage (zeros, negatives, NaN). Fixed by clamping instead of asserting, plus snapshotting valid amax to CPU before corruption can occur.
 7. **Model loading OOM during expert weight conversion.** `AutoModelForCausalLM.from_pretrained` does `torch.cat` on GPU for expert `gate_up_proj` (31.5GB alloc), but only 25.9GB free with `device_map="sequential"`. Fixed by using modelopt's `get_model()` which sets `max_memory` per GPU before loading.
+8. **Export crash — stale GPU weight tensors in `get_weight_scaling_factor`.** Patches 1-3 only covered quantizer amax. The model weights themselves are also on stale GPU. `weight_scaling_factor_2.to(weight.device)` triggers `cudaErrorIllegalAddress`. Fixed by patching `_export_quantized_weight` to force weight to CPU at the entry point, covering the entire export chain.
 
 ## Dependencies (pinned versions)