diff --git a/README.md b/README.md index b9c297b..0e18c2d 100644 --- a/README.md +++ b/README.md @@ -2,6 +2,8 @@ Full NVFP4 quantization of DeepSeek V4 Pro on a single B200 node (8× B200, 2.7TB RAM, 13TB NVMe). Target: ~600GB. +**Cost:** ~$161/run at $23/hr (7 hours each). Don't waste runs. + ## Pipeline ### Step 1: Dequantize FP8 → BF16 @@ -23,44 +25,62 @@ W_bf16 = dequantize_fp4_weight(W_int, S) # per-tensor scale dequant, not .to(bf ### Step 2: Run NVFP4 Quantization ```bash -python3 scripts/quantize_nvfp4.py +cd /root/nvidia-meeting/modelopt-repo/examples/llm_ptq +python3 /root/nvidia-meeting/deepseek-v4-quant/scripts/quantize_nvfp4.py ``` -This script runs the full pipeline in-process (not wrapping the shell script): +Must run from the modelopt example directory (relative imports). +Pipeline steps: 1. **Load** BF16 model with sequential device map (3TB model, CPU offload) -2. **Patch** modelopt for V4 compatibility (ModuleList quantizers, GPU tensor safety) +2. **Patch** modelopt at runtime (GPU tensor safety, graceful degradation) 3. **Quantize + Calibrate** (5-6 hours, 128 samples) -4. **SAVE** model state to disk ← insurance against export crashes -5. **Export** to HF safetensors +4. **Snapshot amax to CPU** — copies all quantizer state to CPU and saves to disk (~50MB) +5. **Save model state** — full state dict to disk (insurance against export crashes) +6. **Export** to HF safetensors -If the export crashes (and it will — modelopt's export reads stale GPU tensors after hours of calibration): +If the export crashes: ```bash -python3 scripts/quantize_nvfp4.py --export-only +python3 /root/nvidia-meeting/deepseek-v4-quant/scripts/quantize_nvfp4.py --export-only ``` -This loads the saved calibration state and retries just the export step. +To validate saved state without running anything: -**Config:** -- `--quant nvfp4` (full model, not experts-only) -- `--calib 128` — 128 calibration samples. 256 OOMs with 3TB BF16 on CPU offload. -- `--kv_cache_quant fp8_cast` -- `--use_seq_device_map` — sequential device mapping (CPU offload) -- `--gpu_max_mem_percentage 0.7` — VRAM headroom +```bash +python3 /root/nvidia-meeting/deepseek-v4-quant/scripts/quantize_nvfp4.py --validate-only +``` + +**Config:** `nvfp4`, 128 calib samples, calib_seq=512, kv_fp8_cast, gpu_mem_pct=0.7 **Calibration datasets:** `abisee/cnn_dailymail` + `nvidia/Nemotron-Post-Training-Dataset-v2` (gated — requires HF token). **Runtime:** Model loading ~53 min. Calibration ~5.5 hours. Export ~30-60 min. Total 7-8 hours. -## Bugs Found (V4 + modelopt) +## Run History (forward progression) -1. `QuantDeepseekV4Experts` AttributeError — V4 uses `nn.ModuleList` for per-expert quantizers, modelopt expected singular `TensorQuantizer`. Patched in `quantize_nvfp4.py`. +| Run | Date | Script | Calib | Result | Root Cause | Fix | +|-----|------|--------|-------|--------|------------|-----| +| 1 | May 7 | shell wrapper, FP8 source | 256 | ❌ Crashed at batch probing | `o_b_proj` shape mismatch — finegrained_fp8 wraps MLA projections incorrectly with FP8 source | Use BF16 source (dequantized) | +| 2 | May 8-9 | shell wrapper, BF16 source | 128 | ❌ Crashed at export (128/128 calib ✅) | `get_activation_scaling_factor` reads stale GPU amax → CUDA illegal memory access | Snapshot amax to CPU immediately after calibration | +| 3 | May 9 | `quantize_nvfp4.py` v1 | 128 | 🔄 Running | — | — | + +**Key lesson from Run 2:** The `use_seq_device_map` mode shuffles layers through GPU for calibration. Quantizer amax tensors sit in VRAM for 5+ hours while CUDA's allocator churns memory. By export time, the GPU tensor metadata is valid but the underlying memory has been recycled — reading it triggers `cudaErrorIllegalAddress`. The fix is to copy amax to CPU *immediately* after calibration, before any further GPU operations. + +**Do NOT repeat these mistakes:** +- Don't use FP8 source model — kernel issues on Blackwell (Run 1) +- Don't use `--low_memory_mode` with V4 — meta device errors +- Don't use calib_size=256 — OOMs with 3TB BF16 on CPU offload +- Don't assume GPU tensor integrity after 5+ hours of sequential calibration + +## Bugs Found (V4 + modelopt 0.45.0.dev64) + +1. ~~`QuantDeepseekV4Experts` AttributeError~~ — **Already fixed** in modelopt 0.45.0.dev64 (handles `nn.ModuleList` quantizers natively). 2. `--low_memory_mode` → meta device error. Don't use with V4. 3. Missing `kernels` package for FP8 ops. `pip install -U kernels`. -4. `--calib` not `--calib_size`, `--quant` not `--qformat` (shell script arg names — no longer relevant, we run in-process). -5. **Export crash — stale GPU tensors.** After 5+ hours of calibration, modelopt's export step reads quantizer amax tensors that have been sitting in VRAM for hours. CUDA illegal memory access. Fixed by moving quantizer tensors to CPU before export. -6. **Export crash — `assert torch.all(activation_scaling_factor > 0)`.** Related to #5. The amax values from stale GPU reads are garbage. Fixed by clamping instead of asserting. +4. ~~Shell script arg names~~ — No longer relevant (in-process script). +5. **Export crash — stale GPU tensors in `export_amax()`.** After hours of calibration, quantizer `_amax` on GPU becomes unreadable. Fixed by patching `export_amax` to move `_amax` to CPU before reading. +6. **Export crash — `assert torch.all(activation_scaling_factor > 0)`.** Amax values from stale GPU reads are garbage (zeros, negatives, NaN). Fixed by clamping instead of asserting, plus snapshotting valid amax to CPU before corruption can occur. ## Dependencies (pinned versions) @@ -69,7 +89,7 @@ This loads the saved calibration state and retries just the export step. - **kernels:** latest (`pip install -U kernels` — needed for finegrained FP8 ops) - **Python:** 3.10 -The patches in `quantize_nvfp4.py` are for **modelopt 0.45.0.dev64** specifically. Later versions may include fixes natively. +The patches in `quantize_nvfp4.py` are for **modelopt 0.45.0.dev64** specifically. Later versions may include fixes natively — check before applying. ## Key Notes @@ -77,3 +97,4 @@ The patches in `quantize_nvfp4.py` are for **modelopt 0.45.0.dev64** specificall - `--low_memory_mode` causes meta device errors with V4 — don't use. - modelopt has no explicit V4 support — relies on auto-detection of fused experts. - The calibration state save (`v4_nvfp4_calibrated_state.pt`) is ~1.5TB. It lives on NVMe, not in git. +- The amax snapshot (`v4_nvfp4_amax_snapshots.pt`) is ~50MB. Small, critical, cheap insurance.