- Added Run 3 to table (model loading OOM, fixed with get_model())
- Added Run 4 (current, commit f9bbef8)
- Added bug #7 (model loading OOM during expert weight concat)
- Added 'do NOT repeat' for AutoModelForCausalLM.from_pretrained
- Documented all 5 runtime patches
- Noted only divergence from modelopt example: get_model()
120 lines
7.3 KiB
Markdown
120 lines
7.3 KiB
Markdown
# DeepSeek V4 Pro → NVFP4 Quantization
|
||
|
||
Full NVFP4 quantization of DeepSeek V4 Pro on a single B200 node (8× B200, 2.7TB RAM, 13TB NVMe). Target: ~600GB.
|
||
|
||
**Cost:** ~$161/run at $23/hr (7 hours each). Don't waste runs.
|
||
|
||
## Pipeline
|
||
|
||
### Step 1: Dequantize FP8 → BF16
|
||
|
||
```bash
|
||
python3 scripts/dequant_fp8_to_bf16.py /root/nvidia-meeting/DeepSeek-V4-Pro-FP8 /root/nvidia-meeting/DeepSeek-V4-Pro-BF16
|
||
```
|
||
|
||
The original V4 weights use mixed precision (FP8 attention + FP4/E2M1 experts with per-tensor scales). We dequantize everything to pure BF16 so modelopt can run calibration without hitting broken FP8 kernel paths on Blackwell (DeepGEMM unsupported, Triton finegrained FP8 matmul shape mismatches).
|
||
|
||
This is not a blind upcast — it applies the actual scale factors:
|
||
|
||
```
|
||
W_bf16 = dequantize_fp4_weight(W_int, S) # per-tensor scale dequant, not .to(bfloat16)
|
||
```
|
||
|
||
**Byte-exact verified** — matmul diff is 0.000000 against the official inference path.
|
||
|
||
### Step 2: Run NVFP4 Quantization
|
||
|
||
```bash
|
||
cd /root/nvidia-meeting/modelopt-repo/examples/llm_ptq
|
||
python3 /root/nvidia-meeting/deepseek-v4-quant/scripts/quantize_nvfp4.py
|
||
```
|
||
|
||
Must run from the modelopt example directory (relative imports).
|
||
|
||
Pipeline steps:
|
||
1. **Load** BF16 model via modelopt's `get_model()` (handles memory limits properly for 3TB model)
|
||
2. **Patch** modelopt at runtime (load_calib_amax forces CPU, export_amax CPU fallback, graceful clamp)
|
||
3. **Quantize + Calibrate** (5-6 hours, 128 samples)
|
||
4. **Snapshot amax to CPU** — copies all quantizer state to CPU and saves to disk (~50MB)
|
||
5. **Save model state** — full state dict to disk (insurance against export crashes)
|
||
6. **Export** to HF safetensors
|
||
|
||
If the export crashes:
|
||
|
||
```bash
|
||
python3 /root/nvidia-meeting/deepseek-v4-quant/scripts/quantize_nvfp4.py --export-only
|
||
```
|
||
|
||
To validate saved state without running anything:
|
||
|
||
```bash
|
||
python3 /root/nvidia-meeting/deepseek-v4-quant/scripts/quantize_nvfp4.py --validate-only
|
||
```
|
||
|
||
**Config:** `nvfp4`, 128 calib samples, calib_seq=512, kv_fp8_cast, gpu_mem_pct=0.7
|
||
|
||
**Calibration datasets:** `abisee/cnn_dailymail` + `nvidia/Nemotron-Post-Training-Dataset-v2` (gated — requires HF token).
|
||
|
||
**Runtime:** Model loading ~53 min. Calibration ~5.5 hours. Export ~30-60 min. Total 7-8 hours.
|
||
|
||
## Run History (forward progression)
|
||
|
||
| Run | Date | Commit | Calib | Result | Root Cause | Fix |
|
||
|-----|------|--------|-------|--------|------------|-----|
|
||
| 1 | May 7 | shell wrapper (pre-repo) | 256 | ❌ Crashed at batch probing | `o_b_proj` shape mismatch — finegrained_fp8 wraps MLA projections incorrectly with FP8 source | Use BF16 source (dequantized) |
|
||
| 2 | May 8-9 | shell wrapper (pre-repo) | 128 | ❌ Crashed at export (128/128 calib ✅) | `get_activation_scaling_factor` reads stale GPU amax → CUDA illegal memory access | Snapshot amax to CPU immediately after calibration |
|
||
| 3 | May 9 06:10 | `3907838` | 128 | ❌ Crashed at model loading | `AutoModelForCausalLM.from_pretrained` OOM during expert weight `torch.cat` (31.5GB alloc, 25.9GB free) | Use modelopt `get_model()` which handles max_memory properly |
|
||
| 4 | May 9 ~08:05 | `f9bbef8` | 128 | 🔄 Running | — | — |
|
||
|
||
**If Run 4 succeeds**, current code is good. No further changes needed.
|
||
**If Run 4 fails**, check the log, identify the crash point, add it to this table.
|
||
|
||
**Key lesson from Run 2:** The `use_seq_device_map` mode shuffles layers through GPU for calibration. Quantizer amax tensors sit in VRAM for 5+ hours while CUDA's allocator churns memory. By export time, the GPU tensor metadata is valid but the underlying memory has been recycled — reading it triggers `cudaErrorIllegalAddress`. The fix is to copy amax to CPU *immediately* after calibration, before any further GPU operations.
|
||
|
||
**Key lesson from Run 3:** Don't use `AutoModelForCausalLM.from_pretrained` directly for models with large expert weights. The weight conversion step does `torch.cat` on GPU which can OOM. Use modelopt's `get_model()` which sets `max_memory` per GPU before loading.
|
||
|
||
**Do NOT repeat these mistakes:**
|
||
- Don't use FP8 source model — kernel issues on Blackwell (Run 1)
|
||
- Don't use `--low_memory_mode` with V4 — meta device errors
|
||
- Don't use calib_size=256 — OOMs with 3TB BF16 on CPU offload
|
||
- Don't use `AutoModelForCausalLM.from_pretrained` directly — OOM during expert weight concat (Run 3)
|
||
- Don't assume GPU tensor integrity after 5+ hours of sequential calibration
|
||
|
||
## Runtime Patches Applied by quantize_nvfp4.py
|
||
|
||
These are monkey-patches applied at runtime — no modelopt source files are modified.
|
||
|
||
1. **`load_calib_amax`** — After calibration writes `_amax` to GPU, immediately moves it to CPU. Prevents stale GPU tensors.
|
||
2. **`export_amax`** — If any `_amax` is still on GPU at export time, moves to CPU before reading. Safety net.
|
||
3. **`get_activation_scaling_factor`** — Moves amax to CPU, clamps bad values instead of hard assert. Prevents crash on garbage from GPU corruption.
|
||
4. **`snapshot_amax_to_cpu()`** — After calibration, walks all quantizers, copies `_amax` to CPU, saves to disk (~50MB). Insurance policy.
|
||
5. **`force_all_amax_to_cpu()`** — Moves `_pre_quant_scale`, `_global_amax` to CPU too. Nuclear option.
|
||
|
||
## Bugs Found (V4 + modelopt 0.45.0.dev64)
|
||
|
||
1. ~~`QuantDeepseekV4Experts` AttributeError~~ — **Already fixed** in modelopt 0.45.0.dev64 (handles `nn.ModuleList` quantizers natively).
|
||
2. `--low_memory_mode` → meta device error. Don't use with V4.
|
||
3. Missing `kernels` package for FP8 ops. `pip install -U kernels`.
|
||
4. ~~Shell script arg names~~ — No longer relevant (in-process script).
|
||
5. **Export crash — stale GPU tensors in `export_amax()`.** After hours of calibration, quantizer `_amax` on GPU becomes unreadable. Fixed by patching `export_amax` to move `_amax` to CPU before reading.
|
||
6. **Export crash — `assert torch.all(activation_scaling_factor > 0)`.** Amax values from stale GPU reads are garbage (zeros, negatives, NaN). Fixed by clamping instead of asserting, plus snapshotting valid amax to CPU before corruption can occur.
|
||
7. **Model loading OOM during expert weight conversion.** `AutoModelForCausalLM.from_pretrained` does `torch.cat` on GPU for expert `gate_up_proj` (31.5GB alloc), but only 25.9GB free with `device_map="sequential"`. Fixed by using modelopt's `get_model()` which sets `max_memory` per GPU before loading.
|
||
|
||
## Dependencies (pinned versions)
|
||
|
||
- **nvidia-modelopt:** `0.45.0.dev64+g579fc6c31` (installed from git, not PyPI)
|
||
- **transformers:** `5.8.0.dev0` (from git, required for DeepSeekV4 support)
|
||
- **kernels:** latest (`pip install -U kernels` — needed for finegrained FP8 ops)
|
||
- **Python:** 3.10
|
||
|
||
The patches in `quantize_nvfp4.py` are for **modelopt 0.45.0.dev64** specifically. Later versions may include fixes natively — check before applying.
|
||
|
||
## Key Notes
|
||
|
||
- V4 is NOT BF16 — it ships as mixed-precision FP8/FP4. You MUST dequantize to BF16 first (Step 1).
|
||
- `--low_memory_mode` causes meta device errors with V4 — don't use.
|
||
- modelopt has no explicit V4 support — relies on auto-detection of fused experts.
|
||
- The calibration state save (`v4_nvfp4_calibrated_state.pt`) is ~1.5TB. It lives on NVMe, not in git.
|
||
- The amax snapshot (`v4_nvfp4_amax_snapshots.pt`) is ~50MB. Small, critical, cheap insurance.
|
||
- Only divergence from modelopt example path: `get_model()` instead of `AutoModelForCausalLM.from_pretrained` (avoids OOM). Everything else uses the same `hf_ptq` functions.
|