deepseek-v4-quant/README.md

# DeepSeek V4 Pro → NVFP4 Quantization

Full NVFP4 quantization of DeepSeek V4 Pro on a single B200 node (8× B200, 2.7TB RAM, 13TB NVMe). Target: ~600GB.

**Cost:** ~$161/run at $23/hr (7 hours each). Don't waste runs.

## Pipeline

### Step 1: Dequantize FP8 → BF16

```bash
python3 scripts/dequant_fp8_to_bf16.py /root/nvidia-meeting/DeepSeek-V4-Pro-FP8 /root/nvidia-meeting/DeepSeek-V4-Pro-BF16
```

The original V4 weights use mixed precision (FP8 attention + FP4/E2M1 experts with per-tensor scales). We dequantize everything to pure BF16 so modelopt can run calibration without hitting broken FP8 kernel paths on Blackwell (DeepGEMM unsupported, Triton finegrained FP8 matmul shape mismatches).

This is not a blind upcast — it applies the actual scale factors:

```
W_bf16 = dequantize_fp4_weight(W_int, S)  # per-tensor scale dequant, not .to(bfloat16)
```

**Byte-exact verified** — matmul diff is 0.000000 against the official inference path.

### Step 2: Run NVFP4 Quantization

```bash
cd /root/nvidia-meeting/modelopt-repo/examples/llm_ptq
python3 /root/nvidia-meeting/deepseek-v4-quant/scripts/quantize_nvfp4.py
```

Must run from the modelopt example directory (relative imports).

Pipeline steps:
1. **Load** BF16 model via modelopt's `get_model()` (handles memory limits properly for 3TB model)
2. **Patch** modelopt at runtime (load_calib_amax forces CPU, export_amax CPU fallback, graceful clamp)
3. **Quantize + Calibrate** (5-6 hours, 128 samples)
4. **Snapshot amax to CPU** — copies all quantizer state to CPU and saves to disk (~50MB)
5. **Save model state** — full state dict to disk (insurance against export crashes)
6. **Export** to HF safetensors

If the export crashes:

```bash
python3 /root/nvidia-meeting/deepseek-v4-quant/scripts/quantize_nvfp4.py --export-only
```

To validate saved state without running anything:

```bash
python3 /root/nvidia-meeting/deepseek-v4-quant/scripts/quantize_nvfp4.py --validate-only
```

**Config:** `nvfp4`, 128 calib samples, calib_seq=512, kv_fp8_cast, gpu_mem_pct=0.7

**Calibration datasets:** `abisee/cnn_dailymail` + `nvidia/Nemotron-Post-Training-Dataset-v2` (gated — requires HF token).

**Runtime:** Model loading ~53 min. Calibration ~5.5 hours. Export ~30-60 min. Total 7-8 hours.

## Run History (forward progression)

| Run | Date | Commit | Calib | Result | Root Cause | Fix |
|-----|------|--------|-------|--------|------------|-----|
| 1 | May 7 | shell wrapper (pre-repo) | 256 | ❌ Crashed at batch probing | `o_b_proj` shape mismatch — finegrained_fp8 wraps MLA projections incorrectly with FP8 source | Use BF16 source (dequantized) |
| 2 | May 8-9 | shell wrapper (pre-repo) | 128 | ❌ Crashed at export (128/128 calib ✅) | `get_activation_scaling_factor` reads stale GPU amax → CUDA illegal memory access | Snapshot amax to CPU immediately after calibration |
| 3 | May 9 06:10 | `3907838` | 128 | ❌ Crashed at model loading | `AutoModelForCausalLM.from_pretrained` OOM during expert weight `torch.cat` (31.5GB alloc, 25.9GB free) | Use modelopt `get_model()` which handles max_memory properly |
| 4 | May 9 ~08:05 | `f9bbef8` | 128 | 🔄 Running | — | — |

**If Run 4 succeeds**, current code is good. No further changes needed.
**If Run 4 fails**, check the log, identify the crash point, add it to this table.

**Key lesson from Run 2:** The `use_seq_device_map` mode shuffles layers through GPU for calibration. Quantizer amax tensors sit in VRAM for 5+ hours while CUDA's allocator churns memory. By export time, the GPU tensor metadata is valid but the underlying memory has been recycled — reading it triggers `cudaErrorIllegalAddress`. The fix is to copy amax to CPU *immediately* after calibration, before any further GPU operations.

**Key lesson from Run 3:** Don't use `AutoModelForCausalLM.from_pretrained` directly for models with large expert weights. The weight conversion step does `torch.cat` on GPU which can OOM. Use modelopt's `get_model()` which sets `max_memory` per GPU before loading.

**Do NOT repeat these mistakes:**
- Don't use FP8 source model — kernel issues on Blackwell (Run 1)
- Don't use `--low_memory_mode` with V4 — meta device errors
- Don't use calib_size=256 — OOMs with 3TB BF16 on CPU offload
- Don't use `AutoModelForCausalLM.from_pretrained` directly — OOM during expert weight concat (Run 3)
- Don't assume GPU tensor integrity after 5+ hours of sequential calibration

## Runtime Patches Applied by quantize_nvfp4.py

These are monkey-patches applied at runtime — no modelopt source files are modified.

1. **`load_calib_amax`** — After calibration writes `_amax` to GPU, immediately moves it to CPU. Prevents stale GPU tensors.
2. **`export_amax`** — If any `_amax` is still on GPU at export time, moves to CPU before reading. Safety net.
3. **`get_activation_scaling_factor`** — Moves amax to CPU, clamps bad values instead of hard assert. Prevents crash on garbage from GPU corruption.
4. **`snapshot_amax_to_cpu()`** — After calibration, walks all quantizers, copies `_amax` to CPU, saves to disk (~50MB). Insurance policy.
5. **`force_all_amax_to_cpu()`** — Moves `_pre_quant_scale`, `_global_amax` to CPU too. Nuclear option.

## Bugs Found (V4 + modelopt 0.45.0.dev64)

1. ~~`QuantDeepseekV4Experts` AttributeError~~ — **Already fixed** in modelopt 0.45.0.dev64 (handles `nn.ModuleList` quantizers natively).
2. `--low_memory_mode` → meta device error. Don't use with V4.
3. Missing `kernels` package for FP8 ops. `pip install -U kernels`.
4. ~~Shell script arg names~~ — No longer relevant (in-process script).
5. **Export crash — stale GPU tensors in `export_amax()`.** After hours of calibration, quantizer `_amax` on GPU becomes unreadable. Fixed by patching `export_amax` to move `_amax` to CPU before reading.
6. **Export crash — `assert torch.all(activation_scaling_factor > 0)`.** Amax values from stale GPU reads are garbage (zeros, negatives, NaN). Fixed by clamping instead of asserting, plus snapshotting valid amax to CPU before corruption can occur.
7. **Model loading OOM during expert weight conversion.** `AutoModelForCausalLM.from_pretrained` does `torch.cat` on GPU for expert `gate_up_proj` (31.5GB alloc), but only 25.9GB free with `device_map="sequential"`. Fixed by using modelopt's `get_model()` which sets `max_memory` per GPU before loading.

## Dependencies (pinned versions)

- **nvidia-modelopt:** `0.45.0.dev64+g579fc6c31` (installed from git, not PyPI)
- **transformers:** `5.8.0.dev0` (from git, required for DeepSeekV4 support)
- **kernels:** latest (`pip install -U kernels` — needed for finegrained FP8 ops)
- **Python:** 3.10

The patches in `quantize_nvfp4.py` are for **modelopt 0.45.0.dev64** specifically. Later versions may include fixes natively — check before applying.

## Key Notes

- V4 is NOT BF16 — it ships as mixed-precision FP8/FP4. You MUST dequantize to BF16 first (Step 1).
- `--low_memory_mode` causes meta device errors with V4 — don't use.
- modelopt has no explicit V4 support — relies on auto-detection of fused experts.
- The calibration state save (`v4_nvfp4_calibrated_state.pt`) is ~1.5TB. It lives on NVMe, not in git.
- The amax snapshot (`v4_nvfp4_amax_snapshots.pt`) is ~50MB. Small, critical, cheap insurance.
- Only divergence from modelopt example path: `get_model()` instead of `AutoModelForCausalLM.from_pretrained` (avoids OOM). Everything else uses the same `hf_ptq` functions.