deepseek-v4-quant/README.md

# DeepSeek V4 Pro → NVFP4 conversion kit

Two paths for converting `sgl-project/DeepSeek-V4-Pro-FP8` (the uniform-FP8 repackage of the original mixed-precision V4 Pro) into NVFP4 for Blackwell inference.

## What's here

| File | Purpose |
| --- | --- |
| `inspect_model.py` | Run this first. Prints tensor name patterns, dtypes, FP8 scaling block sizes, and counts of MoE expert/router/norm tensors so you know exactly what you're dealing with before any conversion. |
| `fp8_to_nvfp4_streaming.py` | **Path A.** Pure tensor-level streaming FP8 → NVFP4 conversion. No model loading, no calibration, weight-only. Low memory, fast, deterministic. Recommended for first run. |
| `quantize_llmcompressor.py` | **Path B.** `llm-compressor` oneshot with sequential pipeline + activation calibration. Produces W4A4 with calibrated activation scales. Higher quality on activation-sensitive ops but riskier given V4 is two weeks old. |
| `verify_nvfp4.py` | Loads the produced NVFP4 checkpoint, runs a basic forward pass through one block, checks for NaN/Inf, and dumps a few generated tokens via vLLM. |

## Hardware assumptions

- 8× B200 baremetal, 1.5 TB HBM total
- 2.7 TB system RAM
- ≥10 TB free NVMe at `~/nvidia-meeting/`

## Prereqs

```bash
source ~/nvidia-meeting/venv/bin/activate
pip install --upgrade torch safetensors transformers tqdm
pip install --upgrade llmcompressor compressed-tensors  # only needed for Path B
pip install --upgrade vllm                              # only needed for verify
```

You'll likely need `transformers` from source for V4 architecture support, and `trust_remote_code=True` everywhere. Stock pip versions may not load V4 yet.

## Recommended order tonight

```bash
cd ~/nvidia-meeting

# 1. Inspect the FP8 source — 30 seconds, no GPU needed.
python inspect_model.py DeepSeek-V4-Pro-FP8 | tee inspect.log

# 2. Path A streaming conversion — should run in 2-6 hours dominated by NVMe I/O.
python fp8_to_nvfp4_streaming.py \
    --src DeepSeek-V4-Pro-FP8 \
    --dst DeepSeek-V4-Pro-NVFP4-streaming \
    --workers 8 \
    2>&1 | tee path_a.log

# 3. Quick sanity check — does it load and forward-pass?
python verify_nvfp4.py DeepSeek-V4-Pro-NVFP4-streaming

# 4. Path B (overnight). Run only after Path A succeeds. 24-72 hours.
python quantize_llmcompressor.py \
    --src DeepSeek-V4-Pro-FP8 \
    --dst DeepSeek-V4-Pro-NVFP4-llmcompressor \
    --num-samples 256 \
    --max-seq-len 4096 \
    2>&1 | tee path_b.log
```

## Path A — what it does

1. Reads `model.safetensors.index.json` to map every tensor to its shard.
2. Classifies every tensor:
   - **Preserve** (copied bit-for-bit): `lm_head`, `embed_tokens`, MoE router gates (`*.mlp.gate`), all norms, V4-specific attention indexer/scoring tensors, mHC residual mixing weights.
   - **Quantize**: any FP8 weight that has a corresponding `*.weight_scale_inv` companion (i.e. real GEMM weights).
3. For every quantizable weight:
   - Dequantizes FP8 E4M3 → FP32 using the source's per-block scales (auto-detects 128×128 blocks).
   - Computes NVFP4 dual scales: per-tensor `weight_scale_2 = amax / (6.0 * 448.0)` and per-16-element-block `weight_scale = block_amax / (6.0 * weight_scale_2)` cast to FP8 E4M3.
   - Quantizes FP32 → E2M1 representable values `{0, ±0.5, ±1, ±1.5, ±2, ±3, ±4, ±6}`.
   - Packs two 4-bit values per `uint8` byte.
4. **MoE pair handling**: detects `gate_proj` (w1) + `up_proj` (w3) of each expert and computes a joint `weight_scale_2` across both, since vLLM's fused MoE kernel requires them to share that global scale.
5. Streams output to new shards (~5 GB each) with a fresh `model.safetensors.index.json` and copies all non-tensor files (config, tokenizer, etc.) verbatim.

**This is weight-only NVFP4.** Activation quantization is not done here — you get W4A16 effective behavior at runtime unless your inference engine generates dynamic per-group activation scales. vLLM does generate per-group activation scales dynamically at inference, so this is fine for most use cases.

## Path B — what it does

1. Loads the FP8 model via `transformers` with `device_map="auto"` and the offload folder pointing at NVMe. With 2.7 TB RAM, the FP8 weights (~865 GB) sit in RAM; activations and per-layer BF16 promotion happen on the B200s.
2. Loads a calibration set (default 256 samples of `HuggingFaceH4/ultrachat_200k`).
3. Runs `llm-compressor` `oneshot` with `pipeline="sequential"` so only one transformer block is materialized in BF16 on GPU at a time.
4. `moe_calibrate_all_experts=True` ensures every routed expert gets calibration signal even when natural routing wouldn't pick it.
5. The recipe targets `Linear` with NVFP4 and the same ignore list as Path A (lm_head, embed, router gates, norms, indexer, mHC).
6. Saves with `save_compressed=True` in `compressed-tensors` format.

**The known risks for Path B on V4 specifically:**

- V4 architecture is brand new. `llm-compressor` may not have a registered MoE wrapper for V4 — you may need to call `replace_modules_for_calibration` with the actual V4 MoE class name (the script has a TODO and a fallback path).
- Sequential pipeline may not handle CSA/HCA hybrid attention if the attention forward isn't a simple linear chain. If you see weird offload errors during calibration, the indexer/scoring tensors are likely the culprit.
- Calibration cache for 256 routed experts × all V4 layers can be hundreds of GB. Watch `nvidia-smi` and `free -h` during the first 30 minutes.

## Things to discuss with the NVIDIA engineer

1. **NVFP4 packing convention.** My converter packs as `byte = elem0 | (elem1 << 4)` (low nibble first). Verify this matches what TensorRT-LLM / cutlass NVFP4 kernels expect. If reversed, just flip in `pack_fp4()`.
2. **Joint scaling extension.** I implement joint `weight_scale_2` for `gate_proj`/`up_proj` pairs. Ask whether `down_proj` also benefits, or whether all three experts in a fused MoE block should share — recipes have varied.
3. **mHC residual weights.** I preserve them in FP8/BF16 conservatively. If NVIDIA has actually quantized these somewhere internally, drop them out of the ignore list to recover memory.
4. **CSA + HCA indexer/scoring tensors.** I preserve these blindly based on the V3.2 DSA precedent. Ask whether V4's compressed-sparse / heavily-compressed attention has analogous "cannot quantize" tensors and what the canonical regex is.
5. **W4A4 vs W4A16 for V4 Pro.** Path A is W4A16-equivalent; Path B is W4A4. For a 1.6T MoE with extreme long-context, ask which is internally recommended for first deployment.
6. **`modelopt` vs `llm-compressor` for V4.** RedHat shipped V4-*Flash* NVFP4 via `llm-compressor`. Why not Pro yet? Find out if there's a known-bad layer or just compute time.

## Output sizes to expect

- FP8 source: ~865 GB
- Path A NVFP4 output: ~430–470 GB (about 2× compression vs FP8 source; experts dominate, norms/embeds add a bit back)
- Path B NVFP4 output: similar, plus activation scale metadata

## Resumability

Path A is checkpoint-resumable per shard — if it dies mid-run, re-running picks up from the next unwritten output shard. Path B is **not** resumable mid-calibration; if it crashes you restart.