Files
deepseek-v4-quant/README.md
2026-05-06 23:47:07 +00:00

106 lines
7.0 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# DeepSeek V4 Pro → NVFP4 conversion kit
Two paths for converting `sgl-project/DeepSeek-V4-Pro-FP8` (the uniform-FP8 repackage of the original mixed-precision V4 Pro) into NVFP4 for Blackwell inference.
## What's here
| File | Purpose |
| --- | --- |
| `inspect_model.py` | Run this first. Prints tensor name patterns, dtypes, FP8 scaling block sizes, and counts of MoE expert/router/norm tensors so you know exactly what you're dealing with before any conversion. |
| `fp8_to_nvfp4_streaming.py` | **Path A.** Pure tensor-level streaming FP8 → NVFP4 conversion. No model loading, no calibration, weight-only. Low memory, fast, deterministic. Recommended for first run. |
| `quantize_llmcompressor.py` | **Path B.** `llm-compressor` oneshot with sequential pipeline + activation calibration. Produces W4A4 with calibrated activation scales. Higher quality on activation-sensitive ops but riskier given V4 is two weeks old. |
| `verify_nvfp4.py` | Loads the produced NVFP4 checkpoint, runs a basic forward pass through one block, checks for NaN/Inf, and dumps a few generated tokens via vLLM. |
## Hardware assumptions
- 8× B200 baremetal, 1.5 TB HBM total
- 2.7 TB system RAM
- ≥10 TB free NVMe at `~/nvidia-meeting/`
## Prereqs
```bash
source ~/nvidia-meeting/venv/bin/activate
pip install --upgrade torch safetensors transformers tqdm
pip install --upgrade llmcompressor compressed-tensors # only needed for Path B
pip install --upgrade vllm # only needed for verify
```
You'll likely need `transformers` from source for V4 architecture support, and `trust_remote_code=True` everywhere. Stock pip versions may not load V4 yet.
## Recommended order tonight
```bash
cd ~/nvidia-meeting
# 1. Inspect the FP8 source — 30 seconds, no GPU needed.
python inspect_model.py DeepSeek-V4-Pro-FP8 | tee inspect.log
# 2. Path A streaming conversion — should run in 2-6 hours dominated by NVMe I/O.
python fp8_to_nvfp4_streaming.py \
--src DeepSeek-V4-Pro-FP8 \
--dst DeepSeek-V4-Pro-NVFP4-streaming \
--workers 8 \
2>&1 | tee path_a.log
# 3. Quick sanity check — does it load and forward-pass?
python verify_nvfp4.py DeepSeek-V4-Pro-NVFP4-streaming
# 4. Path B (overnight). Run only after Path A succeeds. 24-72 hours.
python quantize_llmcompressor.py \
--src DeepSeek-V4-Pro-FP8 \
--dst DeepSeek-V4-Pro-NVFP4-llmcompressor \
--num-samples 256 \
--max-seq-len 4096 \
2>&1 | tee path_b.log
```
## Path A — what it does
1. Reads `model.safetensors.index.json` to map every tensor to its shard.
2. Classifies every tensor:
- **Preserve** (copied bit-for-bit): `lm_head`, `embed_tokens`, MoE router gates (`*.mlp.gate`), all norms, V4-specific attention indexer/scoring tensors, mHC residual mixing weights.
- **Quantize**: any FP8 weight that has a corresponding `*.weight_scale_inv` companion (i.e. real GEMM weights).
3. For every quantizable weight:
- Dequantizes FP8 E4M3 → FP32 using the source's per-block scales (auto-detects 128×128 blocks).
- Computes NVFP4 dual scales: per-tensor `weight_scale_2 = amax / (6.0 * 448.0)` and per-16-element-block `weight_scale = block_amax / (6.0 * weight_scale_2)` cast to FP8 E4M3.
- Quantizes FP32 → E2M1 representable values `{0, ±0.5, ±1, ±1.5, ±2, ±3, ±4, ±6}`.
- Packs two 4-bit values per `uint8` byte.
4. **MoE pair handling**: detects `gate_proj` (w1) + `up_proj` (w3) of each expert and computes a joint `weight_scale_2` across both, since vLLM's fused MoE kernel requires them to share that global scale.
5. Streams output to new shards (~5 GB each) with a fresh `model.safetensors.index.json` and copies all non-tensor files (config, tokenizer, etc.) verbatim.
**This is weight-only NVFP4.** Activation quantization is not done here — you get W4A16 effective behavior at runtime unless your inference engine generates dynamic per-group activation scales. vLLM does generate per-group activation scales dynamically at inference, so this is fine for most use cases.
## Path B — what it does
1. Loads the FP8 model via `transformers` with `device_map="auto"` and the offload folder pointing at NVMe. With 2.7 TB RAM, the FP8 weights (~865 GB) sit in RAM; activations and per-layer BF16 promotion happen on the B200s.
2. Loads a calibration set (default 256 samples of `HuggingFaceH4/ultrachat_200k`).
3. Runs `llm-compressor` `oneshot` with `pipeline="sequential"` so only one transformer block is materialized in BF16 on GPU at a time.
4. `moe_calibrate_all_experts=True` ensures every routed expert gets calibration signal even when natural routing wouldn't pick it.
5. The recipe targets `Linear` with NVFP4 and the same ignore list as Path A (lm_head, embed, router gates, norms, indexer, mHC).
6. Saves with `save_compressed=True` in `compressed-tensors` format.
**The known risks for Path B on V4 specifically:**
- V4 architecture is brand new. `llm-compressor` may not have a registered MoE wrapper for V4 — you may need to call `replace_modules_for_calibration` with the actual V4 MoE class name (the script has a TODO and a fallback path).
- Sequential pipeline may not handle CSA/HCA hybrid attention if the attention forward isn't a simple linear chain. If you see weird offload errors during calibration, the indexer/scoring tensors are likely the culprit.
- Calibration cache for 256 routed experts × all V4 layers can be hundreds of GB. Watch `nvidia-smi` and `free -h` during the first 30 minutes.
## Things to discuss with the NVIDIA engineer
1. **NVFP4 packing convention.** My converter packs as `byte = elem0 | (elem1 << 4)` (low nibble first). Verify this matches what TensorRT-LLM / cutlass NVFP4 kernels expect. If reversed, just flip in `pack_fp4()`.
2. **Joint scaling extension.** I implement joint `weight_scale_2` for `gate_proj`/`up_proj` pairs. Ask whether `down_proj` also benefits, or whether all three experts in a fused MoE block should share — recipes have varied.
3. **mHC residual weights.** I preserve them in FP8/BF16 conservatively. If NVIDIA has actually quantized these somewhere internally, drop them out of the ignore list to recover memory.
4. **CSA + HCA indexer/scoring tensors.** I preserve these blindly based on the V3.2 DSA precedent. Ask whether V4's compressed-sparse / heavily-compressed attention has analogous "cannot quantize" tensors and what the canonical regex is.
5. **W4A4 vs W4A16 for V4 Pro.** Path A is W4A16-equivalent; Path B is W4A4. For a 1.6T MoE with extreme long-context, ask which is internally recommended for first deployment.
6. **`modelopt` vs `llm-compressor` for V4.** RedHat shipped V4-*Flash* NVFP4 via `llm-compressor`. Why not Pro yet? Find out if there's a known-bad layer or just compute time.
## Output sizes to expect
- FP8 source: ~865 GB
- Path A NVFP4 output: ~430470 GB (about 2× compression vs FP8 source; experts dominate, norms/embeds add a bit back)
- Path B NVFP4 output: similar, plus activation scale metadata
## Resumability
Path A is checkpoint-resumable per shard — if it dies mid-run, re-running picks up from the next unwritten output shard. Path B is **not** resumable mid-calibration; if it crashes you restart.