106 lines
7.0 KiB
Markdown
106 lines
7.0 KiB
Markdown
# DeepSeek V4 Pro → NVFP4 conversion kit
|
||
|
||
Two paths for converting `sgl-project/DeepSeek-V4-Pro-FP8` (the uniform-FP8 repackage of the original mixed-precision V4 Pro) into NVFP4 for Blackwell inference.
|
||
|
||
## What's here
|
||
|
||
| File | Purpose |
|
||
| --- | --- |
|
||
| `inspect_model.py` | Run this first. Prints tensor name patterns, dtypes, FP8 scaling block sizes, and counts of MoE expert/router/norm tensors so you know exactly what you're dealing with before any conversion. |
|
||
| `fp8_to_nvfp4_streaming.py` | **Path A.** Pure tensor-level streaming FP8 → NVFP4 conversion. No model loading, no calibration, weight-only. Low memory, fast, deterministic. Recommended for first run. |
|
||
| `quantize_llmcompressor.py` | **Path B.** `llm-compressor` oneshot with sequential pipeline + activation calibration. Produces W4A4 with calibrated activation scales. Higher quality on activation-sensitive ops but riskier given V4 is two weeks old. |
|
||
| `verify_nvfp4.py` | Loads the produced NVFP4 checkpoint, runs a basic forward pass through one block, checks for NaN/Inf, and dumps a few generated tokens via vLLM. |
|
||
|
||
## Hardware assumptions
|
||
|
||
- 8× B200 baremetal, 1.5 TB HBM total
|
||
- 2.7 TB system RAM
|
||
- ≥10 TB free NVMe at `~/nvidia-meeting/`
|
||
|
||
## Prereqs
|
||
|
||
```bash
|
||
source ~/nvidia-meeting/venv/bin/activate
|
||
pip install --upgrade torch safetensors transformers tqdm
|
||
pip install --upgrade llmcompressor compressed-tensors # only needed for Path B
|
||
pip install --upgrade vllm # only needed for verify
|
||
```
|
||
|
||
You'll likely need `transformers` from source for V4 architecture support, and `trust_remote_code=True` everywhere. Stock pip versions may not load V4 yet.
|
||
|
||
## Recommended order tonight
|
||
|
||
```bash
|
||
cd ~/nvidia-meeting
|
||
|
||
# 1. Inspect the FP8 source — 30 seconds, no GPU needed.
|
||
python inspect_model.py DeepSeek-V4-Pro-FP8 | tee inspect.log
|
||
|
||
# 2. Path A streaming conversion — should run in 2-6 hours dominated by NVMe I/O.
|
||
python fp8_to_nvfp4_streaming.py \
|
||
--src DeepSeek-V4-Pro-FP8 \
|
||
--dst DeepSeek-V4-Pro-NVFP4-streaming \
|
||
--workers 8 \
|
||
2>&1 | tee path_a.log
|
||
|
||
# 3. Quick sanity check — does it load and forward-pass?
|
||
python verify_nvfp4.py DeepSeek-V4-Pro-NVFP4-streaming
|
||
|
||
# 4. Path B (overnight). Run only after Path A succeeds. 24-72 hours.
|
||
python quantize_llmcompressor.py \
|
||
--src DeepSeek-V4-Pro-FP8 \
|
||
--dst DeepSeek-V4-Pro-NVFP4-llmcompressor \
|
||
--num-samples 256 \
|
||
--max-seq-len 4096 \
|
||
2>&1 | tee path_b.log
|
||
```
|
||
|
||
## Path A — what it does
|
||
|
||
1. Reads `model.safetensors.index.json` to map every tensor to its shard.
|
||
2. Classifies every tensor:
|
||
- **Preserve** (copied bit-for-bit): `lm_head`, `embed_tokens`, MoE router gates (`*.mlp.gate`), all norms, V4-specific attention indexer/scoring tensors, mHC residual mixing weights.
|
||
- **Quantize**: any FP8 weight that has a corresponding `*.weight_scale_inv` companion (i.e. real GEMM weights).
|
||
3. For every quantizable weight:
|
||
- Dequantizes FP8 E4M3 → FP32 using the source's per-block scales (auto-detects 128×128 blocks).
|
||
- Computes NVFP4 dual scales: per-tensor `weight_scale_2 = amax / (6.0 * 448.0)` and per-16-element-block `weight_scale = block_amax / (6.0 * weight_scale_2)` cast to FP8 E4M3.
|
||
- Quantizes FP32 → E2M1 representable values `{0, ±0.5, ±1, ±1.5, ±2, ±3, ±4, ±6}`.
|
||
- Packs two 4-bit values per `uint8` byte.
|
||
4. **MoE pair handling**: detects `gate_proj` (w1) + `up_proj` (w3) of each expert and computes a joint `weight_scale_2` across both, since vLLM's fused MoE kernel requires them to share that global scale.
|
||
5. Streams output to new shards (~5 GB each) with a fresh `model.safetensors.index.json` and copies all non-tensor files (config, tokenizer, etc.) verbatim.
|
||
|
||
**This is weight-only NVFP4.** Activation quantization is not done here — you get W4A16 effective behavior at runtime unless your inference engine generates dynamic per-group activation scales. vLLM does generate per-group activation scales dynamically at inference, so this is fine for most use cases.
|
||
|
||
## Path B — what it does
|
||
|
||
1. Loads the FP8 model via `transformers` with `device_map="auto"` and the offload folder pointing at NVMe. With 2.7 TB RAM, the FP8 weights (~865 GB) sit in RAM; activations and per-layer BF16 promotion happen on the B200s.
|
||
2. Loads a calibration set (default 256 samples of `HuggingFaceH4/ultrachat_200k`).
|
||
3. Runs `llm-compressor` `oneshot` with `pipeline="sequential"` so only one transformer block is materialized in BF16 on GPU at a time.
|
||
4. `moe_calibrate_all_experts=True` ensures every routed expert gets calibration signal even when natural routing wouldn't pick it.
|
||
5. The recipe targets `Linear` with NVFP4 and the same ignore list as Path A (lm_head, embed, router gates, norms, indexer, mHC).
|
||
6. Saves with `save_compressed=True` in `compressed-tensors` format.
|
||
|
||
**The known risks for Path B on V4 specifically:**
|
||
|
||
- V4 architecture is brand new. `llm-compressor` may not have a registered MoE wrapper for V4 — you may need to call `replace_modules_for_calibration` with the actual V4 MoE class name (the script has a TODO and a fallback path).
|
||
- Sequential pipeline may not handle CSA/HCA hybrid attention if the attention forward isn't a simple linear chain. If you see weird offload errors during calibration, the indexer/scoring tensors are likely the culprit.
|
||
- Calibration cache for 256 routed experts × all V4 layers can be hundreds of GB. Watch `nvidia-smi` and `free -h` during the first 30 minutes.
|
||
|
||
## Things to discuss with the NVIDIA engineer
|
||
|
||
1. **NVFP4 packing convention.** My converter packs as `byte = elem0 | (elem1 << 4)` (low nibble first). Verify this matches what TensorRT-LLM / cutlass NVFP4 kernels expect. If reversed, just flip in `pack_fp4()`.
|
||
2. **Joint scaling extension.** I implement joint `weight_scale_2` for `gate_proj`/`up_proj` pairs. Ask whether `down_proj` also benefits, or whether all three experts in a fused MoE block should share — recipes have varied.
|
||
3. **mHC residual weights.** I preserve them in FP8/BF16 conservatively. If NVIDIA has actually quantized these somewhere internally, drop them out of the ignore list to recover memory.
|
||
4. **CSA + HCA indexer/scoring tensors.** I preserve these blindly based on the V3.2 DSA precedent. Ask whether V4's compressed-sparse / heavily-compressed attention has analogous "cannot quantize" tensors and what the canonical regex is.
|
||
5. **W4A4 vs W4A16 for V4 Pro.** Path A is W4A16-equivalent; Path B is W4A4. For a 1.6T MoE with extreme long-context, ask which is internally recommended for first deployment.
|
||
6. **`modelopt` vs `llm-compressor` for V4.** RedHat shipped V4-*Flash* NVFP4 via `llm-compressor`. Why not Pro yet? Find out if there's a known-bad layer or just compute time.
|
||
|
||
## Output sizes to expect
|
||
|
||
- FP8 source: ~865 GB
|
||
- Path A NVFP4 output: ~430–470 GB (about 2× compression vs FP8 source; experts dominate, norms/embeds add a bit back)
|
||
- Path B NVFP4 output: similar, plus activation scale metadata
|
||
|
||
## Resumability
|
||
|
||
Path A is checkpoint-resumable per shard — if it dies mid-run, re-running picks up from the next unwritten output shard. Path B is **not** resumable mid-calibration; if it crashes you restart. |