DeepSeek V4 Pro → NVFP4 conversion kit
Two paths for converting sgl-project/DeepSeek-V4-Pro-FP8 (the uniform-FP8 repackage of the original mixed-precision V4 Pro) into NVFP4 for Blackwell inference.
What's here
| File | Purpose |
|---|---|
inspect_model.py |
Run this first. Prints tensor name patterns, dtypes, FP8 scaling block sizes, and counts of MoE expert/router/norm tensors so you know exactly what you're dealing with before any conversion. |
fp8_to_nvfp4_streaming.py |
Path A. Pure tensor-level streaming FP8 → NVFP4 conversion. No model loading, no calibration, weight-only. Low memory, fast, deterministic. Recommended for first run. |
quantize_llmcompressor.py |
Path B. llm-compressor oneshot with sequential pipeline + activation calibration. Produces W4A4 with calibrated activation scales. Higher quality on activation-sensitive ops but riskier given V4 is two weeks old. |
verify_nvfp4.py |
Loads the produced NVFP4 checkpoint, runs a basic forward pass through one block, checks for NaN/Inf, and dumps a few generated tokens via vLLM. |
Hardware assumptions
- 8× B200 baremetal, 1.5 TB HBM total
- 2.7 TB system RAM
- ≥10 TB free NVMe at
~/nvidia-meeting/
Prereqs
source ~/nvidia-meeting/venv/bin/activate
pip install --upgrade torch safetensors transformers tqdm
pip install --upgrade llmcompressor compressed-tensors # only needed for Path B
pip install --upgrade vllm # only needed for verify
You'll likely need transformers from source for V4 architecture support, and trust_remote_code=True everywhere. Stock pip versions may not load V4 yet.
Recommended order tonight
cd ~/nvidia-meeting
# 1. Inspect the FP8 source — 30 seconds, no GPU needed.
python inspect_model.py DeepSeek-V4-Pro-FP8 | tee inspect.log
# 2. Path A streaming conversion — should run in 2-6 hours dominated by NVMe I/O.
python fp8_to_nvfp4_streaming.py \
--src DeepSeek-V4-Pro-FP8 \
--dst DeepSeek-V4-Pro-NVFP4-streaming \
--workers 8 \
2>&1 | tee path_a.log
# 3. Quick sanity check — does it load and forward-pass?
python verify_nvfp4.py DeepSeek-V4-Pro-NVFP4-streaming
# 4. Path B (overnight). Run only after Path A succeeds. 24-72 hours.
python quantize_llmcompressor.py \
--src DeepSeek-V4-Pro-FP8 \
--dst DeepSeek-V4-Pro-NVFP4-llmcompressor \
--num-samples 256 \
--max-seq-len 4096 \
2>&1 | tee path_b.log
Path A — what it does
- Reads
model.safetensors.index.jsonto map every tensor to its shard. - Classifies every tensor:
- Preserve (copied bit-for-bit):
lm_head,embed_tokens, MoE router gates (*.mlp.gate), all norms, V4-specific attention indexer/scoring tensors, mHC residual mixing weights. - Quantize: any FP8 weight that has a corresponding
*.weight_scale_invcompanion (i.e. real GEMM weights).
- Preserve (copied bit-for-bit):
- For every quantizable weight:
- Dequantizes FP8 E4M3 → FP32 using the source's per-block scales (auto-detects 128×128 blocks).
- Computes NVFP4 dual scales: per-tensor
weight_scale_2 = amax / (6.0 * 448.0)and per-16-element-blockweight_scale = block_amax / (6.0 * weight_scale_2)cast to FP8 E4M3. - Quantizes FP32 → E2M1 representable values
{0, ±0.5, ±1, ±1.5, ±2, ±3, ±4, ±6}. - Packs two 4-bit values per
uint8byte.
- MoE pair handling: detects
gate_proj(w1) +up_proj(w3) of each expert and computes a jointweight_scale_2across both, since vLLM's fused MoE kernel requires them to share that global scale. - Streams output to new shards (~5 GB each) with a fresh
model.safetensors.index.jsonand copies all non-tensor files (config, tokenizer, etc.) verbatim.
This is weight-only NVFP4. Activation quantization is not done here — you get W4A16 effective behavior at runtime unless your inference engine generates dynamic per-group activation scales. vLLM does generate per-group activation scales dynamically at inference, so this is fine for most use cases.
Path B — what it does
- Loads the FP8 model via
transformerswithdevice_map="auto"and the offload folder pointing at NVMe. With 2.7 TB RAM, the FP8 weights (~865 GB) sit in RAM; activations and per-layer BF16 promotion happen on the B200s. - Loads a calibration set (default 256 samples of
HuggingFaceH4/ultrachat_200k). - Runs
llm-compressoroneshotwithpipeline="sequential"so only one transformer block is materialized in BF16 on GPU at a time. moe_calibrate_all_experts=Trueensures every routed expert gets calibration signal even when natural routing wouldn't pick it.- The recipe targets
Linearwith NVFP4 and the same ignore list as Path A (lm_head, embed, router gates, norms, indexer, mHC). - Saves with
save_compressed=Trueincompressed-tensorsformat.
The known risks for Path B on V4 specifically:
- V4 architecture is brand new.
llm-compressormay not have a registered MoE wrapper for V4 — you may need to callreplace_modules_for_calibrationwith the actual V4 MoE class name (the script has a TODO and a fallback path). - Sequential pipeline may not handle CSA/HCA hybrid attention if the attention forward isn't a simple linear chain. If you see weird offload errors during calibration, the indexer/scoring tensors are likely the culprit.
- Calibration cache for 256 routed experts × all V4 layers can be hundreds of GB. Watch
nvidia-smiandfree -hduring the first 30 minutes.
Things to discuss with the NVIDIA engineer
- NVFP4 packing convention. My converter packs as
byte = elem0 | (elem1 << 4)(low nibble first). Verify this matches what TensorRT-LLM / cutlass NVFP4 kernels expect. If reversed, just flip inpack_fp4(). - Joint scaling extension. I implement joint
weight_scale_2forgate_proj/up_projpairs. Ask whetherdown_projalso benefits, or whether all three experts in a fused MoE block should share — recipes have varied. - mHC residual weights. I preserve them in FP8/BF16 conservatively. If NVIDIA has actually quantized these somewhere internally, drop them out of the ignore list to recover memory.
- CSA + HCA indexer/scoring tensors. I preserve these blindly based on the V3.2 DSA precedent. Ask whether V4's compressed-sparse / heavily-compressed attention has analogous "cannot quantize" tensors and what the canonical regex is.
- W4A4 vs W4A16 for V4 Pro. Path A is W4A16-equivalent; Path B is W4A4. For a 1.6T MoE with extreme long-context, ask which is internally recommended for first deployment.
modeloptvsllm-compressorfor V4. RedHat shipped V4-Flash NVFP4 viallm-compressor. Why not Pro yet? Find out if there's a known-bad layer or just compute time.
Output sizes to expect
- FP8 source: ~865 GB
- Path A NVFP4 output: ~430–470 GB (about 2× compression vs FP8 source; experts dominate, norms/embeds add a bit back)
- Path B NVFP4 output: similar, plus activation scale metadata
Resumability
Path A is checkpoint-resumable per shard — if it dies mid-run, re-running picks up from the next unwritten output shard. Path B is not resumable mid-calibration; if it crashes you restart.