DeepSeek V4 Pro → NVFP4 conversion kit

Two paths for converting sgl-project/DeepSeek-V4-Pro-FP8 (the uniform-FP8 repackage of the original mixed-precision V4 Pro) into NVFP4 for Blackwell inference.

What's here

File Purpose
inspect_model.py Run this first. Prints tensor name patterns, dtypes, FP8 scaling block sizes, and counts of MoE expert/router/norm tensors so you know exactly what you're dealing with before any conversion.
fp8_to_nvfp4_streaming.py Path A. Pure tensor-level streaming FP8 → NVFP4 conversion. No model loading, no calibration, weight-only. Low memory, fast, deterministic. Recommended for first run.
quantize_llmcompressor.py Path B. llm-compressor oneshot with sequential pipeline + activation calibration. Produces W4A4 with calibrated activation scales. Higher quality on activation-sensitive ops but riskier given V4 is two weeks old.
verify_nvfp4.py Loads the produced NVFP4 checkpoint, runs a basic forward pass through one block, checks for NaN/Inf, and dumps a few generated tokens via vLLM.

Hardware assumptions

  • 8× B200 baremetal, 1.5 TB HBM total
  • 2.7 TB system RAM
  • ≥10 TB free NVMe at ~/nvidia-meeting/

Prereqs

source ~/nvidia-meeting/venv/bin/activate
pip install --upgrade torch safetensors transformers tqdm
pip install --upgrade llmcompressor compressed-tensors  # only needed for Path B
pip install --upgrade vllm                              # only needed for verify

You'll likely need transformers from source for V4 architecture support, and trust_remote_code=True everywhere. Stock pip versions may not load V4 yet.

cd ~/nvidia-meeting

# 1. Inspect the FP8 source — 30 seconds, no GPU needed.
python inspect_model.py DeepSeek-V4-Pro-FP8 | tee inspect.log

# 2. Path A streaming conversion — should run in 2-6 hours dominated by NVMe I/O.
python fp8_to_nvfp4_streaming.py \
    --src DeepSeek-V4-Pro-FP8 \
    --dst DeepSeek-V4-Pro-NVFP4-streaming \
    --workers 8 \
    2>&1 | tee path_a.log

# 3. Quick sanity check — does it load and forward-pass?
python verify_nvfp4.py DeepSeek-V4-Pro-NVFP4-streaming

# 4. Path B (overnight). Run only after Path A succeeds. 24-72 hours.
python quantize_llmcompressor.py \
    --src DeepSeek-V4-Pro-FP8 \
    --dst DeepSeek-V4-Pro-NVFP4-llmcompressor \
    --num-samples 256 \
    --max-seq-len 4096 \
    2>&1 | tee path_b.log

Path A — what it does

  1. Reads model.safetensors.index.json to map every tensor to its shard.
  2. Classifies every tensor:
    • Preserve (copied bit-for-bit): lm_head, embed_tokens, MoE router gates (*.mlp.gate), all norms, V4-specific attention indexer/scoring tensors, mHC residual mixing weights.
    • Quantize: any FP8 weight that has a corresponding *.weight_scale_inv companion (i.e. real GEMM weights).
  3. For every quantizable weight:
    • Dequantizes FP8 E4M3 → FP32 using the source's per-block scales (auto-detects 128×128 blocks).
    • Computes NVFP4 dual scales: per-tensor weight_scale_2 = amax / (6.0 * 448.0) and per-16-element-block weight_scale = block_amax / (6.0 * weight_scale_2) cast to FP8 E4M3.
    • Quantizes FP32 → E2M1 representable values {0, ±0.5, ±1, ±1.5, ±2, ±3, ±4, ±6}.
    • Packs two 4-bit values per uint8 byte.
  4. MoE pair handling: detects gate_proj (w1) + up_proj (w3) of each expert and computes a joint weight_scale_2 across both, since vLLM's fused MoE kernel requires them to share that global scale.
  5. Streams output to new shards (~5 GB each) with a fresh model.safetensors.index.json and copies all non-tensor files (config, tokenizer, etc.) verbatim.

This is weight-only NVFP4. Activation quantization is not done here — you get W4A16 effective behavior at runtime unless your inference engine generates dynamic per-group activation scales. vLLM does generate per-group activation scales dynamically at inference, so this is fine for most use cases.

Path B — what it does

  1. Loads the FP8 model via transformers with device_map="auto" and the offload folder pointing at NVMe. With 2.7 TB RAM, the FP8 weights (~865 GB) sit in RAM; activations and per-layer BF16 promotion happen on the B200s.
  2. Loads a calibration set (default 256 samples of HuggingFaceH4/ultrachat_200k).
  3. Runs llm-compressor oneshot with pipeline="sequential" so only one transformer block is materialized in BF16 on GPU at a time.
  4. moe_calibrate_all_experts=True ensures every routed expert gets calibration signal even when natural routing wouldn't pick it.
  5. The recipe targets Linear with NVFP4 and the same ignore list as Path A (lm_head, embed, router gates, norms, indexer, mHC).
  6. Saves with save_compressed=True in compressed-tensors format.

The known risks for Path B on V4 specifically:

  • V4 architecture is brand new. llm-compressor may not have a registered MoE wrapper for V4 — you may need to call replace_modules_for_calibration with the actual V4 MoE class name (the script has a TODO and a fallback path).
  • Sequential pipeline may not handle CSA/HCA hybrid attention if the attention forward isn't a simple linear chain. If you see weird offload errors during calibration, the indexer/scoring tensors are likely the culprit.
  • Calibration cache for 256 routed experts × all V4 layers can be hundreds of GB. Watch nvidia-smi and free -h during the first 30 minutes.

Things to discuss with the NVIDIA engineer

  1. NVFP4 packing convention. My converter packs as byte = elem0 | (elem1 << 4) (low nibble first). Verify this matches what TensorRT-LLM / cutlass NVFP4 kernels expect. If reversed, just flip in pack_fp4().
  2. Joint scaling extension. I implement joint weight_scale_2 for gate_proj/up_proj pairs. Ask whether down_proj also benefits, or whether all three experts in a fused MoE block should share — recipes have varied.
  3. mHC residual weights. I preserve them in FP8/BF16 conservatively. If NVIDIA has actually quantized these somewhere internally, drop them out of the ignore list to recover memory.
  4. CSA + HCA indexer/scoring tensors. I preserve these blindly based on the V3.2 DSA precedent. Ask whether V4's compressed-sparse / heavily-compressed attention has analogous "cannot quantize" tensors and what the canonical regex is.
  5. W4A4 vs W4A16 for V4 Pro. Path A is W4A16-equivalent; Path B is W4A4. For a 1.6T MoE with extreme long-context, ask which is internally recommended for first deployment.
  6. modelopt vs llm-compressor for V4. RedHat shipped V4-Flash NVFP4 via llm-compressor. Why not Pro yet? Find out if there's a known-bad layer or just compute time.

Output sizes to expect

  • FP8 source: ~865 GB
  • Path A NVFP4 output: ~430470 GB (about 2× compression vs FP8 source; experts dominate, norms/embeds add a bit back)
  • Path B NVFP4 output: similar, plus activation scale metadata

Resumability

Path A is checkpoint-resumable per shard — if it dies mid-run, re-running picks up from the next unwritten output shard. Path B is not resumable mid-calibration; if it crashes you restart.

Description
No description provided
Readme 1.6 MiB
Languages
Python 100%