# DeepSeek V4 Pro → NVFP4 conversion kit Two paths for converting `sgl-project/DeepSeek-V4-Pro-FP8` (the uniform-FP8 repackage of the original mixed-precision V4 Pro) into NVFP4 for Blackwell inference. ## What's here | File | Purpose | | --- | --- | | `inspect_model.py` | Run this first. Prints tensor name patterns, dtypes, FP8 scaling block sizes, and counts of MoE expert/router/norm tensors so you know exactly what you're dealing with before any conversion. | | `fp8_to_nvfp4_streaming.py` | **Path A.** Pure tensor-level streaming FP8 → NVFP4 conversion. No model loading, no calibration, weight-only. Low memory, fast, deterministic. Recommended for first run. | | `quantize_llmcompressor.py` | **Path B.** `llm-compressor` oneshot with sequential pipeline + activation calibration. Produces W4A4 with calibrated activation scales. Higher quality on activation-sensitive ops but riskier given V4 is two weeks old. | | `verify_nvfp4.py` | Loads the produced NVFP4 checkpoint, runs a basic forward pass through one block, checks for NaN/Inf, and dumps a few generated tokens via vLLM. | ## Hardware assumptions - 8× B200 baremetal, 1.5 TB HBM total - 2.7 TB system RAM - ≥10 TB free NVMe at `~/nvidia-meeting/` ## Prereqs ```bash source ~/nvidia-meeting/venv/bin/activate pip install --upgrade torch safetensors transformers tqdm pip install --upgrade llmcompressor compressed-tensors # only needed for Path B pip install --upgrade vllm # only needed for verify ``` You'll likely need `transformers` from source for V4 architecture support, and `trust_remote_code=True` everywhere. Stock pip versions may not load V4 yet. ## Recommended order tonight ```bash cd ~/nvidia-meeting # 1. Inspect the FP8 source — 30 seconds, no GPU needed. python inspect_model.py DeepSeek-V4-Pro-FP8 | tee inspect.log # 2. Path A streaming conversion — should run in 2-6 hours dominated by NVMe I/O. python fp8_to_nvfp4_streaming.py \ --src DeepSeek-V4-Pro-FP8 \ --dst DeepSeek-V4-Pro-NVFP4-streaming \ --workers 8 \ 2>&1 | tee path_a.log # 3. Quick sanity check — does it load and forward-pass? python verify_nvfp4.py DeepSeek-V4-Pro-NVFP4-streaming # 4. Path B (overnight). Run only after Path A succeeds. 24-72 hours. python quantize_llmcompressor.py \ --src DeepSeek-V4-Pro-FP8 \ --dst DeepSeek-V4-Pro-NVFP4-llmcompressor \ --num-samples 256 \ --max-seq-len 4096 \ 2>&1 | tee path_b.log ``` ## Path A — what it does 1. Reads `model.safetensors.index.json` to map every tensor to its shard. 2. Classifies every tensor: - **Preserve** (copied bit-for-bit): `lm_head`, `embed_tokens`, MoE router gates (`*.mlp.gate`), all norms, V4-specific attention indexer/scoring tensors, mHC residual mixing weights. - **Quantize**: any FP8 weight that has a corresponding `*.weight_scale_inv` companion (i.e. real GEMM weights). 3. For every quantizable weight: - Dequantizes FP8 E4M3 → FP32 using the source's per-block scales (auto-detects 128×128 blocks). - Computes NVFP4 dual scales: per-tensor `weight_scale_2 = amax / (6.0 * 448.0)` and per-16-element-block `weight_scale = block_amax / (6.0 * weight_scale_2)` cast to FP8 E4M3. - Quantizes FP32 → E2M1 representable values `{0, ±0.5, ±1, ±1.5, ±2, ±3, ±4, ±6}`. - Packs two 4-bit values per `uint8` byte. 4. **MoE pair handling**: detects `gate_proj` (w1) + `up_proj` (w3) of each expert and computes a joint `weight_scale_2` across both, since vLLM's fused MoE kernel requires them to share that global scale. 5. Streams output to new shards (~5 GB each) with a fresh `model.safetensors.index.json` and copies all non-tensor files (config, tokenizer, etc.) verbatim. **This is weight-only NVFP4.** Activation quantization is not done here — you get W4A16 effective behavior at runtime unless your inference engine generates dynamic per-group activation scales. vLLM does generate per-group activation scales dynamically at inference, so this is fine for most use cases. ## Path B — what it does 1. Loads the FP8 model via `transformers` with `device_map="auto"` and the offload folder pointing at NVMe. With 2.7 TB RAM, the FP8 weights (~865 GB) sit in RAM; activations and per-layer BF16 promotion happen on the B200s. 2. Loads a calibration set (default 256 samples of `HuggingFaceH4/ultrachat_200k`). 3. Runs `llm-compressor` `oneshot` with `pipeline="sequential"` so only one transformer block is materialized in BF16 on GPU at a time. 4. `moe_calibrate_all_experts=True` ensures every routed expert gets calibration signal even when natural routing wouldn't pick it. 5. The recipe targets `Linear` with NVFP4 and the same ignore list as Path A (lm_head, embed, router gates, norms, indexer, mHC). 6. Saves with `save_compressed=True` in `compressed-tensors` format. **The known risks for Path B on V4 specifically:** - V4 architecture is brand new. `llm-compressor` may not have a registered MoE wrapper for V4 — you may need to call `replace_modules_for_calibration` with the actual V4 MoE class name (the script has a TODO and a fallback path). - Sequential pipeline may not handle CSA/HCA hybrid attention if the attention forward isn't a simple linear chain. If you see weird offload errors during calibration, the indexer/scoring tensors are likely the culprit. - Calibration cache for 256 routed experts × all V4 layers can be hundreds of GB. Watch `nvidia-smi` and `free -h` during the first 30 minutes. ## Things to discuss with the NVIDIA engineer 1. **NVFP4 packing convention.** My converter packs as `byte = elem0 | (elem1 << 4)` (low nibble first). Verify this matches what TensorRT-LLM / cutlass NVFP4 kernels expect. If reversed, just flip in `pack_fp4()`. 2. **Joint scaling extension.** I implement joint `weight_scale_2` for `gate_proj`/`up_proj` pairs. Ask whether `down_proj` also benefits, or whether all three experts in a fused MoE block should share — recipes have varied. 3. **mHC residual weights.** I preserve them in FP8/BF16 conservatively. If NVIDIA has actually quantized these somewhere internally, drop them out of the ignore list to recover memory. 4. **CSA + HCA indexer/scoring tensors.** I preserve these blindly based on the V3.2 DSA precedent. Ask whether V4's compressed-sparse / heavily-compressed attention has analogous "cannot quantize" tensors and what the canonical regex is. 5. **W4A4 vs W4A16 for V4 Pro.** Path A is W4A16-equivalent; Path B is W4A4. For a 1.6T MoE with extreme long-context, ask which is internally recommended for first deployment. 6. **`modelopt` vs `llm-compressor` for V4.** RedHat shipped V4-*Flash* NVFP4 via `llm-compressor`. Why not Pro yet? Find out if there's a known-bad layer or just compute time. ## Output sizes to expect - FP8 source: ~865 GB - Path A NVFP4 output: ~430–470 GB (about 2× compression vs FP8 source; experts dominate, norms/embeds add a bit back) - Path B NVFP4 output: similar, plus activation scale metadata ## Resumability Path A is checkpoint-resumable per shard — if it dies mid-run, re-running picks up from the next unwritten output shard. Path B is **not** resumable mid-calibration; if it crashes you restart.