16 KiB
DeepSeek V4 Pro → NVFP4 Quantization
Full NVFP4 quantization of DeepSeek V4 Pro on a single B200 node (8× B200, 2.7TB RAM, 13TB NVMe). Result: 881GB NVFP4 (Run 11).
Cost: ~$161/run at $23/hr (7 hours each). Don't waste runs.
✅ Final Result (Run 11)
- Output:
/root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4— 881GB, 95 safetensors - Config:
nvfp4full quantization, 128 calib samples,kv_cache_qformat=fp8_cast - Total runtime: ~7,783s (~2h10m end-to-end)
- Peak GPU mem: ~163GB per B200
- Amax snapshots: 47,696 quantizers, 15.4MB
- Calibrated state: 721.4GB (insurance, can re-export with
--export-only) - A few experts (11, 83, 100, 112, 254) had uncalibrated amax — weight-derived fallback used (normal for sparse MoE with 256 experts)
⚠️ Model Config Patches (post-export)
modelopt 0.45.0.dev64's export produces configs that don't match what vllm expects at runtime. NVIDIA's own published NVFP4 exports have the same gaps — we compared against nvidia/DeepSeek-V3.2-NVFP4 and nvidia/MiniMax-M2.7-NVFP4 on HuggingFace. Neither includes compress_ratios or scale_fmt either. This is a modelopt ↔ vllm integration gap, not a problem with our quantization.
All patches below are to DeepSeek-V4-Pro-NVFP4/config.json unless noted.
| # | Field | modelopt export (original) | vllm requires | Patch applied | Why modelopt doesn't export it |
|---|---|---|---|---|---|
| 1 | compress_ratios |
Missing (transformers 5.8.0 renamed to compress_rates dict) |
List of ints indexed by layer_id | Copied from BF16 source model's compress_ratios (62 items) |
modelopt doesn't add fields the source config lacks; transformers 5.8.0 renamed the field |
| 2 | quantization_config.scale_fmt |
Missing | "ue8m0" string |
Added | modelopt doesn't include vllm-specific runtime fields |
| 3 | rope_parameters |
Nested dict {'main': {...}, 'compress': {...}} (transformers 5.8.0 format) |
Flat dict {'rope_theta': ..., 'rope_type': ..., ...} |
Flattened to main sub-dict |
transformers 5.8.0 changed rope_parameters from flat → nested per-component |
| 4 | rope_scaling |
Nested dict {'main': {...}, 'compress': {...}} (same as above) |
Flat dict | Flattened to main sub-dict |
Same transformers 5.8.0 schema change |
NVIDIA's own NVFP4 exports confirmed to also lack patches 1 and 2. We checked:
nvidia/DeepSeek-V3.2-NVFP4— nocompress_ratios, noscale_fmt, noquantization_configin config.json at all (V3.2 doesn't use MLA compression so it sidesteps the issue)nvidia/MiniMax-M2.7-NVFP4— hasquantization_configin config.json (same schema as ours) but noscale_fmt
The compress_rates → compress_ratios rename and rope_parameters nesting are transformers 5.8.0 regressions that modelopt doesn't account for. scale_fmt is a vllm runtime field that modelopt has never exported.
Architecture
We call modelopt's hf_ptq.main() directly — the same entry point the shell script uses. We don't rewrite the pipeline. We just:
- Patch modelopt at runtime (GPU tensor safety, before anything runs)
- Hook
export_quantizedto snapshot amax + save state before export - Call
hf_main(args)with properly parsed args
This avoids the cascade of missing-arg bugs from manually constructing argparse.Namespace (Runs 4–8).
Pipeline
Step 1: Dequantize FP8 → BF16
python3 scripts/dequant_fp8_to_bf16.py /root/nvidia-meeting/DeepSeek-V4-Pro-FP8 /root/nvidia-meeting/DeepSeek-V4-Pro-BF16
The original V4 weights use mixed precision (FP8 attention + FP4/E2M1 experts with per-tensor scales). We dequantize everything to pure BF16 so modelopt can run calibration without hitting broken FP8 kernel paths on Blackwell (DeepGEMM unsupported, Triton finegrained FP8 matmul shape mismatches).
This is not a blind upcast — it applies the actual scale factors:
W_bf16 = dequantize_fp4_weight(W_int, S) # per-tensor scale dequant, not .to(bfloat16)
Byte-exact verified — matmul diff is 0.000000 against the official inference path.
Step 2: Run NVFP4 Quantization
cd /root/nvidia-meeting/modelopt-repo/examples/llm_ptq
python3 /root/nvidia-meeting/deepseek-v4-quant/scripts/quantize_nvfp4.py
Must run from the modelopt example directory (relative imports).
What happens inside:
- Apply patches — 3 runtime monkey-patches for GPU tensor safety (see below)
- Parse args — uses
hf_ptq.parse_args()with our config viasys.argvreplacement, then applies the same post-parse conversions (datasetsplit,calib_sizeint list) thathf_ptq.__main__normally does - Hook export — monkey-patch
export_quantizedto snapshot amax + save state before export - Call
hf_main(args)— the exact same pipeline the shell script uses
If the export crashes:
python3 /root/nvidia-meeting/deepseek-v4-quant/scripts/quantize_nvfp4.py --export-only
To validate saved state without running anything:
python3 /root/nvidia-meeting/deepseek-v4-quant/scripts/quantize_nvfp4.py --validate-only
Config: nvfp4, 128 calib samples, calib_seq=512, kv_cache_qformat=fp8_cast, gpu_max_mem_percentage=0.7, use_seq_device_map, inference_tensor_parallel=8
Calibration datasets: abisee/cnn_dailymail + nvidia/Nemotron-Post-Training-Dataset-v2 (default when no --dataset specified).
Runtime: Model loading ~50 min. Calibration ~5.5 hours. Export ~30-60 min. Total 7-8 hours.
Run History
| Run | Date | Commit | Calib | Result | Root Cause | Fix |
|---|---|---|---|---|---|---|
| 1 | May 7 | shell wrapper | 256 | ❌ Batch probing crash | o_b_proj shape mismatch — finegrained_fp8 wraps MLA projections incorrectly with FP8 source |
Use BF16 source (dequantized) |
| 2 | May 8-9 | shell wrapper | 128 | ❌ Export crash (calib ✅) | get_activation_scaling_factor reads stale GPU amax → CUDA illegal memory access |
Snapshot amax to CPU after calibration |
| 3 | May 9 06:10 | 3907838 |
128 | ❌ Model loading OOM | AutoModelForCausalLM.from_pretrained OOM during expert weight torch.cat |
Use modelopt get_model() with max_memory |
| 4 | May 9 ~07:00 | 86dd8df |
128 | ❌ Import error | mtq.KV_QUANT_CFG_CHOICES doesn't exist — it's hf_ptq.KV_QUANT_CFG_CHOICES |
Import from hf_ptq, not mtq |
| 5 | May 9 ~08:05 | f9bbef8 |
128 | ❌ Same as Run 4 | Fix wasn't synced properly | Properly synced |
| 6 | May 9 ~09:25 | 6c1bff6 |
128 | ❌ Dataloader crash | make_calib_dataloader AttributeError — missing args |
Added args to Namespace |
| 7 | May 9 ~13:40 | 25b4d8d |
128 | ❌ Dataloader crash | dataset=None, len() on None |
Provided dataset list |
| 8 | May 9 ~14:00 | b2849a8 |
128 | ❌ Argparse crash | Wrong flag names (shell script names vs hf_ptq.py names) |
Use hf_ptq.py flag names |
| 9 | May 9 ~14:30 | a300302 |
128 | ❌ TypeError | Skipped __main__ post-parse conversions (calib_size still string, not int list) |
Apply same conversions after parse_args() |
| 10 | May 9 ~15:30 | 5a72da7 |
128 | ❌ Export crash (calib ✅) | get_weight_scaling_factor reads stale GPU weight → cudaErrorIllegalAddress |
Patch _export_quantized_weight to force weight to CPU at entry point |
| 11 | May 9 ~22:50 | 07cd50e |
128 | ✅ SUCCESS | — | 8 patches covering full export chain |
Key Lessons
Run 2 — Stale GPU tensors: use_seq_device_map shuffles layers through GPU for calibration. Quantizer amax tensors sit in VRAM for 5+ hours while CUDA's allocator churns memory. By export time, the GPU tensor metadata is valid but the underlying memory has been recycled — reading it triggers cudaErrorIllegalAddress. Fix: copy amax to CPU immediately after calibration.
Run 3 — Expert weight OOM: AutoModelForCausalLM.from_pretrained does torch.cat on GPU for expert gate_up_proj (31.5GB alloc, 25.9GB free). Fix: use modelopt's get_model() which sets max_memory per GPU before loading. (Note: Run 10 uses hf_main() which calls get_model() internally.)
Runs 4–8 — Pipeline rewriting trap: Trying to reconstruct hf_ptq's pipeline by importing individual functions and building a fake argparse.Namespace causes an endless stream of missing-attribute and type errors. Each fix reveals the next bug. Fix: call hf_main(args) directly with a properly parsed args object.
Run 9 — __main__ gap: hf_ptq.py does critical type conversions in its __main__ block (string → list for dataset, string → int list for calib_size). When calling main() directly, these are skipped. Fix: apply the same conversions after parse_args().
Run 10 — Stale GPU weight tensors in export: The amax patches (Patch 1-3) only cover quantizer state. The model weights themselves are also on stale GPU. get_weight_scaling_factor does weight_scaling_factor_2.to(weight.device) which triggers cudaErrorIllegalAddress because weight is on stale GPU. Fix: patch _export_quantized_weight (the entry point for each module's export) to force weight to CPU before any downstream code reads it. This covers the entire chain: get_weight_scaling_factor, get_weights_scaling_factor_from_quantizer, to_quantized_weight, weight.to(dtype) — all resolve to CPU because weight.device is CPU.
Do NOT Repeat These Mistakes
- Don't use FP8 source model — kernel issues on Blackwell (Run 1)
- Don't use
--low_memory_modewith V4 — meta device errors - Don't use
calib_size=256— OOMs with 3TB BF16 on CPU offload - Don't use
AutoModelForCausalLM.from_pretraineddirectly — OOM during expert weight concat (Run 3) - Don't assume GPU tensor integrity after 5+ hours of sequential calibration (Run 2, Run 10)
- Don't rewrite the hf_ptq pipeline — call
hf_main()directly (Runs 4–8) - Don't skip the
__main__post-parse conversions —calib_sizemust be int list,datasetmust be list (Run 9) - Don't use shell script arg names (
--quant,--calib,--kv_cache_quant,--tp) — usehf_ptq.pynames (--qformat,--calib_size,--kv_cache_qformat,--inference_tensor_parallel) - Don't patch individual export functions one at a time — patch the entry point (
_export_quantized_weight) so weight is on CPU for the entire chain (Run 10)
Runtime Patches Applied by quantize_nvfp4.py
These are monkey-patches applied at runtime — no modelopt source files are modified.
Calibration-time patches (applied before pipeline runs)
TensorQuantizer.load_calib_amax— After calibration writes_amaxto GPU, immediately moves it to CPU. Prevents stale GPU tensors.TensorQuantizer.export_amax— If_amaxis still on GPU at export time, moves to CPU before reading. Safety net.NVFP4QTensor.get_activation_scaling_factor— Moves amax to CPU, clamps bad values instead of hard assert. Prevents crash on garbage from GPU corruption.
Export-time patches (force stale GPU tensors to CPU at entry points)
_export_quantized_weight(KEY PATCH) — Forces weight + all quantizer state to CPU before any downstream code reads them. This is the entry point for exporting each linear layer. By forcing weight to CPU here, every downstream.to(weight.device)resolves to CPU, covering the entire chain:get_weight_scaling_factor,get_weights_scaling_factor_from_quantizer,to_quantized_weight,weight.to(dtype)._export_fused_experts— Same treatment for MoE expert weights (DeepseekV4Experts go through this path). Forces expert weights, buffers, and quantizer state to CPU.to_quantized_weight— Forces weight and scaling factors to CPU. Redundant if Patch 4 works, but catches any code path that reaches this function without going through_export_quantized_weight.get_weight_scaling_factor— Forces weight + quantizer to CPU. Redundant if Patch 4 works.get_weight_scaling_factor_2— Forces quantizer state to CPU. Redundant if Patch 4 works.
Patches 6-8 are belt-and-suspenders. Patch 4 is the one that matters — it moves weight to CPU at the earliest possible point in the export chain, making all downstream stale GPU reads impossible.
Post-Calibration Hook
export_quantized is monkey-patched to run these steps before the real export:
snapshot_amax_to_cpu()— Walks all quantizers, copies_amaxto CPU, saves to disk (~50MB). Insurance policy.force_all_amax_to_cpu()— Moves_pre_quant_scale,_global_amaxto CPU too. Nuclear option.save_calibrated_state()— Saves full model state dict to disk (~1.5TB). Enables--export-onlyrecovery if export crashes.
Bugs Found (V4 + modelopt 0.45.0.dev64)
— Already fixed in modelopt 0.45.0.dev64 (handlesQuantDeepseekV4ExpertsAttributeErrornn.ModuleListquantizers natively).--low_memory_mode→ meta device error. Don't use with V4.- Missing
kernelspackage for FP8 ops.pip install -U kernels. Shell script arg names— Resolved by callinghf_main()directly.- Export crash — stale GPU tensors in
export_amax(). After hours of calibration, quantizer_amaxon GPU becomes unreadable. Fixed by patchingexport_amaxto move_amaxto CPU before reading. - Export crash —
assert torch.all(activation_scaling_factor > 0). Amax values from stale GPU reads are garbage (zeros, negatives, NaN). Fixed by clamping instead of asserting, plus snapshotting valid amax to CPU before corruption can occur. - Model loading OOM during expert weight conversion.
AutoModelForCausalLM.from_pretraineddoestorch.caton GPU for expertgate_up_proj(31.5GB alloc), but only 25.9GB free withdevice_map="sequential". Fixed by using modelopt'sget_model()which setsmax_memoryper GPU before loading. - Export crash — stale GPU weight tensors in
get_weight_scaling_factor. Patches 1-3 only covered quantizer amax. The model weights themselves are also on stale GPU.weight_scaling_factor_2.to(weight.device)triggerscudaErrorIllegalAddress. Fixed by patching_export_quantized_weightto force weight to CPU at the entry point, covering the entire export chain.
Dependencies (pinned versions)
- nvidia-modelopt:
0.45.0.dev64+g579fc6c31(installed from git, not PyPI) - transformers:
5.8.0.dev0(from git, required for DeepSeekV4 support) - kernels: latest (
pip install -U kernels— needed for finegrained FP8 ops) - Python: 3.10
The patches in quantize_nvfp4.py are for modelopt 0.45.0.dev64 specifically. Later versions may include fixes natively — check before applying.
Key Notes
- V4 is NOT BF16 — it ships as mixed-precision FP8/FP4. You MUST dequantize to BF16 first (Step 1).
--low_memory_modecauses meta device errors with V4 — don't use.- modelopt has no explicit V4 support — relies on auto-detection of fused experts.
- The calibration state save (
v4_nvfp4_calibrated_state.pt) is ~1.5TB. It lives on NVMe, not in git. - The amax snapshot (
v4_nvfp4_amax_snapshots.pt) is ~50MB. Small, critical, cheap insurance. - The script calls
hf_main(args)— the exact same entry point as the shell script. No pipeline divergence. - Must run from
/root/nvidia-meeting/modelopt-repo/examples/llm_ptq(relative imports).
File Layout
scripts/
dequant_fp8_to_bf16.py — Step 1: FP8/FP4 → BF16 dequantization
quantize_nvfp4.py — Step 2: NVFP4 quantization (patches + hf_main)
patches/
patch_finegrained_fp8_blackwell.py — (legacy) FP8 kernel patches for Blackwell
quant_module_patched.py — (legacy) quant module patches
The patches/ directory contains earlier approaches that modified modelopt source files directly. The current approach (quantize_nvfp4.py) uses runtime monkey-patching instead — no source files are modified.