DeepSeek V4 Pro → NVFP4 Quantization

Full NVFP4 quantization of DeepSeek V4 Pro using NVIDIA's Model Optimizer.

Strategy

Dequantize the original mixed-precision FP8 weights to pure BF16 (scripts/dequant_fp8_to_bf16.py)
Full quantize BF16 → NVFP4 using NVIDIA's official ModelOpt PTQ pipeline (scripts/model_opt_nvfp4_full.py)

Full model quantization (attention + experts + shared MLP) to NVFP4. Target output: ~600GB.

File	Purpose
`scripts/dequant_fp8_to_bf16.py`	Dequant FP8 source → pure BF16 (resumable, shard-level)
`scripts/upcast_to_bf16.py`	Alternative: upcast mixed-precision to BF16
`scripts/model_opt_nvfp4_full.py`	Run ModelOpt NVFP4 full quantization (calib 128)
`patches/quant_module_patched.py`	Patch for modelopt V4 experts ModuleList bug
`patches/patch_finegrained_fp8_blackwell.py`	Blackwell FP8 kernel patch
`check-ttl.sh`	B200 node TTL watchdog

Calib size: 128 (256 OOMs on 2.8TB RAM with 3TB BF16 model)
Full quant (nvfp4), not experts-only
Use BF16 source — V4's mixed precision causes issues, FP8 source has kernel problems on Blackwell
--use_seq_device_map required (model doesn't fit in GPU VRAM alone)
--gpu_max_mem_percentage 0.7 for VRAM headroom
--low_memory_mode causes meta device errors with V4 — don't use
modelopt has no explicit V4 support — relies on auto-detection of fused experts
Calibration dataset nvidia/Nemotron-Post-Training-Dataset-v2 is gated — requires HF token

QuantDeepseekV4Experts AttributeError — patched iter_weights_for_calibration() for ModuleList quantizers
--low_memory_mode → meta device error
Missing kernels package for FP8 ops
--calib not --calib_size, --quant not --qformat (shell script arg names)