Files
deepseek-v4-quant/README.md

1.9 KiB
Raw Blame History

DeepSeek V4 Pro → NVFP4 Quantization

Full NVFP4 quantization of DeepSeek V4 Pro using NVIDIA's Model Optimizer.

Strategy

  1. Dequantize the original mixed-precision FP8 weights to pure BF16 (scripts/dequant_fp8_to_bf16.py)
  2. Full quantize BF16 → NVFP4 using NVIDIA's official ModelOpt PTQ pipeline (scripts/model_opt_nvfp4_full.py)

Full model quantization (attention + experts + shared MLP) to NVFP4. Target output: ~600GB.

Scripts

File Purpose
scripts/dequant_fp8_to_bf16.py Dequant FP8 source → pure BF16 (resumable, shard-level)
scripts/upcast_to_bf16.py Alternative: upcast mixed-precision to BF16
scripts/model_opt_nvfp4_full.py Run ModelOpt NVFP4 full quantization (calib 128)
patches/quant_module_patched.py Patch for modelopt V4 experts ModuleList bug
patches/patch_finegrained_fp8_blackwell.py Blackwell FP8 kernel patch
check-ttl.sh B200 node TTL watchdog

B200 Node

  • 8× B200, 2.7TB RAM, 13TB NVMe
  • See .env for access details

Key Notes

  • Calib size: 128 (256 OOMs on 2.8TB RAM with 3TB BF16 model)
  • Full quant (nvfp4), not experts-only
  • Use BF16 source — V4's mixed precision causes issues, FP8 source has kernel problems on Blackwell
  • --use_seq_device_map required (model doesn't fit in GPU VRAM alone)
  • --gpu_max_mem_percentage 0.7 for VRAM headroom
  • --low_memory_mode causes meta device errors with V4 — don't use
  • modelopt has no explicit V4 support — relies on auto-detection of fused experts
  • Calibration dataset nvidia/Nemotron-Post-Training-Dataset-v2 is gated — requires HF token

Bugs Found (V4 + modelopt)

  1. QuantDeepseekV4Experts AttributeError — patched iter_weights_for_calibration() for ModuleList quantizers
  2. --low_memory_mode → meta device error
  3. Missing kernels package for FP8 ops
  4. --calib not --calib_size, --quant not --qformat (shell script arg names)