biondizzle/deepseek-v4-quant

Fork 0

Go to file

biondizzle 50348989b2 Clarify: V4 is NOT BF16, dequantize first

2026-05-08 17:31:35 +00:00

patches

Add BF16 upcast script and Blackwell DeepGEMM patch

2026-05-07 14:25:30 +00:00

scripts

Remove upcast_to_bf16.py — superseded by dequant_fp8_to_bf16.py

2026-05-08 17:13:39 +00:00

.env

Purge OpenClaw session files, memory dumps, __pycache__. Update .gitignore

2026-05-08 17:09:59 +00:00

.gitignore

Purge OpenClaw session files, memory dumps, __pycache__. Update .gitignore

2026-05-08 17:09:59 +00:00

index.yaml

Purge INT4 references — expert weights are FP4 (E2M1), not INT4

2026-05-08 02:33:46 +00:00

README.md

Clarify: V4 is NOT BF16, dequantize first

2026-05-08 17:31:35 +00:00

requirements.txt

NVIDIA Model Optimizer branch: nvfp4_experts_only PTQ for DeepSeek V4 Pro

2026-05-07 00:11:31 +00:00

README.md

DeepSeek V4 Pro → NVFP4 Quantization

Full NVFP4 quantization of DeepSeek V4 Pro on a single B200 node (8× B200, 2.7TB RAM, 13TB NVMe).

Pipeline

Step 1: Dequantize FP8 → BF16

python3 scripts/dequant_fp8_to_bf16.py /root/nvidia-meeting/DeepSeek-V4-Pro-FP8 /root/nvidia-meeting/DeepSeek-V4-Pro-BF16

The original V4 weights use mixed precision (FP8 attention + FP4/E2M1 experts with per-tensor scales). We dequantize everything to pure BF16 so modelopt can run calibration without hitting broken FP8 kernel paths on Blackwell (DeepGEMM unsupported, Triton finegrained FP8 matmul shape mismatches).

This is not a blind upcast — it applies the actual scale factors:

W_bf16 = dequantize_fp4_weight(W_int, S)  # per-tensor scale dequant, not .to(bfloat16)

We verified byte-exact correctness by dequantizing a single expert and running a matmul against the official inference path:

W_bf16 = dequantize_fp4_weight(W_int, S)
y_ours = W_bf16 @ x.bfloat16()
y_ref = official_expert_forward(W_int, S, x)
print((y_ours - y_ref).abs().max() / y_ref.abs().mean())

Results:

Max abs diff: 0.00000000
Mean abs diff: 0.00000000
Relative error: 0.000000
Matmul max diff: 0.00000000

Byte-exact. Zero drift from BF16 rounding noise — ruled out as a potential issue in the final quant.

Step 2: Run ModelOpt NVFP4 Full Quantization

python3 scripts/model_opt_nvfp4_full.py

Runs NVIDIA's official ModelOpt PTQ pipeline (hf_ptq.py) with full nvfp4 quantization (attention + experts + shared MLP). Output target: ~600GB.

Config:

--quant nvfp4 (full model, not experts-only)
--calib 128 — 128 calibration samples. The B200 node has 2.7TB RAM; the 3TB BF16 model doesn't fit in GPU VRAM (~1.4TB total), so it runs with --use_seq_device_map (CPU offload). 256 calibration samples OOMs. 128 is the max that fits.
--kv_cache_quant fp8_cast
--use_seq_device_map — sequential device mapping, loads model into CPU RAM, moves layers to GPU for forward passes
--gpu_max_mem_percentage 0.7 — VRAM headroom

Calibration datasets: abisee/cnn_dailymail + nvidia/Nemotron-Post-Training-Dataset-v2 (gated — requires HF token). The script exports HF_TOKEN and HUGGING_FACE_HUB_TOKEN; the token must also be set via hf auth login on the node.

Runtime: Model loading takes ~53 minutes. Quantization + calibration takes several hours. Total expect 6-12 hours.

Dependencies (pinned versions)

nvidia-modelopt: 0.45.0.dev64+g579fc6c31 (installed from git, not PyPI)
transformers: 5.8.0.dev0 (from git, required for DeepSeekV4 support)
kernels: latest (pip install -U kernels — needed for finegrained FP8 ops)
Python: 3.10

The quant_module_patched.py fix is for modelopt 0.45.0.dev64 specifically. Later versions may include the fix natively — check before applying. Using a different modelopt version may cause patches to fail or V4 quantization to break.

Key Notes

V4 is NOT BF16 — it ships as mixed-precision FP8/FP4. You MUST dequantize to BF16 first (Step 1). The raw FP8 source has kernel problems on Blackwell; the mixed-precision source causes modelopt errors
--low_memory_mode causes meta device errors with V4 — don't use
modelopt has no explicit V4 support — relies on auto-detection of fused experts
The quant_module_patched.py patch fixes iter_weights_for_calibration() for V4's nn.ModuleList expert quantizers — already applied in the venv

Bugs Found (V4 + modelopt)

QuantDeepseekV4Experts AttributeError — patched iter_weights_for_calibration() for ModuleList quantizers
--low_memory_mode → meta device error
Missing kernels package for FP8 ops
--calib not --calib_size, --quant not --qformat (shell script arg names)

README.md Unescape Escape

DeepSeek V4 Pro → NVFP4 Quantization

Pipeline

Step 1: Dequantize FP8 → BF16

Step 2: Run ModelOpt NVFP4 Full Quantization

Dependencies (pinned versions)

Key Notes

Bugs Found (V4 + modelopt)

README.md