biondizzle/deepseek-v4-quant

Fork 0

Go to file

biondizzle d7593fc1dd Update README: run history table, bug #1 already fixed, cost note, don't-repeat mistakes

2026-05-09 06:44:17 +00:00

patches

Add BF16 upcast script and Blackwell DeepGEMM patch

2026-05-07 14:25:30 +00:00

scripts

Defensive quantization: snapshot amax to CPU immediately after calibration

2026-05-09 06:31:08 +00:00

.env

Purge OpenClaw session files, memory dumps, __pycache__. Update .gitignore

2026-05-08 17:09:59 +00:00

.gitignore

Replace shell wrapper with in-process quantize script

2026-05-09 06:07:22 +00:00

index.yaml

Purge INT4 references — expert weights are FP4 (E2M1), not INT4

2026-05-08 02:33:46 +00:00

README.md

Update README: run history table, bug #1 already fixed, cost note, don't-repeat mistakes

2026-05-09 06:44:17 +00:00

requirements.txt

NVIDIA Model Optimizer branch: nvfp4_experts_only PTQ for DeepSeek V4 Pro

2026-05-07 00:11:31 +00:00

README.md

DeepSeek V4 Pro → NVFP4 Quantization

Full NVFP4 quantization of DeepSeek V4 Pro on a single B200 node (8× B200, 2.7TB RAM, 13TB NVMe). Target: ~600GB.

Cost: ~$161/run at $23/hr (7 hours each). Don't waste runs.

Pipeline

Step 1: Dequantize FP8 → BF16

python3 scripts/dequant_fp8_to_bf16.py /root/nvidia-meeting/DeepSeek-V4-Pro-FP8 /root/nvidia-meeting/DeepSeek-V4-Pro-BF16

The original V4 weights use mixed precision (FP8 attention + FP4/E2M1 experts with per-tensor scales). We dequantize everything to pure BF16 so modelopt can run calibration without hitting broken FP8 kernel paths on Blackwell (DeepGEMM unsupported, Triton finegrained FP8 matmul shape mismatches).

This is not a blind upcast — it applies the actual scale factors:

W_bf16 = dequantize_fp4_weight(W_int, S)  # per-tensor scale dequant, not .to(bfloat16)

Byte-exact verified — matmul diff is 0.000000 against the official inference path.

Step 2: Run NVFP4 Quantization

cd /root/nvidia-meeting/modelopt-repo/examples/llm_ptq
python3 /root/nvidia-meeting/deepseek-v4-quant/scripts/quantize_nvfp4.py

Must run from the modelopt example directory (relative imports).

Pipeline steps:

Load BF16 model with sequential device map (3TB model, CPU offload)
Patch modelopt at runtime (GPU tensor safety, graceful degradation)
Quantize + Calibrate (5-6 hours, 128 samples)
Snapshot amax to CPU — copies all quantizer state to CPU and saves to disk (~50MB)
Save model state — full state dict to disk (insurance against export crashes)
Export to HF safetensors

If the export crashes:

python3 /root/nvidia-meeting/deepseek-v4-quant/scripts/quantize_nvfp4.py --export-only

To validate saved state without running anything:

python3 /root/nvidia-meeting/deepseek-v4-quant/scripts/quantize_nvfp4.py --validate-only

Config: nvfp4, 128 calib samples, calib_seq=512, kv_fp8_cast, gpu_mem_pct=0.7

Calibration datasets: abisee/cnn_dailymail + nvidia/Nemotron-Post-Training-Dataset-v2 (gated — requires HF token).

Runtime: Model loading ~53 min. Calibration ~5.5 hours. Export ~30-60 min. Total 7-8 hours.

Run History (forward progression)

Run	Date	Script	Calib	Result	Root Cause	Fix
1	May 7	shell wrapper, FP8 source	256	❌ Crashed at batch probing	`o_b_proj` shape mismatch — finegrained_fp8 wraps MLA projections incorrectly with FP8 source	Use BF16 source (dequantized)
2	May 8-9	shell wrapper, BF16 source	128	❌ Crashed at export (128/128 calib ✅)	`get_activation_scaling_factor` reads stale GPU amax → CUDA illegal memory access	Snapshot amax to CPU immediately after calibration
3	May 9	`quantize_nvfp4.py` v1	128	🔄 Running	—	—

Key lesson from Run 2: The use_seq_device_map mode shuffles layers through GPU for calibration. Quantizer amax tensors sit in VRAM for 5+ hours while CUDA's allocator churns memory. By export time, the GPU tensor metadata is valid but the underlying memory has been recycled — reading it triggers cudaErrorIllegalAddress. The fix is to copy amax to CPU immediately after calibration, before any further GPU operations.

Do NOT repeat these mistakes:

Don't use FP8 source model — kernel issues on Blackwell (Run 1)
Don't use --low_memory_mode with V4 — meta device errors
Don't use calib_size=256 — OOMs with 3TB BF16 on CPU offload
Don't assume GPU tensor integrity after 5+ hours of sequential calibration

Bugs Found (V4 + modelopt 0.45.0.dev64)

~~QuantDeepseekV4Experts AttributeError~~ — Already fixed in modelopt 0.45.0.dev64 (handles nn.ModuleList quantizers natively).
--low_memory_mode → meta device error. Don't use with V4.
Missing kernels package for FP8 ops. pip install -U kernels.
~~Shell script arg names~~ — No longer relevant (in-process script).
Export crash — stale GPU tensors in export_amax(). After hours of calibration, quantizer _amax on GPU becomes unreadable. Fixed by patching export_amax to move _amax to CPU before reading.
Export crash — assert torch.all(activation_scaling_factor > 0). Amax values from stale GPU reads are garbage (zeros, negatives, NaN). Fixed by clamping instead of asserting, plus snapshotting valid amax to CPU before corruption can occur.

Dependencies (pinned versions)

nvidia-modelopt: 0.45.0.dev64+g579fc6c31 (installed from git, not PyPI)
transformers: 5.8.0.dev0 (from git, required for DeepSeekV4 support)
kernels: latest (pip install -U kernels — needed for finegrained FP8 ops)
Python: 3.10

The patches in quantize_nvfp4.py are for modelopt 0.45.0.dev64 specifically. Later versions may include fixes natively — check before applying.

Key Notes

V4 is NOT BF16 — it ships as mixed-precision FP8/FP4. You MUST dequantize to BF16 first (Step 1).
--low_memory_mode causes meta device errors with V4 — don't use.
modelopt has no explicit V4 support — relies on auto-detection of fused experts.
The calibration state save (v4_nvfp4_calibrated_state.pt) is ~1.5TB. It lives on NVMe, not in git.
The amax snapshot (v4_nvfp4_amax_snapshots.pt) is ~50MB. Small, critical, cheap insurance.

README.md Unescape Escape

DeepSeek V4 Pro → NVFP4 Quantization

Pipeline

Step 1: Dequantize FP8 → BF16

Step 2: Run NVFP4 Quantization

Run History (forward progression)

Bugs Found (V4 + modelopt 0.45.0.dev64)

Dependencies (pinned versions)

Key Notes

README.md