Go to file

biondizzle 6eaba26914 Defensive quantization: snapshot amax to CPU immediately after calibration

Key changes:
- snapshot_amax_to_cpu(): copies all quantizer _amax to CPU and saves
  to disk (~50MB) right after mtq.quantize() returns, before any other
  GPU operation can corrupt them
- force_all_amax_to_cpu(): nuclear option, moves _pre_quant_scale and
  _global_amax to CPU too
- _FORCE_AMAX_CPU flag + patched amax setter: after calibration, any
  future amax writes go to CPU instead of GPU
- --validate-only mode to check saved state without running anything
- restore_amax_from_snapshot() for --export-only recovery
- torch.cuda.empty_cache() + gc.collect() between steps
- Patches: export_amax CPU fallback, get_activation_scaling_factor
  clamp instead of assert

2026-05-09 06:31:08 +00:00

patches

Add BF16 upcast script and Blackwell DeepGEMM patch

2026-05-07 14:25:30 +00:00

scripts

Defensive quantization: snapshot amax to CPU immediately after calibration

2026-05-09 06:31:08 +00:00

.env

Purge OpenClaw session files, memory dumps, __pycache__. Update .gitignore

2026-05-08 17:09:59 +00:00

.gitignore

Replace shell wrapper with in-process quantize script

2026-05-09 06:07:22 +00:00

index.yaml

Purge INT4 references — expert weights are FP4 (E2M1), not INT4

2026-05-08 02:33:46 +00:00

README.md

Replace shell wrapper with in-process quantize script

2026-05-09 06:07:22 +00:00

requirements.txt

NVIDIA Model Optimizer branch: nvfp4_experts_only PTQ for DeepSeek V4 Pro

2026-05-07 00:11:31 +00:00

README.md

DeepSeek V4 Pro → NVFP4 Quantization

Full NVFP4 quantization of DeepSeek V4 Pro on a single B200 node (8× B200, 2.7TB RAM, 13TB NVMe). Target: ~600GB.

Pipeline

Step 1: Dequantize FP8 → BF16

python3 scripts/dequant_fp8_to_bf16.py /root/nvidia-meeting/DeepSeek-V4-Pro-FP8 /root/nvidia-meeting/DeepSeek-V4-Pro-BF16

The original V4 weights use mixed precision (FP8 attention + FP4/E2M1 experts with per-tensor scales). We dequantize everything to pure BF16 so modelopt can run calibration without hitting broken FP8 kernel paths on Blackwell (DeepGEMM unsupported, Triton finegrained FP8 matmul shape mismatches).

This is not a blind upcast — it applies the actual scale factors:

W_bf16 = dequantize_fp4_weight(W_int, S)  # per-tensor scale dequant, not .to(bfloat16)

Byte-exact verified — matmul diff is 0.000000 against the official inference path.

Step 2: Run NVFP4 Quantization

python3 scripts/quantize_nvfp4.py

This script runs the full pipeline in-process (not wrapping the shell script):

Load BF16 model with sequential device map (3TB model, CPU offload)
Patch modelopt for V4 compatibility (ModuleList quantizers, GPU tensor safety)
Quantize + Calibrate (5-6 hours, 128 samples)
SAVE model state to disk ← insurance against export crashes
Export to HF safetensors

If the export crashes (and it will — modelopt's export reads stale GPU tensors after hours of calibration):

python3 scripts/quantize_nvfp4.py --export-only

This loads the saved calibration state and retries just the export step.

Config:

--quant nvfp4 (full model, not experts-only)
--calib 128 — 128 calibration samples. 256 OOMs with 3TB BF16 on CPU offload.
--kv_cache_quant fp8_cast
--use_seq_device_map — sequential device mapping (CPU offload)
--gpu_max_mem_percentage 0.7 — VRAM headroom

Calibration datasets: abisee/cnn_dailymail + nvidia/Nemotron-Post-Training-Dataset-v2 (gated — requires HF token).

Runtime: Model loading ~53 min. Calibration ~5.5 hours. Export ~30-60 min. Total 7-8 hours.

Bugs Found (V4 + modelopt)

QuantDeepseekV4Experts AttributeError — V4 uses nn.ModuleList for per-expert quantizers, modelopt expected singular TensorQuantizer. Patched in quantize_nvfp4.py.
--low_memory_mode → meta device error. Don't use with V4.
Missing kernels package for FP8 ops. pip install -U kernels.
--calib not --calib_size, --quant not --qformat (shell script arg names — no longer relevant, we run in-process).
Export crash — stale GPU tensors. After 5+ hours of calibration, modelopt's export step reads quantizer amax tensors that have been sitting in VRAM for hours. CUDA illegal memory access. Fixed by moving quantizer tensors to CPU before export.
Export crash — assert torch.all(activation_scaling_factor > 0). Related to #5. The amax values from stale GPU reads are garbage. Fixed by clamping instead of asserting.

Dependencies (pinned versions)

nvidia-modelopt: 0.45.0.dev64+g579fc6c31 (installed from git, not PyPI)
transformers: 5.8.0.dev0 (from git, required for DeepSeekV4 support)
kernels: latest (pip install -U kernels — needed for finegrained FP8 ops)
Python: 3.10

The patches in quantize_nvfp4.py are for modelopt 0.45.0.dev64 specifically. Later versions may include fixes natively.

Key Notes

V4 is NOT BF16 — it ships as mixed-precision FP8/FP4. You MUST dequantize to BF16 first (Step 1).
--low_memory_mode causes meta device errors with V4 — don't use.
modelopt has no explicit V4 support — relies on auto-detection of fused experts.
The calibration state save (v4_nvfp4_calibrated_state.pt) is ~1.5TB. It lives on NVMe, not in git.

README.md Unescape Escape

DeepSeek V4 Pro → NVFP4 Quantization

Pipeline

Step 1: Dequantize FP8 → BF16

Step 2: Run NVFP4 Quantization

Bugs Found (V4 + modelopt)

Dependencies (pinned versions)

Key Notes

README.md