Go to file

biondizzle ce9056d259 README overhaul: reflect current architecture (hf_main, run history through Run 10)

- Architecture section: call hf_main() directly, not rewrite the pipeline
- Run history: all 10 runs with root causes and fixes
- Key lessons: stale GPU tensors, expert OOM, pipeline rewriting trap, __main__ gap
- Runtime patches: 3 monkey-patches + 3 post-calibration hook steps
- Do NOT repeat: 8 specific mistakes with run references
- File layout with legacy patches note

2026-05-09 16:09:09 +00:00

patches

Add BF16 upcast script and Blackwell DeepGEMM patch

2026-05-07 14:25:30 +00:00

scripts

Fix: apply hf_ptq __main__ post-parse conversions (dataset split, calib_size int list)

2026-05-09 15:58:36 +00:00

.env

Purge OpenClaw session files, memory dumps, __pycache__. Update .gitignore

2026-05-08 17:09:59 +00:00

.gitignore

Replace shell wrapper with in-process quantize script

2026-05-09 06:07:22 +00:00

index.yaml

Purge INT4 references — expert weights are FP4 (E2M1), not INT4

2026-05-08 02:33:46 +00:00

README.md

README overhaul: reflect current architecture (hf_main, run history through Run 10)

2026-05-09 16:09:09 +00:00

requirements.txt

NVIDIA Model Optimizer branch: nvfp4_experts_only PTQ for DeepSeek V4 Pro

2026-05-07 00:11:31 +00:00

README.md

DeepSeek V4 Pro → NVFP4 Quantization

Full NVFP4 quantization of DeepSeek V4 Pro on a single B200 node (8× B200, 2.7TB RAM, 13TB NVMe). Target: ~600GB.

Cost: ~$161/run at $23/hr (7 hours each). Don't waste runs.

Architecture

We call modelopt's hf_ptq.main() directly — the same entry point the shell script uses. We don't rewrite the pipeline. We just:

Patch modelopt at runtime (GPU tensor safety, before anything runs)
Hook export_quantized to snapshot amax + save state before export
Call hf_main(args) with properly parsed args

This avoids the cascade of missing-arg bugs from manually constructing argparse.Namespace (Runs 4–8).

Pipeline

Step 1: Dequantize FP8 → BF16

python3 scripts/dequant_fp8_to_bf16.py /root/nvidia-meeting/DeepSeek-V4-Pro-FP8 /root/nvidia-meeting/DeepSeek-V4-Pro-BF16

The original V4 weights use mixed precision (FP8 attention + FP4/E2M1 experts with per-tensor scales). We dequantize everything to pure BF16 so modelopt can run calibration without hitting broken FP8 kernel paths on Blackwell (DeepGEMM unsupported, Triton finegrained FP8 matmul shape mismatches).

This is not a blind upcast — it applies the actual scale factors:

W_bf16 = dequantize_fp4_weight(W_int, S)  # per-tensor scale dequant, not .to(bfloat16)

Byte-exact verified — matmul diff is 0.000000 against the official inference path.

Step 2: Run NVFP4 Quantization

cd /root/nvidia-meeting/modelopt-repo/examples/llm_ptq
python3 /root/nvidia-meeting/deepseek-v4-quant/scripts/quantize_nvfp4.py

Must run from the modelopt example directory (relative imports).

What happens inside:

Apply patches — 3 runtime monkey-patches for GPU tensor safety (see below)
Parse args — uses hf_ptq.parse_args() with our config via sys.argv replacement, then applies the same post-parse conversions (dataset split, calib_size int list) that hf_ptq.__main__ normally does
Hook export — monkey-patch export_quantized to snapshot amax + save state before export
Call hf_main(args) — the exact same pipeline the shell script uses

If the export crashes:

python3 /root/nvidia-meeting/deepseek-v4-quant/scripts/quantize_nvfp4.py --export-only

To validate saved state without running anything:

python3 /root/nvidia-meeting/deepseek-v4-quant/scripts/quantize_nvfp4.py --validate-only

Config: nvfp4, 128 calib samples, calib_seq=512, kv_cache_qformat=fp8_cast, gpu_max_mem_percentage=0.7, use_seq_device_map, inference_tensor_parallel=8

Calibration datasets: abisee/cnn_dailymail + nvidia/Nemotron-Post-Training-Dataset-v2 (default when no --dataset specified).

Runtime: Model loading ~50 min. Calibration ~5.5 hours. Export ~30-60 min. Total 7-8 hours.

Run History

Run	Date	Commit	Calib	Result	Root Cause	Fix
1	May 7	shell wrapper	256	❌ Batch probing crash	`o_b_proj` shape mismatch — finegrained_fp8 wraps MLA projections incorrectly with FP8 source	Use BF16 source (dequantized)
2	May 8-9	shell wrapper	128	❌ Export crash (calib ✅)	`get_activation_scaling_factor` reads stale GPU amax → CUDA illegal memory access	Snapshot amax to CPU after calibration
3	May 9 06:10	`3907838`	128	❌ Model loading OOM	`AutoModelForCausalLM.from_pretrained` OOM during expert weight `torch.cat`	Use modelopt `get_model()` with `max_memory`
4	May 9 ~07:00	`86dd8df`	128	❌ Import error	`mtq.KV_QUANT_CFG_CHOICES` doesn't exist — it's `hf_ptq.KV_QUANT_CFG_CHOICES`	Import from `hf_ptq`, not `mtq`
5	May 9 ~08:05	`f9bbef8`	128	❌ Same as Run 4	Fix wasn't synced properly	Properly synced
6	May 9 ~09:25	`6c1bff6`	128	❌ Dataloader crash	`make_calib_dataloader` AttributeError — missing args	Added args to Namespace
7	May 9 ~13:40	`25b4d8d`	128	❌ Dataloader crash	`dataset=None`, `len()` on None	Provided dataset list
8	May 9 ~14:00	`b2849a8`	128	❌ Argparse crash	Wrong flag names (shell script names vs `hf_ptq.py` names)	Use `hf_ptq.py` flag names
9	May 9 ~14:30	`a300302`	128	❌ TypeError	Skipped `__main__` post-parse conversions (`calib_size` still string, not int list)	Apply same conversions after `parse_args()`
10	May 9 ~15:30	`5a72da7`	128	🔄 Running	—	Calls `hf_main(args)` directly

Key Lessons

Run 2 — Stale GPU tensors: use_seq_device_map shuffles layers through GPU for calibration. Quantizer amax tensors sit in VRAM for 5+ hours while CUDA's allocator churns memory. By export time, the GPU tensor metadata is valid but the underlying memory has been recycled — reading it triggers cudaErrorIllegalAddress. Fix: copy amax to CPU immediately after calibration.

Run 3 — Expert weight OOM: AutoModelForCausalLM.from_pretrained does torch.cat on GPU for expert gate_up_proj (31.5GB alloc, 25.9GB free). Fix: use modelopt's get_model() which sets max_memory per GPU before loading. (Note: Run 10 uses hf_main() which calls get_model() internally.)

Runs 4–8 — Pipeline rewriting trap: Trying to reconstruct hf_ptq's pipeline by importing individual functions and building a fake argparse.Namespace causes an endless stream of missing-attribute and type errors. Each fix reveals the next bug. Fix: call hf_main(args) directly with a properly parsed args object.

Run 9 — __main__ gap: hf_ptq.py does critical type conversions in its __main__ block (string → list for dataset, string → int list for calib_size). When calling main() directly, these are skipped. Fix: apply the same conversions after parse_args().

Do NOT Repeat These Mistakes

Don't use FP8 source model — kernel issues on Blackwell (Run 1)
Don't use --low_memory_mode with V4 — meta device errors
Don't use calib_size=256 — OOMs with 3TB BF16 on CPU offload
Don't use AutoModelForCausalLM.from_pretrained directly — OOM during expert weight concat (Run 3)
Don't assume GPU tensor integrity after 5+ hours of sequential calibration (Run 2)
Don't rewrite the hf_ptq pipeline — call hf_main() directly (Runs 4–8)
Don't skip the __main__ post-parse conversions — calib_size must be int list, dataset must be list (Run 9)
Don't use shell script arg names (--quant, --calib, --kv_cache_quant, --tp) — use hf_ptq.py names (--qformat, --calib_size, --kv_cache_qformat, --inference_tensor_parallel)

Runtime Patches Applied by quantize_nvfp4.py

These are monkey-patches applied at runtime — no modelopt source files are modified.

TensorQuantizer.load_calib_amax — After calibration writes _amax to GPU, immediately moves it to CPU. Prevents stale GPU tensors.
TensorQuantizer.export_amax — If _amax is still on GPU at export time, moves to CPU before reading. Safety net.
NVFP4QTensor.get_activation_scaling_factor — Moves amax to CPU, clamps bad values instead of hard assert. Prevents crash on garbage from GPU corruption.

Post-Calibration Hook

export_quantized is monkey-patched to run these steps before the real export:

snapshot_amax_to_cpu() — Walks all quantizers, copies _amax to CPU, saves to disk (~50MB). Insurance policy.
force_all_amax_to_cpu() — Moves _pre_quant_scale, _global_amax to CPU too. Nuclear option.
save_calibrated_state() — Saves full model state dict to disk (~1.5TB). Enables --export-only recovery if export crashes.

Bugs Found (V4 + modelopt 0.45.0.dev64)

~~QuantDeepseekV4Experts AttributeError~~ — Already fixed in modelopt 0.45.0.dev64 (handles nn.ModuleList quantizers natively).
--low_memory_mode → meta device error. Don't use with V4.
Missing kernels package for FP8 ops. pip install -U kernels.
~~Shell script arg names~~ — Resolved by calling hf_main() directly.
Export crash — stale GPU tensors in export_amax(). After hours of calibration, quantizer _amax on GPU becomes unreadable. Fixed by patching export_amax to move _amax to CPU before reading.
Export crash — assert torch.all(activation_scaling_factor > 0). Amax values from stale GPU reads are garbage (zeros, negatives, NaN). Fixed by clamping instead of asserting, plus snapshotting valid amax to CPU before corruption can occur.
Model loading OOM during expert weight conversion. AutoModelForCausalLM.from_pretrained does torch.cat on GPU for expert gate_up_proj (31.5GB alloc), but only 25.9GB free with device_map="sequential". Fixed by using modelopt's get_model() which sets max_memory per GPU before loading.

Dependencies (pinned versions)

nvidia-modelopt: 0.45.0.dev64+g579fc6c31 (installed from git, not PyPI)
transformers: 5.8.0.dev0 (from git, required for DeepSeekV4 support)
kernels: latest (pip install -U kernels — needed for finegrained FP8 ops)
Python: 3.10

The patches in quantize_nvfp4.py are for modelopt 0.45.0.dev64 specifically. Later versions may include fixes natively — check before applying.

Key Notes

V4 is NOT BF16 — it ships as mixed-precision FP8/FP4. You MUST dequantize to BF16 first (Step 1).
--low_memory_mode causes meta device errors with V4 — don't use.
modelopt has no explicit V4 support — relies on auto-detection of fused experts.
The calibration state save (v4_nvfp4_calibrated_state.pt) is ~1.5TB. It lives on NVMe, not in git.
The amax snapshot (v4_nvfp4_amax_snapshots.pt) is ~50MB. Small, critical, cheap insurance.
The script calls hf_main(args) — the exact same entry point as the shell script. No pipeline divergence.
Must run from /root/nvidia-meeting/modelopt-repo/examples/llm_ptq (relative imports).

File Layout

scripts/
  dequant_fp8_to_bf16.py   — Step 1: FP8/FP4 → BF16 dequantization
  quantize_nvfp4.py         — Step 2: NVFP4 quantization (patches + hf_main)

patches/
  patch_finegrained_fp8_blackwell.py  — (legacy) FP8 kernel patches for Blackwell
  quant_module_patched.py             — (legacy) quant module patches

The patches/ directory contains earlier approaches that modified modelopt source files directly. The current approach (quantize_nvfp4.py) uses runtime monkey-patching instead — no source files are modified.

README.md Unescape Escape

DeepSeek V4 Pro → NVFP4 Quantization

Architecture

Pipeline

Step 1: Dequantize FP8 → BF16

Step 2: Run NVFP4 Quantization

Run History

Key Lessons

Do NOT Repeat These Mistakes

Runtime Patches Applied by quantize_nvfp4.py

Post-Calibration Hook

Bugs Found (V4 + modelopt 0.45.0.dev64)

Dependencies (pinned versions)

Key Notes

File Layout

README.md