deepseek-v4-quant

Author	SHA1	Message	Date
biondizzle	07cd50e823	8 patches covering full export chain — no more whack-a-mole Traced the full execution chain from _process_quantized_modules through every function that reads stale GPU tensors: _process_quantized_modules → _export_quantized_weight (Patch 4: force weight to CPU at entry point) → get_weight_scaling_factor (Patch 7: belt-and-suspenders) → get_weights_scaling_factor_from_quantizer (safe: weight now CPU) → NVFP4QTensor.get_weights_scaling_factor (safe: input is CPU) → get_weight_scaling_factor_2 (Patch 8: force quantizer to CPU) → get_activation_scaling_factor (Patch 3: CPU + clamp) → to_quantized_weight (Patch 6: force all tensors to CPU) → weight.to(dtype) (safe: weight is CPU) → _export_fused_experts (Patch 5: force expert weights + quantizer to CPU) Patch 4 is the key: it moves weight to CPU at the earliest possible point, so ALL downstream .to(weight.device) calls resolve to CPU. Patches 5-8 are belt-and-suspenders for alternative code paths.	2026-05-09 22:50:58 +00:00
biondizzle	efc111a11f	Add Patch 4+5: get_weight_scaling_factor and get_weight_scaling_factor_2 CPU safety Run 10 completed calibration (128/128) but crashed at export in get_weight_scaling_factor — the weight tensor on GPU was stale after 5+ hours of calibration, and weight_scaling_factor_2.to(weight.device) triggered cudaErrorIllegalAddress. Patches 4+5 force weight and quantizer state to CPU before computing scaling factors. This mirrors the same pattern as Patch 3 (get_activation_scaling_factor). Calibrated state saved successfully (721.4 GB, 47,696 amax tensors). Amax snapshot saved (15.4 MB). Re-running with new patches.	2026-05-09 22:43:48 +00:00
biondizzle	5a72da7193	Fix: apply hf_ptq __main__ post-parse conversions (dataset split, calib_size int list) When calling hf_main(args) directly, the __main__ block conversions that run between parse_args() and main() are skipped. calib_size stays as string '128' instead of [128], causing TypeError on list concatenation.	2026-05-09 15:58:36 +00:00
biondizzle	a300302486	Fix: use hf_ptq.py arg names (--pyt_ckpt_path, --qformat, --inference_tensor_parallel)	2026-05-09 14:57:28 +00:00
biondizzle	1a36a655ea	Fix: use full argparse flag names (--calib_size, --kv_cache_qformat)	2026-05-09 14:54:51 +00:00
biondizzle	b2849a8944	Fundamental rewrite: call hf_main() instead of rewriting the pipeline The previous approach tried to reconstruct hf_ptq's pipeline by importing individual functions and building a fake argparse.Namespace. This caused repeated crashes from missing args (KV_QUANT_CFG_CHOICES, dataset, calib_with_images, etc.). New approach: - Call hf_ptq.parse_args() with sys.argv replaced — gets ALL defaults - Call hf_main(args) — the exact same entry point the shell script uses - Hook export_quantized to add amax snapshot + state save before export - No more missing args. No more diverging from the example script. The only changes from the stock pipeline: 1. Runtime patches (load_calib_amax CPU, export_amax CPU, clamp) 2. Post-calibration hook (snapshot amax, save state, force CPU)	2026-05-09 14:52:02 +00:00
biondizzle	25b4d8da06	Fix: add missing args for make_calib_dataloader (dataset, calib_with_images, auto_quantize, specdec)	2026-05-09 13:37:24 +00:00
biondizzle	6c1bff6997	Clean rewrite: verified all imports against runtime, removed dead code - get_model/get_tokenizer imported from example_utils (not hf_ptq) - KV_QUANT_CFG_CHOICES imported from hf_ptq (not mtq) - Removed dead _FORCE_AMAX_CPU global and global reference in run_export_only - Fixed stale comments - All 16 imports and references verified against the actual B200 runtime - Zero divergences from modelopt example path except get_model()	2026-05-09 09:26:23 +00:00
biondizzle	86dd8df302	Fix: KV_QUANT_CFG_CHOICES is in hf_ptq, not mtq	2026-05-09 09:17:12 +00:00
biondizzle	f9bbef8e91	Fix: patch load_calib_amax instead of amax property setter (can't patch readonly descriptor) Also remove _FORCE_AMAX_CPU global — load_calib_amax patch handles it.	2026-05-09 08:04:03 +00:00
biondizzle	94179ed9d0	Fix typo: store_only → store_true	2026-05-09 08:02:09 +00:00
biondizzle	03c10ab3b6	Fix model loading: use modelopt get_model() instead of raw AutoModelForCausalLM Raw from_pretrained OOMs during weight conversion — torch.cat on expert gate_up_proj tries to allocate 31.5GB on a GPU with only 25.9GB free. modelopt's get_model() handles max_memory/device_map properly for models that need sequential device mapping.	2026-05-09 08:00:50 +00:00
biondizzle	6eaba26914	Defensive quantization: snapshot amax to CPU immediately after calibration Key changes: - snapshot_amax_to_cpu(): copies all quantizer _amax to CPU and saves to disk (~50MB) right after mtq.quantize() returns, before any other GPU operation can corrupt them - force_all_amax_to_cpu(): nuclear option, moves _pre_quant_scale and _global_amax to CPU too - _FORCE_AMAX_CPU flag + patched amax setter: after calibration, any future amax writes go to CPU instead of GPU - --validate-only mode to check saved state without running anything - restore_amax_from_snapshot() for --export-only recovery - torch.cuda.empty_cache() + gc.collect() between steps - Patches: export_amax CPU fallback, get_activation_scaling_factor clamp instead of assert	2026-05-09 06:31:08 +00:00
biondizzle	3907838409	Remove ModuleList patch (already fixed in modelopt 0.45), fix numbering	2026-05-09 06:10:18 +00:00
biondizzle	382c1d872f	Fix quant_module import path	2026-05-09 06:09:17 +00:00
biondizzle	9291165ba0	Fix imports: QUANT_CFG_CHOICES is in hf_ptq, not modelopt config	2026-05-09 06:08:35 +00:00
biondizzle	a0bacb3cf6	Replace shell wrapper with in-process quantize script - New scripts/quantize_nvfp4.py: runs full ModelOpt pipeline in-process - Saves calibrated state after calibration (insurance against export crashes) - Patches modelopt for V4: ModuleList quantizers, stale GPU tensor safety - --export-only flag to retry export from saved calibration state - Removed old model_opt_nvfp4_full.py (shell wrapper) - Updated README with new pipeline docs and bug #5/#6	2026-05-09 06:07:22 +00:00

17 Commits