Commit Graph

14 Commits

Author SHA1 Message Date
a300302486 Fix: use hf_ptq.py arg names (--pyt_ckpt_path, --qformat, --inference_tensor_parallel) 2026-05-09 14:57:28 +00:00
1a36a655ea Fix: use full argparse flag names (--calib_size, --kv_cache_qformat) 2026-05-09 14:54:51 +00:00
b2849a8944 Fundamental rewrite: call hf_main() instead of rewriting the pipeline
The previous approach tried to reconstruct hf_ptq's pipeline by importing
individual functions and building a fake argparse.Namespace. This caused
repeated crashes from missing args (KV_QUANT_CFG_CHOICES, dataset,
calib_with_images, etc.).

New approach:
- Call hf_ptq.parse_args() with sys.argv replaced — gets ALL defaults
- Call hf_main(args) — the exact same entry point the shell script uses
- Hook export_quantized to add amax snapshot + state save before export
- No more missing args. No more diverging from the example script.

The only changes from the stock pipeline:
1. Runtime patches (load_calib_amax CPU, export_amax CPU, clamp)
2. Post-calibration hook (snapshot amax, save state, force CPU)
2026-05-09 14:52:02 +00:00
25b4d8da06 Fix: add missing args for make_calib_dataloader (dataset, calib_with_images, auto_quantize, specdec) 2026-05-09 13:37:24 +00:00
6c1bff6997 Clean rewrite: verified all imports against runtime, removed dead code
- get_model/get_tokenizer imported from example_utils (not hf_ptq)
- KV_QUANT_CFG_CHOICES imported from hf_ptq (not mtq)
- Removed dead _FORCE_AMAX_CPU global and global reference in run_export_only
- Fixed stale comments
- All 16 imports and references verified against the actual B200 runtime
- Zero divergences from modelopt example path except get_model()
2026-05-09 09:26:23 +00:00
86dd8df302 Fix: KV_QUANT_CFG_CHOICES is in hf_ptq, not mtq 2026-05-09 09:17:12 +00:00
f9bbef8e91 Fix: patch load_calib_amax instead of amax property setter (can't patch readonly descriptor)
Also remove _FORCE_AMAX_CPU global — load_calib_amax patch handles it.
2026-05-09 08:04:03 +00:00
94179ed9d0 Fix typo: store_only → store_true 2026-05-09 08:02:09 +00:00
03c10ab3b6 Fix model loading: use modelopt get_model() instead of raw AutoModelForCausalLM
Raw from_pretrained OOMs during weight conversion — torch.cat on expert
gate_up_proj tries to allocate 31.5GB on a GPU with only 25.9GB free.
modelopt's get_model() handles max_memory/device_map properly for models
that need sequential device mapping.
2026-05-09 08:00:50 +00:00
6eaba26914 Defensive quantization: snapshot amax to CPU immediately after calibration
Key changes:
- snapshot_amax_to_cpu(): copies all quantizer _amax to CPU and saves
  to disk (~50MB) right after mtq.quantize() returns, before any other
  GPU operation can corrupt them
- force_all_amax_to_cpu(): nuclear option, moves _pre_quant_scale and
  _global_amax to CPU too
- _FORCE_AMAX_CPU flag + patched amax setter: after calibration, any
  future amax writes go to CPU instead of GPU
- --validate-only mode to check saved state without running anything
- restore_amax_from_snapshot() for --export-only recovery
- torch.cuda.empty_cache() + gc.collect() between steps
- Patches: export_amax CPU fallback, get_activation_scaling_factor
  clamp instead of assert
2026-05-09 06:31:08 +00:00
3907838409 Remove ModuleList patch (already fixed in modelopt 0.45), fix numbering 2026-05-09 06:10:18 +00:00
382c1d872f Fix quant_module import path 2026-05-09 06:09:17 +00:00
9291165ba0 Fix imports: QUANT_CFG_CHOICES is in hf_ptq, not modelopt config 2026-05-09 06:08:35 +00:00
a0bacb3cf6 Replace shell wrapper with in-process quantize script
- New scripts/quantize_nvfp4.py: runs full ModelOpt pipeline in-process
- Saves calibrated state after calibration (insurance against export crashes)
- Patches modelopt for V4: ModuleList quantizers, stale GPU tensor safety
- --export-only flag to retry export from saved calibration state
- Removed old model_opt_nvfp4_full.py (shell wrapper)
- Updated README with new pipeline docs and bug #5/#6
2026-05-09 06:07:22 +00:00