Key changes: - snapshot_amax_to_cpu(): copies all quantizer _amax to CPU and saves to disk (~50MB) right after mtq.quantize() returns, before any other GPU operation can corrupt them - force_all_amax_to_cpu(): nuclear option, moves _pre_quant_scale and _global_amax to CPU too - _FORCE_AMAX_CPU flag + patched amax setter: after calibration, any future amax writes go to CPU instead of GPU - --validate-only mode to check saved state without running anything - restore_amax_from_snapshot() for --export-only recovery - torch.cuda.empty_cache() + gc.collect() between steps - Patches: export_amax CPU fallback, get_activation_scaling_factor clamp instead of assert
19 KiB
19 KiB