Commit Graph

27 Commits

Author SHA1 Message Date
25b4d8da06 Fix: add missing args for make_calib_dataloader (dataset, calib_with_images, auto_quantize, specdec) 2026-05-09 13:37:24 +00:00
6c1bff6997 Clean rewrite: verified all imports against runtime, removed dead code
- get_model/get_tokenizer imported from example_utils (not hf_ptq)
- KV_QUANT_CFG_CHOICES imported from hf_ptq (not mtq)
- Removed dead _FORCE_AMAX_CPU global and global reference in run_export_only
- Fixed stale comments
- All 16 imports and references verified against the actual B200 runtime
- Zero divergences from modelopt example path except get_model()
2026-05-09 09:26:23 +00:00
86dd8df302 Fix: KV_QUANT_CFG_CHOICES is in hf_ptq, not mtq 2026-05-09 09:17:12 +00:00
f9bbef8e91 Fix: patch load_calib_amax instead of amax property setter (can't patch readonly descriptor)
Also remove _FORCE_AMAX_CPU global — load_calib_amax patch handles it.
2026-05-09 08:04:03 +00:00
94179ed9d0 Fix typo: store_only → store_true 2026-05-09 08:02:09 +00:00
03c10ab3b6 Fix model loading: use modelopt get_model() instead of raw AutoModelForCausalLM
Raw from_pretrained OOMs during weight conversion — torch.cat on expert
gate_up_proj tries to allocate 31.5GB on a GPU with only 25.9GB free.
modelopt's get_model() handles max_memory/device_map properly for models
that need sequential device mapping.
2026-05-09 08:00:50 +00:00
6eaba26914 Defensive quantization: snapshot amax to CPU immediately after calibration
Key changes:
- snapshot_amax_to_cpu(): copies all quantizer _amax to CPU and saves
  to disk (~50MB) right after mtq.quantize() returns, before any other
  GPU operation can corrupt them
- force_all_amax_to_cpu(): nuclear option, moves _pre_quant_scale and
  _global_amax to CPU too
- _FORCE_AMAX_CPU flag + patched amax setter: after calibration, any
  future amax writes go to CPU instead of GPU
- --validate-only mode to check saved state without running anything
- restore_amax_from_snapshot() for --export-only recovery
- torch.cuda.empty_cache() + gc.collect() between steps
- Patches: export_amax CPU fallback, get_activation_scaling_factor
  clamp instead of assert
2026-05-09 06:31:08 +00:00
3907838409 Remove ModuleList patch (already fixed in modelopt 0.45), fix numbering 2026-05-09 06:10:18 +00:00
382c1d872f Fix quant_module import path 2026-05-09 06:09:17 +00:00
9291165ba0 Fix imports: QUANT_CFG_CHOICES is in hf_ptq, not modelopt config 2026-05-09 06:08:35 +00:00
a0bacb3cf6 Replace shell wrapper with in-process quantize script
- New scripts/quantize_nvfp4.py: runs full ModelOpt pipeline in-process
- Saves calibrated state after calibration (insurance against export crashes)
- Patches modelopt for V4: ModuleList quantizers, stale GPU tensor safety
- --export-only flag to retry export from saved calibration state
- Removed old model_opt_nvfp4_full.py (shell wrapper)
- Updated README with new pipeline docs and bug #5/#6
2026-05-09 06:07:22 +00:00
f1d21900ea Remove upcast_to_bf16.py — superseded by dequant_fp8_to_bf16.py 2026-05-08 17:13:39 +00:00
eeba101cc4 Cleanup: nuke dead scripts and stale docs, rewrite README for full NVFP4 pipeline 2026-05-08 17:02:07 +00:00
075da675dc fix: update HF token, echo it at runtime, export both HF_TOKEN and HUGGING_FACE_HUB_TOKEN 2026-05-08 16:57:32 +00:00
36e1342270 nvfp4_full: pass HF_TOKEN env var for gated calibration dataset 2026-05-08 13:33:45 +00:00
3d38e1d5cd nvfp4_full: drop calib to 128, gpu_max_mem to 0.7 for VRAM headroom 2026-05-08 06:24:45 +00:00
d0fc5338fe model_opt_nvfp4_full: add use_seq_device_map, fix source for /bin/sh 2026-05-08 05:50:16 +00:00
b70a04696e Add resume capability to dequant script (skip already-done shards)
Verified our FP4 dequant is byte-identical to official transformers
MXFP4 implementation. Max diff = 0.0 across all values.
2026-05-08 02:58:24 +00:00
f63eed5cfd Purge INT4 references — expert weights are FP4 (E2M1), not INT4
All docs and scripts updated. Historical memory entries annotated.
2026-05-08 02:33:46 +00:00
f8533197f2 Fix: expert weights are FP4 (E2M1), not INT4 - verified with nibble analysis
Nibble index 0 vs 8 ratio = 0.996 (FP4 -0.0 ≈ +0.0), NOT INT4 where -8 would be rare.
FP4 dequant uses E2M1 LUT lookup × E8M0 scale (MXFP4 microscaling).
Also adds model_opt_nvfp4_full.py for full model NVFP4 quantization.
2026-05-08 02:25:43 +00:00
b5d569218c Add full nvfp4 quantization script + complete dequant script
- model_opt_nvfp4_full.py: Full NVFP4 quantization (not experts-only)
  Uses --gpu_max_mem_percentage 0.9 instead of --use_seq_device_map
- dequant_fp8_to_bf16.py: Now handles INT4-packed experts + FP8 shared
  experts + FP8 attention. Complete dequant to pure BF16.
2026-05-08 01:50:53 +00:00
db6beb5b76 Complete dequant script: handles INT4 experts, FP8 attention, FP8 shared experts
INT4 expert weights are packed 2-per-byte into int8 with float8_e8m0fnu
per-row 32-column block scales. Unpacking: lower nibble first, upper second.
Output dimensions are 2x the stored dimensions (e.g. [3072,3584] → [3072,7168]).

Also adds progress output with ETA per shard so screen sessions stay alive.
2026-05-08 01:39:50 +00:00
cbfc5a9afb Update nvfp4_experts_only to use dequantized BF16 model 2026-05-07 16:34:37 +00:00
b5d14aa8b8 Add proper FP8→BF16 dequantization script
Unlike the naive upcast, this properly dequantizes FP8 block-wise weights:
bf16 = fp8_weight * scale_expanded (128x128 blocks).

Also removes the now-unnecessary scale tensors and updates config.
FP8Linear.forward() sees element_size() > 1 and falls back to F.linear().
2026-05-07 15:45:46 +00:00
6008cf128d Add model_opt_nvfp4_experts_only.py
Quantizes only MoE expert weights to NVFP4, leaving attention untouched.
Includes comments documenting all available NVFP4 strategies.
Copy to model_opt_nvfp4_<strategy>.py for each new strategy.
2026-05-07 15:16:08 +00:00
7a3b81e833 Add BF16 upcast script and Blackwell DeepGEMM patch
- scripts/upcast_to_bf16.py: Converts mixed-precision V4 Pro to pure BF16
  by upcasting all FP8 tensors (float8_e8m0fnu etc.) to bfloat16.
  Needed because modelopt PTQ calibration crashes on Blackwell with FP8
  kernels (DeepGEMM unsupported, Triton finegrained-fp8 has K mismatches).

- patches/patch_finegrained_fp8_blackwell.py: Patches transformers to
  reject DeepGEMM on SM100+ (Blackwell), letting it fall back to Triton.
  Note: the Triton fallback also fails during modelopt calibration on
  quantized weights, so upcasting to BF16 is the working solution.
2026-05-07 14:25:30 +00:00
ef89ceffbd Add ModelOpt NVFP4 pipeline: patch, run script, README
- Patch fixes iter_weights_for_calibration() for DeepseekV4Experts
  (ModuleList quantizers vs singular)
- Run script uses official NVIDIA hf_ptq.py with FP8 source
- Documents flags to avoid (--low_memory_mode, wrong arg names)
2026-05-07 07:22:54 +00:00