deepseek-v4-quant

Author	SHA1	Message	Date
biondizzle	25b4d8da06	Fix: add missing args for make_calib_dataloader (dataset, calib_with_images, auto_quantize, specdec)	2026-05-09 13:37:24 +00:00
biondizzle	6c1bff6997	Clean rewrite: verified all imports against runtime, removed dead code - get_model/get_tokenizer imported from example_utils (not hf_ptq) - KV_QUANT_CFG_CHOICES imported from hf_ptq (not mtq) - Removed dead _FORCE_AMAX_CPU global and global reference in run_export_only - Fixed stale comments - All 16 imports and references verified against the actual B200 runtime - Zero divergences from modelopt example path except get_model()	2026-05-09 09:26:23 +00:00
biondizzle	86dd8df302	Fix: KV_QUANT_CFG_CHOICES is in hf_ptq, not mtq	2026-05-09 09:17:12 +00:00
biondizzle	f9bbef8e91	Fix: patch load_calib_amax instead of amax property setter (can't patch readonly descriptor) Also remove _FORCE_AMAX_CPU global — load_calib_amax patch handles it.	2026-05-09 08:04:03 +00:00
biondizzle	94179ed9d0	Fix typo: store_only → store_true	2026-05-09 08:02:09 +00:00
biondizzle	03c10ab3b6	Fix model loading: use modelopt get_model() instead of raw AutoModelForCausalLM Raw from_pretrained OOMs during weight conversion — torch.cat on expert gate_up_proj tries to allocate 31.5GB on a GPU with only 25.9GB free. modelopt's get_model() handles max_memory/device_map properly for models that need sequential device mapping.	2026-05-09 08:00:50 +00:00
biondizzle	6eaba26914	Defensive quantization: snapshot amax to CPU immediately after calibration Key changes: - snapshot_amax_to_cpu(): copies all quantizer _amax to CPU and saves to disk (~50MB) right after mtq.quantize() returns, before any other GPU operation can corrupt them - force_all_amax_to_cpu(): nuclear option, moves _pre_quant_scale and _global_amax to CPU too - _FORCE_AMAX_CPU flag + patched amax setter: after calibration, any future amax writes go to CPU instead of GPU - --validate-only mode to check saved state without running anything - restore_amax_from_snapshot() for --export-only recovery - torch.cuda.empty_cache() + gc.collect() between steps - Patches: export_amax CPU fallback, get_activation_scaling_factor clamp instead of assert	2026-05-09 06:31:08 +00:00
biondizzle	3907838409	Remove ModuleList patch (already fixed in modelopt 0.45), fix numbering	2026-05-09 06:10:18 +00:00
biondizzle	382c1d872f	Fix quant_module import path	2026-05-09 06:09:17 +00:00
biondizzle	9291165ba0	Fix imports: QUANT_CFG_CHOICES is in hf_ptq, not modelopt config	2026-05-09 06:08:35 +00:00
biondizzle	a0bacb3cf6	Replace shell wrapper with in-process quantize script - New scripts/quantize_nvfp4.py: runs full ModelOpt pipeline in-process - Saves calibrated state after calibration (insurance against export crashes) - Patches modelopt for V4: ModuleList quantizers, stale GPU tensor safety - --export-only flag to retry export from saved calibration state - Removed old model_opt_nvfp4_full.py (shell wrapper) - Updated README with new pipeline docs and bug #5/#6	2026-05-09 06:07:22 +00:00
biondizzle	f1d21900ea	Remove upcast_to_bf16.py — superseded by dequant_fp8_to_bf16.py	2026-05-08 17:13:39 +00:00
biondizzle	eeba101cc4	Cleanup: nuke dead scripts and stale docs, rewrite README for full NVFP4 pipeline	2026-05-08 17:02:07 +00:00
biondizzle	075da675dc	fix: update HF token, echo it at runtime, export both HF_TOKEN and HUGGING_FACE_HUB_TOKEN	2026-05-08 16:57:32 +00:00
biondizzle	36e1342270	nvfp4_full: pass HF_TOKEN env var for gated calibration dataset	2026-05-08 13:33:45 +00:00
biondizzle	3d38e1d5cd	nvfp4_full: drop calib to 128, gpu_max_mem to 0.7 for VRAM headroom	2026-05-08 06:24:45 +00:00
biondizzle	d0fc5338fe	model_opt_nvfp4_full: add use_seq_device_map, fix source for /bin/sh	2026-05-08 05:50:16 +00:00
biondizzle	b70a04696e	Add resume capability to dequant script (skip already-done shards) Verified our FP4 dequant is byte-identical to official transformers MXFP4 implementation. Max diff = 0.0 across all values.	2026-05-08 02:58:24 +00:00
biondizzle	f63eed5cfd	Purge INT4 references — expert weights are FP4 (E2M1), not INT4 All docs and scripts updated. Historical memory entries annotated.	2026-05-08 02:33:46 +00:00
biondizzle	f8533197f2	Fix: expert weights are FP4 (E2M1), not INT4 - verified with nibble analysis Nibble index 0 vs 8 ratio = 0.996 (FP4 -0.0 ≈ +0.0), NOT INT4 where -8 would be rare. FP4 dequant uses E2M1 LUT lookup × E8M0 scale (MXFP4 microscaling). Also adds model_opt_nvfp4_full.py for full model NVFP4 quantization.	2026-05-08 02:25:43 +00:00
biondizzle	b5d569218c	Add full nvfp4 quantization script + complete dequant script - model_opt_nvfp4_full.py: Full NVFP4 quantization (not experts-only) Uses --gpu_max_mem_percentage 0.9 instead of --use_seq_device_map - dequant_fp8_to_bf16.py: Now handles INT4-packed experts + FP8 shared experts + FP8 attention. Complete dequant to pure BF16.	2026-05-08 01:50:53 +00:00
biondizzle	db6beb5b76	Complete dequant script: handles INT4 experts, FP8 attention, FP8 shared experts INT4 expert weights are packed 2-per-byte into int8 with float8_e8m0fnu per-row 32-column block scales. Unpacking: lower nibble first, upper second. Output dimensions are 2x the stored dimensions (e.g. [3072,3584] → [3072,7168]). Also adds progress output with ETA per shard so screen sessions stay alive.	2026-05-08 01:39:50 +00:00
biondizzle	cbfc5a9afb	Update nvfp4_experts_only to use dequantized BF16 model	2026-05-07 16:34:37 +00:00
biondizzle	b5d14aa8b8	Add proper FP8→BF16 dequantization script Unlike the naive upcast, this properly dequantizes FP8 block-wise weights: bf16 = fp8_weight * scale_expanded (128x128 blocks). Also removes the now-unnecessary scale tensors and updates config. FP8Linear.forward() sees element_size() > 1 and falls back to F.linear().	2026-05-07 15:45:46 +00:00
biondizzle	6008cf128d	Add model_opt_nvfp4_experts_only.py Quantizes only MoE expert weights to NVFP4, leaving attention untouched. Includes comments documenting all available NVFP4 strategies. Copy to model_opt_nvfp4_<strategy>.py for each new strategy.	2026-05-07 15:16:08 +00:00
biondizzle	7a3b81e833	Add BF16 upcast script and Blackwell DeepGEMM patch - scripts/upcast_to_bf16.py: Converts mixed-precision V4 Pro to pure BF16 by upcasting all FP8 tensors (float8_e8m0fnu etc.) to bfloat16. Needed because modelopt PTQ calibration crashes on Blackwell with FP8 kernels (DeepGEMM unsupported, Triton finegrained-fp8 has K mismatches). - patches/patch_finegrained_fp8_blackwell.py: Patches transformers to reject DeepGEMM on SM100+ (Blackwell), letting it fall back to Triton. Note: the Triton fallback also fails during modelopt calibration on quantized weights, so upcasting to BF16 is the working solution.	2026-05-07 14:25:30 +00:00
biondizzle	ef89ceffbd	Add ModelOpt NVFP4 pipeline: patch, run script, README - Patch fixes iter_weights_for_calibration() for DeepseekV4Experts (ModuleList quantizers vs singular) - Run script uses official NVIDIA hf_ptq.py with FP8 source - Documents flags to avoid (--low_memory_mode, wrong arg names)	2026-05-07 07:22:54 +00:00

27 Commits