Commit Graph

29 Commits

Author SHA1 Message Date
b70a04696e Add resume capability to dequant script (skip already-done shards)
Verified our FP4 dequant is byte-identical to official transformers
MXFP4 implementation. Max diff = 0.0 across all values.
2026-05-08 02:58:24 +00:00
f63eed5cfd Purge INT4 references — expert weights are FP4 (E2M1), not INT4
All docs and scripts updated. Historical memory entries annotated.
2026-05-08 02:33:46 +00:00
f8533197f2 Fix: expert weights are FP4 (E2M1), not INT4 - verified with nibble analysis
Nibble index 0 vs 8 ratio = 0.996 (FP4 -0.0 ≈ +0.0), NOT INT4 where -8 would be rare.
FP4 dequant uses E2M1 LUT lookup × E8M0 scale (MXFP4 microscaling).
Also adds model_opt_nvfp4_full.py for full model NVFP4 quantization.
2026-05-08 02:25:43 +00:00
b5d569218c Add full nvfp4 quantization script + complete dequant script
- model_opt_nvfp4_full.py: Full NVFP4 quantization (not experts-only)
  Uses --gpu_max_mem_percentage 0.9 instead of --use_seq_device_map
- dequant_fp8_to_bf16.py: Now handles INT4-packed experts + FP8 shared
  experts + FP8 attention. Complete dequant to pure BF16.
2026-05-08 01:50:53 +00:00
db6beb5b76 Complete dequant script: handles INT4 experts, FP8 attention, FP8 shared experts
INT4 expert weights are packed 2-per-byte into int8 with float8_e8m0fnu
per-row 32-column block scales. Unpacking: lower nibble first, upper second.
Output dimensions are 2x the stored dimensions (e.g. [3072,3584] → [3072,7168]).

Also adds progress output with ETA per shard so screen sessions stay alive.
2026-05-08 01:39:50 +00:00
cbfc5a9afb Update nvfp4_experts_only to use dequantized BF16 model 2026-05-07 16:34:37 +00:00
b5d14aa8b8 Add proper FP8→BF16 dequantization script
Unlike the naive upcast, this properly dequantizes FP8 block-wise weights:
bf16 = fp8_weight * scale_expanded (128x128 blocks).

Also removes the now-unnecessary scale tensors and updates config.
FP8Linear.forward() sees element_size() > 1 and falls back to F.linear().
2026-05-07 15:45:46 +00:00
6008cf128d Add model_opt_nvfp4_experts_only.py
Quantizes only MoE expert weights to NVFP4, leaving attention untouched.
Includes comments documenting all available NVFP4 strategies.
Copy to model_opt_nvfp4_<strategy>.py for each new strategy.
2026-05-07 15:16:08 +00:00
a7664aee7d Add BF16 upcast script and Blackwell DeepGEMM patch 2026-05-07 14:29:50 +00:00
7a3b81e833 Add BF16 upcast script and Blackwell DeepGEMM patch
- scripts/upcast_to_bf16.py: Converts mixed-precision V4 Pro to pure BF16
  by upcasting all FP8 tensors (float8_e8m0fnu etc.) to bfloat16.
  Needed because modelopt PTQ calibration crashes on Blackwell with FP8
  kernels (DeepGEMM unsupported, Triton finegrained-fp8 has K mismatches).

- patches/patch_finegrained_fp8_blackwell.py: Patches transformers to
  reject DeepGEMM on SM100+ (Blackwell), letting it fall back to Triton.
  Note: the Triton fallback also fails during modelopt calibration on
  quantized weights, so upcasting to BF16 is the working solution.
2026-05-07 14:25:30 +00:00
ef89ceffbd Add ModelOpt NVFP4 pipeline: patch, run script, README
- Patch fixes iter_weights_for_calibration() for DeepseekV4Experts
  (ModuleList quantizers vs singular)
- Run script uses official NVIDIA hf_ptq.py with FP8 source
- Documents flags to avoid (--low_memory_mode, wrong arg names)
2026-05-07 07:22:54 +00:00
116933dcf6 Fix: skip .cuda() when low_memory_mode; switch default to nvfp4 2026-05-07 03:06:33 +00:00
b8bdd00d19 Lower GPU max_memory to 100GiB, add CPU-only fallback for low_memory_mode 2026-05-07 02:49:24 +00:00
717151b98c Add CPU offloading and max_memory caps for FP8 model loading 2026-05-07 02:40:48 +00:00
aff12c6951 Fix forward_loop: pass as callable, not via create_forward_loop 2026-05-07 02:08:09 +00:00
492e44c0f6 Fix dataloader API: max_sample_length not seq_len, proper create_forward_loop 2026-05-07 02:04:54 +00:00
b32bb2e84d NVIDIA Model Optimizer branch: nvfp4_experts_only PTQ for DeepSeek V4 Pro 2026-05-07 00:11:31 +00:00
c40607053b Fix remaining gate_proj/up_proj -> w1/w3 references in paired_names 2026-05-07 00:05:55 +00:00
771e42cef3 Fix expert pair dict keys: w1/w3 not gate_proj/up_proj 2026-05-07 00:05:25 +00:00
5f35a5d2b3 Gracefully handle missing scale tensors (BF16 weights with stale index entries) 2026-05-07 00:04:29 +00:00
4470653e15 Fix V4 tensor naming: .scale companions, w1/w3 expert pairs, ffn.gate, hc_* preserve 2026-05-07 00:03:20 +00:00
2b7f063e39 7 commit 2026-05-06 23:51:54 +00:00
be16bd023e sixth commit 2026-05-06 23:50:51 +00:00
97e7638abc sixth commit 2026-05-06 23:49:34 +00:00
75503a1190 fifth commit 2026-05-06 23:49:02 +00:00
2eeeefcf8f fourth commit 2026-05-06 23:48:38 +00:00
31a4302ab6 third commit 2026-05-06 23:48:25 +00:00
18ba8e057f second commit 2026-05-06 23:47:38 +00:00
4708cdebb2 init commit 2026-05-06 23:47:07 +00:00