Commit Graph

5 Commits

Author SHA1 Message Date
cbfc5a9afb Update nvfp4_experts_only to use dequantized BF16 model 2026-05-07 16:34:37 +00:00
b5d14aa8b8 Add proper FP8→BF16 dequantization script
Unlike the naive upcast, this properly dequantizes FP8 block-wise weights:
bf16 = fp8_weight * scale_expanded (128x128 blocks).

Also removes the now-unnecessary scale tensors and updates config.
FP8Linear.forward() sees element_size() > 1 and falls back to F.linear().
2026-05-07 15:45:46 +00:00
6008cf128d Add model_opt_nvfp4_experts_only.py
Quantizes only MoE expert weights to NVFP4, leaving attention untouched.
Includes comments documenting all available NVFP4 strategies.
Copy to model_opt_nvfp4_<strategy>.py for each new strategy.
2026-05-07 15:16:08 +00:00
7a3b81e833 Add BF16 upcast script and Blackwell DeepGEMM patch
- scripts/upcast_to_bf16.py: Converts mixed-precision V4 Pro to pure BF16
  by upcasting all FP8 tensors (float8_e8m0fnu etc.) to bfloat16.
  Needed because modelopt PTQ calibration crashes on Blackwell with FP8
  kernels (DeepGEMM unsupported, Triton finegrained-fp8 has K mismatches).

- patches/patch_finegrained_fp8_blackwell.py: Patches transformers to
  reject DeepGEMM on SM100+ (Blackwell), letting it fall back to Triton.
  Note: the Triton fallback also fails during modelopt calibration on
  quantized weights, so upcasting to BF16 is the working solution.
2026-05-07 14:25:30 +00:00
ef89ceffbd Add ModelOpt NVFP4 pipeline: patch, run script, README
- Patch fixes iter_weights_for_calibration() for DeepseekV4Experts
  (ModuleList quantizers vs singular)
- Run script uses official NVIDIA hf_ptq.py with FP8 source
- Documents flags to avoid (--low_memory_mode, wrong arg names)
2026-05-07 07:22:54 +00:00