- New scripts/quantize_nvfp4.py: runs full ModelOpt pipeline in-process
- Saves calibrated state after calibration (insurance against export crashes)
- Patches modelopt for V4: ModuleList quantizers, stale GPU tensor safety
- --export-only flag to retry export from saved calibration state
- Removed old model_opt_nvfp4_full.py (shell wrapper)
- Updated README with new pipeline docs and bug #5/#6
Nibble index 0 vs 8 ratio = 0.996 (FP4 -0.0 ≈ +0.0), NOT INT4 where -8 would be rare.
FP4 dequant uses E2M1 LUT lookup × E8M0 scale (MXFP4 microscaling).
Also adds model_opt_nvfp4_full.py for full model NVFP4 quantization.
INT4 expert weights are packed 2-per-byte into int8 with float8_e8m0fnu
per-row 32-column block scales. Unpacking: lower nibble first, upper second.
Output dimensions are 2x the stored dimensions (e.g. [3072,3584] → [3072,7168]).
Also adds progress output with ETA per shard so screen sessions stay alive.
Unlike the naive upcast, this properly dequantizes FP8 block-wise weights:
bf16 = fp8_weight * scale_expanded (128x128 blocks).
Also removes the now-unnecessary scale tensors and updates config.
FP8Linear.forward() sees element_size() > 1 and falls back to F.linear().
Quantizes only MoE expert weights to NVFP4, leaving attention untouched.
Includes comments documenting all available NVFP4 strategies.
Copy to model_opt_nvfp4_<strategy>.py for each new strategy.
- scripts/upcast_to_bf16.py: Converts mixed-precision V4 Pro to pure BF16
by upcasting all FP8 tensors (float8_e8m0fnu etc.) to bfloat16.
Needed because modelopt PTQ calibration crashes on Blackwell with FP8
kernels (DeepGEMM unsupported, Triton finegrained-fp8 has K mismatches).
- patches/patch_finegrained_fp8_blackwell.py: Patches transformers to
reject DeepGEMM on SM100+ (Blackwell), letting it fall back to Triton.
Note: the Triton fallback also fails during modelopt calibration on
quantized weights, so upcasting to BF16 is the working solution.
- Patch fixes iter_weights_for_calibration() for DeepseekV4Experts
(ModuleList quantizers vs singular)
- Run script uses official NVIDIA hf_ptq.py with FP8 source
- Documents flags to avoid (--low_memory_mode, wrong arg names)