deepseek-v4-quant

biondizzle/deepseek-v4-quant

Fork 0

Commit Graph

Author	SHA1	Message	Date
biondizzle	7a3b81e833	Add BF16 upcast script and Blackwell DeepGEMM patch - scripts/upcast_to_bf16.py: Converts mixed-precision V4 Pro to pure BF16 by upcasting all FP8 tensors (float8_e8m0fnu etc.) to bfloat16. Needed because modelopt PTQ calibration crashes on Blackwell with FP8 kernels (DeepGEMM unsupported, Triton finegrained-fp8 has K mismatches). - patches/patch_finegrained_fp8_blackwell.py: Patches transformers to reject DeepGEMM on SM100+ (Blackwell), letting it fall back to Triton. Note: the Triton fallback also fails during modelopt calibration on quantized weights, so upcasting to BF16 is the working solution.	2026-05-07 14:25:30 +00:00
biondizzle	ef89ceffbd	Add ModelOpt NVFP4 pipeline: patch, run script, README - Patch fixes iter_weights_for_calibration() for DeepseekV4Experts (ModuleList quantizers vs singular) - Run script uses official NVIDIA hf_ptq.py with FP8 source - Documents flags to avoid (--low_memory_mode, wrong arg names)	2026-05-07 07:22:54 +00:00

Author

SHA1

Message

Date

biondizzle

7a3b81e833

Add BF16 upcast script and Blackwell DeepGEMM patch

- scripts/upcast_to_bf16.py: Converts mixed-precision V4 Pro to pure BF16
  by upcasting all FP8 tensors (float8_e8m0fnu etc.) to bfloat16.
  Needed because modelopt PTQ calibration crashes on Blackwell with FP8
  kernels (DeepGEMM unsupported, Triton finegrained-fp8 has K mismatches).

- patches/patch_finegrained_fp8_blackwell.py: Patches transformers to
  reject DeepGEMM on SM100+ (Blackwell), letting it fall back to Triton.
  Note: the Triton fallback also fails during modelopt calibration on
  quantized weights, so upcasting to BF16 is the working solution.

2026-05-07 14:25:30 +00:00

biondizzle

ef89ceffbd

Add ModelOpt NVFP4 pipeline: patch, run script, README

- Patch fixes iter_weights_for_calibration() for DeepseekV4Experts
  (ModuleList quantizers vs singular)
- Run script uses official NVIDIA hf_ptq.py with FP8 source
- Documents flags to avoid (--low_memory_mode, wrong arg names)

2026-05-07 07:22:54 +00:00

2 Commits