deepseek-v4-quant

Author	SHA1	Message	Date
biondizzle	cbfc5a9afb	Update nvfp4_experts_only to use dequantized BF16 model	2026-05-07 16:34:37 +00:00
biondizzle	b5d14aa8b8	Add proper FP8→BF16 dequantization script Unlike the naive upcast, this properly dequantizes FP8 block-wise weights: bf16 = fp8_weight * scale_expanded (128x128 blocks). Also removes the now-unnecessary scale tensors and updates config. FP8Linear.forward() sees element_size() > 1 and falls back to F.linear().	2026-05-07 15:45:46 +00:00
biondizzle	6008cf128d	Add model_opt_nvfp4_experts_only.py Quantizes only MoE expert weights to NVFP4, leaving attention untouched. Includes comments documenting all available NVFP4 strategies. Copy to model_opt_nvfp4_<strategy>.py for each new strategy.	2026-05-07 15:16:08 +00:00
biondizzle	a7664aee7d	Add BF16 upcast script and Blackwell DeepGEMM patch	2026-05-07 14:29:50 +00:00
biondizzle	7a3b81e833	Add BF16 upcast script and Blackwell DeepGEMM patch - scripts/upcast_to_bf16.py: Converts mixed-precision V4 Pro to pure BF16 by upcasting all FP8 tensors (float8_e8m0fnu etc.) to bfloat16. Needed because modelopt PTQ calibration crashes on Blackwell with FP8 kernels (DeepGEMM unsupported, Triton finegrained-fp8 has K mismatches). - patches/patch_finegrained_fp8_blackwell.py: Patches transformers to reject DeepGEMM on SM100+ (Blackwell), letting it fall back to Triton. Note: the Triton fallback also fails during modelopt calibration on quantized weights, so upcasting to BF16 is the working solution.	2026-05-07 14:25:30 +00:00
biondizzle	ef89ceffbd	Add ModelOpt NVFP4 pipeline: patch, run script, README - Patch fixes iter_weights_for_calibration() for DeepseekV4Experts (ModuleList quantizers vs singular) - Run script uses official NVIDIA hf_ptq.py with FP8 source - Documents flags to avoid (--low_memory_mode, wrong arg names)	2026-05-07 07:22:54 +00:00
biondizzle	116933dcf6	Fix: skip .cuda() when low_memory_mode; switch default to nvfp4	2026-05-07 03:06:33 +00:00
biondizzle	b8bdd00d19	Lower GPU max_memory to 100GiB, add CPU-only fallback for low_memory_mode	2026-05-07 02:49:24 +00:00
biondizzle	717151b98c	Add CPU offloading and max_memory caps for FP8 model loading	2026-05-07 02:40:48 +00:00
biondizzle	aff12c6951	Fix forward_loop: pass as callable, not via create_forward_loop	2026-05-07 02:08:09 +00:00
biondizzle	492e44c0f6	Fix dataloader API: max_sample_length not seq_len, proper create_forward_loop	2026-05-07 02:04:54 +00:00
biondizzle	b32bb2e84d	NVIDIA Model Optimizer branch: nvfp4_experts_only PTQ for DeepSeek V4 Pro	2026-05-07 00:11:31 +00:00
biondizzle	c40607053b	Fix remaining gate_proj/up_proj -> w1/w3 references in paired_names	2026-05-07 00:05:55 +00:00
biondizzle	771e42cef3	Fix expert pair dict keys: w1/w3 not gate_proj/up_proj	2026-05-07 00:05:25 +00:00
biondizzle	5f35a5d2b3	Gracefully handle missing scale tensors (BF16 weights with stale index entries)	2026-05-07 00:04:29 +00:00
biondizzle	4470653e15	Fix V4 tensor naming: .scale companions, w1/w3 expert pairs, ffn.gate, hc_* preserve	2026-05-07 00:03:20 +00:00
biondizzle	2b7f063e39	7 commit	2026-05-06 23:51:54 +00:00
biondizzle	be16bd023e	sixth commit	2026-05-06 23:50:51 +00:00
biondizzle	97e7638abc	sixth commit	2026-05-06 23:49:34 +00:00
biondizzle	75503a1190	fifth commit	2026-05-06 23:49:02 +00:00
biondizzle	2eeeefcf8f	fourth commit	2026-05-06 23:48:38 +00:00
biondizzle	31a4302ab6	third commit	2026-05-06 23:48:25 +00:00
biondizzle	18ba8e057f	second commit	2026-05-06 23:47:38 +00:00
biondizzle	4708cdebb2	init commit	2026-05-06 23:47:07 +00:00

24 Commits