deepseek-v4-quant

Author	SHA1	Message	Date
biondizzle	b70a04696e	Add resume capability to dequant script (skip already-done shards) Verified our FP4 dequant is byte-identical to official transformers MXFP4 implementation. Max diff = 0.0 across all values.	2026-05-08 02:58:24 +00:00
biondizzle	f8533197f2	Fix: expert weights are FP4 (E2M1), not INT4 - verified with nibble analysis Nibble index 0 vs 8 ratio = 0.996 (FP4 -0.0 ≈ +0.0), NOT INT4 where -8 would be rare. FP4 dequant uses E2M1 LUT lookup × E8M0 scale (MXFP4 microscaling). Also adds model_opt_nvfp4_full.py for full model NVFP4 quantization.	2026-05-08 02:25:43 +00:00
biondizzle	db6beb5b76	Complete dequant script: handles INT4 experts, FP8 attention, FP8 shared experts INT4 expert weights are packed 2-per-byte into int8 with float8_e8m0fnu per-row 32-column block scales. Unpacking: lower nibble first, upper second. Output dimensions are 2x the stored dimensions (e.g. [3072,3584] → [3072,7168]). Also adds progress output with ETA per shard so screen sessions stay alive.	2026-05-08 01:39:50 +00:00
biondizzle	b5d14aa8b8	Add proper FP8→BF16 dequantization script Unlike the naive upcast, this properly dequantizes FP8 block-wise weights: bf16 = fp8_weight * scale_expanded (128x128 blocks). Also removes the now-unnecessary scale tensors and updates config. FP8Linear.forward() sees element_size() > 1 and falls back to F.linear().	2026-05-07 15:45:46 +00:00