deepseek-v4-quant

Author	SHA1	Message	Date
biondizzle	c8564caf9d	fix: patch vLLM deepseek_v4.py directly in image	2026-05-11 06:09:40 +00:00
biondizzle	7c8c6cd67f	fix: add PYTHONPATH for deep_gemm import	2026-05-11 06:06:52 +00:00
biondizzle	cffb373759	fix: symlink NVRTC lib into cuda/lib64 for linker	2026-05-11 06:04:24 +00:00
biondizzle	983ba02c5b	fix: add CUDA/NVRTC lib paths to Dockerfile	2026-05-11 06:02:13 +00:00
biondizzle	f0471ed1c2	fix: correct CR URL to atl.vultrcr.com	2026-05-11 05:59:06 +00:00
biondizzle	c234190a80	feat: add Dockerfile + build/push script for NVFP4 container - Extends dream-build with DeepGEMM nvfp4-mega-moe kernel - build_push.sh: builds, logs into Vultr CR, pushes, updates docker-compose - CACHE_BUSTER parameter for forcing fresh clones	2026-05-11 05:57:49 +00:00
biondizzle	e963325b61	WIP: MegaMoE NVFP4 kernel + diagnostics - Force use_mega_moe=True for NVFP4 pipeline - DeepseekV4MegaMoEExperts: load NVFP4 params (float8 block scales, float32 global/input scales), convert NVFP4→BF16→MXFP4 in finalize_weights for the DeepGEMM mega_moe kernel - Add _nvfp4_to_bf16 and _bf16_to_mxfp4 conversion methods - Remove expert_dtype check blocking mega_moe - Add diagnostics for wo_a and bf16 layer conversion - Still WIP: attention layer bugs under investigation	2026-05-11 05:19:49 +00:00
biondizzle	7e2f219259	fix: banner uses _os instead of os (not yet imported)	2026-05-11 04:57:24 +00:00
biondizzle	cf54b4755a	fix CRITICAL #7 : UE8M0 block scale misinterpreted as E4M3 scale_fmt=ue8m0 means weight_scale bytes are E8M0 format (power-of-2 only). A simple .to(float32) misinterprets them as E4M3 (which has mantissa bits), producing completely wrong block scale values and garbled output. Fix: add _ue8m0_to_float32() that reinterprets raw uint8 bits as IEEE 754 exponent field: (raw_byte << 23).view(float32) = 2^(raw-127). Applied to: - _dequant_nvfp4_to_bf16 (BF16 layers: fused_wqa_wkv, wq_b, wo_b) - _convert_nvfp4_to_fp8 (wo_a FP8 conversion) - _reconstruct_compressor_weight (compressor fused_wkv_wgate) - BF16->FP4 quantization path (stores as UE8M0, reads back correctly)	2026-05-11 04:37:33 +00:00
biondizzle	7febeaeb71	README: document bugs #5 (input_scale) and #6 (fused_skip_regex), add version banner section, update status	2026-05-11 04:28:38 +00:00
biondizzle	26aaaba4a2	Add version banner to patch — prints commit, arch, bugs fixed at startup Ensures we can always verify what's running inside the container from the docker logs. No functional changes.	2026-05-11 04:28:10 +00:00
biondizzle	67f9086a26	Fix critical dequantization bug: remove input_scale from weight dequant input_scale is for ACTIVATIONS, not weights. The correct NVFP4 weight dequantization formula is: weight_bf16 = e2m1_value * block_scale * global_scale Including input_scale made weights ~5000x too small, causing completely garbled output (multilingual gibberish with repeating patterns).	2026-05-11 02:23:26 +00:00
biondizzle	02b8ea536f	Update MEMORY.md and memory files with vLLM NVFP4 serving progress Server running on B200 port 8000 with full NVFP4→vLLM bridge. All critical bugs fixed: DeepGEMM scale format, compressor shapes, block scale values.	2026-05-11 02:02:49 +00:00
biondizzle	653e2d7a50	vLLM NVFP4 serving: full end-to-end pipeline working Bridged the gap between ModelOpt NVFP4 and vLLM DeepSeek V4 attention. Server loads and serves tokens on 8x B200 with TP=8, EP=8. Key changes: - wo_a: NVFP4->BF16->FP8 with DeepGEMM block-scale format for BMM einsum Uses deepgemm_post_process_fp8_weight_block for correct scale layout weight_scale_inv = DeepGEMM-formatted block scale (NOT per-tensor scalar) Block scale filled with fp8_scale (NOT all-ones -- causes garbage output) - Attention: NVFP4->BF16 dequantization, UnquantizedLinearMethod - Compressor: reconstruct fused_wkv_wgate from separate kv_proj+gate_proj Fixed indexer path: compressor.indexer.kv_proj (was loading main compressor) - MoE experts: stay NVFP4, FLASHINFER_TRTLLM FusedMoE backend Bugs fixed: 1. DeepGEMM sf.dim() assertion: weight_scale_inv must be block-scale tensor 2. Block scale dtype: float32 (not float8_e4m3fn) 3. Missing deepgemm_post_process args: quant_block_shape, use_e8m0 4. Compressor indexer shape mismatch: wrong checkpoint key prefix 5. All-ones block scale: DeepGEMM divides by 1.0 instead of actual scale Updated README with full technical documentation of all fixes.	2026-05-11 02:01:46 +00:00
biondizzle	db16be8e5d	S11: Fixed substr mapping, stacking, suffix, and o_a_proj - loads weights but attention forward uses FP8 einsum incompatible with NVFP4	2026-05-10 17:45:53 +00:00
biondizzle	6fd03a0aa0	vLLM serving: patched deepseek_v4.py, disabled mega_moe, updated docs - Add patches/deepseek_v4.py: patched vllm source file with modelopt NVFP4 weight name mappings (expert gate_proj→w1, mlp→ffn, self_attn→attn.mla_attn, compressor.kv_proj→wkv, etc.), E2M1 FP4→BF16 unpacking for stacked params, skip patterns for NVFP4 scale tensors on MergedColumnParallelLinear, and resilient loading for unknown params. - Update docker-compose.yml: copy patched deepseek_v4.py over original at container startup, remove --moe-backend=deep_gemm_mega_moe (no NVFP4 kernel). - Update patches/patch_vllm_weights.py: legacy runtime monkey-patch approach (doesn't work with worker processes), kept for reference. - Update README.md: added vLLM serving run history table (S1-S10), documented all open issues (MergedColumnParallelLinear+NVFP4, no mega_moe kernel, resilient loading), added vLLM-specific bug list and key notes. - Update scripts/serve_vllm.py: add WARN comment on mega_moe flag.	2026-05-10 16:14:17 +00:00
biondizzle	d88793dee6	Add vllm weight mapper patch and docker-compose	2026-05-10 09:33:48 +00:00
biondizzle	30608e3834	Config patches: document modelopt↔vllm gaps with NVIDIA reference	2026-05-10 08:59:28 +00:00
biondizzle	0d74b97fb2	Config patches doc + compress_ratios runtime patch in serve script	2026-05-10 08:23:11 +00:00
biondizzle	f65d4ab99f	Run 11 SUCCESS: 881GB NVFP4 exported, add vLLM serve script	2026-05-10 07:54:34 +00:00
biondizzle	eb80bd6f80	README + memory: Run 10 result (export crash in get_weight_scaling_factor), Run 11 running - Run 10: calibration succeeded but export crashed in get_weight_scaling_factor (stale GPU weight, not just amax). Patch 4 forces weight to CPU at _export_quantized_weight entry point, covering the entire export chain. - Updated Key Lessons with Run 10 analysis - Updated Runtime Patches section to document all 8 patches - Added Bug #8 (stale GPU weight tensors) - Updated Do NOT Repeat list	2026-05-09 23:00:17 +00:00
biondizzle	07cd50e823	8 patches covering full export chain — no more whack-a-mole Traced the full execution chain from _process_quantized_modules through every function that reads stale GPU tensors: _process_quantized_modules → _export_quantized_weight (Patch 4: force weight to CPU at entry point) → get_weight_scaling_factor (Patch 7: belt-and-suspenders) → get_weights_scaling_factor_from_quantizer (safe: weight now CPU) → NVFP4QTensor.get_weights_scaling_factor (safe: input is CPU) → get_weight_scaling_factor_2 (Patch 8: force quantizer to CPU) → get_activation_scaling_factor (Patch 3: CPU + clamp) → to_quantized_weight (Patch 6: force all tensors to CPU) → weight.to(dtype) (safe: weight is CPU) → _export_fused_experts (Patch 5: force expert weights + quantizer to CPU) Patch 4 is the key: it moves weight to CPU at the earliest possible point, so ALL downstream .to(weight.device) calls resolve to CPU. Patches 5-8 are belt-and-suspenders for alternative code paths.	2026-05-09 22:50:58 +00:00
biondizzle	efc111a11f	Add Patch 4+5: get_weight_scaling_factor and get_weight_scaling_factor_2 CPU safety Run 10 completed calibration (128/128) but crashed at export in get_weight_scaling_factor — the weight tensor on GPU was stale after 5+ hours of calibration, and weight_scaling_factor_2.to(weight.device) triggered cudaErrorIllegalAddress. Patches 4+5 force weight and quantizer state to CPU before computing scaling factors. This mirrors the same pattern as Patch 3 (get_activation_scaling_factor). Calibrated state saved successfully (721.4 GB, 47,696 amax tensors). Amax snapshot saved (15.4 MB). Re-running with new patches.	2026-05-09 22:43:48 +00:00
biondizzle	ce9056d259	README overhaul: reflect current architecture (hf_main, run history through Run 10) - Architecture section: call hf_main() directly, not rewrite the pipeline - Run history: all 10 runs with root causes and fixes - Key lessons: stale GPU tensors, expert OOM, pipeline rewriting trap, __main__ gap - Runtime patches: 3 monkey-patches + 3 post-calibration hook steps - Do NOT repeat: 8 specific mistakes with run references - File layout with legacy patches note	2026-05-09 16:09:09 +00:00
biondizzle	5a72da7193	Fix: apply hf_ptq __main__ post-parse conversions (dataset split, calib_size int list) When calling hf_main(args) directly, the __main__ block conversions that run between parse_args() and main() are skipped. calib_size stays as string '128' instead of [128], causing TypeError on list concatenation.	2026-05-09 15:58:36 +00:00
biondizzle	8612914169	Update run history: Runs 7-8, Run 9 running on `a300302`	2026-05-09 15:00:23 +00:00
biondizzle	a300302486	Fix: use hf_ptq.py arg names (--pyt_ckpt_path, --qformat, --inference_tensor_parallel)	2026-05-09 14:57:28 +00:00
biondizzle	1a36a655ea	Fix: use full argparse flag names (--calib_size, --kv_cache_qformat)	2026-05-09 14:54:51 +00:00
biondizzle	b2849a8944	Fundamental rewrite: call hf_main() instead of rewriting the pipeline The previous approach tried to reconstruct hf_ptq's pipeline by importing individual functions and building a fake argparse.Namespace. This caused repeated crashes from missing args (KV_QUANT_CFG_CHOICES, dataset, calib_with_images, etc.). New approach: - Call hf_ptq.parse_args() with sys.argv replaced — gets ALL defaults - Call hf_main(args) — the exact same entry point the shell script uses - Hook export_quantized to add amax snapshot + state save before export - No more missing args. No more diverging from the example script. The only changes from the stock pipeline: 1. Runtime patches (load_calib_amax CPU, export_amax CPU, clamp) 2. Post-calibration hook (snapshot amax, save state, force CPU)	2026-05-09 14:52:02 +00:00
biondizzle	a70593d886	Update run history: Run 6 (dataloader crash), Run 7 running on `25b4d8d`	2026-05-09 13:40:00 +00:00
biondizzle	25b4d8da06	Fix: add missing args for make_calib_dataloader (dataset, calib_with_images, auto_quantize, specdec)	2026-05-09 13:37:24 +00:00
biondizzle	d1e15178b2	Update run history: Runs 4-5 (import bugs), Run 6 running on `6c1bff6`	2026-05-09 09:29:20 +00:00
biondizzle	6c1bff6997	Clean rewrite: verified all imports against runtime, removed dead code - get_model/get_tokenizer imported from example_utils (not hf_ptq) - KV_QUANT_CFG_CHOICES imported from hf_ptq (not mtq) - Removed dead _FORCE_AMAX_CPU global and global reference in run_export_only - Fixed stale comments - All 16 imports and references verified against the actual B200 runtime - Zero divergences from modelopt example path except get_model()	2026-05-09 09:26:23 +00:00
biondizzle	86dd8df302	Fix: KV_QUANT_CFG_CHOICES is in hf_ptq, not mtq	2026-05-09 09:17:12 +00:00
biondizzle	99f861f48a	Update README and memory: Run 3 OOM crash, Run 4 running on `f9bbef8` - Added Run 3 to table (model loading OOM, fixed with get_model()) - Added Run 4 (current, commit `f9bbef8`) - Added bug #7 (model loading OOM during expert weight concat) - Added 'do NOT repeat' for AutoModelForCausalLM.from_pretrained - Documented all 5 runtime patches - Noted only divergence from modelopt example: get_model()	2026-05-09 08:10:04 +00:00
biondizzle	f9bbef8e91	Fix: patch load_calib_amax instead of amax property setter (can't patch readonly descriptor) Also remove _FORCE_AMAX_CPU global — load_calib_amax patch handles it.	2026-05-09 08:04:03 +00:00
biondizzle	94179ed9d0	Fix typo: store_only → store_true	2026-05-09 08:02:09 +00:00
biondizzle	03c10ab3b6	Fix model loading: use modelopt get_model() instead of raw AutoModelForCausalLM Raw from_pretrained OOMs during weight conversion — torch.cat on expert gate_up_proj tries to allocate 31.5GB on a GPU with only 25.9GB free. modelopt's get_model() handles max_memory/device_map properly for models that need sequential device mapping.	2026-05-09 08:00:50 +00:00
biondizzle	9438af5a8c	Add commit hashes to run history table	2026-05-09 06:47:26 +00:00
biondizzle	d7593fc1dd	Update README: run history table, bug #1 already fixed, cost note, don't-repeat mistakes	2026-05-09 06:44:17 +00:00
biondizzle	6eaba26914	Defensive quantization: snapshot amax to CPU immediately after calibration Key changes: - snapshot_amax_to_cpu(): copies all quantizer _amax to CPU and saves to disk (~50MB) right after mtq.quantize() returns, before any other GPU operation can corrupt them - force_all_amax_to_cpu(): nuclear option, moves _pre_quant_scale and _global_amax to CPU too - _FORCE_AMAX_CPU flag + patched amax setter: after calibration, any future amax writes go to CPU instead of GPU - --validate-only mode to check saved state without running anything - restore_amax_from_snapshot() for --export-only recovery - torch.cuda.empty_cache() + gc.collect() between steps - Patches: export_amax CPU fallback, get_activation_scaling_factor clamp instead of assert	2026-05-09 06:31:08 +00:00
biondizzle	3907838409	Remove ModuleList patch (already fixed in modelopt 0.45), fix numbering	2026-05-09 06:10:18 +00:00
biondizzle	382c1d872f	Fix quant_module import path	2026-05-09 06:09:17 +00:00
biondizzle	9291165ba0	Fix imports: QUANT_CFG_CHOICES is in hf_ptq, not modelopt config	2026-05-09 06:08:35 +00:00
biondizzle	a0bacb3cf6	Replace shell wrapper with in-process quantize script - New scripts/quantize_nvfp4.py: runs full ModelOpt pipeline in-process - Saves calibrated state after calibration (insurance against export crashes) - Patches modelopt for V4: ModuleList quantizers, stale GPU tensor safety - --export-only flag to retry export from saved calibration state - Removed old model_opt_nvfp4_full.py (shell wrapper) - Updated README with new pipeline docs and bug #5/#6	2026-05-09 06:07:22 +00:00
biondizzle	04304fdae6	Add export crash fix patches, update README with bug #5 (repr CUDA crash)	2026-05-08 23:28:32 +00:00
biondizzle	50348989b2	Clarify: V4 is NOT BF16, dequantize first	2026-05-08 17:31:35 +00:00
biondizzle	24e3b3745d	Pin modelopt and transformers versions in README	2026-05-08 17:23:10 +00:00
biondizzle	b08afea425	remove weird session dump crap	2026-05-08 17:21:18 +00:00
biondizzle	a2370006f7	Update README: document full pipeline, BF16 verification, calib 128 constraint	2026-05-08 17:17:48 +00:00

1 2

86 Commits