deepseek-v4-quant

Author	SHA1	Message	Date
biondizzle	7febeaeb71	README: document bugs #5 (input_scale) and #6 (fused_skip_regex), add version banner section, update status	2026-05-11 04:28:38 +00:00
biondizzle	653e2d7a50	vLLM NVFP4 serving: full end-to-end pipeline working Bridged the gap between ModelOpt NVFP4 and vLLM DeepSeek V4 attention. Server loads and serves tokens on 8x B200 with TP=8, EP=8. Key changes: - wo_a: NVFP4->BF16->FP8 with DeepGEMM block-scale format for BMM einsum Uses deepgemm_post_process_fp8_weight_block for correct scale layout weight_scale_inv = DeepGEMM-formatted block scale (NOT per-tensor scalar) Block scale filled with fp8_scale (NOT all-ones -- causes garbage output) - Attention: NVFP4->BF16 dequantization, UnquantizedLinearMethod - Compressor: reconstruct fused_wkv_wgate from separate kv_proj+gate_proj Fixed indexer path: compressor.indexer.kv_proj (was loading main compressor) - MoE experts: stay NVFP4, FLASHINFER_TRTLLM FusedMoE backend Bugs fixed: 1. DeepGEMM sf.dim() assertion: weight_scale_inv must be block-scale tensor 2. Block scale dtype: float32 (not float8_e4m3fn) 3. Missing deepgemm_post_process args: quant_block_shape, use_e8m0 4. Compressor indexer shape mismatch: wrong checkpoint key prefix 5. All-ones block scale: DeepGEMM divides by 1.0 instead of actual scale Updated README with full technical documentation of all fixes.	2026-05-11 02:01:46 +00:00
biondizzle	6fd03a0aa0	vLLM serving: patched deepseek_v4.py, disabled mega_moe, updated docs - Add patches/deepseek_v4.py: patched vllm source file with modelopt NVFP4 weight name mappings (expert gate_proj→w1, mlp→ffn, self_attn→attn.mla_attn, compressor.kv_proj→wkv, etc.), E2M1 FP4→BF16 unpacking for stacked params, skip patterns for NVFP4 scale tensors on MergedColumnParallelLinear, and resilient loading for unknown params. - Update docker-compose.yml: copy patched deepseek_v4.py over original at container startup, remove --moe-backend=deep_gemm_mega_moe (no NVFP4 kernel). - Update patches/patch_vllm_weights.py: legacy runtime monkey-patch approach (doesn't work with worker processes), kept for reference. - Update README.md: added vLLM serving run history table (S1-S10), documented all open issues (MergedColumnParallelLinear+NVFP4, no mega_moe kernel, resilient loading), added vLLM-specific bug list and key notes. - Update scripts/serve_vllm.py: add WARN comment on mega_moe flag.	2026-05-10 16:14:17 +00:00
biondizzle	30608e3834	Config patches: document modelopt↔vllm gaps with NVIDIA reference	2026-05-10 08:59:28 +00:00
biondizzle	0d74b97fb2	Config patches doc + compress_ratios runtime patch in serve script	2026-05-10 08:23:11 +00:00
biondizzle	f65d4ab99f	Run 11 SUCCESS: 881GB NVFP4 exported, add vLLM serve script	2026-05-10 07:54:34 +00:00
biondizzle	eb80bd6f80	README + memory: Run 10 result (export crash in get_weight_scaling_factor), Run 11 running - Run 10: calibration succeeded but export crashed in get_weight_scaling_factor (stale GPU weight, not just amax). Patch 4 forces weight to CPU at _export_quantized_weight entry point, covering the entire export chain. - Updated Key Lessons with Run 10 analysis - Updated Runtime Patches section to document all 8 patches - Added Bug #8 (stale GPU weight tensors) - Updated Do NOT Repeat list	2026-05-09 23:00:17 +00:00
biondizzle	ce9056d259	README overhaul: reflect current architecture (hf_main, run history through Run 10) - Architecture section: call hf_main() directly, not rewrite the pipeline - Run history: all 10 runs with root causes and fixes - Key lessons: stale GPU tensors, expert OOM, pipeline rewriting trap, __main__ gap - Runtime patches: 3 monkey-patches + 3 post-calibration hook steps - Do NOT repeat: 8 specific mistakes with run references - File layout with legacy patches note	2026-05-09 16:09:09 +00:00
biondizzle	8612914169	Update run history: Runs 7-8, Run 9 running on `a300302`	2026-05-09 15:00:23 +00:00
biondizzle	a70593d886	Update run history: Run 6 (dataloader crash), Run 7 running on `25b4d8d`	2026-05-09 13:40:00 +00:00
biondizzle	d1e15178b2	Update run history: Runs 4-5 (import bugs), Run 6 running on `6c1bff6`	2026-05-09 09:29:20 +00:00
biondizzle	99f861f48a	Update README and memory: Run 3 OOM crash, Run 4 running on `f9bbef8` - Added Run 3 to table (model loading OOM, fixed with get_model()) - Added Run 4 (current, commit `f9bbef8`) - Added bug #7 (model loading OOM during expert weight concat) - Added 'do NOT repeat' for AutoModelForCausalLM.from_pretrained - Documented all 5 runtime patches - Noted only divergence from modelopt example: get_model()	2026-05-09 08:10:04 +00:00
biondizzle	9438af5a8c	Add commit hashes to run history table	2026-05-09 06:47:26 +00:00
biondizzle	d7593fc1dd	Update README: run history table, bug #1 already fixed, cost note, don't-repeat mistakes	2026-05-09 06:44:17 +00:00
biondizzle	a0bacb3cf6	Replace shell wrapper with in-process quantize script - New scripts/quantize_nvfp4.py: runs full ModelOpt pipeline in-process - Saves calibrated state after calibration (insurance against export crashes) - Patches modelopt for V4: ModuleList quantizers, stale GPU tensor safety - --export-only flag to retry export from saved calibration state - Removed old model_opt_nvfp4_full.py (shell wrapper) - Updated README with new pipeline docs and bug #5/#6	2026-05-09 06:07:22 +00:00
biondizzle	04304fdae6	Add export crash fix patches, update README with bug #5 (repr CUDA crash)	2026-05-08 23:28:32 +00:00
biondizzle	50348989b2	Clarify: V4 is NOT BF16, dequantize first	2026-05-08 17:31:35 +00:00
biondizzle	24e3b3745d	Pin modelopt and transformers versions in README	2026-05-08 17:23:10 +00:00
biondizzle	a2370006f7	Update README: document full pipeline, BF16 verification, calib 128 constraint	2026-05-08 17:17:48 +00:00
biondizzle	eeba101cc4	Cleanup: nuke dead scripts and stale docs, rewrite README for full NVFP4 pipeline	2026-05-08 17:02:07 +00:00
biondizzle	b32bb2e84d	NVIDIA Model Optimizer branch: nvfp4_experts_only PTQ for DeepSeek V4 Pro	2026-05-07 00:11:31 +00:00
biondizzle	4708cdebb2	init commit	2026-05-06 23:47:07 +00:00

22 Commits