Commit Graph

22 Commits

Author SHA1 Message Date
7febeaeb71 README: document bugs #5 (input_scale) and #6 (fused_skip_regex), add version banner section, update status 2026-05-11 04:28:38 +00:00
653e2d7a50 vLLM NVFP4 serving: full end-to-end pipeline working
Bridged the gap between ModelOpt NVFP4 and vLLM DeepSeek V4 attention.
Server loads and serves tokens on 8x B200 with TP=8, EP=8.

Key changes:
- wo_a: NVFP4->BF16->FP8 with DeepGEMM block-scale format for BMM einsum
  Uses deepgemm_post_process_fp8_weight_block for correct scale layout
  weight_scale_inv = DeepGEMM-formatted block scale (NOT per-tensor scalar)
  Block scale filled with fp8_scale (NOT all-ones -- causes garbage output)
- Attention: NVFP4->BF16 dequantization, UnquantizedLinearMethod
- Compressor: reconstruct fused_wkv_wgate from separate kv_proj+gate_proj
  Fixed indexer path: compressor.indexer.kv_proj (was loading main compressor)
- MoE experts: stay NVFP4, FLASHINFER_TRTLLM FusedMoE backend

Bugs fixed:
1. DeepGEMM sf.dim() assertion: weight_scale_inv must be block-scale tensor
2. Block scale dtype: float32 (not float8_e4m3fn)
3. Missing deepgemm_post_process args: quant_block_shape, use_e8m0
4. Compressor indexer shape mismatch: wrong checkpoint key prefix
5. All-ones block scale: DeepGEMM divides by 1.0 instead of actual scale

Updated README with full technical documentation of all fixes.
2026-05-11 02:01:46 +00:00
6fd03a0aa0 vLLM serving: patched deepseek_v4.py, disabled mega_moe, updated docs
- Add patches/deepseek_v4.py: patched vllm source file with modelopt NVFP4
  weight name mappings (expert gate_proj→w1, mlp→ffn, self_attn→attn.mla_attn,
  compressor.kv_proj→wkv, etc.), E2M1 FP4→BF16 unpacking for stacked params,
  skip patterns for NVFP4 scale tensors on MergedColumnParallelLinear, and
  resilient loading for unknown params.

- Update docker-compose.yml: copy patched deepseek_v4.py over original at
  container startup, remove --moe-backend=deep_gemm_mega_moe (no NVFP4 kernel).

- Update patches/patch_vllm_weights.py: legacy runtime monkey-patch approach
  (doesn't work with worker processes), kept for reference.

- Update README.md: added vLLM serving run history table (S1-S10), documented
  all open issues (MergedColumnParallelLinear+NVFP4, no mega_moe kernel,
  resilient loading), added vLLM-specific bug list and key notes.

- Update scripts/serve_vllm.py: add WARN comment on mega_moe flag.
2026-05-10 16:14:17 +00:00
30608e3834 Config patches: document modelopt↔vllm gaps with NVIDIA reference 2026-05-10 08:59:28 +00:00
0d74b97fb2 Config patches doc + compress_ratios runtime patch in serve script 2026-05-10 08:23:11 +00:00
f65d4ab99f Run 11 SUCCESS: 881GB NVFP4 exported, add vLLM serve script 2026-05-10 07:54:34 +00:00
eb80bd6f80 README + memory: Run 10 result (export crash in get_weight_scaling_factor), Run 11 running
- Run 10: calibration succeeded but export crashed in get_weight_scaling_factor
  (stale GPU weight, not just amax). Patch 4 forces weight to CPU at
  _export_quantized_weight entry point, covering the entire export chain.
- Updated Key Lessons with Run 10 analysis
- Updated Runtime Patches section to document all 8 patches
- Added Bug #8 (stale GPU weight tensors)
- Updated Do NOT Repeat list
2026-05-09 23:00:17 +00:00
ce9056d259 README overhaul: reflect current architecture (hf_main, run history through Run 10)
- Architecture section: call hf_main() directly, not rewrite the pipeline
- Run history: all 10 runs with root causes and fixes
- Key lessons: stale GPU tensors, expert OOM, pipeline rewriting trap, __main__ gap
- Runtime patches: 3 monkey-patches + 3 post-calibration hook steps
- Do NOT repeat: 8 specific mistakes with run references
- File layout with legacy patches note
2026-05-09 16:09:09 +00:00
8612914169 Update run history: Runs 7-8, Run 9 running on a300302 2026-05-09 15:00:23 +00:00
a70593d886 Update run history: Run 6 (dataloader crash), Run 7 running on 25b4d8d 2026-05-09 13:40:00 +00:00
d1e15178b2 Update run history: Runs 4-5 (import bugs), Run 6 running on 6c1bff6 2026-05-09 09:29:20 +00:00
99f861f48a Update README and memory: Run 3 OOM crash, Run 4 running on f9bbef8
- Added Run 3 to table (model loading OOM, fixed with get_model())
- Added Run 4 (current, commit f9bbef8)
- Added bug #7 (model loading OOM during expert weight concat)
- Added 'do NOT repeat' for AutoModelForCausalLM.from_pretrained
- Documented all 5 runtime patches
- Noted only divergence from modelopt example: get_model()
2026-05-09 08:10:04 +00:00
9438af5a8c Add commit hashes to run history table 2026-05-09 06:47:26 +00:00
d7593fc1dd Update README: run history table, bug #1 already fixed, cost note, don't-repeat mistakes 2026-05-09 06:44:17 +00:00
a0bacb3cf6 Replace shell wrapper with in-process quantize script
- New scripts/quantize_nvfp4.py: runs full ModelOpt pipeline in-process
- Saves calibrated state after calibration (insurance against export crashes)
- Patches modelopt for V4: ModuleList quantizers, stale GPU tensor safety
- --export-only flag to retry export from saved calibration state
- Removed old model_opt_nvfp4_full.py (shell wrapper)
- Updated README with new pipeline docs and bug #5/#6
2026-05-09 06:07:22 +00:00
04304fdae6 Add export crash fix patches, update README with bug #5 (repr CUDA crash) 2026-05-08 23:28:32 +00:00
50348989b2 Clarify: V4 is NOT BF16, dequantize first 2026-05-08 17:31:35 +00:00
24e3b3745d Pin modelopt and transformers versions in README 2026-05-08 17:23:10 +00:00
a2370006f7 Update README: document full pipeline, BF16 verification, calib 128 constraint 2026-05-08 17:17:48 +00:00
eeba101cc4 Cleanup: nuke dead scripts and stale docs, rewrite README for full NVFP4 pipeline 2026-05-08 17:02:07 +00:00
b32bb2e84d NVIDIA Model Optimizer branch: nvfp4_experts_only PTQ for DeepSeek V4 Pro 2026-05-07 00:11:31 +00:00
4708cdebb2 init commit 2026-05-06 23:47:07 +00:00