- Add patches/deepseek_v4.py: patched vllm source file with modelopt NVFP4
weight name mappings (expert gate_proj→w1, mlp→ffn, self_attn→attn.mla_attn,
compressor.kv_proj→wkv, etc.), E2M1 FP4→BF16 unpacking for stacked params,
skip patterns for NVFP4 scale tensors on MergedColumnParallelLinear, and
resilient loading for unknown params.
- Update docker-compose.yml: copy patched deepseek_v4.py over original at
container startup, remove --moe-backend=deep_gemm_mega_moe (no NVFP4 kernel).
- Update patches/patch_vllm_weights.py: legacy runtime monkey-patch approach
(doesn't work with worker processes), kept for reference.
- Update README.md: added vLLM serving run history table (S1-S10), documented
all open issues (MergedColumnParallelLinear+NVFP4, no mega_moe kernel,
resilient loading), added vLLM-specific bug list and key notes.
- Update scripts/serve_vllm.py: add WARN comment on mega_moe flag.
- Run 10: calibration succeeded but export crashed in get_weight_scaling_factor
(stale GPU weight, not just amax). Patch 4 forces weight to CPU at
_export_quantized_weight entry point, covering the entire export chain.
- Updated Key Lessons with Run 10 analysis
- Updated Runtime Patches section to document all 8 patches
- Added Bug #8 (stale GPU weight tensors)
- Updated Do NOT Repeat list
- Architecture section: call hf_main() directly, not rewrite the pipeline
- Run history: all 10 runs with root causes and fixes
- Key lessons: stale GPU tensors, expert OOM, pipeline rewriting trap, __main__ gap
- Runtime patches: 3 monkey-patches + 3 post-calibration hook steps
- Do NOT repeat: 8 specific mistakes with run references
- File layout with legacy patches note
- New scripts/quantize_nvfp4.py: runs full ModelOpt pipeline in-process
- Saves calibrated state after calibration (insurance against export crashes)
- Patches modelopt for V4: ModuleList quantizers, stale GPU tensor safety
- --export-only flag to retry export from saved calibration state
- Removed old model_opt_nvfp4_full.py (shell wrapper)
- Updated README with new pipeline docs and bug #5/#6