- Add patches/deepseek_v4.py: patched vllm source file with modelopt NVFP4
weight name mappings (expert gate_proj→w1, mlp→ffn, self_attn→attn.mla_attn,
compressor.kv_proj→wkv, etc.), E2M1 FP4→BF16 unpacking for stacked params,
skip patterns for NVFP4 scale tensors on MergedColumnParallelLinear, and
resilient loading for unknown params.
- Update docker-compose.yml: copy patched deepseek_v4.py over original at
container startup, remove --moe-backend=deep_gemm_mega_moe (no NVFP4 kernel).
- Update patches/patch_vllm_weights.py: legacy runtime monkey-patch approach
(doesn't work with worker processes), kept for reference.
- Update README.md: added vLLM serving run history table (S1-S10), documented
all open issues (MergedColumnParallelLinear+NVFP4, no mega_moe kernel,
resilient loading), added vLLM-specific bug list and key notes.
- Update scripts/serve_vllm.py: add WARN comment on mega_moe flag.