Commit Graph

3 Commits

Author SHA1 Message Date
b8f95ffad3 docker: add OMP_NUM_THREADS=64, remove --tool initcheck, mount cubin cache 2026-05-12 11:15:06 +00:00
6fd03a0aa0 vLLM serving: patched deepseek_v4.py, disabled mega_moe, updated docs
- Add patches/deepseek_v4.py: patched vllm source file with modelopt NVFP4
  weight name mappings (expert gate_proj→w1, mlp→ffn, self_attn→attn.mla_attn,
  compressor.kv_proj→wkv, etc.), E2M1 FP4→BF16 unpacking for stacked params,
  skip patterns for NVFP4 scale tensors on MergedColumnParallelLinear, and
  resilient loading for unknown params.

- Update docker-compose.yml: copy patched deepseek_v4.py over original at
  container startup, remove --moe-backend=deep_gemm_mega_moe (no NVFP4 kernel).

- Update patches/patch_vllm_weights.py: legacy runtime monkey-patch approach
  (doesn't work with worker processes), kept for reference.

- Update README.md: added vLLM serving run history table (S1-S10), documented
  all open issues (MergedColumnParallelLinear+NVFP4, no mega_moe kernel,
  resilient loading), added vLLM-specific bug list and key notes.

- Update scripts/serve_vllm.py: add WARN comment on mega_moe flag.
2026-05-10 16:14:17 +00:00
d88793dee6 Add vllm weight mapper patch and docker-compose 2026-05-10 09:33:48 +00:00