fix: increase vLLM RPC timeout to 10 min for first-request JIT

First inference triggers Triton/TileLang kernel JIT compilation (2-3 min). The default 5-min RPC timeout kills the engine. Bumped to 10 min via VLLM_RPC_TIMEOUT_MS so the first request survives compilation. Not ideal — would prefer to warm up the kernels during startup. But CUDA graphs don't work well with grouped GEMMs and variable expert counts. Will investigate vLLM warmup shape config later.
2026-05-16 06:02:11 +00:00
parent a569612df5
commit 2e4ff6b8d4
1 changed files with 1 additions and 0 deletions
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -9,6 +9,7 @@ services:
      - OMP_NUM_THREADS=128
      - CUDA_LAUNCH_BLOCKING=0
      - PYTHONUNBUFFERED=1
+      - VLLM_RPC_TIMEOUT_MS=600000
    command:
      - /model
      - --trust-remote-code