fix: increase vLLM RPC timeout to 10 min for first-request JIT

First inference triggers Triton/TileLang kernel JIT compilation (2-3 min).
The default 5-min RPC timeout kills the engine. Bumped to 10 min via
VLLM_RPC_TIMEOUT_MS so the first request survives compilation.

Not ideal — would prefer to warm up the kernels during startup.
But CUDA graphs don't work well with grouped GEMMs and variable
expert counts. Will investigate vLLM warmup shape config later.
This commit is contained in:
2026-05-16 06:02:11 +00:00
parent a569612df5
commit 2e4ff6b8d4

View File

@@ -9,6 +9,7 @@ services:
- OMP_NUM_THREADS=128
- CUDA_LAUNCH_BLOCKING=0
- PYTHONUNBUFFERED=1
- VLLM_RPC_TIMEOUT_MS=600000
command:
- /model
- --trust-remote-code