fix: increase vLLM RPC timeout to 10 min for first-request JIT
First inference triggers Triton/TileLang kernel JIT compilation (2-3 min). The default 5-min RPC timeout kills the engine. Bumped to 10 min via VLLM_RPC_TIMEOUT_MS so the first request survives compilation. Not ideal — would prefer to warm up the kernels during startup. But CUDA graphs don't work well with grouped GEMMs and variable expert counts. Will investigate vLLM warmup shape config later.
This commit is contained in:
@@ -9,6 +9,7 @@ services:
|
||||
- OMP_NUM_THREADS=128
|
||||
- CUDA_LAUNCH_BLOCKING=0
|
||||
- PYTHONUNBUFFERED=1
|
||||
- VLLM_RPC_TIMEOUT_MS=600000
|
||||
command:
|
||||
- /model
|
||||
- --trust-remote-code
|
||||
|
||||
Reference in New Issue
Block a user