Files
nvfp4-megamoe-kernel/docker-compose.yml
biondizzle 2e4ff6b8d4 fix: increase vLLM RPC timeout to 10 min for first-request JIT
First inference triggers Triton/TileLang kernel JIT compilation (2-3 min).
The default 5-min RPC timeout kills the engine. Bumped to 10 min via
VLLM_RPC_TIMEOUT_MS so the first request survives compilation.

Not ideal — would prefer to warm up the kernels during startup.
But CUDA graphs don't work well with grouped GEMMs and variable
expert counts. Will investigate vLLM warmup shape config later.
2026-05-16 06:02:11 +00:00

31 lines
702 B
YAML

services:
vllm:
build:
context: .
dockerfile: Dockerfile
ports:
- "8000:8000"
environment:
- OMP_NUM_THREADS=128
- CUDA_LAUNCH_BLOCKING=0
- PYTHONUNBUFFERED=1
- VLLM_RPC_TIMEOUT_MS=600000
command:
- /model
- --trust-remote-code
- --enable-expert-parallel
- --tensor-parallel-size=8
- --enforce-eager
- --tokenizer-mode=deepseek_v4
- --host=0.0.0.0
- --port=8000
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
volumes:
- /root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4:/model:ro