First inference triggers Triton/TileLang kernel JIT compilation (2-3 min).
The default 5-min RPC timeout kills the engine. Bumped to 10 min via
VLLM_RPC_TIMEOUT_MS so the first request survives compilation.
Not ideal — would prefer to warm up the kernels during startup.
But CUDA graphs don't work well with grouped GEMMs and variable
expert counts. Will investigate vLLM warmup shape config later.
Python buffers stdout by default. Docker only sees the buffer dumps,
so all progress bars appear at once when the step completes.
PYTHONUNBUFFERED=1 disables buffering — prints flush immediately.