CUDA graph capture needs extra memory on top of the model weights.
With 175GB model on 178GB GPUs, there's no room.
Going back to --enforce-eager with 10-min RPC timeout. The first
inference request will be slow (2-3 min JIT compilation) but won't
crash. Subsequent requests are fast.
CUDA graph mode requires either more GPU memory or a smaller model.
First inference triggers Triton/TileLang kernel JIT compilation (2-3 min).
The default 5-min RPC timeout kills the engine. Bumped to 10 min via
VLLM_RPC_TIMEOUT_MS so the first request survives compilation.
Not ideal — would prefer to warm up the kernels during startup.
But CUDA graphs don't work well with grouped GEMMs and variable
expert counts. Will investigate vLLM warmup shape config later.
Python buffers stdout by default. Docker only sees the buffer dumps,
so all progress bars appear at once when the step completes.
PYTHONUNBUFFERED=1 disables buffering — prints flush immediately.