diff --git a/docs/configuration/optimization.md b/docs/configuration/optimization.md index 556d9f8b9..80b12ae33 100644 --- a/docs/configuration/optimization.md +++ b/docs/configuration/optimization.md @@ -47,6 +47,10 @@ You can tune the performance by adjusting `max_num_batched_tokens`: - For optimal throughput, we recommend setting `max_num_batched_tokens > 8192` especially for smaller models on large GPUs. - If `max_num_batched_tokens` is the same as `max_model_len`, that's almost the equivalent to the V0 default scheduling policy (except that it still prioritizes decodes). +!!! warning + When chunked prefill is disabled, `max_num_batched_tokens` must be greater than `max_model_len`. + In that case, if `max_num_batched_tokens < max_model_len`, vLLM may crash at server start‑up. + ```python from vllm import LLM