[DOC]: Add warning about max_num_batched_tokens and max_model_len when chunked prefill is disabled (#33109)

Signed-off-by: Vincent Gimenes <147169146+VincentG1234@users.noreply.github.com>
2026-01-27 04:05:02 +01:00
parent c568581ff3
commit 0b53bec60b
1 changed files with 4 additions and 0 deletions
--- a/docs/configuration/optimization.md
+++ b/docs/configuration/optimization.md
@@ -47,6 +47,10 @@ You can tune the performance by adjusting `max_num_batched_tokens`:
 - For optimal throughput, we recommend setting `max_num_batched_tokens > 8192` especially for smaller models on large GPUs.
 - If `max_num_batched_tokens` is the same as `max_model_len`, that's almost the equivalent to the V0 default scheduling policy (except that it still prioritizes decodes).

+!!! warning
+    When chunked prefill is disabled, `max_num_batched_tokens` must be greater than `max_model_len`.  
+    In that case, if `max_num_batched_tokens < max_model_len`, vLLM may crash at server start‑up.
+
 ```python
 from vllm import LLM