[DOC]: Add warning about max_num_batched_tokens and max_model_len when chunked prefill is disabled (#33109)

Signed-off-by: Vincent Gimenes <147169146+VincentG1234@users.noreply.github.com>
This commit is contained in:
Vincent Gimenes
2026-01-27 04:05:02 +01:00
committed by GitHub
parent c568581ff3
commit 0b53bec60b

View File

@@ -47,6 +47,10 @@ You can tune the performance by adjusting `max_num_batched_tokens`:
- For optimal throughput, we recommend setting `max_num_batched_tokens > 8192` especially for smaller models on large GPUs.
- If `max_num_batched_tokens` is the same as `max_model_len`, that's almost the equivalent to the V0 default scheduling policy (except that it still prioritizes decodes).
!!! warning
When chunked prefill is disabled, `max_num_batched_tokens` must be greater than `max_model_len`.
In that case, if `max_num_batched_tokens < max_model_len`, vLLM may crash at server startup.
```python
from vllm import LLM