[DOC]: Add warning about max_num_batched_tokens and max_model_len when chunked prefill is disabled (#33109)
Signed-off-by: Vincent Gimenes <147169146+VincentG1234@users.noreply.github.com>
This commit is contained in:
@@ -47,6 +47,10 @@ You can tune the performance by adjusting `max_num_batched_tokens`:
|
||||
- For optimal throughput, we recommend setting `max_num_batched_tokens > 8192` especially for smaller models on large GPUs.
|
||||
- If `max_num_batched_tokens` is the same as `max_model_len`, that's almost the equivalent to the V0 default scheduling policy (except that it still prioritizes decodes).
|
||||
|
||||
!!! warning
|
||||
When chunked prefill is disabled, `max_num_batched_tokens` must be greater than `max_model_len`.
|
||||
In that case, if `max_num_batched_tokens < max_model_len`, vLLM may crash at server start‑up.
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
|
||||
|
||||
Reference in New Issue
Block a user