[Feature] limit thinking tokens (hard limit) (#20859)

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>
Signed-off-by: Chauncey <chaunceyjiang@gmail.com>
Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
Sungjae Lee
2026-03-25 01:53:07 +09:00
committed by GitHub
parent 8de5261e69
commit 4731884796
13 changed files with 702 additions and 12 deletions

View File

@@ -240,6 +240,81 @@ response = client.chat.completions.create(
)
```
## Thinking Budget Control
Some models, such as [Qwen3](https://qwen.readthedocs.io/en/latest/getting_started/quickstart.html#thinking-budget), [DeepSeek](https://www.alibabacloud.com/help/en/model-studio/deep-thinking), and [Nemotron3](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16), support a thinking budget that limits the maximum number of tokens used for reasoning.
Token counting starts from `think_start_str`. Once the reasoning token count reaches the configured `thinking_token_budget`, vLLM forces the model to produce `think_end_str`, effectively terminating the reasoning block.
To use this feature:
- `--reasoning-parser` enables reasoning extraction.
- `--reasoning-config` defines the reasoning boundary tokens (e.g., `think_start_str`, `think_end_str`).
- `thinking_token_budget` (a sampling parameter) sets the per-request reasoning token limit.
If `thinking_token_budget` is not specified, no explicit reasoning limit is applied beyond normal generation constraints such as `max_tokens`.
`--reasoning-config` accepts a JSON object corresponding to
[ReasoningConfig][vllm.config.ReasoningConfig] with the following fields:
| Field | Type | Description |
|-------------------|----------------|--------------------------------------------------|
| `think_start_str` | `str \| null` | String that marks the start of reasoning content |
| `think_end_str` | `str \| null` | String that marks the end of reasoning content |
!!! note
`think_end_str` can include a transition phrase before the think end token. For example, setting `think_end_str` to `"I have to give the solution based on the thinking directly now.</think>"` instructs the model to emit that phrase when the budget is exhausted, making the reasoning termination more natural.
### Online Serving
```bash
vllm serve Qwen/Qwen3-0.6B \
--reasoning-parser qwen3 \
--reasoning-config '{"think_start_str": "<think>", "think_end_str": "I have to give the solution based on the thinking directly now.</think>"}'
```
Then make a request with `thinking_token_budget` to limit the reasoning tokens:
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [
{ "role": "user", "content": "9.11 and 9.8, which is greater?" }
],
"extra_body": {
"thinking_token_budget": 10
}
}'
```
### Offline Inference
```python
from vllm import LLM, SamplingParams
from vllm.config import ReasoningConfig
llm = LLM(
model="Qwen/Qwen3-0.6B",
reasoning_config=ReasoningConfig(
think_start_str="<think>",
think_end_str="I have to give the solution based on the thinking directly now.</think>",
),
)
sampling_params = SamplingParams(thinking_token_budget=10)
messages = [
{"role": "user", "content": "9.11 and 9.8, which is greater?"},
]
outputs = llm.chat(messages, sampling_params=sampling_params)
for output in outputs:
print("text:", output.outputs[0].text)
```
## Limitations
- The reasoning content is only available for online serving's chat completion endpoint (`/v1/chat/completions`).