Fix is_reasoning_end to handle multi-turn prompt tokens

Instead of always returning False (which broke tool call streaming),
use a heuristic: if think-end appears in the token IDs but is
followed by more than 3 tokens (chat template wrapping like
<|im_end|>, user markers, etc.), it's from a prior turn's prompt
and reasoning hasn't started in the current generation. Return False.
If think-end is at or near the end, it's from generated tokens and
reasoning has ended. Return True.
This commit is contained in:
2026-04-14 08:39:18 +00:00
parent f5266646eb
commit 778f1bfe66

View File

@@ -144,29 +144,89 @@ class KimiK2ReasoningParser(ReasoningParser):
# ------------------------------------------------------------------
def is_reasoning_end(self, input_ids: Sequence[int]) -> bool:
"""Check if reasoning has ended.
IMPORTANT: Always returns False for this parser. The reasoning
state is tracked internally by ``_reasoning_ended`` which is
updated only when ``extract_reasoning_streaming`` detects
think-end or a tool-section marker in the model's *generated*
text.
The vLLM serving layer calls this with ``prompt_token_ids`` to
pre-compute whether reasoning has ended. On multi-turn
conversations, the prompt contains think-end tokens from prior
assistant messages, which would cause a false positive — the
serving layer would skip ``extract_reasoning_streaming`` entirely
and route all thinking text to content.
Returning False ensures the serving layer always calls
``extract_reasoning_streaming``, which correctly handles the
transition using generated text only.
"""Check if reasoning has ended based on the token ID sequence.
Scans backward to find the last think-start or think-end token.
Returns True only if the last relevant token is a think-end or
a tool-section-start, AND there is no think-start after it.
CRITICAL: When called with prompt_token_ids (as the vLLM serving
layer does), the input contains the full chat history. On
multi-turn conversations, the prompt ends with tokens from the
prior assistant message, which may include think-end. However,
this think-end belongs to the PRIOR generation — the new
generation will start its own reasoning with think-start.
To handle this correctly, we check whether the input ends with
a complete reasoning block (think-start ... think-end). If the
last think token is think-end AND it's followed by non-reasoning
tokens (like tool_call tokens or end-of-sequence), we return
True. But if the input is just the prompt with no generated
tokens yet, we return False because the new generation hasn't
started reasoning yet.
The key insight: in the chat template for multi-turn, after the
last assistant message's think-end, the template adds
<|im_end|> followed by new user/assistant markers. The
assistant generation prompt ends with <|im_assistant|> and
<|im_middle|> — no think tokens. So if we scan backward and
find think-end but then find prompt-end tokens (not think-start)
after it, we know reasoning ended in a PRIOR turn, not the
current one. We return False to let the new generation start
fresh.
"""
if self._identity_parser is not None:
return self._identity_parser.is_reasoning_end(input_ids)
return False
# Scan backward to find the last think-start or think-end
# or tool-section-start token.
last_start = -1
last_end = -1
last_tool_section = -1
for i in range(len(input_ids) - 1, -1, -1):
if input_ids[i] == self._start_token_id and last_start == -1:
last_start = i
if input_ids[i] == self._end_token_id and last_end == -1:
last_end = i
if input_ids[i] in self._tool_section_start_token_ids and last_tool_section == -1:
last_tool_section = i
# Stop early if we found think-start — it's the boundary
if last_start != -1:
break
# No think tokens at all — not a reasoning model output
if last_start == -1 and last_end == -1 and last_tool_section == -1:
return False
# think-start is the last relevant token — reasoning is in progress
if last_start != -1 and (last_end == -1 or last_start > last_end):
return False
# think-end or tool-section is the last relevant token.
# This could be from the prompt (prior turn) or from generated
# tokens. For prompt tokens on multi-turn, the think-end is
# from a prior assistant message and the new generation hasn't
# started yet — we should return False.
#
# Heuristic: if think-end appears but is followed by more tokens
# (like <|im_end|>, user markers, etc.), it's from the prompt
# and reasoning hasn't started in the current generation yet.
# Return False.
#
# If think-end is the very last token or near the end, it's
# from generated tokens and reasoning has ended. Return True.
last_relevant = max(last_end, last_tool_section)
tokens_after = len(input_ids) - 1 - last_relevant
# If there are more than a few tokens after the last think-end,
# those are prompt tokens (chat template wrapping), meaning
# the think-end is from a prior turn. Return False.
if tokens_after > 3:
return False
return True
def is_reasoning_end_streaming(
self, input_ids: Sequence[int], delta_ids: Iterable[int]
) -> bool: