Fix is_reasoning_end to handle multi-turn prompt tokens

Instead of always returning False (which broke tool call streaming),
use a heuristic: if think-end appears in the token IDs but is
followed by more than 3 tokens (chat template wrapping like
<|im_end|>, user markers, etc.), it's from a prior turn's prompt
and reasoning hasn't started in the current generation. Return False.
If think-end is at or near the end, it's from generated tokens and
reasoning has ended. Return True.
This commit is contained in:
2026-04-14 08:39:18 +00:00
parent f5266646eb
commit 778f1bfe66

View File

@@ -144,29 +144,89 @@ class KimiK2ReasoningParser(ReasoningParser):
# ------------------------------------------------------------------ # ------------------------------------------------------------------
def is_reasoning_end(self, input_ids: Sequence[int]) -> bool: def is_reasoning_end(self, input_ids: Sequence[int]) -> bool:
"""Check if reasoning has ended. """Check if reasoning has ended based on the token ID sequence.
IMPORTANT: Always returns False for this parser. The reasoning Scans backward to find the last think-start or think-end token.
state is tracked internally by ``_reasoning_ended`` which is Returns True only if the last relevant token is a think-end or
updated only when ``extract_reasoning_streaming`` detects a tool-section-start, AND there is no think-start after it.
think-end or a tool-section marker in the model's *generated*
text. CRITICAL: When called with prompt_token_ids (as the vLLM serving
layer does), the input contains the full chat history. On
The vLLM serving layer calls this with ``prompt_token_ids`` to multi-turn conversations, the prompt ends with tokens from the
pre-compute whether reasoning has ended. On multi-turn prior assistant message, which may include think-end. However,
conversations, the prompt contains think-end tokens from prior this think-end belongs to the PRIOR generation — the new
assistant messages, which would cause a false positive — the generation will start its own reasoning with think-start.
serving layer would skip ``extract_reasoning_streaming`` entirely
and route all thinking text to content. To handle this correctly, we check whether the input ends with
a complete reasoning block (think-start ... think-end). If the
Returning False ensures the serving layer always calls last think token is think-end AND it's followed by non-reasoning
``extract_reasoning_streaming``, which correctly handles the tokens (like tool_call tokens or end-of-sequence), we return
transition using generated text only. True. But if the input is just the prompt with no generated
tokens yet, we return False because the new generation hasn't
started reasoning yet.
The key insight: in the chat template for multi-turn, after the
last assistant message's think-end, the template adds
<|im_end|> followed by new user/assistant markers. The
assistant generation prompt ends with <|im_assistant|> and
<|im_middle|> — no think tokens. So if we scan backward and
find think-end but then find prompt-end tokens (not think-start)
after it, we know reasoning ended in a PRIOR turn, not the
current one. We return False to let the new generation start
fresh.
""" """
if self._identity_parser is not None: if self._identity_parser is not None:
return self._identity_parser.is_reasoning_end(input_ids) return self._identity_parser.is_reasoning_end(input_ids)
return False # Scan backward to find the last think-start or think-end
# or tool-section-start token.
last_start = -1
last_end = -1
last_tool_section = -1
for i in range(len(input_ids) - 1, -1, -1):
if input_ids[i] == self._start_token_id and last_start == -1:
last_start = i
if input_ids[i] == self._end_token_id and last_end == -1:
last_end = i
if input_ids[i] in self._tool_section_start_token_ids and last_tool_section == -1:
last_tool_section = i
# Stop early if we found think-start — it's the boundary
if last_start != -1:
break
# No think tokens at all — not a reasoning model output
if last_start == -1 and last_end == -1 and last_tool_section == -1:
return False
# think-start is the last relevant token — reasoning is in progress
if last_start != -1 and (last_end == -1 or last_start > last_end):
return False
# think-end or tool-section is the last relevant token.
# This could be from the prompt (prior turn) or from generated
# tokens. For prompt tokens on multi-turn, the think-end is
# from a prior assistant message and the new generation hasn't
# started yet — we should return False.
#
# Heuristic: if think-end appears but is followed by more tokens
# (like <|im_end|>, user markers, etc.), it's from the prompt
# and reasoning hasn't started in the current generation yet.
# Return False.
#
# If think-end is the very last token or near the end, it's
# from generated tokens and reasoning has ended. Return True.
last_relevant = max(last_end, last_tool_section)
tokens_after = len(input_ids) - 1 - last_relevant
# If there are more than a few tokens after the last think-end,
# those are prompt tokens (chat template wrapping), meaning
# the think-end is from a prior turn. Return False.
if tokens_after > 3:
return False
return True
def is_reasoning_end_streaming( def is_reasoning_end_streaming(
self, input_ids: Sequence[int], delta_ids: Iterable[int] self, input_ids: Sequence[int], delta_ids: Iterable[int]
) -> bool: ) -> bool: