21 Commits

Author SHA1 Message Date
3ee933951c Tool parser: fallback to <|tool_call_begin|> when no section marker
Some Kimi K2.5 model variants (nvidia/Kimi-K2.5-NVFP4) omit
<|tool_calls_section_begin|> and go directly to <|tool_call_begin|>.
The tool parser was only looking for section-level markers, so these
tool calls were forwarded as raw content text instead of being parsed.

Fix: _find_section_start and _find_section_start_end now fall back to
<|tool_call_begin|> as a section start when no section-level marker
is found. The section end falls back to <|tool_call_end|>.
2026-04-14 11:25:11 +00:00
120c8d9d8d keep everything .py 2026-04-14 10:16:59 +00:00
6999ed8a3a Fix finish_reason_ variable name in non-streaming path
Use output.finish_reason instead of finish_reason_ which doesn't exist
in the non-streaming code path.
2026-04-14 10:12:46 +00:00
3f2708a095 keep everything .py 2026-04-14 09:51:51 +00:00
043f51322f Patch vLLM serving layer to flush reasoning on finish_reason=length
When the model runs out of tokens while still reasoning (no think-end
emitted), all text goes to the reasoning field with zero content — the
model appears silent to the client.

Streaming fix: yield an extra content delta with the extracted reasoning
text before the finish chunk, so the client can see the output.

Non-streaming fix: move reasoning to content when finish_reason=length
and content is None.

Also adds the patched serving.py to the Dockerfile.
2026-04-14 09:49:45 +00:00
778f1bfe66 Fix is_reasoning_end to handle multi-turn prompt tokens
Instead of always returning False (which broke tool call streaming),
use a heuristic: if think-end appears in the token IDs but is
followed by more than 3 tokens (chat template wrapping like
<|im_end|>, user markers, etc.), it's from a prior turn's prompt
and reasoning hasn't started in the current generation. Return False.
If think-end is at or near the end, it's from generated tokens and
reasoning has ended. Return True.
2026-04-14 08:39:18 +00:00
f5266646eb Make is_reasoning_end() always return False
The vLLM serving layer calls is_reasoning_end() with prompt_token_ids
to pre-compute whether reasoning has ended before streaming starts. On
multi-turn conversations, prompt_token_ids contains think-end tokens
from prior assistant messages in the chat history. This causes a false
positive — the serving layer sets reasoning_end_arr[i] = True, skips
extract_reasoning_streaming entirely, and routes all thinking text to
content.

By returning False, the serving layer always calls
extract_reasoning_streaming, which correctly tracks reasoning state
via _reasoning_ended based only on the model's generated text.
2026-04-14 08:21:14 +00:00
055b14cb67 fix reasoning parser 2026-04-14 07:49:05 +00:00
9051c610d2 Fix reasoning parser for multi-turn conversations
The streaming path was using is_reasoning_end(previous_token_ids) to
check if reasoning had ended. On multi-turn conversations,
previous_token_ids includes the entire chat history, including
think-end tokens from prior assistant messages. This caused the parser
to incorrectly think reasoning was already over before the model
generated anything, routing all thinking text to content instead of
reasoning.

Fix: Replace the token-ID-based check with a text-based state variable
(_reasoning_ended) that tracks reasoning end based solely on what the
model has generated in the current turn. Reset on each new generation.
Also includes the chat template for reference.
2026-04-14 07:46:33 +00:00
c5e6414daf Fix last empty content delta in Case 5 (post-section-close)
Return None instead of DeltaMessage(content='') when delta_text
exists but there's no new content after the section end marker.
2026-04-14 07:08:45 +00:00
a404735b2d Fix empty content deltas and leaked section markers in streaming
Tool parser:
- Case 3/4: return None instead of DeltaMessage(content='') when
  inside an open tool section with no parseable content yet.
  Empty-string content deltas pollute the response and break the
  content=null vs content='' contract with non-streaming.

Reasoning parser:
- Suppress tool-calls section markers from content forwarding.
  The tool parser detects them via current_text re-parsing; forwarding
  them as content causes double-handling.
- Already-past-reasoning path: strip section markers from content
  for the same reason.
2026-04-14 06:46:19 +00:00
fcf8fd134e we need to forward the context using the old way 2026-04-14 06:18:32 +00:00
d0c9c5c482 we actually need the empty deltas to keep the stream going 2026-04-14 05:48:04 +00:00
d4568f1d80 more speculative decoding fixes 2026-04-14 05:06:30 +00:00
d4813de98f fix empty content deltas 2026-04-14 03:49:39 +00:00
72339bfe20 Document speculative decoding tool parser bug and re-parse-and-diff fix 2026-04-14 03:29:20 +00:00
9be82d3574 add the tool call parser fixes for eagle decode 2026-04-14 03:13:24 +00:00
4de7496f5b Add README 2026-04-13 17:28:30 +00:00
ad76c78630 Install unzip for Eagle3 extraction, then remove it 2026-04-13 15:52:29 +00:00
7d2f02a7cf Fix empty TAG fallback in Jenkinsfile 2026-04-13 15:39:13 +00:00
3293502548 init commit 2026-04-13 15:24:48 +00:00