vllm-kimi25-eagle

Author	SHA1	Message	Date
biondizzle	778f1bfe66	Fix is_reasoning_end to handle multi-turn prompt tokens Instead of always returning False (which broke tool call streaming), use a heuristic: if think-end appears in the token IDs but is followed by more than 3 tokens (chat template wrapping like <\|im_end\|>, user markers, etc.), it's from a prior turn's prompt and reasoning hasn't started in the current generation. Return False. If think-end is at or near the end, it's from generated tokens and reasoning has ended. Return True.	2026-04-14 08:39:18 +00:00
biondizzle	f5266646eb	Make is_reasoning_end() always return False The vLLM serving layer calls is_reasoning_end() with prompt_token_ids to pre-compute whether reasoning has ended before streaming starts. On multi-turn conversations, prompt_token_ids contains think-end tokens from prior assistant messages in the chat history. This causes a false positive — the serving layer sets reasoning_end_arr[i] = True, skips extract_reasoning_streaming entirely, and routes all thinking text to content. By returning False, the serving layer always calls extract_reasoning_streaming, which correctly tracks reasoning state via _reasoning_ended based only on the model's generated text.	2026-04-14 08:21:14 +00:00
biondizzle	055b14cb67	fix reasoning parser	2026-04-14 07:49:05 +00:00
biondizzle	9051c610d2	Fix reasoning parser for multi-turn conversations The streaming path was using is_reasoning_end(previous_token_ids) to check if reasoning had ended. On multi-turn conversations, previous_token_ids includes the entire chat history, including think-end tokens from prior assistant messages. This caused the parser to incorrectly think reasoning was already over before the model generated anything, routing all thinking text to content instead of reasoning. Fix: Replace the token-ID-based check with a text-based state variable (_reasoning_ended) that tracks reasoning end based solely on what the model has generated in the current turn. Reset on each new generation. Also includes the chat template for reference.	2026-04-14 07:46:33 +00:00
biondizzle	a404735b2d	Fix empty content deltas and leaked section markers in streaming Tool parser: - Case 3/4: return None instead of DeltaMessage(content='') when inside an open tool section with no parseable content yet. Empty-string content deltas pollute the response and break the content=null vs content='' contract with non-streaming. Reasoning parser: - Suppress tool-calls section markers from content forwarding. The tool parser detects them via current_text re-parsing; forwarding them as content causes double-handling. - Already-past-reasoning path: strip section markers from content for the same reason.	2026-04-14 06:46:19 +00:00
biondizzle	d4568f1d80	more speculative decoding fixes	2026-04-14 05:06:30 +00:00

6 Commits