vllm-kimi25-eagle

Author	SHA1	Message	Date
biondizzle	3ee933951c	Tool parser: fallback to <\|tool_call_begin\|> when no section marker Some Kimi K2.5 model variants (nvidia/Kimi-K2.5-NVFP4) omit <\|tool_calls_section_begin\|> and go directly to <\|tool_call_begin\|>. The tool parser was only looking for section-level markers, so these tool calls were forwarded as raw content text instead of being parsed. Fix: _find_section_start and _find_section_start_end now fall back to <\|tool_call_begin\|> as a section start when no section-level marker is found. The section end falls back to <\|tool_call_end\|>.	2026-04-14 11:25:11 +00:00
biondizzle	120c8d9d8d	keep everything .py	2026-04-14 10:16:59 +00:00
biondizzle	6999ed8a3a	Fix finish_reason_ variable name in non-streaming path Use output.finish_reason instead of finish_reason_ which doesn't exist in the non-streaming code path.	2026-04-14 10:12:46 +00:00
biondizzle	3f2708a095	keep everything .py	2026-04-14 09:51:51 +00:00
biondizzle	043f51322f	Patch vLLM serving layer to flush reasoning on finish_reason=length When the model runs out of tokens while still reasoning (no think-end emitted), all text goes to the reasoning field with zero content — the model appears silent to the client. Streaming fix: yield an extra content delta with the extracted reasoning text before the finish chunk, so the client can see the output. Non-streaming fix: move reasoning to content when finish_reason=length and content is None. Also adds the patched serving.py to the Dockerfile.	2026-04-14 09:49:45 +00:00
biondizzle	778f1bfe66	Fix is_reasoning_end to handle multi-turn prompt tokens Instead of always returning False (which broke tool call streaming), use a heuristic: if think-end appears in the token IDs but is followed by more than 3 tokens (chat template wrapping like <\|im_end\|>, user markers, etc.), it's from a prior turn's prompt and reasoning hasn't started in the current generation. Return False. If think-end is at or near the end, it's from generated tokens and reasoning has ended. Return True.	2026-04-14 08:39:18 +00:00
biondizzle	f5266646eb	Make is_reasoning_end() always return False The vLLM serving layer calls is_reasoning_end() with prompt_token_ids to pre-compute whether reasoning has ended before streaming starts. On multi-turn conversations, prompt_token_ids contains think-end tokens from prior assistant messages in the chat history. This causes a false positive — the serving layer sets reasoning_end_arr[i] = True, skips extract_reasoning_streaming entirely, and routes all thinking text to content. By returning False, the serving layer always calls extract_reasoning_streaming, which correctly tracks reasoning state via _reasoning_ended based only on the model's generated text.	2026-04-14 08:21:14 +00:00
biondizzle	055b14cb67	fix reasoning parser	2026-04-14 07:49:05 +00:00
biondizzle	9051c610d2	Fix reasoning parser for multi-turn conversations The streaming path was using is_reasoning_end(previous_token_ids) to check if reasoning had ended. On multi-turn conversations, previous_token_ids includes the entire chat history, including think-end tokens from prior assistant messages. This caused the parser to incorrectly think reasoning was already over before the model generated anything, routing all thinking text to content instead of reasoning. Fix: Replace the token-ID-based check with a text-based state variable (_reasoning_ended) that tracks reasoning end based solely on what the model has generated in the current turn. Reset on each new generation. Also includes the chat template for reference.	2026-04-14 07:46:33 +00:00
biondizzle	c5e6414daf	Fix last empty content delta in Case 5 (post-section-close) Return None instead of DeltaMessage(content='') when delta_text exists but there's no new content after the section end marker.	2026-04-14 07:08:45 +00:00
biondizzle	a404735b2d	Fix empty content deltas and leaked section markers in streaming Tool parser: - Case 3/4: return None instead of DeltaMessage(content='') when inside an open tool section with no parseable content yet. Empty-string content deltas pollute the response and break the content=null vs content='' contract with non-streaming. Reasoning parser: - Suppress tool-calls section markers from content forwarding. The tool parser detects them via current_text re-parsing; forwarding them as content causes double-handling. - Already-past-reasoning path: strip section markers from content for the same reason.	2026-04-14 06:46:19 +00:00
biondizzle	fcf8fd134e	we need to forward the context using the old way	2026-04-14 06:18:32 +00:00
biondizzle	d0c9c5c482	we actually need the empty deltas to keep the stream going	2026-04-14 05:48:04 +00:00
biondizzle	d4568f1d80	more speculative decoding fixes	2026-04-14 05:06:30 +00:00
biondizzle	d4813de98f	fix empty content deltas	2026-04-14 03:49:39 +00:00
biondizzle	72339bfe20	Document speculative decoding tool parser bug and re-parse-and-diff fix	2026-04-14 03:29:20 +00:00
biondizzle	9be82d3574	add the tool call parser fixes for eagle decode	2026-04-14 03:13:24 +00:00
biondizzle	4de7496f5b	Add README	2026-04-13 17:28:30 +00:00
biondizzle	ad76c78630	Install unzip for Eagle3 extraction, then remove it	2026-04-13 15:52:29 +00:00
biondizzle	7d2f02a7cf	Fix empty TAG fallback in Jenkinsfile	2026-04-13 15:39:13 +00:00
biondizzle	3293502548	init commit	2026-04-13 15:24:48 +00:00

21 Commits