Go to file

biondizzle 3ee933951c Tool parser: fallback to <|tool_call_begin|> when no section marker

Some Kimi K2.5 model variants (nvidia/Kimi-K2.5-NVFP4) omit
<|tool_calls_section_begin|> and go directly to <|tool_call_begin|>.
The tool parser was only looking for section-level markers, so these
tool calls were forwarded as raw content text instead of being parsed.

Fix: _find_section_start and _find_section_start_end now fall back to
<|tool_call_begin|> as a section start when no section-level marker
is found. The section end falls back to <|tool_call_end|>.

2026-04-14 11:25:11 +00:00

chat_template.jinja

Fix reasoning parser for multi-turn conversations

2026-04-14 07:46:33 +00:00

Dockerfile

keep everything .py

2026-04-14 09:51:51 +00:00

Jenkinsfile

Fix empty TAG fallback in Jenkinsfile

2026-04-13 15:39:13 +00:00

kimi_k2_reasoning_parser.py

Fix is_reasoning_end to handle multi-turn prompt tokens

2026-04-14 08:39:18 +00:00

kimi_k2_tool_parser.py

Tool parser: fallback to <|tool_call_begin|> when no section marker

2026-04-14 11:25:11 +00:00

README.md

Document speculative decoding tool parser bug and re-parse-and-diff fix

2026-04-14 03:29:20 +00:00

serving.py

keep everything .py

2026-04-14 10:16:59 +00:00

README.md

vLLM Kimi-K2.5-Thinking Eagle3 Drafter

A convenience Docker image that bundles the Eagle3 drafter model into the vLLM container, so you can deploy speculative decoding without a separate model download step. Also includes a patched tool-call parser that fixes streaming failures caused by speculative decoding.

What's Inside

Base image: vllm/vllm-openai:v0.19.0
Drafter model: nvidia/Kimi-K2.5-Thinking-Eagle3 (Eagle3 speculator layers) extracted to /opt/
Patched tool parser: kimi_k2_tool_parser.py — re-parse-and-diff replacement for the upstream parser

Note: This only works with nvidia/Kimi-K2-Thinking-NVFP4 — the text generation model. It is not compatible with the multimodal Kimi 2.5.

The Problem: Speculative Decoding Breaks Tool Call Parsing

The upstream kimi_k2 tool parser uses a token-count state machine to track streaming state — it counts how many <|tool_call_begin|> and <|tool_call_end|> tokens have arrived and uses those counts to decide whether the model is generating text or inside a tool call.

This works fine with standard autoregressive decoding, where tokens arrive one at a time. But Eagle3 speculative decoding is non-deterministic about how many tokens arrive in each streaming chunk — it can emit anywhere from 1 to num_speculative_tokens + 1 tokens per step. When multiple structural tokens land in the same delta, the state machine breaks.

Symptom 1: Tool calls never fire

<|tool_calls_section_begin|> and <|tool_call_begin|> arrive together in one delta. The parser checks the <|tool_call_begin|> count, but only <|tool_calls_section_begin|> has been seen so far — cur_tool_start_count == cur_tool_end_count == 0, so the parser thinks it's still "generating text" and forwards the section-begin token as plain content. The model says it wants to make a tool call, but the parser never enters the tool-call path.

Symptom 2: Model goes silent after a tool call

<|tool_call_end|> and <|tool_calls_section_end|> arrive in the same delta. The same count mismatch prevents the parser from transitioning out of the tool-call state. The model completes the tool call but never resumes generating text.

The Fix: Re-parse-and-diff

The patched parser replaces the token-count state machine with a re-parse-and-diff approach. On every streaming call it re-scans the entire current_text, finds all tool-call regions (complete and in-progress), extracts JSON arguments, and diffs against what was previously sent. Because the parser doesn't rely on counting tokens incrementally, it's correct regardless of how many tokens arrive per step — whether the speculative decoder emits 1 token or 5, the parser handles it.

This is the same approach used in the vllm-deepseek-v32-mtp parser for DeepSeek-V3.2, adapted for the Kimi-K2 tool call format.

Pull

docker pull atl.vultrcr.com/vllm/vllm-kimi25-eagle:v0.19.0

Usage

Add the speculative decoding config to your vLLM launch args. Here's a known-working Kubernetes deployment snippet:

- "--tensor-parallel-size=8"
- "--trust-remote-code"
- "--gpu-memory-utilization=0.92"
- "--enable-auto-tool-choice"
- "--tool-call-parser=kimi_k2"
- "--reasoning-parser=kimi_k2"
- "--speculative_config"
- '{"model": "/opt/nvidia-Kimi-K2.5-Thinking-Eagle3/models--nvidia--Kimi-K2.5-Thinking-Eagle3/snapshots/13dab2a34d650a93196d37f2af91f74b8b855bab", "draft_tensor_parallel_size": 1, "num_speculative_tokens": 3, "method": "eagle3"}'

Speculative Config Breakdown

Parameter	Value	Notes
`model`	`/opt/nvidia-Kimi-K2.5-Thinking-Eagle3/...`	Path to the drafter inside the container
`draft_tensor_parallel_size`	`1`	TP size for the drafter
`num_speculative_tokens`	`3`	Number of tokens to speculate per step
`method`	`eagle3`	Speculative decoding method

Building

The Jenkins pipeline builds and pushes this image. Trigger a build with a specific tag:

curl -X POST "https://jenkins.sweetapi.com/job/vllm-kimi25-eagle/buildWithParameters" \
  -u "$JENKINS_USER:$JENKINS_PASS" \
  -d "TAG=v0.19.0"

To build locally:

docker build -t atl.vultrcr.com/vllm/vllm-kimi25-eagle:v0.19.0 .