Document speculative decoding tool parser bug and re-parse-and-diff fix
This commit is contained in:
23
README.md
23
README.md
@@ -1,14 +1,35 @@
|
||||
# vLLM Kimi-K2.5-Thinking Eagle3 Drafter
|
||||
|
||||
A convenience Docker image that bundles the [Eagle3 drafter model](https://huggingface.co/nvidia/Kimi-K2.5-Thinking-Eagle3) into the vLLM container, so you can deploy speculative decoding without a separate model download step.
|
||||
A convenience Docker image that bundles the [Eagle3 drafter model](https://huggingface.co/nvidia/Kimi-K2.5-Thinking-Eagle3) into the vLLM container, so you can deploy speculative decoding without a separate model download step. Also includes a patched tool-call parser that fixes streaming failures caused by speculative decoding.
|
||||
|
||||
## What's Inside
|
||||
|
||||
- **Base image:** `vllm/vllm-openai:v0.19.0`
|
||||
- **Drafter model:** `nvidia/Kimi-K2.5-Thinking-Eagle3` (Eagle3 speculator layers) extracted to `/opt/`
|
||||
- **Patched tool parser:** `kimi_k2_tool_parser.py` — re-parse-and-diff replacement for the upstream parser
|
||||
|
||||
> **Note:** This only works with `nvidia/Kimi-K2-Thinking-NVFP4` — the text generation model. It is **not** compatible with the multimodal Kimi 2.5.
|
||||
|
||||
## The Problem: Speculative Decoding Breaks Tool Call Parsing
|
||||
|
||||
The upstream `kimi_k2` tool parser uses a **token-count state machine** to track streaming state — it counts how many `<|tool_call_begin|>` and `<|tool_call_end|>` tokens have arrived and uses those counts to decide whether the model is generating text or inside a tool call.
|
||||
|
||||
This works fine with standard autoregressive decoding, where tokens arrive one at a time. But Eagle3 speculative decoding is non-deterministic about how many tokens arrive in each streaming chunk — it can emit anywhere from 1 to `num_speculative_tokens + 1` tokens per step. When multiple structural tokens land in the same delta, the state machine breaks.
|
||||
|
||||
### Symptom 1: Tool calls never fire
|
||||
|
||||
`<|tool_calls_section_begin|>` and `<|tool_call_begin|>` arrive together in one delta. The parser checks the `<|tool_call_begin|>` count, but only `<|tool_calls_section_begin|>` has been seen so far — `cur_tool_start_count == cur_tool_end_count == 0`, so the parser thinks it's still "generating text" and forwards the section-begin token as plain content. The model says it wants to make a tool call, but the parser never enters the tool-call path.
|
||||
|
||||
### Symptom 2: Model goes silent after a tool call
|
||||
|
||||
`<|tool_call_end|>` and `<|tool_calls_section_end|>` arrive in the same delta. The same count mismatch prevents the parser from transitioning out of the tool-call state. The model completes the tool call but never resumes generating text.
|
||||
|
||||
### The Fix: Re-parse-and-diff
|
||||
|
||||
The patched parser replaces the token-count state machine with a **re-parse-and-diff** approach. On every streaming call it re-scans the entire `current_text`, finds all tool-call regions (complete and in-progress), extracts JSON arguments, and diffs against what was previously sent. Because the parser doesn't rely on counting tokens incrementally, it's correct regardless of how many tokens arrive per step — whether the speculative decoder emits 1 token or 5, the parser handles it.
|
||||
|
||||
This is the same approach used in the [vllm-deepseek-v32-mtp](https://sweetapi.com/biondizzle/vllm-deepseek-v32-mtp) parser for DeepSeek-V3.2, adapted for the Kimi-K2 tool call format.
|
||||
|
||||
## Pull
|
||||
|
||||
```bash
|
||||
|
||||
Reference in New Issue
Block a user