Patch vLLM serving layer to flush reasoning on finish_reason=length

When the model runs out of tokens while still reasoning (no think-end
emitted), all text goes to the reasoning field with zero content — the
model appears silent to the client.

Streaming fix: yield an extra content delta with the extracted reasoning
text before the finish chunk, so the client can see the output.

Non-streaming fix: move reasoning to content when finish_reason=length
and content is None.

Also adds the patched serving.py to the Dockerfile.
This commit is contained in:
2026-04-14 09:49:45 +00:00
parent 778f1bfe66
commit 043f51322f
2 changed files with 1867 additions and 1 deletions

View File

@@ -11,4 +11,7 @@ RUN unzip /tmp/eagle3.zip -d /opt/nvidia-Kimi-K2.5-Thinking-Eagle3 && \
# Patch tool and reasoning parsers for Eagle
COPY kimi_k2_tool_parser.py /usr/local/lib/python3.12/dist-packages/vllm/tool_parsers/kimi_k2_tool_parser.py
COPY kimi_k2_reasoning_parser.py /usr/local/lib/python3.12/dist-packages/vllm/reasoning/kimi_k2_reasoning_parser.py
COPY kimi_k2_reasoning_parser.py /usr/local/lib/python3.12/dist-packages/vllm/reasoning/kimi_k2_reasoning_parser.py
# Patch serving layer: flush reasoning→content on finish_reason=length
COPY serving.py.patch /usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/chat_completion/serving.py

1863
serving.py.patch Normal file

File diff suppressed because it is too large Load Diff