Patch vLLM serving layer to flush reasoning on finish_reason=length

When the model runs out of tokens while still reasoning (no think-end emitted), all text goes to the reasoning field with zero content — the model appears silent to the client. Streaming fix: yield an extra content delta with the extracted reasoning text before the finish chunk, so the client can see the output. Non-streaming fix: move reasoning to content when finish_reason=length and content is None. Also adds the patched serving.py to the Dockerfile.
2026-04-14 09:49:45 +00:00
parent 778f1bfe66
commit 043f51322f
2 changed files with 1867 additions and 1 deletions
--- a/5
+++ b/5
@@ -11,4 +11,7 @@ RUN unzip /tmp/eagle3.zip -d /opt/nvidia-Kimi-K2.5-Thinking-Eagle3 && \
 # Patch tool and reasoning parsers for Eagle
 COPY kimi_k2_tool_parser.py /usr/local/lib/python3.12/dist-packages/vllm/tool_parsers/kimi_k2_tool_parser.py

-COPY kimi_k2_reasoning_parser.py /usr/local/lib/python3.12/dist-packages/vllm/reasoning/kimi_k2_reasoning_parser.py
+COPY kimi_k2_reasoning_parser.py /usr/local/lib/python3.12/dist-packages/vllm/reasoning/kimi_k2_reasoning_parser.py
+
+# Patch serving layer: flush reasoning→content on finish_reason=length
+COPY serving.py.patch /usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/chat_completion/serving.py
--- a/serving.py.patch
+++ b/serving.py.patch