Patch vLLM serving layer to flush reasoning on finish_reason=length
When the model runs out of tokens while still reasoning (no think-end emitted), all text goes to the reasoning field with zero content — the model appears silent to the client. Streaming fix: yield an extra content delta with the extracted reasoning text before the finish chunk, so the client can see the output. Non-streaming fix: move reasoning to content when finish_reason=length and content is None. Also adds the patched serving.py to the Dockerfile.
This commit is contained in:
@@ -11,4 +11,7 @@ RUN unzip /tmp/eagle3.zip -d /opt/nvidia-Kimi-K2.5-Thinking-Eagle3 && \
|
||||
# Patch tool and reasoning parsers for Eagle
|
||||
COPY kimi_k2_tool_parser.py /usr/local/lib/python3.12/dist-packages/vllm/tool_parsers/kimi_k2_tool_parser.py
|
||||
|
||||
COPY kimi_k2_reasoning_parser.py /usr/local/lib/python3.12/dist-packages/vllm/reasoning/kimi_k2_reasoning_parser.py
|
||||
COPY kimi_k2_reasoning_parser.py /usr/local/lib/python3.12/dist-packages/vllm/reasoning/kimi_k2_reasoning_parser.py
|
||||
|
||||
# Patch serving layer: flush reasoning→content on finish_reason=length
|
||||
COPY serving.py.patch /usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/chat_completion/serving.py
|
||||
1863
serving.py.patch
Normal file
1863
serving.py.patch
Normal file
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user