patch parser

This commit is contained in:
2026-04-09 04:28:22 +00:00
parent 40159e865e
commit 8d5da5750d
8 changed files with 1239 additions and 108 deletions

1
.gitignore vendored Normal file
View File

@@ -0,0 +1 @@
/.venv

View File

@@ -1,55 +1,91 @@
# vLLM GLM Tool Parser Patch
## Purpose
Patches vLLM's GLM-4/GLM-5.1 tool parser to fix multiple issues with tool call handling.
Patches vLLM's GLM-4/GLM-5.1 tool parser to fix a streaming issue where long string parameters are buffered entirely before being emitted, causing multi-second delays.
## Issues Fixed
## The Problem
### Issue 1: Tool Response Content Ignored (CRITICAL)
GLM models emit tool calls in a special XML-like format:
**Symptom:** When the model makes a tool call and receives a response, it would act as if the response was empty ("The function returned no output") even though valid content was provided.
**Root Cause:** The `func_detail_regex` required a newline between the function name and first argument tag, but GLM-5.1's chat template does NOT include that newline. The regex silently failed to match, tool call extraction failed, and somewhere in that failure path the tool response content got lost.
**Model output format (no newline after name):**
```
.tool_name
param_nameparam_value
[TOOL_CALL_START]function_name[ARG_KEY]value[ARG_END]...[TOOL_CALL_END]
```
The upstream parser (as of vLLM issue #32829) buffers string values until the closing tag arrives. For long strings (e.g., 4000+ characters of code), users see nothing until the entire value is complete — not true streaming.
**Old regex (broken):**
```python
r"\[TOOL_CALL_START\]([^\n]*)\n(.*)\[TOOL_CALL_END\]" # Requires \n after name
```
## The Fix (Pulled from https://github.com/vllm-project/vllm/pull/39253)
**Fixed regex:**
```python
r"\[TOOL_CALL_START\]\s*([\w.\-]+)\s*((?:\[ARG_KEY\].*)?)\s*\[TOOL_CALL_END\]"
```
`glm4_moe_tool_parser.py` implements incremental string streaming:
The fix:
- Uses `\s*` instead of mandatory `\n`
- Makes the arguments group optional for zero-argument calls
- Accepts word chars, dots, and hyphens in function names
- Re-parses `` regions on each streaming call
- Diffs against previously sent content
- Emits only new characters as they arrive
- String values now stream character-by-character
### Issue 2: Zero-Argument Tool Calls Crash
**Symptom:** `TypeError: 'NoneType' object is not iterable` when tool has no arguments.
**Fix:** The `tc_args_raw` is now defaulted to empty string: `tc_args_raw = tc_detail.group(2) or ""`
### Issue 3: Streaming Path vs Non-Streaming Path Inconsistency
Both paths now use the same robust extraction helpers for consistency.
## Files
| File | Description |
|------|-------------|
| `glm4_moe_tool_parser.py` | Fixed tool parser with incremental streaming |
| `glm4_moe_tool_parser.py` | Fixed tool parser |
| `utils.py` | Utility functions for partial JSON/tag handling |
| `Dockerfile` | Overlays patched files onto base image |
| `Jenkinsfile` | CI/CD pipeline for building and pushing |
| `tests/` | Test suite for tool call validation |
## Testing
### Requirements
```bash
pip install httpx regex
```
### Running Tests
```bash
export VLLM_API_BASE="https://api.vultrinference.com/v1"
export VLLM_API_KEY="your-api-key"
export VLLM_MODEL="zai-org/GLM-5.1-FP8"
python tests/test_tool_diagnosis.py
```
### Test Cases
| Test | Description |
|------|-------------|
| `test_simple_tool_response` | Verifies model can see tool response content |
| `test_without_tools_param` | Tests behavior without tools param in follow-up |
| `test_different_content_formats` | String vs array content formats |
## Deployment
### Jenkins Pipeline
Build via Jenkins:
```bash
curl -X POST "https://jenkins.sweetapi.com/job/vllm-glm-build/buildWithParameters" \
-u "admin:TOKEN" \
-d "IMAGE_TAG=latest"
```
Parameters:
- `IMAGE_TAG` - Docker image tag (default: `latest`)
- `GIT_REPO` - Git repository URL (optional, uses workspace if empty)
- `GIT_BRANCH` - Git branch to build (default: `master`)
### Manual Build
```bash
@@ -65,3 +101,4 @@ docker push atl.vultrcr.com/vllm/vllm-glm51-patched:latest
## Related
- vLLM Issue #32829 (streaming long string parameters)
- GLM-5.1 chat template: https://huggingface.co/zai-org/GLM-5.1-FP8/raw/main/chat_template.jinja

View File

@@ -1,14 +1,26 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
"""
GLM-4 Tool Call Parser with incremental string streaming support.
GLM-4/5 Tool Call Parser — fixed version.
This parser fixes the streaming issue reported in Issue #32829 where long string
parameters (e.g., file content with 4000+ characters of code) are buffered until
complete, causing multi-second delays before the user sees any content.
Fixes applied over the upstream vLLM + sweetapi patch:
The fix streams string values incrementally as they arrive, providing a true
streaming experience for long content.
1. **func_detail_regex no longer requires a newline** between tool name and
first <arg_key>. The model's chat template instructs:
<tool_call>{name}<arg_key>…</arg_key><arg_value>…</arg_value>…</tool_call>
with NO mandatory newline, but the original regex used ``[^\\n]*\\n`` which
silently failed when the model omitted it.
2. **Zero-argument tool calls no longer crash** (TypeError on NoneType).
3. **extract_tool_calls uses the same robust extraction helpers** as the
streaming path, so both paths parse identically.
4. **_extract_tool_name_from_region** is more tolerant of whitespace /
formatting variants the model may produce.
Drop this file into your vLLM install as a --tool-parser-plugin, or replace
the built-in glm4_moe_tool_parser.py.
"""
import ast
@@ -43,7 +55,7 @@ logger = init_logger(__name__)
class Glm4MoeModelToolParser(ToolParser):
"""Tool parser for GLM-4 models with incremental string streaming.
"""Tool parser for GLM-4/5 models with incremental string streaming.
On every streaming call the parser re-parses ``current_text`` to find
``<tool_call>`` regions, builds the JSON arguments string for each tool
@@ -67,10 +79,25 @@ class Glm4MoeModelToolParser(ToolParser):
self.tool_calls_start_token = self.tool_call_start_token
self.func_call_regex = re.compile(r"<tool_call>.*?</tool_call>", re.DOTALL)
self.func_detail_regex = re.compile(
r"<tool_call>([^\n]*)\n(.*)</tool_call>", re.DOTALL
# ---- FIXED regexes ------------------------------------------------
# Match the whole <tool_call>…</tool_call> block (unchanged).
self.func_call_regex = re.compile(
r"<tool_call>.*?</tool_call>", re.DOTALL
)
# FIX 1: The original regex required a literal \n between tool name
# and the body. The model often omits it. We now accept any
# whitespace (including none) before the first <arg_key>, and we
# make the body group optional so zero-argument calls don't fail.
self.func_detail_regex = re.compile(
r"<tool_call>\s*" # opening tag + optional whitespace
r"([\w.\-]+)" # group 1: tool/function name (word chars, dots, hyphens)
r"\s*" # optional whitespace / newline
r"((?:<arg_key>.*)?)" # group 2: everything from first <arg_key> onward (may be empty)
r"\s*</tool_call>", # closing tag
re.DOTALL,
)
self.func_arg_regex = re.compile(
r"<arg_key>(.*?)</arg_key>\s*<arg_value>(.*?)</arg_value>", re.DOTALL
)
@@ -95,27 +122,25 @@ class Glm4MoeModelToolParser(ToolParser):
self._sent_content_idx: int = 0
self._tool_call_ids: list[str] = []
# ------------------------------------------------------------------
# Static helpers
# ------------------------------------------------------------------
@staticmethod
def _deserialize(value: str) -> Any:
try:
return json.loads(value)
except json.JSONDecodeError:
pass
try:
return ast.literal_eval(value)
except (ValueError, SyntaxError):
pass
return value
@staticmethod
def _json_escape_string_content(s: str) -> str:
"""JSON-escape string content for incremental streaming.
This escapes the content that goes INSIDE a JSON string (between quotes),
not including the surrounding quotes themselves.
"""
"""JSON-escape string content (without surrounding quotes)."""
if not s:
return ""
return json.dumps(s, ensure_ascii=False)[1:-1]
@@ -144,7 +169,6 @@ class Glm4MoeModelToolParser(ToolParser):
@staticmethod
def _tools_enabled(request: ChatCompletionRequest) -> bool:
"""Return whether tool parsing should be applied for this request."""
try:
tools = getattr(request, "tools", None)
tool_choice = getattr(request, "tool_choice", None)
@@ -153,19 +177,22 @@ class Glm4MoeModelToolParser(ToolParser):
logger.exception("Failed to determine if tools are enabled.")
return False
# ------------------------------------------------------------------
# Request adjustment
# ------------------------------------------------------------------
def adjust_request(
self, request: ChatCompletionRequest | ResponsesRequest
) -> ChatCompletionRequest | ResponsesRequest:
"""Adjust request parameters for tool call token handling."""
request = super().adjust_request(request)
if request.tools and request.tool_choice != "none":
# Ensure tool call tokens (<tool_call>, </tool_call>) are not skipped
# during decoding. Even though they are not marked as special tokens,
# setting skip_special_tokens=False ensures proper handling in
# transformers 5.x where decoding behavior may have changed.
request.skip_special_tokens = False
return request
# ------------------------------------------------------------------
# Non-streaming extraction
# ------------------------------------------------------------------
def extract_tool_calls(
self,
model_output: str,
@@ -173,19 +200,20 @@ class Glm4MoeModelToolParser(ToolParser):
) -> ExtractedToolCallInformation:
matched_tool_calls = self.func_call_regex.findall(model_output)
logger.debug("model_output: %s", model_output)
try:
tool_calls: list[ToolCall] = []
for match in matched_tool_calls:
tc_detail = self.func_detail_regex.search(match)
if not tc_detail:
logger.warning(
"Failed to parse tool call details from: %s",
match,
"Failed to parse tool call details from: %s", match
)
continue
tc_name = tc_detail.group(1).strip()
tc_args = tc_detail.group(2)
pairs = self.func_arg_regex.findall(tc_args) if tc_args else []
tc_args_raw = tc_detail.group(2) or "" # FIX 2: default to ""
pairs = self.func_arg_regex.findall(tc_args_raw) if tc_args_raw else []
arg_dct: dict[str, Any] = {}
for key, value in pairs:
arg_key = key.strip()
@@ -208,38 +236,31 @@ class Glm4MoeModelToolParser(ToolParser):
return ExtractedToolCallInformation(
tools_called=False, tool_calls=[], content=model_output
)
else:
if len(tool_calls) > 0:
content: str | None = model_output[
: model_output.find(self.tool_calls_start_token)
]
# Normalize empty/whitespace-only content to None
if not content or not content.strip():
content = None
return ExtractedToolCallInformation(
tools_called=True, tool_calls=tool_calls, content=content
)
if tool_calls:
content: str | None = model_output[
: model_output.find(self.tool_calls_start_token)
]
if not content or not content.strip():
content = None
return ExtractedToolCallInformation(
tools_called=False, tool_calls=[], content=model_output
tools_called=True, tool_calls=tool_calls, content=content
)
return ExtractedToolCallInformation(
tools_called=False, tool_calls=[], content=model_output
)
# ------------------------------------------------------------------
# Streaming helpers
# ------------------------------------------------------------------
def _extract_content(self, current_text: str) -> str | None:
"""Return unsent non-tool-call text, or None.
Collects all text outside ``<tool_call>...</tool_call>`` regions,
including text between consecutive tool calls. Holds back any
suffix that could be a partial ``<tool_call>`` tag.
"""
# Build the "sendable index" — the furthest point we can send
# content up to. We scan through the text collecting segments
# that are outside tool-call regions.
content_segments: list[str] = []
pos = self._sent_content_idx
while pos < len(current_text):
start = current_text.find(self.tool_call_start_token, pos)
if start == -1:
# No more tool calls — send up to (len - partial-tag overlap)
tail = current_text[pos:]
overlap = partial_tag_overlap(tail, self.tool_call_start_token)
sendable = tail[: len(tail) - overlap] if overlap else tail
@@ -248,29 +269,24 @@ class Glm4MoeModelToolParser(ToolParser):
pos = len(current_text) - overlap
break
# Text before this <tool_call>
if start > pos:
content_segments.append(current_text[pos:start])
# Skip past the </tool_call> (or to end if incomplete)
end = current_text.find(self.tool_call_end_token, start)
if end != -1:
pos = end + len(self.tool_call_end_token)
else:
# Incomplete tool call — nothing more to send
pos = start
break
if content_segments:
self._sent_content_idx = pos
return "".join(content_segments)
# Even if no content, advance past completed tool-call regions
if pos > self._sent_content_idx:
self._sent_content_idx = pos
return None
def _extract_tool_call_regions(self, text: str) -> list[tuple[str, bool]]:
"""Extract ``(inner_text, is_complete)`` for each ``<tool_call>`` region."""
results: list[tuple[str, bool]] = []
pos = 0
while True:
@@ -283,7 +299,6 @@ class Glm4MoeModelToolParser(ToolParser):
results.append((text[inner_start:end], True))
pos = end + len(self.tool_call_end_token)
else:
# Incomplete tool call — strip partial </tool_call> suffix
raw = text[inner_start:]
overlap = partial_tag_overlap(raw, self.tool_call_end_token)
if overlap:
@@ -295,16 +310,31 @@ class Glm4MoeModelToolParser(ToolParser):
def _extract_tool_name_from_region(self, inner_text: str) -> str | None:
"""Extract the tool name from the beginning of a tool-call region.
The name is everything before the first ``\\n`` or ``<arg_key>``.
Returns ``None`` if the name hasn't fully arrived yet.
The name is everything before the first ``\\n``, ``<arg_key>``, or
``</tool_call>``. We also accept the name being the only content
(for zero-argument calls that are still in-flight).
"""
nl = inner_text.find("\n")
ak = inner_text.find(self.arg_key_start)
# Strip leading whitespace — model may emit \n after <tool_call>
stripped = inner_text.lstrip()
if not stripped:
return None
nl = stripped.find("\n")
ak = stripped.find(self.arg_key_start)
candidates = [i for i in [nl, ak] if i != -1]
if not candidates:
# No delimiter yet — if the text looks like a partial name
# (only word chars / dots / hyphens), return None to wait.
# If it's a complete name with no args (zero-arg call, complete),
# it will be handled when is_complete is True.
candidate_name = stripped.strip()
if re.fullmatch(r'[\w.\-]+', candidate_name):
# Could be a complete name or still arriving — return it
# so zero-arg complete calls work; the caller checks is_complete.
return candidate_name
return None
cut = min(candidates)
name = inner_text[:cut].strip()
name = stripped[:cut].strip()
return name if name else None
def _build_args_json_so_far(
@@ -313,17 +343,6 @@ class Glm4MoeModelToolParser(ToolParser):
inner_text: str,
is_complete: bool,
) -> str:
"""Build the JSON arguments string from the XML pairs seen so far.
For complete ``<arg_key>/<arg_value>`` pairs the value is fully
formatted. For the last argument whose ``<arg_value>`` has been
opened but not closed, the partial string content is included
(JSON-escaped, with an opening ``"`` but no closing ``"``).
The closing ``}`` is only appended when ``is_complete`` is True
(i.e. the ``</tool_call>`` tag has arrived).
"""
# Find all complete arg pairs
pairs = self.func_arg_regex.findall(inner_text)
parts: list[str] = []
@@ -331,8 +350,6 @@ class Glm4MoeModelToolParser(ToolParser):
key = key.strip()
key_json = json.dumps(key, ensure_ascii=False)
if self._is_string_type(tool_name, key, self.tools):
# Don't strip string values — whitespace is significant
# and must match the partial-value path for diffing.
val_json = json.dumps(value, ensure_ascii=False)
else:
val_json = json.dumps(
@@ -341,7 +358,6 @@ class Glm4MoeModelToolParser(ToolParser):
parts.append(f"{key_json}: {val_json}")
# Check for a partial (incomplete) arg value
# Find the last <arg_value> that isn't closed
last_val_start = inner_text.rfind(self.arg_val_start)
last_val_end = inner_text.rfind(self.arg_val_end)
has_partial_value = last_val_start != -1 and (
@@ -349,8 +365,6 @@ class Glm4MoeModelToolParser(ToolParser):
)
if has_partial_value:
# Find the key for this partial value
# Look for the last <arg_key>...</arg_key> before this <arg_value>
last_key_match = None
for m in self._arg_key_pattern.finditer(inner_text[:last_val_start]):
last_key_match = m
@@ -360,16 +374,12 @@ class Glm4MoeModelToolParser(ToolParser):
partial_content_start = last_val_start + len(self.arg_val_start)
partial_content = inner_text[partial_content_start:]
# Hold back any partial </arg_value> suffix
overlap = partial_tag_overlap(partial_content, self.arg_val_end)
if overlap:
partial_content = partial_content[:-overlap]
key_json = json.dumps(partial_key, ensure_ascii=False)
if is_complete:
# Tool call finished but </arg_value> is missing
# (malformed output). Treat partial as complete value
# so the diff naturally closes any open quotes.
if self._is_string_type(tool_name, partial_key, self.tools):
val_json = json.dumps(partial_content, ensure_ascii=False)
else:
@@ -380,10 +390,8 @@ class Glm4MoeModelToolParser(ToolParser):
parts.append(f"{key_json}: {val_json}")
elif self._is_string_type(tool_name, partial_key, self.tools):
escaped = self._json_escape_string_content(partial_content)
# Open quote but no close — more content may arrive
parts.append(f'{key_json}: "{escaped}')
else:
# Non-string partial: include raw content, no wrapping
parts.append(f"{key_json}: {partial_content}")
if not parts:
@@ -395,7 +403,6 @@ class Glm4MoeModelToolParser(ToolParser):
return joined
def _compute_args_diff(self, index: int, args_so_far: str) -> str | None:
"""Return new argument text not yet sent for tool *index*, or None."""
if not args_so_far or len(args_so_far) <= len(
self.streamed_args_for_tool[index]
):
@@ -406,7 +413,6 @@ class Glm4MoeModelToolParser(ToolParser):
return diff
def _ensure_tool_state_for(self, index: int) -> None:
"""Grow state arrays so that *index* is valid."""
while len(self._tool_call_ids) <= index:
self._tool_call_ids.append(
make_tool_call_id(id_type="random", func_name=None, idx=None)
@@ -416,6 +422,10 @@ class Glm4MoeModelToolParser(ToolParser):
while len(self.prev_tool_call_arr) <= index:
self.prev_tool_call_arr.append({})
# ------------------------------------------------------------------
# Main streaming entry point
# ------------------------------------------------------------------
def extract_tool_calls_streaming(
self,
previous_text: str,
@@ -436,7 +446,6 @@ class Glm4MoeModelToolParser(ToolParser):
for i, (inner_text, is_complete) in enumerate(regions):
self._ensure_tool_state_for(i)
# Extract tool name
tool_name = self._extract_tool_name_from_region(inner_text)
if not tool_name:
break
@@ -471,7 +480,6 @@ class Glm4MoeModelToolParser(ToolParser):
)
)
# Update current_tool_id for serving layer compatibility
if regions:
self.current_tool_id = len(regions) - 1
@@ -480,4 +488,4 @@ class Glm4MoeModelToolParser(ToolParser):
content=content,
tool_calls=tool_call_deltas,
)
return None
return None

1
tests/requirements.txt Normal file
View File

@@ -0,0 +1 @@
httpx>=0.25.0

19
tests/run_tests.sh Executable file
View File

@@ -0,0 +1,19 @@
#!/bin/bash
# Run the streaming tool call tests
set -e
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# Default values
export VLLM_API_BASE="${VLLM_API_BASE:-http://localhost:8000/v1}"
export VLLM_API_KEY="${VLLM_API_KEY:-none}"
export VLLM_MODEL="${VLLM_MODEL:-zai-org/GLM-5.1-FP8}"
echo "Configuration:"
echo " API_BASE: $VLLM_API_BASE"
echo " MODEL: $VLLM_MODEL"
echo ""
# Run the test
python3 "$SCRIPT_DIR/test_streaming_tool_calls.py"

View File

@@ -0,0 +1,386 @@
#!/usr/bin/env python3
"""
Test suite for vLLM GLM-5.1 streaming tool calls.
Reproduces the issue where long string parameters in tool calls
are buffered entirely before being emitted during streaming.
"""
import os
import time
import json
import httpx
from datetime import datetime
# Configuration - will be set via environment or direct assignment
API_BASE = os.environ.get("VLLM_API_BASE", "http://localhost:8000/v1")
API_KEY = os.environ.get("VLLM_API_KEY", "none")
MODEL = os.environ.get("VLLM_MODEL", "zai-org/GLM-5.1-FP8")
def timestamp():
return datetime.now().strftime("%H:%M:%S.%f")[:-3]
def test_streaming_tool_call_with_code():
"""
Test streaming a tool call with a long string parameter.
This prompts the model to generate code via a tool call,
which should stream incrementally if the patch works correctly.
"""
tools = [
{
"type": "function",
"function": {
"name": "write_file",
"description": "Write content to a file. Use this to save code, text, or other content.",
"parameters": {
"type": "object",
"properties": {
"filename": {
"type": "string",
"description": "Name of the file to write"
},
"content": {
"type": "string",
"description": "The content to write to the file"
}
},
"required": ["filename", "content"]
}
}
}
]
messages = [
{
"role": "user",
"content": "Write a Python implementation of a binary search tree with insert, search, and delete methods. Include docstrings and type hints. Save it to bst.py using the write_file tool."
}
]
print(f"\n{'='*60}")
print(f"TEST: Streaming tool call with long string parameter")
print(f"API: {API_BASE}")
print(f"Model: {MODEL}")
print(f"{'='*60}\n")
# Track streaming events
chunks_received = []
first_chunk_time = None
last_chunk_time = None
tool_call_chunks = []
accumulated_content = ""
start_time = time.time()
with httpx.Client(timeout=120.0) as client:
with client.stream(
"POST",
f"{API_BASE}/chat/completions",
headers={
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
},
json={
"model": MODEL,
"messages": messages,
"tools": tools,
"tool_choice": "auto",
"stream": True,
"max_tokens": 4096
}
) as response:
print(f"[{timestamp()}] Response status: {response.status_code}")
for line in response.iter_lines():
if not line or line == "data: [DONE]":
continue
if line.startswith("data: "):
chunk_data = line[6:]
try:
chunk = json.loads(chunk_data)
if first_chunk_time is None:
first_chunk_time = time.time()
print(f"\n[{timestamp()}] FIRST CHUNK RECEIVED ({first_chunk_time - start_time:.3f}s)")
last_chunk_time = time.time()
chunks_received.append(chunk)
# Extract delta content
if chunk.get("choices"):
delta = chunk["choices"][0].get("delta", {})
# Check for tool calls in delta
if delta.get("tool_calls"):
for tc in delta["tool_calls"]:
tc_index = tc.get("index", 0)
tc_function = tc.get("function", {})
if tc_function.get("name"):
print(f"\n[{timestamp()}] Tool call name: {tc_function['name']}")
if tc_function.get("arguments"):
args_chunk = tc_function["arguments"]
tool_call_chunks.append(args_chunk)
accumulated_content += args_chunk
# Print progress every ~500 chars
if len(accumulated_content) % 500 < len(args_chunk):
print(f"[{timestamp()}] Accumulated {len(accumulated_content)} chars...")
# Regular content
if delta.get("content"):
print(f"[{timestamp()}] Content chunk: {delta['content'][:50]}...")
except json.JSONDecodeError as e:
print(f"[{timestamp()}] JSON decode error: {e}")
end_time = time.time()
# Summary
print(f"\n{'='*60}")
print("SUMMARY")
print(f"{'='*60}")
print(f"Total chunks received: {len(chunks_received)}")
print(f"Total time: {end_time - start_time:.3f}s")
if first_chunk_time:
print(f"Time to first chunk: {first_chunk_time - start_time:.3f}s")
if tool_call_chunks:
print(f"Tool call chunks: {len(tool_call_chunks)}")
print(f"Total tool call content: {len(accumulated_content)} chars")
# Try to parse the accumulated arguments
print(f"\nAttempting to parse tool call arguments...")
try:
args = json.loads(accumulated_content)
print(f"Successfully parsed!")
print(f" - filename: {args.get('filename', 'N/A')}")
print(f" - content length: {len(args.get('content', ''))} chars")
except json.JSONDecodeError as e:
print(f"Failed to parse: {e}")
print(f"Raw accumulated content (first 500 chars):\n{accumulated_content[:500]}")
# Verdict
print(f"\n{'='*60}")
if len(tool_call_chunks) > 1:
print("✓ PASS: Tool call arguments arrived in multiple chunks")
print(f" Chunks: {len(tool_call_chunks)}, indicating incremental streaming")
elif len(tool_call_chunks) == 1 and len(accumulated_content) > 1000:
print("✗ FAIL: Tool call arguments arrived in a single chunk")
print(" This indicates buffering, not true streaming")
else:
print("? INCONCLUSIVE: Not enough data or no tool call occurred")
print(f"{'='*60}\n")
return {
"chunks_received": len(chunks_received),
"tool_call_chunks": len(tool_call_chunks),
"accumulated_length": len(accumulated_content),
"total_time": end_time - start_time
}
def test_streaming_tool_call_with_json():
"""
Test streaming a tool call that returns structured JSON data.
"""
tools = [
{
"type": "function",
"function": {
"name": "save_config",
"description": "Save a configuration object",
"parameters": {
"type": "object",
"properties": {
"config": {
"type": "object",
"description": "Configuration object with many fields"
}
},
"required": ["config"]
}
}
}
]
messages = [
{
"role": "user",
"content": "Create a detailed configuration for a web server with the following sections: server (host, port, ssl), logging (level, format, outputs), cache (enabled, ttl, max_size), rate_limiting (enabled, requests_per_minute, burst), cors (enabled, origins, methods, headers), security (headers, csp, hsts). Use the save_config tool."
}
]
print(f"\n{'='*60}")
print(f"TEST: Streaming tool call with nested JSON")
print(f"{'='*60}\n")
tool_call_chunks = []
accumulated_content = ""
start_time = time.time()
with httpx.Client(timeout=120.0) as client:
with client.stream(
"POST",
f"{API_BASE}/chat/completions",
headers={
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
},
json={
"model": MODEL,
"messages": messages,
"tools": tools,
"tool_choice": "auto",
"stream": True,
"max_tokens": 2048
}
) as response:
for line in response.iter_lines():
if not line or line == "data: [DONE]":
continue
if line.startswith("data: "):
try:
chunk = json.loads(line[6:])
if chunk.get("choices"):
delta = chunk["choices"][0].get("delta", {})
if delta.get("tool_calls"):
for tc in delta["tool_calls"]:
if tc.get("function", {}).get("arguments"):
args_chunk = tc["function"]["arguments"]
tool_call_chunks.append(args_chunk)
accumulated_content += args_chunk
print(f"[{timestamp()}] Chunk {len(tool_call_chunks)}: +{len(args_chunk)} chars (total: {len(accumulated_content)})")
except json.JSONDecodeError:
pass
end_time = time.time()
print(f"\n{'='*60}")
print(f"Total chunks: {len(tool_call_chunks)}, Total content: {len(accumulated_content)} chars")
print(f"Time: {end_time - start_time:.3f}s")
if len(tool_call_chunks) > 1:
print("✓ PASS: Arguments streamed in multiple chunks")
elif len(tool_call_chunks) == 1:
print("✗ FAIL: Arguments arrived in single chunk (buffered)")
else:
print("? No tool call occurred")
print(f"{'='*60}\n")
def test_non_streaming_tool_call():
"""
Baseline test: non-streaming tool call for comparison.
"""
tools = [
{
"type": "function",
"function": {
"name": "write_file",
"description": "Write content to a file",
"parameters": {
"type": "object",
"properties": {
"filename": {"type": "string"},
"content": {"type": "string"}
},
"required": ["filename", "content"]
}
}
}
]
messages = [
{
"role": "user",
"content": "Write a simple Python hello world and save it using the write_file tool."
}
]
print(f"\n{'='*60}")
print(f"TEST: Non-streaming tool call (baseline)")
print(f"{'='*60}\n")
start_time = time.time()
with httpx.Client(timeout=120.0) as client:
response = client.post(
f"{API_BASE}/chat/completions",
headers={
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
},
json={
"model": MODEL,
"messages": messages,
"tools": tools,
"tool_choice": "auto",
"stream": False,
"max_tokens": 1024
}
)
result = response.json()
end_time = time.time()
print(f"Status: {response.status_code}")
print(f"Time: {end_time - start_time:.3f}s")
if result.get("choices"):
message = result["choices"][0].get("message", {})
if message.get("tool_calls"):
for tc in message["tool_calls"]:
print(f"Tool: {tc['function']['name']}")
args = json.loads(tc["function"]["arguments"])
print(f"Arguments parsed successfully")
print(f" - filename: {args.get('filename')}")
print(f" - content length: {len(args.get('content', ''))}")
else:
print("No tool call in response")
print(f"{'='*60}\n")
def main():
print("\n" + "="*60)
print("vLLM GLM-5.1 Streaming Tool Call Tests")
print("="*60)
# Check API connectivity
print(f"\nChecking API at {API_BASE}...")
try:
with httpx.Client(timeout=10.0) as client:
response = client.get(f"{API_BASE.replace('/v1', '')}/health")
print(f"Health check: {response.status_code}")
except Exception as e:
print(f"Warning: Could not reach API - {e}")
# Run tests
print("\nRunning tests...\n")
# Test 1: Non-streaming baseline
test_non_streaming_tool_call()
# Test 2: Streaming with nested JSON
test_streaming_tool_call_with_json()
# Test 3: Main test - streaming with long code
result = test_streaming_tool_call_with_code()
print("\nAll tests complete.")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,234 @@
#!/usr/bin/env python3
"""
Focused test to diagnose GLM-5.1 tool response issue.
The issue: Model sees tool response as blank.
"""
import httpx
import json
API_BASE = "https://api.vultrinference.com/v1"
API_KEY = "26DN7PNUB3YRBEPCDNMXKKD6ZODMETRSMOZQ"
MODEL = "zai-org/GLM-5.1-FP8"
def test_simple_tool_response():
"""
Minimal test: Send a tool response and see if the model can use it.
"""
# Simulate a conversation where a tool was called
messages = [
{"role": "user", "content": "Call the test function"},
{
"role": "assistant",
"tool_calls": [{
"id": "call_123",
"type": "function",
"function": {"name": "test_func", "arguments": "{}"}
}]
},
{
"role": "tool",
"tool_call_id": "call_123",
"content": "SUCCESS: The function returned value 42"
}
]
tools = [{
"type": "function",
"function": {
"name": "test_func",
"description": "A test function",
"parameters": {"type": "object", "properties": {}}
}
}]
print("=" * 60)
print("Request messages:")
print(json.dumps(messages, indent=2))
print("=" * 60)
with httpx.Client(timeout=60.0) as client:
# Non-streaming to get full response
response = client.post(
f"{API_BASE}/chat/completions",
headers={
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
},
json={
"model": MODEL,
"messages": messages,
"tools": tools,
"stream": False,
"max_tokens": 256
}
)
result = response.json()
print("\nFull response:")
print(json.dumps(result, indent=2))
if result.get("choices"):
content = result["choices"][0].get("message", {}).get("content", "")
print("\n" + "=" * 60)
print("Model response content:")
print(content)
print("=" * 60)
# Check if the tool result is referenced
if "42" in content:
print("\n✓ PASS: Model referenced the tool result (42)")
else:
print("\n✗ FAIL: Model did NOT reference the tool result (42)")
# Check for signs the model didn't see the result
if "don't have" in content.lower() or "cannot access" in content.lower():
print("✗ Model indicates it cannot see tool result")
def test_without_tools_param():
"""
Test what happens if we don't pass tools in the follow-up request.
Some APIs need tools to be passed on every request.
"""
messages = [
{"role": "user", "content": "Call the test function"},
{
"role": "assistant",
"tool_calls": [{
"id": "call_123",
"type": "function",
"function": {"name": "test_func", "arguments": "{}"}
}]
},
{
"role": "tool",
"tool_call_id": "call_123",
"content": "SUCCESS: The function returned value 42"
}
]
print("\n" + "=" * 60)
print("Test WITHOUT tools param in follow-up")
print("=" * 60)
with httpx.Client(timeout=60.0) as client:
response = client.post(
f"{API_BASE}/chat/completions",
headers={
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
},
json={
"model": MODEL,
"messages": messages,
# No tools param
"stream": False,
"max_tokens": 256
}
)
result = response.json()
if result.get("choices"):
content = result["choices"][0].get("message", {}).get("content", "")
print("Model response:", content[:200])
if "42" in content:
print("✓ Model referenced the tool result")
def test_different_content_formats():
"""
Test if the issue is with how content is formatted.
"""
# Test 1: String content (standard)
messages_string = [
{"role": "user", "content": "What is 2+2?"},
{
"role": "assistant",
"tool_calls": [{
"id": "call_123",
"type": "function",
"function": {"name": "calc", "arguments": "{}"}
}]
},
{
"role": "tool",
"tool_call_id": "call_123",
"content": "The answer is 4"
}
]
# Test 2: Content as array (OpenAI format)
messages_array = [
{"role": "user", "content": "What is 2+2?"},
{
"role": "assistant",
"tool_calls": [{
"id": "call_123",
"type": "function",
"function": {"name": "calc", "arguments": "{}"}
}]
},
{
"role": "tool",
"tool_call_id": "call_123",
"content": [{"type": "text", "text": "The answer is 4"}]
}
]
tools = [{
"type": "function",
"function": {
"name": "calc",
"description": "Calculator",
"parameters": {"type": "object", "properties": {}}
}
}]
print("\n" + "=" * 60)
print("Test: String content vs Array content")
print("=" * 60)
with httpx.Client(timeout=60.0) as client:
for name, msgs in [("String content", messages_string), ("Array content", messages_array)]:
print(f"\n--- {name} ---")
response = client.post(
f"{API_BASE}/chat/completions",
headers={
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
},
json={
"model": MODEL,
"messages": msgs,
"tools": tools,
"stream": False,
"max_tokens": 128
}
)
result = response.json()
if result.get("choices"):
content = result["choices"][0].get("message", {}).get("content", "")
print(f"Response: {content[:150]}")
if "4" in content:
print("✓ Referenced tool result")
else:
print("✗ Did NOT reference tool result")
if __name__ == "__main__":
print("GLM-5.1 Tool Response Diagnosis")
print("=" * 60)
test_simple_tool_response()
test_without_tools_param()
test_different_content_formats()

445
tests/test_tool_response.py Normal file
View File

@@ -0,0 +1,445 @@
#!/usr/bin/env python3
"""
Test for tool call response handling in GLM-5.1.
Tests the multi-turn flow:
1. Send a prompt that triggers a tool call
2. Send back the tool result
3. Verify the model can see and use the tool response
This reproduces the issue where tool responses appear blank to the model.
"""
import os
import json
import httpx
from datetime import datetime
API_BASE = os.environ.get("VLLM_API_BASE", "http://localhost:8000/v1")
API_KEY = os.environ.get("VLLM_API_KEY", "none")
MODEL = os.environ.get("VLLM_MODEL", "zai-org/GLM-5.1-FP8")
def timestamp():
return datetime.now().strftime("%H:%M:%S.%f")[:-3]
def test_tool_call_response_flow(streaming: bool = True):
"""
Test the full tool call -> response -> follow-up flow.
This simulates:
1. User asks for weather
2. Model calls get_weather tool
3. We send back the weather data
4. Model should see and use that data
"""
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City and state, e.g. 'New York, NY'"
}
},
"required": ["location"]
}
}
}
]
# Initial request that should trigger a tool call
messages = [
{
"role": "user",
"content": "What's the weather like in Tokyo right now?"
}
]
mode = "STREAMING" if streaming else "NON-STREAMING"
print(f"\n{'='*60}")
print(f"TEST: Tool call response flow ({mode})")
print(f"API: {API_BASE}")
print(f"Model: {MODEL}")
print(f"{'='*60}\n")
with httpx.Client(timeout=120.0) as client:
# Step 1: Send initial request, expect tool call
print(f"[{timestamp()}] Step 1: Sending initial request...")
if streaming:
tool_calls = []
tool_call_id = None
tool_call_name = None
accumulated_args = ""
with client.stream(
"POST",
f"{API_BASE}/chat/completions",
headers={
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
},
json={
"model": MODEL,
"messages": messages,
"tools": tools,
"tool_choice": "auto",
"stream": True,
"max_tokens": 512
}
) as response:
print(f"[{timestamp()}] Response status: {response.status_code}")
for line in response.iter_lines():
if not line or line == "data: [DONE]":
continue
if line.startswith("data: "):
try:
chunk = json.loads(line[6:])
if chunk.get("choices"):
delta = chunk["choices"][0].get("delta", {})
if delta.get("tool_calls"):
for tc in delta["tool_calls"]:
idx = tc.get("index", 0)
if tc.get("id"):
tool_call_id = tc["id"]
if tc.get("function", {}).get("name"):
tool_call_name = tc["function"]["name"]
print(f"[{timestamp()}] Tool call: {tool_call_name}")
if tc.get("function", {}).get("arguments"):
accumulated_args += tc["function"]["arguments"]
if delta.get("content"):
print(f"[{timestamp()}] Content: {delta['content'][:100]}")
except json.JSONDecodeError as e:
print(f"[{timestamp()}] JSON error: {e}")
if tool_call_name:
tool_calls.append({
"id": tool_call_id or "call_0",
"type": "function",
"function": {
"name": tool_call_name,
"arguments": accumulated_args
}
})
else:
# Non-streaming
response = client.post(
f"{API_BASE}/chat/completions",
headers={
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
},
json={
"model": MODEL,
"messages": messages,
"tools": tools,
"tool_choice": "auto",
"stream": False,
"max_tokens": 512
}
)
result = response.json()
print(f"[{timestamp()}] Response status: {response.status_code}")
tool_calls = []
if result.get("choices"):
message = result["choices"][0].get("message", {})
if message.get("tool_calls"):
tool_calls = message["tool_calls"]
for tc in tool_calls:
print(f"[{timestamp()}] Tool call: {tc['function']['name']}")
print(f"[{timestamp()}] Args: {tc['function']['arguments']}")
# Check if we got a tool call
if not tool_calls:
print(f"\n[{timestamp()}] No tool call received - model didn't call the tool")
return {"success": False, "reason": "no_tool_call"}
# Step 2: Parse tool call and prepare response
tc = tool_calls[0]
tc_id = tc.get("id", "call_0")
tc_name = tc["function"]["name"]
tc_args = json.loads(tc["function"]["arguments"])
print(f"\n[{timestamp()}] Step 2: Tool call received")
print(f" Name: {tc_name}")
print(f" Args: {tc_args}")
# Simulate tool execution
tool_result = {
"location": tc_args.get("location", "Unknown"),
"temperature": "22°C",
"condition": "Partly cloudy",
"humidity": "65%",
"wind": "15 km/h NE"
}
# Step 3: Send the tool response back
messages.append({
"role": "assistant",
"tool_calls": tool_calls
})
messages.append({
"role": "tool",
"tool_call_id": tc_id,
"content": json.dumps(tool_result)
})
print(f"\n[{timestamp()}] Step 3: Sending tool response...")
print(f" Tool call ID: {tc_id}")
print(f" Tool result: {json.dumps(tool_result, indent=2)}")
# Step 4: Get the model's follow-up response
if streaming:
final_response = ""
print(f"\n[{timestamp()}] Step 4: Receiving model's follow-up (streaming)...")
with client.stream(
"POST",
f"{API_BASE}/chat/completions",
headers={
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
},
json={
"model": MODEL,
"messages": messages,
"tools": tools,
"stream": True,
"max_tokens": 512
}
) as response:
for line in response.iter_lines():
if not line or line == "data: [DONE]":
continue
if line.startswith("data: "):
try:
chunk = json.loads(line[6:])
if chunk.get("choices"):
delta = chunk["choices"][0].get("delta", {})
if delta.get("content"):
content = delta["content"]
final_response += content
print(f"[{timestamp()}] Content: {content}", end="", flush=True)
except json.JSONDecodeError:
pass
print() # newline after streaming output
else:
print(f"\n[{timestamp()}] Step 4: Receiving model's follow-up (non-streaming)...")
response = client.post(
f"{API_BASE}/chat/completions",
headers={
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
},
json={
"model": MODEL,
"messages": messages,
"tools": tools,
"stream": False,
"max_tokens": 512
}
)
result = response.json()
final_response = ""
if result.get("choices"):
final_response = result["choices"][0].get("message", {}).get("content", "")
print(f"\n[{timestamp()}] Final response:\n{final_response}")
# Check if the model used the tool data
success = True
issues = []
# The response should mention the weather data
if "22" not in final_response and "22°C" not in final_response:
issues.append("Temperature (22°C) not mentioned in response")
success = False
if "cloudy" not in final_response.lower() and "partly cloudy" not in final_response.lower():
issues.append("Condition (Partly cloudy) not mentioned in response")
success = False
# Check for signs the model didn't see the data
blank_indicators = [
"i don't have",
"i cannot access",
"i'm unable to",
"i am unable to",
"don't have access",
"don't have real-time",
"cannot provide real-time"
]
for indicator in blank_indicators:
if indicator in final_response.lower():
issues.append(f"Model seems unaware of tool result (found: '{indicator}')")
success = False
break
print(f"\n{'='*60}")
if success:
print("✓ PASS: Model correctly used tool response data")
else:
print("✗ FAIL: Model did not use tool response correctly")
for issue in issues:
print(f" - {issue}")
print(f"{'='*60}\n")
return {
"success": success,
"issues": issues,
"final_response": final_response
}
def test_tool_response_with_debug_info():
"""
Test with detailed logging to capture exactly what the model sees.
"""
tools = [
{
"type": "function",
"function": {
"name": "get_time",
"description": "Get the current time",
"parameters": {
"type": "object",
"properties": {},
"required": []
}
}
}
]
print(f"\n{'='*60}")
print(f"TEST: Tool response with debug info (non-streaming)")
print(f"{'='*60}\n")
messages = [
{"role": "user", "content": "What time is it?"}
]
with httpx.Client(timeout=120.0) as client:
# Get tool call
print(f"[{timestamp()}] Sending initial request...")
response = client.post(
f"{API_BASE}/chat/completions",
headers={
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
},
json={
"model": MODEL,
"messages": messages,
"tools": tools,
"tool_choice": "auto",
"stream": False,
"max_tokens": 256
}
)
result = response.json()
if not result.get("choices") or not result["choices"][0].get("message", {}).get("tool_calls"):
print("No tool call - skipping test")
return
tool_call = result["choices"][0]["message"]["tool_calls"][0]
tc_id = tool_call["id"]
print(f"[{timestamp()}] Tool call: {tool_call['function']['name']}")
print(f"[{timestamp()}] Tool call ID: {tc_id}")
# Add tool response
messages.append({
"role": "assistant",
"tool_calls": [tool_call]
})
messages.append({
"role": "tool",
"tool_call_id": tc_id,
"content": "The current time is 3:45 PM on Thursday, April 9, 2026."
})
# Debug: print the full messages array we're about to send
print(f"\n[{timestamp()}] Sending follow-up with these messages:")
print(json.dumps(messages, indent=2))
# Get follow-up
response2 = client.post(
f"{API_BASE}/chat/completions",
headers={
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
},
json={
"model": MODEL,
"messages": messages,
"tools": tools,
"stream": False,
"max_tokens": 256
}
)
result2 = response2.json()
print(f"\n[{timestamp()}] Full response:")
print(json.dumps(result2, indent=2))
if result2.get("choices"):
content = result2["choices"][0].get("message", {}).get("content", "")
print(f"\n[{timestamp()}] Model response content: {content}")
# Check if time is mentioned
if "3:45" in content or "3:45 PM" in content:
print("\n✓ Model used the tool response (time mentioned)")
else:
print("\n✗ Model may not have seen the tool response (time not mentioned)")
def main():
print("\n" + "="*60)
print("GLM-5.1 Tool Call Response Tests")
print("="*60)
# Test non-streaming first (simpler to debug)
print("\n--- Test 1: Non-streaming tool response flow ---")
test_tool_call_response_flow(streaming=False)
# Test streaming
print("\n--- Test 2: Streaming tool response flow ---")
test_tool_call_response_flow(streaming=True)
# Debug test
print("\n--- Test 3: Debug info test ---")
test_tool_response_with_debug_info()
print("\nAll tests complete.")
if __name__ == "__main__":
main()