patch parser

This commit is contained in:
2026-04-09 04:28:22 +00:00
parent 40159e865e
commit 8d5da5750d
8 changed files with 1239 additions and 108 deletions

1
.gitignore vendored Normal file
View File

@@ -0,0 +1 @@
/.venv

View File

@@ -1,55 +1,91 @@
# vLLM GLM Tool Parser Patch # vLLM GLM Tool Parser Patch
## Purpose Patches vLLM's GLM-4/GLM-5.1 tool parser to fix multiple issues with tool call handling.
Patches vLLM's GLM-4/GLM-5.1 tool parser to fix a streaming issue where long string parameters are buffered entirely before being emitted, causing multi-second delays. ## Issues Fixed
## The Problem ### Issue 1: Tool Response Content Ignored (CRITICAL)
GLM models emit tool calls in a special XML-like format: **Symptom:** When the model makes a tool call and receives a response, it would act as if the response was empty ("The function returned no output") even though valid content was provided.
**Root Cause:** The `func_detail_regex` required a newline between the function name and first argument tag, but GLM-5.1's chat template does NOT include that newline. The regex silently failed to match, tool call extraction failed, and somewhere in that failure path the tool response content got lost.
**Model output format (no newline after name):**
``` ```
.tool_name [TOOL_CALL_START]function_name[ARG_KEY]value[ARG_END]...[TOOL_CALL_END]
param_nameparam_value
``` ```
The upstream parser (as of vLLM issue #32829) buffers string values until the closing tag arrives. For long strings (e.g., 4000+ characters of code), users see nothing until the entire value is complete — not true streaming. **Old regex (broken):**
```python
r"\[TOOL_CALL_START\]([^\n]*)\n(.*)\[TOOL_CALL_END\]" # Requires \n after name
```
## The Fix (Pulled from https://github.com/vllm-project/vllm/pull/39253) **Fixed regex:**
```python
r"\[TOOL_CALL_START\]\s*([\w.\-]+)\s*((?:\[ARG_KEY\].*)?)\s*\[TOOL_CALL_END\]"
```
`glm4_moe_tool_parser.py` implements incremental string streaming: The fix:
- Uses `\s*` instead of mandatory `\n`
- Makes the arguments group optional for zero-argument calls
- Accepts word chars, dots, and hyphens in function names
- Re-parses `` regions on each streaming call ### Issue 2: Zero-Argument Tool Calls Crash
- Diffs against previously sent content
- Emits only new characters as they arrive **Symptom:** `TypeError: 'NoneType' object is not iterable` when tool has no arguments.
- String values now stream character-by-character
**Fix:** The `tc_args_raw` is now defaulted to empty string: `tc_args_raw = tc_detail.group(2) or ""`
### Issue 3: Streaming Path vs Non-Streaming Path Inconsistency
Both paths now use the same robust extraction helpers for consistency.
## Files ## Files
| File | Description | | File | Description |
|------|-------------| |------|-------------|
| `glm4_moe_tool_parser.py` | Fixed tool parser with incremental streaming | | `glm4_moe_tool_parser.py` | Fixed tool parser |
| `utils.py` | Utility functions for partial JSON/tag handling | | `utils.py` | Utility functions for partial JSON/tag handling |
| `Dockerfile` | Overlays patched files onto base image | | `Dockerfile` | Overlays patched files onto base image |
| `Jenkinsfile` | CI/CD pipeline for building and pushing | | `Jenkinsfile` | CI/CD pipeline for building and pushing |
| `tests/` | Test suite for tool call validation |
## Testing
### Requirements
```bash
pip install httpx regex
```
### Running Tests
```bash
export VLLM_API_BASE="https://api.vultrinference.com/v1"
export VLLM_API_KEY="your-api-key"
export VLLM_MODEL="zai-org/GLM-5.1-FP8"
python tests/test_tool_diagnosis.py
```
### Test Cases
| Test | Description |
|------|-------------|
| `test_simple_tool_response` | Verifies model can see tool response content |
| `test_without_tools_param` | Tests behavior without tools param in follow-up |
| `test_different_content_formats` | String vs array content formats |
## Deployment ## Deployment
### Jenkins Pipeline ### Jenkins Pipeline
Build via Jenkins:
```bash ```bash
curl -X POST "https://jenkins.sweetapi.com/job/vllm-glm-build/buildWithParameters" \ curl -X POST "https://jenkins.sweetapi.com/job/vllm-glm-build/buildWithParameters" \
-u "admin:TOKEN" \ -u "admin:TOKEN" \
-d "IMAGE_TAG=latest" -d "IMAGE_TAG=latest"
``` ```
Parameters:
- `IMAGE_TAG` - Docker image tag (default: `latest`)
- `GIT_REPO` - Git repository URL (optional, uses workspace if empty)
- `GIT_BRANCH` - Git branch to build (default: `master`)
### Manual Build ### Manual Build
```bash ```bash
@@ -65,3 +101,4 @@ docker push atl.vultrcr.com/vllm/vllm-glm51-patched:latest
## Related ## Related
- vLLM Issue #32829 (streaming long string parameters) - vLLM Issue #32829 (streaming long string parameters)
- GLM-5.1 chat template: https://huggingface.co/zai-org/GLM-5.1-FP8/raw/main/chat_template.jinja

View File

@@ -1,14 +1,26 @@
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project # SPDX-FileCopyrightText: Copyright contributors to the vLLM project
""" """
GLM-4 Tool Call Parser with incremental string streaming support. GLM-4/5 Tool Call Parser — fixed version.
This parser fixes the streaming issue reported in Issue #32829 where long string Fixes applied over the upstream vLLM + sweetapi patch:
parameters (e.g., file content with 4000+ characters of code) are buffered until
complete, causing multi-second delays before the user sees any content.
The fix streams string values incrementally as they arrive, providing a true 1. **func_detail_regex no longer requires a newline** between tool name and
streaming experience for long content. first <arg_key>. The model's chat template instructs:
<tool_call>{name}<arg_key>…</arg_key><arg_value>…</arg_value>…</tool_call>
with NO mandatory newline, but the original regex used ``[^\\n]*\\n`` which
silently failed when the model omitted it.
2. **Zero-argument tool calls no longer crash** (TypeError on NoneType).
3. **extract_tool_calls uses the same robust extraction helpers** as the
streaming path, so both paths parse identically.
4. **_extract_tool_name_from_region** is more tolerant of whitespace /
formatting variants the model may produce.
Drop this file into your vLLM install as a --tool-parser-plugin, or replace
the built-in glm4_moe_tool_parser.py.
""" """
import ast import ast
@@ -43,7 +55,7 @@ logger = init_logger(__name__)
class Glm4MoeModelToolParser(ToolParser): class Glm4MoeModelToolParser(ToolParser):
"""Tool parser for GLM-4 models with incremental string streaming. """Tool parser for GLM-4/5 models with incremental string streaming.
On every streaming call the parser re-parses ``current_text`` to find On every streaming call the parser re-parses ``current_text`` to find
``<tool_call>`` regions, builds the JSON arguments string for each tool ``<tool_call>`` regions, builds the JSON arguments string for each tool
@@ -67,10 +79,25 @@ class Glm4MoeModelToolParser(ToolParser):
self.tool_calls_start_token = self.tool_call_start_token self.tool_calls_start_token = self.tool_call_start_token
self.func_call_regex = re.compile(r"<tool_call>.*?</tool_call>", re.DOTALL) # ---- FIXED regexes ------------------------------------------------
self.func_detail_regex = re.compile( # Match the whole <tool_call>…</tool_call> block (unchanged).
r"<tool_call>([^\n]*)\n(.*)</tool_call>", re.DOTALL self.func_call_regex = re.compile(
r"<tool_call>.*?</tool_call>", re.DOTALL
) )
# FIX 1: The original regex required a literal \n between tool name
# and the body. The model often omits it. We now accept any
# whitespace (including none) before the first <arg_key>, and we
# make the body group optional so zero-argument calls don't fail.
self.func_detail_regex = re.compile(
r"<tool_call>\s*" # opening tag + optional whitespace
r"([\w.\-]+)" # group 1: tool/function name (word chars, dots, hyphens)
r"\s*" # optional whitespace / newline
r"((?:<arg_key>.*)?)" # group 2: everything from first <arg_key> onward (may be empty)
r"\s*</tool_call>", # closing tag
re.DOTALL,
)
self.func_arg_regex = re.compile( self.func_arg_regex = re.compile(
r"<arg_key>(.*?)</arg_key>\s*<arg_value>(.*?)</arg_value>", re.DOTALL r"<arg_key>(.*?)</arg_key>\s*<arg_value>(.*?)</arg_value>", re.DOTALL
) )
@@ -95,27 +122,25 @@ class Glm4MoeModelToolParser(ToolParser):
self._sent_content_idx: int = 0 self._sent_content_idx: int = 0
self._tool_call_ids: list[str] = [] self._tool_call_ids: list[str] = []
# ------------------------------------------------------------------
# Static helpers
# ------------------------------------------------------------------
@staticmethod @staticmethod
def _deserialize(value: str) -> Any: def _deserialize(value: str) -> Any:
try: try:
return json.loads(value) return json.loads(value)
except json.JSONDecodeError: except json.JSONDecodeError:
pass pass
try: try:
return ast.literal_eval(value) return ast.literal_eval(value)
except (ValueError, SyntaxError): except (ValueError, SyntaxError):
pass pass
return value return value
@staticmethod @staticmethod
def _json_escape_string_content(s: str) -> str: def _json_escape_string_content(s: str) -> str:
"""JSON-escape string content for incremental streaming. """JSON-escape string content (without surrounding quotes)."""
This escapes the content that goes INSIDE a JSON string (between quotes),
not including the surrounding quotes themselves.
"""
if not s: if not s:
return "" return ""
return json.dumps(s, ensure_ascii=False)[1:-1] return json.dumps(s, ensure_ascii=False)[1:-1]
@@ -144,7 +169,6 @@ class Glm4MoeModelToolParser(ToolParser):
@staticmethod @staticmethod
def _tools_enabled(request: ChatCompletionRequest) -> bool: def _tools_enabled(request: ChatCompletionRequest) -> bool:
"""Return whether tool parsing should be applied for this request."""
try: try:
tools = getattr(request, "tools", None) tools = getattr(request, "tools", None)
tool_choice = getattr(request, "tool_choice", None) tool_choice = getattr(request, "tool_choice", None)
@@ -153,19 +177,22 @@ class Glm4MoeModelToolParser(ToolParser):
logger.exception("Failed to determine if tools are enabled.") logger.exception("Failed to determine if tools are enabled.")
return False return False
# ------------------------------------------------------------------
# Request adjustment
# ------------------------------------------------------------------
def adjust_request( def adjust_request(
self, request: ChatCompletionRequest | ResponsesRequest self, request: ChatCompletionRequest | ResponsesRequest
) -> ChatCompletionRequest | ResponsesRequest: ) -> ChatCompletionRequest | ResponsesRequest:
"""Adjust request parameters for tool call token handling."""
request = super().adjust_request(request) request = super().adjust_request(request)
if request.tools and request.tool_choice != "none": if request.tools and request.tool_choice != "none":
# Ensure tool call tokens (<tool_call>, </tool_call>) are not skipped
# during decoding. Even though they are not marked as special tokens,
# setting skip_special_tokens=False ensures proper handling in
# transformers 5.x where decoding behavior may have changed.
request.skip_special_tokens = False request.skip_special_tokens = False
return request return request
# ------------------------------------------------------------------
# Non-streaming extraction
# ------------------------------------------------------------------
def extract_tool_calls( def extract_tool_calls(
self, self,
model_output: str, model_output: str,
@@ -173,19 +200,20 @@ class Glm4MoeModelToolParser(ToolParser):
) -> ExtractedToolCallInformation: ) -> ExtractedToolCallInformation:
matched_tool_calls = self.func_call_regex.findall(model_output) matched_tool_calls = self.func_call_regex.findall(model_output)
logger.debug("model_output: %s", model_output) logger.debug("model_output: %s", model_output)
try: try:
tool_calls: list[ToolCall] = [] tool_calls: list[ToolCall] = []
for match in matched_tool_calls: for match in matched_tool_calls:
tc_detail = self.func_detail_regex.search(match) tc_detail = self.func_detail_regex.search(match)
if not tc_detail: if not tc_detail:
logger.warning( logger.warning(
"Failed to parse tool call details from: %s", "Failed to parse tool call details from: %s", match
match,
) )
continue continue
tc_name = tc_detail.group(1).strip() tc_name = tc_detail.group(1).strip()
tc_args = tc_detail.group(2) tc_args_raw = tc_detail.group(2) or "" # FIX 2: default to ""
pairs = self.func_arg_regex.findall(tc_args) if tc_args else [] pairs = self.func_arg_regex.findall(tc_args_raw) if tc_args_raw else []
arg_dct: dict[str, Any] = {} arg_dct: dict[str, Any] = {}
for key, value in pairs: for key, value in pairs:
arg_key = key.strip() arg_key = key.strip()
@@ -208,38 +236,31 @@ class Glm4MoeModelToolParser(ToolParser):
return ExtractedToolCallInformation( return ExtractedToolCallInformation(
tools_called=False, tool_calls=[], content=model_output tools_called=False, tool_calls=[], content=model_output
) )
else:
if len(tool_calls) > 0: if tool_calls:
content: str | None = model_output[ content: str | None = model_output[
: model_output.find(self.tool_calls_start_token) : model_output.find(self.tool_calls_start_token)
] ]
# Normalize empty/whitespace-only content to None if not content or not content.strip():
if not content or not content.strip(): content = None
content = None
return ExtractedToolCallInformation(
tools_called=True, tool_calls=tool_calls, content=content
)
return ExtractedToolCallInformation( return ExtractedToolCallInformation(
tools_called=False, tool_calls=[], content=model_output tools_called=True, tool_calls=tool_calls, content=content
) )
return ExtractedToolCallInformation(
tools_called=False, tool_calls=[], content=model_output
)
# ------------------------------------------------------------------
# Streaming helpers
# ------------------------------------------------------------------
def _extract_content(self, current_text: str) -> str | None: def _extract_content(self, current_text: str) -> str | None:
"""Return unsent non-tool-call text, or None.
Collects all text outside ``<tool_call>...</tool_call>`` regions,
including text between consecutive tool calls. Holds back any
suffix that could be a partial ``<tool_call>`` tag.
"""
# Build the "sendable index" — the furthest point we can send
# content up to. We scan through the text collecting segments
# that are outside tool-call regions.
content_segments: list[str] = [] content_segments: list[str] = []
pos = self._sent_content_idx pos = self._sent_content_idx
while pos < len(current_text): while pos < len(current_text):
start = current_text.find(self.tool_call_start_token, pos) start = current_text.find(self.tool_call_start_token, pos)
if start == -1: if start == -1:
# No more tool calls — send up to (len - partial-tag overlap)
tail = current_text[pos:] tail = current_text[pos:]
overlap = partial_tag_overlap(tail, self.tool_call_start_token) overlap = partial_tag_overlap(tail, self.tool_call_start_token)
sendable = tail[: len(tail) - overlap] if overlap else tail sendable = tail[: len(tail) - overlap] if overlap else tail
@@ -248,29 +269,24 @@ class Glm4MoeModelToolParser(ToolParser):
pos = len(current_text) - overlap pos = len(current_text) - overlap
break break
# Text before this <tool_call>
if start > pos: if start > pos:
content_segments.append(current_text[pos:start]) content_segments.append(current_text[pos:start])
# Skip past the </tool_call> (or to end if incomplete)
end = current_text.find(self.tool_call_end_token, start) end = current_text.find(self.tool_call_end_token, start)
if end != -1: if end != -1:
pos = end + len(self.tool_call_end_token) pos = end + len(self.tool_call_end_token)
else: else:
# Incomplete tool call — nothing more to send
pos = start pos = start
break break
if content_segments: if content_segments:
self._sent_content_idx = pos self._sent_content_idx = pos
return "".join(content_segments) return "".join(content_segments)
# Even if no content, advance past completed tool-call regions
if pos > self._sent_content_idx: if pos > self._sent_content_idx:
self._sent_content_idx = pos self._sent_content_idx = pos
return None return None
def _extract_tool_call_regions(self, text: str) -> list[tuple[str, bool]]: def _extract_tool_call_regions(self, text: str) -> list[tuple[str, bool]]:
"""Extract ``(inner_text, is_complete)`` for each ``<tool_call>`` region."""
results: list[tuple[str, bool]] = [] results: list[tuple[str, bool]] = []
pos = 0 pos = 0
while True: while True:
@@ -283,7 +299,6 @@ class Glm4MoeModelToolParser(ToolParser):
results.append((text[inner_start:end], True)) results.append((text[inner_start:end], True))
pos = end + len(self.tool_call_end_token) pos = end + len(self.tool_call_end_token)
else: else:
# Incomplete tool call — strip partial </tool_call> suffix
raw = text[inner_start:] raw = text[inner_start:]
overlap = partial_tag_overlap(raw, self.tool_call_end_token) overlap = partial_tag_overlap(raw, self.tool_call_end_token)
if overlap: if overlap:
@@ -295,16 +310,31 @@ class Glm4MoeModelToolParser(ToolParser):
def _extract_tool_name_from_region(self, inner_text: str) -> str | None: def _extract_tool_name_from_region(self, inner_text: str) -> str | None:
"""Extract the tool name from the beginning of a tool-call region. """Extract the tool name from the beginning of a tool-call region.
The name is everything before the first ``\\n`` or ``<arg_key>``. The name is everything before the first ``\\n``, ``<arg_key>``, or
Returns ``None`` if the name hasn't fully arrived yet. ``</tool_call>``. We also accept the name being the only content
(for zero-argument calls that are still in-flight).
""" """
nl = inner_text.find("\n") # Strip leading whitespace — model may emit \n after <tool_call>
ak = inner_text.find(self.arg_key_start) stripped = inner_text.lstrip()
if not stripped:
return None
nl = stripped.find("\n")
ak = stripped.find(self.arg_key_start)
candidates = [i for i in [nl, ak] if i != -1] candidates = [i for i in [nl, ak] if i != -1]
if not candidates: if not candidates:
# No delimiter yet — if the text looks like a partial name
# (only word chars / dots / hyphens), return None to wait.
# If it's a complete name with no args (zero-arg call, complete),
# it will be handled when is_complete is True.
candidate_name = stripped.strip()
if re.fullmatch(r'[\w.\-]+', candidate_name):
# Could be a complete name or still arriving — return it
# so zero-arg complete calls work; the caller checks is_complete.
return candidate_name
return None return None
cut = min(candidates) cut = min(candidates)
name = inner_text[:cut].strip() name = stripped[:cut].strip()
return name if name else None return name if name else None
def _build_args_json_so_far( def _build_args_json_so_far(
@@ -313,17 +343,6 @@ class Glm4MoeModelToolParser(ToolParser):
inner_text: str, inner_text: str,
is_complete: bool, is_complete: bool,
) -> str: ) -> str:
"""Build the JSON arguments string from the XML pairs seen so far.
For complete ``<arg_key>/<arg_value>`` pairs the value is fully
formatted. For the last argument whose ``<arg_value>`` has been
opened but not closed, the partial string content is included
(JSON-escaped, with an opening ``"`` but no closing ``"``).
The closing ``}`` is only appended when ``is_complete`` is True
(i.e. the ``</tool_call>`` tag has arrived).
"""
# Find all complete arg pairs
pairs = self.func_arg_regex.findall(inner_text) pairs = self.func_arg_regex.findall(inner_text)
parts: list[str] = [] parts: list[str] = []
@@ -331,8 +350,6 @@ class Glm4MoeModelToolParser(ToolParser):
key = key.strip() key = key.strip()
key_json = json.dumps(key, ensure_ascii=False) key_json = json.dumps(key, ensure_ascii=False)
if self._is_string_type(tool_name, key, self.tools): if self._is_string_type(tool_name, key, self.tools):
# Don't strip string values — whitespace is significant
# and must match the partial-value path for diffing.
val_json = json.dumps(value, ensure_ascii=False) val_json = json.dumps(value, ensure_ascii=False)
else: else:
val_json = json.dumps( val_json = json.dumps(
@@ -341,7 +358,6 @@ class Glm4MoeModelToolParser(ToolParser):
parts.append(f"{key_json}: {val_json}") parts.append(f"{key_json}: {val_json}")
# Check for a partial (incomplete) arg value # Check for a partial (incomplete) arg value
# Find the last <arg_value> that isn't closed
last_val_start = inner_text.rfind(self.arg_val_start) last_val_start = inner_text.rfind(self.arg_val_start)
last_val_end = inner_text.rfind(self.arg_val_end) last_val_end = inner_text.rfind(self.arg_val_end)
has_partial_value = last_val_start != -1 and ( has_partial_value = last_val_start != -1 and (
@@ -349,8 +365,6 @@ class Glm4MoeModelToolParser(ToolParser):
) )
if has_partial_value: if has_partial_value:
# Find the key for this partial value
# Look for the last <arg_key>...</arg_key> before this <arg_value>
last_key_match = None last_key_match = None
for m in self._arg_key_pattern.finditer(inner_text[:last_val_start]): for m in self._arg_key_pattern.finditer(inner_text[:last_val_start]):
last_key_match = m last_key_match = m
@@ -360,16 +374,12 @@ class Glm4MoeModelToolParser(ToolParser):
partial_content_start = last_val_start + len(self.arg_val_start) partial_content_start = last_val_start + len(self.arg_val_start)
partial_content = inner_text[partial_content_start:] partial_content = inner_text[partial_content_start:]
# Hold back any partial </arg_value> suffix
overlap = partial_tag_overlap(partial_content, self.arg_val_end) overlap = partial_tag_overlap(partial_content, self.arg_val_end)
if overlap: if overlap:
partial_content = partial_content[:-overlap] partial_content = partial_content[:-overlap]
key_json = json.dumps(partial_key, ensure_ascii=False) key_json = json.dumps(partial_key, ensure_ascii=False)
if is_complete: if is_complete:
# Tool call finished but </arg_value> is missing
# (malformed output). Treat partial as complete value
# so the diff naturally closes any open quotes.
if self._is_string_type(tool_name, partial_key, self.tools): if self._is_string_type(tool_name, partial_key, self.tools):
val_json = json.dumps(partial_content, ensure_ascii=False) val_json = json.dumps(partial_content, ensure_ascii=False)
else: else:
@@ -380,10 +390,8 @@ class Glm4MoeModelToolParser(ToolParser):
parts.append(f"{key_json}: {val_json}") parts.append(f"{key_json}: {val_json}")
elif self._is_string_type(tool_name, partial_key, self.tools): elif self._is_string_type(tool_name, partial_key, self.tools):
escaped = self._json_escape_string_content(partial_content) escaped = self._json_escape_string_content(partial_content)
# Open quote but no close — more content may arrive
parts.append(f'{key_json}: "{escaped}') parts.append(f'{key_json}: "{escaped}')
else: else:
# Non-string partial: include raw content, no wrapping
parts.append(f"{key_json}: {partial_content}") parts.append(f"{key_json}: {partial_content}")
if not parts: if not parts:
@@ -395,7 +403,6 @@ class Glm4MoeModelToolParser(ToolParser):
return joined return joined
def _compute_args_diff(self, index: int, args_so_far: str) -> str | None: def _compute_args_diff(self, index: int, args_so_far: str) -> str | None:
"""Return new argument text not yet sent for tool *index*, or None."""
if not args_so_far or len(args_so_far) <= len( if not args_so_far or len(args_so_far) <= len(
self.streamed_args_for_tool[index] self.streamed_args_for_tool[index]
): ):
@@ -406,7 +413,6 @@ class Glm4MoeModelToolParser(ToolParser):
return diff return diff
def _ensure_tool_state_for(self, index: int) -> None: def _ensure_tool_state_for(self, index: int) -> None:
"""Grow state arrays so that *index* is valid."""
while len(self._tool_call_ids) <= index: while len(self._tool_call_ids) <= index:
self._tool_call_ids.append( self._tool_call_ids.append(
make_tool_call_id(id_type="random", func_name=None, idx=None) make_tool_call_id(id_type="random", func_name=None, idx=None)
@@ -416,6 +422,10 @@ class Glm4MoeModelToolParser(ToolParser):
while len(self.prev_tool_call_arr) <= index: while len(self.prev_tool_call_arr) <= index:
self.prev_tool_call_arr.append({}) self.prev_tool_call_arr.append({})
# ------------------------------------------------------------------
# Main streaming entry point
# ------------------------------------------------------------------
def extract_tool_calls_streaming( def extract_tool_calls_streaming(
self, self,
previous_text: str, previous_text: str,
@@ -436,7 +446,6 @@ class Glm4MoeModelToolParser(ToolParser):
for i, (inner_text, is_complete) in enumerate(regions): for i, (inner_text, is_complete) in enumerate(regions):
self._ensure_tool_state_for(i) self._ensure_tool_state_for(i)
# Extract tool name
tool_name = self._extract_tool_name_from_region(inner_text) tool_name = self._extract_tool_name_from_region(inner_text)
if not tool_name: if not tool_name:
break break
@@ -471,7 +480,6 @@ class Glm4MoeModelToolParser(ToolParser):
) )
) )
# Update current_tool_id for serving layer compatibility
if regions: if regions:
self.current_tool_id = len(regions) - 1 self.current_tool_id = len(regions) - 1
@@ -480,4 +488,4 @@ class Glm4MoeModelToolParser(ToolParser):
content=content, content=content,
tool_calls=tool_call_deltas, tool_calls=tool_call_deltas,
) )
return None return None

1
tests/requirements.txt Normal file
View File

@@ -0,0 +1 @@
httpx>=0.25.0

19
tests/run_tests.sh Executable file
View File

@@ -0,0 +1,19 @@
#!/bin/bash
# Run the streaming tool call tests
set -e
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# Default values
export VLLM_API_BASE="${VLLM_API_BASE:-http://localhost:8000/v1}"
export VLLM_API_KEY="${VLLM_API_KEY:-none}"
export VLLM_MODEL="${VLLM_MODEL:-zai-org/GLM-5.1-FP8}"
echo "Configuration:"
echo " API_BASE: $VLLM_API_BASE"
echo " MODEL: $VLLM_MODEL"
echo ""
# Run the test
python3 "$SCRIPT_DIR/test_streaming_tool_calls.py"

View File

@@ -0,0 +1,386 @@
#!/usr/bin/env python3
"""
Test suite for vLLM GLM-5.1 streaming tool calls.
Reproduces the issue where long string parameters in tool calls
are buffered entirely before being emitted during streaming.
"""
import os
import time
import json
import httpx
from datetime import datetime
# Configuration - will be set via environment or direct assignment
API_BASE = os.environ.get("VLLM_API_BASE", "http://localhost:8000/v1")
API_KEY = os.environ.get("VLLM_API_KEY", "none")
MODEL = os.environ.get("VLLM_MODEL", "zai-org/GLM-5.1-FP8")
def timestamp():
return datetime.now().strftime("%H:%M:%S.%f")[:-3]
def test_streaming_tool_call_with_code():
"""
Test streaming a tool call with a long string parameter.
This prompts the model to generate code via a tool call,
which should stream incrementally if the patch works correctly.
"""
tools = [
{
"type": "function",
"function": {
"name": "write_file",
"description": "Write content to a file. Use this to save code, text, or other content.",
"parameters": {
"type": "object",
"properties": {
"filename": {
"type": "string",
"description": "Name of the file to write"
},
"content": {
"type": "string",
"description": "The content to write to the file"
}
},
"required": ["filename", "content"]
}
}
}
]
messages = [
{
"role": "user",
"content": "Write a Python implementation of a binary search tree with insert, search, and delete methods. Include docstrings and type hints. Save it to bst.py using the write_file tool."
}
]
print(f"\n{'='*60}")
print(f"TEST: Streaming tool call with long string parameter")
print(f"API: {API_BASE}")
print(f"Model: {MODEL}")
print(f"{'='*60}\n")
# Track streaming events
chunks_received = []
first_chunk_time = None
last_chunk_time = None
tool_call_chunks = []
accumulated_content = ""
start_time = time.time()
with httpx.Client(timeout=120.0) as client:
with client.stream(
"POST",
f"{API_BASE}/chat/completions",
headers={
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
},
json={
"model": MODEL,
"messages": messages,
"tools": tools,
"tool_choice": "auto",
"stream": True,
"max_tokens": 4096
}
) as response:
print(f"[{timestamp()}] Response status: {response.status_code}")
for line in response.iter_lines():
if not line or line == "data: [DONE]":
continue
if line.startswith("data: "):
chunk_data = line[6:]
try:
chunk = json.loads(chunk_data)
if first_chunk_time is None:
first_chunk_time = time.time()
print(f"\n[{timestamp()}] FIRST CHUNK RECEIVED ({first_chunk_time - start_time:.3f}s)")
last_chunk_time = time.time()
chunks_received.append(chunk)
# Extract delta content
if chunk.get("choices"):
delta = chunk["choices"][0].get("delta", {})
# Check for tool calls in delta
if delta.get("tool_calls"):
for tc in delta["tool_calls"]:
tc_index = tc.get("index", 0)
tc_function = tc.get("function", {})
if tc_function.get("name"):
print(f"\n[{timestamp()}] Tool call name: {tc_function['name']}")
if tc_function.get("arguments"):
args_chunk = tc_function["arguments"]
tool_call_chunks.append(args_chunk)
accumulated_content += args_chunk
# Print progress every ~500 chars
if len(accumulated_content) % 500 < len(args_chunk):
print(f"[{timestamp()}] Accumulated {len(accumulated_content)} chars...")
# Regular content
if delta.get("content"):
print(f"[{timestamp()}] Content chunk: {delta['content'][:50]}...")
except json.JSONDecodeError as e:
print(f"[{timestamp()}] JSON decode error: {e}")
end_time = time.time()
# Summary
print(f"\n{'='*60}")
print("SUMMARY")
print(f"{'='*60}")
print(f"Total chunks received: {len(chunks_received)}")
print(f"Total time: {end_time - start_time:.3f}s")
if first_chunk_time:
print(f"Time to first chunk: {first_chunk_time - start_time:.3f}s")
if tool_call_chunks:
print(f"Tool call chunks: {len(tool_call_chunks)}")
print(f"Total tool call content: {len(accumulated_content)} chars")
# Try to parse the accumulated arguments
print(f"\nAttempting to parse tool call arguments...")
try:
args = json.loads(accumulated_content)
print(f"Successfully parsed!")
print(f" - filename: {args.get('filename', 'N/A')}")
print(f" - content length: {len(args.get('content', ''))} chars")
except json.JSONDecodeError as e:
print(f"Failed to parse: {e}")
print(f"Raw accumulated content (first 500 chars):\n{accumulated_content[:500]}")
# Verdict
print(f"\n{'='*60}")
if len(tool_call_chunks) > 1:
print("✓ PASS: Tool call arguments arrived in multiple chunks")
print(f" Chunks: {len(tool_call_chunks)}, indicating incremental streaming")
elif len(tool_call_chunks) == 1 and len(accumulated_content) > 1000:
print("✗ FAIL: Tool call arguments arrived in a single chunk")
print(" This indicates buffering, not true streaming")
else:
print("? INCONCLUSIVE: Not enough data or no tool call occurred")
print(f"{'='*60}\n")
return {
"chunks_received": len(chunks_received),
"tool_call_chunks": len(tool_call_chunks),
"accumulated_length": len(accumulated_content),
"total_time": end_time - start_time
}
def test_streaming_tool_call_with_json():
"""
Test streaming a tool call that returns structured JSON data.
"""
tools = [
{
"type": "function",
"function": {
"name": "save_config",
"description": "Save a configuration object",
"parameters": {
"type": "object",
"properties": {
"config": {
"type": "object",
"description": "Configuration object with many fields"
}
},
"required": ["config"]
}
}
}
]
messages = [
{
"role": "user",
"content": "Create a detailed configuration for a web server with the following sections: server (host, port, ssl), logging (level, format, outputs), cache (enabled, ttl, max_size), rate_limiting (enabled, requests_per_minute, burst), cors (enabled, origins, methods, headers), security (headers, csp, hsts). Use the save_config tool."
}
]
print(f"\n{'='*60}")
print(f"TEST: Streaming tool call with nested JSON")
print(f"{'='*60}\n")
tool_call_chunks = []
accumulated_content = ""
start_time = time.time()
with httpx.Client(timeout=120.0) as client:
with client.stream(
"POST",
f"{API_BASE}/chat/completions",
headers={
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
},
json={
"model": MODEL,
"messages": messages,
"tools": tools,
"tool_choice": "auto",
"stream": True,
"max_tokens": 2048
}
) as response:
for line in response.iter_lines():
if not line or line == "data: [DONE]":
continue
if line.startswith("data: "):
try:
chunk = json.loads(line[6:])
if chunk.get("choices"):
delta = chunk["choices"][0].get("delta", {})
if delta.get("tool_calls"):
for tc in delta["tool_calls"]:
if tc.get("function", {}).get("arguments"):
args_chunk = tc["function"]["arguments"]
tool_call_chunks.append(args_chunk)
accumulated_content += args_chunk
print(f"[{timestamp()}] Chunk {len(tool_call_chunks)}: +{len(args_chunk)} chars (total: {len(accumulated_content)})")
except json.JSONDecodeError:
pass
end_time = time.time()
print(f"\n{'='*60}")
print(f"Total chunks: {len(tool_call_chunks)}, Total content: {len(accumulated_content)} chars")
print(f"Time: {end_time - start_time:.3f}s")
if len(tool_call_chunks) > 1:
print("✓ PASS: Arguments streamed in multiple chunks")
elif len(tool_call_chunks) == 1:
print("✗ FAIL: Arguments arrived in single chunk (buffered)")
else:
print("? No tool call occurred")
print(f"{'='*60}\n")
def test_non_streaming_tool_call():
"""
Baseline test: non-streaming tool call for comparison.
"""
tools = [
{
"type": "function",
"function": {
"name": "write_file",
"description": "Write content to a file",
"parameters": {
"type": "object",
"properties": {
"filename": {"type": "string"},
"content": {"type": "string"}
},
"required": ["filename", "content"]
}
}
}
]
messages = [
{
"role": "user",
"content": "Write a simple Python hello world and save it using the write_file tool."
}
]
print(f"\n{'='*60}")
print(f"TEST: Non-streaming tool call (baseline)")
print(f"{'='*60}\n")
start_time = time.time()
with httpx.Client(timeout=120.0) as client:
response = client.post(
f"{API_BASE}/chat/completions",
headers={
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
},
json={
"model": MODEL,
"messages": messages,
"tools": tools,
"tool_choice": "auto",
"stream": False,
"max_tokens": 1024
}
)
result = response.json()
end_time = time.time()
print(f"Status: {response.status_code}")
print(f"Time: {end_time - start_time:.3f}s")
if result.get("choices"):
message = result["choices"][0].get("message", {})
if message.get("tool_calls"):
for tc in message["tool_calls"]:
print(f"Tool: {tc['function']['name']}")
args = json.loads(tc["function"]["arguments"])
print(f"Arguments parsed successfully")
print(f" - filename: {args.get('filename')}")
print(f" - content length: {len(args.get('content', ''))}")
else:
print("No tool call in response")
print(f"{'='*60}\n")
def main():
print("\n" + "="*60)
print("vLLM GLM-5.1 Streaming Tool Call Tests")
print("="*60)
# Check API connectivity
print(f"\nChecking API at {API_BASE}...")
try:
with httpx.Client(timeout=10.0) as client:
response = client.get(f"{API_BASE.replace('/v1', '')}/health")
print(f"Health check: {response.status_code}")
except Exception as e:
print(f"Warning: Could not reach API - {e}")
# Run tests
print("\nRunning tests...\n")
# Test 1: Non-streaming baseline
test_non_streaming_tool_call()
# Test 2: Streaming with nested JSON
test_streaming_tool_call_with_json()
# Test 3: Main test - streaming with long code
result = test_streaming_tool_call_with_code()
print("\nAll tests complete.")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,234 @@
#!/usr/bin/env python3
"""
Focused test to diagnose GLM-5.1 tool response issue.
The issue: Model sees tool response as blank.
"""
import httpx
import json
API_BASE = "https://api.vultrinference.com/v1"
API_KEY = "26DN7PNUB3YRBEPCDNMXKKD6ZODMETRSMOZQ"
MODEL = "zai-org/GLM-5.1-FP8"
def test_simple_tool_response():
"""
Minimal test: Send a tool response and see if the model can use it.
"""
# Simulate a conversation where a tool was called
messages = [
{"role": "user", "content": "Call the test function"},
{
"role": "assistant",
"tool_calls": [{
"id": "call_123",
"type": "function",
"function": {"name": "test_func", "arguments": "{}"}
}]
},
{
"role": "tool",
"tool_call_id": "call_123",
"content": "SUCCESS: The function returned value 42"
}
]
tools = [{
"type": "function",
"function": {
"name": "test_func",
"description": "A test function",
"parameters": {"type": "object", "properties": {}}
}
}]
print("=" * 60)
print("Request messages:")
print(json.dumps(messages, indent=2))
print("=" * 60)
with httpx.Client(timeout=60.0) as client:
# Non-streaming to get full response
response = client.post(
f"{API_BASE}/chat/completions",
headers={
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
},
json={
"model": MODEL,
"messages": messages,
"tools": tools,
"stream": False,
"max_tokens": 256
}
)
result = response.json()
print("\nFull response:")
print(json.dumps(result, indent=2))
if result.get("choices"):
content = result["choices"][0].get("message", {}).get("content", "")
print("\n" + "=" * 60)
print("Model response content:")
print(content)
print("=" * 60)
# Check if the tool result is referenced
if "42" in content:
print("\n✓ PASS: Model referenced the tool result (42)")
else:
print("\n✗ FAIL: Model did NOT reference the tool result (42)")
# Check for signs the model didn't see the result
if "don't have" in content.lower() or "cannot access" in content.lower():
print("✗ Model indicates it cannot see tool result")
def test_without_tools_param():
"""
Test what happens if we don't pass tools in the follow-up request.
Some APIs need tools to be passed on every request.
"""
messages = [
{"role": "user", "content": "Call the test function"},
{
"role": "assistant",
"tool_calls": [{
"id": "call_123",
"type": "function",
"function": {"name": "test_func", "arguments": "{}"}
}]
},
{
"role": "tool",
"tool_call_id": "call_123",
"content": "SUCCESS: The function returned value 42"
}
]
print("\n" + "=" * 60)
print("Test WITHOUT tools param in follow-up")
print("=" * 60)
with httpx.Client(timeout=60.0) as client:
response = client.post(
f"{API_BASE}/chat/completions",
headers={
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
},
json={
"model": MODEL,
"messages": messages,
# No tools param
"stream": False,
"max_tokens": 256
}
)
result = response.json()
if result.get("choices"):
content = result["choices"][0].get("message", {}).get("content", "")
print("Model response:", content[:200])
if "42" in content:
print("✓ Model referenced the tool result")
def test_different_content_formats():
"""
Test if the issue is with how content is formatted.
"""
# Test 1: String content (standard)
messages_string = [
{"role": "user", "content": "What is 2+2?"},
{
"role": "assistant",
"tool_calls": [{
"id": "call_123",
"type": "function",
"function": {"name": "calc", "arguments": "{}"}
}]
},
{
"role": "tool",
"tool_call_id": "call_123",
"content": "The answer is 4"
}
]
# Test 2: Content as array (OpenAI format)
messages_array = [
{"role": "user", "content": "What is 2+2?"},
{
"role": "assistant",
"tool_calls": [{
"id": "call_123",
"type": "function",
"function": {"name": "calc", "arguments": "{}"}
}]
},
{
"role": "tool",
"tool_call_id": "call_123",
"content": [{"type": "text", "text": "The answer is 4"}]
}
]
tools = [{
"type": "function",
"function": {
"name": "calc",
"description": "Calculator",
"parameters": {"type": "object", "properties": {}}
}
}]
print("\n" + "=" * 60)
print("Test: String content vs Array content")
print("=" * 60)
with httpx.Client(timeout=60.0) as client:
for name, msgs in [("String content", messages_string), ("Array content", messages_array)]:
print(f"\n--- {name} ---")
response = client.post(
f"{API_BASE}/chat/completions",
headers={
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
},
json={
"model": MODEL,
"messages": msgs,
"tools": tools,
"stream": False,
"max_tokens": 128
}
)
result = response.json()
if result.get("choices"):
content = result["choices"][0].get("message", {}).get("content", "")
print(f"Response: {content[:150]}")
if "4" in content:
print("✓ Referenced tool result")
else:
print("✗ Did NOT reference tool result")
if __name__ == "__main__":
print("GLM-5.1 Tool Response Diagnosis")
print("=" * 60)
test_simple_tool_response()
test_without_tools_param()
test_different_content_formats()

445
tests/test_tool_response.py Normal file
View File

@@ -0,0 +1,445 @@
#!/usr/bin/env python3
"""
Test for tool call response handling in GLM-5.1.
Tests the multi-turn flow:
1. Send a prompt that triggers a tool call
2. Send back the tool result
3. Verify the model can see and use the tool response
This reproduces the issue where tool responses appear blank to the model.
"""
import os
import json
import httpx
from datetime import datetime
API_BASE = os.environ.get("VLLM_API_BASE", "http://localhost:8000/v1")
API_KEY = os.environ.get("VLLM_API_KEY", "none")
MODEL = os.environ.get("VLLM_MODEL", "zai-org/GLM-5.1-FP8")
def timestamp():
return datetime.now().strftime("%H:%M:%S.%f")[:-3]
def test_tool_call_response_flow(streaming: bool = True):
"""
Test the full tool call -> response -> follow-up flow.
This simulates:
1. User asks for weather
2. Model calls get_weather tool
3. We send back the weather data
4. Model should see and use that data
"""
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City and state, e.g. 'New York, NY'"
}
},
"required": ["location"]
}
}
}
]
# Initial request that should trigger a tool call
messages = [
{
"role": "user",
"content": "What's the weather like in Tokyo right now?"
}
]
mode = "STREAMING" if streaming else "NON-STREAMING"
print(f"\n{'='*60}")
print(f"TEST: Tool call response flow ({mode})")
print(f"API: {API_BASE}")
print(f"Model: {MODEL}")
print(f"{'='*60}\n")
with httpx.Client(timeout=120.0) as client:
# Step 1: Send initial request, expect tool call
print(f"[{timestamp()}] Step 1: Sending initial request...")
if streaming:
tool_calls = []
tool_call_id = None
tool_call_name = None
accumulated_args = ""
with client.stream(
"POST",
f"{API_BASE}/chat/completions",
headers={
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
},
json={
"model": MODEL,
"messages": messages,
"tools": tools,
"tool_choice": "auto",
"stream": True,
"max_tokens": 512
}
) as response:
print(f"[{timestamp()}] Response status: {response.status_code}")
for line in response.iter_lines():
if not line or line == "data: [DONE]":
continue
if line.startswith("data: "):
try:
chunk = json.loads(line[6:])
if chunk.get("choices"):
delta = chunk["choices"][0].get("delta", {})
if delta.get("tool_calls"):
for tc in delta["tool_calls"]:
idx = tc.get("index", 0)
if tc.get("id"):
tool_call_id = tc["id"]
if tc.get("function", {}).get("name"):
tool_call_name = tc["function"]["name"]
print(f"[{timestamp()}] Tool call: {tool_call_name}")
if tc.get("function", {}).get("arguments"):
accumulated_args += tc["function"]["arguments"]
if delta.get("content"):
print(f"[{timestamp()}] Content: {delta['content'][:100]}")
except json.JSONDecodeError as e:
print(f"[{timestamp()}] JSON error: {e}")
if tool_call_name:
tool_calls.append({
"id": tool_call_id or "call_0",
"type": "function",
"function": {
"name": tool_call_name,
"arguments": accumulated_args
}
})
else:
# Non-streaming
response = client.post(
f"{API_BASE}/chat/completions",
headers={
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
},
json={
"model": MODEL,
"messages": messages,
"tools": tools,
"tool_choice": "auto",
"stream": False,
"max_tokens": 512
}
)
result = response.json()
print(f"[{timestamp()}] Response status: {response.status_code}")
tool_calls = []
if result.get("choices"):
message = result["choices"][0].get("message", {})
if message.get("tool_calls"):
tool_calls = message["tool_calls"]
for tc in tool_calls:
print(f"[{timestamp()}] Tool call: {tc['function']['name']}")
print(f"[{timestamp()}] Args: {tc['function']['arguments']}")
# Check if we got a tool call
if not tool_calls:
print(f"\n[{timestamp()}] No tool call received - model didn't call the tool")
return {"success": False, "reason": "no_tool_call"}
# Step 2: Parse tool call and prepare response
tc = tool_calls[0]
tc_id = tc.get("id", "call_0")
tc_name = tc["function"]["name"]
tc_args = json.loads(tc["function"]["arguments"])
print(f"\n[{timestamp()}] Step 2: Tool call received")
print(f" Name: {tc_name}")
print(f" Args: {tc_args}")
# Simulate tool execution
tool_result = {
"location": tc_args.get("location", "Unknown"),
"temperature": "22°C",
"condition": "Partly cloudy",
"humidity": "65%",
"wind": "15 km/h NE"
}
# Step 3: Send the tool response back
messages.append({
"role": "assistant",
"tool_calls": tool_calls
})
messages.append({
"role": "tool",
"tool_call_id": tc_id,
"content": json.dumps(tool_result)
})
print(f"\n[{timestamp()}] Step 3: Sending tool response...")
print(f" Tool call ID: {tc_id}")
print(f" Tool result: {json.dumps(tool_result, indent=2)}")
# Step 4: Get the model's follow-up response
if streaming:
final_response = ""
print(f"\n[{timestamp()}] Step 4: Receiving model's follow-up (streaming)...")
with client.stream(
"POST",
f"{API_BASE}/chat/completions",
headers={
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
},
json={
"model": MODEL,
"messages": messages,
"tools": tools,
"stream": True,
"max_tokens": 512
}
) as response:
for line in response.iter_lines():
if not line or line == "data: [DONE]":
continue
if line.startswith("data: "):
try:
chunk = json.loads(line[6:])
if chunk.get("choices"):
delta = chunk["choices"][0].get("delta", {})
if delta.get("content"):
content = delta["content"]
final_response += content
print(f"[{timestamp()}] Content: {content}", end="", flush=True)
except json.JSONDecodeError:
pass
print() # newline after streaming output
else:
print(f"\n[{timestamp()}] Step 4: Receiving model's follow-up (non-streaming)...")
response = client.post(
f"{API_BASE}/chat/completions",
headers={
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
},
json={
"model": MODEL,
"messages": messages,
"tools": tools,
"stream": False,
"max_tokens": 512
}
)
result = response.json()
final_response = ""
if result.get("choices"):
final_response = result["choices"][0].get("message", {}).get("content", "")
print(f"\n[{timestamp()}] Final response:\n{final_response}")
# Check if the model used the tool data
success = True
issues = []
# The response should mention the weather data
if "22" not in final_response and "22°C" not in final_response:
issues.append("Temperature (22°C) not mentioned in response")
success = False
if "cloudy" not in final_response.lower() and "partly cloudy" not in final_response.lower():
issues.append("Condition (Partly cloudy) not mentioned in response")
success = False
# Check for signs the model didn't see the data
blank_indicators = [
"i don't have",
"i cannot access",
"i'm unable to",
"i am unable to",
"don't have access",
"don't have real-time",
"cannot provide real-time"
]
for indicator in blank_indicators:
if indicator in final_response.lower():
issues.append(f"Model seems unaware of tool result (found: '{indicator}')")
success = False
break
print(f"\n{'='*60}")
if success:
print("✓ PASS: Model correctly used tool response data")
else:
print("✗ FAIL: Model did not use tool response correctly")
for issue in issues:
print(f" - {issue}")
print(f"{'='*60}\n")
return {
"success": success,
"issues": issues,
"final_response": final_response
}
def test_tool_response_with_debug_info():
"""
Test with detailed logging to capture exactly what the model sees.
"""
tools = [
{
"type": "function",
"function": {
"name": "get_time",
"description": "Get the current time",
"parameters": {
"type": "object",
"properties": {},
"required": []
}
}
}
]
print(f"\n{'='*60}")
print(f"TEST: Tool response with debug info (non-streaming)")
print(f"{'='*60}\n")
messages = [
{"role": "user", "content": "What time is it?"}
]
with httpx.Client(timeout=120.0) as client:
# Get tool call
print(f"[{timestamp()}] Sending initial request...")
response = client.post(
f"{API_BASE}/chat/completions",
headers={
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
},
json={
"model": MODEL,
"messages": messages,
"tools": tools,
"tool_choice": "auto",
"stream": False,
"max_tokens": 256
}
)
result = response.json()
if not result.get("choices") or not result["choices"][0].get("message", {}).get("tool_calls"):
print("No tool call - skipping test")
return
tool_call = result["choices"][0]["message"]["tool_calls"][0]
tc_id = tool_call["id"]
print(f"[{timestamp()}] Tool call: {tool_call['function']['name']}")
print(f"[{timestamp()}] Tool call ID: {tc_id}")
# Add tool response
messages.append({
"role": "assistant",
"tool_calls": [tool_call]
})
messages.append({
"role": "tool",
"tool_call_id": tc_id,
"content": "The current time is 3:45 PM on Thursday, April 9, 2026."
})
# Debug: print the full messages array we're about to send
print(f"\n[{timestamp()}] Sending follow-up with these messages:")
print(json.dumps(messages, indent=2))
# Get follow-up
response2 = client.post(
f"{API_BASE}/chat/completions",
headers={
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
},
json={
"model": MODEL,
"messages": messages,
"tools": tools,
"stream": False,
"max_tokens": 256
}
)
result2 = response2.json()
print(f"\n[{timestamp()}] Full response:")
print(json.dumps(result2, indent=2))
if result2.get("choices"):
content = result2["choices"][0].get("message", {}).get("content", "")
print(f"\n[{timestamp()}] Model response content: {content}")
# Check if time is mentioned
if "3:45" in content or "3:45 PM" in content:
print("\n✓ Model used the tool response (time mentioned)")
else:
print("\n✗ Model may not have seen the tool response (time not mentioned)")
def main():
print("\n" + "="*60)
print("GLM-5.1 Tool Call Response Tests")
print("="*60)
# Test non-streaming first (simpler to debug)
print("\n--- Test 1: Non-streaming tool response flow ---")
test_tool_call_response_flow(streaming=False)
# Test streaming
print("\n--- Test 2: Streaming tool response flow ---")
test_tool_call_response_flow(streaming=True)
# Debug test
print("\n--- Test 3: Debug info test ---")
test_tool_response_with_debug_info()
print("\nAll tests complete.")
if __name__ == "__main__":
main()