patch parser

2026-04-09 04:28:22 +00:00
parent 40159e865e
commit 8d5da5750d
8 changed files with 1239 additions and 108 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1 @@
 /.venv
--- a/README.md
+++ b/README.md
@@ -1,55 +1,91 @@
 # vLLM GLM Tool Parser Patch
-## Purpose
+Patches vLLM's GLM-4/GLM-5.1 tool parser to fix multiple issues with tool call handling.
-Patches vLLM's GLM-4/GLM-5.1 tool parser to fix a streaming issue where long string parameters are buffered entirely before being emitted, causing multi-second delays.
+## Issues Fixed
-## The Problem
+### Issue 1: Tool Response Content Ignored (CRITICAL)
-GLM models emit tool calls in a special XML-like format:
+**Symptom:** When the model makes a tool call and receives a response, it would act as if the response was empty ("The function returned no output") even though valid content was provided.
 **Root Cause:** The `func_detail_regex` required a newline between the function name and first argument tag, but GLM-5.1's chat template does NOT include that newline. The regex silently failed to match, tool call extraction failed, and somewhere in that failure path the tool response content got lost.
 **Model output format (no newline after name):**
 ```
-.tool_name
+[TOOL_CALL_START]function_name[ARG_KEY]value[ARG_END]...[TOOL_CALL_END]
 param_nameparam_value
 ```
-The upstream parser (as of vLLM issue #32829) buffers string values until the closing tag arrives. For long strings (e.g., 4000+ characters of code), users see nothing until the entire value is complete — not true streaming.
+**Old regex (broken):**
 ```python
 r"\[TOOL_CALL_START\]([^\n]*)\n(.*)\[TOOL_CALL_END\]"  # Requires \n after name
 ```
-## The Fix (Pulled from https://github.com/vllm-project/vllm/pull/39253)
+**Fixed regex:**
 ```python
 r"\[TOOL_CALL_START\]\s*([\w.\-]+)\s*((?:\[ARG_KEY\].*)?)\s*\[TOOL_CALL_END\]"
 ```
-`glm4_moe_tool_parser.py` implements incremental string streaming:
+The fix:
 - Uses `\s*` instead of mandatory `\n`
 - Makes the arguments group optional for zero-argument calls
 - Accepts word chars, dots, and hyphens in function names
- Re-parses `` regions on each streaming call
+### Issue 2: Zero-Argument Tool Calls Crash
- Diffs against previously sent content
+
- Emits only new characters as they arrive
+**Symptom:** `TypeError: 'NoneType' object is not iterable` when tool has no arguments.
- String values now stream character-by-character
+
 **Fix:** The `tc_args_raw` is now defaulted to empty string: `tc_args_raw = tc_detail.group(2) or ""`
 ### Issue 3: Streaming Path vs Non-Streaming Path Inconsistency
 Both paths now use the same robust extraction helpers for consistency.
 ## Files
 | File | Description |
 |------|-------------|
-| `glm4_moe_tool_parser.py` | Fixed tool parser with incremental streaming |
+| `glm4_moe_tool_parser.py` | Fixed tool parser |
 | `utils.py` | Utility functions for partial JSON/tag handling |
 | `Dockerfile` | Overlays patched files onto base image |
 | `Jenkinsfile` | CI/CD pipeline for building and pushing |
 | `tests/` | Test suite for tool call validation |
 ## Testing
 ### Requirements
 ```bash
 pip install httpx regex
 ```
 ### Running Tests
 ```bash
 export VLLM_API_BASE="https://api.vultrinference.com/v1"
 export VLLM_API_KEY="your-api-key"
 export VLLM_MODEL="zai-org/GLM-5.1-FP8"
 python tests/test_tool_diagnosis.py
 ```
 ### Test Cases
 | Test | Description |
 |------|-------------|
 | `test_simple_tool_response` | Verifies model can see tool response content |
 | `test_without_tools_param` | Tests behavior without tools param in follow-up |
 | `test_different_content_formats` | String vs array content formats |
 ## Deployment
 ### Jenkins Pipeline
 Build via Jenkins:
 ```bash
 curl -X POST "https://jenkins.sweetapi.com/job/vllm-glm-build/buildWithParameters" \
  -u "admin:TOKEN" \
  -d "IMAGE_TAG=latest"
 ```
 Parameters:
 - `IMAGE_TAG` - Docker image tag (default: `latest`)
 - `GIT_REPO` - Git repository URL (optional, uses workspace if empty)
 - `GIT_BRANCH` - Git branch to build (default: `master`)
 ### Manual Build
 ```bash
@@ -65,3 +101,4 @@ docker push atl.vultrcr.com/vllm/vllm-glm51-patched:latest
 ## Related
 - vLLM Issue #32829 (streaming long string parameters)
 - GLM-5.1 chat template: https://huggingface.co/zai-org/GLM-5.1-FP8/raw/main/chat_template.jinja
--- a/glm4_moe_tool_parser.py
+++ b/glm4_moe_tool_parser.py
@@ -1,14 +1,26 @@
 # SPDX-License-Identifier: Apache-2.0
 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project
 """
-GLM-4 Tool Call Parser with incremental string streaming support.
+GLM-4/5 Tool Call Parser — fixed version.
-This parser fixes the streaming issue reported in Issue #32829 where long string
+Fixes applied over the upstream vLLM + sweetapi patch:
 parameters (e.g., file content with 4000+ characters of code) are buffered until
 complete, causing multi-second delays before the user sees any content.
-The fix streams string values incrementally as they arrive, providing a true
+1. **func_detail_regex no longer requires a newline** between tool name and
-streaming experience for long content.
+   first <arg_key>.  The model's chat template instructs:
       <tool_call>{name}<arg_key>…</arg_key><arg_value>…</arg_value>…</tool_call>
   with NO mandatory newline, but the original regex used ``[^\\n]*\\n`` which
   silently failed when the model omitted it.
 2. **Zero-argument tool calls no longer crash** (TypeError on NoneType).
 3. **extract_tool_calls uses the same robust extraction helpers** as the
   streaming path, so both paths parse identically.
 4. **_extract_tool_name_from_region** is more tolerant of whitespace /
   formatting variants the model may produce.
 Drop this file into your vLLM install as a --tool-parser-plugin, or replace
 the built-in glm4_moe_tool_parser.py.
 """
 import ast
@@ -43,7 +55,7 @@ logger = init_logger(__name__)
 class Glm4MoeModelToolParser(ToolParser):
-    """Tool parser for GLM-4 models with incremental string streaming.
+    """Tool parser for GLM-4/5 models with incremental string streaming.
    On every streaming call the parser re-parses ``current_text`` to find
    ``<tool_call>`` regions, builds the JSON arguments string for each tool
@@ -67,10 +79,25 @@ class Glm4MoeModelToolParser(ToolParser):
        self.tool_calls_start_token = self.tool_call_start_token
-        self.func_call_regex = re.compile(r"<tool_call>.*?</tool_call>", re.DOTALL)
+        # ---- FIXED regexes ------------------------------------------------
-        self.func_detail_regex = re.compile(
+        # Match the whole <tool_call>…</tool_call> block (unchanged).
-            r"<tool_call>([^\n]*)\n(.*)</tool_call>", re.DOTALL
+        self.func_call_regex = re.compile(
            r"<tool_call>.*?</tool_call>", re.DOTALL
        )
        # FIX 1: The original regex required a literal \n between tool name
        # and the body.  The model often omits it.  We now accept any
        # whitespace (including none) before the first <arg_key>, and we
        # make the body group optional so zero-argument calls don't fail.
        self.func_detail_regex = re.compile(
            r"<tool_call>\s*"          # opening tag + optional whitespace
            r"([\w.\-]+)"             # group 1: tool/function name (word chars, dots, hyphens)
            r"\s*"                     # optional whitespace / newline
            r"((?:<arg_key>.*)?)"      # group 2: everything from first <arg_key> onward (may be empty)
            r"\s*</tool_call>",        # closing tag
            re.DOTALL,
        )
        self.func_arg_regex = re.compile(
            r"<arg_key>(.*?)</arg_key>\s*<arg_value>(.*?)</arg_value>", re.DOTALL
        )
@@ -95,27 +122,25 @@ class Glm4MoeModelToolParser(ToolParser):
        self._sent_content_idx: int = 0
        self._tool_call_ids: list[str] = []
    # ------------------------------------------------------------------
    # Static helpers
    # ------------------------------------------------------------------
    @staticmethod
    def _deserialize(value: str) -> Any:
        try:
            return json.loads(value)
        except json.JSONDecodeError:
            pass
        try:
            return ast.literal_eval(value)
        except (ValueError, SyntaxError):
            pass
        return value
    @staticmethod
    def _json_escape_string_content(s: str) -> str:
-        """JSON-escape string content for incremental streaming.
+        """JSON-escape string content (without surrounding quotes)."""
        This escapes the content that goes INSIDE a JSON string (between quotes),
        not including the surrounding quotes themselves.
        """
        if not s:
            return ""
        return json.dumps(s, ensure_ascii=False)[1:-1]
@@ -144,7 +169,6 @@ class Glm4MoeModelToolParser(ToolParser):
    @staticmethod
    def _tools_enabled(request: ChatCompletionRequest) -> bool:
        """Return whether tool parsing should be applied for this request."""
        try:
            tools = getattr(request, "tools", None)
            tool_choice = getattr(request, "tool_choice", None)
@@ -153,19 +177,22 @@ class Glm4MoeModelToolParser(ToolParser):
            logger.exception("Failed to determine if tools are enabled.")
            return False
    # ------------------------------------------------------------------
    # Request adjustment
    # ------------------------------------------------------------------
    def adjust_request(
        self, request: ChatCompletionRequest | ResponsesRequest
    ) -> ChatCompletionRequest | ResponsesRequest:
        """Adjust request parameters for tool call token handling."""
        request = super().adjust_request(request)
        if request.tools and request.tool_choice != "none":
            # Ensure tool call tokens (<tool_call>, </tool_call>) are not skipped
            # during decoding. Even though they are not marked as special tokens,
            # setting skip_special_tokens=False ensures proper handling in
            # transformers 5.x where decoding behavior may have changed.
            request.skip_special_tokens = False
        return request
    # ------------------------------------------------------------------
    # Non-streaming extraction
    # ------------------------------------------------------------------
    def extract_tool_calls(
        self,
        model_output: str,
@@ -173,19 +200,20 @@ class Glm4MoeModelToolParser(ToolParser):
    ) -> ExtractedToolCallInformation:
        matched_tool_calls = self.func_call_regex.findall(model_output)
        logger.debug("model_output: %s", model_output)
        try:
            tool_calls: list[ToolCall] = []
            for match in matched_tool_calls:
                tc_detail = self.func_detail_regex.search(match)
                if not tc_detail:
                    logger.warning(
-                        "Failed to parse tool call details from: %s",
+                        "Failed to parse tool call details from: %s", match
                        match,
                    )
                    continue
                tc_name = tc_detail.group(1).strip()
-                tc_args = tc_detail.group(2)
+                tc_args_raw = tc_detail.group(2) or ""  # FIX 2: default to ""
-                pairs = self.func_arg_regex.findall(tc_args) if tc_args else []
+                pairs = self.func_arg_regex.findall(tc_args_raw) if tc_args_raw else []
                arg_dct: dict[str, Any] = {}
                for key, value in pairs:
                    arg_key = key.strip()
@@ -208,38 +236,31 @@ class Glm4MoeModelToolParser(ToolParser):
            return ExtractedToolCallInformation(
                tools_called=False, tool_calls=[], content=model_output
            )
-        else:
+
-            if len(tool_calls) > 0:
+        if tool_calls:
-                content: str | None = model_output[
+            content: str | None = model_output[
-                    : model_output.find(self.tool_calls_start_token)
+                : model_output.find(self.tool_calls_start_token)
-                ]
+            ]
-                # Normalize empty/whitespace-only content to None
+            if not content or not content.strip():
-                if not content or not content.strip():
+                content = None
                    content = None
                return ExtractedToolCallInformation(
                    tools_called=True, tool_calls=tool_calls, content=content
                )
            return ExtractedToolCallInformation(
-                tools_called=False, tool_calls=[], content=model_output
+                tools_called=True, tool_calls=tool_calls, content=content
            )
        return ExtractedToolCallInformation(
            tools_called=False, tool_calls=[], content=model_output
        )
    # ------------------------------------------------------------------
    # Streaming helpers
    # ------------------------------------------------------------------
    def _extract_content(self, current_text: str) -> str | None:
        """Return unsent non-tool-call text, or None.
        Collects all text outside ``<tool_call>...</tool_call>`` regions,
        including text between consecutive tool calls.  Holds back any
        suffix that could be a partial ``<tool_call>`` tag.
        """
        # Build the "sendable index" — the furthest point we can send
        # content up to.  We scan through the text collecting segments
        # that are outside tool-call regions.
        content_segments: list[str] = []
        pos = self._sent_content_idx
        while pos < len(current_text):
            start = current_text.find(self.tool_call_start_token, pos)
            if start == -1:
                # No more tool calls — send up to (len - partial-tag overlap)
                tail = current_text[pos:]
                overlap = partial_tag_overlap(tail, self.tool_call_start_token)
                sendable = tail[: len(tail) - overlap] if overlap else tail
@@ -248,29 +269,24 @@ class Glm4MoeModelToolParser(ToolParser):
                pos = len(current_text) - overlap
                break
            # Text before this <tool_call>
            if start > pos:
                content_segments.append(current_text[pos:start])
            # Skip past the </tool_call> (or to end if incomplete)
            end = current_text.find(self.tool_call_end_token, start)
            if end != -1:
                pos = end + len(self.tool_call_end_token)
            else:
                # Incomplete tool call — nothing more to send
                pos = start
                break
        if content_segments:
            self._sent_content_idx = pos
            return "".join(content_segments)
        # Even if no content, advance past completed tool-call regions
        if pos > self._sent_content_idx:
            self._sent_content_idx = pos
        return None
    def _extract_tool_call_regions(self, text: str) -> list[tuple[str, bool]]:
        """Extract ``(inner_text, is_complete)`` for each ``<tool_call>`` region."""
        results: list[tuple[str, bool]] = []
        pos = 0
        while True:
@@ -283,7 +299,6 @@ class Glm4MoeModelToolParser(ToolParser):
                results.append((text[inner_start:end], True))
                pos = end + len(self.tool_call_end_token)
            else:
                # Incomplete tool call — strip partial </tool_call> suffix
                raw = text[inner_start:]
                overlap = partial_tag_overlap(raw, self.tool_call_end_token)
                if overlap:
@@ -295,16 +310,31 @@ class Glm4MoeModelToolParser(ToolParser):
    def _extract_tool_name_from_region(self, inner_text: str) -> str | None:
        """Extract the tool name from the beginning of a tool-call region.
-        The name is everything before the first ``\\n`` or ``<arg_key>``.
+        The name is everything before the first ``\\n``, ``<arg_key>``, or
-        Returns ``None`` if the name hasn't fully arrived yet.
+        ``</tool_call>``.  We also accept the name being the only content
        (for zero-argument calls that are still in-flight).
        """
-        nl = inner_text.find("\n")
+        # Strip leading whitespace — model may emit \n after <tool_call>
-        ak = inner_text.find(self.arg_key_start)
+        stripped = inner_text.lstrip()
        if not stripped:
            return None
        nl = stripped.find("\n")
        ak = stripped.find(self.arg_key_start)
        candidates = [i for i in [nl, ak] if i != -1]
        if not candidates:
            # No delimiter yet — if the text looks like a partial name
            # (only word chars / dots / hyphens), return None to wait.
            # If it's a complete name with no args (zero-arg call, complete),
            # it will be handled when is_complete is True.
            candidate_name = stripped.strip()
            if re.fullmatch(r'[\w.\-]+', candidate_name):
                # Could be a complete name or still arriving — return it
                # so zero-arg complete calls work; the caller checks is_complete.
                return candidate_name
            return None
        cut = min(candidates)
-        name = inner_text[:cut].strip()
+        name = stripped[:cut].strip()
        return name if name else None
    def _build_args_json_so_far(
@@ -313,17 +343,6 @@ class Glm4MoeModelToolParser(ToolParser):
        inner_text: str,
        is_complete: bool,
    ) -> str:
        """Build the JSON arguments string from the XML pairs seen so far.
        For complete ``<arg_key>/<arg_value>`` pairs the value is fully
        formatted.  For the last argument whose ``<arg_value>`` has been
        opened but not closed, the partial string content is included
        (JSON-escaped, with an opening ``"`` but no closing ``"``).
        The closing ``}`` is only appended when ``is_complete`` is True
        (i.e. the ``</tool_call>`` tag has arrived).
        """
        # Find all complete arg pairs
        pairs = self.func_arg_regex.findall(inner_text)
        parts: list[str] = []
@@ -331,8 +350,6 @@ class Glm4MoeModelToolParser(ToolParser):
            key = key.strip()
            key_json = json.dumps(key, ensure_ascii=False)
            if self._is_string_type(tool_name, key, self.tools):
                # Don't strip string values — whitespace is significant
                # and must match the partial-value path for diffing.
                val_json = json.dumps(value, ensure_ascii=False)
            else:
                val_json = json.dumps(
@@ -341,7 +358,6 @@ class Glm4MoeModelToolParser(ToolParser):
            parts.append(f"{key_json}: {val_json}")
        # Check for a partial (incomplete) arg value
        # Find the last <arg_value> that isn't closed
        last_val_start = inner_text.rfind(self.arg_val_start)
        last_val_end = inner_text.rfind(self.arg_val_end)
        has_partial_value = last_val_start != -1 and (
@@ -349,8 +365,6 @@ class Glm4MoeModelToolParser(ToolParser):
        )
        if has_partial_value:
            # Find the key for this partial value
            # Look for the last <arg_key>...</arg_key> before this <arg_value>
            last_key_match = None
            for m in self._arg_key_pattern.finditer(inner_text[:last_val_start]):
                last_key_match = m
@@ -360,16 +374,12 @@ class Glm4MoeModelToolParser(ToolParser):
                partial_content_start = last_val_start + len(self.arg_val_start)
                partial_content = inner_text[partial_content_start:]
                # Hold back any partial </arg_value> suffix
                overlap = partial_tag_overlap(partial_content, self.arg_val_end)
                if overlap:
                    partial_content = partial_content[:-overlap]
                key_json = json.dumps(partial_key, ensure_ascii=False)
                if is_complete:
                    # Tool call finished but </arg_value> is missing
                    # (malformed output). Treat partial as complete value
                    # so the diff naturally closes any open quotes.
                    if self._is_string_type(tool_name, partial_key, self.tools):
                        val_json = json.dumps(partial_content, ensure_ascii=False)
                    else:
@@ -380,10 +390,8 @@ class Glm4MoeModelToolParser(ToolParser):
                    parts.append(f"{key_json}: {val_json}")
                elif self._is_string_type(tool_name, partial_key, self.tools):
                    escaped = self._json_escape_string_content(partial_content)
                    # Open quote but no close — more content may arrive
                    parts.append(f'{key_json}: "{escaped}')
                else:
                    # Non-string partial: include raw content, no wrapping
                    parts.append(f"{key_json}: {partial_content}")
        if not parts:
@@ -395,7 +403,6 @@ class Glm4MoeModelToolParser(ToolParser):
        return joined
    def _compute_args_diff(self, index: int, args_so_far: str) -> str | None:
        """Return new argument text not yet sent for tool *index*, or None."""
        if not args_so_far or len(args_so_far) <= len(
            self.streamed_args_for_tool[index]
        ):
@@ -406,7 +413,6 @@ class Glm4MoeModelToolParser(ToolParser):
        return diff
    def _ensure_tool_state_for(self, index: int) -> None:
        """Grow state arrays so that *index* is valid."""
        while len(self._tool_call_ids) <= index:
            self._tool_call_ids.append(
                make_tool_call_id(id_type="random", func_name=None, idx=None)
@@ -416,6 +422,10 @@ class Glm4MoeModelToolParser(ToolParser):
        while len(self.prev_tool_call_arr) <= index:
            self.prev_tool_call_arr.append({})
    # ------------------------------------------------------------------
    # Main streaming entry point
    # ------------------------------------------------------------------
    def extract_tool_calls_streaming(
        self,
        previous_text: str,
@@ -436,7 +446,6 @@ class Glm4MoeModelToolParser(ToolParser):
        for i, (inner_text, is_complete) in enumerate(regions):
            self._ensure_tool_state_for(i)
            # Extract tool name
            tool_name = self._extract_tool_name_from_region(inner_text)
            if not tool_name:
                break
@@ -471,7 +480,6 @@ class Glm4MoeModelToolParser(ToolParser):
                    )
                )
        # Update current_tool_id for serving layer compatibility
        if regions:
            self.current_tool_id = len(regions) - 1
@@ -480,4 +488,4 @@ class Glm4MoeModelToolParser(ToolParser):
                content=content,
                tool_calls=tool_call_deltas,
            )
-        return None
+        return None
--- a/tests/requirements.txt
+++ b/tests/requirements.txt
@@ -0,0 +1 @@
 httpx>=0.25.0
--- a/tests/run_tests.sh
+++ b/tests/run_tests.sh
@@ -0,0 +1,19 @@
 #!/bin/bash
 # Run the streaming tool call tests
 set -e
 SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 # Default values
 export VLLM_API_BASE="${VLLM_API_BASE:-http://localhost:8000/v1}"
 export VLLM_API_KEY="${VLLM_API_KEY:-none}"
 export VLLM_MODEL="${VLLM_MODEL:-zai-org/GLM-5.1-FP8}"
 echo "Configuration:"
 echo "  API_BASE: $VLLM_API_BASE"
 echo "  MODEL: $VLLM_MODEL"
 echo ""
 # Run the test
 python3 "$SCRIPT_DIR/test_streaming_tool_calls.py"
--- a/tests/test_streaming_tool_calls.py
+++ b/tests/test_streaming_tool_calls.py
@@ -0,0 +1,386 @@
 #!/usr/bin/env python3
 """
 Test suite for vLLM GLM-5.1 streaming tool calls.
 Reproduces the issue where long string parameters in tool calls
 are buffered entirely before being emitted during streaming.
 """
 import os
 import time
 import json
 import httpx
 from datetime import datetime
 # Configuration - will be set via environment or direct assignment
 API_BASE = os.environ.get("VLLM_API_BASE", "http://localhost:8000/v1")
 API_KEY = os.environ.get("VLLM_API_KEY", "none")
 MODEL = os.environ.get("VLLM_MODEL", "zai-org/GLM-5.1-FP8")
 def timestamp():
    return datetime.now().strftime("%H:%M:%S.%f")[:-3]
 def test_streaming_tool_call_with_code():
    """
    Test streaming a tool call with a long string parameter.
    This prompts the model to generate code via a tool call,
    which should stream incrementally if the patch works correctly.
    """
    tools = [
        {
            "type": "function",
            "function": {
                "name": "write_file",
                "description": "Write content to a file. Use this to save code, text, or other content.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "filename": {
                            "type": "string",
                            "description": "Name of the file to write"
                        },
                        "content": {
                            "type": "string",
                            "description": "The content to write to the file"
                        }
                    },
                    "required": ["filename", "content"]
                }
            }
        }
    ]
    messages = [
        {
            "role": "user",
            "content": "Write a Python implementation of a binary search tree with insert, search, and delete methods. Include docstrings and type hints. Save it to bst.py using the write_file tool."
        }
    ]
    print(f"\n{'='*60}")
    print(f"TEST: Streaming tool call with long string parameter")
    print(f"API: {API_BASE}")
    print(f"Model: {MODEL}")
    print(f"{'='*60}\n")
    # Track streaming events
    chunks_received = []
    first_chunk_time = None
    last_chunk_time = None
    tool_call_chunks = []
    accumulated_content = ""
    start_time = time.time()
    with httpx.Client(timeout=120.0) as client:
        with client.stream(
            "POST",
            f"{API_BASE}/chat/completions",
            headers={
                "Authorization": f"Bearer {API_KEY}",
                "Content-Type": "application/json"
            },
            json={
                "model": MODEL,
                "messages": messages,
                "tools": tools,
                "tool_choice": "auto",
                "stream": True,
                "max_tokens": 4096
            }
        ) as response:
            print(f"[{timestamp()}] Response status: {response.status_code}")
            for line in response.iter_lines():
                if not line or line == "data: [DONE]":
                    continue
                if line.startswith("data: "):
                    chunk_data = line[6:]
                    try:
                        chunk = json.loads(chunk_data)
                        if first_chunk_time is None:
                            first_chunk_time = time.time()
                            print(f"\n[{timestamp()}] FIRST CHUNK RECEIVED ({first_chunk_time - start_time:.3f}s)")
                        last_chunk_time = time.time()
                        chunks_received.append(chunk)
                        # Extract delta content
                        if chunk.get("choices"):
                            delta = chunk["choices"][0].get("delta", {})
                            # Check for tool calls in delta
                            if delta.get("tool_calls"):
                                for tc in delta["tool_calls"]:
                                    tc_index = tc.get("index", 0)
                                    tc_function = tc.get("function", {})
                                    if tc_function.get("name"):
                                        print(f"\n[{timestamp()}] Tool call name: {tc_function['name']}")
                                    if tc_function.get("arguments"):
                                        args_chunk = tc_function["arguments"]
                                        tool_call_chunks.append(args_chunk)
                                        accumulated_content += args_chunk
                                        # Print progress every ~500 chars
                                        if len(accumulated_content) % 500 < len(args_chunk):
                                            print(f"[{timestamp()}] Accumulated {len(accumulated_content)} chars...")
                            # Regular content
                            if delta.get("content"):
                                print(f"[{timestamp()}] Content chunk: {delta['content'][:50]}...")
                    except json.JSONDecodeError as e:
                        print(f"[{timestamp()}] JSON decode error: {e}")
    end_time = time.time()
    # Summary
    print(f"\n{'='*60}")
    print("SUMMARY")
    print(f"{'='*60}")
    print(f"Total chunks received: {len(chunks_received)}")
    print(f"Total time: {end_time - start_time:.3f}s")
    if first_chunk_time:
        print(f"Time to first chunk: {first_chunk_time - start_time:.3f}s")
    if tool_call_chunks:
        print(f"Tool call chunks: {len(tool_call_chunks)}")
        print(f"Total tool call content: {len(accumulated_content)} chars")
        # Try to parse the accumulated arguments
        print(f"\nAttempting to parse tool call arguments...")
        try:
            args = json.loads(accumulated_content)
            print(f"Successfully parsed!")
            print(f"  - filename: {args.get('filename', 'N/A')}")
            print(f"  - content length: {len(args.get('content', ''))} chars")
        except json.JSONDecodeError as e:
            print(f"Failed to parse: {e}")
            print(f"Raw accumulated content (first 500 chars):\n{accumulated_content[:500]}")
    # Verdict
    print(f"\n{'='*60}")
    if len(tool_call_chunks) > 1:
        print("✓ PASS: Tool call arguments arrived in multiple chunks")
        print(f"  Chunks: {len(tool_call_chunks)}, indicating incremental streaming")
    elif len(tool_call_chunks) == 1 and len(accumulated_content) > 1000:
        print("✗ FAIL: Tool call arguments arrived in a single chunk")
        print("  This indicates buffering, not true streaming")
    else:
        print("? INCONCLUSIVE: Not enough data or no tool call occurred")
    print(f"{'='*60}\n")
    return {
        "chunks_received": len(chunks_received),
        "tool_call_chunks": len(tool_call_chunks),
        "accumulated_length": len(accumulated_content),
        "total_time": end_time - start_time
    }
 def test_streaming_tool_call_with_json():
    """
    Test streaming a tool call that returns structured JSON data.
    """
    tools = [
        {
            "type": "function",
            "function": {
                "name": "save_config",
                "description": "Save a configuration object",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "config": {
                            "type": "object",
                            "description": "Configuration object with many fields"
                        }
                    },
                    "required": ["config"]
                }
            }
        }
    ]
    messages = [
        {
            "role": "user",
            "content": "Create a detailed configuration for a web server with the following sections: server (host, port, ssl), logging (level, format, outputs), cache (enabled, ttl, max_size), rate_limiting (enabled, requests_per_minute, burst), cors (enabled, origins, methods, headers), security (headers, csp, hsts). Use the save_config tool."
        }
    ]
    print(f"\n{'='*60}")
    print(f"TEST: Streaming tool call with nested JSON")
    print(f"{'='*60}\n")
    tool_call_chunks = []
    accumulated_content = ""
    start_time = time.time()
    with httpx.Client(timeout=120.0) as client:
        with client.stream(
            "POST",
            f"{API_BASE}/chat/completions",
            headers={
                "Authorization": f"Bearer {API_KEY}",
                "Content-Type": "application/json"
            },
            json={
                "model": MODEL,
                "messages": messages,
                "tools": tools,
                "tool_choice": "auto",
                "stream": True,
                "max_tokens": 2048
            }
        ) as response:
            for line in response.iter_lines():
                if not line or line == "data: [DONE]":
                    continue
                if line.startswith("data: "):
                    try:
                        chunk = json.loads(line[6:])
                        if chunk.get("choices"):
                            delta = chunk["choices"][0].get("delta", {})
                            if delta.get("tool_calls"):
                                for tc in delta["tool_calls"]:
                                    if tc.get("function", {}).get("arguments"):
                                        args_chunk = tc["function"]["arguments"]
                                        tool_call_chunks.append(args_chunk)
                                        accumulated_content += args_chunk
                                        print(f"[{timestamp()}] Chunk {len(tool_call_chunks)}: +{len(args_chunk)} chars (total: {len(accumulated_content)})")
                    except json.JSONDecodeError:
                        pass
    end_time = time.time()
    print(f"\n{'='*60}")
    print(f"Total chunks: {len(tool_call_chunks)}, Total content: {len(accumulated_content)} chars")
    print(f"Time: {end_time - start_time:.3f}s")
    if len(tool_call_chunks) > 1:
        print("✓ PASS: Arguments streamed in multiple chunks")
    elif len(tool_call_chunks) == 1:
        print("✗ FAIL: Arguments arrived in single chunk (buffered)")
    else:
        print("? No tool call occurred")
    print(f"{'='*60}\n")
 def test_non_streaming_tool_call():
    """
    Baseline test: non-streaming tool call for comparison.
    """
    tools = [
        {
            "type": "function",
            "function": {
                "name": "write_file",
                "description": "Write content to a file",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "filename": {"type": "string"},
                        "content": {"type": "string"}
                    },
                    "required": ["filename", "content"]
                }
            }
        }
    ]
    messages = [
        {
            "role": "user",
            "content": "Write a simple Python hello world and save it using the write_file tool."
        }
    ]
    print(f"\n{'='*60}")
    print(f"TEST: Non-streaming tool call (baseline)")
    print(f"{'='*60}\n")
    start_time = time.time()
    with httpx.Client(timeout=120.0) as client:
        response = client.post(
            f"{API_BASE}/chat/completions",
            headers={
                "Authorization": f"Bearer {API_KEY}",
                "Content-Type": "application/json"
            },
            json={
                "model": MODEL,
                "messages": messages,
                "tools": tools,
                "tool_choice": "auto",
                "stream": False,
                "max_tokens": 1024
            }
        )
        result = response.json()
        end_time = time.time()
        print(f"Status: {response.status_code}")
        print(f"Time: {end_time - start_time:.3f}s")
        if result.get("choices"):
            message = result["choices"][0].get("message", {})
            if message.get("tool_calls"):
                for tc in message["tool_calls"]:
                    print(f"Tool: {tc['function']['name']}")
                    args = json.loads(tc["function"]["arguments"])
                    print(f"Arguments parsed successfully")
                    print(f"  - filename: {args.get('filename')}")
                    print(f"  - content length: {len(args.get('content', ''))}")
            else:
                print("No tool call in response")
    print(f"{'='*60}\n")
 def main():
    print("\n" + "="*60)
    print("vLLM GLM-5.1 Streaming Tool Call Tests")
    print("="*60)
    # Check API connectivity
    print(f"\nChecking API at {API_BASE}...")
    try:
        with httpx.Client(timeout=10.0) as client:
            response = client.get(f"{API_BASE.replace('/v1', '')}/health")
            print(f"Health check: {response.status_code}")
    except Exception as e:
        print(f"Warning: Could not reach API - {e}")
    # Run tests
    print("\nRunning tests...\n")
    # Test 1: Non-streaming baseline
    test_non_streaming_tool_call()
    # Test 2: Streaming with nested JSON
    test_streaming_tool_call_with_json()
    # Test 3: Main test - streaming with long code
    result = test_streaming_tool_call_with_code()
    print("\nAll tests complete.")
 if __name__ == "__main__":
    main()
--- a/tests/test_tool_diagnosis.py
+++ b/tests/test_tool_diagnosis.py
@@ -0,0 +1,234 @@
 #!/usr/bin/env python3
 """
 Focused test to diagnose GLM-5.1 tool response issue.
 The issue: Model sees tool response as blank.
 """
 import httpx
 import json
 API_BASE = "https://api.vultrinference.com/v1"
 API_KEY = "26DN7PNUB3YRBEPCDNMXKKD6ZODMETRSMOZQ"
 MODEL = "zai-org/GLM-5.1-FP8"
 def test_simple_tool_response():
    """
    Minimal test: Send a tool response and see if the model can use it.
    """
    # Simulate a conversation where a tool was called
    messages = [
        {"role": "user", "content": "Call the test function"},
        {
            "role": "assistant",
            "tool_calls": [{
                "id": "call_123",
                "type": "function",
                "function": {"name": "test_func", "arguments": "{}"}
            }]
        },
        {
            "role": "tool",
            "tool_call_id": "call_123",
            "content": "SUCCESS: The function returned value 42"
        }
    ]
    tools = [{
        "type": "function",
        "function": {
            "name": "test_func",
            "description": "A test function",
            "parameters": {"type": "object", "properties": {}}
        }
    }]
    print("=" * 60)
    print("Request messages:")
    print(json.dumps(messages, indent=2))
    print("=" * 60)
    with httpx.Client(timeout=60.0) as client:
        # Non-streaming to get full response
        response = client.post(
            f"{API_BASE}/chat/completions",
            headers={
                "Authorization": f"Bearer {API_KEY}",
                "Content-Type": "application/json"
            },
            json={
                "model": MODEL,
                "messages": messages,
                "tools": tools,
                "stream": False,
                "max_tokens": 256
            }
        )
        result = response.json()
        print("\nFull response:")
        print(json.dumps(result, indent=2))
        if result.get("choices"):
            content = result["choices"][0].get("message", {}).get("content", "")
            print("\n" + "=" * 60)
            print("Model response content:")
            print(content)
            print("=" * 60)
            # Check if the tool result is referenced
            if "42" in content:
                print("\n✓ PASS: Model referenced the tool result (42)")
            else:
                print("\n✗ FAIL: Model did NOT reference the tool result (42)")
            # Check for signs the model didn't see the result
            if "don't have" in content.lower() or "cannot access" in content.lower():
                print("✗ Model indicates it cannot see tool result")
 def test_without_tools_param():
    """
    Test what happens if we don't pass tools in the follow-up request.
    Some APIs need tools to be passed on every request.
    """
    messages = [
        {"role": "user", "content": "Call the test function"},
        {
            "role": "assistant",
            "tool_calls": [{
                "id": "call_123",
                "type": "function",
                "function": {"name": "test_func", "arguments": "{}"}
            }]
        },
        {
            "role": "tool",
            "tool_call_id": "call_123",
            "content": "SUCCESS: The function returned value 42"
        }
    ]
    print("\n" + "=" * 60)
    print("Test WITHOUT tools param in follow-up")
    print("=" * 60)
    with httpx.Client(timeout=60.0) as client:
        response = client.post(
            f"{API_BASE}/chat/completions",
            headers={
                "Authorization": f"Bearer {API_KEY}",
                "Content-Type": "application/json"
            },
            json={
                "model": MODEL,
                "messages": messages,
                # No tools param
                "stream": False,
                "max_tokens": 256
            }
        )
        result = response.json()
        if result.get("choices"):
            content = result["choices"][0].get("message", {}).get("content", "")
            print("Model response:", content[:200])
            if "42" in content:
                print("✓ Model referenced the tool result")
 def test_different_content_formats():
    """
    Test if the issue is with how content is formatted.
    """
    # Test 1: String content (standard)
    messages_string = [
        {"role": "user", "content": "What is 2+2?"},
        {
            "role": "assistant",
            "tool_calls": [{
                "id": "call_123",
                "type": "function",
                "function": {"name": "calc", "arguments": "{}"}
            }]
        },
        {
            "role": "tool",
            "tool_call_id": "call_123",
            "content": "The answer is 4"
        }
    ]
    # Test 2: Content as array (OpenAI format)
    messages_array = [
        {"role": "user", "content": "What is 2+2?"},
        {
            "role": "assistant",
            "tool_calls": [{
                "id": "call_123",
                "type": "function",
                "function": {"name": "calc", "arguments": "{}"}
            }]
        },
        {
            "role": "tool",
            "tool_call_id": "call_123",
            "content": [{"type": "text", "text": "The answer is 4"}]
        }
    ]
    tools = [{
        "type": "function",
        "function": {
            "name": "calc",
            "description": "Calculator",
            "parameters": {"type": "object", "properties": {}}
        }
    }]
    print("\n" + "=" * 60)
    print("Test: String content vs Array content")
    print("=" * 60)
    with httpx.Client(timeout=60.0) as client:
        for name, msgs in [("String content", messages_string), ("Array content", messages_array)]:
            print(f"\n--- {name} ---")
            response = client.post(
                f"{API_BASE}/chat/completions",
                headers={
                    "Authorization": f"Bearer {API_KEY}",
                    "Content-Type": "application/json"
                },
                json={
                    "model": MODEL,
                    "messages": msgs,
                    "tools": tools,
                    "stream": False,
                    "max_tokens": 128
                }
            )
            result = response.json()
            if result.get("choices"):
                content = result["choices"][0].get("message", {}).get("content", "")
                print(f"Response: {content[:150]}")
                if "4" in content:
                    print("✓ Referenced tool result")
                else:
                    print("✗ Did NOT reference tool result")
 if __name__ == "__main__":
    print("GLM-5.1 Tool Response Diagnosis")
    print("=" * 60)
    test_simple_tool_response()
    test_without_tools_param()
    test_different_content_formats()
--- a/tests/test_tool_response.py
+++ b/tests/test_tool_response.py
@@ -0,0 +1,445 @@
 #!/usr/bin/env python3
 """
 Test for tool call response handling in GLM-5.1.
 Tests the multi-turn flow:
 1. Send a prompt that triggers a tool call
 2. Send back the tool result
 3. Verify the model can see and use the tool response
 This reproduces the issue where tool responses appear blank to the model.
 """
 import os
 import json
 import httpx
 from datetime import datetime
 API_BASE = os.environ.get("VLLM_API_BASE", "http://localhost:8000/v1")
 API_KEY = os.environ.get("VLLM_API_KEY", "none")
 MODEL = os.environ.get("VLLM_MODEL", "zai-org/GLM-5.1-FP8")
 def timestamp():
    return datetime.now().strftime("%H:%M:%S.%f")[:-3]
 def test_tool_call_response_flow(streaming: bool = True):
    """
    Test the full tool call -> response -> follow-up flow.
    This simulates:
    1. User asks for weather
    2. Model calls get_weather tool
    3. We send back the weather data
    4. Model should see and use that data
    """
    tools = [
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "Get the current weather for a location",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "City and state, e.g. 'New York, NY'"
                        }
                    },
                    "required": ["location"]
                }
            }
        }
    ]
    # Initial request that should trigger a tool call
    messages = [
        {
            "role": "user",
            "content": "What's the weather like in Tokyo right now?"
        }
    ]
    mode = "STREAMING" if streaming else "NON-STREAMING"
    print(f"\n{'='*60}")
    print(f"TEST: Tool call response flow ({mode})")
    print(f"API: {API_BASE}")
    print(f"Model: {MODEL}")
    print(f"{'='*60}\n")
    with httpx.Client(timeout=120.0) as client:
        # Step 1: Send initial request, expect tool call
        print(f"[{timestamp()}] Step 1: Sending initial request...")
        if streaming:
            tool_calls = []
            tool_call_id = None
            tool_call_name = None
            accumulated_args = ""
            with client.stream(
                "POST",
                f"{API_BASE}/chat/completions",
                headers={
                    "Authorization": f"Bearer {API_KEY}",
                    "Content-Type": "application/json"
                },
                json={
                    "model": MODEL,
                    "messages": messages,
                    "tools": tools,
                    "tool_choice": "auto",
                    "stream": True,
                    "max_tokens": 512
                }
            ) as response:
                print(f"[{timestamp()}] Response status: {response.status_code}")
                for line in response.iter_lines():
                    if not line or line == "data: [DONE]":
                        continue
                    if line.startswith("data: "):
                        try:
                            chunk = json.loads(line[6:])
                            if chunk.get("choices"):
                                delta = chunk["choices"][0].get("delta", {})
                                if delta.get("tool_calls"):
                                    for tc in delta["tool_calls"]:
                                        idx = tc.get("index", 0)
                                        if tc.get("id"):
                                            tool_call_id = tc["id"]
                                        if tc.get("function", {}).get("name"):
                                            tool_call_name = tc["function"]["name"]
                                            print(f"[{timestamp()}] Tool call: {tool_call_name}")
                                        if tc.get("function", {}).get("arguments"):
                                            accumulated_args += tc["function"]["arguments"]
                                if delta.get("content"):
                                    print(f"[{timestamp()}] Content: {delta['content'][:100]}")
                        except json.JSONDecodeError as e:
                            print(f"[{timestamp()}] JSON error: {e}")
            if tool_call_name:
                tool_calls.append({
                    "id": tool_call_id or "call_0",
                    "type": "function",
                    "function": {
                        "name": tool_call_name,
                        "arguments": accumulated_args
                    }
                })
        else:
            # Non-streaming
            response = client.post(
                f"{API_BASE}/chat/completions",
                headers={
                    "Authorization": f"Bearer {API_KEY}",
                    "Content-Type": "application/json"
                },
                json={
                    "model": MODEL,
                    "messages": messages,
                    "tools": tools,
                    "tool_choice": "auto",
                    "stream": False,
                    "max_tokens": 512
                }
            )
            result = response.json()
            print(f"[{timestamp()}] Response status: {response.status_code}")
            tool_calls = []
            if result.get("choices"):
                message = result["choices"][0].get("message", {})
                if message.get("tool_calls"):
                    tool_calls = message["tool_calls"]
                    for tc in tool_calls:
                        print(f"[{timestamp()}] Tool call: {tc['function']['name']}")
                        print(f"[{timestamp()}] Args: {tc['function']['arguments']}")
        # Check if we got a tool call
        if not tool_calls:
            print(f"\n[{timestamp()}] No tool call received - model didn't call the tool")
            return {"success": False, "reason": "no_tool_call"}
        # Step 2: Parse tool call and prepare response
        tc = tool_calls[0]
        tc_id = tc.get("id", "call_0")
        tc_name = tc["function"]["name"]
        tc_args = json.loads(tc["function"]["arguments"])
        print(f"\n[{timestamp()}] Step 2: Tool call received")
        print(f"  Name: {tc_name}")
        print(f"  Args: {tc_args}")
        # Simulate tool execution
        tool_result = {
            "location": tc_args.get("location", "Unknown"),
            "temperature": "22°C",
            "condition": "Partly cloudy",
            "humidity": "65%",
            "wind": "15 km/h NE"
        }
        # Step 3: Send the tool response back
        messages.append({
            "role": "assistant",
            "tool_calls": tool_calls
        })
        messages.append({
            "role": "tool",
            "tool_call_id": tc_id,
            "content": json.dumps(tool_result)
        })
        print(f"\n[{timestamp()}] Step 3: Sending tool response...")
        print(f"  Tool call ID: {tc_id}")
        print(f"  Tool result: {json.dumps(tool_result, indent=2)}")
        # Step 4: Get the model's follow-up response
        if streaming:
            final_response = ""
            print(f"\n[{timestamp()}] Step 4: Receiving model's follow-up (streaming)...")
            with client.stream(
                "POST",
                f"{API_BASE}/chat/completions",
                headers={
                    "Authorization": f"Bearer {API_KEY}",
                    "Content-Type": "application/json"
                },
                json={
                    "model": MODEL,
                    "messages": messages,
                    "tools": tools,
                    "stream": True,
                    "max_tokens": 512
                }
            ) as response:
                for line in response.iter_lines():
                    if not line or line == "data: [DONE]":
                        continue
                    if line.startswith("data: "):
                        try:
                            chunk = json.loads(line[6:])
                            if chunk.get("choices"):
                                delta = chunk["choices"][0].get("delta", {})
                                if delta.get("content"):
                                    content = delta["content"]
                                    final_response += content
                                    print(f"[{timestamp()}] Content: {content}", end="", flush=True)
                        except json.JSONDecodeError:
                            pass
            print()  # newline after streaming output
        else:
            print(f"\n[{timestamp()}] Step 4: Receiving model's follow-up (non-streaming)...")
            response = client.post(
                f"{API_BASE}/chat/completions",
                headers={
                    "Authorization": f"Bearer {API_KEY}",
                    "Content-Type": "application/json"
                },
                json={
                    "model": MODEL,
                    "messages": messages,
                    "tools": tools,
                    "stream": False,
                    "max_tokens": 512
                }
            )
            result = response.json()
            final_response = ""
            if result.get("choices"):
                final_response = result["choices"][0].get("message", {}).get("content", "")
        print(f"\n[{timestamp()}] Final response:\n{final_response}")
        # Check if the model used the tool data
        success = True
        issues = []
        # The response should mention the weather data
        if "22" not in final_response and "22°C" not in final_response:
            issues.append("Temperature (22°C) not mentioned in response")
            success = False
        if "cloudy" not in final_response.lower() and "partly cloudy" not in final_response.lower():
            issues.append("Condition (Partly cloudy) not mentioned in response")
            success = False
        # Check for signs the model didn't see the data
        blank_indicators = [
            "i don't have",
            "i cannot access",
            "i'm unable to",
            "i am unable to",
            "don't have access",
            "don't have real-time",
            "cannot provide real-time"
        ]
        for indicator in blank_indicators:
            if indicator in final_response.lower():
                issues.append(f"Model seems unaware of tool result (found: '{indicator}')")
                success = False
                break
        print(f"\n{'='*60}")
        if success:
            print("✓ PASS: Model correctly used tool response data")
        else:
            print("✗ FAIL: Model did not use tool response correctly")
            for issue in issues:
                print(f"  - {issue}")
        print(f"{'='*60}\n")
        return {
            "success": success,
            "issues": issues,
            "final_response": final_response
        }
 def test_tool_response_with_debug_info():
    """
    Test with detailed logging to capture exactly what the model sees.
    """
    tools = [
        {
            "type": "function",
            "function": {
                "name": "get_time",
                "description": "Get the current time",
                "parameters": {
                    "type": "object",
                    "properties": {},
                    "required": []
                }
            }
        }
    ]
    print(f"\n{'='*60}")
    print(f"TEST: Tool response with debug info (non-streaming)")
    print(f"{'='*60}\n")
    messages = [
        {"role": "user", "content": "What time is it?"}
    ]
    with httpx.Client(timeout=120.0) as client:
        # Get tool call
        print(f"[{timestamp()}] Sending initial request...")
        response = client.post(
            f"{API_BASE}/chat/completions",
            headers={
                "Authorization": f"Bearer {API_KEY}",
                "Content-Type": "application/json"
            },
            json={
                "model": MODEL,
                "messages": messages,
                "tools": tools,
                "tool_choice": "auto",
                "stream": False,
                "max_tokens": 256
            }
        )
        result = response.json()
        if not result.get("choices") or not result["choices"][0].get("message", {}).get("tool_calls"):
            print("No tool call - skipping test")
            return
        tool_call = result["choices"][0]["message"]["tool_calls"][0]
        tc_id = tool_call["id"]
        print(f"[{timestamp()}] Tool call: {tool_call['function']['name']}")
        print(f"[{timestamp()}] Tool call ID: {tc_id}")
        # Add tool response
        messages.append({
            "role": "assistant",
            "tool_calls": [tool_call]
        })
        messages.append({
            "role": "tool",
            "tool_call_id": tc_id,
            "content": "The current time is 3:45 PM on Thursday, April 9, 2026."
        })
        # Debug: print the full messages array we're about to send
        print(f"\n[{timestamp()}] Sending follow-up with these messages:")
        print(json.dumps(messages, indent=2))
        # Get follow-up
        response2 = client.post(
            f"{API_BASE}/chat/completions",
            headers={
                "Authorization": f"Bearer {API_KEY}",
                "Content-Type": "application/json"
            },
            json={
                "model": MODEL,
                "messages": messages,
                "tools": tools,
                "stream": False,
                "max_tokens": 256
            }
        )
        result2 = response2.json()
        print(f"\n[{timestamp()}] Full response:")
        print(json.dumps(result2, indent=2))
        if result2.get("choices"):
            content = result2["choices"][0].get("message", {}).get("content", "")
            print(f"\n[{timestamp()}] Model response content: {content}")
            # Check if time is mentioned
            if "3:45" in content or "3:45 PM" in content:
                print("\n✓ Model used the tool response (time mentioned)")
            else:
                print("\n✗ Model may not have seen the tool response (time not mentioned)")
 def main():
    print("\n" + "="*60)
    print("GLM-5.1 Tool Call Response Tests")
    print("="*60)
    # Test non-streaming first (simpler to debug)
    print("\n--- Test 1: Non-streaming tool response flow ---")
    test_tool_call_response_flow(streaming=False)
    # Test streaming
    print("\n--- Test 2: Streaming tool response flow ---")
    test_tool_call_response_flow(streaming=True)
    # Debug test
    print("\n--- Test 3: Debug info test ---")
    test_tool_response_with_debug_info()
    print("\nAll tests complete.")
 if __name__ == "__main__":
    main()