Files

biondizzle aa4f667ab8 Add hf.py patch to force string content format for GLM models

- Tool response content was being dropped because vLLM detected
  'openai' content format incorrectly for GLM templates
- Added _is_glm_model() detection to force 'string' format
- Updated Dockerfile to include hf.py patch
- Added debug tests for tool visibility

2026-04-09 05:20:47 +00:00

3.4 KiB

Raw Blame History

vLLM GLM Tool Parser Patch

Patches vLLM's GLM-4/GLM-5.1 tool parser to fix multiple issues with tool call handling.

Issues Fixed

Issue 1: Tool Response Content Ignored (CRITICAL)

Symptom: When the model makes a tool call and receives a response, it would act as if the response was empty ("The function returned no output") even though valid content was provided.

Root Cause: Two bugs working together:

Tool parser regex mismatch (glm4_moe_tool_parser.py): The func_detail_regex required a newline between the function name and first argument tag, but GLM-5.1's chat template doesn't include that newline. The regex silently failed to match.
Content format detection wrong (vllm/renderers/hf.py): vLLM detected "openai" content format because the GLM template has {% for tr in m.content %} for tool responses. But the template then checks m.content is string which is False for OpenAI format arrays, causing content to be dropped.

Model output format (no newline after name):

[TOOL_CALL_START]function_name[ARG_KEY]value[ARG_END]...[TOOL_CALL_END]

Old regex (broken):

r"\[TOOL_CALL_START\]([^\n]*)\n(.*)\[TOOL_CALL_END\]"  # Requires \n after name

Fixed regex:

r"\[TOOL_CALL_START\]\s*([\w.\-]+)\s*((?:\[ARG_KEY\].*)?)\s*\[TOOL_CALL_END\]"

Content format fix: Added _is_glm_model() detection to force "string" content format for GLM models, bypassing the incorrect auto-detection.

Issue 2: Zero-Argument Tool Calls Crash

Symptom: TypeError: 'NoneType' object is not iterable when tool has no arguments.

Fix: The tc_args_raw is now defaulted to empty string: tc_args_raw = tc_detail.group(2) or ""

Issue 3: Streaming Path vs Non-Streaming Path Inconsistency

Both paths now use the same robust extraction helpers for consistency.

Files

File	Description
`glm4_moe_tool_parser.py`	Fixed tool parser (regex fix)
`utils.py`	Utility functions for partial JSON/tag handling
`vllm_patches/hf.py`	Patched renderer (content format fix)
`Dockerfile`	Overlays patched files onto base image
`Jenkinsfile`	CI/CD pipeline for building and pushing
`tests/`	Test suite for tool call validation

Testing

Requirements

pip install httpx regex

Running Tests

export VLLM_API_BASE="https://api.vultrinference.com/v1"
export VLLM_API_KEY="your-api-key"
export VLLM_MODEL="zai-org/GLM-5.1-FP8"

python tests/test_tool_diagnosis.py

Test Cases

Test	Description
`test_simple_tool_response`	Verifies model can see tool response content
`test_without_tools_param`	Tests behavior without tools param in follow-up
`test_different_content_formats`	String vs array content formats

Deployment

Jenkins Pipeline

curl -X POST "https://jenkins.sweetapi.com/job/vllm-glm-build/buildWithParameters" \
  -u "admin:TOKEN" \
  -d "IMAGE_TAG=latest"

Manual Build

docker build -t atl.vultrcr.com/vllm/vllm-glm51-patched:latest .
docker push atl.vultrcr.com/vllm/vllm-glm51-patched:latest

Images

Base: vllm/vllm-openai:glm51-cu130
Output: atl.vultrcr.com/vllm/vllm-glm51-patched:<tag>

vLLM Issue #32829 (streaming long string parameters)
GLM-5.1 chat template: https://huggingface.co/zai-org/GLM-5.1-FP8/raw/main/chat_template.jinja

3.4 KiB Raw Blame History