vLLM GLM Tool Parser Patch

Purpose

Patches vLLM's GLM-4/GLM-5.1 tool parser to fix a streaming issue where long string parameters are buffered entirely before being emitted, causing multi-second delays.

The Problem

GLM models emit tool calls in a special XML-like format:

.tool_name
param_nameparam_value

The upstream parser (as of vLLM issue #32829) buffers string values until the closing tag arrives. For long strings (e.g., 4000+ characters of code), users see nothing until the entire value is complete — not true streaming.

The Fix

glm4_moe_tool_parser.py implements incremental string streaming:

Re-parses `` regions on each streaming call
Diffs against previously sent content
Emits only new characters as they arrive
String values now stream character-by-character

Files

File	Description
`glm4_moe_tool_parser.py`	Fixed tool parser with incremental streaming
`utils.py`	Utility functions for partial JSON/tag handling
`Dockerfile`	Overlays patched files onto base image
`Jenkinsfile`	CI/CD pipeline for building and pushing

Deployment

Jenkins Pipeline

Build via Jenkins:

curl -X POST "https://jenkins.sweetapi.com/job/vllm-glm-build/buildWithParameters" \
  -u "admin:TOKEN" \
  -d "IMAGE_TAG=latest"

Parameters:

IMAGE_TAG - Docker image tag (default: latest)
GIT_REPO - Git repository URL (optional, uses workspace if empty)
GIT_BRANCH - Git branch to build (default: master)

Manual Build

docker build -t atl.vultrcr.com/vllm/vllm-glm51-patched:latest .
docker push atl.vultrcr.com/vllm/vllm-glm51-patched:latest

Images

Base: vllm/vllm-openai:glm51-cu130
Output: atl.vultrcr.com/vllm/vllm-glm51-patched:<tag>

vLLM Issue #32829 (streaming long string parameters)

1.8 KiB Raw Blame History