1.8 KiB
1.8 KiB
vLLM GLM Tool Parser Patch
Purpose
Patches vLLM's GLM-4/GLM-5.1 tool parser to fix a streaming issue where long string parameters are buffered entirely before being emitted, causing multi-second delays.
The Problem
GLM models emit tool calls in a special XML-like format:
.tool_name
param_nameparam_value
The upstream parser (as of vLLM issue #32829) buffers string values until the closing tag arrives. For long strings (e.g., 4000+ characters of code), users see nothing until the entire value is complete — not true streaming.
The Fix
glm4_moe_tool_parser.py implements incremental string streaming:
- Re-parses `` regions on each streaming call
- Diffs against previously sent content
- Emits only new characters as they arrive
- String values now stream character-by-character
Files
| File | Description |
|---|---|
glm4_moe_tool_parser.py |
Fixed tool parser with incremental streaming |
utils.py |
Utility functions for partial JSON/tag handling |
Dockerfile |
Overlays patched files onto base image |
Jenkinsfile |
CI/CD pipeline for building and pushing |
Deployment
Jenkins Pipeline
Build via Jenkins:
curl -X POST "https://jenkins.sweetapi.com/job/vllm-glm-build/buildWithParameters" \
-u "admin:TOKEN" \
-d "IMAGE_TAG=latest"
Parameters:
IMAGE_TAG- Docker image tag (default:latest)GIT_REPO- Git repository URL (optional, uses workspace if empty)GIT_BRANCH- Git branch to build (default:master)
Manual Build
docker build -t atl.vultrcr.com/vllm/vllm-glm51-patched:latest .
docker push atl.vultrcr.com/vllm/vllm-glm51-patched:latest
Images
- Base:
vllm/vllm-openai:glm51-cu130 - Output:
atl.vultrcr.com/vllm/vllm-glm51-patched:<tag>
Related
- vLLM Issue #32829 (streaming long string parameters)