# vLLM GLM Tool Parser Patch ## Purpose Patches vLLM's GLM-4/GLM-5.1 tool parser to fix a streaming issue where long string parameters are buffered entirely before being emitted, causing multi-second delays. ## The Problem GLM models emit tool calls in a special XML-like format: ``` .tool_name param_nameparam_value ``` The upstream parser (as of vLLM issue #32829) buffers string values until the closing tag arrives. For long strings (e.g., 4000+ characters of code), users see nothing until the entire value is complete — not true streaming. ## The Fix (Pulled from https://github.com/vllm-project/vllm/pull/39253) `glm4_moe_tool_parser.py` implements incremental string streaming: - Re-parses `` regions on each streaming call - Diffs against previously sent content - Emits only new characters as they arrive - String values now stream character-by-character ## Files | File | Description | |------|-------------| | `glm4_moe_tool_parser.py` | Fixed tool parser with incremental streaming | | `utils.py` | Utility functions for partial JSON/tag handling | | `Dockerfile` | Overlays patched files onto base image | | `Jenkinsfile` | CI/CD pipeline for building and pushing | ## Deployment ### Jenkins Pipeline Build via Jenkins: ```bash curl -X POST "https://jenkins.sweetapi.com/job/vllm-glm-build/buildWithParameters" \ -u "admin:TOKEN" \ -d "IMAGE_TAG=latest" ``` Parameters: - `IMAGE_TAG` - Docker image tag (default: `latest`) - `GIT_REPO` - Git repository URL (optional, uses workspace if empty) - `GIT_BRANCH` - Git branch to build (default: `master`) ### Manual Build ```bash docker build -t atl.vultrcr.com/vllm/vllm-glm51-patched:latest . docker push atl.vultrcr.com/vllm/vllm-glm51-patched:latest ``` ### Images - Base: `vllm/vllm-openai:glm51-cu130` - Output: `atl.vultrcr.com/vllm/vllm-glm51-patched:` ## Related - vLLM Issue #32829 (streaming long string parameters)