Files
vllm-glm/README.md

1.8 KiB

vLLM GLM Tool Parser Patch

Purpose

Patches vLLM's GLM-4/GLM-5.1 tool parser to fix a streaming issue where long string parameters are buffered entirely before being emitted, causing multi-second delays.

The Problem

GLM models emit tool calls in a special XML-like format:

.tool_name
param_nameparam_value

The upstream parser (as of vLLM issue #32829) buffers string values until the closing tag arrives. For long strings (e.g., 4000+ characters of code), users see nothing until the entire value is complete — not true streaming.

The Fix

glm4_moe_tool_parser.py implements incremental string streaming:

  • Re-parses `` regions on each streaming call
  • Diffs against previously sent content
  • Emits only new characters as they arrive
  • String values now stream character-by-character

Files

File Description
glm4_moe_tool_parser.py Fixed tool parser with incremental streaming
utils.py Utility functions for partial JSON/tag handling
Dockerfile Overlays patched files onto base image
Jenkinsfile CI/CD pipeline for building and pushing

Deployment

Jenkins Pipeline

Build via Jenkins:

curl -X POST "https://jenkins.sweetapi.com/job/vllm-glm-build/buildWithParameters" \
  -u "admin:TOKEN" \
  -d "IMAGE_TAG=latest"

Parameters:

  • IMAGE_TAG - Docker image tag (default: latest)
  • GIT_REPO - Git repository URL (optional, uses workspace if empty)
  • GIT_BRANCH - Git branch to build (default: master)

Manual Build

docker build -t atl.vultrcr.com/vllm/vllm-glm51-patched:latest .
docker push atl.vultrcr.com/vllm/vllm-glm51-patched:latest

Images

  • Base: vllm/vllm-openai:glm51-cu130
  • Output: atl.vultrcr.com/vllm/vllm-glm51-patched:<tag>
  • vLLM Issue #32829 (streaming long string parameters)