Files
vllm-glm/README.md

68 lines
1.8 KiB
Markdown

# vLLM GLM Tool Parser Patch
## Purpose
Patches vLLM's GLM-4/GLM-5.1 tool parser to fix a streaming issue where long string parameters are buffered entirely before being emitted, causing multi-second delays.
## The Problem
GLM models emit tool calls in a special XML-like format:
```
.tool_name
param_nameparam_value
```
The upstream parser (as of vLLM issue #32829) buffers string values until the closing tag arrives. For long strings (e.g., 4000+ characters of code), users see nothing until the entire value is complete — not true streaming.
## The Fix
`glm4_moe_tool_parser.py` implements incremental string streaming:
- Re-parses `` regions on each streaming call
- Diffs against previously sent content
- Emits only new characters as they arrive
- String values now stream character-by-character
## Files
| File | Description |
|------|-------------|
| `glm4_moe_tool_parser.py` | Fixed tool parser with incremental streaming |
| `utils.py` | Utility functions for partial JSON/tag handling |
| `Dockerfile` | Overlays patched files onto base image |
| `Jenkinsfile` | CI/CD pipeline for building and pushing |
## Deployment
### Jenkins Pipeline
Build via Jenkins:
```bash
curl -X POST "https://jenkins.sweetapi.com/job/vllm-glm-build/buildWithParameters" \
-u "admin:TOKEN" \
-d "IMAGE_TAG=latest"
```
Parameters:
- `IMAGE_TAG` - Docker image tag (default: `latest`)
- `GIT_REPO` - Git repository URL (optional, uses workspace if empty)
- `GIT_BRANCH` - Git branch to build (default: `master`)
### Manual Build
```bash
docker build -t atl.vultrcr.com/vllm/vllm-glm51-patched:latest .
docker push atl.vultrcr.com/vllm/vllm-glm51-patched:latest
```
### Images
- Base: `vllm/vllm-openai:glm51-cu130`
- Output: `atl.vultrcr.com/vllm/vllm-glm51-patched:<tag>`
## Related
- vLLM Issue #32829 (streaming long string parameters)