vllm-glm/README.md

# vLLM GLM Tool Parser Patch

## Purpose

Patches vLLM's GLM-4/GLM-5.1 tool parser to fix a streaming issue where long string parameters are buffered entirely before being emitted, causing multi-second delays.

## The Problem

GLM models emit tool calls in a special XML-like format:

```
.tool_name
param_nameparam_value
```

The upstream parser (as of vLLM issue #32829) buffers string values until the closing tag arrives. For long strings (e.g., 4000+ characters of code), users see nothing until the entire value is complete — not true streaming.

## The Fix

`glm4_moe_tool_parser.py` implements incremental string streaming:

- Re-parses `` regions on each streaming call
- Diffs against previously sent content
- Emits only new characters as they arrive
- String values now stream character-by-character

## Files

| File | Description |
|------|-------------|
| `glm4_moe_tool_parser.py` | Fixed tool parser with incremental streaming |
| `utils.py` | Utility functions for partial JSON/tag handling |
| `Dockerfile` | Overlays patched files onto base image |
| `Jenkinsfile` | CI/CD pipeline for building and pushing |

## Deployment

### Jenkins Pipeline

Build via Jenkins:

```bash
curl -X POST "https://jenkins.sweetapi.com/job/vllm-glm-build/buildWithParameters" \
  -u "admin:TOKEN" \
  -d "IMAGE_TAG=latest"
```

Parameters:
- `IMAGE_TAG` - Docker image tag (default: `latest`)
- `GIT_REPO` - Git repository URL (optional, uses workspace if empty)
- `GIT_BRANCH` - Git branch to build (default: `master`)

### Manual Build

```bash
docker build -t atl.vultrcr.com/vllm/vllm-glm51-patched:latest .
docker push atl.vultrcr.com/vllm/vllm-glm51-patched:latest
```

### Images

- Base: `vllm/vllm-openai:glm51-cu130`
- Output: `atl.vultrcr.com/vllm/vllm-glm51-patched:<tag>`

## Related

- vLLM Issue #32829 (streaming long string parameters)