Go to file

biondizzle bbe40ac8c0 Add middleware to strip vLLM-only params (logprobs/top_logprobs) before forwarding to SGLang

SGLang's Mistral tool-call parser rejects logprobs/top_logprobs with 422,
while vLLM accepts them. Clients like OpenClaw send these by default.

New architecture: haproxy (port N) → middleware (port N+2) → SGLang (port N+1)
The middleware is a thin FastAPI app that strips incompatible params from
chat completion request bodies and passes everything else through unchanged.

2026-04-12 18:58:37 +00:00

Dockerfile

Add middleware to strip vLLM-only params (logprobs/top_logprobs) before forwarding to SGLang

2026-04-12 18:58:37 +00:00

entrypoint.sh

init commit

2026-04-11 23:39:36 +00:00

Jenkinsfile

Fix Jenkinsfile: agent any, nightly default, proper quoting

2026-04-12 00:22:29 +00:00

README.md

Add middleware to strip vLLM-only params (logprobs/top_logprobs) before forwarding to SGLang

2026-04-12 18:58:37 +00:00

vllm_middleware.py

Add middleware to strip vLLM-only params (logprobs/top_logprobs) before forwarding to SGLang

2026-04-12 18:58:37 +00:00

vllm_shim_module.py

Add middleware to strip vLLM-only params (logprobs/top_logprobs) before forwarding to SGLang

2026-04-12 18:58:37 +00:00

vllm-shim.sh

Add middleware to strip vLLM-only params (logprobs/top_logprobs) before forwarding to SGLang

2026-04-12 18:58:37 +00:00

README.md

vLLM → SGLang Shim

Drop-in replacement that makes a vLLM production stack (e.g. the k8s operator) actually run SGLang instead.

Why?

The vLLM production stack handles model lifecycle, scaling, and routing — but some models work better (or only work) on SGLang. Rather than rewriting your deployment infra, this shim intercepts every vLLM invocation and launches SGLang with equivalent arguments.

How It Works

Invocation interception

Two interception paths catch however the vLLM stack tries to start the server:

What the stack calls	What happens
`vllm serve <model> [flags]`	Shell shim (`vllm-shim.sh`) replaces the `vllm` binary
`python -m vllm.entrypoints.openai.api_server`	Python shim (shadow module on `PYTHONPATH`) intercepts the import

Both extract --host and --port from whatever the stack sends.

haproxy proxy layer

Rather than launching SGLang directly on the vLLM port, the shim runs haproxy on the original port and SGLang on port+1. This solves two critical problems:

/metrics stub — The vLLM stack expects a Prometheus metrics endpoint at /metrics. SGLang doesn't serve one. haproxy intercepts /metrics and returns an empty 200 response instantly.
/health probe timing — SGLang's /health endpoint takes ~1.001s to respond, which races the 1s k8s probe timeout and causes repeated Startup probe failed: context deadline exceeded. haproxy health-checks SGLang in the background (every 5s, with a 3s timeout) and responds to /health probes instantly — 200 if the backend is up, 503 if it's not. No more timeout roulette.

middleware layer

A Python middleware (FastAPI) sits between haproxy and SGLang on port+2. It strips vLLM-only request parameters that SGLang rejects with 422 errors:

logprobs / top_logprobs — vLLM accepts these on chat completion requests; SGLang's Mistral tool-call parser rejects them. OpenClaw and other vLLM clients send them by default.

The middleware only touches POST /v1/chat/completions request bodies and passes everything else through unchanged. To strip additional params, add them to the STRIP_PARAMS set in vllm_middleware.py.

┌─────────────────────────────────────────────┐
│  k8s probes / vLLM stack                    │
│         │                                   │
│         ▼                                   │
│  haproxy (port 8000)                        │
│    /metrics ──► 200 empty (stub)            │
│    /health  ──► 200/503 instant (backend    │
│                 health-checked in bg)        │
│    /*       ──► proxy to middleware          │
│                       │                     │
│                       ▼                     │
│  middleware (port 8002)                      │
│    strips logprobs/top_logprobs             │
│    forwards to SGLang                       │
│                       │                     │
│                       ▼                     │
│              SGLang (port 8001)             │
└─────────────────────────────────────────────┘

haproxy 2.4 compat: uses errorfile + http-request deny deny_status for stub responses (the http-request return payload syntax requires haproxy 2.8+).

Current State

Running in production — mistralai/Devstral-2-123B-Instruct-2512 on 8× MI300X.

Model path, --tp 8, and --tool-call-parser mistral are baked into both shims
The Dockerfile builds on lmsysorg/sglang-rocm and patches a broken aiter build from the base image
MI300X tuning env vars are set (HIP_FORCE_DEV_KERNARG, NCCL_MIN_NCHANNELS, etc.)
All received args are logged to /tmp/vllm-shim.log (configurable via VLLM_SHIM_LOG env var)

Building

docker build -t vllm-to-sglang .

Or use the Jenkins pipeline:

curl -X POST "https://jenkins.sweetapi.com/job/vllm-to-sglang/buildWithParameters" \
  -u "${JENKINS_USER}:${JENKINS_PASS}" \
  -d "BRANCH=metrics" \
  -d "TAG=nightly3"

Then use this image anywhere the vLLM stack expects its server image.

Making It Work For Other Models

Right now the model config is hardcoded in three places:

vllm-shim.sh — the python -m sglang.launch_server line
vllm_shim_module.py — the subprocess.Popen() call
Dockerfile — base image and ROCm-specific patches

To adapt for a different model, change --model-path, --tp, and --tool-call-parser in both shim files. A future pass will make this configurable via env vars or args so you don't have to edit source.

Files

File	Purpose
`Dockerfile`	Builds the image: ROCm SGLang base + haproxy + shims + MI300X env
`vllm-shim.sh`	Shell shim — replaces the `vllm` binary, launches SGLang + middleware + haproxy
`vllm_shim_module.py`	Python shim — shadows `vllm.*` module imports, launches SGLang + middleware + haproxy
`vllm_middleware.py`	FastAPI middleware — strips vLLM-only params (logprobs) before forwarding to SGLang

README.md Unescape Escape

vLLM → SGLang Shim

Why?

How It Works

Invocation interception

haproxy proxy layer

middleware layer

Current State

Building

Making It Work For Other Models

Files

README.md