Compare commits

11 Commits

Author SHA1 Message Date
7d9c4da2ee not sure why we have a default tool parser 2026-04-13 17:49:44 +00:00
efc9dc33e7 dynamic arg translation, remove entrypoint.sh, update README 2026-04-12 21:23:26 +00:00
7c1ed0408b fix: recursive _fix_schema to handle nested properties=[] at any depth 2026-04-12 20:52:44 +00:00
a9911386e0 strip guided_json, guided_regex too; fix parameters.properties array 2026-04-12 20:27:44 +00:00
ccedd3ecee fix: add chat_template_kwargs to STRIP_PARAMS, fix parameters.properties array 2026-04-12 20:23:10 +00:00
c66511e16f fix: handle parameters.properties being array, not just parameters itself 2026-04-12 20:17:06 +00:00
e03e41eb4f fix vLLM/SGLang schema mismatc 2026-04-12 19:57:47 +00:00
7ecbac2dc0 Fix UnboundLocalError in health(), switch from on_event to lifespan 2026-04-12 19:41:08 +00:00
774964a4db Add error dump logging: capture full request+response on 4xx/5xx from SGLang 2026-04-12 19:28:04 +00:00
db9231f796 Fix middleware: handle SGLang startup lag gracefully
- Add /health endpoint that returns 503 until SGLang is ready
- Background task polls SGLang until it accepts connections
- Catch ConnectError/TimeoutException instead of crashing
- Return 503 JSON error when SGLang backend is unavailable
- haproxy health-checks middleware /health, which reflects SGLang state
2026-04-12 19:06:38 +00:00
bbe40ac8c0 Add middleware to strip vLLM-only params (logprobs/top_logprobs) before forwarding to SGLang
SGLang's Mistral tool-call parser rejects logprobs/top_logprobs with 422,
while vLLM accepts them. Clients like OpenClaw send these by default.

New architecture: haproxy (port N) → middleware (port N+2) → SGLang (port N+1)
The middleware is a thin FastAPI app that strips incompatible params from
chat completion request bodies and passes everything else through unchanged.
2026-04-12 18:58:37 +00:00
6 changed files with 705 additions and 196 deletions

View File

@@ -18,6 +18,7 @@ RUN mkdir -p /opt/vllm-shim/vllm/entrypoints/openai \
COPY vllm_shim_module.py /opt/vllm-shim/vllm/__main__.py
COPY vllm_shim_module.py /opt/vllm-shim/vllm/entrypoints/openai/api_server.py
COPY vllm_shim_module.py /opt/vllm-shim/vllm/entrypoints/cli/main.py
COPY vllm_middleware.py /opt/vllm-shim/vllm_middleware.py
RUN touch /opt/vllm-shim/vllm/__init__.py \
/opt/vllm-shim/vllm/entrypoints/__init__.py \
/opt/vllm-shim/vllm/entrypoints/openai/__init__.py \

163
README.md
View File

@@ -1,90 +1,115 @@
# vLLM → SGLang Shim
# vllm-to-sglang
Drop-in replacement that makes a vLLM production stack (e.g. the [k8s operator](https://github.com/vllm-project/production-stack)) actually run [SGLang](https://github.com/sgl-project/sglang) instead.
## Why?
## How it works
The vLLM production stack handles model lifecycle, scaling, and routing — but some models work better (or only work) on SGLang. Rather than rewriting your deployment infra, this shim intercepts every vLLM invocation and launches SGLang with equivalent arguments.
## How It Works
### Invocation interception
Two interception paths catch however the vLLM stack tries to start the server:
| What the stack calls | What happens |
|---|---|
| `vllm serve <model> [flags]` | Shell shim (`vllm-shim.sh`) replaces the `vllm` binary |
| `python -m vllm.entrypoints.openai.api_server` | Python shim (shadow module on `PYTHONPATH`) intercepts the import |
Both extract `--host` and `--port` from whatever the stack sends.
### haproxy proxy layer
Rather than launching SGLang directly on the vLLM port, the shim runs **haproxy** on the original port and **SGLang on port+1**. This solves two critical problems:
1. **`/metrics` stub** — The vLLM stack expects a Prometheus metrics endpoint at `/metrics`. SGLang doesn't serve one. haproxy intercepts `/metrics` and returns an empty 200 response instantly.
2. **`/health` probe timing** — SGLang's `/health` endpoint takes ~1.001s to respond, which races the 1s k8s probe timeout and causes repeated `Startup probe failed: context deadline exceeded`. haproxy health-checks SGLang in the background (every 5s, with a 3s timeout) and responds to `/health` probes **instantly** — 200 if the backend is up, 503 if it's not. No more timeout roulette.
The k8s vLLM production stack calls `vllm serve <model> [flags]`. This project intercepts that call and instead launches SGLang behind haproxy + a middleware layer.
```
┌─────────────────────────────────────────────┐
│ k8s probes / vLLM stack
│ │ │
▼ │
│ haproxy (port 8000)
/metrics ──► 200 empty (stub) │
│ /health ──► 200/503 instant (backend │
health-checked in bg)
/* ──► proxy to SGLang
SGLang (port 8001)
└─────────────────────────────────────────────┘
k8s vLLM stack
│ vllm serve mistralai/Devstral-2-123B-Instruct-2512 \
--host 0.0.0.0 --port 8000 --tensor-parallel-size 8 ...
┌─────────────────────────────────────────────────────────┐
vllm-shim.sh (replaces the `vllm` binary)
or vllm_shim_module.py (shadows python -m vllm.*)
Parses vLLM args, translates to SGLang equivalents,
then launches three processes:
│ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ haproxy :8000 (front door) │ │
│ │ /metrics → 200 empty (stub) │ │
│ │ /health → 200/503 based on backend state │ │
│ │ /* → proxy to middleware :8002 │ │
│ └──────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ middleware :8002 (FastAPI) │ │
│ │ Strips vLLM-only params from request bodies │ │
│ │ Recursively fixes tool JSON schemas │ │
│ │ Forwards to SGLang :8001 │ │
│ └──────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ SGLang :8001 (internal) │ │
│ │ The actual inference server │ │
│ └──────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
```
haproxy 2.4 compat: uses `errorfile` + `http-request deny deny_status` for stub responses (the `http-request return` payload syntax requires haproxy 2.8+).
## Argument translation
## Current State
The shim dynamically translates vLLM CLI args to SGLang equivalents — no hardcoded model names or tensor-parallel sizes.
**Running in production — `mistralai/Devstral-2-123B-Instruct-2512` on 8× MI300X.**
| vLLM flag | SGLang equivalent | Notes |
|-----------|-------------------|-------|
| `serve` | *(skipped)* | Subcommand only |
| `<model>` (positional) | `--model-path <model>` | |
| `--host` | Used for all three processes | |
| `--port` | haproxy binds this port | SGLang gets +1, middleware +2 |
| `--tensor-parallel-size` | `--tp` | |
| `--gpu_memory_utilization` | `--mem-fraction-static` | |
| `--trust-remote-code` | `--trust-remote-code` | |
| `--no-enable-prefix-caching` | *(skipped)* | No SGLang equivalent |
| `--enable-chunked-prefill` | *(skipped)* | No SGLang equivalent |
| `--tool-call-parser` | `--tool-call-parser` | Defaults to `mistral` |
- Model path, `--tp 8`, and `--tool-call-parser mistral` are baked into both shims
- The Dockerfile builds on `lmsysorg/sglang-rocm` and patches a broken `aiter` build from the base image
- MI300X tuning env vars are set (`HIP_FORCE_DEV_KERNARG`, `NCCL_MIN_NCHANNELS`, etc.)
- All received args are logged to `/tmp/vllm-shim.log` (configurable via `VLLM_SHIM_LOG` env var)
Unknown flags are passed through as-is — they may be valid SGLang args.
## Building
### Environment variables
| Variable | Default | Purpose |
|----------|---------|---------|
| `SGLANG_TOOL_CALL_PARSER` | `mistral` | Override the tool-call-parser |
| `VLLM_SHIM_LOG` | `/tmp/vllm-shim.log` | Log file path |
## Middleware: request body fixes
SGLang rejects certain parameters and schemas that vLLM (and OpenClaw) send. The middleware fixes these automatically:
### Stripped parameters
These vLLM-only parameters are removed from request bodies before forwarding to SGLang:
- `logprobs` / `top_logprobs` — SGLang's Mistral tool-call parser rejects these
- `chat_template_kwargs` — OpenClaw sends this for reasoning models; SGLang doesn't support it
- `guided_json` / `guided_regex` — vLLM-only guided decoding params
### Schema fixes
OpenClaw (and some vLLM configurations) send tool schemas with `properties: []` instead of `properties: {}`. SGLang requires `properties` to be an object at **every level** of the schema, including nested `items` and sub-objects.
The middleware recursively walks the entire JSON Schema tree and fixes:
- `properties: []``properties: {}` (at any depth)
- `required: <non-list>` → removed
- `parameters: <non-object>``{"type": "object", "properties": {}}`
## Files
| File | Purpose |
|------|---------|
| `Dockerfile` | Builds on `lmsysorg/sglang-rocm`, installs haproxy, copies shim files |
| `Jenkinsfile` | CI/CD: builds and pushes to Vultr container registry |
| `vllm-shim.sh` | Shell shim — replaces the `vllm` binary, translates args |
| `vllm_shim_module.py` | Python shim — shadows `vllm.*` module imports, translates args |
| `vllm_middleware.py` | FastAPI middleware — strips bad params, fixes tool schemas |
| `README.md` | This file |
## Deploy
```bash
docker build -t vllm-to-sglang .
```
Or use the Jenkins pipeline:
Or via Jenkins:
```bash
curl -X POST "https://jenkins.sweetapi.com/job/vllm-to-sglang/buildWithParameters" \
-u "${JENKINS_USER}:${JENKINS_PASS}" \
-d "BRANCH=metrics" \
-d "TAG=nightly3"
-d TAG=nightly
```
Then use this image anywhere the vLLM stack expects its server image.
## Making It Work For Other Models
Right now the model config is hardcoded in three places:
- `vllm-shim.sh` — the `python -m sglang.launch_server` line
- `vllm_shim_module.py` — the `subprocess.Popen()` call
- `Dockerfile` — base image and ROCm-specific patches
To adapt for a different model, change `--model-path`, `--tp`, and `--tool-call-parser` in both shim files. A future pass will make this configurable via env vars or args so you don't have to edit source.
## Files
| File | Purpose |
|---|---|
| `Dockerfile` | Builds the image: ROCm SGLang base + haproxy + shims + MI300X env |
| `vllm-shim.sh` | Shell shim — replaces the `vllm` binary, launches SGLang + haproxy |
| `vllm_shim_module.py` | Python shim — shadows `vllm.*` module imports, launches SGLang + haproxy |

View File

@@ -1,42 +0,0 @@
#!/bin/bash
set -euo pipefail
# Defaults matching vLLM production stack defaults
HOST="0.0.0.0"
PORT="8000"
# Save original args before parsing eats them
ALL_ARGS="$*"
# Parse only host and port from whatever args the vLLM stack sends.
# Everything else is ignored.
while [[ $# -gt 0 ]]; do
case "$1" in
--host) HOST="$2"; shift 2 ;;
--host=*) HOST="${1#*=}"; shift ;;
--port) PORT="$2"; shift 2 ;;
--port=*) PORT="${1#*=}"; shift ;;
*) shift ;; # ignore everything else
esac
done
echo "=== vLLM production stack args received ==="
echo "Raw args: $ALL_ARGS"
echo ""
i=1
for arg in $ALL_ARGS; do
echo " [$i] $arg"
i=$((i + 1))
done
echo "============================================"
echo ""
echo "=== SGLang shim ==="
echo "Ignoring vLLM args. Launching SGLang on ${HOST}:${PORT}"
echo "==================="
exec python -m sglang.launch_server \
--model-path mistralai/Devstral-2-123B-Instruct-2512 \
--host "$HOST" \
--port "$PORT" \
--tp 8 \
--tool-call-parser mistral

View File

@@ -1,20 +1,21 @@
#!/bin/bash
#!/usr/bin/env bash
set -euo pipefail
# ============================================================
# vLLM -> SGLang shim
# vLLM -> SGLang shim (shell version)
# This script replaces the vllm binary. The k8s production stack
# calls `vllm serve <model> [flags]`, and we intercept everything.
#
# Dynamically translates vLLM args to SGLang equivalents.
# No hardcoded model or tensor-parallel size.
#
# Architecture:
# haproxy on the vLLM port (front door)
# /metrics → 200 empty (stub)
# /health → 200 if SGLang backend is up, 503 if not (instant)
# /* → proxy to SGLang on port+1
# /health → 200 if SGLang backend is up, 503 if not
# /* → proxy to middleware on port+2
# middleware on port+2 (strips vLLM-only params, fixes schemas)
# SGLang on port+1 (internal)
#
# haproxy 2.4 compat: uses errorfile for stub responses instead
# of http-request return (which needs 2.8+ for payload syntax).
# ============================================================
echo ""
@@ -46,36 +47,107 @@ LOG_PATH="${VLLM_SHIM_LOG:-/tmp/vllm-shim.log}"
echo ""
} >> "$LOG_PATH"
# Defaults
# ── Parse vLLM args → extract model, host, port, translate the rest ──
MODEL=""
HOST="0.0.0.0"
PORT="8000"
SGLANG_ARGS=()
SKIPPED_ARGS=()
# Default tool-call-parser; override with SGLANG_TOOL_CALL_PARSER env var
TOOL_CALL_PARSER="${SGLANG_TOOL_CALL_PARSER:-mistral}"
# Parse host and port from whatever the stack sends
while [[ $# -gt 0 ]]; do
case "$1" in
serve) shift ;; # skip the 'serve' subcommand
# Skip 'serve' subcommand
serve) shift ;;
# ── Extracted for infrastructure (not passed to SGLang) ──
--host) HOST="$2"; shift 2 ;;
--host=*) HOST="${1#*=}"; shift ;;
--port) PORT="$2"; shift 2 ;;
--port=*) PORT="${1#*=}"; shift ;;
*) shift ;; # ignore everything else
# ── Positional model name ──
--model|--model-name)
MODEL="$2"; shift 2 ;;
--model=*|--model-name=*)
MODEL="${1#*=}"; shift ;;
# ── Direct renames (vLLM → SGLang) ──
--tensor-parallel-size)
SGLANG_ARGS+=("--tp" "$2"); shift 2 ;;
--tensor-parallel-size=*)
SGLANG_ARGS+=("--tp" "${1#*=}"); shift ;;
--gpu_memory_utilization)
SGLANG_ARGS+=("--mem-fraction-static" "$2"); shift 2 ;;
--gpu_memory_utilization=*)
SGLANG_ARGS+=("--mem-fraction-static" "${1#*=}"); shift ;;
--trust_remote_code|--trust-remote-code)
SGLANG_ARGS+=("--trust-remote-code"); shift ;;
# ── vLLM flags with no SGLang equivalent → skip ──
--no-enable-prefix-caching|--enable-prefix-caching)
SKIPPED_ARGS+=("$1"); shift ;;
--enable-chunked-prefill|--no-enable-chunked-prefill)
SKIPPED_ARGS+=("$1"); shift ;;
--disable-log-requests|--disable-log-stats)
SKIPPED_ARGS+=("$1"); shift ;;
--swap-space|--block-size|--max-num-seqs|--max-num-batched-tokens)
SKIPPED_ARGS+=("$1" "$2"); shift 2 ;;
--swap-space=*|--block-size=*|--max-num-seqs=*|--max-num-batched-tokens=*)
SKIPPED_ARGS+=("$1"); shift ;;
--distributed-executor-backend|--pipeline-parallel-size|--data-parallel-size)
SKIPPED_ARGS+=("$1" "$2"); shift 2 ;;
--quantization|--dtype|--revision|--tokenizer-revision|--tokenizer-mode)
SKIPPED_ARGS+=("$1" "$2"); shift 2 ;;
--quantization=*|--dtype=*|--revision=*|--tokenizer-revision=*|--tokenizer-mode=*)
SKIPPED_ARGS+=("$1"); shift ;;
# ── Pass through to SGLang as-is ──
--tool-call-parser)
TOOL_CALL_PARSER="$2"; shift 2 ;;
--tool-call-parser=*)
TOOL_CALL_PARSER="${1#*=}"; shift ;;
*)
# Positional arg = model name (first non-flag)
if [[ ! "$1" =~ ^- ]] && [[ -z "$MODEL" ]]; then
MODEL="$1"; shift
else
# Unknown — pass through, might be valid for SGLang
SGLANG_ARGS+=("$1"); shift
fi ;;
esac
done
# SGLang runs one port higher; haproxy binds the original port
SGLANG_PORT=$((PORT + 1))
if [[ -z "$MODEL" ]]; then
echo "ERROR: No model specified in vLLM args!"
exit 1
fi
echo "Launching SGLang on ${HOST}:${SGLANG_PORT} (internal)"
echo "Launching haproxy on ${HOST}:${PORT} (front door, /metrics + /health stub)"
# ── Port scheme: haproxy=original, SGLang=+1, middleware=+2 ──
SGLANG_PORT=$((PORT + 1))
MIDDLEWARE_PORT=$((PORT + 2))
echo "Model: ${MODEL}"
echo "SGLang: ${HOST}:${SGLANG_PORT}"
echo "Middleware: ${HOST}:${MIDDLEWARE_PORT}"
echo "haproxy: ${HOST}:${PORT}"
if [[ ${#SGLANG_ARGS[@]} -gt 0 ]]; then
echo "Translated args: ${SGLANG_ARGS[*]}"
fi
if [[ ${#SKIPPED_ARGS[@]} -gt 0 ]]; then
echo "Skipped (no SGLang equivalent): ${SKIPPED_ARGS[*]}"
fi
echo ""
# Prepare error files for haproxy stub responses
# haproxy errorfile format: HTTP/1.x status_code reason\r\nheaders\r\n\r\nbody
# ── haproxy setup ───────────────────────────────────────────
mkdir -p /tmp/haproxy-errors
printf "HTTP/1.0 200 OK\r\nContent-Length: 0\r\nConnection: close\r\n\r\n" > /tmp/haproxy-errors/200-empty.http
printf "HTTP/1.0 503 Service Unavailable\r\nContent-Length: 16\r\nConnection: close\r\nContent-Type: text/plain\r\n\r\nSGLang not ready" > /tmp/haproxy-errors/503-sglang.http
# Write haproxy config (compatible with haproxy 2.4)
HAPROXY_CFG="/tmp/haproxy-shim.cfg"
cat > "$HAPROXY_CFG" <<EOF
global
@@ -90,14 +162,10 @@ defaults
frontend proxy
bind ${HOST}:${PORT}
# /metrics stub — instant 200 empty (vLLM stack expects this)
acl is_metrics path /metrics
http-request deny deny_status 200 if is_metrics
errorfile 200 /tmp/haproxy-errors/200-empty.http
# /health — instant response based on SGLang backend state
# haproxy health-checks SGLang in the background; this avoids
# the 1s k8s probe timeout racing SGLang's ~1.001s /health response
acl is_health path /health
acl sglang_up nbsrv(sglang) gt 0
http-request deny deny_status 200 if is_health sglang_up
@@ -109,34 +177,54 @@ frontend proxy
backend sglang
option httpchk GET /health
http-check expect status 200
server s1 127.0.0.1:${SGLANG_PORT} check inter 5s fall 3 rise 2
server s1 127.0.0.1:${MIDDLEWARE_PORT} check inter 5s fall 3 rise 2
EOF
echo "haproxy config written to ${HAPROXY_CFG}" >> "$LOG_PATH"
# ── Build and launch SGLang ─────────────────────────────────
# Start SGLang in the background
python -m sglang.launch_server \
--model-path mistralai/Devstral-2-123B-Instruct-2512 \
--host "$HOST" \
--port "$SGLANG_PORT" \
--tp 8 \
--tool-call-parser mistral &
SGLANG_CMD=(
python -m sglang.launch_server
--model-path "$MODEL"
--host "$HOST"
--port "$SGLANG_PORT"
)
if [[ -n "$TOOL_CALL_PARSER" ]]; then
SGLANG_CMD+=(--tool-call-parser "$TOOL_CALL_PARSER")
fi
SGLANG_CMD+=("${SGLANG_ARGS[@]}")
echo "SGLang command: ${SGLANG_CMD[*]}"
echo ""
{
echo "haproxy config written to ${HAPROXY_CFG}"
echo "Model: ${MODEL}, SGLang port: ${SGLANG_PORT}, middleware port: ${MIDDLEWARE_PORT}, haproxy port: ${PORT}"
echo "SGLang command: ${SGLANG_CMD[*]}"
if [[ ${#SKIPPED_ARGS[@]} -gt 0 ]]; then
echo "Skipped vLLM args: ${SKIPPED_ARGS[*]}"
fi
} >> "$LOG_PATH"
# Launch SGLang
"${SGLANG_CMD[@]}" &
SGLANG_PID=$!
# Give SGLang a moment to start before haproxy starts routing
# Launch middleware
SGLANG_HOST="$HOST" SGLANG_PORT="$SGLANG_PORT" MIDDLEWARE_PORT="$MIDDLEWARE_PORT" \
python /opt/vllm-shim/vllm_middleware.py &
MIDDLEWARE_PID=$!
sleep 2
# Start haproxy in the foreground (this is now PID 1 for the container)
# Launch haproxy (front door on the original port)
haproxy -f "$HAPROXY_CFG" &
HAPROXY_PID=$!
echo "SGLang PID: ${SGLANG_PID}, haproxy PID: ${HAPROXY_PID}" >> "$LOG_PATH"
echo "SGLang PID: ${SGLANG_PID}, middleware PID: ${MIDDLEWARE_PID}, haproxy PID: ${HAPROXY_PID}" >> "$LOG_PATH"
# Wait for whichever dies first — if either goes, we go
wait -n "$SGLANG_PID" "$HAPROXY_PID"
# Wait for whichever dies first
wait -n "$SGLANG_PID" "$MIDDLEWARE_PID" "$HAPROXY_PID"
EXIT_CODE=$?
echo "A process exited (code ${EXIT_CODE}), shutting down" >> "$LOG_PATH"
kill "$SGLANG_PID" "$HAPROXY_PID" 2>/dev/null || true
kill "$SGLANG_PID" "$MIDDLEWARE_PID" "$HAPROXY_PID" 2>/dev/null || true
exit $EXIT_CODE

260
vllm_middleware.py Normal file
View File

@@ -0,0 +1,260 @@
"""
vLLM → SGLang request middleware.
Sits between haproxy and SGLang to strip vLLM-only parameters
that cause SGLang to return 422/400 errors.
Currently strips: logprobs, top_logprobs
(SGLang's Mistral tool-call parser rejects these; vLLM accepts them.)
Architecture:
haproxy (port N) → middleware (port N+2) → SGLang (port N+1)
haproxy still handles /metrics stub and /health instant responses.
This middleware only touches the proxied request bodies.
"""
import json
import os
import asyncio
import httpx
from datetime import datetime
from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse, Response
import uvicorn
SGLANG_HOST = os.environ.get("SGLANG_HOST", "127.0.0.1")
SGLANG_PORT = int(os.environ.get("SGLANG_PORT", "8001"))
LISTEN_PORT = int(os.environ.get("MIDDLEWARE_PORT", "8002"))
# Params that vLLM accepts but SGLang rejects.
# Extend this set as more incompatibilities are discovered.
STRIP_PARAMS = {"logprobs", "top_logprobs", "chat_template_kwargs", "guided_json", "guided_regex"}
client: httpx.AsyncClient | None = None
_sglang_ready = False
async def _lifespan(app_instance):
global client
client = httpx.AsyncClient(
timeout=httpx.Timeout(300.0, connect=10.0),
)
# Background task: wait for SGLang to become available
asyncio.create_task(_wait_for_sglang())
yield
await client.aclose()
async def _wait_for_sglang():
"""Poll SGLang until it's accepting connections, then mark ready."""
global _sglang_ready
while True:
try:
resp = await client.get(
f"http://{SGLANG_HOST}:{SGLANG_PORT}/health",
timeout=httpx.Timeout(5.0, connect=2.0),
)
if resp.status_code == 200:
_sglang_ready = True
print(f"Middleware: SGLang is ready at {SGLANG_HOST}:{SGLANG_PORT}")
return
except (httpx.ConnectError, httpx.TimeoutException):
pass
await asyncio.sleep(2)
app = FastAPI(lifespan=_lifespan)
@app.get("/health")
async def health():
"""Health check — haproxy polls this. Returns 200 only if SGLang is up."""
global _sglang_ready
if not _sglang_ready:
return Response(content="SGLang not ready", status_code=503)
try:
resp = await client.get(
f"http://{SGLANG_HOST}:{SGLANG_PORT}/health",
timeout=httpx.Timeout(5.0, connect=2.0),
)
return Response(content=resp.content, status_code=resp.status_code,
media_type=resp.headers.get("content-type"))
except (httpx.ConnectError, httpx.TimeoutException):
_sglang_ready = False
# Re-trigger background wait
asyncio.create_task(_wait_for_sglang())
return Response(content="SGLang not ready", status_code=503)
ERROR_LOG = os.environ.get("VLLM_SHIM_LOG", "/tmp/vllm-shim.log")
def _fix_schema(schema: dict) -> bool:
"""Recursively fix a JSON Schema dict: properties must be object, required must be list of strings."""
fixed = False
# Fix 'properties' — must be dict, not array/null
if "properties" in schema and not isinstance(schema["properties"], dict):
schema["properties"] = {}
fixed = True
# Fix 'required' — must be list of strings or absent
if "required" in schema and not isinstance(schema["required"], list):
del schema["required"]
fixed = True
# Recurse into every property value
if isinstance(schema.get("properties"), dict):
for val in schema["properties"].values():
if isinstance(val, dict):
if _fix_schema(val):
fixed = True
# Recurse into items (for array-of-objects)
if isinstance(schema.get("items"), dict):
if _fix_schema(schema["items"]):
fixed = True
# Recurse into anyOf, allOf, oneOf
for key in ("anyOf", "allOf", "oneOf"):
if isinstance(schema.get(key), list):
for item in schema[key]:
if isinstance(item, dict):
if _fix_schema(item):
fixed = True
# Recurse into additionalProperties if it's a schema
if isinstance(schema.get("additionalProperties"), dict):
if _fix_schema(schema["additionalProperties"]):
fixed = True
return fixed
def _dump_error(request_body: bytes, status_code: int, resp_headers: dict, resp_body_raw: bytes, path: str = ""):
"""Log full request + response payload when SGLang returns an error (4xx/5xx)."""
try:
ts = datetime.now().isoformat()
req_json = None
try:
req_json = json.loads(request_body)
except (json.JSONDecodeError, UnicodeDecodeError):
pass
resp_text = resp_body_raw.decode("utf-8", errors="replace")[:4000]
resp_json = None
try:
resp_json = json.loads(resp_text)
except (json.JSONDecodeError, UnicodeDecodeError):
pass
with open(ERROR_LOG, "a") as f:
f.write(f"\n{'='*60}\n")
f.write(f"[{ts}] ERROR DUMP — SGLang returned HTTP {status_code}\n")
f.write(f"Path: {path}\n")
f.write(f"--- Request Body ---\n")
if req_json:
f.write(json.dumps(req_json, indent=2, ensure_ascii=False)[:8000])
else:
f.write(request_body.decode("utf-8", errors="replace")[:8000])
f.write(f"\n--- Response (HTTP {status_code}) ---\n")
if resp_json:
f.write(json.dumps(resp_json, indent=2, ensure_ascii=False)[:4000])
else:
f.write(resp_text)
f.write(f"\n{'='*60}\n")
print(f"[{ts}] ERROR DUMP: HTTP {status_code} on {path} — full payload written to {ERROR_LOG}")
except Exception as e:
print(f"_dump_error failed: {e}")
@app.api_route("/{path:path}", methods=["GET", "POST", "PUT", "DELETE", "PATCH", "OPTIONS"])
async def proxy(path: str, request: Request):
body = await request.body()
is_streaming = False
# Strip incompatible params from chat completion POST requests
if request.method == "POST" and "chat/completions" in path and body:
try:
data = json.loads(body)
is_streaming = data.get("stream", False)
stripped_any = False
for key in STRIP_PARAMS:
if key in data:
del data[key]
stripped_any = True
# Fix tool function parameters: recurse to fix ALL bad properties/required
tools = data.get("tools")
if isinstance(tools, list):
for tool in tools:
func = tool.get("function") if isinstance(tool, dict) else None
if not isinstance(func, dict):
continue
if not isinstance(func.get("parameters"), dict):
func["parameters"] = {"type": "object", "properties": {}}
stripped_any = True
if _fix_schema(func["parameters"]):
stripped_any = True
if stripped_any:
body = json.dumps(data).encode()
except (json.JSONDecodeError, UnicodeDecodeError):
pass
# Forward headers (skip hop-by-hop and ones we're replacing)
fwd_headers = {
k: v for k, v in request.headers.items()
if k.lower() not in ("host", "content-length", "transfer-encoding")
}
fwd_headers["content-length"] = str(len(body))
url = f"http://{SGLANG_HOST}:{SGLANG_PORT}/{path}"
if request.query_params:
url += f"?{request.query_params}"
try:
if is_streaming:
req = client.build_request(request.method, url, content=body, headers=fwd_headers)
resp = await client.send(req, stream=True)
# Dump on error for streaming responses
if resp.status_code >= 400:
error_body = await resp.aread()
_dump_error(body, resp.status_code, resp_headers=dict(resp.headers), resp_body_raw=error_body, path=path)
await resp.aclose()
return Response(
content=error_body,
status_code=resp.status_code,
media_type=resp.headers.get("content-type"),
)
async def stream_body():
try:
async for chunk in resp.aiter_bytes():
yield chunk
finally:
await resp.aclose()
return StreamingResponse(
stream_body(),
status_code=resp.status_code,
headers={"content-type": resp.headers.get("content-type", "text/event-stream")},
)
else:
resp = await client.request(request.method, url, content=body, headers=fwd_headers)
# Dump on error
if resp.status_code >= 400:
_dump_error(body, resp.status_code, resp_headers=dict(resp.headers), resp_body_raw=resp.content, path=path)
return Response(
content=resp.content,
status_code=resp.status_code,
media_type=resp.headers.get("content-type"),
)
except (httpx.ConnectError, httpx.TimeoutException) as e:
return Response(
content=json.dumps({"error": {"message": f"SGLang backend unavailable: {e}", "type": "backend_error"}}),
status_code=503,
media_type="application/json",
)
if __name__ == "__main__":
uvicorn.run(app, host="0.0.0.0", port=LISTEN_PORT, log_level="warning")

View File

@@ -1,28 +1,187 @@
#!/usr/bin/env python3
"""
vLLM -> SGLang Python shim.
Catches `python -m vllm.entrypoints.openai.api_server` (and similar)
and launches SGLang behind haproxy instead.
and launches SGLang behind haproxy + middleware instead.
Dynamically translates vLLM CLI args to SGLang equivalents.
No hardcoded model name or tensor-parallel size.
Architecture:
haproxy on the vLLM port (front door)
/metrics → 200 empty (stub)
/health → 200 if SGLang backend is up, 503 if not (instant)
/* → proxy to SGLang on port+1
/* → proxy to middleware on port+2
middleware on port+2 (strips vLLM-only params, fixes tool schemas)
SGLang on port+1 (internal)
haproxy 2.4 compat: uses errorfile for stub responses instead
of http-request return (which needs 2.8+ for payload syntax).
"""
import os
import sys
import subprocess
import time
import datetime
# ── vLLM → SGLang argument mapping ──────────────────────────
# Key = vLLM flag, value = (sglang_flag, has_value)
# has_value=True means the flag takes an argument (e.g. --port 8000)
# has_value=False means it's a boolean flag (e.g. --no-enable-prefix-caching)
ARG_MAP = {
# Direct renames (vLLM name → SGLang name)
"--tensor-parallel-size": ("--tp", True),
"--gpu_memory_utilization": ("--mem-fraction-static", True),
"--max_model_len": ("--max-running-requests", True), # approximate
"--max-model-len": ("--max-running-requests", True), # kebab variant
"--enforce_eager": ("--enable-torch-compile", False), # opposite intent, skip by default
"--trust_remote_code": ("--trust-remote-code", False),
"--trust-remote-code": ("--trust-remote-code", False),
# vLLM flags with no SGLang equivalent → skip
"--no-enable-prefix-caching": (None, False),
"--enable-prefix-caching": (None, False),
"--enable-chunked-prefill": (None, False),
"--no-enable-chunked-prefill":(None, False),
"--disable-log-requests": (None, False),
"--disable-log-stats": (None, False),
"--swap-space": (None, True),
"--block-size": (None, True),
"--num-gpu-blocks-override": (None, True),
"--num-cpu-blocks-override": (None, True),
"--max-num-seqs": (None, True),
"--max-num-batched-tokens": (None, True),
"--distributed-executor-backend": (None, True),
"--pipeline-parallel-size": (None, True),
"--data-parallel-size": (None, True),
"--revision": (None, True),
"--code-revision": (None, True),
"--tokenizer-revision": (None, True),
"--tokenizer-mode": (None, True),
"--quantization": (None, True),
"--dtype": (None, True),
"--max-seq-len-to-capture": (None, True),
"--enable-lora": (None, False),
"--max-lora-rank": (None, True),
"--max-cpu-loras": (None, True),
"--lora-dtype": (None, True),
"--enable-prompt-adapter": (None, False),
"--scheduler-delay-factor": (None, True),
"--enable-multi-modal": (None, False),
"--limit-mm-per-prompt": (None, True),
}
# Default tool-call-parser; override with SGLANG_TOOL_CALL_PARSER env var
DEFAULT_TOOL_CALL_PARSER = "qwen3_coder"
def parse_vllm_args(args):
"""
Parse vLLM CLI args and extract model, host, port,
plus any args we should translate to SGLang.
Returns (model, host, port, sglang_extra_args, skipped_args).
"""
model = None
host = "0.0.0.0"
port = "8000"
sglang_extra = [] # translated args for SGLang
skipped = [] # vLLM args we're ignoring
i = 0
while i < len(args):
arg = args[i]
# 'serve' subcommand — skip
if arg == "serve":
i += 1
continue
# Positional model argument (first non-flag after serve, or standalone)
if not arg.startswith("-") and model is None:
model = arg
i += 1
continue
# --flag=value form
if "=" in arg and arg.startswith("--"):
flag, val = arg.split("=", 1)
if flag == "--host":
host = val
elif flag == "--port":
port = val
elif flag in ARG_MAP:
sglang_flag, has_val = ARG_MAP[flag]
if sglang_flag is None:
skipped.append(arg)
elif has_val:
sglang_extra.extend([sglang_flag, val])
else:
# boolean flag with =value (unusual but valid)
sglang_extra.append(sglang_flag)
else:
# Unknown flag — pass through as-is (might be a SGLang flag too)
sglang_extra.append(arg)
i += 1
continue
# --flag value form
if arg in ("--host",):
if i + 1 < len(args):
host = args[i + 1]
i += 2
continue
if arg in ("--port",):
if i + 1 < len(args):
port = args[i + 1]
i += 2
continue
if arg in ARG_MAP:
sglang_flag, has_val = ARG_MAP[arg]
if sglang_flag is None:
skipped.append(arg)
if has_val and i + 1 < len(args) and not args[i + 1].startswith("-"):
skipped.append(args[i + 1])
i += 2
else:
i += 1
elif has_val:
if i + 1 < len(args):
sglang_extra.extend([sglang_flag, args[i + 1]])
i += 2
else:
i += 1
else:
sglang_extra.append(sglang_flag)
i += 1
continue
# --tool-call-parser — pass through to SGLang
if arg == "--tool-call-parser":
if i + 1 < len(args):
sglang_extra.extend(["--tool-call-parser", args[i + 1]])
i += 2
else:
i += 1
continue
# Unknown flag — pass through if it takes a value, might be valid for SGLang
if arg.startswith("--") and i + 1 < len(args) and not args[i + 1].startswith("-"):
sglang_extra.extend([arg, args[i + 1]])
i += 2
elif arg.startswith("--"):
sglang_extra.append(arg)
i += 1
else:
# Unknown positional — probably the model if we don't have it yet
if model is None:
model = arg
i += 1
return model, host, port, sglang_extra, skipped
def main():
args = sys.argv[1:]
log_path = os.environ.get("VLLM_SHIM_LOG", "/tmp/vllm-shim.log")
import datetime
with open(log_path, "a") as f:
f.write(f"\n{datetime.datetime.now().isoformat()} vLLM -> SGLang Shim (Python module)\n")
f.write(f" Invoked as: python -m {__name__} {' '.join(args)}\n")
@@ -43,42 +202,52 @@ def main():
print("==========================================")
print()
host = "0.0.0.0"
port = "8000"
model, host, port, sglang_extra, skipped = parse_vllm_args(args)
i = 0
while i < len(args):
if args[i] == "--host" and i + 1 < len(args):
host = args[i + 1]
i += 2
elif args[i].startswith("--host="):
host = args[i].split("=", 1)[1]
i += 1
elif args[i] == "--port" and i + 1 < len(args):
port = args[i + 1]
i += 2
elif args[i].startswith("--port="):
port = args[i].split("=", 1)[1]
i += 1
else:
i += 1
if not model:
print("ERROR: No model specified in vLLM args!")
os._exit(1)
# SGLang runs one port higher; haproxy binds the original port
# SGLang port scheme: original+1 = SGLang, original+2 = middleware
sglang_port = str(int(port) + 1)
middleware_port = str(int(port) + 2)
print(f"Launching SGLang on {host}:{sglang_port} (internal)")
print(f"Launching haproxy on {host}:{port} (front door, /metrics + /health stub)")
# Build SGLang command
sglang_cmd = [
sys.executable, "-m", "sglang.launch_server",
"--model-path", model,
"--host", host,
"--port", sglang_port,
]
# Add tool-call-parser (env override or default)
tcp = os.environ.get("SGLANG_TOOL_CALL_PARSER", DEFAULT_TOOL_CALL_PARSER)
if tcp:
sglang_cmd.extend(["--tool-call-parser", tcp])
# Add translated/forwarded args
sglang_cmd.extend(sglang_extra)
print(f"Model: {model}")
print(f"SGLang host: {host}:{sglang_port}")
print(f"Middleware: {host}:{middleware_port}")
print(f"haproxy: {host}:{port}")
if sglang_extra:
print(f"Translated args: {' '.join(sglang_extra)}")
if skipped:
print(f"Skipped (no SGLang equivalent): {' '.join(skipped)}")
print()
print(f"SGLang command: {' '.join(sglang_cmd)}")
print()
# Prepare error files for haproxy stub responses
# haproxy errorfile format: HTTP/1.x status_code reason\r\nheaders\r\n\r\nbody
# ── haproxy setup ────────────────────────────────────────
os.makedirs("/tmp/haproxy-errors", exist_ok=True)
with open("/tmp/haproxy-errors/200-empty.http", "w") as f:
f.write("HTTP/1.0 200 OK\r\nContent-Length: 0\r\nConnection: close\r\n\r\n")
with open("/tmp/haproxy-errors/503-sglang.http", "w") as f:
f.write("HTTP/1.0 503 Service Unavailable\r\nContent-Length: 16\r\nConnection: close\r\nContent-Type: text/plain\r\n\r\nSGLang not ready")
# Write haproxy config (compatible with haproxy 2.4)
haproxy_cfg = "/tmp/haproxy-shim.cfg"
with open(haproxy_cfg, "w") as f:
f.write(f"""global
@@ -99,8 +268,6 @@ frontend proxy
errorfile 200 /tmp/haproxy-errors/200-empty.http
# /health — instant response based on SGLang backend state
# haproxy health-checks SGLang in the background; this avoids
# the 1s k8s probe timeout racing SGLang's ~1.001s /health response
acl is_health path /health
acl sglang_up nbsrv(sglang) gt 0
http-request deny deny_status 200 if is_health sglang_up
@@ -112,45 +279,55 @@ frontend proxy
backend sglang
option httpchk GET /health
http-check expect status 200
server s1 127.0.0.1:{sglang_port} check inter 5s fall 3 rise 2
server s1 127.0.0.1:{middleware_port} check inter 5s fall 3 rise 2
""")
with open(log_path, "a") as f:
f.write(f"haproxy config written to {haproxy_cfg}\n")
f.write(f"SGLang port: {sglang_port}, haproxy port: {port}\n")
f.write(f"Model: {model}, SGLang port: {sglang_port}, middleware port: {middleware_port}, haproxy port: {port}\n")
f.write(f"SGLang command: {' '.join(sglang_cmd)}\n")
if skipped:
f.write(f"Skipped vLLM args: {' '.join(skipped)}\n")
# Start SGLang in the background
sglang_proc = subprocess.Popen(
[
sys.executable, "-m", "sglang.launch_server",
"--model-path", "mistralai/Devstral-2-123B-Instruct-2512",
"--host", host,
"--port", sglang_port,
"--tp", "8",
"--tool-call-parser", "mistral",
],
# ── Launch processes ─────────────────────────────────────
sglang_proc = subprocess.Popen(sglang_cmd)
middleware_env = os.environ.copy()
middleware_env["SGLANG_HOST"] = host
middleware_env["SGLANG_PORT"] = sglang_port
middleware_env["MIDDLEWARE_PORT"] = middleware_port
middleware_proc = subprocess.Popen(
[sys.executable, "/opt/vllm-shim/vllm_middleware.py"],
env=middleware_env,
)
# Give SGLang a moment before haproxy starts routing
time.sleep(2)
# Start haproxy in the background
haproxy_proc = subprocess.Popen(["haproxy", "-f", haproxy_cfg])
with open(log_path, "a") as f:
f.write(f"SGLang PID: {sglang_proc.pid}, haproxy PID: {haproxy_proc.pid}\n")
f.write(f"SGLang PID: {sglang_proc.pid}, middleware PID: {middleware_proc.pid}, haproxy PID: {haproxy_proc.pid}\n")
# Wait for whichever dies first
while True:
sglang_ret = sglang_proc.poll()
middleware_ret = middleware_proc.poll()
haproxy_ret = haproxy_proc.poll()
if sglang_ret is not None:
print(f"SGLang exited (code {sglang_ret}), shutting down")
middleware_proc.terminate()
haproxy_proc.terminate()
os._exit(sglang_ret)
if middleware_ret is not None:
print(f"Middleware exited (code {middleware_ret}), shutting down")
sglang_proc.terminate()
haproxy_proc.terminate()
os._exit(middleware_ret)
if haproxy_ret is not None:
print(f"haproxy exited (code {haproxy_ret}), shutting down")
sglang_proc.terminate()
middleware_proc.terminate()
os._exit(haproxy_ret)
time.sleep(1)