Compare commits

...

17 Commits

Author SHA1 Message Date
7d9c4da2ee not sure why we have a default tool parser 2026-04-13 17:49:44 +00:00
efc9dc33e7 dynamic arg translation, remove entrypoint.sh, update README 2026-04-12 21:23:26 +00:00
7c1ed0408b fix: recursive _fix_schema to handle nested properties=[] at any depth 2026-04-12 20:52:44 +00:00
a9911386e0 strip guided_json, guided_regex too; fix parameters.properties array 2026-04-12 20:27:44 +00:00
ccedd3ecee fix: add chat_template_kwargs to STRIP_PARAMS, fix parameters.properties array 2026-04-12 20:23:10 +00:00
c66511e16f fix: handle parameters.properties being array, not just parameters itself 2026-04-12 20:17:06 +00:00
e03e41eb4f fix vLLM/SGLang schema mismatc 2026-04-12 19:57:47 +00:00
7ecbac2dc0 Fix UnboundLocalError in health(), switch from on_event to lifespan 2026-04-12 19:41:08 +00:00
774964a4db Add error dump logging: capture full request+response on 4xx/5xx from SGLang 2026-04-12 19:28:04 +00:00
db9231f796 Fix middleware: handle SGLang startup lag gracefully
- Add /health endpoint that returns 503 until SGLang is ready
- Background task polls SGLang until it accepts connections
- Catch ConnectError/TimeoutException instead of crashing
- Return 503 JSON error when SGLang backend is unavailable
- haproxy health-checks middleware /health, which reflects SGLang state
2026-04-12 19:06:38 +00:00
bbe40ac8c0 Add middleware to strip vLLM-only params (logprobs/top_logprobs) before forwarding to SGLang
SGLang's Mistral tool-call parser rejects logprobs/top_logprobs with 422,
while vLLM accepts them. Clients like OpenClaw send these by default.

New architecture: haproxy (port N) → middleware (port N+2) → SGLang (port N+1)
The middleware is a thin FastAPI app that strips incompatible params from
chat completion request bodies and passes everything else through unchanged.
2026-04-12 18:58:37 +00:00
359aa94337 Update README: haproxy proxy layer, /health probe fix, current state 2026-04-12 18:27:06 +00:00
6476c9c12a fix: content-length 16 not 15, remove 'timeout check' (not valid in haproxy 2.4 server line) 2026-04-12 17:29:08 +00:00
725e61d792 fix: haproxy 2.4 compat — use errorfile instead of http-request return
haproxy 2.4 (Ubuntu 22.04) doesn't support http-request return with
payload/content-type syntax (that's 2.8+). Switch to errorfile-based
stub responses: http-request deny deny_status N + errorfile N path.
2026-04-12 17:26:45 +00:00
1ddc08c88b haproxy: intercept /health too — instant response based on backend state
SGLang's /health takes ~1.001s, racing the 1s k8s probe timeout.
Now haproxy health-checks SGLang in the background (5s interval, 3s check timeout)
and responds to /health probes instantly: 200 if backend is up, 503 if not.
2026-04-12 17:21:04 +00:00
7fb373fdfc Add haproxy proxy: /metrics returns 200 empty, everything else proxies to SGLang
SGLang now runs on port+1, haproxy binds the original vLLM port.
haproxy serves a stub /metrics endpoint (200, empty body) and
passes all other traffic through to SGLang via raw TCP proxy.
2026-04-12 17:09:58 +00:00
dd3a981497 Log all received args to /tmp/vllm-shim.log 2026-04-12 04:37:24 +00:00
6 changed files with 863 additions and 120 deletions

View File

@@ -1,5 +1,11 @@
FROM lmsysorg/sglang-rocm:v0.5.10rc0-rocm700-mi30x-20260411 FROM lmsysorg/sglang-rocm:v0.5.10rc0-rocm700-mi30x-20260411
# ---------------------------------------------------------------
# haproxy: routes /metrics stub, proxies everything else to SGLang
# ---------------------------------------------------------------
RUN apt-get update && apt-get install -y --no-install-recommends haproxy \
&& rm -rf /var/lib/apt/lists/*
# --------------------------------------------------------------- # ---------------------------------------------------------------
# Replace the vllm binary with our shim # Replace the vllm binary with our shim
# --------------------------------------------------------------- # ---------------------------------------------------------------
@@ -12,6 +18,7 @@ RUN mkdir -p /opt/vllm-shim/vllm/entrypoints/openai \
COPY vllm_shim_module.py /opt/vllm-shim/vllm/__main__.py COPY vllm_shim_module.py /opt/vllm-shim/vllm/__main__.py
COPY vllm_shim_module.py /opt/vllm-shim/vllm/entrypoints/openai/api_server.py COPY vllm_shim_module.py /opt/vllm-shim/vllm/entrypoints/openai/api_server.py
COPY vllm_shim_module.py /opt/vllm-shim/vllm/entrypoints/cli/main.py COPY vllm_shim_module.py /opt/vllm-shim/vllm/entrypoints/cli/main.py
COPY vllm_middleware.py /opt/vllm-shim/vllm_middleware.py
RUN touch /opt/vllm-shim/vllm/__init__.py \ RUN touch /opt/vllm-shim/vllm/__init__.py \
/opt/vllm-shim/vllm/entrypoints/__init__.py \ /opt/vllm-shim/vllm/entrypoints/__init__.py \
/opt/vllm-shim/vllm/entrypoints/openai/__init__.py \ /opt/vllm-shim/vllm/entrypoints/openai/__init__.py \

131
README.md
View File

@@ -1,52 +1,115 @@
# vLLM → SGLang Shim # vllm-to-sglang
Drop-in replacement that makes a vLLM production stack (e.g. the [k8s operator](https://github.com/vllm-project/production-stack)) actually run [SGLang](https://github.com/sgl-project/sglang) instead. Drop-in replacement that makes a vLLM production stack (e.g. the [k8s operator](https://github.com/vllm-project/production-stack)) actually run [SGLang](https://github.com/sgl-project/sglang) instead.
## Why? ## How it works
The vLLM production stack handles model lifecycle, scaling, and routing — but some models work better (or only work) on SGLang. Rather than rewriting your deployment infra, this shim intercepts every vLLM invocation and launches SGLang with equivalent arguments. The k8s vLLM production stack calls `vllm serve <model> [flags]`. This project intercepts that call and instead launches SGLang behind haproxy + a middleware layer.
## How It Works ```
k8s vLLM stack
│ vllm serve mistralai/Devstral-2-123B-Instruct-2512 \
│ --host 0.0.0.0 --port 8000 --tensor-parallel-size 8 ...
┌─────────────────────────────────────────────────────────┐
│ vllm-shim.sh (replaces the `vllm` binary) │
│ or vllm_shim_module.py (shadows python -m vllm.*) │
│ │
│ Parses vLLM args, translates to SGLang equivalents, │
│ then launches three processes: │
│ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ haproxy :8000 (front door) │ │
│ │ /metrics → 200 empty (stub) │ │
│ │ /health → 200/503 based on backend state │ │
│ │ /* → proxy to middleware :8002 │ │
│ └──────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ middleware :8002 (FastAPI) │ │
│ │ Strips vLLM-only params from request bodies │ │
│ │ Recursively fixes tool JSON schemas │ │
│ │ Forwards to SGLang :8001 │ │
│ └──────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ SGLang :8001 (internal) │ │
│ │ The actual inference server │ │
│ └──────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
```
Two interception paths: ## Argument translation
| What the stack calls | What happens | The shim dynamically translates vLLM CLI args to SGLang equivalents — no hardcoded model names or tensor-parallel sizes.
|---|---|
| `vllm serve <model> [flags]` | Shell shim (`vllm-shim.sh`) parses args, execs `python -m sglang.launch_server` |
| `python -m vllm.entrypoints.openai.api_server` | Python shim (shadow module on `PYTHONPATH`) does the same |
Both extract `--host` and `--port` from whatever the stack sends and forward them to SGLang. Everything else is currently hardcoded for the target model. | vLLM flag | SGLang equivalent | Notes |
|-----------|-------------------|-------|
| `serve` | *(skipped)* | Subcommand only |
| `<model>` (positional) | `--model-path <model>` | |
| `--host` | Used for all three processes | |
| `--port` | haproxy binds this port | SGLang gets +1, middleware +2 |
| `--tensor-parallel-size` | `--tp` | |
| `--gpu_memory_utilization` | `--mem-fraction-static` | |
| `--trust-remote-code` | `--trust-remote-code` | |
| `--no-enable-prefix-caching` | *(skipped)* | No SGLang equivalent |
| `--enable-chunked-prefill` | *(skipped)* | No SGLang equivalent |
| `--tool-call-parser` | `--tool-call-parser` | Defaults to `mistral` |
## Current State Unknown flags are passed through as-is — they may be valid SGLang args.
**PoC — hardcoded for `mistralai/Devstral-2-123B-Instruct-2512` on 8× MI300X.** ### Environment variables
- Model path, `--tp 8`, and `--tool-call-parser mistral` are baked into both shims | Variable | Default | Purpose |
- The Dockerfile builds on `lmsysorg/sglang-rocm` and patches a broken `aiter` build from the base image |----------|---------|---------|
- MI300X tuning env vars are set (`HIP_FORCE_DEV_KERNARG`, `NCCL_MIN_NCHANNELS`, etc.) | `SGLANG_TOOL_CALL_PARSER` | `mistral` | Override the tool-call-parser |
| `VLLM_SHIM_LOG` | `/tmp/vllm-shim.log` | Log file path |
## Building ## Middleware: request body fixes
SGLang rejects certain parameters and schemas that vLLM (and OpenClaw) send. The middleware fixes these automatically:
### Stripped parameters
These vLLM-only parameters are removed from request bodies before forwarding to SGLang:
- `logprobs` / `top_logprobs` — SGLang's Mistral tool-call parser rejects these
- `chat_template_kwargs` — OpenClaw sends this for reasoning models; SGLang doesn't support it
- `guided_json` / `guided_regex` — vLLM-only guided decoding params
### Schema fixes
OpenClaw (and some vLLM configurations) send tool schemas with `properties: []` instead of `properties: {}`. SGLang requires `properties` to be an object at **every level** of the schema, including nested `items` and sub-objects.
The middleware recursively walks the entire JSON Schema tree and fixes:
- `properties: []``properties: {}` (at any depth)
- `required: <non-list>` → removed
- `parameters: <non-object>``{"type": "object", "properties": {}}`
## Files
| File | Purpose |
|------|---------|
| `Dockerfile` | Builds on `lmsysorg/sglang-rocm`, installs haproxy, copies shim files |
| `Jenkinsfile` | CI/CD: builds and pushes to Vultr container registry |
| `vllm-shim.sh` | Shell shim — replaces the `vllm` binary, translates args |
| `vllm_shim_module.py` | Python shim — shadows `vllm.*` module imports, translates args |
| `vllm_middleware.py` | FastAPI middleware — strips bad params, fixes tool schemas |
| `README.md` | This file |
## Deploy
```bash ```bash
docker build -t vllm-to-sglang . docker build -t vllm-to-sglang .
``` ```
Then use this image anywhere the vLLM stack expects its server image. Or via Jenkins:
## Making It Work For Other Models ```bash
curl -X POST "https://jenkins.sweetapi.com/job/vllm-to-sglang/buildWithParameters" \
Right now the model config is hardcoded in three places: -d TAG=nightly
```
- `vllm-shim.sh` — the `exec python -m sglang.launch_server` line
- `vllm_shim_module.py` — the `os.execvp()` call
- `Dockerfile` — base image and ROCm-specific patches
To adapt for a different model, change `--model-path`, `--tp`, and `--tool-call-parser` in both shim files. A future pass will make this configurable via env vars or args so you don't have to edit source.
## Files
| File | Purpose |
|---|---|
| `Dockerfile` | Builds the image: ROCm SGLang base + aiter fix + shims + MI300X env |
| `vllm-shim.sh` | Shell shim — replaces the `vllm` binary |
| `vllm_shim_module.py` | Python shim — shadows `vllm.*` module imports |

View File

@@ -1,42 +0,0 @@
#!/bin/bash
set -euo pipefail
# Defaults matching vLLM production stack defaults
HOST="0.0.0.0"
PORT="8000"
# Save original args before parsing eats them
ALL_ARGS="$*"
# Parse only host and port from whatever args the vLLM stack sends.
# Everything else is ignored.
while [[ $# -gt 0 ]]; do
case "$1" in
--host) HOST="$2"; shift 2 ;;
--host=*) HOST="${1#*=}"; shift ;;
--port) PORT="$2"; shift 2 ;;
--port=*) PORT="${1#*=}"; shift ;;
*) shift ;; # ignore everything else
esac
done
echo "=== vLLM production stack args received ==="
echo "Raw args: $ALL_ARGS"
echo ""
i=1
for arg in $ALL_ARGS; do
echo " [$i] $arg"
i=$((i + 1))
done
echo "============================================"
echo ""
echo "=== SGLang shim ==="
echo "Ignoring vLLM args. Launching SGLang on ${HOST}:${PORT}"
echo "==================="
exec python -m sglang.launch_server \
--model-path mistralai/Devstral-2-123B-Instruct-2512 \
--host "$HOST" \
--port "$PORT" \
--tp 8 \
--tool-call-parser mistral

View File

@@ -1,10 +1,21 @@
#!/bin/bash #!/usr/bin/env bash
set -euo pipefail set -euo pipefail
# ============================================================ # ============================================================
# vLLM -> SGLang shim # vLLM -> SGLang shim (shell version)
# This script replaces the vllm binary. The k8s production stack # This script replaces the vllm binary. The k8s production stack
# calls `vllm serve <model> [flags]`, and we intercept everything. # calls `vllm serve <model> [flags]`, and we intercept everything.
#
# Dynamically translates vLLM args to SGLang equivalents.
# No hardcoded model or tensor-parallel size.
#
# Architecture:
# haproxy on the vLLM port (front door)
# /metrics → 200 empty (stub)
# /health → 200 if SGLang backend is up, 503 if not
# /* → proxy to middleware on port+2
# middleware on port+2 (strips vLLM-only params, fixes schemas)
# SGLang on port+1 (internal)
# ============================================================ # ============================================================
echo "" echo ""
@@ -22,28 +33,198 @@ done
echo "==========================================" echo "=========================================="
echo "" echo ""
# Defaults # Log to file
LOG_PATH="${VLLM_SHIM_LOG:-/tmp/vllm-shim.log}"
{
echo "$(date -Iseconds) vLLM -> SGLang Shim (shell)"
echo " Invoked as: vllm $*"
echo " All arguments received:"
i=1
for arg in "$@"; do
echo " [$i] $arg"
i=$((i + 1))
done
echo ""
} >> "$LOG_PATH"
# ── Parse vLLM args → extract model, host, port, translate the rest ──
MODEL=""
HOST="0.0.0.0" HOST="0.0.0.0"
PORT="8000" PORT="8000"
SGLANG_ARGS=()
SKIPPED_ARGS=()
# Default tool-call-parser; override with SGLANG_TOOL_CALL_PARSER env var
TOOL_CALL_PARSER="${SGLANG_TOOL_CALL_PARSER:-mistral}"
# Parse host and port from whatever the stack sends
while [[ $# -gt 0 ]]; do while [[ $# -gt 0 ]]; do
case "$1" in case "$1" in
serve) shift ;; # skip the 'serve' subcommand # Skip 'serve' subcommand
serve) shift ;;
# ── Extracted for infrastructure (not passed to SGLang) ──
--host) HOST="$2"; shift 2 ;; --host) HOST="$2"; shift 2 ;;
--host=*) HOST="${1#*=}"; shift ;; --host=*) HOST="${1#*=}"; shift ;;
--port) PORT="$2"; shift 2 ;; --port) PORT="$2"; shift 2 ;;
--port=*) PORT="${1#*=}"; shift ;; --port=*) PORT="${1#*=}"; shift ;;
*) shift ;; # ignore everything else
# ── Positional model name ──
--model|--model-name)
MODEL="$2"; shift 2 ;;
--model=*|--model-name=*)
MODEL="${1#*=}"; shift ;;
# ── Direct renames (vLLM → SGLang) ──
--tensor-parallel-size)
SGLANG_ARGS+=("--tp" "$2"); shift 2 ;;
--tensor-parallel-size=*)
SGLANG_ARGS+=("--tp" "${1#*=}"); shift ;;
--gpu_memory_utilization)
SGLANG_ARGS+=("--mem-fraction-static" "$2"); shift 2 ;;
--gpu_memory_utilization=*)
SGLANG_ARGS+=("--mem-fraction-static" "${1#*=}"); shift ;;
--trust_remote_code|--trust-remote-code)
SGLANG_ARGS+=("--trust-remote-code"); shift ;;
# ── vLLM flags with no SGLang equivalent → skip ──
--no-enable-prefix-caching|--enable-prefix-caching)
SKIPPED_ARGS+=("$1"); shift ;;
--enable-chunked-prefill|--no-enable-chunked-prefill)
SKIPPED_ARGS+=("$1"); shift ;;
--disable-log-requests|--disable-log-stats)
SKIPPED_ARGS+=("$1"); shift ;;
--swap-space|--block-size|--max-num-seqs|--max-num-batched-tokens)
SKIPPED_ARGS+=("$1" "$2"); shift 2 ;;
--swap-space=*|--block-size=*|--max-num-seqs=*|--max-num-batched-tokens=*)
SKIPPED_ARGS+=("$1"); shift ;;
--distributed-executor-backend|--pipeline-parallel-size|--data-parallel-size)
SKIPPED_ARGS+=("$1" "$2"); shift 2 ;;
--quantization|--dtype|--revision|--tokenizer-revision|--tokenizer-mode)
SKIPPED_ARGS+=("$1" "$2"); shift 2 ;;
--quantization=*|--dtype=*|--revision=*|--tokenizer-revision=*|--tokenizer-mode=*)
SKIPPED_ARGS+=("$1"); shift ;;
# ── Pass through to SGLang as-is ──
--tool-call-parser)
TOOL_CALL_PARSER="$2"; shift 2 ;;
--tool-call-parser=*)
TOOL_CALL_PARSER="${1#*=}"; shift ;;
*)
# Positional arg = model name (first non-flag)
if [[ ! "$1" =~ ^- ]] && [[ -z "$MODEL" ]]; then
MODEL="$1"; shift
else
# Unknown — pass through, might be valid for SGLang
SGLANG_ARGS+=("$1"); shift
fi ;;
esac esac
done done
echo "Launching SGLang on ${HOST}:${PORT}" if [[ -z "$MODEL" ]]; then
echo "ERROR: No model specified in vLLM args!"
exit 1
fi
# ── Port scheme: haproxy=original, SGLang=+1, middleware=+2 ──
SGLANG_PORT=$((PORT + 1))
MIDDLEWARE_PORT=$((PORT + 2))
echo "Model: ${MODEL}"
echo "SGLang: ${HOST}:${SGLANG_PORT}"
echo "Middleware: ${HOST}:${MIDDLEWARE_PORT}"
echo "haproxy: ${HOST}:${PORT}"
if [[ ${#SGLANG_ARGS[@]} -gt 0 ]]; then
echo "Translated args: ${SGLANG_ARGS[*]}"
fi
if [[ ${#SKIPPED_ARGS[@]} -gt 0 ]]; then
echo "Skipped (no SGLang equivalent): ${SKIPPED_ARGS[*]}"
fi
echo "" echo ""
exec python -m sglang.launch_server \ # ── haproxy setup ───────────────────────────────────────────
--model-path mistralai/Devstral-2-123B-Instruct-2512 \
--host "$HOST" \ mkdir -p /tmp/haproxy-errors
--port "$PORT" \ printf "HTTP/1.0 200 OK\r\nContent-Length: 0\r\nConnection: close\r\n\r\n" > /tmp/haproxy-errors/200-empty.http
--tp 8 \ printf "HTTP/1.0 503 Service Unavailable\r\nContent-Length: 16\r\nConnection: close\r\nContent-Type: text/plain\r\n\r\nSGLang not ready" > /tmp/haproxy-errors/503-sglang.http
--tool-call-parser mistral
HAPROXY_CFG="/tmp/haproxy-shim.cfg"
cat > "$HAPROXY_CFG" <<EOF
global
maxconn 4096
defaults
mode http
timeout connect 5s
timeout client 300s
timeout server 300s
frontend proxy
bind ${HOST}:${PORT}
acl is_metrics path /metrics
http-request deny deny_status 200 if is_metrics
errorfile 200 /tmp/haproxy-errors/200-empty.http
acl is_health path /health
acl sglang_up nbsrv(sglang) gt 0
http-request deny deny_status 200 if is_health sglang_up
http-request deny deny_status 503 if is_health
errorfile 503 /tmp/haproxy-errors/503-sglang.http
default_backend sglang
backend sglang
option httpchk GET /health
http-check expect status 200
server s1 127.0.0.1:${MIDDLEWARE_PORT} check inter 5s fall 3 rise 2
EOF
# ── Build and launch SGLang ─────────────────────────────────
SGLANG_CMD=(
python -m sglang.launch_server
--model-path "$MODEL"
--host "$HOST"
--port "$SGLANG_PORT"
)
if [[ -n "$TOOL_CALL_PARSER" ]]; then
SGLANG_CMD+=(--tool-call-parser "$TOOL_CALL_PARSER")
fi
SGLANG_CMD+=("${SGLANG_ARGS[@]}")
echo "SGLang command: ${SGLANG_CMD[*]}"
echo ""
{
echo "haproxy config written to ${HAPROXY_CFG}"
echo "Model: ${MODEL}, SGLang port: ${SGLANG_PORT}, middleware port: ${MIDDLEWARE_PORT}, haproxy port: ${PORT}"
echo "SGLang command: ${SGLANG_CMD[*]}"
if [[ ${#SKIPPED_ARGS[@]} -gt 0 ]]; then
echo "Skipped vLLM args: ${SKIPPED_ARGS[*]}"
fi
} >> "$LOG_PATH"
# Launch SGLang
"${SGLANG_CMD[@]}" &
SGLANG_PID=$!
# Launch middleware
SGLANG_HOST="$HOST" SGLANG_PORT="$SGLANG_PORT" MIDDLEWARE_PORT="$MIDDLEWARE_PORT" \
python /opt/vllm-shim/vllm_middleware.py &
MIDDLEWARE_PID=$!
sleep 2
# Launch haproxy (front door on the original port)
haproxy -f "$HAPROXY_CFG" &
HAPROXY_PID=$!
echo "SGLang PID: ${SGLANG_PID}, middleware PID: ${MIDDLEWARE_PID}, haproxy PID: ${HAPROXY_PID}" >> "$LOG_PATH"
# Wait for whichever dies first
wait -n "$SGLANG_PID" "$MIDDLEWARE_PID" "$HAPROXY_PID"
EXIT_CODE=$?
echo "A process exited (code ${EXIT_CODE}), shutting down" >> "$LOG_PATH"
kill "$SGLANG_PID" "$MIDDLEWARE_PID" "$HAPROXY_PID" 2>/dev/null || true
exit $EXIT_CODE

260
vllm_middleware.py Normal file
View File

@@ -0,0 +1,260 @@
"""
vLLM → SGLang request middleware.
Sits between haproxy and SGLang to strip vLLM-only parameters
that cause SGLang to return 422/400 errors.
Currently strips: logprobs, top_logprobs
(SGLang's Mistral tool-call parser rejects these; vLLM accepts them.)
Architecture:
haproxy (port N) → middleware (port N+2) → SGLang (port N+1)
haproxy still handles /metrics stub and /health instant responses.
This middleware only touches the proxied request bodies.
"""
import json
import os
import asyncio
import httpx
from datetime import datetime
from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse, Response
import uvicorn
SGLANG_HOST = os.environ.get("SGLANG_HOST", "127.0.0.1")
SGLANG_PORT = int(os.environ.get("SGLANG_PORT", "8001"))
LISTEN_PORT = int(os.environ.get("MIDDLEWARE_PORT", "8002"))
# Params that vLLM accepts but SGLang rejects.
# Extend this set as more incompatibilities are discovered.
STRIP_PARAMS = {"logprobs", "top_logprobs", "chat_template_kwargs", "guided_json", "guided_regex"}
client: httpx.AsyncClient | None = None
_sglang_ready = False
async def _lifespan(app_instance):
global client
client = httpx.AsyncClient(
timeout=httpx.Timeout(300.0, connect=10.0),
)
# Background task: wait for SGLang to become available
asyncio.create_task(_wait_for_sglang())
yield
await client.aclose()
async def _wait_for_sglang():
"""Poll SGLang until it's accepting connections, then mark ready."""
global _sglang_ready
while True:
try:
resp = await client.get(
f"http://{SGLANG_HOST}:{SGLANG_PORT}/health",
timeout=httpx.Timeout(5.0, connect=2.0),
)
if resp.status_code == 200:
_sglang_ready = True
print(f"Middleware: SGLang is ready at {SGLANG_HOST}:{SGLANG_PORT}")
return
except (httpx.ConnectError, httpx.TimeoutException):
pass
await asyncio.sleep(2)
app = FastAPI(lifespan=_lifespan)
@app.get("/health")
async def health():
"""Health check — haproxy polls this. Returns 200 only if SGLang is up."""
global _sglang_ready
if not _sglang_ready:
return Response(content="SGLang not ready", status_code=503)
try:
resp = await client.get(
f"http://{SGLANG_HOST}:{SGLANG_PORT}/health",
timeout=httpx.Timeout(5.0, connect=2.0),
)
return Response(content=resp.content, status_code=resp.status_code,
media_type=resp.headers.get("content-type"))
except (httpx.ConnectError, httpx.TimeoutException):
_sglang_ready = False
# Re-trigger background wait
asyncio.create_task(_wait_for_sglang())
return Response(content="SGLang not ready", status_code=503)
ERROR_LOG = os.environ.get("VLLM_SHIM_LOG", "/tmp/vllm-shim.log")
def _fix_schema(schema: dict) -> bool:
"""Recursively fix a JSON Schema dict: properties must be object, required must be list of strings."""
fixed = False
# Fix 'properties' — must be dict, not array/null
if "properties" in schema and not isinstance(schema["properties"], dict):
schema["properties"] = {}
fixed = True
# Fix 'required' — must be list of strings or absent
if "required" in schema and not isinstance(schema["required"], list):
del schema["required"]
fixed = True
# Recurse into every property value
if isinstance(schema.get("properties"), dict):
for val in schema["properties"].values():
if isinstance(val, dict):
if _fix_schema(val):
fixed = True
# Recurse into items (for array-of-objects)
if isinstance(schema.get("items"), dict):
if _fix_schema(schema["items"]):
fixed = True
# Recurse into anyOf, allOf, oneOf
for key in ("anyOf", "allOf", "oneOf"):
if isinstance(schema.get(key), list):
for item in schema[key]:
if isinstance(item, dict):
if _fix_schema(item):
fixed = True
# Recurse into additionalProperties if it's a schema
if isinstance(schema.get("additionalProperties"), dict):
if _fix_schema(schema["additionalProperties"]):
fixed = True
return fixed
def _dump_error(request_body: bytes, status_code: int, resp_headers: dict, resp_body_raw: bytes, path: str = ""):
"""Log full request + response payload when SGLang returns an error (4xx/5xx)."""
try:
ts = datetime.now().isoformat()
req_json = None
try:
req_json = json.loads(request_body)
except (json.JSONDecodeError, UnicodeDecodeError):
pass
resp_text = resp_body_raw.decode("utf-8", errors="replace")[:4000]
resp_json = None
try:
resp_json = json.loads(resp_text)
except (json.JSONDecodeError, UnicodeDecodeError):
pass
with open(ERROR_LOG, "a") as f:
f.write(f"\n{'='*60}\n")
f.write(f"[{ts}] ERROR DUMP — SGLang returned HTTP {status_code}\n")
f.write(f"Path: {path}\n")
f.write(f"--- Request Body ---\n")
if req_json:
f.write(json.dumps(req_json, indent=2, ensure_ascii=False)[:8000])
else:
f.write(request_body.decode("utf-8", errors="replace")[:8000])
f.write(f"\n--- Response (HTTP {status_code}) ---\n")
if resp_json:
f.write(json.dumps(resp_json, indent=2, ensure_ascii=False)[:4000])
else:
f.write(resp_text)
f.write(f"\n{'='*60}\n")
print(f"[{ts}] ERROR DUMP: HTTP {status_code} on {path} — full payload written to {ERROR_LOG}")
except Exception as e:
print(f"_dump_error failed: {e}")
@app.api_route("/{path:path}", methods=["GET", "POST", "PUT", "DELETE", "PATCH", "OPTIONS"])
async def proxy(path: str, request: Request):
body = await request.body()
is_streaming = False
# Strip incompatible params from chat completion POST requests
if request.method == "POST" and "chat/completions" in path and body:
try:
data = json.loads(body)
is_streaming = data.get("stream", False)
stripped_any = False
for key in STRIP_PARAMS:
if key in data:
del data[key]
stripped_any = True
# Fix tool function parameters: recurse to fix ALL bad properties/required
tools = data.get("tools")
if isinstance(tools, list):
for tool in tools:
func = tool.get("function") if isinstance(tool, dict) else None
if not isinstance(func, dict):
continue
if not isinstance(func.get("parameters"), dict):
func["parameters"] = {"type": "object", "properties": {}}
stripped_any = True
if _fix_schema(func["parameters"]):
stripped_any = True
if stripped_any:
body = json.dumps(data).encode()
except (json.JSONDecodeError, UnicodeDecodeError):
pass
# Forward headers (skip hop-by-hop and ones we're replacing)
fwd_headers = {
k: v for k, v in request.headers.items()
if k.lower() not in ("host", "content-length", "transfer-encoding")
}
fwd_headers["content-length"] = str(len(body))
url = f"http://{SGLANG_HOST}:{SGLANG_PORT}/{path}"
if request.query_params:
url += f"?{request.query_params}"
try:
if is_streaming:
req = client.build_request(request.method, url, content=body, headers=fwd_headers)
resp = await client.send(req, stream=True)
# Dump on error for streaming responses
if resp.status_code >= 400:
error_body = await resp.aread()
_dump_error(body, resp.status_code, resp_headers=dict(resp.headers), resp_body_raw=error_body, path=path)
await resp.aclose()
return Response(
content=error_body,
status_code=resp.status_code,
media_type=resp.headers.get("content-type"),
)
async def stream_body():
try:
async for chunk in resp.aiter_bytes():
yield chunk
finally:
await resp.aclose()
return StreamingResponse(
stream_body(),
status_code=resp.status_code,
headers={"content-type": resp.headers.get("content-type", "text/event-stream")},
)
else:
resp = await client.request(request.method, url, content=body, headers=fwd_headers)
# Dump on error
if resp.status_code >= 400:
_dump_error(body, resp.status_code, resp_headers=dict(resp.headers), resp_body_raw=resp.content, path=path)
return Response(
content=resp.content,
status_code=resp.status_code,
media_type=resp.headers.get("content-type"),
)
except (httpx.ConnectError, httpx.TimeoutException) as e:
return Response(
content=json.dumps({"error": {"message": f"SGLang backend unavailable: {e}", "type": "backend_error"}}),
status_code=503,
media_type="application/json",
)
if __name__ == "__main__":
uvicorn.run(app, host="0.0.0.0", port=LISTEN_PORT, log_level="warning")

View File

@@ -1,15 +1,195 @@
#!/usr/bin/env python3
""" """
vLLM -> SGLang Python shim. vLLM -> SGLang Python shim.
Catches `python -m vllm.entrypoints.openai.api_server` (and similar) Catches `python -m vllm.entrypoints.openai.api_server` (and similar)
and launches SGLang instead. and launches SGLang behind haproxy + middleware instead.
Dynamically translates vLLM CLI args to SGLang equivalents.
No hardcoded model name or tensor-parallel size.
Architecture:
haproxy on the vLLM port (front door)
/metrics → 200 empty (stub)
/health → 200 if SGLang backend is up, 503 if not (instant)
/* → proxy to middleware on port+2
middleware on port+2 (strips vLLM-only params, fixes tool schemas)
SGLang on port+1 (internal)
""" """
import os import os
import sys import sys
import subprocess import subprocess
import time
import datetime
# ── vLLM → SGLang argument mapping ──────────────────────────
# Key = vLLM flag, value = (sglang_flag, has_value)
# has_value=True means the flag takes an argument (e.g. --port 8000)
# has_value=False means it's a boolean flag (e.g. --no-enable-prefix-caching)
ARG_MAP = {
# Direct renames (vLLM name → SGLang name)
"--tensor-parallel-size": ("--tp", True),
"--gpu_memory_utilization": ("--mem-fraction-static", True),
"--max_model_len": ("--max-running-requests", True), # approximate
"--max-model-len": ("--max-running-requests", True), # kebab variant
"--enforce_eager": ("--enable-torch-compile", False), # opposite intent, skip by default
"--trust_remote_code": ("--trust-remote-code", False),
"--trust-remote-code": ("--trust-remote-code", False),
# vLLM flags with no SGLang equivalent → skip
"--no-enable-prefix-caching": (None, False),
"--enable-prefix-caching": (None, False),
"--enable-chunked-prefill": (None, False),
"--no-enable-chunked-prefill":(None, False),
"--disable-log-requests": (None, False),
"--disable-log-stats": (None, False),
"--swap-space": (None, True),
"--block-size": (None, True),
"--num-gpu-blocks-override": (None, True),
"--num-cpu-blocks-override": (None, True),
"--max-num-seqs": (None, True),
"--max-num-batched-tokens": (None, True),
"--distributed-executor-backend": (None, True),
"--pipeline-parallel-size": (None, True),
"--data-parallel-size": (None, True),
"--revision": (None, True),
"--code-revision": (None, True),
"--tokenizer-revision": (None, True),
"--tokenizer-mode": (None, True),
"--quantization": (None, True),
"--dtype": (None, True),
"--max-seq-len-to-capture": (None, True),
"--enable-lora": (None, False),
"--max-lora-rank": (None, True),
"--max-cpu-loras": (None, True),
"--lora-dtype": (None, True),
"--enable-prompt-adapter": (None, False),
"--scheduler-delay-factor": (None, True),
"--enable-multi-modal": (None, False),
"--limit-mm-per-prompt": (None, True),
}
# Default tool-call-parser; override with SGLANG_TOOL_CALL_PARSER env var
DEFAULT_TOOL_CALL_PARSER = "qwen3_coder"
def parse_vllm_args(args):
"""
Parse vLLM CLI args and extract model, host, port,
plus any args we should translate to SGLang.
Returns (model, host, port, sglang_extra_args, skipped_args).
"""
model = None
host = "0.0.0.0"
port = "8000"
sglang_extra = [] # translated args for SGLang
skipped = [] # vLLM args we're ignoring
i = 0
while i < len(args):
arg = args[i]
# 'serve' subcommand — skip
if arg == "serve":
i += 1
continue
# Positional model argument (first non-flag after serve, or standalone)
if not arg.startswith("-") and model is None:
model = arg
i += 1
continue
# --flag=value form
if "=" in arg and arg.startswith("--"):
flag, val = arg.split("=", 1)
if flag == "--host":
host = val
elif flag == "--port":
port = val
elif flag in ARG_MAP:
sglang_flag, has_val = ARG_MAP[flag]
if sglang_flag is None:
skipped.append(arg)
elif has_val:
sglang_extra.extend([sglang_flag, val])
else:
# boolean flag with =value (unusual but valid)
sglang_extra.append(sglang_flag)
else:
# Unknown flag — pass through as-is (might be a SGLang flag too)
sglang_extra.append(arg)
i += 1
continue
# --flag value form
if arg in ("--host",):
if i + 1 < len(args):
host = args[i + 1]
i += 2
continue
if arg in ("--port",):
if i + 1 < len(args):
port = args[i + 1]
i += 2
continue
if arg in ARG_MAP:
sglang_flag, has_val = ARG_MAP[arg]
if sglang_flag is None:
skipped.append(arg)
if has_val and i + 1 < len(args) and not args[i + 1].startswith("-"):
skipped.append(args[i + 1])
i += 2
else:
i += 1
elif has_val:
if i + 1 < len(args):
sglang_extra.extend([sglang_flag, args[i + 1]])
i += 2
else:
i += 1
else:
sglang_extra.append(sglang_flag)
i += 1
continue
# --tool-call-parser — pass through to SGLang
if arg == "--tool-call-parser":
if i + 1 < len(args):
sglang_extra.extend(["--tool-call-parser", args[i + 1]])
i += 2
else:
i += 1
continue
# Unknown flag — pass through if it takes a value, might be valid for SGLang
if arg.startswith("--") and i + 1 < len(args) and not args[i + 1].startswith("-"):
sglang_extra.extend([arg, args[i + 1]])
i += 2
elif arg.startswith("--"):
sglang_extra.append(arg)
i += 1
else:
# Unknown positional — probably the model if we don't have it yet
if model is None:
model = arg
i += 1
return model, host, port, sglang_extra, skipped
def main(): def main():
args = sys.argv[1:] args = sys.argv[1:]
log_path = os.environ.get("VLLM_SHIM_LOG", "/tmp/vllm-shim.log")
with open(log_path, "a") as f:
f.write(f"\n{datetime.datetime.now().isoformat()} vLLM -> SGLang Shim (Python module)\n")
f.write(f" Invoked as: python -m {__name__} {' '.join(args)}\n")
f.write(" All arguments received:\n")
for i, arg in enumerate(args, 1):
f.write(f" [{i}] {arg}\n")
f.write("\n")
print() print()
print("==========================================") print("==========================================")
print(" vLLM -> SGLang Shim (Python module)") print(" vLLM -> SGLang Shim (Python module)")
@@ -22,43 +202,137 @@ def main():
print("==========================================") print("==========================================")
print() print()
host = "0.0.0.0" model, host, port, sglang_extra, skipped = parse_vllm_args(args)
port = "8000"
i = 0 if not model:
while i < len(args): print("ERROR: No model specified in vLLM args!")
if args[i] == "--host" and i + 1 < len(args): os._exit(1)
host = args[i + 1]
i += 2
elif args[i].startswith("--host="):
host = args[i].split("=", 1)[1]
i += 1
elif args[i] == "--port" and i + 1 < len(args):
port = args[i + 1]
i += 2
elif args[i].startswith("--port="):
port = args[i].split("=", 1)[1]
i += 1
else:
i += 1
print(f"Launching SGLang on {host}:{port}") # SGLang port scheme: original+1 = SGLang, original+2 = middleware
sglang_port = str(int(port) + 1)
middleware_port = str(int(port) + 2)
# Build SGLang command
sglang_cmd = [
sys.executable, "-m", "sglang.launch_server",
"--model-path", model,
"--host", host,
"--port", sglang_port,
]
# Add tool-call-parser (env override or default)
tcp = os.environ.get("SGLANG_TOOL_CALL_PARSER", DEFAULT_TOOL_CALL_PARSER)
if tcp:
sglang_cmd.extend(["--tool-call-parser", tcp])
# Add translated/forwarded args
sglang_cmd.extend(sglang_extra)
print(f"Model: {model}")
print(f"SGLang host: {host}:{sglang_port}")
print(f"Middleware: {host}:{middleware_port}")
print(f"haproxy: {host}:{port}")
if sglang_extra:
print(f"Translated args: {' '.join(sglang_extra)}")
if skipped:
print(f"Skipped (no SGLang equivalent): {' '.join(skipped)}")
print()
print(f"SGLang command: {' '.join(sglang_cmd)}")
print() print()
os.execvp( # ── haproxy setup ────────────────────────────────────────
sys.executable,
[ os.makedirs("/tmp/haproxy-errors", exist_ok=True)
sys.executable, "-m", "sglang.launch_server", with open("/tmp/haproxy-errors/200-empty.http", "w") as f:
"--model-path", "mistralai/Devstral-2-123B-Instruct-2512", f.write("HTTP/1.0 200 OK\r\nContent-Length: 0\r\nConnection: close\r\n\r\n")
"--host", host, with open("/tmp/haproxy-errors/503-sglang.http", "w") as f:
"--port", port, f.write("HTTP/1.0 503 Service Unavailable\r\nContent-Length: 16\r\nConnection: close\r\nContent-Type: text/plain\r\n\r\nSGLang not ready")
"--tp", "8",
"--tool-call-parser", "mistral", haproxy_cfg = "/tmp/haproxy-shim.cfg"
], with open(haproxy_cfg, "w") as f:
f.write(f"""global
maxconn 4096
defaults
mode http
timeout connect 5s
timeout client 300s
timeout server 300s
frontend proxy
bind {host}:{port}
# /metrics stub — instant 200 empty (vLLM stack expects this)
acl is_metrics path /metrics
http-request deny deny_status 200 if is_metrics
errorfile 200 /tmp/haproxy-errors/200-empty.http
# /health — instant response based on SGLang backend state
acl is_health path /health
acl sglang_up nbsrv(sglang) gt 0
http-request deny deny_status 200 if is_health sglang_up
http-request deny deny_status 503 if is_health
errorfile 503 /tmp/haproxy-errors/503-sglang.http
default_backend sglang
backend sglang
option httpchk GET /health
http-check expect status 200
server s1 127.0.0.1:{middleware_port} check inter 5s fall 3 rise 2
""")
with open(log_path, "a") as f:
f.write(f"haproxy config written to {haproxy_cfg}\n")
f.write(f"Model: {model}, SGLang port: {sglang_port}, middleware port: {middleware_port}, haproxy port: {port}\n")
f.write(f"SGLang command: {' '.join(sglang_cmd)}\n")
if skipped:
f.write(f"Skipped vLLM args: {' '.join(skipped)}\n")
# ── Launch processes ─────────────────────────────────────
sglang_proc = subprocess.Popen(sglang_cmd)
middleware_env = os.environ.copy()
middleware_env["SGLANG_HOST"] = host
middleware_env["SGLANG_PORT"] = sglang_port
middleware_env["MIDDLEWARE_PORT"] = middleware_port
middleware_proc = subprocess.Popen(
[sys.executable, "/opt/vllm-shim/vllm_middleware.py"],
env=middleware_env,
) )
time.sleep(2)
haproxy_proc = subprocess.Popen(["haproxy", "-f", haproxy_cfg])
with open(log_path, "a") as f:
f.write(f"SGLang PID: {sglang_proc.pid}, middleware PID: {middleware_proc.pid}, haproxy PID: {haproxy_proc.pid}\n")
# Wait for whichever dies first
while True:
sglang_ret = sglang_proc.poll()
middleware_ret = middleware_proc.poll()
haproxy_ret = haproxy_proc.poll()
if sglang_ret is not None:
print(f"SGLang exited (code {sglang_ret}), shutting down")
middleware_proc.terminate()
haproxy_proc.terminate()
os._exit(sglang_ret)
if middleware_ret is not None:
print(f"Middleware exited (code {middleware_ret}), shutting down")
sglang_proc.terminate()
haproxy_proc.terminate()
os._exit(middleware_ret)
if haproxy_ret is not None:
print(f"haproxy exited (code {haproxy_ret}), shutting down")
sglang_proc.terminate()
middleware_proc.terminate()
os._exit(haproxy_ret)
time.sleep(1)
if __name__ == "__main__": if __name__ == "__main__":
main() main()
# Also run if imported as a module (some invocation paths just import the file) # Also run if imported as a module (some invocation paths just import the file)
main() main()