dynamic arg translation, remove entrypoint.sh, update README

This commit is contained in:
2026-04-12 21:23:26 +00:00
parent 7c1ed0408b
commit efc9dc33e7
4 changed files with 417 additions and 211 deletions

177
README.md
View File

@@ -1,104 +1,115 @@
# vLLM → SGLang Shim # vllm-to-sglang
Drop-in replacement that makes a vLLM production stack (e.g. the [k8s operator](https://github.com/vllm-project/production-stack)) actually run [SGLang](https://github.com/sgl-project/sglang) instead. Drop-in replacement that makes a vLLM production stack (e.g. the [k8s operator](https://github.com/vllm-project/production-stack)) actually run [SGLang](https://github.com/sgl-project/sglang) instead.
## Why? ## How it works
The vLLM production stack handles model lifecycle, scaling, and routing — but some models work better (or only work) on SGLang. Rather than rewriting your deployment infra, this shim intercepts every vLLM invocation and launches SGLang with equivalent arguments. The k8s vLLM production stack calls `vllm serve <model> [flags]`. This project intercepts that call and instead launches SGLang behind haproxy + a middleware layer.
## How It Works
### Invocation interception
Two interception paths catch however the vLLM stack tries to start the server:
| What the stack calls | What happens |
|---|---|
| `vllm serve <model> [flags]` | Shell shim (`vllm-shim.sh`) replaces the `vllm` binary |
| `python -m vllm.entrypoints.openai.api_server` | Python shim (shadow module on `PYTHONPATH`) intercepts the import |
Both extract `--host` and `--port` from whatever the stack sends.
### haproxy proxy layer
Rather than launching SGLang directly on the vLLM port, the shim runs **haproxy** on the original port and **SGLang on port+1**. This solves two critical problems:
1. **`/metrics` stub** — The vLLM stack expects a Prometheus metrics endpoint at `/metrics`. SGLang doesn't serve one. haproxy intercepts `/metrics` and returns an empty 200 response instantly.
2. **`/health` probe timing** — SGLang's `/health` endpoint takes ~1.001s to respond, which races the 1s k8s probe timeout and causes repeated `Startup probe failed: context deadline exceeded`. haproxy health-checks SGLang in the background (every 5s, with a 3s timeout) and responds to `/health` probes **instantly** — 200 if the backend is up, 503 if it's not. No more timeout roulette.
### middleware layer
A Python middleware (FastAPI) sits between haproxy and SGLang on **port+2**. It strips vLLM-only request parameters that SGLang rejects with 422 errors:
- **`logprobs`** / **`top_logprobs`** — vLLM accepts these on chat completion requests; SGLang's Mistral tool-call parser rejects them. OpenClaw and other vLLM clients send them by default.
The middleware only touches `POST /v1/chat/completions` request bodies and passes everything else through unchanged. To strip additional params, add them to the `STRIP_PARAMS` set in `vllm_middleware.py`.
``` ```
┌─────────────────────────────────────────────┐ k8s vLLM stack
│ k8s probes / vLLM stack
│ │ │ │ vllm serve mistralai/Devstral-2-123B-Instruct-2512 \
▼ │ --host 0.0.0.0 --port 8000 --tensor-parallel-size 8 ...
│ haproxy (port 8000)
/metrics ──► 200 empty (stub) │
│ /health ──► 200/503 instant (backend │ ┌─────────────────────────────────────────────────────────┐
health-checked in bg) vllm-shim.sh (replaces the `vllm` binary)
/* ──► proxy to middleware or vllm_shim_module.py (shadows python -m vllm.*)
Parses vLLM args, translates to SGLang equivalents,
middleware (port 8002) then launches three processes:
strips logprobs/top_logprobs
forwards to SGLang ┌──────────────────────────────────────────────────┐
│ haproxy :8000 (front door)
/metrics → 200 empty (stub)
SGLang (port 8001) /health → 200/503 based on backend state │
└─────────────────────────────────────────────┘ │ │ /* → proxy to middleware :8002 │ │
│ └──────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ middleware :8002 (FastAPI) │ │
│ │ Strips vLLM-only params from request bodies │ │
│ │ Recursively fixes tool JSON schemas │ │
│ │ Forwards to SGLang :8001 │ │
│ └──────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ SGLang :8001 (internal) │ │
│ │ The actual inference server │ │
│ └──────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
``` ```
haproxy 2.4 compat: uses `errorfile` + `http-request deny deny_status` for stub responses (the `http-request return` payload syntax requires haproxy 2.8+). ## Argument translation
## Current State The shim dynamically translates vLLM CLI args to SGLang equivalents — no hardcoded model names or tensor-parallel sizes.
**Running in production — `mistralai/Devstral-2-123B-Instruct-2512` on 8× MI300X.** | vLLM flag | SGLang equivalent | Notes |
|-----------|-------------------|-------|
| `serve` | *(skipped)* | Subcommand only |
| `<model>` (positional) | `--model-path <model>` | |
| `--host` | Used for all three processes | |
| `--port` | haproxy binds this port | SGLang gets +1, middleware +2 |
| `--tensor-parallel-size` | `--tp` | |
| `--gpu_memory_utilization` | `--mem-fraction-static` | |
| `--trust-remote-code` | `--trust-remote-code` | |
| `--no-enable-prefix-caching` | *(skipped)* | No SGLang equivalent |
| `--enable-chunked-prefill` | *(skipped)* | No SGLang equivalent |
| `--tool-call-parser` | `--tool-call-parser` | Defaults to `mistral` |
- Model path, `--tp 8`, and `--tool-call-parser mistral` are baked into both shims Unknown flags are passed through as-is — they may be valid SGLang args.
- The Dockerfile builds on `lmsysorg/sglang-rocm` and patches a broken `aiter` build from the base image
- MI300X tuning env vars are set (`HIP_FORCE_DEV_KERNARG`, `NCCL_MIN_NCHANNELS`, etc.)
- All received args are logged to `/tmp/vllm-shim.log` (configurable via `VLLM_SHIM_LOG` env var)
## Building ### Environment variables
| Variable | Default | Purpose |
|----------|---------|---------|
| `SGLANG_TOOL_CALL_PARSER` | `mistral` | Override the tool-call-parser |
| `VLLM_SHIM_LOG` | `/tmp/vllm-shim.log` | Log file path |
## Middleware: request body fixes
SGLang rejects certain parameters and schemas that vLLM (and OpenClaw) send. The middleware fixes these automatically:
### Stripped parameters
These vLLM-only parameters are removed from request bodies before forwarding to SGLang:
- `logprobs` / `top_logprobs` — SGLang's Mistral tool-call parser rejects these
- `chat_template_kwargs` — OpenClaw sends this for reasoning models; SGLang doesn't support it
- `guided_json` / `guided_regex` — vLLM-only guided decoding params
### Schema fixes
OpenClaw (and some vLLM configurations) send tool schemas with `properties: []` instead of `properties: {}`. SGLang requires `properties` to be an object at **every level** of the schema, including nested `items` and sub-objects.
The middleware recursively walks the entire JSON Schema tree and fixes:
- `properties: []``properties: {}` (at any depth)
- `required: <non-list>` → removed
- `parameters: <non-object>``{"type": "object", "properties": {}}`
## Files
| File | Purpose |
|------|---------|
| `Dockerfile` | Builds on `lmsysorg/sglang-rocm`, installs haproxy, copies shim files |
| `Jenkinsfile` | CI/CD: builds and pushes to Vultr container registry |
| `vllm-shim.sh` | Shell shim — replaces the `vllm` binary, translates args |
| `vllm_shim_module.py` | Python shim — shadows `vllm.*` module imports, translates args |
| `vllm_middleware.py` | FastAPI middleware — strips bad params, fixes tool schemas |
| `README.md` | This file |
## Deploy
```bash ```bash
docker build -t vllm-to-sglang . docker build -t vllm-to-sglang .
``` ```
Or use the Jenkins pipeline: Or via Jenkins:
```bash ```bash
curl -X POST "https://jenkins.sweetapi.com/job/vllm-to-sglang/buildWithParameters" \ curl -X POST "https://jenkins.sweetapi.com/job/vllm-to-sglang/buildWithParameters" \
-u "${JENKINS_USER}:${JENKINS_PASS}" \ -d TAG=nightly
-d "BRANCH=metrics" \
-d "TAG=nightly3"
``` ```
Then use this image anywhere the vLLM stack expects its server image.
## Making It Work For Other Models
Right now the model config is hardcoded in three places:
- `vllm-shim.sh` — the `python -m sglang.launch_server` line
- `vllm_shim_module.py` — the `subprocess.Popen()` call
- `Dockerfile` — base image and ROCm-specific patches
To adapt for a different model, change `--model-path`, `--tp`, and `--tool-call-parser` in both shim files. A future pass will make this configurable via env vars or args so you don't have to edit source.
## Files
| File | Purpose |
|---|---|
| `Dockerfile` | Builds the image: ROCm SGLang base + haproxy + shims + MI300X env |
| `vllm-shim.sh` | Shell shim — replaces the `vllm` binary, launches SGLang + middleware + haproxy |
| `vllm_shim_module.py` | Python shim — shadows `vllm.*` module imports, launches SGLang + middleware + haproxy |
| `vllm_middleware.py` | FastAPI middleware — strips vLLM-only params (logprobs) before forwarding to SGLang |

View File

@@ -1,42 +0,0 @@
#!/bin/bash
set -euo pipefail
# Defaults matching vLLM production stack defaults
HOST="0.0.0.0"
PORT="8000"
# Save original args before parsing eats them
ALL_ARGS="$*"
# Parse only host and port from whatever args the vLLM stack sends.
# Everything else is ignored.
while [[ $# -gt 0 ]]; do
case "$1" in
--host) HOST="$2"; shift 2 ;;
--host=*) HOST="${1#*=}"; shift ;;
--port) PORT="$2"; shift 2 ;;
--port=*) PORT="${1#*=}"; shift ;;
*) shift ;; # ignore everything else
esac
done
echo "=== vLLM production stack args received ==="
echo "Raw args: $ALL_ARGS"
echo ""
i=1
for arg in $ALL_ARGS; do
echo " [$i] $arg"
i=$((i + 1))
done
echo "============================================"
echo ""
echo "=== SGLang shim ==="
echo "Ignoring vLLM args. Launching SGLang on ${HOST}:${PORT}"
echo "==================="
exec python -m sglang.launch_server \
--model-path mistralai/Devstral-2-123B-Instruct-2512 \
--host "$HOST" \
--port "$PORT" \
--tp 8 \
--tool-call-parser mistral

View File

@@ -1,20 +1,21 @@
#!/bin/bash #!/usr/bin/env bash
set -euo pipefail set -euo pipefail
# ============================================================ # ============================================================
# vLLM -> SGLang shim # vLLM -> SGLang shim (shell version)
# This script replaces the vllm binary. The k8s production stack # This script replaces the vllm binary. The k8s production stack
# calls `vllm serve <model> [flags]`, and we intercept everything. # calls `vllm serve <model> [flags]`, and we intercept everything.
# #
# Dynamically translates vLLM args to SGLang equivalents.
# No hardcoded model or tensor-parallel size.
#
# Architecture: # Architecture:
# haproxy on the vLLM port (front door) # haproxy on the vLLM port (front door)
# /metrics → 200 empty (stub) # /metrics → 200 empty (stub)
# /health → 200 if SGLang backend is up, 503 if not (instant) # /health → 200 if SGLang backend is up, 503 if not
# /* → proxy to SGLang on port+1 # /* → proxy to middleware on port+2
# middleware on port+2 (strips vLLM-only params, fixes schemas)
# SGLang on port+1 (internal) # SGLang on port+1 (internal)
#
# haproxy 2.4 compat: uses errorfile for stub responses instead
# of http-request return (which needs 2.8+ for payload syntax).
# ============================================================ # ============================================================
echo "" echo ""
@@ -46,39 +47,107 @@ LOG_PATH="${VLLM_SHIM_LOG:-/tmp/vllm-shim.log}"
echo "" echo ""
} >> "$LOG_PATH" } >> "$LOG_PATH"
# Defaults # ── Parse vLLM args → extract model, host, port, translate the rest ──
MODEL=""
HOST="0.0.0.0" HOST="0.0.0.0"
PORT="8000" PORT="8000"
SGLANG_ARGS=()
SKIPPED_ARGS=()
# Default tool-call-parser; override with SGLANG_TOOL_CALL_PARSER env var
TOOL_CALL_PARSER="${SGLANG_TOOL_CALL_PARSER:-mistral}"
# Parse host and port from whatever the stack sends
while [[ $# -gt 0 ]]; do while [[ $# -gt 0 ]]; do
case "$1" in case "$1" in
serve) shift ;; # skip the 'serve' subcommand # Skip 'serve' subcommand
serve) shift ;;
# ── Extracted for infrastructure (not passed to SGLang) ──
--host) HOST="$2"; shift 2 ;; --host) HOST="$2"; shift 2 ;;
--host=*) HOST="${1#*=}"; shift ;; --host=*) HOST="${1#*=}"; shift ;;
--port) PORT="$2"; shift 2 ;; --port) PORT="$2"; shift 2 ;;
--port=*) PORT="${1#*=}"; shift ;; --port=*) PORT="${1#*=}"; shift ;;
*) shift ;; # ignore everything else
# ── Positional model name ──
--model|--model-name)
MODEL="$2"; shift 2 ;;
--model=*|--model-name=*)
MODEL="${1#*=}"; shift ;;
# ── Direct renames (vLLM → SGLang) ──
--tensor-parallel-size)
SGLANG_ARGS+=("--tp" "$2"); shift 2 ;;
--tensor-parallel-size=*)
SGLANG_ARGS+=("--tp" "${1#*=}"); shift ;;
--gpu_memory_utilization)
SGLANG_ARGS+=("--mem-fraction-static" "$2"); shift 2 ;;
--gpu_memory_utilization=*)
SGLANG_ARGS+=("--mem-fraction-static" "${1#*=}"); shift ;;
--trust_remote_code|--trust-remote-code)
SGLANG_ARGS+=("--trust-remote-code"); shift ;;
# ── vLLM flags with no SGLang equivalent → skip ──
--no-enable-prefix-caching|--enable-prefix-caching)
SKIPPED_ARGS+=("$1"); shift ;;
--enable-chunked-prefill|--no-enable-chunked-prefill)
SKIPPED_ARGS+=("$1"); shift ;;
--disable-log-requests|--disable-log-stats)
SKIPPED_ARGS+=("$1"); shift ;;
--swap-space|--block-size|--max-num-seqs|--max-num-batched-tokens)
SKIPPED_ARGS+=("$1" "$2"); shift 2 ;;
--swap-space=*|--block-size=*|--max-num-seqs=*|--max-num-batched-tokens=*)
SKIPPED_ARGS+=("$1"); shift ;;
--distributed-executor-backend|--pipeline-parallel-size|--data-parallel-size)
SKIPPED_ARGS+=("$1" "$2"); shift 2 ;;
--quantization|--dtype|--revision|--tokenizer-revision|--tokenizer-mode)
SKIPPED_ARGS+=("$1" "$2"); shift 2 ;;
--quantization=*|--dtype=*|--revision=*|--tokenizer-revision=*|--tokenizer-mode=*)
SKIPPED_ARGS+=("$1"); shift ;;
# ── Pass through to SGLang as-is ──
--tool-call-parser)
TOOL_CALL_PARSER="$2"; shift 2 ;;
--tool-call-parser=*)
TOOL_CALL_PARSER="${1#*=}"; shift ;;
*)
# Positional arg = model name (first non-flag)
if [[ ! "$1" =~ ^- ]] && [[ -z "$MODEL" ]]; then
MODEL="$1"; shift
else
# Unknown — pass through, might be valid for SGLang
SGLANG_ARGS+=("$1"); shift
fi ;;
esac esac
done done
# SGLang runs one port higher; haproxy binds the original port if [[ -z "$MODEL" ]]; then
# Middleware runs two ports higher (strips vLLM-only params) echo "ERROR: No model specified in vLLM args!"
exit 1
fi
# ── Port scheme: haproxy=original, SGLang=+1, middleware=+2 ──
SGLANG_PORT=$((PORT + 1)) SGLANG_PORT=$((PORT + 1))
MIDDLEWARE_PORT=$((PORT + 2)) MIDDLEWARE_PORT=$((PORT + 2))
echo "Launching SGLang on ${HOST}:${SGLANG_PORT} (internal)" echo "Model: ${MODEL}"
echo "Launching middleware on ${HOST}:${MIDDLEWARE_PORT} (strips logprobs)" echo "SGLang: ${HOST}:${SGLANG_PORT}"
echo "Launching haproxy on ${HOST}:${PORT} (front door, /metrics + /health stub)" echo "Middleware: ${HOST}:${MIDDLEWARE_PORT}"
echo "haproxy: ${HOST}:${PORT}"
if [[ ${#SGLANG_ARGS[@]} -gt 0 ]]; then
echo "Translated args: ${SGLANG_ARGS[*]}"
fi
if [[ ${#SKIPPED_ARGS[@]} -gt 0 ]]; then
echo "Skipped (no SGLang equivalent): ${SKIPPED_ARGS[*]}"
fi
echo "" echo ""
# Prepare error files for haproxy stub responses # ── haproxy setup ───────────────────────────────────────────
# haproxy errorfile format: HTTP/1.x status_code reason\r\nheaders\r\n\r\nbody
mkdir -p /tmp/haproxy-errors mkdir -p /tmp/haproxy-errors
printf "HTTP/1.0 200 OK\r\nContent-Length: 0\r\nConnection: close\r\n\r\n" > /tmp/haproxy-errors/200-empty.http printf "HTTP/1.0 200 OK\r\nContent-Length: 0\r\nConnection: close\r\n\r\n" > /tmp/haproxy-errors/200-empty.http
printf "HTTP/1.0 503 Service Unavailable\r\nContent-Length: 16\r\nConnection: close\r\nContent-Type: text/plain\r\n\r\nSGLang not ready" > /tmp/haproxy-errors/503-sglang.http printf "HTTP/1.0 503 Service Unavailable\r\nContent-Length: 16\r\nConnection: close\r\nContent-Type: text/plain\r\n\r\nSGLang not ready" > /tmp/haproxy-errors/503-sglang.http
# Write haproxy config (compatible with haproxy 2.4)
HAPROXY_CFG="/tmp/haproxy-shim.cfg" HAPROXY_CFG="/tmp/haproxy-shim.cfg"
cat > "$HAPROXY_CFG" <<EOF cat > "$HAPROXY_CFG" <<EOF
global global
@@ -93,14 +162,10 @@ defaults
frontend proxy frontend proxy
bind ${HOST}:${PORT} bind ${HOST}:${PORT}
# /metrics stub — instant 200 empty (vLLM stack expects this)
acl is_metrics path /metrics acl is_metrics path /metrics
http-request deny deny_status 200 if is_metrics http-request deny deny_status 200 if is_metrics
errorfile 200 /tmp/haproxy-errors/200-empty.http errorfile 200 /tmp/haproxy-errors/200-empty.http
# /health — instant response based on SGLang backend state
# haproxy health-checks SGLang in the background; this avoids
# the 1s k8s probe timeout racing SGLang's ~1.001s /health response
acl is_health path /health acl is_health path /health
acl sglang_up nbsrv(sglang) gt 0 acl sglang_up nbsrv(sglang) gt 0
http-request deny deny_status 200 if is_health sglang_up http-request deny deny_status 200 if is_health sglang_up
@@ -115,35 +180,49 @@ backend sglang
server s1 127.0.0.1:${MIDDLEWARE_PORT} check inter 5s fall 3 rise 2 server s1 127.0.0.1:${MIDDLEWARE_PORT} check inter 5s fall 3 rise 2
EOF EOF
echo "haproxy config written to ${HAPROXY_CFG}" >> "$LOG_PATH" # ── Build and launch SGLang ─────────────────────────────────
# Start SGLang in the background SGLANG_CMD=(
python -m sglang.launch_server \ python -m sglang.launch_server
--model-path mistralai/Devstral-2-123B-Instruct-2512 \ --model-path "$MODEL"
--host "$HOST" \ --host "$HOST"
--port "$SGLANG_PORT" \ --port "$SGLANG_PORT"
--tp 8 \ )
--tool-call-parser mistral & if [[ -n "$TOOL_CALL_PARSER" ]]; then
SGLANG_CMD+=(--tool-call-parser "$TOOL_CALL_PARSER")
fi
SGLANG_CMD+=("${SGLANG_ARGS[@]}")
echo "SGLang command: ${SGLANG_CMD[*]}"
echo ""
{
echo "haproxy config written to ${HAPROXY_CFG}"
echo "Model: ${MODEL}, SGLang port: ${SGLANG_PORT}, middleware port: ${MIDDLEWARE_PORT}, haproxy port: ${PORT}"
echo "SGLang command: ${SGLANG_CMD[*]}"
if [[ ${#SKIPPED_ARGS[@]} -gt 0 ]]; then
echo "Skipped vLLM args: ${SKIPPED_ARGS[*]}"
fi
} >> "$LOG_PATH"
# Launch SGLang
"${SGLANG_CMD[@]}" &
SGLANG_PID=$! SGLANG_PID=$!
# Start the middleware (strips vLLM-only params like logprobs) # Launch middleware
SGLANG_PORT=$SGLANG_PORT MIDDLEWARE_PORT=$MIDDLEWARE_PORT \ SGLANG_HOST="$HOST" SGLANG_PORT="$SGLANG_PORT" MIDDLEWARE_PORT="$MIDDLEWARE_PORT" \
python /opt/vllm-shim/vllm_middleware.py & python /opt/vllm-shim/vllm_middleware.py &
MIDDLEWARE_PID=$! MIDDLEWARE_PID=$!
# Give SGLang a moment to start before haproxy starts routing
sleep 2 sleep 2
# Start haproxy in the foreground (this is now PID 1 for the container) # Launch haproxy (front door on the original port)
haproxy -f "$HAPROXY_CFG" & haproxy -f "$HAPROXY_CFG" &
HAPROXY_PID=$! HAPROXY_PID=$!
echo "SGLang PID: ${SGLANG_PID}, middleware PID: ${MIDDLEWARE_PID}, haproxy PID: ${HAPROXY_PID}" >> "$LOG_PATH" echo "SGLang PID: ${SGLANG_PID}, middleware PID: ${MIDDLEWARE_PID}, haproxy PID: ${HAPROXY_PID}" >> "$LOG_PATH"
# Wait for whichever dies first — if either goes, we go # Wait for whichever dies first
wait -n "$SGLANG_PID" "$MIDDLEWARE_PID" "$HAPROXY_PID" wait -n "$SGLANG_PID" "$MIDDLEWARE_PID" "$HAPROXY_PID"
EXIT_CODE=$? EXIT_CODE=$?
echo "A process exited (code ${EXIT_CODE}), shutting down" >> "$LOG_PATH" echo "A process exited (code ${EXIT_CODE}), shutting down" >> "$LOG_PATH"

View File

@@ -1,28 +1,187 @@
#!/usr/bin/env python3
""" """
vLLM -> SGLang Python shim. vLLM -> SGLang Python shim.
Catches `python -m vllm.entrypoints.openai.api_server` (and similar) Catches `python -m vllm.entrypoints.openai.api_server` (and similar)
and launches SGLang behind haproxy instead. and launches SGLang behind haproxy + middleware instead.
Dynamically translates vLLM CLI args to SGLang equivalents.
No hardcoded model name or tensor-parallel size.
Architecture: Architecture:
haproxy on the vLLM port (front door) haproxy on the vLLM port (front door)
/metrics → 200 empty (stub) /metrics → 200 empty (stub)
/health → 200 if SGLang backend is up, 503 if not (instant) /health → 200 if SGLang backend is up, 503 if not (instant)
/* → proxy to SGLang on port+1 /* → proxy to middleware on port+2
middleware on port+2 (strips vLLM-only params, fixes tool schemas)
SGLang on port+1 (internal) SGLang on port+1 (internal)
haproxy 2.4 compat: uses errorfile for stub responses instead
of http-request return (which needs 2.8+ for payload syntax).
""" """
import os import os
import sys import sys
import subprocess import subprocess
import time import time
import datetime
# ── vLLM → SGLang argument mapping ──────────────────────────
# Key = vLLM flag, value = (sglang_flag, has_value)
# has_value=True means the flag takes an argument (e.g. --port 8000)
# has_value=False means it's a boolean flag (e.g. --no-enable-prefix-caching)
ARG_MAP = {
# Direct renames (vLLM name → SGLang name)
"--tensor-parallel-size": ("--tp", True),
"--gpu_memory_utilization": ("--mem-fraction-static", True),
"--max_model_len": ("--max-running-requests", True), # approximate
"--max-model-len": ("--max-running-requests", True), # kebab variant
"--enforce_eager": ("--enable-torch-compile", False), # opposite intent, skip by default
"--trust_remote_code": ("--trust-remote-code", False),
"--trust-remote-code": ("--trust-remote-code", False),
# vLLM flags with no SGLang equivalent → skip
"--no-enable-prefix-caching": (None, False),
"--enable-prefix-caching": (None, False),
"--enable-chunked-prefill": (None, False),
"--no-enable-chunked-prefill":(None, False),
"--disable-log-requests": (None, False),
"--disable-log-stats": (None, False),
"--swap-space": (None, True),
"--block-size": (None, True),
"--num-gpu-blocks-override": (None, True),
"--num-cpu-blocks-override": (None, True),
"--max-num-seqs": (None, True),
"--max-num-batched-tokens": (None, True),
"--distributed-executor-backend": (None, True),
"--pipeline-parallel-size": (None, True),
"--data-parallel-size": (None, True),
"--revision": (None, True),
"--code-revision": (None, True),
"--tokenizer-revision": (None, True),
"--tokenizer-mode": (None, True),
"--quantization": (None, True),
"--dtype": (None, True),
"--max-seq-len-to-capture": (None, True),
"--enable-lora": (None, False),
"--max-lora-rank": (None, True),
"--max-cpu-loras": (None, True),
"--lora-dtype": (None, True),
"--enable-prompt-adapter": (None, False),
"--scheduler-delay-factor": (None, True),
"--enable-multi-modal": (None, False),
"--limit-mm-per-prompt": (None, True),
}
# Default tool-call-parser; override with SGLANG_TOOL_CALL_PARSER env var
DEFAULT_TOOL_CALL_PARSER = "mistral"
def parse_vllm_args(args):
"""
Parse vLLM CLI args and extract model, host, port,
plus any args we should translate to SGLang.
Returns (model, host, port, sglang_extra_args, skipped_args).
"""
model = None
host = "0.0.0.0"
port = "8000"
sglang_extra = [] # translated args for SGLang
skipped = [] # vLLM args we're ignoring
i = 0
while i < len(args):
arg = args[i]
# 'serve' subcommand — skip
if arg == "serve":
i += 1
continue
# Positional model argument (first non-flag after serve, or standalone)
if not arg.startswith("-") and model is None:
model = arg
i += 1
continue
# --flag=value form
if "=" in arg and arg.startswith("--"):
flag, val = arg.split("=", 1)
if flag == "--host":
host = val
elif flag == "--port":
port = val
elif flag in ARG_MAP:
sglang_flag, has_val = ARG_MAP[flag]
if sglang_flag is None:
skipped.append(arg)
elif has_val:
sglang_extra.extend([sglang_flag, val])
else:
# boolean flag with =value (unusual but valid)
sglang_extra.append(sglang_flag)
else:
# Unknown flag — pass through as-is (might be a SGLang flag too)
sglang_extra.append(arg)
i += 1
continue
# --flag value form
if arg in ("--host",):
if i + 1 < len(args):
host = args[i + 1]
i += 2
continue
if arg in ("--port",):
if i + 1 < len(args):
port = args[i + 1]
i += 2
continue
if arg in ARG_MAP:
sglang_flag, has_val = ARG_MAP[arg]
if sglang_flag is None:
skipped.append(arg)
if has_val and i + 1 < len(args) and not args[i + 1].startswith("-"):
skipped.append(args[i + 1])
i += 2
else:
i += 1
elif has_val:
if i + 1 < len(args):
sglang_extra.extend([sglang_flag, args[i + 1]])
i += 2
else:
i += 1
else:
sglang_extra.append(sglang_flag)
i += 1
continue
# --tool-call-parser — pass through to SGLang
if arg == "--tool-call-parser":
if i + 1 < len(args):
sglang_extra.extend(["--tool-call-parser", args[i + 1]])
i += 2
else:
i += 1
continue
# Unknown flag — pass through if it takes a value, might be valid for SGLang
if arg.startswith("--") and i + 1 < len(args) and not args[i + 1].startswith("-"):
sglang_extra.extend([arg, args[i + 1]])
i += 2
elif arg.startswith("--"):
sglang_extra.append(arg)
i += 1
else:
# Unknown positional — probably the model if we don't have it yet
if model is None:
model = arg
i += 1
return model, host, port, sglang_extra, skipped
def main(): def main():
args = sys.argv[1:] args = sys.argv[1:]
log_path = os.environ.get("VLLM_SHIM_LOG", "/tmp/vllm-shim.log") log_path = os.environ.get("VLLM_SHIM_LOG", "/tmp/vllm-shim.log")
import datetime
with open(log_path, "a") as f: with open(log_path, "a") as f:
f.write(f"\n{datetime.datetime.now().isoformat()} vLLM -> SGLang Shim (Python module)\n") f.write(f"\n{datetime.datetime.now().isoformat()} vLLM -> SGLang Shim (Python module)\n")
f.write(f" Invoked as: python -m {__name__} {' '.join(args)}\n") f.write(f" Invoked as: python -m {__name__} {' '.join(args)}\n")
@@ -43,44 +202,52 @@ def main():
print("==========================================") print("==========================================")
print() print()
host = "0.0.0.0" model, host, port, sglang_extra, skipped = parse_vllm_args(args)
port = "8000"
i = 0 if not model:
while i < len(args): print("ERROR: No model specified in vLLM args!")
if args[i] == "--host" and i + 1 < len(args): os._exit(1)
host = args[i + 1]
i += 2
elif args[i].startswith("--host="):
host = args[i].split("=", 1)[1]
i += 1
elif args[i] == "--port" and i + 1 < len(args):
port = args[i + 1]
i += 2
elif args[i].startswith("--port="):
port = args[i].split("=", 1)[1]
i += 1
else:
i += 1
# SGLang runs one port higher; middleware two ports higher # SGLang port scheme: original+1 = SGLang, original+2 = middleware
sglang_port = str(int(port) + 1) sglang_port = str(int(port) + 1)
middleware_port = str(int(port) + 2) middleware_port = str(int(port) + 2)
print(f"Launching SGLang on {host}:{sglang_port} (internal)") # Build SGLang command
print(f"Launching middleware on {host}:{middleware_port} (strips logprobs)") sglang_cmd = [
print(f"Launching haproxy on {host}:{port} (front door, /metrics + /health stub)") sys.executable, "-m", "sglang.launch_server",
"--model-path", model,
"--host", host,
"--port", sglang_port,
]
# Add tool-call-parser (env override or default)
tcp = os.environ.get("SGLANG_TOOL_CALL_PARSER", DEFAULT_TOOL_CALL_PARSER)
if tcp:
sglang_cmd.extend(["--tool-call-parser", tcp])
# Add translated/forwarded args
sglang_cmd.extend(sglang_extra)
print(f"Model: {model}")
print(f"SGLang host: {host}:{sglang_port}")
print(f"Middleware: {host}:{middleware_port}")
print(f"haproxy: {host}:{port}")
if sglang_extra:
print(f"Translated args: {' '.join(sglang_extra)}")
if skipped:
print(f"Skipped (no SGLang equivalent): {' '.join(skipped)}")
print()
print(f"SGLang command: {' '.join(sglang_cmd)}")
print() print()
# Prepare error files for haproxy stub responses # ── haproxy setup ────────────────────────────────────────
# haproxy errorfile format: HTTP/1.x status_code reason\r\nheaders\r\n\r\nbody
os.makedirs("/tmp/haproxy-errors", exist_ok=True) os.makedirs("/tmp/haproxy-errors", exist_ok=True)
with open("/tmp/haproxy-errors/200-empty.http", "w") as f: with open("/tmp/haproxy-errors/200-empty.http", "w") as f:
f.write("HTTP/1.0 200 OK\r\nContent-Length: 0\r\nConnection: close\r\n\r\n") f.write("HTTP/1.0 200 OK\r\nContent-Length: 0\r\nConnection: close\r\n\r\n")
with open("/tmp/haproxy-errors/503-sglang.http", "w") as f: with open("/tmp/haproxy-errors/503-sglang.http", "w") as f:
f.write("HTTP/1.0 503 Service Unavailable\r\nContent-Length: 16\r\nConnection: close\r\nContent-Type: text/plain\r\n\r\nSGLang not ready") f.write("HTTP/1.0 503 Service Unavailable\r\nContent-Length: 16\r\nConnection: close\r\nContent-Type: text/plain\r\n\r\nSGLang not ready")
# Write haproxy config (compatible with haproxy 2.4)
haproxy_cfg = "/tmp/haproxy-shim.cfg" haproxy_cfg = "/tmp/haproxy-shim.cfg"
with open(haproxy_cfg, "w") as f: with open(haproxy_cfg, "w") as f:
f.write(f"""global f.write(f"""global
@@ -101,8 +268,6 @@ frontend proxy
errorfile 200 /tmp/haproxy-errors/200-empty.http errorfile 200 /tmp/haproxy-errors/200-empty.http
# /health — instant response based on SGLang backend state # /health — instant response based on SGLang backend state
# haproxy health-checks SGLang in the background; this avoids
# the 1s k8s probe timeout racing SGLang's ~1.001s /health response
acl is_health path /health acl is_health path /health
acl sglang_up nbsrv(sglang) gt 0 acl sglang_up nbsrv(sglang) gt 0
http-request deny deny_status 200 if is_health sglang_up http-request deny deny_status 200 if is_health sglang_up
@@ -119,22 +284,17 @@ backend sglang
with open(log_path, "a") as f: with open(log_path, "a") as f:
f.write(f"haproxy config written to {haproxy_cfg}\n") f.write(f"haproxy config written to {haproxy_cfg}\n")
f.write(f"SGLang port: {sglang_port}, middleware port: {middleware_port}, haproxy port: {port}\n") f.write(f"Model: {model}, SGLang port: {sglang_port}, middleware port: {middleware_port}, haproxy port: {port}\n")
f.write(f"SGLang command: {' '.join(sglang_cmd)}\n")
if skipped:
f.write(f"Skipped vLLM args: {' '.join(skipped)}\n")
# Start SGLang in the background # ── Launch processes ─────────────────────────────────────
sglang_proc = subprocess.Popen(
[ sglang_proc = subprocess.Popen(sglang_cmd)
sys.executable, "-m", "sglang.launch_server",
"--model-path", "mistralai/Devstral-2-123B-Instruct-2512",
"--host", host,
"--port", sglang_port,
"--tp", "8",
"--tool-call-parser", "mistral",
],
)
# Start the middleware (strips vLLM-only params like logprobs)
middleware_env = os.environ.copy() middleware_env = os.environ.copy()
middleware_env["SGLANG_HOST"] = host
middleware_env["SGLANG_PORT"] = sglang_port middleware_env["SGLANG_PORT"] = sglang_port
middleware_env["MIDDLEWARE_PORT"] = middleware_port middleware_env["MIDDLEWARE_PORT"] = middleware_port
middleware_proc = subprocess.Popen( middleware_proc = subprocess.Popen(
@@ -142,10 +302,8 @@ backend sglang
env=middleware_env, env=middleware_env,
) )
# Give SGLang a moment before haproxy starts routing
time.sleep(2) time.sleep(2)
# Start haproxy in the background
haproxy_proc = subprocess.Popen(["haproxy", "-f", haproxy_cfg]) haproxy_proc = subprocess.Popen(["haproxy", "-f", haproxy_cfg])
with open(log_path, "a") as f: with open(log_path, "a") as f: