Files
vllm-to-sglang/README.md

5.9 KiB

vllm-to-sglang

Drop-in replacement that makes a vLLM production stack (e.g. the k8s operator) actually run SGLang instead.

How it works

The k8s vLLM production stack calls vllm serve <model> [flags]. This project intercepts that call and instead launches SGLang behind haproxy + a middleware layer.

k8s vLLM stack
  │
  │  vllm serve mistralai/Devstral-2-123B-Instruct-2512 \
  │    --host 0.0.0.0 --port 8000 --tensor-parallel-size 8 ...
  │
  ▼
┌─────────────────────────────────────────────────────────┐
│  vllm-shim.sh (replaces the `vllm` binary)             │
│  or vllm_shim_module.py (shadows python -m vllm.*)     │
│                                                         │
│  Parses vLLM args, translates to SGLang equivalents,   │
│  then launches three processes:                         │
│                                                         │
│  ┌──────────────────────────────────────────────────┐  │
│  │ haproxy :8000 (front door)                       │  │
│  │   /metrics → 200 empty (stub)                    │  │
│  │   /health  → 200/503 based on backend state      │  │
│  │   /*       → proxy to middleware :8002            │  │
│  └──────────────────────────────────────────────────┘  │
│                        │                                │
│                        ▼                                │
│  ┌──────────────────────────────────────────────────┐  │
│  │ middleware :8002 (FastAPI)                        │  │
│  │   Strips vLLM-only params from request bodies    │  │
│  │   Recursively fixes tool JSON schemas            │  │
│  │   Forwards to SGLang :8001                       │  │
│  └──────────────────────────────────────────────────┘  │
│                        │                                │
│                        ▼                                │
│  ┌──────────────────────────────────────────────────┐  │
│  │ SGLang :8001 (internal)                          │  │
│  │   The actual inference server                    │  │
│  └──────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────┘

Argument translation

The shim dynamically translates vLLM CLI args to SGLang equivalents — no hardcoded model names or tensor-parallel sizes.

vLLM flag SGLang equivalent Notes
serve (skipped) Subcommand only
<model> (positional) --model-path <model>
--host Used for all three processes
--port haproxy binds this port SGLang gets +1, middleware +2
--tensor-parallel-size --tp
--gpu_memory_utilization --mem-fraction-static
--trust-remote-code --trust-remote-code
--no-enable-prefix-caching (skipped) No SGLang equivalent
--enable-chunked-prefill (skipped) No SGLang equivalent
--tool-call-parser --tool-call-parser Defaults to mistral

Unknown flags are passed through as-is — they may be valid SGLang args.

Environment variables

Variable Default Purpose
SGLANG_TOOL_CALL_PARSER mistral Override the tool-call-parser
VLLM_SHIM_LOG /tmp/vllm-shim.log Log file path

Middleware: request body fixes

SGLang rejects certain parameters and schemas that vLLM (and OpenClaw) send. The middleware fixes these automatically:

Stripped parameters

These vLLM-only parameters are removed from request bodies before forwarding to SGLang:

  • logprobs / top_logprobs — SGLang's Mistral tool-call parser rejects these
  • chat_template_kwargs — OpenClaw sends this for reasoning models; SGLang doesn't support it
  • guided_json / guided_regex — vLLM-only guided decoding params

Schema fixes

OpenClaw (and some vLLM configurations) send tool schemas with properties: [] instead of properties: {}. SGLang requires properties to be an object at every level of the schema, including nested items and sub-objects.

The middleware recursively walks the entire JSON Schema tree and fixes:

  • properties: []properties: {} (at any depth)
  • required: <non-list> → removed
  • parameters: <non-object>{"type": "object", "properties": {}}

Files

File Purpose
Dockerfile Builds on lmsysorg/sglang-rocm, installs haproxy, copies shim files
Jenkinsfile CI/CD: builds and pushes to Vultr container registry
vllm-shim.sh Shell shim — replaces the vllm binary, translates args
vllm_shim_module.py Python shim — shadows vllm.* module imports, translates args
vllm_middleware.py FastAPI middleware — strips bad params, fixes tool schemas
README.md This file

Deploy

docker build -t vllm-to-sglang .

Or via Jenkins:

curl -X POST "https://jenkins.sweetapi.com/job/vllm-to-sglang/buildWithParameters" \
  -d TAG=nightly