README.md

# vllm-to-sglang

Drop-in replacement that makes a vLLM production stack (e.g. the [k8s operator](https://github.com/vllm-project/production-stack)) actually run [SGLang](https://github.com/sgl-project/sglang) instead.

## How it works

The k8s vLLM production stack calls `vllm serve <model> [flags]`. This project intercepts that call and instead launches SGLang behind haproxy + a middleware layer.

```
k8s vLLM stack
  │
  │  vllm serve mistralai/Devstral-2-123B-Instruct-2512 \
  │    --host 0.0.0.0 --port 8000 --tensor-parallel-size 8 ...
  │
  ▼
┌─────────────────────────────────────────────────────────┐
│  vllm-shim.sh (replaces the `vllm` binary)             │
│  or vllm_shim_module.py (shadows python -m vllm.*)     │
│                                                         │
│  Parses vLLM args, translates to SGLang equivalents,   │
│  then launches three processes:                         │
│                                                         │
│  ┌──────────────────────────────────────────────────┐  │
│  │ haproxy :8000 (front door)                       │  │
│  │   /metrics → 200 empty (stub)                    │  │
│  │   /health  → 200/503 based on backend state      │  │
│  │   /*       → proxy to middleware :8002            │  │
│  └──────────────────────────────────────────────────┘  │
│                        │                                │
│                        ▼                                │
│  ┌──────────────────────────────────────────────────┐  │
│  │ middleware :8002 (FastAPI)                        │  │
│  │   Strips vLLM-only params from request bodies    │  │
│  │   Recursively fixes tool JSON schemas            │  │
│  │   Forwards to SGLang :8001                       │  │
│  └──────────────────────────────────────────────────┘  │
│                        │                                │
│                        ▼                                │
│  ┌──────────────────────────────────────────────────┐  │
│  │ SGLang :8001 (internal)                          │  │
│  │   The actual inference server                    │  │
│  └──────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────┘
```

## Argument translation

The shim dynamically translates vLLM CLI args to SGLang equivalents — no hardcoded model names or tensor-parallel sizes.

| vLLM flag | SGLang equivalent | Notes |
|-----------|-------------------|-------|
| `serve` | *(skipped)* | Subcommand only |
| `<model>` (positional) | `--model-path <model>` | |
| `--host` | Used for all three processes | |
| `--port` | haproxy binds this port | SGLang gets +1, middleware +2 |
| `--tensor-parallel-size` | `--tp` | |
| `--gpu_memory_utilization` | `--mem-fraction-static` | |
| `--trust-remote-code` | `--trust-remote-code` | |
| `--no-enable-prefix-caching` | *(skipped)* | No SGLang equivalent |
| `--enable-chunked-prefill` | *(skipped)* | No SGLang equivalent |
| `--tool-call-parser` | `--tool-call-parser` | Defaults to `mistral` |

Unknown flags are passed through as-is — they may be valid SGLang args.

### Environment variables

| Variable | Default | Purpose |
|----------|---------|---------|
| `SGLANG_TOOL_CALL_PARSER` | `mistral` | Override the tool-call-parser |
| `VLLM_SHIM_LOG` | `/tmp/vllm-shim.log` | Log file path |

## Middleware: request body fixes

SGLang rejects certain parameters and schemas that vLLM (and OpenClaw) send. The middleware fixes these automatically:

### Stripped parameters

These vLLM-only parameters are removed from request bodies before forwarding to SGLang:

- `logprobs` / `top_logprobs` — SGLang's Mistral tool-call parser rejects these
- `chat_template_kwargs` — OpenClaw sends this for reasoning models; SGLang doesn't support it
- `guided_json` / `guided_regex` — vLLM-only guided decoding params

### Schema fixes

OpenClaw (and some vLLM configurations) send tool schemas with `properties: []` instead of `properties: {}`. SGLang requires `properties` to be an object at **every level** of the schema, including nested `items` and sub-objects.

The middleware recursively walks the entire JSON Schema tree and fixes:
- `properties: []` → `properties: {}` (at any depth)
- `required: <non-list>` → removed
- `parameters: <non-object>` → `{"type": "object", "properties": {}}`

## Files

| File | Purpose |
|------|---------|
| `Dockerfile` | Builds on `lmsysorg/sglang-rocm`, installs haproxy, copies shim files |
| `Jenkinsfile` | CI/CD: builds and pushes to Vultr container registry |
| `vllm-shim.sh` | Shell shim — replaces the `vllm` binary, translates args |
| `vllm_shim_module.py` | Python shim — shadows `vllm.*` module imports, translates args |
| `vllm_middleware.py` | FastAPI middleware — strips bad params, fixes tool schemas |
| `README.md` | This file |

## Deploy

```bash
docker build -t vllm-to-sglang .
```

Or via Jenkins:

```bash
curl -X POST "https://jenkins.sweetapi.com/job/vllm-to-sglang/buildWithParameters" \
  -d TAG=nightly
```
dynamic arg translation, remove entrypoint.sh, update README 2026-04-12 21:23:26 +00:00			`# vllm-to-sglang`
Fix Jenkinsfile: agent any, nightly default, proper quoting 2026-04-12 00:22:29 +00:00
Rewrite README: explain the shim, current state, and how to adapt for other models 2026-04-12 03:07:43 +00:00			`Drop-in replacement that makes a vLLM production stack (e.g. the [k8s operator](https://github.com/vllm-project/production-stack)) actually run [SGLang](https://github.com/sgl-project/sglang) instead.`
Fix Jenkinsfile: agent any, nightly default, proper quoting 2026-04-12 00:22:29 +00:00
dynamic arg translation, remove entrypoint.sh, update README 2026-04-12 21:23:26 +00:00			`## How it works`
Fix Jenkinsfile: agent any, nightly default, proper quoting 2026-04-12 00:22:29 +00:00
dynamic arg translation, remove entrypoint.sh, update README 2026-04-12 21:23:26 +00:00			The k8s vLLM production stack calls `vllm serve <model> [flags]`. This project intercepts that call and instead launches SGLang behind haproxy + a middleware layer.
Rewrite README: explain the shim, current state, and how to adapt for other models 2026-04-12 03:07:43 +00:00
dynamic arg translation, remove entrypoint.sh, update README 2026-04-12 21:23:26 +00:00			```
			`k8s vLLM stack`
			`│`
			`│ vllm serve mistralai/Devstral-2-123B-Instruct-2512 \`
			`│ --host 0.0.0.0 --port 8000 --tensor-parallel-size 8 ...`
			`│`
			`▼`
			`┌─────────────────────────────────────────────────────────┐`
			│ vllm-shim.sh (replaces the `vllm` binary) │
			`│ or vllm_shim_module.py (shadows python -m vllm.*) │`
			`│ │`
			`│ Parses vLLM args, translates to SGLang equivalents, │`
			`│ then launches three processes: │`
			`│ │`
			`│ ┌──────────────────────────────────────────────────┐ │`
			`│ │ haproxy :8000 (front door) │ │`
			`│ │ /metrics → 200 empty (stub) │ │`
			`│ │ /health → 200/503 based on backend state │ │`
			`│ │ /* → proxy to middleware :8002 │ │`
			`│ └──────────────────────────────────────────────────┘ │`
			`│ │ │`
			`│ ▼ │`
			`│ ┌──────────────────────────────────────────────────┐ │`
			`│ │ middleware :8002 (FastAPI) │ │`
			`│ │ Strips vLLM-only params from request bodies │ │`
			`│ │ Recursively fixes tool JSON schemas │ │`
			`│ │ Forwards to SGLang :8001 │ │`
			`│ └──────────────────────────────────────────────────┘ │`
			`│ │ │`
			`│ ▼ │`
			`│ ┌──────────────────────────────────────────────────┐ │`
			`│ │ SGLang :8001 (internal) │ │`
			`│ │ The actual inference server │ │`
			`│ └──────────────────────────────────────────────────┘ │`
			`└─────────────────────────────────────────────────────────┘`
			```
Update README: haproxy proxy layer, /health probe fix, current state 2026-04-12 18:27:06 +00:00
dynamic arg translation, remove entrypoint.sh, update README 2026-04-12 21:23:26 +00:00			`## Argument translation`
Rewrite README: explain the shim, current state, and how to adapt for other models 2026-04-12 03:07:43 +00:00
dynamic arg translation, remove entrypoint.sh, update README 2026-04-12 21:23:26 +00:00			`The shim dynamically translates vLLM CLI args to SGLang equivalents — no hardcoded model names or tensor-parallel sizes.`
Update README: haproxy proxy layer, /health probe fix, current state 2026-04-12 18:27:06 +00:00
dynamic arg translation, remove entrypoint.sh, update README 2026-04-12 21:23:26 +00:00			`\| vLLM flag \| SGLang equivalent \| Notes \|`
			`\|-----------\|-------------------\|-------\|`
			\| `serve` \| (skipped) \| Subcommand only \|
			\| `<model>` (positional) \| `--model-path <model>` \| \|
			\| `--host` \| Used for all three processes \| \|
			\| `--port` \| haproxy binds this port \| SGLang gets +1, middleware +2 \|
			\| `--tensor-parallel-size` \| `--tp` \| \|
			\| `--gpu_memory_utilization` \| `--mem-fraction-static` \| \|
			\| `--trust-remote-code` \| `--trust-remote-code` \| \|
			\| `--no-enable-prefix-caching` \| (skipped) \| No SGLang equivalent \|
			\| `--enable-chunked-prefill` \| (skipped) \| No SGLang equivalent \|
			\| `--tool-call-parser` \| `--tool-call-parser` \| Defaults to `mistral` \|
Update README: haproxy proxy layer, /health probe fix, current state 2026-04-12 18:27:06 +00:00
dynamic arg translation, remove entrypoint.sh, update README 2026-04-12 21:23:26 +00:00			`Unknown flags are passed through as-is — they may be valid SGLang args.`
Update README: haproxy proxy layer, /health probe fix, current state 2026-04-12 18:27:06 +00:00
dynamic arg translation, remove entrypoint.sh, update README 2026-04-12 21:23:26 +00:00			`### Environment variables`
Rewrite README: explain the shim, current state, and how to adapt for other models 2026-04-12 03:07:43 +00:00
dynamic arg translation, remove entrypoint.sh, update README 2026-04-12 21:23:26 +00:00			`\| Variable \| Default \| Purpose \|`
			`\|----------\|---------\|---------\|`
			\| `SGLANG_TOOL_CALL_PARSER` \| `mistral` \| Override the tool-call-parser \|
			\| `VLLM_SHIM_LOG` \| `/tmp/vllm-shim.log` \| Log file path \|
Update README: haproxy proxy layer, /health probe fix, current state 2026-04-12 18:27:06 +00:00
dynamic arg translation, remove entrypoint.sh, update README 2026-04-12 21:23:26 +00:00			`## Middleware: request body fixes`
Update README: haproxy proxy layer, /health probe fix, current state 2026-04-12 18:27:06 +00:00
dynamic arg translation, remove entrypoint.sh, update README 2026-04-12 21:23:26 +00:00			`SGLang rejects certain parameters and schemas that vLLM (and OpenClaw) send. The middleware fixes these automatically:`
Add middleware to strip vLLM-only params (logprobs/top_logprobs) before forwarding to SGLang SGLang's Mistral tool-call parser rejects logprobs/top_logprobs with 422, while vLLM accepts them. Clients like OpenClaw send these by default. New architecture: haproxy (port N) → middleware (port N+2) → SGLang (port N+1) The middleware is a thin FastAPI app that strips incompatible params from chat completion request bodies and passes everything else through unchanged. 2026-04-12 18:58:37 +00:00
dynamic arg translation, remove entrypoint.sh, update README 2026-04-12 21:23:26 +00:00			`### Stripped parameters`
Add middleware to strip vLLM-only params (logprobs/top_logprobs) before forwarding to SGLang SGLang's Mistral tool-call parser rejects logprobs/top_logprobs with 422, while vLLM accepts them. Clients like OpenClaw send these by default. New architecture: haproxy (port N) → middleware (port N+2) → SGLang (port N+1) The middleware is a thin FastAPI app that strips incompatible params from chat completion request bodies and passes everything else through unchanged. 2026-04-12 18:58:37 +00:00
dynamic arg translation, remove entrypoint.sh, update README 2026-04-12 21:23:26 +00:00			`These vLLM-only parameters are removed from request bodies before forwarding to SGLang:`
Add middleware to strip vLLM-only params (logprobs/top_logprobs) before forwarding to SGLang SGLang's Mistral tool-call parser rejects logprobs/top_logprobs with 422, while vLLM accepts them. Clients like OpenClaw send these by default. New architecture: haproxy (port N) → middleware (port N+2) → SGLang (port N+1) The middleware is a thin FastAPI app that strips incompatible params from chat completion request bodies and passes everything else through unchanged. 2026-04-12 18:58:37 +00:00
dynamic arg translation, remove entrypoint.sh, update README 2026-04-12 21:23:26 +00:00			- `logprobs` / `top_logprobs` — SGLang's Mistral tool-call parser rejects these
			- `chat_template_kwargs` — OpenClaw sends this for reasoning models; SGLang doesn't support it
			- `guided_json` / `guided_regex` — vLLM-only guided decoding params
Add middleware to strip vLLM-only params (logprobs/top_logprobs) before forwarding to SGLang SGLang's Mistral tool-call parser rejects logprobs/top_logprobs with 422, while vLLM accepts them. Clients like OpenClaw send these by default. New architecture: haproxy (port N) → middleware (port N+2) → SGLang (port N+1) The middleware is a thin FastAPI app that strips incompatible params from chat completion request bodies and passes everything else through unchanged. 2026-04-12 18:58:37 +00:00
dynamic arg translation, remove entrypoint.sh, update README 2026-04-12 21:23:26 +00:00			`### Schema fixes`
Update README: haproxy proxy layer, /health probe fix, current state 2026-04-12 18:27:06 +00:00
dynamic arg translation, remove entrypoint.sh, update README 2026-04-12 21:23:26 +00:00			OpenClaw (and some vLLM configurations) send tool schemas with `properties: []` instead of `properties: {}`. SGLang requires `properties` to be an object at every level of the schema, including nested `items` and sub-objects.
Rewrite README: explain the shim, current state, and how to adapt for other models 2026-04-12 03:07:43 +00:00
dynamic arg translation, remove entrypoint.sh, update README 2026-04-12 21:23:26 +00:00			`The middleware recursively walks the entire JSON Schema tree and fixes:`
			- `properties: []` → `properties: {}` (at any depth)
			- `required: <non-list>` → removed
			- `parameters: <non-object>` → `{"type": "object", "properties": {}}`
Rewrite README: explain the shim, current state, and how to adapt for other models 2026-04-12 03:07:43 +00:00
dynamic arg translation, remove entrypoint.sh, update README 2026-04-12 21:23:26 +00:00			`## Files`
Rewrite README: explain the shim, current state, and how to adapt for other models 2026-04-12 03:07:43 +00:00
dynamic arg translation, remove entrypoint.sh, update README 2026-04-12 21:23:26 +00:00			`\| File \| Purpose \|`
			`\|------\|---------\|`
			\| `Dockerfile` \| Builds on `lmsysorg/sglang-rocm`, installs haproxy, copies shim files \|
			\| `Jenkinsfile` \| CI/CD: builds and pushes to Vultr container registry \|
			\| `vllm-shim.sh` \| Shell shim — replaces the `vllm` binary, translates args \|
			\| `vllm_shim_module.py` \| Python shim — shadows `vllm.*` module imports, translates args \|
			\| `vllm_middleware.py` \| FastAPI middleware — strips bad params, fixes tool schemas \|
			\| `README.md` \| This file \|
Rewrite README: explain the shim, current state, and how to adapt for other models 2026-04-12 03:07:43 +00:00
dynamic arg translation, remove entrypoint.sh, update README 2026-04-12 21:23:26 +00:00			`## Deploy`
Rewrite README: explain the shim, current state, and how to adapt for other models 2026-04-12 03:07:43 +00:00
			```bash
			`docker build -t vllm-to-sglang .`
			```

dynamic arg translation, remove entrypoint.sh, update README 2026-04-12 21:23:26 +00:00			`Or via Jenkins:`
Update README: haproxy proxy layer, /health probe fix, current state 2026-04-12 18:27:06 +00:00
			```bash
			`curl -X POST "https://jenkins.sweetapi.com/job/vllm-to-sglang/buildWithParameters" \`
dynamic arg translation, remove entrypoint.sh, update README 2026-04-12 21:23:26 +00:00			`-d TAG=nightly`
Update README: haproxy proxy layer, /health probe fix, current state 2026-04-12 18:27:06 +00:00			```