not sure why we have a default tool parser

dynamic arg translation, remove entrypoint.sh, update README
fix: recursive _fix_schema to handle nested properties=[] at any depth
2026-04-13 17:49:44 +00:00 · 2026-04-12 21:23:26 +00:00 · 2026-04-12 20:52:44 +00:00 · 2026-04-12 20:27:44 +00:00 · 2026-04-12 20:23:10 +00:00 · 2026-04-12 20:17:06 +00:00
6 changed files with 863 additions and 120 deletions
--- a/7
+++ b/7
@@ -1,5 +1,11 @@
 FROM lmsysorg/sglang-rocm:v0.5.10rc0-rocm700-mi30x-20260411
 # ---------------------------------------------------------------
 # haproxy: routes /metrics stub, proxies everything else to SGLang
 # ---------------------------------------------------------------
 RUN apt-get update && apt-get install -y --no-install-recommends haproxy \
    && rm -rf /var/lib/apt/lists/*
 # ---------------------------------------------------------------
 # Replace the vllm binary with our shim
 # ---------------------------------------------------------------
@@ -12,6 +18,7 @@ RUN mkdir -p /opt/vllm-shim/vllm/entrypoints/openai \
 COPY vllm_shim_module.py /opt/vllm-shim/vllm/__main__.py
 COPY vllm_shim_module.py /opt/vllm-shim/vllm/entrypoints/openai/api_server.py
 COPY vllm_shim_module.py /opt/vllm-shim/vllm/entrypoints/cli/main.py
 COPY vllm_middleware.py /opt/vllm-shim/vllm_middleware.py
 RUN touch /opt/vllm-shim/vllm/__init__.py \
          /opt/vllm-shim/vllm/entrypoints/__init__.py \
          /opt/vllm-shim/vllm/entrypoints/openai/__init__.py \
--- a/README.md
+++ b/README.md
@@ -1,52 +1,115 @@
-# vLLM → SGLang Shim
+# vllm-to-sglang
 Drop-in replacement that makes a vLLM production stack (e.g. the [k8s operator](https://github.com/vllm-project/production-stack)) actually run [SGLang](https://github.com/sgl-project/sglang) instead.
-## Why?
+## How it works
-The vLLM production stack handles model lifecycle, scaling, and routing — but some models work better (or only work) on SGLang. Rather than rewriting your deployment infra, this shim intercepts every vLLM invocation and launches SGLang with equivalent arguments.
+The k8s vLLM production stack calls `vllm serve <model> [flags]`. This project intercepts that call and instead launches SGLang behind haproxy + a middleware layer.
-## How It Works
+```
 k8s vLLM stack
  │
  │  vllm serve mistralai/Devstral-2-123B-Instruct-2512 \
  │    --host 0.0.0.0 --port 8000 --tensor-parallel-size 8 ...
  │
  ▼
 ┌─────────────────────────────────────────────────────────┐
 │  vllm-shim.sh (replaces the `vllm` binary)             │
 │  or vllm_shim_module.py (shadows python -m vllm.*)     │
 │                                                         │
 │  Parses vLLM args, translates to SGLang equivalents,   │
 │  then launches three processes:                         │
 │                                                         │
 │  ┌──────────────────────────────────────────────────┐  │
 │  │ haproxy :8000 (front door)                       │  │
 │  │   /metrics → 200 empty (stub)                    │  │
 │  │   /health  → 200/503 based on backend state      │  │
 │  │   /*       → proxy to middleware :8002            │  │
 │  └──────────────────────────────────────────────────┘  │
 │                        │                                │
 │                        ▼                                │
 │  ┌──────────────────────────────────────────────────┐  │
 │  │ middleware :8002 (FastAPI)                        │  │
 │  │   Strips vLLM-only params from request bodies    │  │
 │  │   Recursively fixes tool JSON schemas            │  │
 │  │   Forwards to SGLang :8001                       │  │
 │  └──────────────────────────────────────────────────┘  │
 │                        │                                │
 │                        ▼                                │
 │  ┌──────────────────────────────────────────────────┐  │
 │  │ SGLang :8001 (internal)                          │  │
 │  │   The actual inference server                    │  │
 │  └──────────────────────────────────────────────────┘  │
 └─────────────────────────────────────────────────────────┘
 ```
-Two interception paths:
+## Argument translation
-| What the stack calls | What happens |
+The shim dynamically translates vLLM CLI args to SGLang equivalents — no hardcoded model names or tensor-parallel sizes.
 |---|---|
 | `vllm serve <model> [flags]` | Shell shim (`vllm-shim.sh`) parses args, execs `python -m sglang.launch_server` |
 | `python -m vllm.entrypoints.openai.api_server` | Python shim (shadow module on `PYTHONPATH`) does the same |
-Both extract `--host` and `--port` from whatever the stack sends and forward them to SGLang. Everything else is currently hardcoded for the target model.
+| vLLM flag | SGLang equivalent | Notes |
 |-----------|-------------------|-------|
 | `serve` | *(skipped)* | Subcommand only |
 | `<model>` (positional) | `--model-path <model>` | |
 | `--host` | Used for all three processes | |
 | `--port` | haproxy binds this port | SGLang gets +1, middleware +2 |
 | `--tensor-parallel-size` | `--tp` | |
 | `--gpu_memory_utilization` | `--mem-fraction-static` | |
 | `--trust-remote-code` | `--trust-remote-code` | |
 | `--no-enable-prefix-caching` | *(skipped)* | No SGLang equivalent |
 | `--enable-chunked-prefill` | *(skipped)* | No SGLang equivalent |
 | `--tool-call-parser` | `--tool-call-parser` | Defaults to `mistral` |
-## Current State
+Unknown flags are passed through as-is — they may be valid SGLang args.
-**PoC — hardcoded for `mistralai/Devstral-2-123B-Instruct-2512` on 8× MI300X.**
+### Environment variables
- Model path, `--tp 8`, and `--tool-call-parser mistral` are baked into both shims
+| Variable | Default | Purpose |
- The Dockerfile builds on `lmsysorg/sglang-rocm` and patches a broken `aiter` build from the base image
+|----------|---------|---------|
- MI300X tuning env vars are set (`HIP_FORCE_DEV_KERNARG`, `NCCL_MIN_NCHANNELS`, etc.)
+| `SGLANG_TOOL_CALL_PARSER` | `mistral` | Override the tool-call-parser |
 | `VLLM_SHIM_LOG` | `/tmp/vllm-shim.log` | Log file path |
-## Building
+## Middleware: request body fixes
 SGLang rejects certain parameters and schemas that vLLM (and OpenClaw) send. The middleware fixes these automatically:
 ### Stripped parameters
 These vLLM-only parameters are removed from request bodies before forwarding to SGLang:
 - `logprobs` / `top_logprobs` — SGLang's Mistral tool-call parser rejects these
 - `chat_template_kwargs` — OpenClaw sends this for reasoning models; SGLang doesn't support it
 - `guided_json` / `guided_regex` — vLLM-only guided decoding params
 ### Schema fixes
 OpenClaw (and some vLLM configurations) send tool schemas with `properties: []` instead of `properties: {}`. SGLang requires `properties` to be an object at **every level** of the schema, including nested `items` and sub-objects.
 The middleware recursively walks the entire JSON Schema tree and fixes:
 - `properties: []` → `properties: {}` (at any depth)
 - `required: <non-list>` → removed
 - `parameters: <non-object>` → `{"type": "object", "properties": {}}`
 ## Files
 | File | Purpose |
 |------|---------|
 | `Dockerfile` | Builds on `lmsysorg/sglang-rocm`, installs haproxy, copies shim files |
 | `Jenkinsfile` | CI/CD: builds and pushes to Vultr container registry |
 | `vllm-shim.sh` | Shell shim — replaces the `vllm` binary, translates args |
 | `vllm_shim_module.py` | Python shim — shadows `vllm.*` module imports, translates args |
 | `vllm_middleware.py` | FastAPI middleware — strips bad params, fixes tool schemas |
 | `README.md` | This file |
 ## Deploy
 ```bash
 docker build -t vllm-to-sglang .
 ```
-Then use this image anywhere the vLLM stack expects its server image.
+Or via Jenkins:
-## Making It Work For Other Models
+```bash
-
+curl -X POST "https://jenkins.sweetapi.com/job/vllm-to-sglang/buildWithParameters" \
-Right now the model config is hardcoded in three places:
+  -d TAG=nightly
-
+```
 - `vllm-shim.sh` — the `exec python -m sglang.launch_server` line
 - `vllm_shim_module.py` — the `os.execvp()` call
 - `Dockerfile` — base image and ROCm-specific patches
 To adapt for a different model, change `--model-path`, `--tp`, and `--tool-call-parser` in both shim files. A future pass will make this configurable via env vars or args so you don't have to edit source.
 ## Files
 | File | Purpose |
 |---|---|
 | `Dockerfile` | Builds the image: ROCm SGLang base + aiter fix + shims + MI300X env |
 | `vllm-shim.sh` | Shell shim — replaces the `vllm` binary |
 | `vllm_shim_module.py` | Python shim — shadows `vllm.*` module imports |
--- a/entrypoint.sh
+++ b/entrypoint.sh
@@ -1,42 +0,0 @@
 #!/bin/bash
 set -euo pipefail
 # Defaults matching vLLM production stack defaults
 HOST="0.0.0.0"
 PORT="8000"
 # Save original args before parsing eats them
 ALL_ARGS="$*"
 # Parse only host and port from whatever args the vLLM stack sends.
 # Everything else is ignored.
 while [[ $# -gt 0 ]]; do
  case "$1" in
    --host)       HOST="$2"; shift 2 ;;
    --host=*)     HOST="${1#*=}"; shift ;;
    --port)       PORT="$2"; shift 2 ;;
    --port=*)     PORT="${1#*=}"; shift ;;
    *)            shift ;;  # ignore everything else
  esac
 done
 echo "=== vLLM production stack args received ==="
 echo "Raw args: $ALL_ARGS"
 echo ""
 i=1
 for arg in $ALL_ARGS; do
  echo "  [$i] $arg"
  i=$((i + 1))
 done
 echo "============================================"
 echo ""
 echo "=== SGLang shim ==="
 echo "Ignoring vLLM args. Launching SGLang on ${HOST}:${PORT}"
 echo "==================="
 exec python -m sglang.launch_server \
  --model-path mistralai/Devstral-2-123B-Instruct-2512 \
  --host "$HOST" \
  --port "$PORT" \
  --tp 8 \
  --tool-call-parser mistral
--- a/vllm-shim.sh
+++ b/vllm-shim.sh
@@ -1,10 +1,21 @@
-#!/bin/bash
+#!/usr/bin/env bash
 set -euo pipefail
 # ============================================================
-# vLLM -> SGLang shim
+# vLLM -> SGLang shim (shell version)
 # This script replaces the vllm binary. The k8s production stack
 # calls `vllm serve <model> [flags]`, and we intercept everything.
 #
 # Dynamically translates vLLM args to SGLang equivalents.
 # No hardcoded model or tensor-parallel size.
 #
 # Architecture:
 #   haproxy on the vLLM port (front door)
 #     /metrics → 200 empty (stub)
 #     /health  → 200 if SGLang backend is up, 503 if not
 #     /*       → proxy to middleware on port+2
 #   middleware on port+2 (strips vLLM-only params, fixes schemas)
 #   SGLang on port+1 (internal)
 # ============================================================
 echo ""
@@ -22,28 +33,198 @@ done
 echo "=========================================="
 echo ""
-# Defaults
+# Log to file
 LOG_PATH="${VLLM_SHIM_LOG:-/tmp/vllm-shim.log}"
 {
  echo "$(date -Iseconds) vLLM -> SGLang Shim (shell)"
  echo "  Invoked as: vllm $*"
  echo "  All arguments received:"
  i=1
  for arg in "$@"; do
    echo "    [$i] $arg"
    i=$((i + 1))
  done
  echo ""
 } >> "$LOG_PATH"
 # ── Parse vLLM args → extract model, host, port, translate the rest ──
 MODEL=""
 HOST="0.0.0.0"
 PORT="8000"
 SGLANG_ARGS=()
 SKIPPED_ARGS=()
 # Default tool-call-parser; override with SGLANG_TOOL_CALL_PARSER env var
 TOOL_CALL_PARSER="${SGLANG_TOOL_CALL_PARSER:-mistral}"
 # Parse host and port from whatever the stack sends
 while [[ $# -gt 0 ]]; do
  case "$1" in
-    serve)        shift ;;  # skip the 'serve' subcommand
+    # Skip 'serve' subcommand
    serve)        shift ;;
    # ── Extracted for infrastructure (not passed to SGLang) ──
    --host)       HOST="$2"; shift 2 ;;
    --host=*)     HOST="${1#*=}"; shift ;;
    --port)       PORT="$2"; shift 2 ;;
    --port=*)     PORT="${1#*=}"; shift ;;
-    *)            shift ;;  # ignore everything else
+
    # ── Positional model name ──
    --model|--model-name)
      MODEL="$2"; shift 2 ;;
    --model=*|--model-name=*)
      MODEL="${1#*=}"; shift ;;
    # ── Direct renames (vLLM → SGLang) ──
    --tensor-parallel-size)
      SGLANG_ARGS+=("--tp" "$2"); shift 2 ;;
    --tensor-parallel-size=*)
      SGLANG_ARGS+=("--tp" "${1#*=}"); shift ;;
    --gpu_memory_utilization)
      SGLANG_ARGS+=("--mem-fraction-static" "$2"); shift 2 ;;
    --gpu_memory_utilization=*)
      SGLANG_ARGS+=("--mem-fraction-static" "${1#*=}"); shift ;;
    --trust_remote_code|--trust-remote-code)
      SGLANG_ARGS+=("--trust-remote-code"); shift ;;
    # ── vLLM flags with no SGLang equivalent → skip ──
    --no-enable-prefix-caching|--enable-prefix-caching)
      SKIPPED_ARGS+=("$1"); shift ;;
    --enable-chunked-prefill|--no-enable-chunked-prefill)
      SKIPPED_ARGS+=("$1"); shift ;;
    --disable-log-requests|--disable-log-stats)
      SKIPPED_ARGS+=("$1"); shift ;;
    --swap-space|--block-size|--max-num-seqs|--max-num-batched-tokens)
      SKIPPED_ARGS+=("$1" "$2"); shift 2 ;;
    --swap-space=*|--block-size=*|--max-num-seqs=*|--max-num-batched-tokens=*)
      SKIPPED_ARGS+=("$1"); shift ;;
    --distributed-executor-backend|--pipeline-parallel-size|--data-parallel-size)
      SKIPPED_ARGS+=("$1" "$2"); shift 2 ;;
    --quantization|--dtype|--revision|--tokenizer-revision|--tokenizer-mode)
      SKIPPED_ARGS+=("$1" "$2"); shift 2 ;;
    --quantization=*|--dtype=*|--revision=*|--tokenizer-revision=*|--tokenizer-mode=*)
      SKIPPED_ARGS+=("$1"); shift ;;
    # ── Pass through to SGLang as-is ──
    --tool-call-parser)
      TOOL_CALL_PARSER="$2"; shift 2 ;;
    --tool-call-parser=*)
      TOOL_CALL_PARSER="${1#*=}"; shift ;;
    *)
      # Positional arg = model name (first non-flag)
      if [[ ! "$1" =~ ^- ]] && [[ -z "$MODEL" ]]; then
        MODEL="$1"; shift
      else
        # Unknown — pass through, might be valid for SGLang
        SGLANG_ARGS+=("$1"); shift
      fi ;;
  esac
 done
-echo "Launching SGLang on ${HOST}:${PORT}"
+if [[ -z "$MODEL" ]]; then
  echo "ERROR: No model specified in vLLM args!"
  exit 1
 fi
 # ── Port scheme: haproxy=original, SGLang=+1, middleware=+2 ──
 SGLANG_PORT=$((PORT + 1))
 MIDDLEWARE_PORT=$((PORT + 2))
 echo "Model: ${MODEL}"
 echo "SGLang:  ${HOST}:${SGLANG_PORT}"
 echo "Middleware: ${HOST}:${MIDDLEWARE_PORT}"
 echo "haproxy: ${HOST}:${PORT}"
 if [[ ${#SGLANG_ARGS[@]} -gt 0 ]]; then
  echo "Translated args: ${SGLANG_ARGS[*]}"
 fi
 if [[ ${#SKIPPED_ARGS[@]} -gt 0 ]]; then
  echo "Skipped (no SGLang equivalent): ${SKIPPED_ARGS[*]}"
 fi
 echo ""
-exec python -m sglang.launch_server \
+# ── haproxy setup ───────────────────────────────────────────
-  --model-path mistralai/Devstral-2-123B-Instruct-2512 \
+
-  --host "$HOST" \
+mkdir -p /tmp/haproxy-errors
-  --port "$PORT" \
+printf "HTTP/1.0 200 OK\r\nContent-Length: 0\r\nConnection: close\r\n\r\n" > /tmp/haproxy-errors/200-empty.http
-  --tp 8 \
+printf "HTTP/1.0 503 Service Unavailable\r\nContent-Length: 16\r\nConnection: close\r\nContent-Type: text/plain\r\n\r\nSGLang not ready" > /tmp/haproxy-errors/503-sglang.http
-  --tool-call-parser mistral
+
 HAPROXY_CFG="/tmp/haproxy-shim.cfg"
 cat > "$HAPROXY_CFG" <<EOF
 global
  maxconn 4096
 defaults
  mode http
  timeout connect 5s
  timeout client 300s
  timeout server 300s
 frontend proxy
  bind ${HOST}:${PORT}
  acl is_metrics path /metrics
  http-request deny deny_status 200 if is_metrics
  errorfile 200 /tmp/haproxy-errors/200-empty.http
  acl is_health path /health
  acl sglang_up nbsrv(sglang) gt 0
  http-request deny deny_status 200 if is_health sglang_up
  http-request deny deny_status 503 if is_health
  errorfile 503 /tmp/haproxy-errors/503-sglang.http
  default_backend sglang
 backend sglang
  option httpchk GET /health
  http-check expect status 200
  server s1 127.0.0.1:${MIDDLEWARE_PORT} check inter 5s fall 3 rise 2
 EOF
 # ── Build and launch SGLang ─────────────────────────────────
 SGLANG_CMD=(
  python -m sglang.launch_server
  --model-path "$MODEL"
  --host "$HOST"
  --port "$SGLANG_PORT"
 )
 if [[ -n "$TOOL_CALL_PARSER" ]]; then
  SGLANG_CMD+=(--tool-call-parser "$TOOL_CALL_PARSER")
 fi
 SGLANG_CMD+=("${SGLANG_ARGS[@]}")
 echo "SGLang command: ${SGLANG_CMD[*]}"
 echo ""
 {
  echo "haproxy config written to ${HAPROXY_CFG}"
  echo "Model: ${MODEL}, SGLang port: ${SGLANG_PORT}, middleware port: ${MIDDLEWARE_PORT}, haproxy port: ${PORT}"
  echo "SGLang command: ${SGLANG_CMD[*]}"
  if [[ ${#SKIPPED_ARGS[@]} -gt 0 ]]; then
    echo "Skipped vLLM args: ${SKIPPED_ARGS[*]}"
  fi
 } >> "$LOG_PATH"
 # Launch SGLang
 "${SGLANG_CMD[@]}" &
 SGLANG_PID=$!
 # Launch middleware
 SGLANG_HOST="$HOST" SGLANG_PORT="$SGLANG_PORT" MIDDLEWARE_PORT="$MIDDLEWARE_PORT" \
  python /opt/vllm-shim/vllm_middleware.py &
 MIDDLEWARE_PID=$!
 sleep 2
 # Launch haproxy (front door on the original port)
 haproxy -f "$HAPROXY_CFG" &
 HAPROXY_PID=$!
 echo "SGLang PID: ${SGLANG_PID}, middleware PID: ${MIDDLEWARE_PID}, haproxy PID: ${HAPROXY_PID}" >> "$LOG_PATH"
 # Wait for whichever dies first
 wait -n "$SGLANG_PID" "$MIDDLEWARE_PID" "$HAPROXY_PID"
 EXIT_CODE=$?
 echo "A process exited (code ${EXIT_CODE}), shutting down" >> "$LOG_PATH"
 kill "$SGLANG_PID" "$MIDDLEWARE_PID" "$HAPROXY_PID" 2>/dev/null || true
 exit $EXIT_CODE
--- a/vllm_middleware.py
+++ b/vllm_middleware.py
@@ -0,0 +1,260 @@
 """
 vLLM → SGLang request middleware.
 Sits between haproxy and SGLang to strip vLLM-only parameters
 that cause SGLang to return 422/400 errors.
 Currently strips: logprobs, top_logprobs
 (SGLang's Mistral tool-call parser rejects these; vLLM accepts them.)
 Architecture:
  haproxy (port N) → middleware (port N+2) → SGLang (port N+1)
 haproxy still handles /metrics stub and /health instant responses.
 This middleware only touches the proxied request bodies.
 """
 import json
 import os
 import asyncio
 import httpx
 from datetime import datetime
 from fastapi import FastAPI, Request
 from fastapi.responses import StreamingResponse, Response
 import uvicorn
 SGLANG_HOST = os.environ.get("SGLANG_HOST", "127.0.0.1")
 SGLANG_PORT = int(os.environ.get("SGLANG_PORT", "8001"))
 LISTEN_PORT = int(os.environ.get("MIDDLEWARE_PORT", "8002"))
 # Params that vLLM accepts but SGLang rejects.
 # Extend this set as more incompatibilities are discovered.
 STRIP_PARAMS = {"logprobs", "top_logprobs", "chat_template_kwargs", "guided_json", "guided_regex"}
 client: httpx.AsyncClient | None = None
 _sglang_ready = False
 async def _lifespan(app_instance):
    global client
    client = httpx.AsyncClient(
        timeout=httpx.Timeout(300.0, connect=10.0),
    )
    # Background task: wait for SGLang to become available
    asyncio.create_task(_wait_for_sglang())
    yield
    await client.aclose()
 async def _wait_for_sglang():
    """Poll SGLang until it's accepting connections, then mark ready."""
    global _sglang_ready
    while True:
        try:
            resp = await client.get(
                f"http://{SGLANG_HOST}:{SGLANG_PORT}/health",
                timeout=httpx.Timeout(5.0, connect=2.0),
            )
            if resp.status_code == 200:
                _sglang_ready = True
                print(f"Middleware: SGLang is ready at {SGLANG_HOST}:{SGLANG_PORT}")
                return
        except (httpx.ConnectError, httpx.TimeoutException):
            pass
        await asyncio.sleep(2)
 app = FastAPI(lifespan=_lifespan)
@app.get("/health")
 async def health():
    """Health check — haproxy polls this. Returns 200 only if SGLang is up."""
    global _sglang_ready
    if not _sglang_ready:
        return Response(content="SGLang not ready", status_code=503)
    try:
        resp = await client.get(
            f"http://{SGLANG_HOST}:{SGLANG_PORT}/health",
            timeout=httpx.Timeout(5.0, connect=2.0),
        )
        return Response(content=resp.content, status_code=resp.status_code,
                        media_type=resp.headers.get("content-type"))
    except (httpx.ConnectError, httpx.TimeoutException):
        _sglang_ready = False
        # Re-trigger background wait
        asyncio.create_task(_wait_for_sglang())
        return Response(content="SGLang not ready", status_code=503)
 ERROR_LOG = os.environ.get("VLLM_SHIM_LOG", "/tmp/vllm-shim.log")
 def _fix_schema(schema: dict) -> bool:
    """Recursively fix a JSON Schema dict: properties must be object, required must be list of strings."""
    fixed = False
    # Fix 'properties' — must be dict, not array/null
    if "properties" in schema and not isinstance(schema["properties"], dict):
        schema["properties"] = {}
        fixed = True
    # Fix 'required' — must be list of strings or absent
    if "required" in schema and not isinstance(schema["required"], list):
        del schema["required"]
        fixed = True
    # Recurse into every property value
    if isinstance(schema.get("properties"), dict):
        for val in schema["properties"].values():
            if isinstance(val, dict):
                if _fix_schema(val):
                    fixed = True
    # Recurse into items (for array-of-objects)
    if isinstance(schema.get("items"), dict):
        if _fix_schema(schema["items"]):
            fixed = True
    # Recurse into anyOf, allOf, oneOf
    for key in ("anyOf", "allOf", "oneOf"):
        if isinstance(schema.get(key), list):
            for item in schema[key]:
                if isinstance(item, dict):
                    if _fix_schema(item):
                        fixed = True
    # Recurse into additionalProperties if it's a schema
    if isinstance(schema.get("additionalProperties"), dict):
        if _fix_schema(schema["additionalProperties"]):
            fixed = True
    return fixed
 def _dump_error(request_body: bytes, status_code: int, resp_headers: dict, resp_body_raw: bytes, path: str = ""):
    """Log full request + response payload when SGLang returns an error (4xx/5xx)."""
    try:
        ts = datetime.now().isoformat()
        req_json = None
        try:
            req_json = json.loads(request_body)
        except (json.JSONDecodeError, UnicodeDecodeError):
            pass
        resp_text = resp_body_raw.decode("utf-8", errors="replace")[:4000]
        resp_json = None
        try:
            resp_json = json.loads(resp_text)
        except (json.JSONDecodeError, UnicodeDecodeError):
            pass
        with open(ERROR_LOG, "a") as f:
            f.write(f"\n{'='*60}\n")
            f.write(f"[{ts}] ERROR DUMP — SGLang returned HTTP {status_code}\n")
            f.write(f"Path: {path}\n")
            f.write(f"--- Request Body ---\n")
            if req_json:
                f.write(json.dumps(req_json, indent=2, ensure_ascii=False)[:8000])
            else:
                f.write(request_body.decode("utf-8", errors="replace")[:8000])
            f.write(f"\n--- Response (HTTP {status_code}) ---\n")
            if resp_json:
                f.write(json.dumps(resp_json, indent=2, ensure_ascii=False)[:4000])
            else:
                f.write(resp_text)
            f.write(f"\n{'='*60}\n")
        print(f"[{ts}] ERROR DUMP: HTTP {status_code} on {path} — full payload written to {ERROR_LOG}")
    except Exception as e:
        print(f"_dump_error failed: {e}")
@app.api_route("/{path:path}", methods=["GET", "POST", "PUT", "DELETE", "PATCH", "OPTIONS"])
 async def proxy(path: str, request: Request):
    body = await request.body()
    is_streaming = False
    # Strip incompatible params from chat completion POST requests
    if request.method == "POST" and "chat/completions" in path and body:
        try:
            data = json.loads(body)
            is_streaming = data.get("stream", False)
            stripped_any = False
            for key in STRIP_PARAMS:
                if key in data:
                    del data[key]
                    stripped_any = True
            # Fix tool function parameters: recurse to fix ALL bad properties/required
            tools = data.get("tools")
            if isinstance(tools, list):
                for tool in tools:
                    func = tool.get("function") if isinstance(tool, dict) else None
                    if not isinstance(func, dict):
                        continue
                    if not isinstance(func.get("parameters"), dict):
                        func["parameters"] = {"type": "object", "properties": {}}
                        stripped_any = True
                    if _fix_schema(func["parameters"]):
                        stripped_any = True
            if stripped_any:
                body = json.dumps(data).encode()
        except (json.JSONDecodeError, UnicodeDecodeError):
            pass
    # Forward headers (skip hop-by-hop and ones we're replacing)
    fwd_headers = {
        k: v for k, v in request.headers.items()
        if k.lower() not in ("host", "content-length", "transfer-encoding")
    }
    fwd_headers["content-length"] = str(len(body))
    url = f"http://{SGLANG_HOST}:{SGLANG_PORT}/{path}"
    if request.query_params:
        url += f"?{request.query_params}"
    try:
        if is_streaming:
            req = client.build_request(request.method, url, content=body, headers=fwd_headers)
            resp = await client.send(req, stream=True)
            # Dump on error for streaming responses
            if resp.status_code >= 400:
                error_body = await resp.aread()
                _dump_error(body, resp.status_code, resp_headers=dict(resp.headers), resp_body_raw=error_body, path=path)
                await resp.aclose()
                return Response(
                    content=error_body,
                    status_code=resp.status_code,
                    media_type=resp.headers.get("content-type"),
                )
            async def stream_body():
                try:
                    async for chunk in resp.aiter_bytes():
                        yield chunk
                finally:
                    await resp.aclose()
            return StreamingResponse(
                stream_body(),
                status_code=resp.status_code,
                headers={"content-type": resp.headers.get("content-type", "text/event-stream")},
            )
        else:
            resp = await client.request(request.method, url, content=body, headers=fwd_headers)
            # Dump on error
            if resp.status_code >= 400:
                _dump_error(body, resp.status_code, resp_headers=dict(resp.headers), resp_body_raw=resp.content, path=path)
            return Response(
                content=resp.content,
                status_code=resp.status_code,
                media_type=resp.headers.get("content-type"),
            )
    except (httpx.ConnectError, httpx.TimeoutException) as e:
        return Response(
            content=json.dumps({"error": {"message": f"SGLang backend unavailable: {e}", "type": "backend_error"}}),
            status_code=503,
            media_type="application/json",
        )
 if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=LISTEN_PORT, log_level="warning")
--- a/vllm_shim_module.py
+++ b/vllm_shim_module.py
@@ -1,15 +1,195 @@
 #!/usr/bin/env python3
 """
 vLLM -> SGLang Python shim.
 Catches `python -m vllm.entrypoints.openai.api_server` (and similar)
-and launches SGLang instead.
+and launches SGLang behind haproxy + middleware instead.
 Dynamically translates vLLM CLI args to SGLang equivalents.
 No hardcoded model name or tensor-parallel size.
 Architecture:
  haproxy on the vLLM port (front door)
    /metrics → 200 empty (stub)
    /health  → 200 if SGLang backend is up, 503 if not (instant)
    /*       → proxy to middleware on port+2
  middleware on port+2 (strips vLLM-only params, fixes tool schemas)
  SGLang on port+1 (internal)
 """
 import os
 import sys
 import subprocess
 import time
 import datetime
 # ── vLLM → SGLang argument mapping ──────────────────────────
 # Key = vLLM flag, value = (sglang_flag, has_value)
 # has_value=True means the flag takes an argument (e.g. --port 8000)
 # has_value=False means it's a boolean flag (e.g. --no-enable-prefix-caching)
 ARG_MAP = {
    # Direct renames (vLLM name → SGLang name)
    "--tensor-parallel-size":    ("--tp",                  True),
    "--gpu_memory_utilization":  ("--mem-fraction-static", True),
    "--max_model_len":           ("--max-running-requests", True),  # approximate
    "--max-model-len":           ("--max-running-requests", True),  # kebab variant
    "--enforce_eager":           ("--enable-torch-compile", False),  # opposite intent, skip by default
    "--trust_remote_code":       ("--trust-remote-code",   False),
    "--trust-remote-code":       ("--trust-remote-code",   False),
    # vLLM flags with no SGLang equivalent → skip
    "--no-enable-prefix-caching": (None, False),
    "--enable-prefix-caching":    (None, False),
    "--enable-chunked-prefill":   (None, False),
    "--no-enable-chunked-prefill":(None, False),
    "--disable-log-requests":     (None, False),
    "--disable-log-stats":        (None, False),
    "--swap-space":               (None, True),
    "--block-size":               (None, True),
    "--num-gpu-blocks-override":  (None, True),
    "--num-cpu-blocks-override":  (None, True),
    "--max-num-seqs":             (None, True),
    "--max-num-batched-tokens":   (None, True),
    "--distributed-executor-backend": (None, True),
    "--pipeline-parallel-size":   (None, True),
    "--data-parallel-size":       (None, True),
    "--revision":                 (None, True),
    "--code-revision":            (None, True),
    "--tokenizer-revision":       (None, True),
    "--tokenizer-mode":           (None, True),
    "--quantization":             (None, True),
    "--dtype":                    (None, True),
    "--max-seq-len-to-capture":   (None, True),
    "--enable-lora":              (None, False),
    "--max-lora-rank":            (None, True),
    "--max-cpu-loras":            (None, True),
    "--lora-dtype":               (None, True),
    "--enable-prompt-adapter":    (None, False),
    "--scheduler-delay-factor":   (None, True),
    "--enable-multi-modal":       (None, False),
    "--limit-mm-per-prompt":      (None, True),
 }
 # Default tool-call-parser; override with SGLANG_TOOL_CALL_PARSER env var
 DEFAULT_TOOL_CALL_PARSER = "qwen3_coder"
 def parse_vllm_args(args):
    """
    Parse vLLM CLI args and extract model, host, port,
    plus any args we should translate to SGLang.
    Returns (model, host, port, sglang_extra_args, skipped_args).
    """
    model = None
    host = "0.0.0.0"
    port = "8000"
    sglang_extra = []  # translated args for SGLang
    skipped = []       # vLLM args we're ignoring
    i = 0
    while i < len(args):
        arg = args[i]
        # 'serve' subcommand — skip
        if arg == "serve":
            i += 1
            continue
        # Positional model argument (first non-flag after serve, or standalone)
        if not arg.startswith("-") and model is None:
            model = arg
            i += 1
            continue
        # --flag=value form
        if "=" in arg and arg.startswith("--"):
            flag, val = arg.split("=", 1)
            if flag == "--host":
                host = val
            elif flag == "--port":
                port = val
            elif flag in ARG_MAP:
                sglang_flag, has_val = ARG_MAP[flag]
                if sglang_flag is None:
                    skipped.append(arg)
                elif has_val:
                    sglang_extra.extend([sglang_flag, val])
                else:
                    # boolean flag with =value (unusual but valid)
                    sglang_extra.append(sglang_flag)
            else:
                # Unknown flag — pass through as-is (might be a SGLang flag too)
                sglang_extra.append(arg)
            i += 1
            continue
        # --flag value form
        if arg in ("--host",):
            if i + 1 < len(args):
                host = args[i + 1]
            i += 2
            continue
        if arg in ("--port",):
            if i + 1 < len(args):
                port = args[i + 1]
            i += 2
            continue
        if arg in ARG_MAP:
            sglang_flag, has_val = ARG_MAP[arg]
            if sglang_flag is None:
                skipped.append(arg)
                if has_val and i + 1 < len(args) and not args[i + 1].startswith("-"):
                    skipped.append(args[i + 1])
                    i += 2
                else:
                    i += 1
            elif has_val:
                if i + 1 < len(args):
                    sglang_extra.extend([sglang_flag, args[i + 1]])
                    i += 2
                else:
                    i += 1
            else:
                sglang_extra.append(sglang_flag)
                i += 1
            continue
        # --tool-call-parser — pass through to SGLang
        if arg == "--tool-call-parser":
            if i + 1 < len(args):
                sglang_extra.extend(["--tool-call-parser", args[i + 1]])
                i += 2
            else:
                i += 1
            continue
        # Unknown flag — pass through if it takes a value, might be valid for SGLang
        if arg.startswith("--") and i + 1 < len(args) and not args[i + 1].startswith("-"):
            sglang_extra.extend([arg, args[i + 1]])
            i += 2
        elif arg.startswith("--"):
            sglang_extra.append(arg)
            i += 1
        else:
            # Unknown positional — probably the model if we don't have it yet
            if model is None:
                model = arg
            i += 1
    return model, host, port, sglang_extra, skipped
 def main():
    args = sys.argv[1:]
    log_path = os.environ.get("VLLM_SHIM_LOG", "/tmp/vllm-shim.log")
    with open(log_path, "a") as f:
        f.write(f"\n{datetime.datetime.now().isoformat()} vLLM -> SGLang Shim (Python module)\n")
        f.write(f"  Invoked as: python -m {__name__} {' '.join(args)}\n")
        f.write("  All arguments received:\n")
        for i, arg in enumerate(args, 1):
            f.write(f"    [{i}] {arg}\n")
        f.write("\n")
    print()
    print("==========================================")
    print("  vLLM -> SGLang Shim (Python module)")
@@ -22,41 +202,135 @@ def main():
    print("==========================================")
    print()
-    host = "0.0.0.0"
+    model, host, port, sglang_extra, skipped = parse_vllm_args(args)
    port = "8000"
-    i = 0
+    if not model:
-    while i < len(args):
+        print("ERROR: No model specified in vLLM args!")
-        if args[i] == "--host" and i + 1 < len(args):
+        os._exit(1)
            host = args[i + 1]
            i += 2
        elif args[i].startswith("--host="):
            host = args[i].split("=", 1)[1]
            i += 1
        elif args[i] == "--port" and i + 1 < len(args):
            port = args[i + 1]
            i += 2
        elif args[i].startswith("--port="):
            port = args[i].split("=", 1)[1]
            i += 1
        else:
            i += 1
-    print(f"Launching SGLang on {host}:{port}")
+    # SGLang port scheme: original+1 = SGLang, original+2 = middleware
    sglang_port = str(int(port) + 1)
    middleware_port = str(int(port) + 2)
    # Build SGLang command
    sglang_cmd = [
        sys.executable, "-m", "sglang.launch_server",
        "--model-path", model,
        "--host", host,
        "--port", sglang_port,
    ]
    # Add tool-call-parser (env override or default)
    tcp = os.environ.get("SGLANG_TOOL_CALL_PARSER", DEFAULT_TOOL_CALL_PARSER)
    if tcp:
        sglang_cmd.extend(["--tool-call-parser", tcp])
    # Add translated/forwarded args
    sglang_cmd.extend(sglang_extra)
    print(f"Model: {model}")
    print(f"SGLang host: {host}:{sglang_port}")
    print(f"Middleware:  {host}:{middleware_port}")
    print(f"haproxy:    {host}:{port}")
    if sglang_extra:
        print(f"Translated args: {' '.join(sglang_extra)}")
    if skipped:
        print(f"Skipped (no SGLang equivalent): {' '.join(skipped)}")
    print()
    print(f"SGLang command: {' '.join(sglang_cmd)}")
    print()
-    os.execvp(
+    # ── haproxy setup ────────────────────────────────────────
-        sys.executable,
+
-        [
+    os.makedirs("/tmp/haproxy-errors", exist_ok=True)
-            sys.executable, "-m", "sglang.launch_server",
+    with open("/tmp/haproxy-errors/200-empty.http", "w") as f:
-            "--model-path", "mistralai/Devstral-2-123B-Instruct-2512",
+        f.write("HTTP/1.0 200 OK\r\nContent-Length: 0\r\nConnection: close\r\n\r\n")
-            "--host", host,
+    with open("/tmp/haproxy-errors/503-sglang.http", "w") as f:
-            "--port", port,
+        f.write("HTTP/1.0 503 Service Unavailable\r\nContent-Length: 16\r\nConnection: close\r\nContent-Type: text/plain\r\n\r\nSGLang not ready")
-            "--tp", "8",
+
-            "--tool-call-parser", "mistral",
+    haproxy_cfg = "/tmp/haproxy-shim.cfg"
-        ],
+    with open(haproxy_cfg, "w") as f:
        f.write(f"""global
  maxconn 4096
 defaults
  mode http
  timeout connect 5s
  timeout client 300s
  timeout server 300s
 frontend proxy
  bind {host}:{port}
  # /metrics stub — instant 200 empty (vLLM stack expects this)
  acl is_metrics path /metrics
  http-request deny deny_status 200 if is_metrics
  errorfile 200 /tmp/haproxy-errors/200-empty.http
  # /health — instant response based on SGLang backend state
  acl is_health path /health
  acl sglang_up nbsrv(sglang) gt 0
  http-request deny deny_status 200 if is_health sglang_up
  http-request deny deny_status 503 if is_health
  errorfile 503 /tmp/haproxy-errors/503-sglang.http
  default_backend sglang
 backend sglang
  option httpchk GET /health
  http-check expect status 200
  server s1 127.0.0.1:{middleware_port} check inter 5s fall 3 rise 2
 """)
    with open(log_path, "a") as f:
        f.write(f"haproxy config written to {haproxy_cfg}\n")
        f.write(f"Model: {model}, SGLang port: {sglang_port}, middleware port: {middleware_port}, haproxy port: {port}\n")
        f.write(f"SGLang command: {' '.join(sglang_cmd)}\n")
        if skipped:
            f.write(f"Skipped vLLM args: {' '.join(skipped)}\n")
    # ── Launch processes ─────────────────────────────────────
    sglang_proc = subprocess.Popen(sglang_cmd)
    middleware_env = os.environ.copy()
    middleware_env["SGLANG_HOST"] = host
    middleware_env["SGLANG_PORT"] = sglang_port
    middleware_env["MIDDLEWARE_PORT"] = middleware_port
    middleware_proc = subprocess.Popen(
        [sys.executable, "/opt/vllm-shim/vllm_middleware.py"],
        env=middleware_env,
    )
    time.sleep(2)
    haproxy_proc = subprocess.Popen(["haproxy", "-f", haproxy_cfg])
    with open(log_path, "a") as f:
        f.write(f"SGLang PID: {sglang_proc.pid}, middleware PID: {middleware_proc.pid}, haproxy PID: {haproxy_proc.pid}\n")
    # Wait for whichever dies first
    while True:
        sglang_ret = sglang_proc.poll()
        middleware_ret = middleware_proc.poll()
        haproxy_ret = haproxy_proc.poll()
        if sglang_ret is not None:
            print(f"SGLang exited (code {sglang_ret}), shutting down")
            middleware_proc.terminate()
            haproxy_proc.terminate()
            os._exit(sglang_ret)
        if middleware_ret is not None:
            print(f"Middleware exited (code {middleware_ret}), shutting down")
            sglang_proc.terminate()
            haproxy_proc.terminate()
            os._exit(middleware_ret)
        if haproxy_ret is not None:
            print(f"haproxy exited (code {haproxy_ret}), shutting down")
            sglang_proc.terminate()
            middleware_proc.terminate()
            os._exit(haproxy_ret)
        time.sleep(1)
 if __name__ == "__main__":
    main()
Author	SHA1	Message	Date
biondizzle	7d9c4da2ee	not sure why we have a default tool parser	2026-04-13 17:49:44 +00:00
biondizzle	efc9dc33e7	dynamic arg translation, remove entrypoint.sh, update README	2026-04-12 21:23:26 +00:00
biondizzle	7c1ed0408b	fix: recursive _fix_schema to handle nested properties=[] at any depth	2026-04-12 20:52:44 +00:00
biondizzle	a9911386e0	strip guided_json, guided_regex too; fix parameters.properties array	2026-04-12 20:27:44 +00:00
biondizzle	ccedd3ecee	fix: add chat_template_kwargs to STRIP_PARAMS, fix parameters.properties array	2026-04-12 20:23:10 +00:00
biondizzle	c66511e16f	fix: handle parameters.properties being array, not just parameters itself	2026-04-12 20:17:06 +00:00
biondizzle	e03e41eb4f	fix vLLM/SGLang schema mismatc	2026-04-12 19:57:47 +00:00
biondizzle	7ecbac2dc0	Fix UnboundLocalError in health(), switch from on_event to lifespan	2026-04-12 19:41:08 +00:00
biondizzle	774964a4db	Add error dump logging: capture full request+response on 4xx/5xx from SGLang	2026-04-12 19:28:04 +00:00
biondizzle	db9231f796	Fix middleware: handle SGLang startup lag gracefully - Add /health endpoint that returns 503 until SGLang is ready - Background task polls SGLang until it accepts connections - Catch ConnectError/TimeoutException instead of crashing - Return 503 JSON error when SGLang backend is unavailable - haproxy health-checks middleware /health, which reflects SGLang state	2026-04-12 19:06:38 +00:00
biondizzle	bbe40ac8c0	Add middleware to strip vLLM-only params (logprobs/top_logprobs) before forwarding to SGLang SGLang's Mistral tool-call parser rejects logprobs/top_logprobs with 422, while vLLM accepts them. Clients like OpenClaw send these by default. New architecture: haproxy (port N) → middleware (port N+2) → SGLang (port N+1) The middleware is a thin FastAPI app that strips incompatible params from chat completion request bodies and passes everything else through unchanged.	2026-04-12 18:58:37 +00:00
biondizzle	359aa94337	Update README: haproxy proxy layer, /health probe fix, current state	2026-04-12 18:27:06 +00:00
biondizzle	6476c9c12a	fix: content-length 16 not 15, remove 'timeout check' (not valid in haproxy 2.4 server line)	2026-04-12 17:29:08 +00:00
biondizzle	725e61d792	fix: haproxy 2.4 compat — use errorfile instead of http-request return haproxy 2.4 (Ubuntu 22.04) doesn't support http-request return with payload/content-type syntax (that's 2.8+). Switch to errorfile-based stub responses: http-request deny deny_status N + errorfile N path.	2026-04-12 17:26:45 +00:00
biondizzle	1ddc08c88b	haproxy: intercept /health too — instant response based on backend state SGLang's /health takes ~1.001s, racing the 1s k8s probe timeout. Now haproxy health-checks SGLang in the background (5s interval, 3s check timeout) and responds to /health probes instantly: 200 if backend is up, 503 if not.	2026-04-12 17:21:04 +00:00
biondizzle	7fb373fdfc	Add haproxy proxy: /metrics returns 200 empty, everything else proxies to SGLang SGLang now runs on port+1, haproxy binds the original vLLM port. haproxy serves a stub /metrics endpoint (200, empty body) and passes all other traffic through to SGLang via raw TCP proxy.	2026-04-12 17:09:58 +00:00
biondizzle	dd3a981497	Log all received args to /tmp/vllm-shim.log	2026-04-12 04:37:24 +00:00