[Docs] Add RunPod GPU deployment guide for vLLM (#34531)

Signed-off-by: lisperz <zhuchen200245@163.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2026-03-04 12:11:34 -06:00
parent 2f2c1d73a7
commit 138c5fa186
1 changed files with 87 additions and 0 deletions
--- a/docs/deployment/frameworks/runpod.md
+++ b/docs/deployment/frameworks/runpod.md
@@ -0,0 +1,87 @@
+# RunPod
+
+vLLM can be deployed on [RunPod](https://www.runpod.io/), a cloud GPU platform that provides on-demand and serverless GPU instances for AI inference workloads.
+
+## Prerequisites
+
+- A RunPod account with GPU pod access
+- A GPU pod running a CUDA-compatible template (e.g., `runpod/pytorch`)
+
+## Starting the Server
+
+SSH into your RunPod pod and launch the vLLM OpenAI-compatible server:
+
+```bash
+python -m vllm.entrypoints.openai.api_server \
+    --model <model-name> \
+    --host 0.0.0.0 \
+    --port 8000
+```
+
+!!! note
+
+    Use `--host 0.0.0.0` to bind to all interfaces so the server is reachable from outside the container.
+
+## Exposing Port 8000
+
+RunPod exposes HTTP services through its proxy. To make port 8000 accessible:
+
+1. In the RunPod dashboard, navigate to your pod settings.
+2. Add `8000` to the list of exposed HTTP ports.
+3. After the pod restarts, RunPod provides a public URL in the format:
+
+    ```text
+    https://<pod-id>-8000.proxy.runpod.net
+    ```
+
+## Troubleshooting 502 Bad Gateway
+
+A `502 Bad Gateway` error from the RunPod proxy typically means the server is not yet listening. Common causes:
+
+- **Model still loading** — Large models take time to download and load into GPU memory. Check the pod logs for progress.
+- **Wrong host binding** — Ensure you passed `--host 0.0.0.0`. Binding to `127.0.0.1` (the default) makes the server unreachable from the proxy.
+- **Port mismatch** — Verify the `--port` value matches the port exposed in the RunPod dashboard.
+- **Out of GPU memory** — The model may be too large for the allocated GPU. Check logs for CUDA OOM errors and consider using a larger instance or adding `--tensor-parallel-size` for multi-GPU pods.
+
+## Verifying the Deployment
+
+Once the server is running, test it with a curl request:
+
+!!! console "Command"
+
+    ```bash
+    curl https://<pod-id>-8000.proxy.runpod.net/v1/chat/completions \
+        -H "Content-Type: application/json" \
+        -d '{
+            "model": "<model-name>",
+            "messages": [
+                {"role": "user", "content": "Hello, how are you?"}
+            ],
+            "max_tokens": 50
+        }'
+    ```
+
+!!! console "Response"
+
+    ```json
+    {
+        "id": "chat-abc123",
+        "object": "chat.completion",
+        "choices": [
+            {
+                "message": {
+                    "role": "assistant",
+                    "content": "I'm doing well, thank you for asking! How can I help you today?"
+                },
+                "index": 0,
+                "finish_reason": "stop"
+            }
+        ]
+    }
+    ```
+
+You can also check the server health endpoint:
+
+```bash
+curl https://<pod-id>-8000.proxy.runpod.net/health
+```