Signed-off-by: lisperz <zhuchen200245@163.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2.6 KiB
2.6 KiB
RunPod
vLLM can be deployed on RunPod, a cloud GPU platform that provides on-demand and serverless GPU instances for AI inference workloads.
Prerequisites
- A RunPod account with GPU pod access
- A GPU pod running a CUDA-compatible template (e.g.,
runpod/pytorch)
Starting the Server
SSH into your RunPod pod and launch the vLLM OpenAI-compatible server:
python -m vllm.entrypoints.openai.api_server \
--model <model-name> \
--host 0.0.0.0 \
--port 8000
!!! note
Use `--host 0.0.0.0` to bind to all interfaces so the server is reachable from outside the container.
Exposing Port 8000
RunPod exposes HTTP services through its proxy. To make port 8000 accessible:
-
In the RunPod dashboard, navigate to your pod settings.
-
Add
8000to the list of exposed HTTP ports. -
After the pod restarts, RunPod provides a public URL in the format:
https://<pod-id>-8000.proxy.runpod.net
Troubleshooting 502 Bad Gateway
A 502 Bad Gateway error from the RunPod proxy typically means the server is not yet listening. Common causes:
- Model still loading — Large models take time to download and load into GPU memory. Check the pod logs for progress.
- Wrong host binding — Ensure you passed
--host 0.0.0.0. Binding to127.0.0.1(the default) makes the server unreachable from the proxy. - Port mismatch — Verify the
--portvalue matches the port exposed in the RunPod dashboard. - Out of GPU memory — The model may be too large for the allocated GPU. Check logs for CUDA OOM errors and consider using a larger instance or adding
--tensor-parallel-sizefor multi-GPU pods.
Verifying the Deployment
Once the server is running, test it with a curl request:
!!! console "Command"
```bash
curl https://<pod-id>-8000.proxy.runpod.net/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "<model-name>",
"messages": [
{"role": "user", "content": "Hello, how are you?"}
],
"max_tokens": 50
}'
```
!!! console "Response"
```json
{
"id": "chat-abc123",
"object": "chat.completion",
"choices": [
{
"message": {
"role": "assistant",
"content": "I'm doing well, thank you for asking! How can I help you today?"
},
"index": 0,
"finish_reason": "stop"
}
]
}
```
You can also check the server health endpoint:
curl https://<pod-id>-8000.proxy.runpod.net/health