All communications between nodes in a multi-node vLLM deployment are **insecure by default** and must be protected by placing the nodes on an isolated network. This includes:
1. PyTorch Distributed communications
2. KV cache transfer communications
3. Tensor, Pipeline, and Data parallel communications
### Configuration Options for Inter-Node Communications
The `--api-key` flag (or `VLLM_API_KEY` environment variable) provides authentication for vLLM's HTTP server, but **only for OpenAI-compatible API endpoints under the `/v1` path prefix**. Many other sensitive endpoints are exposed on the same HTTP server without any authentication enforcement.
**Important:** Do not rely exclusively on `--api-key` for securing access to vLLM. Additional security measures are required for production deployments.
### Protected Endpoints (Require API Key)
When `--api-key` is configured, the following `/v1` endpoints require Bearer token authentication:
-`/v1/models` - List available models
-`/v1/chat/completions` - Chat completions
-`/v1/completions` - Text completions
-`/v1/embeddings` - Generate embeddings
-`/v1/audio/transcriptions` - Audio transcription
-`/v1/audio/translations` - Audio translation
-`/v1/messages` - Anthropic-compatible messages API
-`/v1/responses` - Response management
-`/v1/score` - Scoring API
-`/v1/rerank` - Reranking API
### Unprotected Endpoints (No API Key Required)
The following endpoints **do not require authentication** even when `--api-key` is configured:
**Inference endpoints:**
-`/invocations` - SageMaker-compatible endpoint (routes to the same inference functions as `/v1` endpoints)
-`/inference/v1/generate` - Generate completions
-`/pooling` - Pooling API
-`/classify` - Classification API
-`/score` - Scoring API (non-`/v1` variant)
-`/rerank` - Reranking API (non-`/v1` variant)
**Operational control endpoints (always enabled):**
-`/pause` - Pause generation (causes denial of service)
-`/resume` - Resume generation
-`/scale_elastic_ep` - Trigger scaling operations
**Utility endpoints:**
-`/tokenize` - Tokenize text
-`/detokenize` - Detokenize tokens
-`/health` - Health check
-`/ping` - SageMaker health check
-`/version` - Version information
-`/load` - Server load metrics
**Tokenizer information endpoint (only when `--enable-tokenizer-info-endpoint` is set):**
This endpoint is **only available when the `--enable-tokenizer-info-endpoint` flag is set**. It may expose sensitive information such as chat templates and tokenizer configuration:
-`/tokenizer_info` - Get comprehensive tokenizer information including chat templates and configuration
**Development endpoints (only when `VLLM_SERVER_DEV_MODE=1`):**
These endpoints are **only available when the environment variable `VLLM_SERVER_DEV_MODE` is set to `1`**. They are intended for development and debugging purposes and should never be enabled in production:
-`/server_info` - Get detailed server configuration
-`/sleep` - Put engine to sleep (causes denial of service)
-`/wake_up` - Wake engine from sleep
-`/is_sleeping` - Check if engine is sleeping
-`/collective_rpc` - Execute arbitrary RPC methods on the engine (extremely dangerous)
**Profiler endpoints (only when `VLLM_TORCH_PROFILER_DIR` or `VLLM_TORCH_CUDA_PROFILE` are set):**
These endpoints are only available when profiling is enabled and should only be used for local development:
-`/start_profile` - Start PyTorch profiler
-`/stop_profile` - Stop PyTorch profiler
**Note:** The `/invocations` endpoint is particularly concerning as it provides unauthenticated access to the same inference capabilities as the protected `/v1` endpoints.
### Security Implications
An attacker who can reach the vLLM HTTP server can:
1.**Bypass authentication** by using non-`/v1` endpoints like `/invocations`, `/inference/v1/generate`, `/pooling`, `/classify`, `/score`, or `/rerank` to run arbitrary inference without credentials
2.**Cause denial of service** by calling `/pause` or `/scale_elastic_ep` without a token
3.**Access operational controls** to manipulate server state (e.g., pausing generation)
4.**If `--enable-tokenizer-info-endpoint` is set:** Access sensitive tokenizer configuration including chat templates, which may reveal prompt engineering strategies or other implementation details
5.**If `VLLM_SERVER_DEV_MODE=1` is set:** Execute arbitrary RPC commands via `/collective_rpc`, reset caches, put the engine to sleep, and access detailed server configuration
### Recommended Security Practices
#### 1. Minimize Exposed Endpoints
**CRITICAL:** Never set `VLLM_SERVER_DEV_MODE=1` in production environments. Development endpoints expose extremely dangerous functionality including:
- Arbitrary RPC execution via `/collective_rpc`
- Cache manipulation that can disrupt service
- Detailed server configuration disclosure
Similarly, never enable profiler endpoints (`VLLM_TORCH_PROFILER_DIR` or `VLLM_TORCH_CUDA_PROFILE`) in production.
**Be cautious with `--enable-tokenizer-info-endpoint`:** Only enable the `/tokenizer_info` endpoint if you need to expose tokenizer configuration information. This endpoint reveals chat templates and tokenizer settings that may contain sensitive implementation details or prompt engineering strategies.
#### 2. Deploy Behind a Reverse Proxy
The most effective approach is to deploy vLLM behind a reverse proxy (such as nginx, Envoy, or a Kubernetes Gateway) that:
- Explicitly allowlists only the endpoints you want to expose to end users
- Blocks all other endpoints, including the unauthenticated inference and operational control endpoints
- Implements additional authentication, rate limiting, and logging at the proxy layer
If you believe you have found a security vulnerability in vLLM, please report it following the project's security policy. For more information on how to report security issues and the project's security policy, please see the [vLLM Security Policy](https://github.com/vllm-project/vllm/blob/main/SECURITY.md).