Migrate docs from Sphinx to MkDocs (#18145)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
114
docs/serving/distributed_serving.md
Normal file
114
docs/serving/distributed_serving.md
Normal file
@@ -0,0 +1,114 @@
|
||||
---
|
||||
title: Distributed Inference and Serving
|
||||
---
|
||||
[](){ #distributed-serving }
|
||||
|
||||
## How to decide the distributed inference strategy?
|
||||
|
||||
Before going into the details of distributed inference and serving, let's first make it clear when to use distributed inference and what are the strategies available. The common practice is:
|
||||
|
||||
- **Single GPU (no distributed inference)**: If your model fits in a single GPU, you probably don't need to use distributed inference. Just use the single GPU to run the inference.
|
||||
- **Single-Node Multi-GPU (tensor parallel inference)**: If your model is too large to fit in a single GPU, but it can fit in a single node with multiple GPUs, you can use tensor parallelism. The tensor parallel size is the number of GPUs you want to use. For example, if you have 4 GPUs in a single node, you can set the tensor parallel size to 4.
|
||||
- **Multi-Node Multi-GPU (tensor parallel plus pipeline parallel inference)**: If your model is too large to fit in a single node, you can use tensor parallel together with pipeline parallelism. The tensor parallel size is the number of GPUs you want to use in each node, and the pipeline parallel size is the number of nodes you want to use. For example, if you have 16 GPUs in 2 nodes (8 GPUs per node), you can set the tensor parallel size to 8 and the pipeline parallel size to 2.
|
||||
|
||||
In short, you should increase the number of GPUs and the number of nodes until you have enough GPU memory to hold the model. The tensor parallel size should be the number of GPUs in each node, and the pipeline parallel size should be the number of nodes.
|
||||
|
||||
After adding enough GPUs and nodes to hold the model, you can run vLLM first, which will print some logs like `# GPU blocks: 790`. Multiply the number by `16` (the block size), and you can get roughly the maximum number of tokens that can be served on the current configuration. If this number is not satisfying, e.g. you want higher throughput, you can further increase the number of GPUs or nodes, until the number of blocks is enough.
|
||||
|
||||
!!! note
|
||||
There is one edge case: if the model fits in a single node with multiple GPUs, but the number of GPUs cannot divide the model size evenly, you can use pipeline parallelism, which splits the model along layers and supports uneven splits. In this case, the tensor parallel size should be 1 and the pipeline parallel size should be the number of GPUs.
|
||||
|
||||
## Running vLLM on a single node
|
||||
|
||||
vLLM supports distributed tensor-parallel and pipeline-parallel inference and serving. Currently, we support [Megatron-LM's tensor parallel algorithm](https://arxiv.org/pdf/1909.08053.pdf). We manage the distributed runtime with either [Ray](https://github.com/ray-project/ray) or python native multiprocessing. Multiprocessing can be used when deploying on a single node, multi-node inference currently requires Ray.
|
||||
|
||||
Multiprocessing will be used by default when not running in a Ray placement group and if there are sufficient GPUs available on the same node for the configured `tensor_parallel_size`, otherwise Ray will be used. This default can be overridden via the `LLM` class `distributed_executor_backend` argument or `--distributed-executor-backend` API server argument. Set it to `mp` for multiprocessing or `ray` for Ray. It's not required for Ray to be installed for the multiprocessing case.
|
||||
|
||||
To run multi-GPU inference with the `LLM` class, set the `tensor_parallel_size` argument to the number of GPUs you want to use. For example, to run inference on 4 GPUs:
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
llm = LLM("facebook/opt-13b", tensor_parallel_size=4)
|
||||
output = llm.generate("San Francisco is a")
|
||||
```
|
||||
|
||||
To run multi-GPU serving, pass in the `--tensor-parallel-size` argument when starting the server. For example, to run API server on 4 GPUs:
|
||||
|
||||
```console
|
||||
vllm serve facebook/opt-13b \
|
||||
--tensor-parallel-size 4
|
||||
```
|
||||
|
||||
You can also additionally specify `--pipeline-parallel-size` to enable pipeline parallelism. For example, to run API server on 8 GPUs with pipeline parallelism and tensor parallelism:
|
||||
|
||||
```console
|
||||
vllm serve gpt2 \
|
||||
--tensor-parallel-size 4 \
|
||||
--pipeline-parallel-size 2
|
||||
```
|
||||
|
||||
## Running vLLM on multiple nodes
|
||||
|
||||
If a single node does not have enough GPUs to hold the model, you can run the model using multiple nodes. It is important to make sure the execution environment is the same on all nodes, including the model path, the Python environment. The recommended way is to use docker images to ensure the same environment, and hide the heterogeneity of the host machines via mapping them into the same docker configuration.
|
||||
|
||||
The first step, is to start containers and organize them into a cluster. We have provided the helper script <gh-file:examples/online_serving/run_cluster.sh> to start the cluster. Please note, this script launches docker without administrative privileges that would be required to access GPU performance counters when running profiling and tracing tools. For that purpose, the script can have `CAP_SYS_ADMIN` to the docker container by using the `--cap-add` option in the docker run command.
|
||||
|
||||
Pick a node as the head node, and run the following command:
|
||||
|
||||
```console
|
||||
bash run_cluster.sh \
|
||||
vllm/vllm-openai \
|
||||
ip_of_head_node \
|
||||
--head \
|
||||
/path/to/the/huggingface/home/in/this/node \
|
||||
-e VLLM_HOST_IP=ip_of_this_node
|
||||
```
|
||||
|
||||
On the rest of the worker nodes, run the following command:
|
||||
|
||||
```console
|
||||
bash run_cluster.sh \
|
||||
vllm/vllm-openai \
|
||||
ip_of_head_node \
|
||||
--worker \
|
||||
/path/to/the/huggingface/home/in/this/node \
|
||||
-e VLLM_HOST_IP=ip_of_this_node
|
||||
```
|
||||
|
||||
Then you get a ray cluster of **containers**. Note that you need to keep the shells running these commands alive to hold the cluster. Any shell disconnect will terminate the cluster. In addition, please note that the argument `ip_of_head_node` should be the IP address of the head node, which is accessible by all the worker nodes. The IP addresses of each worker node should be specified in the `VLLM_HOST_IP` environment variable, and should be different for each worker node. Please check the network configuration of your cluster to make sure the nodes can communicate with each other through the specified IP addresses.
|
||||
|
||||
!!! warning
|
||||
It is considered best practice to set `VLLM_HOST_IP` to an address on a private network segment for the vLLM cluster. The traffic sent here is not encrypted. The endpoints are also exchanging data in a format that could be exploited to execute arbitrary code should a malicious party gain access to the network. Please ensure that this network is not reachable by any untrusted parties.
|
||||
|
||||
!!! warning
|
||||
Since this is a ray cluster of **containers**, all the following commands should be executed in the **containers**, otherwise you are executing the commands on the host machine, which is not connected to the ray cluster. To enter the container, you can use `docker exec -it node /bin/bash`.
|
||||
|
||||
Then, on any node, use `docker exec -it node /bin/bash` to enter the container, execute `ray status` and `ray list nodes` to check the status of the Ray cluster. You should see the right number of nodes and GPUs.
|
||||
|
||||
After that, on any node, use `docker exec -it node /bin/bash` to enter the container again. **In the container**, you can use vLLM as usual, just as you have all the GPUs on one node: vLLM will be able to leverage GPU resources of all nodes in the Ray cluster, and therefore, only run the `vllm` command on this node but not other nodes. The common practice is to set the tensor parallel size to the number of GPUs in each node, and the pipeline parallel size to the number of nodes. For example, if you have 16 GPUs in 2 nodes (8 GPUs per node), you can set the tensor parallel size to 8 and the pipeline parallel size to 2:
|
||||
|
||||
```console
|
||||
vllm serve /path/to/the/model/in/the/container \
|
||||
--tensor-parallel-size 8 \
|
||||
--pipeline-parallel-size 2
|
||||
```
|
||||
|
||||
You can also use tensor parallel without pipeline parallel, just set the tensor parallel size to the number of GPUs in the cluster. For example, if you have 16 GPUs in 2 nodes (8 GPUs per node), you can set the tensor parallel size to 16:
|
||||
|
||||
```console
|
||||
vllm serve /path/to/the/model/in/the/container \
|
||||
--tensor-parallel-size 16
|
||||
```
|
||||
|
||||
To make tensor parallel performant, you should make sure the communication between nodes is efficient, e.g. using high-speed network cards like Infiniband. To correctly set up the cluster to use Infiniband, append additional arguments like `--privileged -e NCCL_IB_HCA=mlx5` to the `run_cluster.sh` script. Please contact your system administrator for more information on how to set up the flags. One way to confirm if the Infiniband is working is to run vLLM with `NCCL_DEBUG=TRACE` environment variable set, e.g. `NCCL_DEBUG=TRACE vllm serve ...` and check the logs for the NCCL version and the network used. If you find `[send] via NET/Socket` in the logs, it means NCCL uses raw TCP Socket, which is not efficient for cross-node tensor parallel. If you find `[send] via NET/IB/GDRDMA` in the logs, it means NCCL uses Infiniband with GPU-Direct RDMA, which is efficient.
|
||||
|
||||
!!! warning
|
||||
After you start the Ray cluster, you'd better also check the GPU-GPU communication between nodes. It can be non-trivial to set up. Please refer to the [sanity check script][troubleshooting-incorrect-hardware-driver] for more information. If you need to set some environment variables for the communication configuration, you can append them to the `run_cluster.sh` script, e.g. `-e NCCL_SOCKET_IFNAME=eth0`. Note that setting environment variables in the shell (e.g. `NCCL_SOCKET_IFNAME=eth0 vllm serve ...`) only works for the processes in the same node, not for the processes in the other nodes. Setting environment variables when you create the cluster is the recommended way. See <gh-issue:6803> for more information.
|
||||
|
||||
!!! warning
|
||||
Please make sure you downloaded the model to all the nodes (with the same path), or the model is downloaded to some distributed file system that is accessible by all nodes.
|
||||
|
||||
When you use huggingface repo id to refer to the model, you should append your huggingface token to the `run_cluster.sh` script, e.g. `-e HF_TOKEN=`. The recommended way is to download the model first, and then use the path to refer to the model.
|
||||
|
||||
!!! warning
|
||||
If you keep receiving the error message `Error: No available node types can fulfill resource request` but you have enough GPUs in the cluster, chances are your nodes have multiple IP addresses and vLLM cannot find the right one, especially when you are using multi-node inference. Please make sure vLLM and ray use the same IP address. You can set the `VLLM_HOST_IP` environment variable to the right IP address in the `run_cluster.sh` script (different for each node!), and check `ray status` and `ray list nodes` to see the IP address used by Ray. See <gh-issue:7815> for more information.
|
||||
18
docs/serving/engine_args.md
Normal file
18
docs/serving/engine_args.md
Normal file
@@ -0,0 +1,18 @@
|
||||
---
|
||||
title: Engine Arguments
|
||||
---
|
||||
[](){ #engine-args }
|
||||
|
||||
Engine arguments control the behavior of the vLLM engine.
|
||||
|
||||
- For [offline inference][offline-inference], they are part of the arguments to [LLM][vllm.LLM] class.
|
||||
- For [online serving][openai-compatible-server], they are part of the arguments to `vllm serve`.
|
||||
|
||||
You can look at [EngineArgs][vllm.engine.arg_utils.EngineArgs] and [AsyncEngineArgs][vllm.engine.arg_utils.AsyncEngineArgs] to see the available engine arguments.
|
||||
|
||||
However, these classes are a combination of the configuration classes defined in [vllm.config][]. Therefore, we would recommend you read about them there where they are best documented.
|
||||
|
||||
For offline inference you will have access to these configuration classes and for online serving you can cross-reference the configs with `vllm serve --help`, which has its arguments grouped by config.
|
||||
|
||||
!!! note
|
||||
Additional arguments are available to the [AsyncLLMEngine][vllm.engine.async_llm_engine.AsyncLLMEngine] which is used for online serving. These can be found by running `vllm serve --help`
|
||||
12
docs/serving/env_vars.md
Normal file
12
docs/serving/env_vars.md
Normal file
@@ -0,0 +1,12 @@
|
||||
# Environment Variables
|
||||
|
||||
vLLM uses the following environment variables to configure the system:
|
||||
|
||||
!!! warning
|
||||
Please note that `VLLM_PORT` and `VLLM_HOST_IP` set the port and ip for vLLM's **internal usage**. It is not the port and ip for the API server. If you use `--host $VLLM_HOST_IP` and `--port $VLLM_PORT` to start the API server, it will not work.
|
||||
|
||||
All environment variables used by vLLM are prefixed with `VLLM_`. **Special care should be taken for Kubernetes users**: please do not name the service as `vllm`, otherwise environment variables set by Kubernetes might conflict with vLLM's environment variables, because [Kubernetes sets environment variables for each service with the capitalized service name as the prefix](https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables).
|
||||
|
||||
```python
|
||||
--8<-- "vllm/envs.py:env-vars-definition"
|
||||
```
|
||||
31
docs/serving/integrations/langchain.md
Normal file
31
docs/serving/integrations/langchain.md
Normal file
@@ -0,0 +1,31 @@
|
||||
---
|
||||
title: LangChain
|
||||
---
|
||||
[](){ #serving-langchain }
|
||||
|
||||
vLLM is also available via [LangChain](https://github.com/langchain-ai/langchain) .
|
||||
|
||||
To install LangChain, run
|
||||
|
||||
```console
|
||||
pip install langchain langchain_community -q
|
||||
```
|
||||
|
||||
To run inference on a single or multiple GPUs, use `VLLM` class from `langchain`.
|
||||
|
||||
```python
|
||||
from langchain_community.llms import VLLM
|
||||
|
||||
llm = VLLM(model="mosaicml/mpt-7b",
|
||||
trust_remote_code=True, # mandatory for hf models
|
||||
max_new_tokens=128,
|
||||
top_k=10,
|
||||
top_p=0.95,
|
||||
temperature=0.8,
|
||||
# tensor_parallel_size=... # for distributed inference
|
||||
)
|
||||
|
||||
print(llm("What is the capital of France ?"))
|
||||
```
|
||||
|
||||
Please refer to this [Tutorial](https://python.langchain.com/docs/integrations/llms/vllm) for more details.
|
||||
27
docs/serving/integrations/llamaindex.md
Normal file
27
docs/serving/integrations/llamaindex.md
Normal file
@@ -0,0 +1,27 @@
|
||||
---
|
||||
title: LlamaIndex
|
||||
---
|
||||
[](){ #serving-llamaindex }
|
||||
|
||||
vLLM is also available via [LlamaIndex](https://github.com/run-llama/llama_index) .
|
||||
|
||||
To install LlamaIndex, run
|
||||
|
||||
```console
|
||||
pip install llama-index-llms-vllm -q
|
||||
```
|
||||
|
||||
To run inference on a single or multiple GPUs, use `Vllm` class from `llamaindex`.
|
||||
|
||||
```python
|
||||
from llama_index.llms.vllm import Vllm
|
||||
|
||||
llm = Vllm(
|
||||
model="microsoft/Orca-2-7b",
|
||||
tensor_parallel_size=4,
|
||||
max_new_tokens=100,
|
||||
vllm_kwargs={"swap_space": 1, "gpu_memory_utilization": 0.5},
|
||||
)
|
||||
```
|
||||
|
||||
Please refer to this [Tutorial](https://docs.llamaindex.ai/en/latest/examples/llm/vllm/) for more details.
|
||||
53
docs/serving/metrics.md
Normal file
53
docs/serving/metrics.md
Normal file
@@ -0,0 +1,53 @@
|
||||
# Production Metrics
|
||||
|
||||
vLLM exposes a number of metrics that can be used to monitor the health of the
|
||||
system. These metrics are exposed via the `/metrics` endpoint on the vLLM
|
||||
OpenAI compatible API server.
|
||||
|
||||
You can start the server using Python, or using [Docker][deployment-docker]:
|
||||
|
||||
```console
|
||||
vllm serve unsloth/Llama-3.2-1B-Instruct
|
||||
```
|
||||
|
||||
Then query the endpoint to get the latest metrics from the server:
|
||||
|
||||
```console
|
||||
$ curl http://0.0.0.0:8000/metrics
|
||||
|
||||
# HELP vllm:iteration_tokens_total Histogram of number of tokens per engine_step.
|
||||
# TYPE vllm:iteration_tokens_total histogram
|
||||
vllm:iteration_tokens_total_sum{model_name="unsloth/Llama-3.2-1B-Instruct"} 0.0
|
||||
vllm:iteration_tokens_total_bucket{le="1.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
|
||||
vllm:iteration_tokens_total_bucket{le="8.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
|
||||
vllm:iteration_tokens_total_bucket{le="16.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
|
||||
vllm:iteration_tokens_total_bucket{le="32.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
|
||||
vllm:iteration_tokens_total_bucket{le="64.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
|
||||
vllm:iteration_tokens_total_bucket{le="128.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
|
||||
vllm:iteration_tokens_total_bucket{le="256.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
|
||||
vllm:iteration_tokens_total_bucket{le="512.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
|
||||
...
|
||||
```
|
||||
|
||||
The following metrics are exposed:
|
||||
|
||||
```python
|
||||
--8<-- "vllm/engine/metrics.py:metrics-definitions"
|
||||
```
|
||||
|
||||
The following metrics are deprecated and due to be removed in a future version:
|
||||
|
||||
- `vllm:num_requests_swapped`, `vllm:cpu_cache_usage_perc`, and
|
||||
`vllm:cpu_prefix_cache_hit_rate` because KV cache offloading is not
|
||||
used in V1.
|
||||
- `vllm:gpu_prefix_cache_hit_rate` is replaced by queries+hits
|
||||
counters in V1.
|
||||
- `vllm:time_in_queue_requests` because it duplicates
|
||||
`vllm:request_queue_time_seconds`.
|
||||
- `vllm:model_forward_time_milliseconds` and
|
||||
`vllm:model_execute_time_milliseconds` because
|
||||
prefill/decode/inference time metrics should be used instead.
|
||||
|
||||
Note: when metrics are deprecated in version `X.Y`, they are hidden in version `X.Y+1`
|
||||
but can be re-enabled using the `--show-hidden-metrics-for-version=X.Y` escape hatch,
|
||||
and are then removed in version `X.Y+2`.
|
||||
214
docs/serving/offline_inference.md
Normal file
214
docs/serving/offline_inference.md
Normal file
@@ -0,0 +1,214 @@
|
||||
---
|
||||
title: Offline Inference
|
||||
---
|
||||
[](){ #offline-inference }
|
||||
|
||||
You can run vLLM in your own code on a list of prompts.
|
||||
|
||||
The offline API is based on the [LLM][vllm.LLM] class.
|
||||
To initialize the vLLM engine, create a new instance of `LLM` and specify the model to run.
|
||||
|
||||
For example, the following code downloads the [`facebook/opt-125m`](https://huggingface.co/facebook/opt-125m) model from HuggingFace
|
||||
and runs it in vLLM using the default configuration.
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
|
||||
llm = LLM(model="facebook/opt-125m")
|
||||
```
|
||||
|
||||
After initializing the `LLM` instance, you can perform model inference using various APIs.
|
||||
The available APIs depend on the type of model that is being run:
|
||||
|
||||
- [Generative models][generative-models] output logprobs which are sampled from to obtain the final output text.
|
||||
- [Pooling models][pooling-models] output their hidden states directly.
|
||||
|
||||
Please refer to the above pages for more details about each API.
|
||||
|
||||
!!! info
|
||||
[API Reference][offline-inference-api]
|
||||
|
||||
[](){ #configuration-options }
|
||||
|
||||
## Configuration Options
|
||||
|
||||
This section lists the most common options for running the vLLM engine.
|
||||
For a full list, refer to the [configuration][configuration] page.
|
||||
|
||||
[](){ #model-resolution }
|
||||
|
||||
### Model resolution
|
||||
|
||||
vLLM loads HuggingFace-compatible models by inspecting the `architectures` field in `config.json` of the model repository
|
||||
and finding the corresponding implementation that is registered to vLLM.
|
||||
Nevertheless, our model resolution may fail for the following reasons:
|
||||
|
||||
- The `config.json` of the model repository lacks the `architectures` field.
|
||||
- Unofficial repositories refer to a model using alternative names which are not recorded in vLLM.
|
||||
- The same architecture name is used for multiple models, creating ambiguity as to which model should be loaded.
|
||||
|
||||
To fix this, explicitly specify the model architecture by passing `config.json` overrides to the `hf_overrides` option.
|
||||
For example:
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
|
||||
model = LLM(
|
||||
model="cerebras/Cerebras-GPT-1.3B",
|
||||
hf_overrides={"architectures": ["GPT2LMHeadModel"]}, # GPT-2
|
||||
)
|
||||
```
|
||||
|
||||
Our [list of supported models][supported-models] shows the model architectures that are recognized by vLLM.
|
||||
|
||||
[](){ #reducing-memory-usage }
|
||||
|
||||
### Reducing memory usage
|
||||
|
||||
Large models might cause your machine to run out of memory (OOM). Here are some options that help alleviate this problem.
|
||||
|
||||
#### Tensor Parallelism (TP)
|
||||
|
||||
Tensor parallelism (`tensor_parallel_size` option) can be used to split the model across multiple GPUs.
|
||||
|
||||
The following code splits the model across 2 GPUs.
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
|
||||
llm = LLM(model="ibm-granite/granite-3.1-8b-instruct",
|
||||
tensor_parallel_size=2)
|
||||
```
|
||||
|
||||
!!! warning
|
||||
To ensure that vLLM initializes CUDA correctly, you should avoid calling related functions (e.g. [torch.cuda.set_device][])
|
||||
before initializing vLLM. Otherwise, you may run into an error like `RuntimeError: Cannot re-initialize CUDA in forked subprocess`.
|
||||
|
||||
To control which devices are used, please instead set the `CUDA_VISIBLE_DEVICES` environment variable.
|
||||
|
||||
!!! note
|
||||
With tensor parallelism enabled, each process will read the whole model and split it into chunks, which makes the disk reading time even longer (proportional to the size of tensor parallelism).
|
||||
|
||||
You can convert the model checkpoint to a sharded checkpoint using <gh-file:examples/offline_inference/save_sharded_state.py>. The conversion process might take some time, but later you can load the sharded checkpoint much faster. The model loading time should remain constant regardless of the size of tensor parallelism.
|
||||
|
||||
#### Quantization
|
||||
|
||||
Quantized models take less memory at the cost of lower precision.
|
||||
|
||||
Statically quantized models can be downloaded from HF Hub (some popular ones are available at [Red Hat AI](https://huggingface.co/RedHatAI))
|
||||
and used directly without extra configuration.
|
||||
|
||||
Dynamic quantization is also supported via the `quantization` option -- see [here][quantization-index] for more details.
|
||||
|
||||
#### Context length and batch size
|
||||
|
||||
You can further reduce memory usage by limiting the context length of the model (`max_model_len` option)
|
||||
and the maximum batch size (`max_num_seqs` option).
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
|
||||
llm = LLM(model="adept/fuyu-8b",
|
||||
max_model_len=2048,
|
||||
max_num_seqs=2)
|
||||
```
|
||||
|
||||
#### Reduce CUDA Graphs
|
||||
|
||||
By default, we optimize model inference using CUDA graphs which take up extra memory in the GPU.
|
||||
|
||||
!!! warning
|
||||
CUDA graph capture takes up more memory in V1 than in V0.
|
||||
|
||||
You can adjust `compilation_config` to achieve a better balance between inference speed and memory usage:
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
from vllm.config import CompilationConfig, CompilationLevel
|
||||
|
||||
llm = LLM(
|
||||
model="meta-llama/Llama-3.1-8B-Instruct",
|
||||
compilation_config=CompilationConfig(
|
||||
level=CompilationLevel.PIECEWISE,
|
||||
# By default, it goes up to max_num_seqs
|
||||
cudagraph_capture_sizes=[1, 2, 4, 8, 16],
|
||||
),
|
||||
)
|
||||
```
|
||||
|
||||
You can disable graph capturing completely via the `enforce_eager` flag:
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
|
||||
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct",
|
||||
enforce_eager=True)
|
||||
```
|
||||
|
||||
#### Adjust cache size
|
||||
|
||||
If you run out of CPU RAM, try the following options:
|
||||
|
||||
- (Multi-modal models only) you can set the size of multi-modal input cache using `VLLM_MM_INPUT_CACHE_GIB` environment variable (default 4 GiB).
|
||||
- (CPU backend only) you can set the size of KV cache using `VLLM_CPU_KVCACHE_SPACE` environment variable (default 4 GiB).
|
||||
|
||||
#### Multi-modal input limits
|
||||
|
||||
You can allow a smaller number of multi-modal items per prompt to reduce the memory footprint of the model:
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
|
||||
# Accept up to 3 images and 1 video per prompt
|
||||
llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct",
|
||||
limit_mm_per_prompt={"image": 3, "video": 1})
|
||||
```
|
||||
|
||||
You can go a step further and disable unused modalities completely by setting its limit to zero.
|
||||
For example, if your application only accepts image input, there is no need to allocate any memory for videos.
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
|
||||
# Accept any number of images but no videos
|
||||
llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct",
|
||||
limit_mm_per_prompt={"video": 0})
|
||||
```
|
||||
|
||||
You can even run a multi-modal model for text-only inference:
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
|
||||
# Don't accept images. Just text.
|
||||
llm = LLM(model="google/gemma-3-27b-it",
|
||||
limit_mm_per_prompt={"image": 0})
|
||||
```
|
||||
|
||||
#### Multi-modal processor arguments
|
||||
|
||||
For certain models, you can adjust the multi-modal processor arguments to
|
||||
reduce the size of the processed multi-modal inputs, which in turn saves memory.
|
||||
|
||||
Here are some examples:
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
|
||||
# Available for Qwen2-VL series models
|
||||
llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct",
|
||||
mm_processor_kwargs={
|
||||
"max_pixels": 768 * 768, # Default is 1280 * 28 * 28
|
||||
})
|
||||
|
||||
# Available for InternVL series models
|
||||
llm = LLM(model="OpenGVLab/InternVL2-2B",
|
||||
mm_processor_kwargs={
|
||||
"max_dynamic_patch": 4, # Default is 12
|
||||
})
|
||||
```
|
||||
|
||||
### Performance optimization and tuning
|
||||
|
||||
You can potentially improve the performance of vLLM by finetuning various options.
|
||||
Please refer to [this guide][optimization-and-tuning] for more details.
|
||||
730
docs/serving/openai_compatible_server.md
Normal file
730
docs/serving/openai_compatible_server.md
Normal file
@@ -0,0 +1,730 @@
|
||||
---
|
||||
title: OpenAI-Compatible Server
|
||||
---
|
||||
[](){ #openai-compatible-server }
|
||||
|
||||
vLLM provides an HTTP server that implements OpenAI's [Completions API](https://platform.openai.com/docs/api-reference/completions), [Chat API](https://platform.openai.com/docs/api-reference/chat), and more! This functionality lets you serve models and interact with them using an HTTP client.
|
||||
|
||||
In your terminal, you can [install](../getting_started/installation.md) vLLM, then start the server with the [`vllm serve`][serve-args] command. (You can also use our [Docker][deployment-docker] image.)
|
||||
|
||||
```bash
|
||||
vllm serve NousResearch/Meta-Llama-3-8B-Instruct --dtype auto --api-key token-abc123
|
||||
```
|
||||
|
||||
To call the server, in your preferred text editor, create a script that uses an HTTP client. Include any messages that you want to send to the model. Then run that script. Below is an example script using the [official OpenAI Python client](https://github.com/openai/openai-python).
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
client = OpenAI(
|
||||
base_url="http://localhost:8000/v1",
|
||||
api_key="token-abc123",
|
||||
)
|
||||
|
||||
completion = client.chat.completions.create(
|
||||
model="NousResearch/Meta-Llama-3-8B-Instruct",
|
||||
messages=[
|
||||
{"role": "user", "content": "Hello!"}
|
||||
]
|
||||
)
|
||||
|
||||
print(completion.choices[0].message)
|
||||
```
|
||||
|
||||
!!! tip
|
||||
vLLM supports some parameters that are not supported by OpenAI, `top_k` for example.
|
||||
You can pass these parameters to vLLM using the OpenAI client in the `extra_body` parameter of your requests, i.e. `extra_body={"top_k": 50}` for `top_k`.
|
||||
|
||||
!!! warning
|
||||
By default, the server applies `generation_config.json` from the Hugging Face model repository if it exists. This means the default values of certain sampling parameters can be overridden by those recommended by the model creator.
|
||||
|
||||
To disable this behavior, please pass `--generation-config vllm` when launching the server.
|
||||
|
||||
## Supported APIs
|
||||
|
||||
We currently support the following OpenAI APIs:
|
||||
|
||||
- [Completions API][completions-api] (`/v1/completions`)
|
||||
- Only applicable to [text generation models](../models/generative_models.md) (`--task generate`).
|
||||
- *Note: `suffix` parameter is not supported.*
|
||||
- [Chat Completions API][chat-api] (`/v1/chat/completions`)
|
||||
- Only applicable to [text generation models](../models/generative_models.md) (`--task generate`) with a [chat template][chat-template].
|
||||
- *Note: `parallel_tool_calls` and `user` parameters are ignored.*
|
||||
- [Embeddings API][embeddings-api] (`/v1/embeddings`)
|
||||
- Only applicable to [embedding models](../models/pooling_models.md) (`--task embed`).
|
||||
- [Transcriptions API][transcriptions-api] (`/v1/audio/transcriptions`)
|
||||
- Only applicable to Automatic Speech Recognition (ASR) models (OpenAI Whisper) (`--task generate`).
|
||||
|
||||
In addition, we have the following custom APIs:
|
||||
|
||||
- [Tokenizer API][tokenizer-api] (`/tokenize`, `/detokenize`)
|
||||
- Applicable to any model with a tokenizer.
|
||||
- [Pooling API][pooling-api] (`/pooling`)
|
||||
- Applicable to all [pooling models](../models/pooling_models.md).
|
||||
- [Classification API][classification-api] (`/classify`)
|
||||
- Only applicable to [classification models](../models/pooling_models.md) (`--task classify`).
|
||||
- [Score API][score-api] (`/score`)
|
||||
- Applicable to embedding models and [cross-encoder models](../models/pooling_models.md) (`--task score`).
|
||||
- [Re-rank API][rerank-api] (`/rerank`, `/v1/rerank`, `/v2/rerank`)
|
||||
- Implements [Jina AI's v1 re-rank API](https://jina.ai/reranker/)
|
||||
- Also compatible with [Cohere's v1 & v2 re-rank APIs](https://docs.cohere.com/v2/reference/rerank)
|
||||
- Jina and Cohere's APIs are very similar; Jina's includes extra information in the rerank endpoint's response.
|
||||
- Only applicable to [cross-encoder models](../models/pooling_models.md) (`--task score`).
|
||||
|
||||
[](){ #chat-template }
|
||||
|
||||
## Chat Template
|
||||
|
||||
In order for the language model to support chat protocol, vLLM requires the model to include
|
||||
a chat template in its tokenizer configuration. The chat template is a Jinja2 template that
|
||||
specifies how are roles, messages, and other chat-specific tokens are encoded in the input.
|
||||
|
||||
An example chat template for `NousResearch/Meta-Llama-3-8B-Instruct` can be found [here](https://github.com/meta-llama/llama3?tab=readme-ov-file#instruction-tuned-models)
|
||||
|
||||
Some models do not provide a chat template even though they are instruction/chat fine-tuned. For those model,
|
||||
you can manually specify their chat template in the `--chat-template` parameter with the file path to the chat
|
||||
template, or the template in string form. Without a chat template, the server will not be able to process chat
|
||||
and all chat requests will error.
|
||||
|
||||
```bash
|
||||
vllm serve <model> --chat-template ./path-to-chat-template.jinja
|
||||
```
|
||||
|
||||
vLLM community provides a set of chat templates for popular models. You can find them under the <gh-dir:examples> directory.
|
||||
|
||||
With the inclusion of multi-modal chat APIs, the OpenAI spec now accepts chat messages in a new format which specifies
|
||||
both a `type` and a `text` field. An example is provided below:
|
||||
|
||||
```python
|
||||
completion = client.chat.completions.create(
|
||||
model="NousResearch/Meta-Llama-3-8B-Instruct",
|
||||
messages=[
|
||||
{"role": "user", "content": [{"type": "text", "text": "Classify this sentiment: vLLM is wonderful!"}]}
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
Most chat templates for LLMs expect the `content` field to be a string, but there are some newer models like
|
||||
`meta-llama/Llama-Guard-3-1B` that expect the content to be formatted according to the OpenAI schema in the
|
||||
request. vLLM provides best-effort support to detect this automatically, which is logged as a string like
|
||||
*"Detected the chat template content format to be..."*, and internally converts incoming requests to match
|
||||
the detected format, which can be one of:
|
||||
|
||||
- `"string"`: A string.
|
||||
- Example: `"Hello world"`
|
||||
- `"openai"`: A list of dictionaries, similar to OpenAI schema.
|
||||
- Example: `[{"type": "text", "text": "Hello world!"}]`
|
||||
|
||||
If the result is not what you expect, you can set the `--chat-template-content-format` CLI argument
|
||||
to override which format to use.
|
||||
|
||||
## Extra Parameters
|
||||
|
||||
vLLM supports a set of parameters that are not part of the OpenAI API.
|
||||
In order to use them, you can pass them as extra parameters in the OpenAI client.
|
||||
Or directly merge them into the JSON payload if you are using HTTP call directly.
|
||||
|
||||
```python
|
||||
completion = client.chat.completions.create(
|
||||
model="NousResearch/Meta-Llama-3-8B-Instruct",
|
||||
messages=[
|
||||
{"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"}
|
||||
],
|
||||
extra_body={
|
||||
"guided_choice": ["positive", "negative"]
|
||||
}
|
||||
)
|
||||
```
|
||||
|
||||
## Extra HTTP Headers
|
||||
|
||||
Only `X-Request-Id` HTTP request header is supported for now. It can be enabled
|
||||
with `--enable-request-id-headers`.
|
||||
|
||||
> Note that enablement of the headers can impact performance significantly at high QPS
|
||||
> rates. We recommend implementing HTTP headers at the router level (e.g. via Istio),
|
||||
> rather than within the vLLM layer for this reason.
|
||||
> See [this PR](https://github.com/vllm-project/vllm/pull/11529) for more details.
|
||||
|
||||
```python
|
||||
completion = client.chat.completions.create(
|
||||
model="NousResearch/Meta-Llama-3-8B-Instruct",
|
||||
messages=[
|
||||
{"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"}
|
||||
],
|
||||
extra_headers={
|
||||
"x-request-id": "sentiment-classification-00001",
|
||||
}
|
||||
)
|
||||
print(completion._request_id)
|
||||
|
||||
completion = client.completions.create(
|
||||
model="NousResearch/Meta-Llama-3-8B-Instruct",
|
||||
prompt="A robot may not injure a human being",
|
||||
extra_headers={
|
||||
"x-request-id": "completion-test",
|
||||
}
|
||||
)
|
||||
print(completion._request_id)
|
||||
```
|
||||
|
||||
## API Reference
|
||||
|
||||
[](){ #completions-api }
|
||||
|
||||
### Completions API
|
||||
|
||||
Our Completions API is compatible with [OpenAI's Completions API](https://platform.openai.com/docs/api-reference/completions);
|
||||
you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it.
|
||||
|
||||
Code example: <gh-file:examples/online_serving/openai_completion_client.py>
|
||||
|
||||
#### Extra parameters
|
||||
|
||||
The following [sampling parameters][sampling-params] are supported.
|
||||
|
||||
```python
|
||||
--8<-- "vllm/entrypoints/openai/protocol.py:completion-sampling-params"
|
||||
```
|
||||
|
||||
The following extra parameters are supported:
|
||||
|
||||
```python
|
||||
--8<-- "vllm/entrypoints/openai/protocol.py:completion-extra-params"
|
||||
```
|
||||
|
||||
[](){ #chat-api }
|
||||
|
||||
### Chat API
|
||||
|
||||
Our Chat API is compatible with [OpenAI's Chat Completions API](https://platform.openai.com/docs/api-reference/chat);
|
||||
you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it.
|
||||
|
||||
We support both [Vision](https://platform.openai.com/docs/guides/vision)- and
|
||||
[Audio](https://platform.openai.com/docs/guides/audio?audio-generation-quickstart-example=audio-in)-related parameters;
|
||||
see our [Multimodal Inputs][multimodal-inputs] guide for more information.
|
||||
- *Note: `image_url.detail` parameter is not supported.*
|
||||
|
||||
Code example: <gh-file:examples/online_serving/openai_chat_completion_client.py>
|
||||
|
||||
#### Extra parameters
|
||||
|
||||
The following [sampling parameters][sampling-params] are supported.
|
||||
|
||||
```python
|
||||
--8<-- "vllm/entrypoints/openai/protocol.py:chat-completion-sampling-params"
|
||||
```
|
||||
|
||||
The following extra parameters are supported:
|
||||
|
||||
```python
|
||||
--8<-- "vllm/entrypoints/openai/protocol.py:chat-completion-extra-params"
|
||||
```
|
||||
|
||||
[](){ #embeddings-api }
|
||||
|
||||
### Embeddings API
|
||||
|
||||
Our Embeddings API is compatible with [OpenAI's Embeddings API](https://platform.openai.com/docs/api-reference/embeddings);
|
||||
you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it.
|
||||
|
||||
If the model has a [chat template][chat-template], you can replace `inputs` with a list of `messages` (same schema as [Chat API][chat-api])
|
||||
which will be treated as a single prompt to the model.
|
||||
|
||||
Code example: <gh-file:examples/online_serving/openai_embedding_client.py>
|
||||
|
||||
#### Multi-modal inputs
|
||||
|
||||
You can pass multi-modal inputs to embedding models by defining a custom chat template for the server
|
||||
and passing a list of `messages` in the request. Refer to the examples below for illustration.
|
||||
|
||||
=== "VLM2Vec"
|
||||
|
||||
To serve the model:
|
||||
|
||||
```bash
|
||||
vllm serve TIGER-Lab/VLM2Vec-Full --task embed \
|
||||
--trust-remote-code --max-model-len 4096 --chat-template examples/template_vlm2vec.jinja
|
||||
```
|
||||
|
||||
!!! warning
|
||||
Since VLM2Vec has the same model architecture as Phi-3.5-Vision, we have to explicitly pass `--task embed`
|
||||
to run this model in embedding mode instead of text generation mode.
|
||||
|
||||
The custom chat template is completely different from the original one for this model,
|
||||
and can be found here: <gh-file:examples/template_vlm2vec.jinja>
|
||||
|
||||
Since the request schema is not defined by OpenAI client, we post a request to the server using the lower-level `requests` library:
|
||||
|
||||
```python
|
||||
import requests
|
||||
|
||||
image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
|
||||
|
||||
response = requests.post(
|
||||
"http://localhost:8000/v1/embeddings",
|
||||
json={
|
||||
"model": "TIGER-Lab/VLM2Vec-Full",
|
||||
"messages": [{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{"type": "image_url", "image_url": {"url": image_url}},
|
||||
{"type": "text", "text": "Represent the given image."},
|
||||
],
|
||||
}],
|
||||
"encoding_format": "float",
|
||||
},
|
||||
)
|
||||
response.raise_for_status()
|
||||
response_json = response.json()
|
||||
print("Embedding output:", response_json["data"][0]["embedding"])
|
||||
```
|
||||
|
||||
=== "DSE-Qwen2-MRL"
|
||||
|
||||
To serve the model:
|
||||
|
||||
```bash
|
||||
vllm serve MrLight/dse-qwen2-2b-mrl-v1 --task embed \
|
||||
--trust-remote-code --max-model-len 8192 --chat-template examples/template_dse_qwen2_vl.jinja
|
||||
```
|
||||
|
||||
!!! warning
|
||||
Like with VLM2Vec, we have to explicitly pass `--task embed`.
|
||||
|
||||
Additionally, `MrLight/dse-qwen2-2b-mrl-v1` requires an EOS token for embeddings, which is handled
|
||||
by a custom chat template: <gh-file:examples/template_dse_qwen2_vl.jinja>
|
||||
|
||||
!!! warning
|
||||
`MrLight/dse-qwen2-2b-mrl-v1` requires a placeholder image of the minimum image size for text query embeddings. See the full code
|
||||
example below for details.
|
||||
|
||||
Full example: <gh-file:examples/online_serving/openai_chat_embedding_client_for_multimodal.py>
|
||||
|
||||
#### Extra parameters
|
||||
|
||||
The following [pooling parameters][pooling-params] are supported.
|
||||
|
||||
```python
|
||||
--8<-- "vllm/entrypoints/openai/protocol.py:embedding-pooling-params"
|
||||
```
|
||||
|
||||
The following extra parameters are supported by default:
|
||||
|
||||
```python
|
||||
--8<-- "vllm/entrypoints/openai/protocol.py:embedding-extra-params"
|
||||
```
|
||||
|
||||
For chat-like input (i.e. if `messages` is passed), these extra parameters are supported instead:
|
||||
|
||||
```python
|
||||
--8<-- "vllm/entrypoints/openai/protocol.py:chat-embedding-extra-params"
|
||||
```
|
||||
|
||||
[](){ #transcriptions-api }
|
||||
|
||||
### Transcriptions API
|
||||
|
||||
Our Transcriptions API is compatible with [OpenAI's Transcriptions API](https://platform.openai.com/docs/api-reference/audio/createTranscription);
|
||||
you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it.
|
||||
|
||||
!!! note
|
||||
To use the Transcriptions API, please install with extra audio dependencies using `pip install vllm[audio]`.
|
||||
|
||||
Code example: <gh-file:examples/online_serving/openai_transcription_client.py>
|
||||
<!-- TODO: api enforced limits + uploading audios -->
|
||||
|
||||
#### Extra Parameters
|
||||
|
||||
The following [sampling parameters][sampling-params] are supported.
|
||||
|
||||
```python
|
||||
--8<-- "vllm/entrypoints/openai/protocol.py:transcription-sampling-params"
|
||||
```
|
||||
|
||||
The following extra parameters are supported:
|
||||
|
||||
```python
|
||||
--8<-- "vllm/entrypoints/openai/protocol.py:transcription-extra-params"
|
||||
```
|
||||
|
||||
[](){ #tokenizer-api }
|
||||
|
||||
### Tokenizer API
|
||||
|
||||
Our Tokenizer API is a simple wrapper over [HuggingFace-style tokenizers](https://huggingface.co/docs/transformers/en/main_classes/tokenizer).
|
||||
It consists of two endpoints:
|
||||
|
||||
- `/tokenize` corresponds to calling `tokenizer.encode()`.
|
||||
- `/detokenize` corresponds to calling `tokenizer.decode()`.
|
||||
|
||||
[](){ #pooling-api }
|
||||
|
||||
### Pooling API
|
||||
|
||||
Our Pooling API encodes input prompts using a [pooling model](../models/pooling_models.md) and returns the corresponding hidden states.
|
||||
|
||||
The input format is the same as [Embeddings API][embeddings-api], but the output data can contain an arbitrary nested list, not just a 1-D list of floats.
|
||||
|
||||
Code example: <gh-file:examples/online_serving/openai_pooling_client.py>
|
||||
|
||||
[](){ #classification-api }
|
||||
|
||||
### Classification API
|
||||
|
||||
Our Classification API directly supports Hugging Face sequence-classification models such as [ai21labs/Jamba-tiny-reward-dev](https://huggingface.co/ai21labs/Jamba-tiny-reward-dev) and [jason9693/Qwen2.5-1.5B-apeach](https://huggingface.co/jason9693/Qwen2.5-1.5B-apeach).
|
||||
|
||||
We automatically wrap any other transformer via `as_classification_model()`, which pools on the last token, attaches a `RowParallelLinear` head, and applies a softmax to produce per-class probabilities.
|
||||
|
||||
Code example: <gh-file:examples/online_serving/openai_classification_client.py>
|
||||
|
||||
#### Example Requests
|
||||
|
||||
You can classify multiple texts by passing an array of strings:
|
||||
|
||||
Request:
|
||||
|
||||
```bash
|
||||
curl -v "http://127.0.0.1:8000/classify" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "jason9693/Qwen2.5-1.5B-apeach",
|
||||
"input": [
|
||||
"Loved the new café—coffee was great.",
|
||||
"This update broke everything. Frustrating."
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
Response:
|
||||
|
||||
```bash
|
||||
{
|
||||
"id": "classify-7c87cac407b749a6935d8c7ce2a8fba2",
|
||||
"object": "list",
|
||||
"created": 1745383065,
|
||||
"model": "jason9693/Qwen2.5-1.5B-apeach",
|
||||
"data": [
|
||||
{
|
||||
"index": 0,
|
||||
"label": "Default",
|
||||
"probs": [
|
||||
0.565970778465271,
|
||||
0.4340292513370514
|
||||
],
|
||||
"num_classes": 2
|
||||
},
|
||||
{
|
||||
"index": 1,
|
||||
"label": "Spoiled",
|
||||
"probs": [
|
||||
0.26448777318000793,
|
||||
0.7355121970176697
|
||||
],
|
||||
"num_classes": 2
|
||||
}
|
||||
],
|
||||
"usage": {
|
||||
"prompt_tokens": 20,
|
||||
"total_tokens": 20,
|
||||
"completion_tokens": 0,
|
||||
"prompt_tokens_details": null
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
You can also pass a string directly to the `input` field:
|
||||
|
||||
Request:
|
||||
|
||||
```bash
|
||||
curl -v "http://127.0.0.1:8000/classify" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "jason9693/Qwen2.5-1.5B-apeach",
|
||||
"input": "Loved the new café—coffee was great."
|
||||
}'
|
||||
```
|
||||
|
||||
Response:
|
||||
|
||||
```bash
|
||||
{
|
||||
"id": "classify-9bf17f2847b046c7b2d5495f4b4f9682",
|
||||
"object": "list",
|
||||
"created": 1745383213,
|
||||
"model": "jason9693/Qwen2.5-1.5B-apeach",
|
||||
"data": [
|
||||
{
|
||||
"index": 0,
|
||||
"label": "Default",
|
||||
"probs": [
|
||||
0.565970778465271,
|
||||
0.4340292513370514
|
||||
],
|
||||
"num_classes": 2
|
||||
}
|
||||
],
|
||||
"usage": {
|
||||
"prompt_tokens": 10,
|
||||
"total_tokens": 10,
|
||||
"completion_tokens": 0,
|
||||
"prompt_tokens_details": null
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### Extra parameters
|
||||
|
||||
The following [pooling parameters][pooling-params] are supported.
|
||||
|
||||
```python
|
||||
--8<-- "vllm/entrypoints/openai/protocol.py:classification-pooling-params"
|
||||
```
|
||||
|
||||
The following extra parameters are supported:
|
||||
|
||||
```python
|
||||
--8<-- "vllm/entrypoints/openai/protocol.py:classification-extra-params"
|
||||
```
|
||||
|
||||
[](){ #score-api }
|
||||
|
||||
### Score API
|
||||
|
||||
Our Score API can apply a cross-encoder model or an embedding model to predict scores for sentence pairs. When using an embedding model the score corresponds to the cosine similarity between each embedding pair.
|
||||
Usually, the score for a sentence pair refers to the similarity between two sentences, on a scale of 0 to 1.
|
||||
|
||||
You can find the documentation for cross encoder models at [sbert.net](https://www.sbert.net/docs/package_reference/cross_encoder/cross_encoder.html).
|
||||
|
||||
Code example: <gh-file:examples/online_serving/openai_cross_encoder_score.py>
|
||||
|
||||
#### Single inference
|
||||
|
||||
You can pass a string to both `text_1` and `text_2`, forming a single sentence pair.
|
||||
|
||||
Request:
|
||||
|
||||
```bash
|
||||
curl -X 'POST' \
|
||||
'http://127.0.0.1:8000/score' \
|
||||
-H 'accept: application/json' \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{
|
||||
"model": "BAAI/bge-reranker-v2-m3",
|
||||
"encoding_format": "float",
|
||||
"text_1": "What is the capital of France?",
|
||||
"text_2": "The capital of France is Paris."
|
||||
}'
|
||||
```
|
||||
|
||||
Response:
|
||||
|
||||
```bash
|
||||
{
|
||||
"id": "score-request-id",
|
||||
"object": "list",
|
||||
"created": 693447,
|
||||
"model": "BAAI/bge-reranker-v2-m3",
|
||||
"data": [
|
||||
{
|
||||
"index": 0,
|
||||
"object": "score",
|
||||
"score": 1
|
||||
}
|
||||
],
|
||||
"usage": {}
|
||||
}
|
||||
```
|
||||
|
||||
#### Batch inference
|
||||
|
||||
You can pass a string to `text_1` and a list to `text_2`, forming multiple sentence pairs
|
||||
where each pair is built from `text_1` and a string in `text_2`.
|
||||
The total number of pairs is `len(text_2)`.
|
||||
|
||||
Request:
|
||||
|
||||
```bash
|
||||
curl -X 'POST' \
|
||||
'http://127.0.0.1:8000/score' \
|
||||
-H 'accept: application/json' \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{
|
||||
"model": "BAAI/bge-reranker-v2-m3",
|
||||
"text_1": "What is the capital of France?",
|
||||
"text_2": [
|
||||
"The capital of Brazil is Brasilia.",
|
||||
"The capital of France is Paris."
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
Response:
|
||||
|
||||
```bash
|
||||
{
|
||||
"id": "score-request-id",
|
||||
"object": "list",
|
||||
"created": 693570,
|
||||
"model": "BAAI/bge-reranker-v2-m3",
|
||||
"data": [
|
||||
{
|
||||
"index": 0,
|
||||
"object": "score",
|
||||
"score": 0.001094818115234375
|
||||
},
|
||||
{
|
||||
"index": 1,
|
||||
"object": "score",
|
||||
"score": 1
|
||||
}
|
||||
],
|
||||
"usage": {}
|
||||
}
|
||||
```
|
||||
|
||||
You can pass a list to both `text_1` and `text_2`, forming multiple sentence pairs
|
||||
where each pair is built from a string in `text_1` and the corresponding string in `text_2` (similar to `zip()`).
|
||||
The total number of pairs is `len(text_2)`.
|
||||
|
||||
Request:
|
||||
|
||||
```bash
|
||||
curl -X 'POST' \
|
||||
'http://127.0.0.1:8000/score' \
|
||||
-H 'accept: application/json' \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{
|
||||
"model": "BAAI/bge-reranker-v2-m3",
|
||||
"encoding_format": "float",
|
||||
"text_1": [
|
||||
"What is the capital of Brazil?",
|
||||
"What is the capital of France?"
|
||||
],
|
||||
"text_2": [
|
||||
"The capital of Brazil is Brasilia.",
|
||||
"The capital of France is Paris."
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
Response:
|
||||
|
||||
```bash
|
||||
{
|
||||
"id": "score-request-id",
|
||||
"object": "list",
|
||||
"created": 693447,
|
||||
"model": "BAAI/bge-reranker-v2-m3",
|
||||
"data": [
|
||||
{
|
||||
"index": 0,
|
||||
"object": "score",
|
||||
"score": 1
|
||||
},
|
||||
{
|
||||
"index": 1,
|
||||
"object": "score",
|
||||
"score": 1
|
||||
}
|
||||
],
|
||||
"usage": {}
|
||||
}
|
||||
```
|
||||
|
||||
#### Extra parameters
|
||||
|
||||
The following [pooling parameters][pooling-params] are supported.
|
||||
|
||||
```python
|
||||
--8<-- "vllm/entrypoints/openai/protocol.py:score-pooling-params"
|
||||
```
|
||||
|
||||
The following extra parameters are supported:
|
||||
|
||||
```python
|
||||
--8<-- "vllm/entrypoints/openai/protocol.py:score-extra-params"
|
||||
```
|
||||
|
||||
[](){ #rerank-api }
|
||||
|
||||
### Re-rank API
|
||||
|
||||
Our Re-rank API can apply an embedding model or a cross-encoder model to predict relevant scores between a single query, and
|
||||
each of a list of documents. Usually, the score for a sentence pair refers to the similarity between two sentences, on
|
||||
a scale of 0 to 1.
|
||||
|
||||
You can find the documentation for cross encoder models at [sbert.net](https://www.sbert.net/docs/package_reference/cross_encoder/cross_encoder.html).
|
||||
|
||||
The rerank endpoints support popular re-rank models such as `BAAI/bge-reranker-base` and other models supporting the
|
||||
`score` task. Additionally, `/rerank`, `/v1/rerank`, and `/v2/rerank`
|
||||
endpoints are compatible with both [Jina AI's re-rank API interface](https://jina.ai/reranker/) and
|
||||
[Cohere's re-rank API interface](https://docs.cohere.com/v2/reference/rerank) to ensure compatibility with
|
||||
popular open-source tools.
|
||||
|
||||
Code example: <gh-file:examples/online_serving/jinaai_rerank_client.py>
|
||||
|
||||
#### Example Request
|
||||
|
||||
Note that the `top_n` request parameter is optional and will default to the length of the `documents` field.
|
||||
Result documents will be sorted by relevance, and the `index` property can be used to determine original order.
|
||||
|
||||
Request:
|
||||
|
||||
```bash
|
||||
curl -X 'POST' \
|
||||
'http://127.0.0.1:8000/v1/rerank' \
|
||||
-H 'accept: application/json' \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{
|
||||
"model": "BAAI/bge-reranker-base",
|
||||
"query": "What is the capital of France?",
|
||||
"documents": [
|
||||
"The capital of Brazil is Brasilia.",
|
||||
"The capital of France is Paris.",
|
||||
"Horses and cows are both animals"
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
Response:
|
||||
|
||||
```bash
|
||||
{
|
||||
"id": "rerank-fae51b2b664d4ed38f5969b612edff77",
|
||||
"model": "BAAI/bge-reranker-base",
|
||||
"usage": {
|
||||
"total_tokens": 56
|
||||
},
|
||||
"results": [
|
||||
{
|
||||
"index": 1,
|
||||
"document": {
|
||||
"text": "The capital of France is Paris."
|
||||
},
|
||||
"relevance_score": 0.99853515625
|
||||
},
|
||||
{
|
||||
"index": 0,
|
||||
"document": {
|
||||
"text": "The capital of Brazil is Brasilia."
|
||||
},
|
||||
"relevance_score": 0.0005860328674316406
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
#### Extra parameters
|
||||
|
||||
The following [pooling parameters][pooling-params] are supported.
|
||||
|
||||
```python
|
||||
--8<-- "vllm/entrypoints/openai/protocol.py:rerank-pooling-params"
|
||||
```
|
||||
|
||||
The following extra parameters are supported:
|
||||
|
||||
```python
|
||||
--8<-- "vllm/entrypoints/openai/protocol.py:rerank-extra-params"
|
||||
```
|
||||
38
docs/serving/serve_args.md
Normal file
38
docs/serving/serve_args.md
Normal file
@@ -0,0 +1,38 @@
|
||||
---
|
||||
title: Server Arguments
|
||||
---
|
||||
[](){ #serve-args }
|
||||
|
||||
The `vllm serve` command is used to launch the OpenAI-compatible server.
|
||||
|
||||
## CLI Arguments
|
||||
|
||||
The `vllm serve` command is used to launch the OpenAI-compatible server.
|
||||
To see the available CLI arguments, run `vllm serve --help`!
|
||||
|
||||
## Configuration file
|
||||
|
||||
You can load CLI arguments via a [YAML](https://yaml.org/) config file.
|
||||
The argument names must be the long form of those outlined [above][serve-args].
|
||||
|
||||
For example:
|
||||
|
||||
```yaml
|
||||
# config.yaml
|
||||
|
||||
model: meta-llama/Llama-3.1-8B-Instruct
|
||||
host: "127.0.0.1"
|
||||
port: 6379
|
||||
uvicorn-log-level: "info"
|
||||
```
|
||||
|
||||
To use the above config file:
|
||||
|
||||
```bash
|
||||
vllm serve --config config.yaml
|
||||
```
|
||||
|
||||
!!! note
|
||||
In case an argument is supplied simultaneously using command line and the config file, the value from the command line will take precedence.
|
||||
The order of priorities is `command line > config file values > defaults`.
|
||||
e.g. `vllm serve SOME_MODEL --config config.yaml`, SOME_MODEL takes precedence over `model` in config file.
|
||||
59
docs/serving/usage_stats.md
Normal file
59
docs/serving/usage_stats.md
Normal file
@@ -0,0 +1,59 @@
|
||||
# Usage Stats Collection
|
||||
|
||||
vLLM collects anonymous usage data by default to help the engineering team better understand which hardware and model configurations are widely used. This data allows them to prioritize their efforts on the most common workloads. The collected data is transparent, does not contain any sensitive information.
|
||||
|
||||
A subset of the data, after cleaning and aggregation, will be publicly released for the community's benefit. For example, you can see the 2024 usage report [here](https://2024.vllm.ai).
|
||||
|
||||
## What data is collected?
|
||||
|
||||
The list of data collected by the latest version of vLLM can be found here: <gh-file:vllm/usage/usage_lib.py>
|
||||
|
||||
Here is an example as of v0.4.0:
|
||||
|
||||
```json
|
||||
{
|
||||
"uuid": "fbe880e9-084d-4cab-a395-8984c50f1109",
|
||||
"provider": "GCP",
|
||||
"num_cpu": 24,
|
||||
"cpu_type": "Intel(R) Xeon(R) CPU @ 2.20GHz",
|
||||
"cpu_family_model_stepping": "6,85,7",
|
||||
"total_memory": 101261135872,
|
||||
"architecture": "x86_64",
|
||||
"platform": "Linux-5.10.0-28-cloud-amd64-x86_64-with-glibc2.31",
|
||||
"gpu_count": 2,
|
||||
"gpu_type": "NVIDIA L4",
|
||||
"gpu_memory_per_device": 23580639232,
|
||||
"model_architecture": "OPTForCausalLM",
|
||||
"vllm_version": "0.3.2+cu123",
|
||||
"context": "LLM_CLASS",
|
||||
"log_time": 1711663373492490000,
|
||||
"source": "production",
|
||||
"dtype": "torch.float16",
|
||||
"tensor_parallel_size": 1,
|
||||
"block_size": 16,
|
||||
"gpu_memory_utilization": 0.9,
|
||||
"quantization": null,
|
||||
"kv_cache_dtype": "auto",
|
||||
"enable_lora": false,
|
||||
"enable_prefix_caching": false,
|
||||
"enforce_eager": false,
|
||||
"disable_custom_all_reduce": true
|
||||
}
|
||||
```
|
||||
|
||||
You can preview the collected data by running the following command:
|
||||
|
||||
```bash
|
||||
tail ~/.config/vllm/usage_stats.json
|
||||
```
|
||||
|
||||
## Opting out
|
||||
|
||||
You can opt-out of usage stats collection by setting the `VLLM_NO_USAGE_STATS` or `DO_NOT_TRACK` environment variable, or by creating a `~/.config/vllm/do_not_track` file:
|
||||
|
||||
```bash
|
||||
# Any of the following methods can disable usage stats collection
|
||||
export VLLM_NO_USAGE_STATS=1
|
||||
export DO_NOT_TRACK=1
|
||||
mkdir -p ~/.config/vllm && touch ~/.config/vllm/do_not_track
|
||||
```
|
||||
Reference in New Issue
Block a user