Migrate docs from Sphinx to MkDocs (#18145)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
Harry Mellor
2025-05-23 11:09:53 +02:00
committed by GitHub
parent d0bc2f810b
commit a1fe24d961
218 changed files with 4126 additions and 6790 deletions

View File

@@ -0,0 +1,38 @@
---
title: Frequently Asked Questions
---
[](){ #faq }
> Q: How can I serve multiple models on a single port using the OpenAI API?
A: Assuming that you're referring to using OpenAI compatible server to serve multiple models at once, that is not currently supported, you can run multiple instances of the server (each serving a different model) at the same time, and have another layer to route the incoming request to the correct server accordingly.
---
> Q: Which model to use for offline inference embedding?
A: You can try [e5-mistral-7b-instruct](https://huggingface.co/intfloat/e5-mistral-7b-instruct) and [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5);
more are listed [here][supported-models].
By extracting hidden states, vLLM can automatically convert text generation models like [Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B),
[Mistral-7B-Instruct-v0.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3) into embedding models,
but they are expected to be inferior to models that are specifically trained on embedding tasks.
---
> Q: Can the output of a prompt vary across runs in vLLM?
A: Yes, it can. vLLM does not guarantee stable log probabilities (logprobs) for the output tokens. Variations in logprobs may occur due to
numerical instability in Torch operations or non-deterministic behavior in batched Torch operations when batching changes. For more details,
see the [Numerical Accuracy section](https://pytorch.org/docs/stable/notes/numerical_accuracy.html#batched-computations-or-slice-computations).
In vLLM, the same requests might be batched differently due to factors such as other concurrent requests,
changes in batch size, or batch expansion in speculative decoding. These batching variations, combined with numerical instability of Torch operations,
can lead to slightly different logit/logprob values at each step. Such differences can accumulate, potentially resulting in
different tokens being sampled. Once a different token is sampled, further divergence is likely.
## Mitigation Strategies
- For improved stability and reduced variance, use `float32`. Note that this will require more memory.
- If using `bfloat16`, switching to `float16` can also help.
- Using request seeds can aid in achieving more stable generation for temperature > 0, but discrepancies due to precision differences may still occur.

View File

@@ -0,0 +1,5 @@
nav:
- README.md
- gpu.md
- cpu.md
- ai_accelerator.md

View File

@@ -0,0 +1,20 @@
---
title: Installation
---
[](){ #installation-index }
vLLM supports the following hardware platforms:
- [GPU](gpu.md)
- [NVIDIA CUDA](gpu.md#nvidia-cuda)
- [AMD ROCm](gpu.md#amd-rocm)
- [Intel XPU](gpu.md#intel-xpu)
- [CPU](cpu.md)
- [Intel/AMD x86](cpu.md#intelamd-x86)
- [ARM AArch64](cpu.md#arm-aarch64)
- [Apple silicon](cpu.md#apple-silicon)
- [IBM Z (S390X)](cpu.md#ibm-z-s390x)
- [Other AI accelerators](ai_accelerator.md)
- [Google TPU](ai_accelerator.md#google-tpu)
- [Intel Gaudi](ai_accelerator.md#intel-gaudi)
- [AWS Neuron](ai_accelerator.md#aws-neuron)

View File

@@ -0,0 +1,117 @@
# Other AI accelerators
vLLM is a Python library that supports the following AI accelerators. Select your AI accelerator type to see vendor specific instructions:
=== "Google TPU"
--8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:installation"
=== "Intel Gaudi"
--8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:installation"
=== "AWS Neuron"
--8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:installation"
## Requirements
=== "Google TPU"
--8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:requirements"
=== "Intel Gaudi"
--8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:requirements"
=== "AWS Neuron"
--8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:requirements"
## Configure a new environment
=== "Google TPU"
--8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:configure-a-new-environment"
=== "Intel Gaudi"
--8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:configure-a-new-environment"
=== "AWS Neuron"
--8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:configure-a-new-environment"
## Set up using Python
### Pre-built wheels
=== "Google TPU"
--8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:pre-built-wheels"
=== "Intel Gaudi"
--8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:pre-built-wheels"
=== "AWS Neuron"
--8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:pre-built-wheels"
### Build wheel from source
=== "Google TPU"
--8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:build-wheel-from-source"
=== "Intel Gaudi"
--8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:build-wheel-from-source"
=== "AWS Neuron"
--8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:build-wheel-from-source"
## Set up using Docker
### Pre-built images
=== "Google TPU"
--8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:pre-built-images"
=== "Intel Gaudi"
--8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:pre-built-images"
=== "AWS Neuron"
--8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:pre-built-images"
### Build image from source
=== "Google TPU"
--8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:build-image-from-source"
=== "Intel Gaudi"
--8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:build-image-from-source"
=== "AWS Neuron"
--8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:build-image-from-source"
## Extra information
=== "Google TPU"
--8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:extra-information"
=== "Intel Gaudi"
--8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:extra-information"
=== "AWS Neuron"
--8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:extra-information"

View File

@@ -0,0 +1,389 @@
# --8<-- [start:installation]
This tab provides instructions on running vLLM with Intel Gaudi devices.
!!! warning
There are no pre-built wheels or images for this device, so you must build vLLM from source.
# --8<-- [end:installation]
# --8<-- [start:requirements]
- OS: Ubuntu 22.04 LTS
- Python: 3.10
- Intel Gaudi accelerator
- Intel Gaudi software version 1.18.0
Please follow the instructions provided in the
[Gaudi Installation Guide](https://docs.habana.ai/en/latest/Installation_Guide/index.html)
to set up the execution environment. To achieve the best performance,
please follow the methods outlined in the
[Optimizing Training Platform Guide](https://docs.habana.ai/en/latest/PyTorch/Model_Optimization_PyTorch/Optimization_in_Training_Platform.html).
## Configure a new environment
### Environment verification
To verify that the Intel Gaudi software was correctly installed, run:
```console
hl-smi # verify that hl-smi is in your PATH and each Gaudi accelerator is visible
apt list --installed | grep habana # verify that habanalabs-firmware-tools, habanalabs-graph, habanalabs-rdma-core, habanalabs-thunk and habanalabs-container-runtime are installed
pip list | grep habana # verify that habana-torch-plugin, habana-torch-dataloader, habana-pyhlml and habana-media-loader are installed
pip list | grep neural # verify that neural_compressor is installed
```
Refer to [Intel Gaudi Software Stack Verification](https://docs.habana.ai/en/latest/Installation_Guide/SW_Verification.html#platform-upgrade)
for more details.
### Run Docker Image
It is highly recommended to use the latest Docker image from Intel Gaudi
vault. Refer to the [Intel Gaudi documentation](https://docs.habana.ai/en/latest/Installation_Guide/Bare_Metal_Fresh_OS.html#pull-prebuilt-containers)
for more details.
Use the following commands to run a Docker image:
```console
docker pull vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest
docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest
```
# --8<-- [end:requirements]
# --8<-- [start:set-up-using-python]
# --8<-- [end:set-up-using-python]
# --8<-- [start:pre-built-wheels]
Currently, there are no pre-built Intel Gaudi wheels.
# --8<-- [end:pre-built-wheels]
# --8<-- [start:build-wheel-from-source]
To build and install vLLM from source, run:
```console
git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -r requirements/hpu.txt
python setup.py develop
```
Currently, the latest features and performance optimizations are developed in Gaudi's [vLLM-fork](https://github.com/HabanaAI/vllm-fork) and we periodically upstream them to vLLM main repo. To install latest [HabanaAI/vLLM-fork](https://github.com/HabanaAI/vllm-fork), run the following:
```console
git clone https://github.com/HabanaAI/vllm-fork.git
cd vllm-fork
git checkout habana_main
pip install -r requirements/hpu.txt
python setup.py develop
```
# --8<-- [end:build-wheel-from-source]
# --8<-- [start:set-up-using-docker]
# --8<-- [end:set-up-using-docker]
# --8<-- [start:pre-built-images]
Currently, there are no pre-built Intel Gaudi images.
# --8<-- [end:pre-built-images]
# --8<-- [start:build-image-from-source]
```console
docker build -f docker/Dockerfile.hpu -t vllm-hpu-env .
docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --rm vllm-hpu-env
```
!!! tip
If you're observing the following error: `docker: Error response from daemon: Unknown runtime specified habana.`, please refer to "Install Using Containers" section of [Intel Gaudi Software Stack and Driver Installation](https://docs.habana.ai/en/v1.18.0/Installation_Guide/Bare_Metal_Fresh_OS.html). Make sure you have `habana-container-runtime` package installed and that `habana` container runtime is registered.
# --8<-- [end:build-image-from-source]
# --8<-- [start:extra-information]
## Supported features
- [Offline inference][offline-inference]
- Online serving via [OpenAI-Compatible Server][openai-compatible-server]
- HPU autodetection - no need to manually select device within vLLM
- Paged KV cache with algorithms enabled for Intel Gaudi accelerators
- Custom Intel Gaudi implementations of Paged Attention, KV cache ops,
prefill attention, Root Mean Square Layer Normalization, Rotary
Positional Encoding
- Tensor parallelism support for multi-card inference
- Inference with [HPU Graphs](https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_HPU_Graphs.html)
for accelerating low-batch latency and throughput
- Attention with Linear Biases (ALiBi)
## Unsupported features
- Beam search
- LoRA adapters
- Quantization
- Prefill chunking (mixed-batch inferencing)
## Supported configurations
The following configurations have been validated to function with
Gaudi2 devices. Configurations that are not listed may or may not work.
- [meta-llama/Llama-2-7b](https://huggingface.co/meta-llama/Llama-2-7b)
on single HPU, or with tensor parallelism on 2x and 8x HPU, BF16
datatype with random or greedy sampling
- [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)
on single HPU, or with tensor parallelism on 2x and 8x HPU, BF16
datatype with random or greedy sampling
- [meta-llama/Meta-Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B)
on single HPU, or with tensor parallelism on 2x and 8x HPU, BF16
datatype with random or greedy sampling
- [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)
on single HPU, or with tensor parallelism on 2x and 8x HPU, BF16
datatype with random or greedy sampling
- [meta-llama/Meta-Llama-3.1-8B](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B)
on single HPU, or with tensor parallelism on 2x and 8x HPU, BF16
datatype with random or greedy sampling
- [meta-llama/Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct)
on single HPU, or with tensor parallelism on 2x and 8x HPU, BF16
datatype with random or greedy sampling
- [meta-llama/Llama-2-70b](https://huggingface.co/meta-llama/Llama-2-70b)
with tensor parallelism on 8x HPU, BF16 datatype with random or greedy sampling
- [meta-llama/Llama-2-70b-chat-hf](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf)
with tensor parallelism on 8x HPU, BF16 datatype with random or greedy sampling
- [meta-llama/Meta-Llama-3-70B](https://huggingface.co/meta-llama/Meta-Llama-3-70B)
with tensor parallelism on 8x HPU, BF16 datatype with random or greedy sampling
- [meta-llama/Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct)
with tensor parallelism on 8x HPU, BF16 datatype with random or greedy sampling
- [meta-llama/Meta-Llama-3.1-70B](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B)
with tensor parallelism on 8x HPU, BF16 datatype with random or greedy sampling
- [meta-llama/Meta-Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct)
with tensor parallelism on 8x HPU, BF16 datatype with random or greedy sampling
## Performance tuning
### Execution modes
Currently in vLLM for HPU we support four execution modes, depending on selected HPU PyTorch Bridge backend (via `PT_HPU_LAZY_MODE` environment variable), and `--enforce-eager` flag.
| `PT_HPU_LAZY_MODE` | `enforce_eager` | execution mode |
|----------------------|-------------------|--------------------|
| 0 | 0 | torch.compile |
| 0 | 1 | PyTorch eager mode |
| 1 | 0 | HPU Graphs |
<figcaption>vLLM execution modes</figcaption>
!!! warning
In 1.18.0, all modes utilizing `PT_HPU_LAZY_MODE=0` are highly experimental and should be only used for validating functional correctness. Their performance will be improved in the next releases. For obtaining the best performance in 1.18.0, please use HPU Graphs, or PyTorch lazy mode.
[](){ #gaudi-bucketing-mechanism }
### Bucketing mechanism
Intel Gaudi accelerators work best when operating on models with fixed tensor shapes. [Intel Gaudi Graph Compiler](https://docs.habana.ai/en/latest/Gaudi_Overview/Intel_Gaudi_Software_Suite.html#graph-compiler-and-runtime) is responsible for generating optimized binary code that implements the given model topology on Gaudi. In its default configuration, the produced binary code may be heavily dependent on input and output tensor shapes, and can require graph recompilation when encountering differently shaped tensors within the same topology. While the resulting binaries utilize Gaudi efficiently, the compilation itself may introduce a noticeable overhead in end-to-end execution.
In a dynamic inference serving scenario, there is a need to minimize the number of graph compilations and reduce the risk of graph compilation occurring during server runtime. Currently it is achieved by "bucketing" model's forward pass across two dimensions - `batch_size` and `sequence_length`.
!!! note
Bucketing allows us to reduce the number of required graphs significantly, but it does not handle any graph compilation and device code generation - this is done in warmup and HPUGraph capture phase.
Bucketing ranges are determined with 3 parameters - `min`, `step` and `max`. They can be set separately for prompt and decode phase, and for batch size and sequence length dimension. These parameters can be observed in logs during vLLM startup:
```text
INFO 08-01 21:37:59 hpu_model_runner.py:493] Prompt bucket config (min, step, max_warmup) bs:[1, 32, 4], seq:[128, 128, 1024]
INFO 08-01 21:37:59 hpu_model_runner.py:499] Generated 24 prompt buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024)]
INFO 08-01 21:37:59 hpu_model_runner.py:504] Decode bucket config (min, step, max_warmup) bs:[1, 128, 4], seq:[128, 128, 2048]
INFO 08-01 21:37:59 hpu_model_runner.py:509] Generated 48 decode buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (1, 1152), (1, 1280), (1, 1408), (1, 1536), (1, 1664), (1, 1792), (1, 1920), (1, 2048), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (2, 1152), (2, 1280), (2, 1408), (2, 1536), (2, 1664), (2, 1792), (2, 1920), (2, 2048), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024), (4, 1152), (4, 1280), (4, 1408), (4, 1536), (4, 1664), (4, 1792), (4, 1920), (4, 2048)]
```
`min` determines the lowest value of the bucket. `step` determines the interval between buckets, and `max` determines the upper bound of the bucket. Furthermore, interval between `min` and `step` has special handling -- `min` gets multiplied by consecutive powers of two, until `step` gets reached. We call this the ramp-up phase and it is used for handling lower batch sizes with minimum wastage, while allowing larger padding on larger batch sizes.
Example (with ramp-up)
```text
min = 2, step = 32, max = 64
=> ramp_up = (2, 4, 8, 16)
=> stable = (32, 64)
=> buckets = ramp_up + stable => (2, 4, 8, 16, 32, 64)
```
Example (without ramp-up)
```text
min = 128, step = 128, max = 512
=> ramp_up = ()
=> stable = (128, 256, 384, 512)
=> buckets = ramp_up + stable => (128, 256, 384, 512)
```
In the logged scenario, 24 buckets were generated for prompt (prefill) runs, and 48 buckets for decode runs. Each bucket corresponds to a separate optimized device binary for a given model with specified tensor shapes. Whenever a batch of requests is processed, it is padded across batch and sequence length dimension to the smallest possible bucket.
!!! warning
If a request exceeds maximum bucket size in any dimension, it will be processed without padding, and its processing may require a graph compilation, potentially significantly increasing end-to-end latency. The boundaries of the buckets are user-configurable via environment variables, and upper bucket boundaries can be increased to avoid such scenario.
As an example, if a request of 3 sequences, with max sequence length of 412 comes in to an idle vLLM server, it will be padded executed as `(4, 512)` prefill bucket, as `batch_size` (number of sequences) will be padded to 4 (closest batch_size dimension higher than 3), and max sequence length will be padded to 512 (closest sequence length dimension higher than 412). After prefill stage, it will be executed as `(4, 512)` decode bucket and will continue as that bucket until either batch dimension changes (due to request being finished) - in which case it will become a `(2, 512)` bucket, or context length increases above 512 tokens, in which case it will become `(4, 640)` bucket.
!!! note
Bucketing is transparent to a client -- padding in sequence length dimension is never returned to the client, and padding in batch dimension does not create new requests.
### Warmup
Warmup is an optional, but highly recommended step occurring before vLLM server starts listening. It executes a forward pass for each bucket with dummy data. The goal is to pre-compile all graphs and not incur any graph compilation overheads within bucket boundaries during server runtime. Each warmup step is logged during vLLM startup:
```text
INFO 08-01 22:26:47 hpu_model_runner.py:1066] [Warmup][Prompt][1/24] batch_size:4 seq_len:1024 free_mem:79.16 GiB
INFO 08-01 22:26:47 hpu_model_runner.py:1066] [Warmup][Prompt][2/24] batch_size:4 seq_len:896 free_mem:55.43 GiB
INFO 08-01 22:26:48 hpu_model_runner.py:1066] [Warmup][Prompt][3/24] batch_size:4 seq_len:768 free_mem:55.43 GiB
...
INFO 08-01 22:26:59 hpu_model_runner.py:1066] [Warmup][Prompt][24/24] batch_size:1 seq_len:128 free_mem:55.43 GiB
INFO 08-01 22:27:00 hpu_model_runner.py:1066] [Warmup][Decode][1/48] batch_size:4 seq_len:2048 free_mem:55.43 GiB
INFO 08-01 22:27:00 hpu_model_runner.py:1066] [Warmup][Decode][2/48] batch_size:4 seq_len:1920 free_mem:55.43 GiB
INFO 08-01 22:27:01 hpu_model_runner.py:1066] [Warmup][Decode][3/48] batch_size:4 seq_len:1792 free_mem:55.43 GiB
...
INFO 08-01 22:27:16 hpu_model_runner.py:1066] [Warmup][Decode][47/48] batch_size:2 seq_len:128 free_mem:55.43 GiB
INFO 08-01 22:27:16 hpu_model_runner.py:1066] [Warmup][Decode][48/48] batch_size:1 seq_len:128 free_mem:55.43 GiB
```
This example uses the same buckets as in the [Bucketing Mechanism][gaudi-bucketing-mechanism] section. Each output line corresponds to execution of a single bucket. When bucket is executed for the first time, its graph is compiled and can be reused later on, skipping further graph compilations.
!!! tip
Compiling all the buckets might take some time and can be turned off with `VLLM_SKIP_WARMUP=true` environment variable. Keep in mind that if you do that, you may face graph compilations once executing a given bucket for the first time. It is fine to disable warmup for development, but it's highly recommended to enable it in deployment.
### HPU Graph capture
[HPU Graphs](https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_HPU_Graphs.html) are currently the most performant execution method of vLLM on Intel Gaudi. When HPU Graphs are enabled, execution graphs will be traced (recorded) ahead of time (after performing warmup), to be later replayed during inference, significantly reducing host overheads. Recording can take large amounts of memory, which needs to be taken into account when allocating KV cache. Enabling HPU Graphs will impact the number of available KV cache blocks, but vLLM provides user-configurable variables to control memory management.
When HPU Graphs are being used, they share the common memory pool ("usable memory") as KV cache, determined by `gpu_memory_utilization` flag (`0.9` by default).
Before KV cache gets allocated, model weights are loaded onto the device, and a forward pass of the model is executed on dummy data, to estimate memory usage.
Only after that, `gpu_memory_utilization` flag is utilized - at its default value, will mark 90% of free device memory at that point as usable.
Next, KV cache gets allocated, model is warmed up, and HPU Graphs are captured.
Environment variable `VLLM_GRAPH_RESERVED_MEM` defines the ratio of memory reserved for HPU Graphs capture.
With its default value (`VLLM_GRAPH_RESERVED_MEM=0.1`), 10% of usable memory will be reserved for graph capture (later referred to as "usable graph memory"), and the remaining 90% will be utilized for KV cache.
Environment variable `VLLM_GRAPH_PROMPT_RATIO` determines the ratio of usable graph memory reserved for prefill and decode graphs. By default (`VLLM_GRAPH_PROMPT_RATIO=0.3`), both stages have equal memory constraints.
Lower value corresponds to less usable graph memory reserved for prefill stage, e.g. `VLLM_GRAPH_PROMPT_RATIO=0.2` will reserve 20% of usable graph memory for prefill graphs, and 80% of usable graph memory for decode graphs.
!!! note
`gpu_memory_utilization` does not correspond to the absolute memory usage across HPU. It specifies the memory margin after loading the model and performing a profile run. If device has 100 GiB of total memory, and 50 GiB of free memory after loading model weights and executing profiling run, `gpu_memory_utilization` at its default value will mark 90% of 50 GiB as usable, leaving 5 GiB of margin, regardless of total device memory.
User can also configure the strategy for capturing HPU Graphs for prompt and decode stages separately. Strategy affects the order of capturing graphs. There are two strategies implemented:
- `max_bs` - graph capture queue will sorted in descending order by their batch sizes. Buckets with equal batch sizes are sorted by sequence length in ascending order (e.g. `(64, 128)`, `(64, 256)`, `(32, 128)`, `(32, 256)`, `(1, 128)`, `(1,256)`), default strategy for decode
- `min_tokens` - graph capture queue will be sorted in ascending order by the number of tokens each graph processes (`batch_size*sequence_length`), default strategy for prompt
When there's large amount of requests pending, vLLM scheduler will attempt to fill the maximum batch size for decode as soon as possible. When a request is finished, decode batch size decreases. When that happens, vLLM will attempt to schedule a prefill iteration for requests in the waiting queue, to fill the decode batch size to its previous state. This means that in a full load scenario, decode batch size is often at its maximum, which makes large batch size HPU Graphs crucial to capture, as reflected by `max_bs` strategy. On the other hand, prefills will be executed most frequently with very low batch sizes (1-4), which is reflected in `min_tokens` strategy.
!!! note
`VLLM_GRAPH_PROMPT_RATIO` does not set a hard limit on memory taken by graphs for each stage (prefill and decode). vLLM will first attempt to use up entirety of usable prefill graph memory (usable graph memory * `VLLM_GRAPH_PROMPT_RATIO`) for capturing prefill HPU Graphs, next it will attempt do the same for decode graphs and usable decode graph memory pool. If one stage is fully captured, and there is unused memory left within usable graph memory pool, vLLM will attempt further graph capture for the other stage, until no more HPU Graphs can be captured without exceeding reserved memory pool. The behavior on that mechanism can be observed in the example below.
Each described step is logged by vLLM server, as follows (negative values correspond to memory being released):
```text
INFO 08-02 17:37:44 hpu_model_runner.py:493] Prompt bucket config (min, step, max_warmup) bs:[1, 32, 4], seq:[128, 128, 1024]
INFO 08-02 17:37:44 hpu_model_runner.py:499] Generated 24 prompt buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024)]
INFO 08-02 17:37:44 hpu_model_runner.py:504] Decode bucket config (min, step, max_warmup) bs:[1, 128, 4], seq:[128, 128, 2048]
INFO 08-02 17:37:44 hpu_model_runner.py:509] Generated 48 decode buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (1, 1152), (1, 1280), (1, 1408), (1, 1536), (1, 1664), (1, 1792), (1, 1920), (1, 2048), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (2, 1152), (2, 1280), (2, 1408), (2, 1536), (2, 1664), (2, 1792), (2, 1920), (2, 2048), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024), (4, 1152), (4, 1280), (4, 1408), (4, 1536), (4, 1664), (4, 1792), (4, 1920), (4, 2048)]
INFO 08-02 17:37:52 hpu_model_runner.py:430] Pre-loading model weights on hpu:0 took 14.97 GiB of device memory (14.97 GiB/94.62 GiB used) and 2.95 GiB of host memory (475.2 GiB/1007 GiB used)
INFO 08-02 17:37:52 hpu_model_runner.py:438] Wrapping in HPU Graph took 0 B of device memory (14.97 GiB/94.62 GiB used) and -252 KiB of host memory (475.2 GiB/1007 GiB used)
INFO 08-02 17:37:52 hpu_model_runner.py:442] Loading model weights took in total 14.97 GiB of device memory (14.97 GiB/94.62 GiB used) and 2.95 GiB of host memory (475.2 GiB/1007 GiB used)
INFO 08-02 17:37:54 hpu_worker.py:134] Model profiling run took 504 MiB of device memory (15.46 GiB/94.62 GiB used) and 180.9 MiB of host memory (475.4 GiB/1007 GiB used)
INFO 08-02 17:37:54 hpu_worker.py:158] Free device memory: 79.16 GiB, 39.58 GiB usable (gpu_memory_utilization=0.5), 15.83 GiB reserved for HPUGraphs (VLLM_GRAPH_RESERVED_MEM=0.4), 23.75 GiB reserved for KV cache
INFO 08-02 17:37:54 hpu_executor.py:85] # HPU blocks: 1519, # CPU blocks: 0
INFO 08-02 17:37:54 hpu_worker.py:190] Initializing cache engine took 23.73 GiB of device memory (39.2 GiB/94.62 GiB used) and -1.238 MiB of host memory (475.4 GiB/1007 GiB used)
INFO 08-02 17:37:54 hpu_model_runner.py:1066] [Warmup][Prompt][1/24] batch_size:4 seq_len:1024 free_mem:55.43 GiB
...
INFO 08-02 17:38:22 hpu_model_runner.py:1066] [Warmup][Decode][48/48] batch_size:1 seq_len:128 free_mem:55.43 GiB
INFO 08-02 17:38:22 hpu_model_runner.py:1159] Using 15.85 GiB/55.43 GiB of free device memory for HPUGraphs, 7.923 GiB for prompt and 7.923 GiB for decode (VLLM_GRAPH_PROMPT_RATIO=0.3)
INFO 08-02 17:38:22 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][1/24] batch_size:1 seq_len:128 free_mem:55.43 GiB
...
INFO 08-02 17:38:26 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][11/24] batch_size:1 seq_len:896 free_mem:48.77 GiB
INFO 08-02 17:38:27 hpu_model_runner.py:1066] [Warmup][Graph/Decode][1/48] batch_size:4 seq_len:128 free_mem:47.51 GiB
...
INFO 08-02 17:38:41 hpu_model_runner.py:1066] [Warmup][Graph/Decode][48/48] batch_size:1 seq_len:2048 free_mem:47.35 GiB
INFO 08-02 17:38:41 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][12/24] batch_size:4 seq_len:256 free_mem:47.35 GiB
INFO 08-02 17:38:42 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][13/24] batch_size:2 seq_len:512 free_mem:45.91 GiB
INFO 08-02 17:38:42 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][14/24] batch_size:1 seq_len:1024 free_mem:44.48 GiB
INFO 08-02 17:38:43 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][15/24] batch_size:2 seq_len:640 free_mem:43.03 GiB
INFO 08-02 17:38:43 hpu_model_runner.py:1128] Graph/Prompt captured:15 (62.5%) used_mem:14.03 GiB buckets:[(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (4, 128), (4, 256)]
INFO 08-02 17:38:43 hpu_model_runner.py:1128] Graph/Decode captured:48 (100.0%) used_mem:161.9 MiB buckets:[(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (1, 1152), (1, 1280), (1, 1408), (1, 1536), (1, 1664), (1, 1792), (1, 1920), (1, 2048), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (2, 1152), (2, 1280), (2, 1408), (2, 1536), (2, 1664), (2, 1792), (2, 1920), (2, 2048), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024), (4, 1152), (4, 1280), (4, 1408), (4, 1536), (4, 1664), (4, 1792), (4, 1920), (4, 2048)]
INFO 08-02 17:38:43 hpu_model_runner.py:1206] Warmup finished in 49 secs, allocated 14.19 GiB of device memory
INFO 08-02 17:38:43 hpu_executor.py:91] init_cache_engine took 37.92 GiB of device memory (53.39 GiB/94.62 GiB used) and 57.86 MiB of host memory (475.4 GiB/1007 GiB used)
```
### Recommended vLLM Parameters
- We recommend running inference on Gaudi 2 with `block_size` of 128
for BF16 data type. Using default values (16, 32) might lead to
sub-optimal performance due to Matrix Multiplication Engine
under-utilization (see [Gaudi Architecture](https://docs.habana.ai/en/latest/Gaudi_Overview/Gaudi_Architecture.html)).
- For max throughput on Llama 7B, we recommend running with batch size
of 128 or 256 and max context length of 2048 with HPU Graphs enabled.
If you encounter out-of-memory issues, see troubleshooting section.
### Environment variables
**Diagnostic and profiling knobs:**
- `VLLM_PROFILER_ENABLED`: If `true`, enable the high level profiler. Resulting JSON traces can be viewed in [perfetto.habana.ai](https://perfetto.habana.ai/#!/viewer). `false` by default.
- `VLLM_HPU_LOG_STEP_GRAPH_COMPILATION`: If `true`, log graph compilations for each vLLM engine step when any occurs. Highly recommended to use with `PT_HPU_METRICS_GC_DETAILS=1`. `false` by default.
- `VLLM_HPU_LOG_STEP_GRAPH_COMPILATION_ALL`: If `true`, always log graph compilations for each vLLM engine step even if none occurred. `false` by default.
- `VLLM_HPU_LOG_STEP_CPU_FALLBACKS`: If `true`, log CPU fallbacks for each vLLM engine step when any occurs. `false` by default.
- `VLLM_HPU_LOG_STEP_CPU_FALLBACKS_ALL`: if `true`, always log CPU fallbacks for each vLLM engine step even if none occurred. `false` by default.
**Performance tuning knobs:**
- `VLLM_SKIP_WARMUP`: if `true`, warmup will be skipped, `false` by default
- `VLLM_GRAPH_RESERVED_MEM`: percentage of memory dedicated for HPUGraph capture, `0.1` by default
- `VLLM_GRAPH_PROMPT_RATIO`: percentage of reserved graph memory dedicated for prompt graphs, `0.3` by default
- `VLLM_GRAPH_PROMPT_STRATEGY`: strategy determining order of prompt graph capture, `min_tokens` or `max_bs`, `min_tokens` by default
- `VLLM_GRAPH_DECODE_STRATEGY`: strategy determining order of decode graph capture, `min_tokens` or `max_bs`, `max_bs` by default
- `VLLM_{phase}_{dim}_BUCKET_{param}` - collection of 12 environment variables configuring ranges of bucketing mechanism
* `{phase}` is either `PROMPT` or `DECODE`
* `{dim}` is either `BS`, `SEQ` or `BLOCK`
* `{param}` is either `MIN`, `STEP` or `MAX`
* Default values:
- Prompt:
- batch size min (`VLLM_PROMPT_BS_BUCKET_MIN`): `1`
- batch size step (`VLLM_PROMPT_BS_BUCKET_STEP`): `min(max_num_seqs, 32)`
- batch size max (`VLLM_PROMPT_BS_BUCKET_MAX`): `min(max_num_seqs, 64)`
- sequence length min (`VLLM_PROMPT_SEQ_BUCKET_MIN`): `block_size`
- sequence length step (`VLLM_PROMPT_SEQ_BUCKET_STEP`): `block_size`
- sequence length max (`VLLM_PROMPT_SEQ_BUCKET_MAX`): `max_model_len`
- Decode:
- batch size min (`VLLM_DECODE_BS_BUCKET_MIN`): `1`
- batch size step (`VLLM_DECODE_BS_BUCKET_STEP`): `min(max_num_seqs, 32)`
- batch size max (`VLLM_DECODE_BS_BUCKET_MAX`): `max_num_seqs`
- sequence length min (`VLLM_DECODE_BLOCK_BUCKET_MIN`): `block_size`
- sequence length step (`VLLM_DECODE_BLOCK_BUCKET_STEP`): `block_size`
- sequence length max (`VLLM_DECODE_BLOCK_BUCKET_MAX`): `max(128, (max_num_seqs*max_model_len)/block_size)`
Additionally, there are HPU PyTorch Bridge environment variables impacting vLLM execution:
- `PT_HPU_LAZY_MODE`: if `0`, PyTorch Eager backend for Gaudi will be used; if `1`, PyTorch Lazy backend for Gaudi will be used. `1` is default.
- `PT_HPU_ENABLE_LAZY_COLLECTIVES`: required to be `true` for tensor parallel inference with HPU Graphs
## Troubleshooting: tweaking HPU graphs
If you experience device out-of-memory issues or want to attempt
inference at higher batch sizes, try tweaking HPU Graphs by following
the below:
- Tweak `gpu_memory_utilization` knob. It will decrease the
allocation of KV cache, leaving some headroom for capturing graphs
with larger batch size. By default `gpu_memory_utilization` is set
to 0.9. It attempts to allocate ~90% of HBM left for KV cache after
short profiling run. Note that decreasing reduces the number of KV
cache blocks you have available, and therefore reduces the effective
maximum number of tokens you can handle at a given time.
- If this method is not efficient, you can disable `HPUGraph`
completely. With HPU Graphs disabled, you are trading latency and
throughput at lower batches for potentially higher throughput on
higher batches. You can do that by adding `--enforce-eager` flag to
server (for online serving), or by passing `enforce_eager=True`
argument to LLM constructor (for offline inference).
# --8<-- [end:extra-information]

View File

@@ -0,0 +1,146 @@
# --8<-- [start:installation]
vLLM 0.3.3 onwards supports model inferencing and serving on AWS Trainium/Inferentia with Neuron SDK with continuous batching.
Paged Attention and Chunked Prefill are currently in development and will be available soon.
Data types currently supported in Neuron SDK are FP16 and BF16.
!!! warning
There are no pre-built wheels or images for this device, so you must build vLLM from source.
# --8<-- [end:installation]
# --8<-- [start:requirements]
- OS: Linux
- Python: 3.9 -- 3.11
- Accelerator: NeuronCore_v2 (in trn1/inf2 instances)
- Pytorch 2.0.1/2.1.1
- AWS Neuron SDK 2.16/2.17 (Verified on python 3.8)
## Configure a new environment
### Launch Trn1/Inf2 instances
Here are the steps to launch trn1/inf2 instances, in order to install [PyTorch Neuron ("torch-neuronx") Setup on Ubuntu 22.04 LTS](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/neuron-setup/pytorch/neuronx/ubuntu/torch-neuronx-ubuntu22.html).
- Please follow the instructions at [launch an Amazon EC2 Instance](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html#ec2-launch-instance) to launch an instance. When choosing the instance type at the EC2 console, please make sure to select the correct instance type.
- To get more information about instances sizes and pricing see: [Trn1 web page](https://aws.amazon.com/ec2/instance-types/trn1/), [Inf2 web page](https://aws.amazon.com/ec2/instance-types/inf2/)
- Select Ubuntu Server 22.04 TLS AMI
- When launching a Trn1/Inf2, please adjust your primary EBS volume size to a minimum of 512GB.
- After launching the instance, follow the instructions in [Connect to your instance](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html) to connect to the instance
### Install drivers and tools
The installation of drivers and tools wouldn't be necessary, if [Deep Learning AMI Neuron](https://docs.aws.amazon.com/dlami/latest/devguide/appendix-ami-release-notes.html) is installed. In case the drivers and tools are not installed on the operating system, follow the steps below:
```console
# Configure Linux for Neuron repository updates
. /etc/os-release
sudo tee /etc/apt/sources.list.d/neuron.list > /dev/null <<EOF
deb https://apt.repos.neuron.amazonaws.com ${VERSION_CODENAME} main
EOF
wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB | sudo apt-key add -
# Update OS packages
sudo apt-get update -y
# Install OS headers
sudo apt-get install linux-headers-$(uname -r) -y
# Install git
sudo apt-get install git -y
# install Neuron Driver
sudo apt-get install aws-neuronx-dkms=2.* -y
# Install Neuron Runtime
sudo apt-get install aws-neuronx-collectives=2.* -y
sudo apt-get install aws-neuronx-runtime-lib=2.* -y
# Install Neuron Tools
sudo apt-get install aws-neuronx-tools=2.* -y
# Add PATH
export PATH=/opt/aws/neuron/bin:$PATH
```
# --8<-- [end:requirements]
# --8<-- [start:set-up-using-python]
# --8<-- [end:set-up-using-python]
# --8<-- [start:pre-built-wheels]
Currently, there are no pre-built Neuron wheels.
# --8<-- [end:pre-built-wheels]
# --8<-- [start:build-wheel-from-source]
!!! note
The currently supported version of Pytorch for Neuron installs `triton` version `2.1.0`. This is incompatible with `vllm >= 0.5.3`. You may see an error `cannot import name 'default_dump_dir...`. To work around this, run a `pip install --upgrade triton==3.0.0` after installing the vLLM wheel.
Following instructions are applicable to Neuron SDK 2.16 and beyond.
#### Install transformers-neuronx and its dependencies
[transformers-neuronx](https://github.com/aws-neuron/transformers-neuronx) will be the backend to support inference on trn1/inf2 instances.
Follow the steps below to install transformer-neuronx package and its dependencies.
```console
# Install Python venv
sudo apt-get install -y python3.10-venv g++
# Create Python venv
python3.10 -m venv aws_neuron_venv_pytorch
# Activate Python venv
source aws_neuron_venv_pytorch/bin/activate
# Install Jupyter notebook kernel
pip install ipykernel
python3.10 -m ipykernel install --user --name aws_neuron_venv_pytorch --display-name "Python (torch-neuronx)"
pip install jupyter notebook
pip install environment_kernels
# Set pip repository pointing to the Neuron repository
python -m pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com
# Install wget, awscli
python -m pip install wget
python -m pip install awscli
# Update Neuron Compiler and Framework
python -m pip install --upgrade neuronx-cc==2.* --pre torch-neuronx==2.1.* torchvision transformers-neuronx
```
#### Install vLLM from source
Once neuronx-cc and transformers-neuronx packages are installed, we will be able to install vllm as follows:
```console
git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -U -r requirements/neuron.txt
VLLM_TARGET_DEVICE="neuron" pip install .
```
If neuron packages are detected correctly in the installation process, `vllm-0.3.0+neuron212` will be installed.
# --8<-- [end:build-wheel-from-source]
# --8<-- [start:set-up-using-docker]
# --8<-- [end:set-up-using-docker]
# --8<-- [start:pre-built-images]
Currently, there are no pre-built Neuron images.
# --8<-- [end:pre-built-images]
# --8<-- [start:build-image-from-source]
See [deployment-docker-build-image-from-source][deployment-docker-build-image-from-source] for instructions on building the Docker image.
Make sure to use <gh-file:docker/Dockerfile.neuron> in place of the default Dockerfile.
# --8<-- [end:build-image-from-source]
# --8<-- [start:extra-information]
There is no extra information for this device.
# --8<-- [end:extra-information]

View File

@@ -0,0 +1,198 @@
# --8<-- [start:installation]
Tensor Processing Units (TPUs) are Google's custom-developed application-specific
integrated circuits (ASICs) used to accelerate machine learning workloads. TPUs
are available in different versions each with different hardware specifications.
For more information about TPUs, see [TPU System Architecture](https://cloud.google.com/tpu/docs/system-architecture-tpu-vm).
For more information on the TPU versions supported with vLLM, see:
- [TPU v6e](https://cloud.google.com/tpu/docs/v6e)
- [TPU v5e](https://cloud.google.com/tpu/docs/v5e)
- [TPU v5p](https://cloud.google.com/tpu/docs/v5p)
- [TPU v4](https://cloud.google.com/tpu/docs/v4)
These TPU versions allow you to configure the physical arrangements of the TPU
chips. This can improve throughput and networking performance. For more
information see:
- [TPU v6e topologies](https://cloud.google.com/tpu/docs/v6e#configurations)
- [TPU v5e topologies](https://cloud.google.com/tpu/docs/v5e#tpu-v5e-config)
- [TPU v5p topologies](https://cloud.google.com/tpu/docs/v5p#tpu-v5p-config)
- [TPU v4 topologies](https://cloud.google.com/tpu/docs/v4#tpu-v4-config)
In order for you to use Cloud TPUs you need to have TPU quota granted to your
Google Cloud Platform project. TPU quotas specify how many TPUs you can use in a
GPC project and are specified in terms of TPU version, the number of TPU you
want to use, and quota type. For more information, see [TPU quota](https://cloud.google.com/tpu/docs/quota#tpu_quota).
For TPU pricing information, see [Cloud TPU pricing](https://cloud.google.com/tpu/pricing).
You may need additional persistent storage for your TPU VMs. For more
information, see [Storage options for Cloud TPU data](https://cloud.devsite.corp.google.com/tpu/docs/storage-options).
!!! warning
There are no pre-built wheels for this device, so you must either use the pre-built Docker image or build vLLM from source.
# --8<-- [end:installation]
# --8<-- [start:requirements]
- Google Cloud TPU VM
- TPU versions: v6e, v5e, v5p, v4
- Python: 3.10 or newer
### Provision Cloud TPUs
You can provision Cloud TPUs using the [Cloud TPU API](https://cloud.google.com/tpu/docs/reference/rest)
or the [queued resources](https://cloud.google.com/tpu/docs/queued-resources)
API (preferred). This section shows how to create TPUs using the queued resource API. For
more information about using the Cloud TPU API, see [Create a Cloud TPU using the Create Node API](https://cloud.google.com/tpu/docs/managing-tpus-tpu-vm#create-node-api).
Queued resources enable you to request Cloud TPU resources in a queued manner.
When you request queued resources, the request is added to a queue maintained by
the Cloud TPU service. When the requested resource becomes available, it's
assigned to your Google Cloud project for your immediate exclusive use.
!!! note
In all of the following commands, replace the ALL CAPS parameter names with
appropriate values. See the parameter descriptions table for more information.
### Provision Cloud TPUs with GKE
For more information about using TPUs with GKE, see:
- <https://cloud.google.com/kubernetes-engine/docs/how-to/tpus>
- <https://cloud.google.com/kubernetes-engine/docs/concepts/tpus>
- <https://cloud.google.com/kubernetes-engine/docs/concepts/plan-tpus>
## Configure a new environment
### Provision a Cloud TPU with the queued resource API
Create a TPU v5e with 4 TPU chips:
```console
gcloud alpha compute tpus queued-resources create QUEUED_RESOURCE_ID \
--node-id TPU_NAME \
--project PROJECT_ID \
--zone ZONE \
--accelerator-type ACCELERATOR_TYPE \
--runtime-version RUNTIME_VERSION \
--service-account SERVICE_ACCOUNT
```
| Parameter name | Description |
|--------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| QUEUED_RESOURCE_ID | The user-assigned ID of the queued resource request. |
| TPU_NAME | The user-assigned name of the TPU which is created when the queued |
| PROJECT_ID | Your Google Cloud project |
| ZONE | The GCP zone where you want to create your Cloud TPU. The value you use |
| ACCELERATOR_TYPE | The TPU version you want to use. Specify the TPU version, for example |
| RUNTIME_VERSION | The TPU VM runtime version to use. For example, use `v2-alpha-tpuv6e` for a VM loaded with one or more v6e TPU(s). For more information see [TPU VM images](https://cloud.google.com/tpu/docs/runtimes). |
<figcaption>Parameter descriptions</figcaption>
Connect to your TPU using SSH:
```bash
gcloud compute tpus tpu-vm ssh TPU_NAME --zone ZONE
```
# --8<-- [end:requirements]
# --8<-- [start:set-up-using-python]
# --8<-- [end:set-up-using-python]
# --8<-- [start:pre-built-wheels]
Currently, there are no pre-built TPU wheels.
# --8<-- [end:pre-built-wheels]
# --8<-- [start:build-wheel-from-source]
Install Miniconda:
```bash
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
source ~/.bashrc
```
Create and activate a Conda environment for vLLM:
```bash
conda create -n vllm python=3.10 -y
conda activate vllm
```
Clone the vLLM repository and go to the vLLM directory:
```bash
git clone https://github.com/vllm-project/vllm.git && cd vllm
```
Uninstall the existing `torch` and `torch_xla` packages:
```bash
pip uninstall torch torch-xla -y
```
Install build dependencies:
```bash
pip install -r requirements/tpu.txt
sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev
```
Run the setup script:
```bash
VLLM_TARGET_DEVICE="tpu" python -m pip install -e .
```
# --8<-- [end:build-wheel-from-source]
# --8<-- [start:set-up-using-docker]
# --8<-- [end:set-up-using-docker]
# --8<-- [start:pre-built-images]
See [deployment-docker-pre-built-image][deployment-docker-pre-built-image] for instructions on using the official Docker image, making sure to substitute the image name `vllm/vllm-openai` with `vllm/vllm-tpu`.
# --8<-- [end:pre-built-images]
# --8<-- [start:build-image-from-source]
You can use <gh-file:docker/Dockerfile.tpu> to build a Docker image with TPU support.
```console
docker build -f docker/Dockerfile.tpu -t vllm-tpu .
```
Run the Docker image with the following command:
```console
# Make sure to add `--privileged --net host --shm-size=16G`.
docker run --privileged --net host --shm-size=16G -it vllm-tpu
```
!!! note
Since TPU relies on XLA which requires static shapes, vLLM bucketizes the
possible input shapes and compiles an XLA graph for each shape. The
compilation time may take 20~30 minutes in the first run. However, the
compilation time reduces to ~5 minutes afterwards because the XLA graphs are
cached in the disk (in `VLLM_XLA_CACHE_PATH` or `~/.cache/vllm/xla_cache` by default).
!!! tip
If you encounter the following error:
```console
from torch._C import * # noqa: F403
ImportError: libopenblas.so.0: cannot open shared object file: No such
file or directory
```
Install OpenBLAS with the following command:
```console
sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev
```
# --8<-- [end:build-image-from-source]
# --8<-- [start:extra-information]
There is no extra information for this device.
# --8<-- [end:extra-information]

View File

@@ -0,0 +1,183 @@
# CPU
vLLM is a Python library that supports the following CPU variants. Select your CPU type to see vendor specific instructions:
=== "Intel/AMD x86"
--8<-- "docs/getting_started/installation/cpu/x86.inc.md:installation"
=== "ARM AArch64"
--8<-- "docs/getting_started/installation/cpu/arm.inc.md:installation"
=== "Apple silicon"
--8<-- "docs/getting_started/installation/cpu/apple.inc.md:installation"
=== "IBM Z (S390X)"
--8<-- "docs/getting_started/installation/cpu/s390x.inc.md:installation"
## Requirements
- Python: 3.9 -- 3.12
=== "Intel/AMD x86"
--8<-- "docs/getting_started/installation/cpu/x86.inc.md:requirements"
=== "ARM AArch64"
--8<-- "docs/getting_started/installation/cpu/arm.inc.md:requirements"
=== "Apple silicon"
--8<-- "docs/getting_started/installation/cpu/apple.inc.md:requirements"
=== "IBM Z (S390X)"
--8<-- "docs/getting_started/installation/cpu/s390x.inc.md:requirements"
## Set up using Python
### Create a new Python environment
--8<-- "docs/getting_started/installation/python_env_setup.inc.md"
### Pre-built wheels
Currently, there are no pre-built CPU wheels.
### Build wheel from source
=== "Intel/AMD x86"
--8<-- "docs/getting_started/installation/cpu/x86.inc.md:build-wheel-from-source"
=== "ARM AArch64"
--8<-- "docs/getting_started/installation/cpu/arm.inc.md:build-wheel-from-source"
=== "Apple silicon"
--8<-- "docs/getting_started/installation/cpu/apple.inc.md:build-wheel-from-source"
=== "IBM Z (s390x)"
--8<-- "docs/getting_started/installation/cpu/s390x.inc.md:build-wheel-from-source"
## Set up using Docker
### Pre-built images
=== "Intel/AMD x86"
--8<-- "docs/getting_started/installation/cpu/x86.inc.md:pre-built-images"
### Build image from source
```console
$ docker build -f docker/Dockerfile.cpu --tag vllm-cpu-env --target vllm-openai .
# Launching OpenAI server
$ docker run --rm \
--privileged=true \
--shm-size=4g \
-p 8000:8000 \
-e VLLM_CPU_KVCACHE_SPACE=<KV cache space> \
-e VLLM_CPU_OMP_THREADS_BIND=<CPU cores for inference> \
vllm-cpu-env \
--model=meta-llama/Llama-3.2-1B-Instruct \
--dtype=bfloat16 \
other vLLM OpenAI server arguments
```
!!! tip
For ARM or Apple silicon, use `docker/Dockerfile.arm`
!!! tip
For IBM Z (s390x), use `docker/Dockerfile.s390x` and in `docker run` use flag `--dtype float`
## Supported features
vLLM CPU backend supports the following vLLM features:
- Tensor Parallel
- Model Quantization (`INT8 W8A8, AWQ, GPTQ`)
- Chunked-prefill
- Prefix-caching
- FP8-E5M2 KV cache
## Related runtime environment variables
- `VLLM_CPU_KVCACHE_SPACE`: specify the KV Cache size (e.g, `VLLM_CPU_KVCACHE_SPACE=40` means 40 GiB space for KV cache), larger setting will allow vLLM running more requests in parallel. This parameter should be set based on the hardware configuration and memory management pattern of users.
- `VLLM_CPU_OMP_THREADS_BIND`: specify the CPU cores dedicated to the OpenMP threads. For example, `VLLM_CPU_OMP_THREADS_BIND=0-31` means there will be 32 OpenMP threads bound on 0-31 CPU cores. `VLLM_CPU_OMP_THREADS_BIND=0-31|32-63` means there will be 2 tensor parallel processes, 32 OpenMP threads of rank0 are bound on 0-31 CPU cores, and the OpenMP threads of rank1 are bound on 32-63 CPU cores.
- `VLLM_CPU_MOE_PREPACK`: whether to use prepack for MoE layer. This will be passed to `ipex.llm.modules.GatedMLPMOE`. Default is `1` (True). On unsupported CPUs, you might need to set this to `0` (False).
## Performance tips
- We highly recommend to use TCMalloc for high performance memory allocation and better cache locality. For example, on Ubuntu 22.4, you can run:
```console
sudo apt-get install libtcmalloc-minimal4 # install TCMalloc library
find / -name *libtcmalloc* # find the dynamic link library path
export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:$LD_PRELOAD # prepend the library to LD_PRELOAD
python examples/offline_inference/basic/basic.py # run vLLM
```
- When using the online serving, it is recommended to reserve 1-2 CPU cores for the serving framework to avoid CPU oversubscription. For example, on a platform with 32 physical CPU cores, reserving CPU 30 and 31 for the framework and using CPU 0-29 for OpenMP:
```console
export VLLM_CPU_KVCACHE_SPACE=40
export VLLM_CPU_OMP_THREADS_BIND=0-29
vllm serve facebook/opt-125m
```
- If using vLLM CPU backend on a machine with hyper-threading, it is recommended to bind only one OpenMP thread on each physical CPU core using `VLLM_CPU_OMP_THREADS_BIND`. On a hyper-threading enabled platform with 16 logical CPU cores / 8 physical CPU cores:
```console
$ lscpu -e # check the mapping between logical CPU cores and physical CPU cores
# The "CPU" column means the logical CPU core IDs, and the "CORE" column means the physical core IDs. On this platform, two logical cores are sharing one physical core.
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ MINMHZ MHZ
0 0 0 0 0:0:0:0 yes 2401.0000 800.0000 800.000
1 0 0 1 1:1:1:0 yes 2401.0000 800.0000 800.000
2 0 0 2 2:2:2:0 yes 2401.0000 800.0000 800.000
3 0 0 3 3:3:3:0 yes 2401.0000 800.0000 800.000
4 0 0 4 4:4:4:0 yes 2401.0000 800.0000 800.000
5 0 0 5 5:5:5:0 yes 2401.0000 800.0000 800.000
6 0 0 6 6:6:6:0 yes 2401.0000 800.0000 800.000
7 0 0 7 7:7:7:0 yes 2401.0000 800.0000 800.000
8 0 0 0 0:0:0:0 yes 2401.0000 800.0000 800.000
9 0 0 1 1:1:1:0 yes 2401.0000 800.0000 800.000
10 0 0 2 2:2:2:0 yes 2401.0000 800.0000 800.000
11 0 0 3 3:3:3:0 yes 2401.0000 800.0000 800.000
12 0 0 4 4:4:4:0 yes 2401.0000 800.0000 800.000
13 0 0 5 5:5:5:0 yes 2401.0000 800.0000 800.000
14 0 0 6 6:6:6:0 yes 2401.0000 800.0000 800.000
15 0 0 7 7:7:7:0 yes 2401.0000 800.0000 800.000
# On this platform, it is recommend to only bind openMP threads on logical CPU cores 0-7 or 8-15
$ export VLLM_CPU_OMP_THREADS_BIND=0-7
$ python examples/offline_inference/basic/basic.py
```
- If using vLLM CPU backend on a multi-socket machine with NUMA, be aware to set CPU cores using `VLLM_CPU_OMP_THREADS_BIND` to avoid cross NUMA node memory access.
## Other considerations
- The CPU backend significantly differs from the GPU backend since the vLLM architecture was originally optimized for GPU use. A number of optimizations are needed to enhance its performance.
- Decouple the HTTP serving components from the inference components. In a GPU backend configuration, the HTTP serving and tokenization tasks operate on the CPU, while inference runs on the GPU, which typically does not pose a problem. However, in a CPU-based setup, the HTTP serving and tokenization can cause significant context switching and reduced cache efficiency. Therefore, it is strongly recommended to segregate these two components for improved performance.
- On CPU based setup with NUMA enabled, the memory access performance may be largely impacted by the [topology](https://github.com/intel/intel-extension-for-pytorch/blob/main/docs/tutorials/performance_tuning/tuning_guide.md#non-uniform-memory-access-numa). For NUMA architecture, Tensor Parallel is a option for better performance.
- Tensor Parallel is supported for serving and offline inferencing. In general each NUMA node is treated as one GPU card. Below is the example script to enable Tensor Parallel = 2 for serving:
```console
VLLM_CPU_KVCACHE_SPACE=40 VLLM_CPU_OMP_THREADS_BIND="0-31|32-63" vllm serve meta-llama/Llama-2-7b-chat-hf -tp=2 --distributed-executor-backend mp
```
- For each thread id list in `VLLM_CPU_OMP_THREADS_BIND`, users should guarantee threads in the list belong to a same NUMA node.
- Meanwhile, users should also take care of memory capacity of each NUMA node. The memory usage of each TP rank is the sum of `weight shard size` and `VLLM_CPU_KVCACHE_SPACE`, if it exceeds the capacity of a single NUMA node, TP worker will be killed due to out-of-memory.

View File

@@ -0,0 +1,67 @@
# --8<-- [start:installation]
vLLM has experimental support for macOS with Apple silicon. For now, users shall build from the source vLLM to natively run on macOS.
Currently the CPU implementation for macOS supports FP32 and FP16 datatypes.
!!! warning
There are no pre-built wheels or images for this device, so you must build vLLM from source.
# --8<-- [end:installation]
# --8<-- [start:requirements]
- OS: `macOS Sonoma` or later
- SDK: `XCode 15.4` or later with Command Line Tools
- Compiler: `Apple Clang >= 15.0.0`
# --8<-- [end:requirements]
# --8<-- [start:set-up-using-python]
# --8<-- [end:set-up-using-python]
# --8<-- [start:pre-built-wheels]
# --8<-- [end:pre-built-wheels]
# --8<-- [start:build-wheel-from-source]
After installation of XCode and the Command Line Tools, which include Apple Clang, execute the following commands to build and install vLLM from the source.
```console
git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -r requirements/cpu.txt
pip install -e .
```
!!! note
On macOS the `VLLM_TARGET_DEVICE` is automatically set to `cpu`, which currently is the only supported device.
#### Troubleshooting
If the build has error like the following snippet where standard C++ headers cannot be found, try to remove and reinstall your
[Command Line Tools for Xcode](https://developer.apple.com/download/all/).
```text
[...] fatal error: 'map' file not found
1 | #include <map>
| ^~~~~
1 error generated.
[2/8] Building CXX object CMakeFiles/_C.dir/csrc/cpu/pos_encoding.cpp.o
[...] fatal error: 'cstddef' file not found
10 | #include <cstddef>
| ^~~~~~~~~
1 error generated.
```
# --8<-- [end:build-wheel-from-source]
# --8<-- [start:set-up-using-docker]
# --8<-- [end:set-up-using-docker]
# --8<-- [start:pre-built-images]
# --8<-- [end:pre-built-images]
# --8<-- [start:build-image-from-source]
# --8<-- [end:build-image-from-source]
# --8<-- [start:extra-information]
# --8<-- [end:extra-information]

View File

@@ -0,0 +1,41 @@
# --8<-- [start:installation]
vLLM has been adapted to work on ARM64 CPUs with NEON support, leveraging the CPU backend initially developed for the x86 platform.
ARM CPU backend currently supports Float32, FP16 and BFloat16 datatypes.
!!! warning
There are no pre-built wheels or images for this device, so you must build vLLM from source.
# --8<-- [end:installation]
# --8<-- [start:requirements]
- OS: Linux
- Compiler: `gcc/g++ >= 12.3.0` (optional, recommended)
- Instruction Set Architecture (ISA): NEON support is required
# --8<-- [end:requirements]
# --8<-- [start:set-up-using-python]
# --8<-- [end:set-up-using-python]
# --8<-- [start:pre-built-wheels]
# --8<-- [end:pre-built-wheels]
# --8<-- [start:build-wheel-from-source]
--8<-- "docs/getting_started/installation/cpu/cpu/build.inc.md"
Testing has been conducted on AWS Graviton3 instances for compatibility.
# --8<-- [end:build-wheel-from-source]
# --8<-- [start:set-up-using-docker]
# --8<-- [end:set-up-using-docker]
# --8<-- [start:pre-built-images]
# --8<-- [end:pre-built-images]
# --8<-- [start:build-image-from-source]
# --8<-- [end:build-image-from-source]
# --8<-- [start:extra-information]
# --8<-- [end:extra-information]

View File

@@ -0,0 +1,36 @@
First, install recommended compiler. We recommend to use `gcc/g++ >= 12.3.0` as the default compiler to avoid potential problems. For example, on Ubuntu 22.4, you can run:
```console
sudo apt-get update -y
sudo apt-get install -y gcc-12 g++-12 libnuma-dev python3-dev
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12
```
Second, clone vLLM project:
```console
git clone https://github.com/vllm-project/vllm.git vllm_source
cd vllm_source
```
Third, install Python packages for vLLM CPU backend building:
```console
pip install --upgrade pip
pip install "cmake>=3.26" wheel packaging ninja "setuptools-scm>=8" numpy
pip install -v -r requirements/cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
```
Finally, build and install vLLM CPU backend:
```console
VLLM_TARGET_DEVICE=cpu python setup.py install
```
If you want to develop vllm, install it in editable mode instead.
```console
VLLM_TARGET_DEVICE=cpu python setup.py develop
```
# --8<-- [end:extra-information]

View File

@@ -0,0 +1,69 @@
# --8<-- [start:installation]
vLLM has experimental support for s390x architecture on IBM Z platform. For now, users shall build from the vLLM source to natively run on IBM Z platform.
Currently the CPU implementation for s390x architecture supports FP32 datatype only.
!!! warning
There are no pre-built wheels or images for this device, so you must build vLLM from source.
# --8<-- [end:installation]
# --8<-- [start:requirements]
- OS: `Linux`
- SDK: `gcc/g++ >= 12.3.0` or later with Command Line Tools
- Instruction Set Architecture (ISA): VXE support is required. Works with Z14 and above.
- Build install python packages: `pyarrow`, `torch` and `torchvision`
# --8<-- [end:requirements]
# --8<-- [start:set-up-using-python]
# --8<-- [end:set-up-using-python]
# --8<-- [start:pre-built-wheels]
# --8<-- [end:pre-built-wheels]
# --8<-- [start:build-wheel-from-source]
Install the following packages from the package manager before building the vLLM. For example on RHEL 9.4:
```console
dnf install -y \
which procps findutils tar vim git gcc g++ make patch make cython zlib-devel \
libjpeg-turbo-devel libtiff-devel libpng-devel libwebp-devel freetype-devel harfbuzz-devel \
openssl-devel openblas openblas-devel wget autoconf automake libtool cmake numactl-devel
```
Install rust>=1.80 which is needed for `outlines-core` and `uvloop` python packages installation.
```console
curl https://sh.rustup.rs -sSf | sh -s -- -y && \
. "$HOME/.cargo/env"
```
Execute the following commands to build and install vLLM from the source.
!!! tip
Please build the following dependencies, `torchvision`, `pyarrow` from the source before building vLLM.
```console
sed -i '/^torch/d' requirements-build.txt # remove torch from requirements-build.txt since we use nightly builds
pip install -v \
--extra-index-url https://download.pytorch.org/whl/nightly/cpu \
-r requirements-build.txt \
-r requirements-cpu.txt \
VLLM_TARGET_DEVICE=cpu python setup.py bdist_wheel && \
pip install dist/*.whl
```
# --8<-- [end:build-wheel-from-source]
# --8<-- [start:set-up-using-docker]
# --8<-- [end:set-up-using-docker]
# --8<-- [start:pre-built-images]
# --8<-- [end:pre-built-images]
# --8<-- [start:build-image-from-source]
# --8<-- [end:build-image-from-source]
# --8<-- [start:extra-information]
# --8<-- [end:extra-information]

View File

@@ -0,0 +1,46 @@
# --8<-- [start:installation]
vLLM initially supports basic model inferencing and serving on x86 CPU platform, with data types FP32, FP16 and BF16.
!!! warning
There are no pre-built wheels or images for this device, so you must build vLLM from source.
# --8<-- [end:installation]
# --8<-- [start:requirements]
- OS: Linux
- Compiler: `gcc/g++ >= 12.3.0` (optional, recommended)
- Instruction Set Architecture (ISA): AVX512 (optional, recommended)
!!! tip
[Intel Extension for PyTorch (IPEX)](https://github.com/intel/intel-extension-for-pytorch) extends PyTorch with up-to-date features optimizations for an extra performance boost on Intel hardware.
# --8<-- [end:requirements]
# --8<-- [start:set-up-using-python]
# --8<-- [end:set-up-using-python]
# --8<-- [start:pre-built-wheels]
# --8<-- [end:pre-built-wheels]
# --8<-- [start:build-wheel-from-source]
--8<-- "docs/getting_started/installation/cpu/cpu/build.inc.md"
!!! note
- AVX512_BF16 is an extension ISA provides native BF16 data type conversion and vector product instructions, which brings some performance improvement compared with pure AVX512. The CPU backend build script will check the host CPU flags to determine whether to enable AVX512_BF16.
- If you want to force enable AVX512_BF16 for the cross-compilation, please set environment variable `VLLM_CPU_AVX512BF16=1` before the building.
# --8<-- [end:build-wheel-from-source]
# --8<-- [start:set-up-using-docker]
# --8<-- [end:set-up-using-docker]
# --8<-- [start:pre-built-images]
See [https://gallery.ecr.aws/q9t5s3a7/vllm-cpu-release-repo](https://gallery.ecr.aws/q9t5s3a7/vllm-cpu-release-repo)
# --8<-- [end:pre-built-images]
# --8<-- [start:build-image-from-source]
# --8<-- [end:build-image-from-source]
# --8<-- [start:extra-information]
# --8<-- [end:extra-information]

View File

@@ -0,0 +1,17 @@
# Installation
## Requirements
## Set up using Python
### Pre-built wheels
### Build wheel from source
## Set up using Docker
### Pre-built images
### Build image from source
## Extra information

View File

@@ -0,0 +1,124 @@
# GPU
vLLM is a Python library that supports the following GPU variants. Select your GPU type to see vendor specific instructions:
=== "NVIDIA CUDA"
--8<-- "docs/getting_started/installation/gpu/cuda.inc.md:installation"
=== "AMD ROCm"
--8<-- "docs/getting_started/installation/gpu/rocm.inc.md:installation"
=== "Intel XPU"
--8<-- "docs/getting_started/installation/gpu/xpu.inc.md:installation"
## Requirements
- OS: Linux
- Python: 3.9 -- 3.12
=== "NVIDIA CUDA"
--8<-- "docs/getting_started/installation/gpu/cuda.inc.md:requirements"
=== "AMD ROCm"
--8<-- "docs/getting_started/installation/gpu/rocm.inc.md:requirements"
=== "Intel XPU"
--8<-- "docs/getting_started/installation/gpu/xpu.inc.md:requirements"
## Set up using Python
### Create a new Python environment
--8<-- "docs/getting_started/installation/python_env_setup.inc.md"
=== "NVIDIA CUDA"
--8<-- "docs/getting_started/installation/gpu/cuda.inc.md:create-a-new-python-environment"
=== "AMD ROCm"
There is no extra information on creating a new Python environment for this device.
=== "Intel XPU"
There is no extra information on creating a new Python environment for this device.
### Pre-built wheels
=== "NVIDIA CUDA"
--8<-- "docs/getting_started/installation/gpu/cuda.inc.md:pre-built-wheels"
=== "AMD ROCm"
--8<-- "docs/getting_started/installation/gpu/rocm.inc.md:pre-built-wheels"
=== "Intel XPU"
--8<-- "docs/getting_started/installation/gpu/xpu.inc.md:pre-built-wheels"
[](){ #build-from-source }
### Build wheel from source
=== "NVIDIA CUDA"
--8<-- "docs/getting_started/installation/gpu/cuda.inc.md:build-wheel-from-source"
=== "AMD ROCm"
--8<-- "docs/getting_started/installation/gpu/rocm.inc.md:build-wheel-from-source"
=== "Intel XPU"
--8<-- "docs/getting_started/installation/gpu/xpu.inc.md:build-wheel-from-source"
## Set up using Docker
### Pre-built images
=== "NVIDIA CUDA"
--8<-- "docs/getting_started/installation/gpu/cuda.inc.md:pre-built-images"
=== "AMD ROCm"
--8<-- "docs/getting_started/installation/gpu/rocm.inc.md:pre-built-images"
=== "Intel XPU"
--8<-- "docs/getting_started/installation/gpu/xpu.inc.md:pre-built-images"
### Build image from source
=== "NVIDIA CUDA"
--8<-- "docs/getting_started/installation/gpu/cuda.inc.md:build-image-from-source"
=== "AMD ROCm"
--8<-- "docs/getting_started/installation/gpu/rocm.inc.md:build-image-from-source"
=== "Intel XPU"
--8<-- "docs/getting_started/installation/gpu/xpu.inc.md:build-image-from-source"
## Supported features
=== "NVIDIA CUDA"
--8<-- "docs/getting_started/installation/gpu/cuda.inc.md:supported-features"
=== "AMD ROCm"
--8<-- "docs/getting_started/installation/gpu/rocm.inc.md:supported-features"
=== "Intel XPU"
--8<-- "docs/getting_started/installation/gpu/xpu.inc.md:supported-features"

View File

@@ -0,0 +1,249 @@
# --8<-- [start:installation]
vLLM contains pre-compiled C++ and CUDA (12.8) binaries.
# --8<-- [end:installation]
# --8<-- [start:requirements]
- GPU: compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, L4, H100, etc.)
# --8<-- [end:requirements]
# --8<-- [start:set-up-using-python]
### Create a new Python environment
!!! note
PyTorch installed via `conda` will statically link `NCCL` library, which can cause issues when vLLM tries to use `NCCL`. See <gh-issue:8420> for more details.
In order to be performant, vLLM has to compile many cuda kernels. The compilation unfortunately introduces binary incompatibility with other CUDA versions and PyTorch versions, even for the same PyTorch version with different building configurations.
Therefore, it is recommended to install vLLM with a **fresh new** environment. If either you have a different CUDA version or you want to use an existing PyTorch installation, you need to build vLLM from source. See [below][build-from-source] for more details.
# --8<-- [end:set-up-using-python]
# --8<-- [start:pre-built-wheels]
You can install vLLM using either `pip` or `uv pip`:
```console
# Install vLLM with CUDA 12.8.
# If you are using pip.
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu128
# If you are using uv.
uv pip install vllm --torch-backend=auto
```
We recommend leveraging `uv` to [automatically select the appropriate PyTorch index at runtime](https://docs.astral.sh/uv/guides/integration/pytorch/#automatic-backend-selection) by inspecting the installed CUDA driver version via `--torch-backend=auto` (or `UV_TORCH_BACKEND=auto`). To select a specific backend (e.g., `cu126`), set `--torch-backend=cu126` (or `UV_TORCH_BACKEND=cu126`). If this doesn't work, try running `uv self update` to update `uv` first.
!!! note
NVIDIA Blackwell GPUs (B200, GB200) require a minimum of CUDA 12.8, so make sure you are installing PyTorch wheels with at least that version. PyTorch itself offers a [dedicated interface](https://pytorch.org/get-started/locally/) to determine the appropriate pip command to run for a given target configuration.
As of now, vLLM's binaries are compiled with CUDA 12.8 and public PyTorch release versions by default. We also provide vLLM binaries compiled with CUDA 12.6, 11.8, and public PyTorch release versions:
```console
# Install vLLM with CUDA 11.8.
export VLLM_VERSION=0.6.1.post1
export PYTHON_VERSION=312
uv pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux1_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118
```
[](){ #install-the-latest-code }
#### Install the latest code
LLM inference is a fast-evolving field, and the latest code may contain bug fixes, performance improvements, and new features that are not released yet. To allow users to try the latest code without waiting for the next release, vLLM provides wheels for Linux running on a x86 platform with CUDA 12 for every commit since `v0.5.3`.
##### Install the latest code using `pip`
```console
pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
```
`--pre` is required for `pip` to consider pre-released versions.
Another way to install the latest code is to use `uv`:
```console
uv pip install -U vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly
```
##### Install specific revisions using `pip`
If you want to access the wheels for previous commits (e.g. to bisect the behavior change, performance regression), due to the limitation of `pip`, you have to specify the full URL of the wheel file by embedding the commit hash in the URL:
```console
export VLLM_COMMIT=33f460b17a54acb3b6cc0b03f4a17876cff5eafd # use full commit hash from the main branch
pip install https://wheels.vllm.ai/${VLLM_COMMIT}/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl
```
Note that the wheels are built with Python 3.8 ABI (see [PEP 425](https://peps.python.org/pep-0425/) for more details about ABI), so **they are compatible with Python 3.8 and later**. The version string in the wheel file name (`1.0.0.dev`) is just a placeholder to have a unified URL for the wheels, the actual versions of wheels are contained in the wheel metadata (the wheels listed in the extra index url have correct versions). Although we don't support Python 3.8 any more (because PyTorch 2.5 dropped support for Python 3.8), the wheels are still built with Python 3.8 ABI to keep the same wheel name as before.
##### Install specific revisions using `uv`
If you want to access the wheels for previous commits (e.g. to bisect the behavior change, performance regression), you can specify the commit hash in the URL:
```console
export VLLM_COMMIT=72d9c316d3f6ede485146fe5aabd4e61dbc59069 # use full commit hash from the main branch
uv pip install vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/${VLLM_COMMIT}
```
The `uv` approach works for vLLM `v0.6.6` and later and offers an easy-to-remember command. A unique feature of `uv` is that packages in `--extra-index-url` have [higher priority than the default index](https://docs.astral.sh/uv/pip/compatibility/#packages-that-exist-on-multiple-indexes). If the latest public release is `v0.6.6.post1`, `uv`'s behavior allows installing a commit before `v0.6.6.post1` by specifying the `--extra-index-url`. In contrast, `pip` combines packages from `--extra-index-url` and the default index, choosing only the latest version, which makes it difficult to install a development version prior to the released version.
# --8<-- [end:pre-built-wheels]
# --8<-- [start:build-wheel-from-source]
#### Set up using Python-only build (without compilation)
If you only need to change Python code, you can build and install vLLM without compilation. Using `pip`'s [`--editable` flag](https://pip.pypa.io/en/stable/topics/local-project-installs/#editable-installs), changes you make to the code will be reflected when you run vLLM:
```console
git clone https://github.com/vllm-project/vllm.git
cd vllm
VLLM_USE_PRECOMPILED=1 pip install --editable .
```
This command will do the following:
1. Look for the current branch in your vLLM clone.
2. Identify the corresponding base commit in the main branch.
3. Download the pre-built wheel of the base commit.
4. Use its compiled libraries in the installation.
!!! note
1. If you change C++ or kernel code, you cannot use Python-only build; otherwise you will see an import error about library not found or undefined symbol.
2. If you rebase your dev branch, it is recommended to uninstall vllm and re-run the above command to make sure your libraries are up to date.
In case you see an error about wheel not found when running the above command, it might be because the commit you based on in the main branch was just merged and the wheel is being built. In this case, you can wait for around an hour to try again, or manually assign the previous commit in the installation using the `VLLM_PRECOMPILED_WHEEL_LOCATION` environment variable.
```console
export VLLM_COMMIT=72d9c316d3f6ede485146fe5aabd4e61dbc59069 # use full commit hash from the main branch
export VLLM_PRECOMPILED_WHEEL_LOCATION=https://wheels.vllm.ai/${VLLM_COMMIT}/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl
pip install --editable .
```
You can find more information about vLLM's wheels in [install-the-latest-code][install-the-latest-code].
!!! note
There is a possibility that your source code may have a different commit ID compared to the latest vLLM wheel, which could potentially lead to unknown errors.
It is recommended to use the same commit ID for the source code as the vLLM wheel you have installed. Please refer to [install-the-latest-code][install-the-latest-code] for instructions on how to install a specified wheel.
#### Full build (with compilation)
If you want to modify C++ or CUDA code, you'll need to build vLLM from source. This can take several minutes:
```console
git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -e .
```
!!! tip
Building from source requires a lot of compilation. If you are building from source repeatedly, it's more efficient to cache the compilation results.
For example, you can install [ccache](https://github.com/ccache/ccache) using `conda install ccache` or `apt install ccache` .
As long as `which ccache` command can find the `ccache` binary, it will be used automatically by the build system. After the first build, subsequent builds will be much faster.
When using `ccache` with `pip install -e .`, you should run `CCACHE_NOHASHDIR="true" pip install --no-build-isolation -e .`. This is because `pip` creates a new folder with a random name for each build, preventing `ccache` from recognizing that the same files are being built.
[sccache](https://github.com/mozilla/sccache) works similarly to `ccache`, but has the capability to utilize caching in remote storage environments.
The following environment variables can be set to configure the vLLM `sccache` remote: `SCCACHE_BUCKET=vllm-build-sccache SCCACHE_REGION=us-west-2 SCCACHE_S3_NO_CREDENTIALS=1`. We also recommend setting `SCCACHE_IDLE_TIMEOUT=0`.
##### Use an existing PyTorch installation
There are scenarios where the PyTorch dependency cannot be easily installed via pip, e.g.:
- Building vLLM with PyTorch nightly or a custom PyTorch build.
- Building vLLM with aarch64 and CUDA (GH200), where the PyTorch wheels are not available on PyPI. Currently, only the PyTorch nightly has wheels for aarch64 with CUDA. You can run `pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124` to [install PyTorch nightly](https://pytorch.org/get-started/locally/), and then build vLLM on top of it.
To build vLLM using an existing PyTorch installation:
```console
git clone https://github.com/vllm-project/vllm.git
cd vllm
python use_existing_torch.py
pip install -r requirements/build.txt
pip install --no-build-isolation -e .
```
##### Use the local cutlass for compilation
Currently, before starting the build process, vLLM fetches cutlass code from GitHub. However, there may be scenarios where you want to use a local version of cutlass instead.
To achieve this, you can set the environment variable VLLM_CUTLASS_SRC_DIR to point to your local cutlass directory.
```console
git clone https://github.com/vllm-project/vllm.git
cd vllm
VLLM_CUTLASS_SRC_DIR=/path/to/cutlass pip install -e .
```
##### Troubleshooting
To avoid your system being overloaded, you can limit the number of compilation jobs
to be run simultaneously, via the environment variable `MAX_JOBS`. For example:
```console
export MAX_JOBS=6
pip install -e .
```
This is especially useful when you are building on less powerful machines. For example, when you use WSL it only [assigns 50% of the total memory by default](https://learn.microsoft.com/en-us/windows/wsl/wsl-config#main-wsl-settings), so using `export MAX_JOBS=1` can avoid compiling multiple files simultaneously and running out of memory.
A side effect is a much slower build process.
Additionally, if you have trouble building vLLM, we recommend using the NVIDIA PyTorch Docker image.
```console
# Use `--ipc=host` to make sure the shared memory is large enough.
docker run --gpus all -it --rm --ipc=host nvcr.io/nvidia/pytorch:23.10-py3
```
If you don't want to use docker, it is recommended to have a full installation of CUDA Toolkit. You can download and install it from [the official website](https://developer.nvidia.com/cuda-toolkit-archive). After installation, set the environment variable `CUDA_HOME` to the installation path of CUDA Toolkit, and make sure that the `nvcc` compiler is in your `PATH`, e.g.:
```console
export CUDA_HOME=/usr/local/cuda
export PATH="${CUDA_HOME}/bin:$PATH"
```
Here is a sanity check to verify that the CUDA Toolkit is correctly installed:
```console
nvcc --version # verify that nvcc is in your PATH
${CUDA_HOME}/bin/nvcc --version # verify that nvcc is in your CUDA_HOME
```
#### Unsupported OS build
vLLM can fully run only on Linux but for development purposes, you can still build it on other systems (for example, macOS), allowing for imports and a more convenient development environment. The binaries will not be compiled and won't work on non-Linux systems.
Simply disable the `VLLM_TARGET_DEVICE` environment variable before installing:
```console
export VLLM_TARGET_DEVICE=empty
pip install -e .
```
# --8<-- [end:build-wheel-from-source]
# --8<-- [start:set-up-using-docker]
# --8<-- [end:set-up-using-docker]
# --8<-- [start:pre-built-images]
See [deployment-docker-pre-built-image][deployment-docker-pre-built-image] for instructions on using the official Docker image.
Another way to access the latest code is to use the docker images:
```console
export VLLM_COMMIT=33f460b17a54acb3b6cc0b03f4a17876cff5eafd # use full commit hash from the main branch
docker pull public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:${VLLM_COMMIT}
```
These docker images are used for CI and testing only, and they are not intended for production use. They will be expired after several days.
The latest code can contain bugs and may not be stable. Please use it with caution.
# --8<-- [end:pre-built-images]
# --8<-- [start:build-image-from-source]
See [deployment-docker-build-image-from-source][deployment-docker-build-image-from-source] for instructions on building the Docker image.
## Supported features
See [feature-x-hardware][feature-x-hardware] compatibility matrix for feature support information.
# --8<-- [end:extra-information]

View File

@@ -0,0 +1,216 @@
# --8<-- [start:installation]
vLLM supports AMD GPUs with ROCm 6.3.
!!! warning
There are no pre-built wheels for this device, so you must either use the pre-built Docker image or build vLLM from source.
# --8<-- [end:installation]
# --8<-- [start:requirements]
- GPU: MI200s (gfx90a), MI300 (gfx942), Radeon RX 7900 series (gfx1100/1101), Radeon RX 9000 series (gfx1200/1201)
- ROCm 6.3
# --8<-- [end:requirements]
# --8<-- [start:set-up-using-python]
# --8<-- [end:set-up-using-python]
# --8<-- [start:pre-built-wheels]
Currently, there are no pre-built ROCm wheels.
# --8<-- [end:pre-built-wheels]
# --8<-- [start:build-wheel-from-source]
0. Install prerequisites (skip if you are already in an environment/docker with the following installed):
- [ROCm](https://rocm.docs.amd.com/en/latest/deploy/linux/index.html)
- [PyTorch](https://pytorch.org/)
For installing PyTorch, you can start from a fresh docker image, e.g, `rocm/pytorch:rocm6.3_ubuntu24.04_py3.12_pytorch_release_2.4.0`, `rocm/pytorch-nightly`. If you are using docker image, you can skip to Step 3.
Alternatively, you can install PyTorch using PyTorch wheels. You can check PyTorch installation guide in PyTorch [Getting Started](https://pytorch.org/get-started/locally/). Example:
```console
# Install PyTorch
$ pip uninstall torch -y
$ pip install --no-cache-dir --pre torch --index-url https://download.pytorch.org/whl/nightly/rocm6.3
```
1. Install [Triton flash attention for ROCm](https://github.com/ROCm/triton)
Install ROCm's Triton flash attention (the default triton-mlir branch) following the instructions from [ROCm/triton](https://github.com/ROCm/triton/blob/triton-mlir/README.md)
```console
python3 -m pip install ninja cmake wheel pybind11
pip uninstall -y triton
git clone https://github.com/OpenAI/triton.git
cd triton
git checkout e5be006
cd python
pip3 install .
cd ../..
```
!!! note
If you see HTTP issue related to downloading packages during building triton, please try again as the HTTP error is intermittent.
2. Optionally, if you choose to use CK flash attention, you can install [flash attention for ROCm](https://github.com/ROCm/flash-attention)
Install ROCm's flash attention (v2.7.2) following the instructions from [ROCm/flash-attention](https://github.com/ROCm/flash-attention#amd-rocm-support)
Alternatively, wheels intended for vLLM use can be accessed under the releases.
For example, for ROCm 6.3, suppose your gfx arch is `gfx90a`. To get your gfx architecture, run `rocminfo |grep gfx`.
```console
git clone https://github.com/ROCm/flash-attention.git
cd flash-attention
git checkout b7d29fb
git submodule update --init
GPU_ARCHS="gfx90a" python3 setup.py install
cd ..
```
!!! note
You might need to downgrade the "ninja" version to 1.10 as it is not used when compiling flash-attention-2 (e.g. `pip install ninja==1.10.2.4`)
3. If you choose to build AITER yourself to use a certain branch or commit, you can build AITER using the following steps:
```console
python3 -m pip uninstall -y aiter
git clone --recursive https://github.com/ROCm/aiter.git
cd aiter
git checkout $AITER_BRANCH_OR_COMMIT
git submodule sync; git submodule update --init --recursive
python3 setup.py develop
```
!!! note
You will need to config the `$AITER_BRANCH_OR_COMMIT` for your purpose.
4. Build vLLM. For example, vLLM on ROCM 6.3 can be built with the following steps:
```bash
$ pip install --upgrade pip
# Build & install AMD SMI
$ pip install /opt/rocm/share/amd_smi
# Install dependencies
$ pip install --upgrade numba scipy huggingface-hub[cli,hf_transfer] setuptools_scm
$ pip install "numpy<2"
$ pip install -r requirements/rocm.txt
# Build vLLM for MI210/MI250/MI300.
$ export PYTORCH_ROCM_ARCH="gfx90a;gfx942"
$ python3 setup.py develop
```
This may take 5-10 minutes. Currently, `pip install .` does not work for ROCm installation.
!!! tip
- Triton flash attention is used by default. For benchmarking purposes, it is recommended to run a warm up step before collecting perf numbers.
- Triton flash attention does not currently support sliding window attention. If using half precision, please use CK flash-attention for sliding window support.
- To use CK flash-attention or PyTorch naive attention, please use this flag `export VLLM_USE_TRITON_FLASH_ATTN=0` to turn off triton flash attention.
- The ROCm version of PyTorch, ideally, should match the ROCm driver version.
!!! tip
- For MI300x (gfx942) users, to achieve optimal performance, please refer to [MI300x tuning guide](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/index.html) for performance optimization and tuning tips on system and workflow level.
For vLLM, please refer to [vLLM performance optimization](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/workload.html#vllm-performance-optimization).
## Set up using Docker (Recommended)
# --8<-- [end:set-up-using-docker]
# --8<-- [start:pre-built-images]
The [AMD Infinity hub for vLLM](https://hub.docker.com/r/rocm/vllm/tags) offers a prebuilt, optimized
docker image designed for validating inference performance on the AMD Instinct™ MI300X accelerator.
!!! tip
Please check [LLM inference performance validation on AMD Instinct MI300X](https://rocm.docs.amd.com/en/latest/how-to/performance-validation/mi300x/vllm-benchmark.html)
for instructions on how to use this prebuilt docker image.
# --8<-- [end:pre-built-images]
# --8<-- [start:build-image-from-source]
Building the Docker image from source is the recommended way to use vLLM with ROCm.
#### (Optional) Build an image with ROCm software stack
Build a docker image from <gh-file:docker/Dockerfile.rocm_base> which setup ROCm software stack needed by the vLLM.
**This step is optional as this rocm_base image is usually prebuilt and store at [Docker Hub](https://hub.docker.com/r/rocm/vllm-dev) under tag `rocm/vllm-dev:base` to speed up user experience.**
If you choose to build this rocm_base image yourself, the steps are as follows.
It is important that the user kicks off the docker build using buildkit. Either the user put DOCKER_BUILDKIT=1 as environment variable when calling docker build command, or the user needs to setup buildkit in the docker daemon configuration /etc/docker/daemon.json as follows and restart the daemon:
```console
{
"features": {
"buildkit": true
}
}
```
To build vllm on ROCm 6.3 for MI200 and MI300 series, you can use the default:
```console
DOCKER_BUILDKIT=1 docker build -f docker/Dockerfile.rocm_base -t rocm/vllm-dev:base .
```
#### Build an image with vLLM
First, build a docker image from <gh-file:docker/Dockerfile.rocm> and launch a docker container from the image.
It is important that the user kicks off the docker build using buildkit. Either the user put `DOCKER_BUILDKIT=1` as environment variable when calling docker build command, or the user needs to setup buildkit in the docker daemon configuration /etc/docker/daemon.json as follows and restart the daemon:
```console
{
"features": {
"buildkit": true
}
}
```
<gh-file:docker/Dockerfile.rocm> uses ROCm 6.3 by default, but also supports ROCm 5.7, 6.0, 6.1, and 6.2, in older vLLM branches.
It provides flexibility to customize the build of docker image using the following arguments:
- `BASE_IMAGE`: specifies the base image used when running `docker build`. The default value `rocm/vllm-dev:base` is an image published and maintained by AMD. It is being built using <gh-file:docker/Dockerfile.rocm_base>
- `USE_CYTHON`: An option to run cython compilation on a subset of python files upon docker build
- `BUILD_RPD`: Include RocmProfileData profiling tool in the image
- `ARG_PYTORCH_ROCM_ARCH`: Allows to override the gfx architecture values from the base docker image
Their values can be passed in when running `docker build` with `--build-arg` options.
To build vllm on ROCm 6.3 for MI200 and MI300 series, you can use the default:
```console
DOCKER_BUILDKIT=1 docker build -f docker/Dockerfile.rocm -t vllm-rocm .
```
To build vllm on ROCm 6.3 for Radeon RX7900 series (gfx1100), you should pick the alternative base image:
```console
DOCKER_BUILDKIT=1 docker build --build-arg BASE_IMAGE="rocm/vllm-dev:navi_base" -f docker/Dockerfile.rocm -t vllm-rocm .
```
To run the above docker image `vllm-rocm`, use the below command:
```console
docker run -it \
--network=host \
--group-add=video \
--ipc=host \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device /dev/kfd \
--device /dev/dri \
-v <path/to/model>:/app/model \
vllm-rocm \
bash
```
Where the `<path/to/model>` is the location where the model is stored, for example, the weights for llama2 or llama3 models.
## Supported features
See [feature-x-hardware][feature-x-hardware] compatibility matrix for feature support information.
# --8<-- [end:extra-information]

View File

@@ -0,0 +1,81 @@
# --8<-- [start:installation]
vLLM initially supports basic model inference and serving on Intel GPU platform.
!!! warning
There are no pre-built wheels or images for this device, so you must build vLLM from source.
# --8<-- [end:installation]
# --8<-- [start:requirements]
- Supported Hardware: Intel Data Center GPU, Intel ARC GPU
- OneAPI requirements: oneAPI 2025.0
# --8<-- [end:requirements]
# --8<-- [start:set-up-using-python]
# --8<-- [end:set-up-using-python]
# --8<-- [start:pre-built-wheels]
Currently, there are no pre-built XPU wheels.
# --8<-- [end:pre-built-wheels]
# --8<-- [start:build-wheel-from-source]
- First, install required driver and Intel OneAPI 2025.0 or later.
- Second, install Python packages for vLLM XPU backend building:
```console
git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install --upgrade pip
pip install -v -r requirements/xpu.txt
```
- Then, build and install vLLM XPU backend:
```console
VLLM_TARGET_DEVICE=xpu python setup.py install
```
!!! note
- FP16 is the default data type in the current XPU backend. The BF16 data
type is supported on Intel Data Center GPU, not supported on Intel Arc GPU yet.
# --8<-- [end:build-wheel-from-source]
# --8<-- [start:set-up-using-docker]
# --8<-- [end:set-up-using-docker]
# --8<-- [start:pre-built-images]
Currently, there are no pre-built XPU images.
# --8<-- [end:pre-built-images]
# --8<-- [start:build-image-from-source]
```console
$ docker build -f docker/Dockerfile.xpu -t vllm-xpu-env --shm-size=4g .
$ docker run -it \
--rm \
--network=host \
--device /dev/dri \
-v /dev/dri/by-path:/dev/dri/by-path \
vllm-xpu-env
```
## Supported features
XPU platform supports **tensor parallel** inference/serving and also supports **pipeline parallel** as a beta feature for online serving. We require Ray as the distributed runtime backend. For example, a reference execution like following:
```console
python -m vllm.entrypoints.openai.api_server \
--model=facebook/opt-13b \
--dtype=bfloat16 \
--max_model_len=1024 \
--distributed-executor-backend=ray \
--pipeline-parallel-size=2 \
-tp=8
```
By default, a ray instance will be launched automatically if no existing one is detected in the system, with `num-gpus` equals to `parallel_config.world_size`. We recommend properly starting a ray cluster before execution, referring to the <gh-file:examples/online_serving/run_cluster.sh> helper script.
# --8<-- [end:extra-information]

View File

@@ -0,0 +1,6 @@
It's recommended to use [uv](https://docs.astral.sh/uv/), a very fast Python environment manager, to create and manage Python environments. Please follow the [documentation](https://docs.astral.sh/uv/#getting-started) to install `uv`. After installing `uv`, you can create a new Python environment and install vLLM using the following commands:
```console
uv venv --python 3.12 --seed
source .venv/bin/activate
```

View File

@@ -0,0 +1,214 @@
---
title: Quickstart
---
[](){ #quickstart }
This guide will help you quickly get started with vLLM to perform:
- [Offline batched inference][quickstart-offline]
- [Online serving using OpenAI-compatible server][quickstart-online]
## Prerequisites
- OS: Linux
- Python: 3.9 -- 3.12
## Installation
If you are using NVIDIA GPUs, you can install vLLM using [pip](https://pypi.org/project/vllm/) directly.
It's recommended to use [uv](https://docs.astral.sh/uv/), a very fast Python environment manager, to create and manage Python environments. Please follow the [documentation](https://docs.astral.sh/uv/#getting-started) to install `uv`. After installing `uv`, you can create a new Python environment and install vLLM using the following commands:
```console
uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install vllm --torch-backend=auto
```
`uv` can [automatically select the appropriate PyTorch index at runtime](https://docs.astral.sh/uv/guides/integration/pytorch/#automatic-backend-selection) by inspecting the installed CUDA driver version via `--torch-backend=auto` (or `UV_TORCH_BACKEND=auto`). To select a specific backend (e.g., `cu126`), set `--torch-backend=cu126` (or `UV_TORCH_BACKEND=cu126`).
Another delightful way is to use `uv run` with `--with [dependency]` option, which allows you to run commands such as `vllm serve` without creating any permanent environment:
```console
uv run --with vllm vllm --help
```
You can also use [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html) to create and manage Python environments. You can install `uv` to the conda environment through `pip` if you want to manage it within the environment.
```console
conda create -n myenv python=3.12 -y
conda activate myenv
pip install --upgrade uv
uv pip install vllm --torch-backend=auto
```
!!! note
For more detail and non-CUDA platforms, please refer [here][installation-index] for specific instructions on how to install vLLM.
[](){ #quickstart-offline }
## Offline Batched Inference
With vLLM installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing). See the example script: <gh-file:examples/offline_inference/basic/basic.py>
The first line of this example imports the classes [LLM][vllm.LLM] and [SamplingParams][vllm.SamplingParams]:
- [LLM][vllm.LLM] is the main class for running offline inference with vLLM engine.
- [SamplingParams][vllm.SamplingParams] specifies the parameters for the sampling process.
```python
from vllm import LLM, SamplingParams
```
The next section defines a list of input prompts and sampling parameters for text generation. The [sampling temperature](https://arxiv.org/html/2402.05201v1) is set to `0.8` and the [nucleus sampling probability](https://en.wikipedia.org/wiki/Top-p_sampling) is set to `0.95`. You can find more information about the sampling parameters [here][sampling-params].
!!! warning
By default, vLLM will use sampling parameters recommended by model creator by applying the `generation_config.json` from the Hugging Face model repository if it exists. In most cases, this will provide you with the best results by default if [SamplingParams][vllm.SamplingParams] is not specified.
However, if vLLM's default sampling parameters are preferred, please set `generation_config="vllm"` when creating the [LLM][vllm.LLM] instance.
```python
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
```
The [LLM][vllm.LLM] class initializes vLLM's engine and the [OPT-125M model](https://arxiv.org/abs/2205.01068) for offline inference. The list of supported models can be found [here][supported-models].
```python
llm = LLM(model="facebook/opt-125m")
```
!!! note
By default, vLLM downloads models from [Hugging Face](https://huggingface.co/). If you would like to use models from [ModelScope](https://www.modelscope.cn), set the environment variable `VLLM_USE_MODELSCOPE` before initializing the engine.
```shell
export VLLM_USE_MODELSCOPE=True
```
Now, the fun part! The outputs are generated using `llm.generate`. It adds the input prompts to the vLLM engine's waiting queue and executes the vLLM engine to generate the outputs with high throughput. The outputs are returned as a list of `RequestOutput` objects, which include all of the output tokens.
```python
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
[](){ #quickstart-online }
## OpenAI-Compatible Server
vLLM can be deployed as a server that implements the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API.
By default, it starts the server at `http://localhost:8000`. You can specify the address with `--host` and `--port` arguments. The server currently hosts one model at a time and implements endpoints such as [list models](https://platform.openai.com/docs/api-reference/models/list), [create chat completion](https://platform.openai.com/docs/api-reference/chat/completions/create), and [create completion](https://platform.openai.com/docs/api-reference/completions/create) endpoints.
Run the following command to start the vLLM server with the [Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) model:
```console
vllm serve Qwen/Qwen2.5-1.5B-Instruct
```
!!! note
By default, the server uses a predefined chat template stored in the tokenizer.
You can learn about overriding it [here][chat-template].
!!! warning
By default, the server applies `generation_config.json` from the huggingface model repository if it exists. This means the default values of certain sampling parameters can be overridden by those recommended by the model creator.
To disable this behavior, please pass `--generation-config vllm` when launching the server.
This server can be queried in the same format as OpenAI API. For example, to list the models:
```console
curl http://localhost:8000/v1/models
```
You can pass in the argument `--api-key` or environment variable `VLLM_API_KEY` to enable the server to check for API key in the header.
### OpenAI Completions API with vLLM
Once your server is started, you can query the model with input prompts:
```console
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-1.5B-Instruct",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}'
```
Since this server is compatible with OpenAI API, you can use it as a drop-in replacement for any applications using OpenAI API. For example, another way to query the server is via the `openai` Python package:
```python
from openai import OpenAI
# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
completion = client.completions.create(model="Qwen/Qwen2.5-1.5B-Instruct",
prompt="San Francisco is a")
print("Completion result:", completion)
```
A more detailed client example can be found here: <gh-file:examples/online_serving/openai_completion_client.py>
### OpenAI Chat Completions API with vLLM
vLLM is designed to also support the OpenAI Chat Completions API. The chat interface is a more dynamic, interactive way to communicate with the model, allowing back-and-forth exchanges that can be stored in the chat history. This is useful for tasks that require context or more detailed explanations.
You can use the [create chat completion](https://platform.openai.com/docs/api-reference/chat/completions/create) endpoint to interact with the model:
```console
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-1.5B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the world series in 2020?"}
]
}'
```
Alternatively, you can use the `openai` Python package:
```python
from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
chat_response = client.chat.completions.create(
model="Qwen/Qwen2.5-1.5B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me a joke."},
]
)
print("Chat response:", chat_response)
```
## On Attention Backends
Currently, vLLM supports multiple backends for efficient Attention computation across different platforms and accelerator architectures. It automatically selects the most performant backend compatible with your system and model specifications.
If desired, you can also manually set the backend of your choice by configuring the environment variable `VLLM_ATTENTION_BACKEND` to one of the following options: `FLASH_ATTN`, `FLASHINFER` or `XFORMERS`.
!!! warning
There are no pre-built vllm wheels containing Flash Infer, so you must install it in your environment first. Refer to the [Flash Infer official docs](https://docs.flashinfer.ai/) or see <gh-file:docker/Dockerfile> for instructions on how to install it.

View File

@@ -0,0 +1,270 @@
---
title: Troubleshooting
---
[](){ #troubleshooting }
This document outlines some troubleshooting strategies you can consider. If you think you've discovered a bug, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible.
!!! note
Once you've debugged a problem, remember to turn off any debugging environment variables defined, or simply start a new shell to avoid being affected by lingering debugging settings. Otherwise, the system might be slow with debugging functionalities left activated.
## Hangs downloading a model
If the model isn't already downloaded to disk, vLLM will download it from the internet which can take time and depend on your internet connection.
It's recommended to download the model first using the [huggingface-cli](https://huggingface.co/docs/huggingface_hub/en/guides/cli) and passing the local path to the model to vLLM. This way, you can isolate the issue.
## Hangs loading a model from disk
If the model is large, it can take a long time to load it from disk. Pay attention to where you store the model. Some clusters have shared filesystems across nodes, e.g. a distributed filesystem or a network filesystem, which can be slow.
It'd be better to store the model in a local disk. Additionally, have a look at the CPU memory usage, when the model is too large it might take a lot of CPU memory, slowing down the operating system because it needs to frequently swap between disk and memory.
!!! note
To isolate the model downloading and loading issue, you can use the `--load-format dummy` argument to skip loading the model weights. This way, you can check if the model downloading and loading is the bottleneck.
## Out of memory
If the model is too large to fit in a single GPU, you will get an out-of-memory (OOM) error. Consider adopting [these options][reducing-memory-usage] to reduce the memory consumption.
## Generation quality changed
In v0.8.0, the source of default sampling parameters was changed in <gh-pr:12622>. Prior to v0.8.0, the default sampling parameters came from vLLM's set of neutral defaults. From v0.8.0 onwards, the default sampling parameters come from the `generation_config.json` provided by the model creator.
In most cases, this should lead to higher quality responses, because the model creator is likely to know which sampling parameters are best for their model. However, in some cases the defaults provided by the model creator can lead to degraded performance.
You can check if this is happening by trying the old defaults with `--generation-config vllm` for online and `generation_config="vllm"` for offline. If, after trying this, your generation quality improves we would recommend continuing to use the vLLM defaults and petition the model creator on <https://huggingface.co> to update their default `generation_config.json` so that it produces better quality generations.
## Enable more logging
If other strategies don't solve the problem, it's likely that the vLLM instance is stuck somewhere. You can use the following environment variables to help debug the issue:
- `export VLLM_LOGGING_LEVEL=DEBUG` to turn on more logging.
- `export CUDA_LAUNCH_BLOCKING=1` to identify which CUDA kernel is causing the problem.
- `export NCCL_DEBUG=TRACE` to turn on more logging for NCCL.
- `export VLLM_TRACE_FUNCTION=1` to record all function calls for inspection in the log files to tell which function crashes or hangs.
## Incorrect network setup
The vLLM instance cannot get the correct IP address if you have a complicated network config. You can find a log such as `DEBUG 06-10 21:32:17 parallel_state.py:88] world_size=8 rank=0 local_rank=0 distributed_init_method=tcp://xxx.xxx.xxx.xxx:54641 backend=nccl` and the IP address should be the correct one.
If it's not, override the IP address using the environment variable `export VLLM_HOST_IP=<your_ip_address>`.
You might also need to set `export NCCL_SOCKET_IFNAME=<your_network_interface>` and `export GLOO_SOCKET_IFNAME=<your_network_interface>` to specify the network interface for the IP address.
## Error near `self.graph.replay()`
If vLLM crashes and the error trace captures it somewhere around `self.graph.replay()` in `vllm/worker/model_runner.py`, it is a CUDA error inside CUDAGraph.
To identify the particular CUDA operation that causes the error, you can add `--enforce-eager` to the command line, or `enforce_eager=True` to the [LLM][vllm.LLM] class to disable the CUDAGraph optimization and isolate the exact CUDA operation that causes the error.
[](){ #troubleshooting-incorrect-hardware-driver }
## Incorrect hardware/driver
If GPU/CPU communication cannot be established, you can use the following Python script and follow the instructions below to confirm whether the GPU/CPU communication is working correctly.
```python
# Test PyTorch NCCL
import torch
import torch.distributed as dist
dist.init_process_group(backend="nccl")
local_rank = dist.get_rank() % torch.cuda.device_count()
torch.cuda.set_device(local_rank)
data = torch.FloatTensor([1,] * 128).to("cuda")
dist.all_reduce(data, op=dist.ReduceOp.SUM)
torch.cuda.synchronize()
value = data.mean().item()
world_size = dist.get_world_size()
assert value == world_size, f"Expected {world_size}, got {value}"
print("PyTorch NCCL is successful!")
# Test PyTorch GLOO
gloo_group = dist.new_group(ranks=list(range(world_size)), backend="gloo")
cpu_data = torch.FloatTensor([1,] * 128)
dist.all_reduce(cpu_data, op=dist.ReduceOp.SUM, group=gloo_group)
value = cpu_data.mean().item()
assert value == world_size, f"Expected {world_size}, got {value}"
print("PyTorch GLOO is successful!")
if world_size <= 1:
exit()
# Test vLLM NCCL, with cuda graph
from vllm.distributed.device_communicators.pynccl import PyNcclCommunicator
pynccl = PyNcclCommunicator(group=gloo_group, device=local_rank)
# pynccl is enabled by default for 0.6.5+,
# but for 0.6.4 and below, we need to enable it manually.
# keep the code for backward compatibility when because people
# prefer to read the latest documentation.
pynccl.disabled = False
s = torch.cuda.Stream()
with torch.cuda.stream(s):
data.fill_(1)
out = pynccl.all_reduce(data, stream=s)
value = out.mean().item()
assert value == world_size, f"Expected {world_size}, got {value}"
print("vLLM NCCL is successful!")
g = torch.cuda.CUDAGraph()
with torch.cuda.graph(cuda_graph=g, stream=s):
out = pynccl.all_reduce(data, stream=torch.cuda.current_stream())
data.fill_(1)
g.replay()
torch.cuda.current_stream().synchronize()
value = out.mean().item()
assert value == world_size, f"Expected {world_size}, got {value}"
print("vLLM NCCL with cuda graph is successful!")
dist.destroy_process_group(gloo_group)
dist.destroy_process_group()
```
If you are testing with a single node, adjust `--nproc-per-node` to the number of GPUs you want to use:
```console
NCCL_DEBUG=TRACE torchrun --nproc-per-node=<number-of-GPUs> test.py
```
If you are testing with multi-nodes, adjust `--nproc-per-node` and `--nnodes` according to your setup and set `MASTER_ADDR` to the correct IP address of the master node, reachable from all nodes. Then, run:
```console
NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR test.py
```
If the script runs successfully, you should see the message `sanity check is successful!`.
If the test script hangs or crashes, usually it means the hardware/drivers are broken in some sense. You should try to contact your system administrator or hardware vendor for further assistance. As a common workaround, you can try to tune some NCCL environment variables, such as `export NCCL_P2P_DISABLE=1` to see if it helps. Please check [their documentation](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html) for more information. Please only use these environment variables as a temporary workaround, as they might affect the performance of the system. The best solution is still to fix the hardware/drivers so that the test script can run successfully.
!!! note
A multi-node environment is more complicated than a single-node one. If you see errors such as `torch.distributed.DistNetworkError`, it is likely that the network/DNS setup is incorrect. In that case, you can manually assign node rank and specify the IP via command line arguments:
- In the first node, run `NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 0 --master_addr $MASTER_ADDR test.py`.
- In the second node, run `NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 1 --master_addr $MASTER_ADDR test.py`.
Adjust `--nproc-per-node`, `--nnodes`, and `--node-rank` according to your setup, being sure to execute different commands (with different `--node-rank`) on different nodes.
[](){ #troubleshooting-python-multiprocessing }
## Python multiprocessing
### `RuntimeError` Exception
If you have seen a warning in your logs like this:
```console
WARNING 12-11 14:50:37 multiproc_worker_utils.py:281] CUDA was previously
initialized. We must use the `spawn` multiprocessing start method. Setting
VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See
https://docs.vllm.ai/en/latest/getting_started/troubleshooting.html#python-multiprocessing
for more information.
```
or an error from Python that looks like this:
```console
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
To fix this issue, refer to the "Safe importing of main module"
section in https://docs.python.org/3/library/multiprocessing.html
```
then you must update your Python code to guard usage of `vllm` behind a `if
__name__ == '__main__':` block. For example, instead of this:
```python
import vllm
llm = vllm.LLM(...)
```
try this instead:
```python
if __name__ == '__main__':
import vllm
llm = vllm.LLM(...)
```
## `torch.compile` Error
vLLM heavily depends on `torch.compile` to optimize the model for better performance, which introduces the dependency on the `torch.compile` functionality and the `triton` library. By default, we use `torch.compile` to [optimize some functions](https://github.com/vllm-project/vllm/pull/10406) in the model. Before running vLLM, you can check if `torch.compile` is working as expected by running the following script:
```python
import torch
@torch.compile
def f(x):
# a simple function to test torch.compile
x = x + 1
x = x * 2
x = x.sin()
return x
x = torch.randn(4, 4).cuda()
print(f(x))
```
If it raises errors from `torch/_inductor` directory, usually it means you have a custom `triton` library that is not compatible with the version of PyTorch you are using. See [this issue](https://github.com/vllm-project/vllm/issues/12219) for example.
## Model failed to be inspected
If you see an error like:
```text
File "vllm/model_executor/models/registry.py", line xxx, in _raise_for_unsupported
raise ValueError(
ValueError: Model architectures ['<arch>'] failed to be inspected. Please check the logs for more details.
```
It means that vLLM failed to import the model file.
Usually, it is related to missing dependencies or outdated binaries in the vLLM build.
Please read the logs carefully to determine the root cause of the error.
## Model not supported
If you see an error like:
```text
Traceback (most recent call last):
...
File "vllm/model_executor/models/registry.py", line xxx, in inspect_model_cls
for arch in architectures:
TypeError: 'NoneType' object is not iterable
```
or:
```text
File "vllm/model_executor/models/registry.py", line xxx, in _raise_for_unsupported
raise ValueError(
ValueError: Model architectures ['<arch>'] are not supported for now. Supported architectures: [...]
```
But you are sure that the model is in the [list of supported models][supported-models], there may be some issue with vLLM's model resolution. In that case, please follow [these steps][model-resolution] to explicitly specify the vLLM implementation for the model.
## Failed to infer device type
If you see an error like `RuntimeError: Failed to infer device type`, it means that vLLM failed to infer the device type of the runtime environment. You can check [the code](gh-file:vllm/platforms/__init__.py) to see how vLLM infers the device type and why it is not working as expected. After [this PR](gh-pr:14195), you can also set the environment variable `VLLM_LOGGING_LEVEL=DEBUG` to see more detailed logs to help debug the issue.
## Known Issues
- In `v0.5.2`, `v0.5.3`, and `v0.5.3.post1`, there is a bug caused by [zmq](https://github.com/zeromq/pyzmq/issues/2000) , which can occasionally cause vLLM to hang depending on the machine configuration. The solution is to upgrade to the latest version of `vllm` to include the [fix](gh-pr:6759).
- To circumvent a NCCL [bug](https://github.com/NVIDIA/nccl/issues/1234) , all vLLM processes will set an environment variable `NCCL_CUMEM_ENABLE=0` to disable NCCL's `cuMem` allocator. It does not affect performance but only gives memory benefits. When external processes want to set up a NCCL connection with vLLM's processes, they should also set this environment variable, otherwise, inconsistent environment setup will cause NCCL to hang or crash, as observed in the [RLHF integration](https://github.com/OpenRLHF/OpenRLHF/pull/604) and the [discussion](gh-issue:5723#issuecomment-2554389656) .

View File

@@ -0,0 +1,153 @@
# vLLM V1 User Guide
V1 is now enabled by default for all supported use cases, and we will gradually enable it for every use case we plan to support. Please share any feedback on [GitHub](https://github.com/vllm-project/vllm) or in the [vLLM Slack](https://inviter.co/vllm-slack).
To disable V1, please set the environment variable as: `VLLM_USE_V1=0`, and send us a GitHub issue sharing the reason!
## Why vLLM V1?
vLLM V0 successfully supported a wide range of models and hardware, but as new features were developed independently, the system grew increasingly complex. This complexity made it harder to integrate new capabilities and introduced technical debt, revealing the need for a more streamlined and unified design.
Building on V0s success, vLLM V1 retains the stable and proven components from V0
(such as the models, GPU kernels, and utilities). At the same time, it significantly
re-architects the core systems, covering the scheduler, KV cache manager, worker,
sampler, and API server, to provide a cohesive, maintainable framework that better
accommodates continued growth and innovation.
Specifically, V1 aims to:
- Provide a **simple, modular, and easy-to-hack codebase**.
- Ensure **high performance** with near-zero CPU overhead.
- **Combine key optimizations** into a unified architecture.
- Require **zero configs** by enabling features/optimizations by default.
We see significant performance improvements from upgrading to V1 core engine, in
particular for long context scenarios. Please see performance benchmark (To be
added).
For more details, check out the vLLM V1 blog post [vLLM V1: A Major
Upgrade to vLLMs Core Architecture](https://blog.vllm.ai/2025/01/27/v1-alpha-release.html) (published Jan 27, 2025).
This living user guide outlines a few known **important changes and limitations** introduced by vLLM V1. The team has been working actively to bring V1 as the default engine, therefore this guide will be updated constantly as more features get supported on vLLM V1.
### Supports Overview
#### Hardware
| Hardware | Status |
|----------|------------------------------------------|
| **NVIDIA** | <nobr>🚀 Natively Supported</nobr> |
| **AMD** | <nobr>🚧 WIP</nobr> |
| **TPU** | <nobr>🚧 WIP</nobr> |
#### Feature / Model
| Feature / Model | Status |
|-----------------|-----------------------------------------------------------------------------------|
| **Prefix Caching** | <nobr>🚀 Optimized</nobr> |
| **Chunked Prefill** | <nobr>🚀 Optimized</nobr> |
| **LoRA** | <nobr>🚀 Optimized</nobr> |
| **Logprobs Calculation** | <nobr>🟢 Functional</nobr> |
| **Multimodal Models** | <nobr>🟢 Functional</nobr> |
| **FP8 KV Cache** | <nobr>🟢 Functional on Hopper devices ([PR #15191](https://github.com/vllm-project/vllm/pull/15191))</nobr>|
| **Spec Decode** | <nobr>🚧 WIP ([PR #13933](https://github.com/vllm-project/vllm/pull/13933))</nobr>|
| **Prompt Logprobs with Prefix Caching** | <nobr>🟡 Planned ([RFC #13414](https://github.com/vllm-project/vllm/issues/13414))</nobr>|
| **Structured Output Alternative Backends** | <nobr>🟡 Planned</nobr> |
| **Embedding Models** | <nobr>🟡 Planned ([RFC #12249](https://github.com/vllm-project/vllm/issues/12249))</nobr> |
| **Mamba Models** | <nobr>🟡 Planned</nobr> |
| **Encoder-Decoder Models** | <nobr>🟡 Planned</nobr> |
| **Request-level Structured Output Backend** | <nobr>🔴 Deprecated</nobr> |
| **best_of** | <nobr>🔴 Deprecated ([RFC #13361](https://github.com/vllm-project/vllm/issues/13361))</nobr>|
| **Per-Request Logits Processors** | <nobr>🔴 Deprecated ([RFC #13360](https://github.com/vllm-project/vllm/pull/13360))</nobr> |
| **GPU <> CPU KV Cache Swapping** | <nobr>🔴 Deprecated</nobr> |
- **🚀 Optimized**: Nearly fully optimized, with no further work currently planned.
- **🟢 Functional**: Fully operational, with ongoing optimizations.
- **🚧 WIP**: Under active development.
- **🟡 Planned**: Scheduled for future implementation (some may have open PRs/RFCs).
- **🔴 Deprecated**: Not planned for v1 unless there is strong demand.
**Note**: vLLM V1s unified scheduler treats both prompt and output tokens the same
way by using a simple dictionary (e.g., {request_id: num_tokens}) to dynamically
allocate a fixed token budget per request, enabling features like chunked prefills,
prefix caching, and speculative decoding without a strict separation between prefill
and decode phases.
### Semantic Changes and Deprecated Features
#### Logprobs
vLLM V1 supports logprobs and prompt logprobs. However, there are some important semantic
differences compared to V0:
**Logprobs Calculation**
Logprobs in V1 are now returned immediately once computed from the models raw output (i.e.
before applying any logits post-processing such as temperature scaling or penalty
adjustments). As a result, the returned logprobs do not reflect the final adjusted
probabilities used during sampling.
Support for logprobs with post-sampling adjustments is in progress and will be added in future updates.
**Prompt Logprobs with Prefix Caching**
Currently prompt logprobs are only supported when prefix caching is turned off via `--no-enable-prefix-caching`. In a future release, prompt logprobs will be compatible with prefix caching, but a recomputation will be triggered to recover the full prompt logprobs even upon a prefix cache hit. See details in [RFC #13414](https://github.com/vllm-project/vllm/issues/13414).
#### Deprecated Features
As part of the major architectural rework in vLLM V1, several legacy features have been deprecated.
**Sampling features**
- **best_of**: This feature has been deprecated due to limited usage. See details at [RFC #13361](https://github.com/vllm-project/vllm/issues/13361).
- **Per-Request Logits Processors**: In V0, users could pass custom
processing functions to adjust logits on a per-request basis. In vLLM V1, this
feature has been deprecated. Instead, the design is moving toward supporting **global logits
processors**, a feature the team is actively working on for future releases. See details at [RFC #13360](https://github.com/vllm-project/vllm/pull/13360).
**KV Cache features**
- **GPU <> CPU KV Cache Swapping**: with the new simplified core architecture, vLLM V1 no longer requires KV cache swapping
to handle request preemptions.
**Structured Output features**
- **Request-level Structured Output Backend**: Deprecated, alternative backends
(outlines, guidance) with fallbacks is WIP.
### Feature & Model Support in Progress
Although we have re-implemented and partially optimized many features and models from V0 in vLLM V1, optimization work is still ongoing for some, and others remain unsupported.
#### Features to Be Optimized
These features are already supported in vLLM V1, but their optimization is still
in progress.
- **Spec Decode**: Currently, only ngram-based spec decode is supported in V1. There
will be follow-up work to support other types of spec decode (e.g., see [PR #13933](https://github.com/vllm-project/vllm/pull/13933)). We will prioritize the support for Eagle, MTP compared to draft model based spec decode.
- **Multimodal Models**: V1 is almost fully compatible with V0 except that interleaved modality input is not supported yet.
See [here](https://github.com/orgs/vllm-project/projects/8) for the status of upcoming features and optimizations.
#### Features to Be Supported
- **Structured Output Alternative Backends**: Structured output alternative backends (outlines, guidance) support is planned. V1 currently
supports only the `xgrammar:no_fallback` mode, meaning that it will error out if the output schema is unsupported by xgrammar.
Details about the structured outputs can be found
[here](https://docs.vllm.ai/en/latest/features/structured_outputs.html).
#### Models to Be Supported
vLLM V1 currently excludes model architectures with the `SupportsV0Only` protocol,
and the majority fall into the following categories. V1 support for these models will be added eventually.
**Embedding Models**
Instead of having a separate model runner, hidden states processor [RFC #12249](https://github.com/vllm-project/vllm/issues/12249), which is based on global logits processor [RFC #13360](https://github.com/vllm-project/vllm/pull/13360), has been proposed to enable simultaneous generation and embedding using the same engine instance in V1. It is still in the planning stage.
**Mamba Models**
Models using selective state-space mechanisms (instead of standard transformer attention)
are not yet supported (e.g., `MambaForCausalLM`, `JambaForCausalLM`).
**Encoder-Decoder Models**
vLLM V1 is currently optimized for decoder-only transformers. Models requiring
cross-attention between separate encoder and decoder are not yet supported (e.g., `BartForConditionalGeneration`, `MllamaForConditionalGeneration`).
For a complete list of supported models, see the [list of supported models](https://docs.vllm.ai/en/latest/models/supported_models.html).