[Doc] Convert docs to use colon fences (#12471)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
Harry Mellor
2025-01-29 03:38:29 +00:00
committed by GitHub
parent a7e3eba66f
commit dd6a3a02cb
68 changed files with 2352 additions and 2341 deletions

View File

@@ -86,9 +86,9 @@ docker build -f Dockerfile.hpu -t vllm-hpu-env .
docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --rm vllm-hpu-env
```
```{tip}
:::{tip}
If you're observing the following error: `docker: Error response from daemon: Unknown runtime specified habana.`, please refer to "Install Using Containers" section of [Intel Gaudi Software Stack and Driver Installation](https://docs.habana.ai/en/v1.18.0/Installation_Guide/Bare_Metal_Fresh_OS.html). Make sure you have `habana-container-runtime` package installed and that `habana` container runtime is registered.
```
:::
## Extra information
@@ -155,30 +155,30 @@ Gaudi2 devices. Configurations that are not listed may or may not work.
Currently in vLLM for HPU we support four execution modes, depending on selected HPU PyTorch Bridge backend (via `PT_HPU_LAZY_MODE` environment variable), and `--enforce-eager` flag.
```{list-table} vLLM execution modes
:::{list-table} vLLM execution modes
:widths: 25 25 50
:header-rows: 1
* - `PT_HPU_LAZY_MODE`
- `enforce_eager`
- execution mode
* - 0
- 0
- torch.compile
* - 0
- 1
- PyTorch eager mode
* - 1
- 0
- HPU Graphs
* - 1
- 1
- PyTorch lazy mode
```
- * `PT_HPU_LAZY_MODE`
* `enforce_eager`
* execution mode
- * 0
* 0
* torch.compile
- * 0
* 1
* PyTorch eager mode
- * 1
* 0
* HPU Graphs
- * 1
* 1
* PyTorch lazy mode
:::
```{warning}
:::{warning}
In 1.18.0, all modes utilizing `PT_HPU_LAZY_MODE=0` are highly experimental and should be only used for validating functional correctness. Their performance will be improved in the next releases. For obtaining the best performance in 1.18.0, please use HPU Graphs, or PyTorch lazy mode.
```
:::
(gaudi-bucketing-mechanism)=
@@ -187,9 +187,9 @@ In 1.18.0, all modes utilizing `PT_HPU_LAZY_MODE=0` are highly experimental and
Intel Gaudi accelerators work best when operating on models with fixed tensor shapes. [Intel Gaudi Graph Compiler](https://docs.habana.ai/en/latest/Gaudi_Overview/Intel_Gaudi_Software_Suite.html#graph-compiler-and-runtime) is responsible for generating optimized binary code that implements the given model topology on Gaudi. In its default configuration, the produced binary code may be heavily dependent on input and output tensor shapes, and can require graph recompilation when encountering differently shaped tensors within the same topology. While the resulting binaries utilize Gaudi efficiently, the compilation itself may introduce a noticeable overhead in end-to-end execution.
In a dynamic inference serving scenario, there is a need to minimize the number of graph compilations and reduce the risk of graph compilation occurring during server runtime. Currently it is achieved by "bucketing" model's forward pass across two dimensions - `batch_size` and `sequence_length`.
```{note}
:::{note}
Bucketing allows us to reduce the number of required graphs significantly, but it does not handle any graph compilation and device code generation - this is done in warmup and HPUGraph capture phase.
```
:::
Bucketing ranges are determined with 3 parameters - `min`, `step` and `max`. They can be set separately for prompt and decode phase, and for batch size and sequence length dimension. These parameters can be observed in logs during vLLM startup:
@@ -222,15 +222,15 @@ min = 128, step = 128, max = 512
In the logged scenario, 24 buckets were generated for prompt (prefill) runs, and 48 buckets for decode runs. Each bucket corresponds to a separate optimized device binary for a given model with specified tensor shapes. Whenever a batch of requests is processed, it is padded across batch and sequence length dimension to the smallest possible bucket.
```{warning}
:::{warning}
If a request exceeds maximum bucket size in any dimension, it will be processed without padding, and its processing may require a graph compilation, potentially significantly increasing end-to-end latency. The boundaries of the buckets are user-configurable via environment variables, and upper bucket boundaries can be increased to avoid such scenario.
```
:::
As an example, if a request of 3 sequences, with max sequence length of 412 comes in to an idle vLLM server, it will be padded executed as `(4, 512)` prefill bucket, as `batch_size` (number of sequences) will be padded to 4 (closest batch_size dimension higher than 3), and max sequence length will be padded to 512 (closest sequence length dimension higher than 412). After prefill stage, it will be executed as `(4, 512)` decode bucket and will continue as that bucket until either batch dimension changes (due to request being finished) - in which case it will become a `(2, 512)` bucket, or context length increases above 512 tokens, in which case it will become `(4, 640)` bucket.
```{note}
:::{note}
Bucketing is transparent to a client -- padding in sequence length dimension is never returned to the client, and padding in batch dimension does not create new requests.
```
:::
### Warmup
@@ -252,9 +252,9 @@ INFO 08-01 22:27:16 hpu_model_runner.py:1066] [Warmup][Decode][48/48] batch_size
This example uses the same buckets as in the [Bucketing Mechanism](#gaudi-bucketing-mechanism) section. Each output line corresponds to execution of a single bucket. When bucket is executed for the first time, its graph is compiled and can be reused later on, skipping further graph compilations.
```{tip}
:::{tip}
Compiling all the buckets might take some time and can be turned off with `VLLM_SKIP_WARMUP=true` environment variable. Keep in mind that if you do that, you may face graph compilations once executing a given bucket for the first time. It is fine to disable warmup for development, but it's highly recommended to enable it in deployment.
```
:::
### HPU Graph capture
@@ -269,9 +269,9 @@ With its default value (`VLLM_GRAPH_RESERVED_MEM=0.1`), 10% of usable memory wil
Environment variable `VLLM_GRAPH_PROMPT_RATIO` determines the ratio of usable graph memory reserved for prefill and decode graphs. By default (`VLLM_GRAPH_PROMPT_RATIO=0.3`), both stages have equal memory constraints.
Lower value corresponds to less usable graph memory reserved for prefill stage, e.g. `VLLM_GRAPH_PROMPT_RATIO=0.2` will reserve 20% of usable graph memory for prefill graphs, and 80% of usable graph memory for decode graphs.
```{note}
:::{note}
`gpu_memory_utilization` does not correspond to the absolute memory usage across HPU. It specifies the memory margin after loading the model and performing a profile run. If device has 100 GiB of total memory, and 50 GiB of free memory after loading model weights and executing profiling run, `gpu_memory_utilization` at its default value will mark 90% of 50 GiB as usable, leaving 5 GiB of margin, regardless of total device memory.
```
:::
User can also configure the strategy for capturing HPU Graphs for prompt and decode stages separately. Strategy affects the order of capturing graphs. There are two strategies implemented:
\- `max_bs` - graph capture queue will sorted in descending order by their batch sizes. Buckets with equal batch sizes are sorted by sequence length in ascending order (e.g. `(64, 128)`, `(64, 256)`, `(32, 128)`, `(32, 256)`, `(1, 128)`, `(1,256)`), default strategy for decode
@@ -279,9 +279,9 @@ User can also configure the strategy for capturing HPU Graphs for prompt and dec
When there's large amount of requests pending, vLLM scheduler will attempt to fill the maximum batch size for decode as soon as possible. When a request is finished, decode batch size decreases. When that happens, vLLM will attempt to schedule a prefill iteration for requests in the waiting queue, to fill the decode batch size to its previous state. This means that in a full load scenario, decode batch size is often at its maximum, which makes large batch size HPU Graphs crucial to capture, as reflected by `max_bs` strategy. On the other hand, prefills will be executed most frequently with very low batch sizes (1-4), which is reflected in `min_tokens` strategy.
```{note}
:::{note}
`VLLM_GRAPH_PROMPT_RATIO` does not set a hard limit on memory taken by graphs for each stage (prefill and decode). vLLM will first attempt to use up entirety of usable prefill graph memory (usable graph memory * `VLLM_GRAPH_PROMPT_RATIO`) for capturing prefill HPU Graphs, next it will attempt do the same for decode graphs and usable decode graph memory pool. If one stage is fully captured, and there is unused memory left within usable graph memory pool, vLLM will attempt further graph capture for the other stage, until no more HPU Graphs can be captured without exceeding reserved memory pool. The behavior on that mechanism can be observed in the example below.
```
:::
Each described step is logged by vLLM server, as follows (negative values correspond to memory being released):
@@ -352,13 +352,13 @@ INFO 08-02 17:38:43 hpu_executor.py:91] init_cache_engine took 37.92 GiB of devi
- `VLLM_{phase}_{dim}_BUCKET_{param}` - collection of 12 environment variables configuring ranges of bucketing mechanism
- `{phase}` is either `PROMPT` or `DECODE`
* `{phase}` is either `PROMPT` or `DECODE`
- `{dim}` is either `BS`, `SEQ` or `BLOCK`
* `{dim}` is either `BS`, `SEQ` or `BLOCK`
- `{param}` is either `MIN`, `STEP` or `MAX`
* `{param}` is either `MIN`, `STEP` or `MAX`
- Default values:
* Default values:
- Prompt:
- batch size min (`VLLM_PROMPT_BS_BUCKET_MIN`): `1`

View File

@@ -2,374 +2,374 @@
vLLM is a Python library that supports the following AI accelerators. Select your AI accelerator type to see vendor specific instructions:
::::{tab-set}
:::::{tab-set}
:sync-group: device
:::{tab-item} TPU
::::{tab-item} TPU
:sync: tpu
```{include} tpu.inc.md
:::{include} tpu.inc.md
:start-after: "# Installation"
:end-before: "## Requirements"
```
:::
:::{tab-item} Intel Gaudi
:sync: hpu-gaudi
```{include} hpu-gaudi.inc.md
:start-after: "# Installation"
:end-before: "## Requirements"
```
:::
:::{tab-item} Neuron
:sync: neuron
```{include} neuron.inc.md
:start-after: "# Installation"
:end-before: "## Requirements"
```
:::
:::{tab-item} OpenVINO
:sync: openvino
```{include} openvino.inc.md
:start-after: "# Installation"
:end-before: "## Requirements"
```
:::
::::
::::{tab-item} Intel Gaudi
:sync: hpu-gaudi
:::{include} hpu-gaudi.inc.md
:start-after: "# Installation"
:end-before: "## Requirements"
:::
::::
::::{tab-item} Neuron
:sync: neuron
:::{include} neuron.inc.md
:start-after: "# Installation"
:end-before: "## Requirements"
:::
::::
::::{tab-item} OpenVINO
:sync: openvino
:::{include} openvino.inc.md
:start-after: "# Installation"
:end-before: "## Requirements"
:::
::::
:::::
## Requirements
::::{tab-set}
:::::{tab-set}
:sync-group: device
:::{tab-item} TPU
::::{tab-item} TPU
:sync: tpu
```{include} tpu.inc.md
:::{include} tpu.inc.md
:start-after: "## Requirements"
:end-before: "## Configure a new environment"
```
:::
:::{tab-item} Intel Gaudi
:sync: hpu-gaudi
```{include} hpu-gaudi.inc.md
:start-after: "## Requirements"
:end-before: "## Configure a new environment"
```
:::
:::{tab-item} Neuron
:sync: neuron
```{include} neuron.inc.md
:start-after: "## Requirements"
:end-before: "## Configure a new environment"
```
:::
:::{tab-item} OpenVINO
:sync: openvino
```{include} openvino.inc.md
:start-after: "## Requirements"
:end-before: "## Set up using Python"
```
:::
::::
::::{tab-item} Intel Gaudi
:sync: hpu-gaudi
:::{include} hpu-gaudi.inc.md
:start-after: "## Requirements"
:end-before: "## Configure a new environment"
:::
::::
::::{tab-item} Neuron
:sync: neuron
:::{include} neuron.inc.md
:start-after: "## Requirements"
:end-before: "## Configure a new environment"
:::
::::
::::{tab-item} OpenVINO
:sync: openvino
:::{include} openvino.inc.md
:start-after: "## Requirements"
:end-before: "## Set up using Python"
:::
::::
:::::
## Configure a new environment
::::{tab-set}
:::::{tab-set}
:sync-group: device
:::{tab-item} TPU
::::{tab-item} TPU
:sync: tpu
```{include} tpu.inc.md
:::{include} tpu.inc.md
:start-after: "## Configure a new environment"
:end-before: "## Set up using Python"
```
:::
:::{tab-item} Intel Gaudi
:sync: hpu-gaudi
```{include} hpu-gaudi.inc.md
:start-after: "## Configure a new environment"
:end-before: "## Set up using Python"
```
:::
:::{tab-item} Neuron
:sync: neuron
```{include} neuron.inc.md
:start-after: "## Configure a new environment"
:end-before: "## Set up using Python"
```
:::
:::{tab-item} OpenVINO
:sync: openvino
```{include} ../python_env_setup.inc.md
```
:::
::::
::::{tab-item} Intel Gaudi
:sync: hpu-gaudi
:::{include} hpu-gaudi.inc.md
:start-after: "## Configure a new environment"
:end-before: "## Set up using Python"
:::
::::
::::{tab-item} Neuron
:sync: neuron
:::{include} neuron.inc.md
:start-after: "## Configure a new environment"
:end-before: "## Set up using Python"
:::
::::
::::{tab-item} OpenVINO
:sync: openvino
:::{include} ../python_env_setup.inc.md
:::
::::
:::::
## Set up using Python
### Pre-built wheels
::::{tab-set}
:::::{tab-set}
:sync-group: device
:::{tab-item} TPU
::::{tab-item} TPU
:sync: tpu
```{include} tpu.inc.md
:::{include} tpu.inc.md
:start-after: "### Pre-built wheels"
:end-before: "### Build wheel from source"
```
:::
:::{tab-item} Intel Gaudi
:sync: hpu-gaudi
```{include} hpu-gaudi.inc.md
:start-after: "### Pre-built wheels"
:end-before: "### Build wheel from source"
```
:::
:::{tab-item} Neuron
:sync: neuron
```{include} neuron.inc.md
:start-after: "### Pre-built wheels"
:end-before: "### Build wheel from source"
```
:::
:::{tab-item} OpenVINO
:sync: openvino
```{include} openvino.inc.md
:start-after: "### Pre-built wheels"
:end-before: "### Build wheel from source"
```
:::
::::
::::{tab-item} Intel Gaudi
:sync: hpu-gaudi
:::{include} hpu-gaudi.inc.md
:start-after: "### Pre-built wheels"
:end-before: "### Build wheel from source"
:::
::::
::::{tab-item} Neuron
:sync: neuron
:::{include} neuron.inc.md
:start-after: "### Pre-built wheels"
:end-before: "### Build wheel from source"
:::
::::
::::{tab-item} OpenVINO
:sync: openvino
:::{include} openvino.inc.md
:start-after: "### Pre-built wheels"
:end-before: "### Build wheel from source"
:::
::::
:::::
### Build wheel from source
::::{tab-set}
:::::{tab-set}
:sync-group: device
:::{tab-item} TPU
::::{tab-item} TPU
:sync: tpu
```{include} tpu.inc.md
:::{include} tpu.inc.md
:start-after: "### Build wheel from source"
:end-before: "## Set up using Docker"
```
:::
:::{tab-item} Intel Gaudi
:sync: hpu-gaudi
```{include} hpu-gaudi.inc.md
:start-after: "### Build wheel from source"
:end-before: "## Set up using Docker"
```
:::
:::{tab-item} Neuron
:sync: neuron
```{include} neuron.inc.md
:start-after: "### Build wheel from source"
:end-before: "## Set up using Docker"
```
:::
:::{tab-item} OpenVINO
:sync: openvino
```{include} openvino.inc.md
:start-after: "### Build wheel from source"
:end-before: "## Set up using Docker"
```
:::
::::
::::{tab-item} Intel Gaudi
:sync: hpu-gaudi
:::{include} hpu-gaudi.inc.md
:start-after: "### Build wheel from source"
:end-before: "## Set up using Docker"
:::
::::
::::{tab-item} Neuron
:sync: neuron
:::{include} neuron.inc.md
:start-after: "### Build wheel from source"
:end-before: "## Set up using Docker"
:::
::::
::::{tab-item} OpenVINO
:sync: openvino
:::{include} openvino.inc.md
:start-after: "### Build wheel from source"
:end-before: "## Set up using Docker"
:::
::::
:::::
## Set up using Docker
### Pre-built images
::::{tab-set}
:::::{tab-set}
:sync-group: device
:::{tab-item} TPU
::::{tab-item} TPU
:sync: tpu
```{include} tpu.inc.md
:::{include} tpu.inc.md
:start-after: "### Pre-built images"
:end-before: "### Build image from source"
```
:::
:::{tab-item} Intel Gaudi
:sync: hpu-gaudi
```{include} hpu-gaudi.inc.md
:start-after: "### Pre-built images"
:end-before: "### Build image from source"
```
:::
:::{tab-item} Neuron
:sync: neuron
```{include} neuron.inc.md
:start-after: "### Pre-built images"
:end-before: "### Build image from source"
```
:::
:::{tab-item} OpenVINO
:sync: openvino
```{include} openvino.inc.md
:start-after: "### Pre-built images"
:end-before: "### Build image from source"
```
:::
::::
::::{tab-item} Intel Gaudi
:sync: hpu-gaudi
:::{include} hpu-gaudi.inc.md
:start-after: "### Pre-built images"
:end-before: "### Build image from source"
:::
::::
::::{tab-item} Neuron
:sync: neuron
:::{include} neuron.inc.md
:start-after: "### Pre-built images"
:end-before: "### Build image from source"
:::
::::
::::{tab-item} OpenVINO
:sync: openvino
:::{include} openvino.inc.md
:start-after: "### Pre-built images"
:end-before: "### Build image from source"
:::
::::
:::::
### Build image from source
::::{tab-set}
:::::{tab-set}
:sync-group: device
:::{tab-item} TPU
::::{tab-item} TPU
:sync: tpu
```{include} tpu.inc.md
:::{include} tpu.inc.md
:start-after: "### Build image from source"
:end-before: "## Extra information"
```
:::
:::{tab-item} Intel Gaudi
:sync: hpu-gaudi
```{include} hpu-gaudi.inc.md
:start-after: "### Build image from source"
:end-before: "## Extra information"
```
:::
:::{tab-item} Neuron
:sync: neuron
```{include} neuron.inc.md
:start-after: "### Build image from source"
:end-before: "## Extra information"
```
:::
:::{tab-item} OpenVINO
:sync: openvino
```{include} openvino.inc.md
:start-after: "### Build image from source"
:end-before: "## Extra information"
```
:::
::::
::::{tab-item} Intel Gaudi
:sync: hpu-gaudi
:::{include} hpu-gaudi.inc.md
:start-after: "### Build image from source"
:end-before: "## Extra information"
:::
::::
::::{tab-item} Neuron
:sync: neuron
:::{include} neuron.inc.md
:start-after: "### Build image from source"
:end-before: "## Extra information"
:::
::::
::::{tab-item} OpenVINO
:sync: openvino
:::{include} openvino.inc.md
:start-after: "### Build image from source"
:end-before: "## Extra information"
:::
::::
:::::
## Extra information
::::{tab-set}
:::::{tab-set}
:sync-group: device
:::{tab-item} TPU
::::{tab-item} TPU
:sync: tpu
```{include} tpu.inc.md
:::{include} tpu.inc.md
:start-after: "## Extra information"
```
:::
:::{tab-item} Intel Gaudi
:sync: hpu-gaudi
```{include} hpu-gaudi.inc.md
:start-after: "## Extra information"
```
:::
:::{tab-item} Neuron
:sync: neuron
```{include} neuron.inc.md
:start-after: "## Extra information"
```
:::
:::{tab-item} OpenVINO
:sync: openvino
```{include} openvino.inc.md
:start-after: "## Extra information"
```
:::
::::
::::{tab-item} Intel Gaudi
:sync: hpu-gaudi
:::{include} hpu-gaudi.inc.md
:start-after: "## Extra information"
:::
::::
::::{tab-item} Neuron
:sync: neuron
:::{include} neuron.inc.md
:start-after: "## Extra information"
:::
::::
::::{tab-item} OpenVINO
:sync: openvino
:::{include} openvino.inc.md
:start-after: "## Extra information"
:::
::::
:::::

View File

@@ -67,9 +67,9 @@ Currently, there are no pre-built Neuron wheels.
### Build wheel from source
```{note}
:::{note}
The currently supported version of Pytorch for Neuron installs `triton` version `2.1.0`. This is incompatible with `vllm >= 0.5.3`. You may see an error `cannot import name 'default_dump_dir...`. To work around this, run a `pip install --upgrade triton==3.0.0` after installing the vLLM wheel.
```
:::
Following instructions are applicable to Neuron SDK 2.16 and beyond.

View File

@@ -47,10 +47,10 @@ When you request queued resources, the request is added to a queue maintained by
the Cloud TPU service. When the requested resource becomes available, it's
assigned to your Google Cloud project for your immediate exclusive use.
```{note}
:::{note}
In all of the following commands, replace the ALL CAPS parameter names with
appropriate values. See the parameter descriptions table for more information.
```
:::
### Provision Cloud TPUs with GKE
@@ -75,33 +75,33 @@ gcloud alpha compute tpus queued-resources create QUEUED_RESOURCE_ID \
--service-account SERVICE_ACCOUNT
```
```{list-table} Parameter descriptions
:::{list-table} Parameter descriptions
:header-rows: 1
* - Parameter name
- Description
* - QUEUED_RESOURCE_ID
- The user-assigned ID of the queued resource request.
* - TPU_NAME
- The user-assigned name of the TPU which is created when the queued
- * Parameter name
* Description
- * QUEUED_RESOURCE_ID
* The user-assigned ID of the queued resource request.
- * TPU_NAME
* The user-assigned name of the TPU which is created when the queued
resource request is allocated.
* - PROJECT_ID
- Your Google Cloud project
* - ZONE
- The GCP zone where you want to create your Cloud TPU. The value you use
- * PROJECT_ID
* Your Google Cloud project
- * ZONE
* The GCP zone where you want to create your Cloud TPU. The value you use
depends on the version of TPUs you are using. For more information, see
`TPU regions and zones <https://cloud.google.com/tpu/docs/regions-zones>`_
* - ACCELERATOR_TYPE
- The TPU version you want to use. Specify the TPU version, for example
- * ACCELERATOR_TYPE
* The TPU version you want to use. Specify the TPU version, for example
`v5litepod-4` specifies a v5e TPU with 4 cores. For more information,
see `TPU versions <https://cloud.devsite.corp.google.com/tpu/docs/system-architecture-tpu-vm#versions>`_.
* - RUNTIME_VERSION
- The TPU VM runtime version to use. For more information see `TPU VM images <https://cloud.google.com/tpu/docs/runtimes>`_.
* - SERVICE_ACCOUNT
- The email address for your service account. You can find it in the IAM
- * RUNTIME_VERSION
* The TPU VM runtime version to use. For more information see `TPU VM images <https://cloud.google.com/tpu/docs/runtimes>`_.
- * SERVICE_ACCOUNT
* The email address for your service account. You can find it in the IAM
Cloud Console under *Service Accounts*. For example:
`tpu-service-account@<your_project_ID>.iam.gserviceaccount.com`
```
:::
Connect to your TPU using SSH:
@@ -178,15 +178,15 @@ Run the Docker image with the following command:
docker run --privileged --net host --shm-size=16G -it vllm-tpu
```
```{note}
:::{note}
Since TPU relies on XLA which requires static shapes, vLLM bucketizes the
possible input shapes and compiles an XLA graph for each shape. The
compilation time may take 20~30 minutes in the first run. However, the
compilation time reduces to ~5 minutes afterwards because the XLA graphs are
cached in the disk (in {code}`VLLM_XLA_CACHE_PATH` or {code}`~/.cache/vllm/xla_cache` by default).
```
:::
````{tip}
:::{tip}
If you encounter the following error:
```console
@@ -198,9 +198,10 @@ file or directory
Install OpenBLAS with the following command:
```console
$ sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev
sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev
```
````
:::
## Extra information

View File

@@ -25,9 +25,9 @@ pip install -r requirements-cpu.txt
pip install -e .
```
```{note}
:::{note}
On macOS the `VLLM_TARGET_DEVICE` is automatically set to `cpu`, which currently is the only supported device.
```
:::
#### Troubleshooting

View File

@@ -2,86 +2,86 @@
vLLM is a Python library that supports the following CPU variants. Select your CPU type to see vendor specific instructions:
::::{tab-set}
:::::{tab-set}
:sync-group: device
:::{tab-item} x86
::::{tab-item} x86
:sync: x86
```{include} x86.inc.md
:::{include} x86.inc.md
:start-after: "# Installation"
:end-before: "## Requirements"
```
:::
:::{tab-item} ARM
:sync: arm
```{include} arm.inc.md
:start-after: "# Installation"
:end-before: "## Requirements"
```
:::
:::{tab-item} Apple silicon
:sync: apple
```{include} apple.inc.md
:start-after: "# Installation"
:end-before: "## Requirements"
```
:::
::::
::::{tab-item} ARM
:sync: arm
:::{include} arm.inc.md
:start-after: "# Installation"
:end-before: "## Requirements"
:::
::::
::::{tab-item} Apple silicon
:sync: apple
:::{include} apple.inc.md
:start-after: "# Installation"
:end-before: "## Requirements"
:::
::::
:::::
## Requirements
- Python: 3.9 -- 3.12
::::{tab-set}
:::::{tab-set}
:sync-group: device
:::{tab-item} x86
::::{tab-item} x86
:sync: x86
```{include} x86.inc.md
:::{include} x86.inc.md
:start-after: "## Requirements"
:end-before: "## Set up using Python"
```
:::
:::{tab-item} ARM
:sync: arm
```{include} arm.inc.md
:start-after: "## Requirements"
:end-before: "## Set up using Python"
```
:::
:::{tab-item} Apple silicon
:sync: apple
```{include} apple.inc.md
:start-after: "## Requirements"
:end-before: "## Set up using Python"
```
:::
::::
::::{tab-item} ARM
:sync: arm
:::{include} arm.inc.md
:start-after: "## Requirements"
:end-before: "## Set up using Python"
:::
::::
::::{tab-item} Apple silicon
:sync: apple
:::{include} apple.inc.md
:start-after: "## Requirements"
:end-before: "## Set up using Python"
:::
::::
:::::
## Set up using Python
### Create a new Python environment
```{include} ../python_env_setup.inc.md
```
:::{include} ../python_env_setup.inc.md
:::
### Pre-built wheels
@@ -89,41 +89,41 @@ Currently, there are no pre-built CPU wheels.
### Build wheel from source
::::{tab-set}
:::::{tab-set}
:sync-group: device
:::{tab-item} x86
::::{tab-item} x86
:sync: x86
```{include} x86.inc.md
:::{include} x86.inc.md
:start-after: "### Build wheel from source"
:end-before: "## Set up using Docker"
```
:::
:::{tab-item} ARM
:sync: arm
```{include} arm.inc.md
:start-after: "### Build wheel from source"
:end-before: "## Set up using Docker"
```
:::
:::{tab-item} Apple silicon
:sync: apple
```{include} apple.inc.md
:start-after: "### Build wheel from source"
:end-before: "## Set up using Docker"
```
:::
::::
::::{tab-item} ARM
:sync: arm
:::{include} arm.inc.md
:start-after: "### Build wheel from source"
:end-before: "## Set up using Docker"
:::
::::
::::{tab-item} Apple silicon
:sync: apple
:::{include} apple.inc.md
:start-after: "### Build wheel from source"
:end-before: "## Set up using Docker"
:::
::::
:::::
## Set up using Docker
### Pre-built images
@@ -142,9 +142,9 @@ $ docker run -it \
vllm-cpu-env
```
:::{tip}
::::{tip}
For ARM or Apple silicon, use `Dockerfile.arm`
:::
::::
## Supported features

View File

@@ -17,10 +17,10 @@ vLLM initially supports basic model inferencing and serving on x86 CPU platform,
:::{include} build.inc.md
:::
```{note}
:::{note}
- AVX512_BF16 is an extension ISA provides native BF16 data type conversion and vector product instructions, which brings some performance improvement compared with pure AVX512. The CPU backend build script will check the host CPU flags to determine whether to enable AVX512_BF16.
- If you want to force enable AVX512_BF16 for the cross-compilation, please set environment variable `VLLM_CPU_AVX512BF16=1` before the building.
```
:::
## Set up using Docker

View File

@@ -10,9 +10,9 @@ vLLM contains pre-compiled C++ and CUDA (12.1) binaries.
### Create a new Python environment
```{note}
:::{note}
PyTorch installed via `conda` will statically link `NCCL` library, which can cause issues when vLLM tries to use `NCCL`. See <gh-issue:8420> for more details.
```
:::
In order to be performant, vLLM has to compile many cuda kernels. The compilation unfortunately introduces binary incompatibility with other CUDA versions and PyTorch versions, even for the same PyTorch version with different building configurations.
@@ -100,10 +100,10 @@ pip install --editable .
You can find more information about vLLM's wheels in <project:#install-the-latest-code>.
```{note}
:::{note}
There is a possibility that your source code may have a different commit ID compared to the latest vLLM wheel, which could potentially lead to unknown errors.
It is recommended to use the same commit ID for the source code as the vLLM wheel you have installed. Please refer to <project:#install-the-latest-code> for instructions on how to install a specified wheel.
```
:::
#### Full build (with compilation)
@@ -115,7 +115,7 @@ cd vllm
pip install -e .
```
```{tip}
:::{tip}
Building from source requires a lot of compilation. If you are building from source repeatedly, it's more efficient to cache the compilation results.
For example, you can install [ccache](https://github.com/ccache/ccache) using `conda install ccache` or `apt install ccache` .
@@ -123,7 +123,7 @@ As long as `which ccache` command can find the `ccache` binary, it will be used
[sccache](https://github.com/mozilla/sccache) works similarly to `ccache`, but has the capability to utilize caching in remote storage environments.
The following environment variables can be set to configure the vLLM `sccache` remote: `SCCACHE_BUCKET=vllm-build-sccache SCCACHE_REGION=us-west-2 SCCACHE_S3_NO_CREDENTIALS=1`. We also recommend setting `SCCACHE_IDLE_TIMEOUT=0`.
```
:::
##### Use an existing PyTorch installation

View File

@@ -2,299 +2,299 @@
vLLM is a Python library that supports the following GPU variants. Select your GPU type to see vendor specific instructions:
::::{tab-set}
:::::{tab-set}
:sync-group: device
:::{tab-item} CUDA
::::{tab-item} CUDA
:sync: cuda
```{include} cuda.inc.md
:::{include} cuda.inc.md
:start-after: "# Installation"
:end-before: "## Requirements"
```
:::
:::{tab-item} ROCm
:sync: rocm
```{include} rocm.inc.md
:start-after: "# Installation"
:end-before: "## Requirements"
```
:::
:::{tab-item} XPU
:sync: xpu
```{include} xpu.inc.md
:start-after: "# Installation"
:end-before: "## Requirements"
```
:::
::::
::::{tab-item} ROCm
:sync: rocm
:::{include} rocm.inc.md
:start-after: "# Installation"
:end-before: "## Requirements"
:::
::::
::::{tab-item} XPU
:sync: xpu
:::{include} xpu.inc.md
:start-after: "# Installation"
:end-before: "## Requirements"
:::
::::
:::::
## Requirements
- OS: Linux
- Python: 3.9 -- 3.12
::::{tab-set}
:::::{tab-set}
:sync-group: device
:::{tab-item} CUDA
::::{tab-item} CUDA
:sync: cuda
```{include} cuda.inc.md
:::{include} cuda.inc.md
:start-after: "## Requirements"
:end-before: "## Set up using Python"
```
:::
:::{tab-item} ROCm
:sync: rocm
```{include} rocm.inc.md
:start-after: "## Requirements"
:end-before: "## Set up using Python"
```
:::
:::{tab-item} XPU
:sync: xpu
```{include} xpu.inc.md
:start-after: "## Requirements"
:end-before: "## Set up using Python"
```
:::
::::
::::{tab-item} ROCm
:sync: rocm
:::{include} rocm.inc.md
:start-after: "## Requirements"
:end-before: "## Set up using Python"
:::
::::
::::{tab-item} XPU
:sync: xpu
:::{include} xpu.inc.md
:start-after: "## Requirements"
:end-before: "## Set up using Python"
:::
::::
:::::
## Set up using Python
### Create a new Python environment
```{include} ../python_env_setup.inc.md
```
::::{tab-set}
:sync-group: device
:::{tab-item} CUDA
:sync: cuda
```{include} cuda.inc.md
:start-after: "## Create a new Python environment"
:end-before: "### Pre-built wheels"
```
:::{include} ../python_env_setup.inc.md
:::
:::{tab-item} ROCm
:::::{tab-set}
:sync-group: device
::::{tab-item} CUDA
:sync: cuda
:::{include} cuda.inc.md
:start-after: "## Create a new Python environment"
:end-before: "### Pre-built wheels"
:::
::::
::::{tab-item} ROCm
:sync: rocm
There is no extra information on creating a new Python environment for this device.
:::
::::
:::{tab-item} XPU
::::{tab-item} XPU
:sync: xpu
There is no extra information on creating a new Python environment for this device.
:::
::::
:::::
### Pre-built wheels
::::{tab-set}
:::::{tab-set}
:sync-group: device
:::{tab-item} CUDA
::::{tab-item} CUDA
:sync: cuda
```{include} cuda.inc.md
:::{include} cuda.inc.md
:start-after: "### Pre-built wheels"
:end-before: "### Build wheel from source"
```
:::
:::{tab-item} ROCm
:sync: rocm
```{include} rocm.inc.md
:start-after: "### Pre-built wheels"
:end-before: "### Build wheel from source"
```
:::
:::{tab-item} XPU
:sync: xpu
```{include} xpu.inc.md
:start-after: "### Pre-built wheels"
:end-before: "### Build wheel from source"
```
:::
::::
::::{tab-item} ROCm
:sync: rocm
:::{include} rocm.inc.md
:start-after: "### Pre-built wheels"
:end-before: "### Build wheel from source"
:::
::::
::::{tab-item} XPU
:sync: xpu
:::{include} xpu.inc.md
:start-after: "### Pre-built wheels"
:end-before: "### Build wheel from source"
:::
::::
:::::
(build-from-source)=
### Build wheel from source
::::{tab-set}
:::::{tab-set}
:sync-group: device
:::{tab-item} CUDA
::::{tab-item} CUDA
:sync: cuda
```{include} cuda.inc.md
:::{include} cuda.inc.md
:start-after: "### Build wheel from source"
:end-before: "## Set up using Docker"
```
:::
:::{tab-item} ROCm
:sync: rocm
```{include} rocm.inc.md
:start-after: "### Build wheel from source"
:end-before: "## Set up using Docker"
```
:::
:::{tab-item} XPU
:sync: xpu
```{include} xpu.inc.md
:start-after: "### Build wheel from source"
:end-before: "## Set up using Docker"
```
:::
::::
::::{tab-item} ROCm
:sync: rocm
:::{include} rocm.inc.md
:start-after: "### Build wheel from source"
:end-before: "## Set up using Docker"
:::
::::
::::{tab-item} XPU
:sync: xpu
:::{include} xpu.inc.md
:start-after: "### Build wheel from source"
:end-before: "## Set up using Docker"
:::
::::
:::::
## Set up using Docker
### Pre-built images
::::{tab-set}
:::::{tab-set}
:sync-group: device
:::{tab-item} CUDA
::::{tab-item} CUDA
:sync: cuda
```{include} cuda.inc.md
:::{include} cuda.inc.md
:start-after: "### Pre-built images"
:end-before: "### Build image from source"
```
:::
:::{tab-item} ROCm
:sync: rocm
```{include} rocm.inc.md
:start-after: "### Pre-built images"
:end-before: "### Build image from source"
```
:::
:::{tab-item} XPU
:sync: xpu
```{include} xpu.inc.md
:start-after: "### Pre-built images"
:end-before: "### Build image from source"
```
:::
::::
::::{tab-item} ROCm
:sync: rocm
:::{include} rocm.inc.md
:start-after: "### Pre-built images"
:end-before: "### Build image from source"
:::
::::
::::{tab-item} XPU
:sync: xpu
:::{include} xpu.inc.md
:start-after: "### Pre-built images"
:end-before: "### Build image from source"
:::
::::
:::::
### Build image from source
::::{tab-set}
:::::{tab-set}
:sync-group: device
:::{tab-item} CUDA
::::{tab-item} CUDA
:sync: cuda
```{include} cuda.inc.md
:::{include} cuda.inc.md
:start-after: "### Build image from source"
:end-before: "## Supported features"
```
:::
:::{tab-item} ROCm
:sync: rocm
```{include} rocm.inc.md
:start-after: "### Build image from source"
:end-before: "## Supported features"
```
:::
:::{tab-item} XPU
:sync: xpu
```{include} xpu.inc.md
:start-after: "### Build image from source"
:end-before: "## Supported features"
```
:::
::::
::::{tab-item} ROCm
:sync: rocm
:::{include} rocm.inc.md
:start-after: "### Build image from source"
:end-before: "## Supported features"
:::
::::
::::{tab-item} XPU
:sync: xpu
:::{include} xpu.inc.md
:start-after: "### Build image from source"
:end-before: "## Supported features"
:::
::::
:::::
## Supported features
::::{tab-set}
:::::{tab-set}
:sync-group: device
:::{tab-item} CUDA
::::{tab-item} CUDA
:sync: cuda
```{include} cuda.inc.md
:::{include} cuda.inc.md
:start-after: "## Supported features"
```
:::
:::{tab-item} ROCm
:sync: rocm
```{include} rocm.inc.md
:start-after: "## Supported features"
```
:::
:::{tab-item} XPU
:sync: xpu
```{include} xpu.inc.md
:start-after: "## Supported features"
```
:::
::::
::::{tab-item} ROCm
:sync: rocm
:::{include} rocm.inc.md
:start-after: "## Supported features"
:::
::::
::::{tab-item} XPU
:sync: xpu
:::{include} xpu.inc.md
:start-after: "## Supported features"
:::
::::
:::::

View File

@@ -16,10 +16,10 @@ Currently, there are no pre-built ROCm wheels.
However, the [AMD Infinity hub for vLLM](https://hub.docker.com/r/rocm/vllm/tags) offers a prebuilt, optimized
docker image designed for validating inference performance on the AMD Instinct™ MI300X accelerator.
```{tip}
:::{tip}
Please check [LLM inference performance validation on AMD Instinct MI300X](https://rocm.docs.amd.com/en/latest/how-to/performance-validation/mi300x/vllm-benchmark.html)
for instructions on how to use this prebuilt docker image.
```
:::
### Build wheel from source
@@ -47,9 +47,9 @@ for instructions on how to use this prebuilt docker image.
cd ../..
```
```{note}
- If you see HTTP issue related to downloading packages during building triton, please try again as the HTTP error is intermittent.
```
:::{note}
If you see HTTP issue related to downloading packages during building triton, please try again as the HTTP error is intermittent.
:::
2. Optionally, if you choose to use CK flash attention, you can install [flash attention for ROCm](https://github.com/ROCm/flash-attention/tree/ck_tile)
@@ -67,9 +67,9 @@ for instructions on how to use this prebuilt docker image.
cd ..
```
```{note}
- You might need to downgrade the "ninja" version to 1.10 it is not used when compiling flash-attention-2 (e.g. `pip install ninja==1.10.2.4`)
```
:::{note}
You might need to downgrade the "ninja" version to 1.10 it is not used when compiling flash-attention-2 (e.g. `pip install ninja==1.10.2.4`)
:::
3. Build vLLM. For example, vLLM on ROCM 6.2 can be built with the following steps:
@@ -95,17 +95,18 @@ for instructions on how to use this prebuilt docker image.
This may take 5-10 minutes. Currently, `pip install .` does not work for ROCm installation.
```{tip}
<!--- pyml disable-num-lines 5 ul-indent-->
:::{tip}
- Triton flash attention is used by default. For benchmarking purposes, it is recommended to run a warm up step before collecting perf numbers.
- Triton flash attention does not currently support sliding window attention. If using half precision, please use CK flash-attention for sliding window support.
- To use CK flash-attention or PyTorch naive attention, please use this flag `export VLLM_USE_TRITON_FLASH_ATTN=0` to turn off triton flash attention.
- The ROCm version of PyTorch, ideally, should match the ROCm driver version.
```
:::
```{tip}
:::{tip}
- For MI300x (gfx942) users, to achieve optimal performance, please refer to [MI300x tuning guide](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/index.html) for performance optimization and tuning tips on system and workflow level.
For vLLM, please refer to [vLLM performance optimization](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/workload.html#vllm-performance-optimization).
```
:::
## Set up using Docker

View File

@@ -30,10 +30,10 @@ pip install -v -r requirements-xpu.txt
VLLM_TARGET_DEVICE=xpu python setup.py install
```
```{note}
:::{note}
- FP16 is the default data type in the current XPU backend. The BF16 data
type will be supported in the future.
```
:::
## Set up using Docker

View File

@@ -4,10 +4,10 @@
vLLM supports the following hardware platforms:
```{toctree}
:::{toctree}
:maxdepth: 1
gpu/index
cpu/index
ai_accelerator/index
```
:::

View File

@@ -6,9 +6,9 @@ conda create -n myenv python=3.12 -y
conda activate myenv
```
```{note}
:::{note}
[PyTorch has deprecated the conda release channel](https://github.com/pytorch/pytorch/issues/138506). If you use `conda`, please only use it to create Python environment rather than installing packages.
```
:::
Or you can create a new Python environment using [uv](https://docs.astral.sh/uv/), a very fast Python environment manager. Please follow the [documentation](https://docs.astral.sh/uv/#getting-started) to install `uv`. After installing `uv`, you can create a new Python environment using the following command: