[doc] split "Other AI Accelerators" tabs (#19708)

This commit is contained in:
David Xia
2025-06-17 09:05:29 -04:00
committed by GitHub
parent 154d063b9f
commit 93aee29fdb
6 changed files with 69 additions and 217 deletions

View File

@@ -2,4 +2,6 @@ nav:
- README.md - README.md
- gpu.md - gpu.md
- cpu.md - cpu.md
- ai_accelerator.md - google_tpu.md
- intel_gaudi.md
- aws_neuron.md

View File

@@ -14,7 +14,6 @@ vLLM supports the following hardware platforms:
- [ARM AArch64](cpu.md#arm-aarch64) - [ARM AArch64](cpu.md#arm-aarch64)
- [Apple silicon](cpu.md#apple-silicon) - [Apple silicon](cpu.md#apple-silicon)
- [IBM Z (S390X)](cpu.md#ibm-z-s390x) - [IBM Z (S390X)](cpu.md#ibm-z-s390x)
- [Other AI accelerators](ai_accelerator.md) - [Google TPU](google_tpu.md)
- [Google TPU](ai_accelerator.md#google-tpu) - [Intel Gaudi](intel_gaudi.md)
- [Intel Gaudi](ai_accelerator.md#intel-gaudi) - [AWS Neuron](aws_neuron.md)
- [AWS Neuron](ai_accelerator.md#aws-neuron)

View File

@@ -1,117 +0,0 @@
# Other AI accelerators
vLLM is a Python library that supports the following AI accelerators. Select your AI accelerator type to see vendor specific instructions:
=== "Google TPU"
--8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:installation"
=== "Intel Gaudi"
--8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:installation"
=== "AWS Neuron"
--8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:installation"
## Requirements
=== "Google TPU"
--8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:requirements"
=== "Intel Gaudi"
--8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:requirements"
=== "AWS Neuron"
--8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:requirements"
## Configure a new environment
=== "Google TPU"
--8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:configure-a-new-environment"
=== "Intel Gaudi"
--8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:configure-a-new-environment"
=== "AWS Neuron"
--8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:configure-a-new-environment"
## Set up using Python
### Pre-built wheels
=== "Google TPU"
--8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:pre-built-wheels"
=== "Intel Gaudi"
--8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:pre-built-wheels"
=== "AWS Neuron"
--8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:pre-built-wheels"
### Build wheel from source
=== "Google TPU"
--8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:build-wheel-from-source"
=== "Intel Gaudi"
--8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:build-wheel-from-source"
=== "AWS Neuron"
--8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:build-wheel-from-source"
## Set up using Docker
### Pre-built images
=== "Google TPU"
--8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:pre-built-images"
=== "Intel Gaudi"
--8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:pre-built-images"
=== "AWS Neuron"
--8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:pre-built-images"
### Build image from source
=== "Google TPU"
--8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:build-image-from-source"
=== "Intel Gaudi"
--8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:build-image-from-source"
=== "AWS Neuron"
--8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:build-image-from-source"
## Extra information
=== "Google TPU"
--8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:extra-information"
=== "Intel Gaudi"
--8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:extra-information"
=== "AWS Neuron"
--8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:extra-information"

View File

@@ -1,15 +1,14 @@
# --8<-- [start:installation] # AWS Neuron
[AWS Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/) is the software development kit (SDK) used to run deep learning and [AWS Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/) is the software development kit (SDK) used to run deep learning and
generative AI workloads on AWS Inferentia and AWS Trainium powered Amazon EC2 instances and UltraServers (Inf1, Inf2, Trn1, Trn2, generative AI workloads on AWS Inferentia and AWS Trainium powered Amazon EC2 instances and UltraServers (Inf1, Inf2, Trn1, Trn2,
and Trn2 UltraServer). Both Trainium and Inferentia are powered by fully-independent heterogeneous compute-units called NeuronCores. and Trn2 UltraServer). Both Trainium and Inferentia are powered by fully-independent heterogeneous compute-units called NeuronCores.
This tab describes how to set up your environment to run vLLM on Neuron. This describes how to set up your environment to run vLLM on Neuron.
!!! warning !!! warning
There are no pre-built wheels or images for this device, so you must build vLLM from source. There are no pre-built wheels or images for this device, so you must build vLLM from source.
# --8<-- [end:installation] ## Requirements
# --8<-- [start:requirements]
- OS: Linux - OS: Linux
- Python: 3.9 or newer - Python: 3.9 or newer
@@ -17,8 +16,7 @@
- Accelerator: NeuronCore-v2 (in trn1/inf2 chips) or NeuronCore-v3 (in trn2 chips) - Accelerator: NeuronCore-v2 (in trn1/inf2 chips) or NeuronCore-v3 (in trn2 chips)
- AWS Neuron SDK 2.23 - AWS Neuron SDK 2.23
# --8<-- [end:requirements] ## Configure a new environment
# --8<-- [start:configure-a-new-environment]
### Launch a Trn1/Trn2/Inf2 instance and verify Neuron dependencies ### Launch a Trn1/Trn2/Inf2 instance and verify Neuron dependencies
@@ -27,6 +25,7 @@ The easiest way to launch a Trainium or Inferentia instance with pre-installed N
- After launching the instance, follow the instructions in [Connect to your instance](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html) to connect to the instance - After launching the instance, follow the instructions in [Connect to your instance](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html) to connect to the instance
- Once inside your instance, activate the pre-installed virtual environment for inference by running - Once inside your instance, activate the pre-installed virtual environment for inference by running
```console ```console
source /opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/bin/activate source /opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/bin/activate
``` ```
@@ -38,20 +37,15 @@ for alternative setup instructions including using Docker and manually installin
NxD Inference is the default recommended backend to run inference on Neuron. If you are looking to use the legacy [transformers-neuronx](https://github.com/aws-neuron/transformers-neuronx) NxD Inference is the default recommended backend to run inference on Neuron. If you are looking to use the legacy [transformers-neuronx](https://github.com/aws-neuron/transformers-neuronx)
library, refer to [Transformers NeuronX Setup](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/transformers-neuronx/setup/index.html). library, refer to [Transformers NeuronX Setup](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/transformers-neuronx/setup/index.html).
# --8<-- [end:configure-a-new-environment] ## Set up using Python
# --8<-- [start:set-up-using-python]
# --8<-- [end:set-up-using-python] ### Pre-built wheels
# --8<-- [start:pre-built-wheels]
Currently, there are no pre-built Neuron wheels. Currently, there are no pre-built Neuron wheels.
# --8<-- [end:pre-built-wheels] ### Build wheel from source
# --8<-- [start:build-wheel-from-source]
#### Install vLLM from source To build and install vLLM from source, run:
Install vllm as follows:
```console ```console
git clone https://github.com/vllm-project/vllm.git git clone https://github.com/vllm-project/vllm.git
@@ -61,8 +55,8 @@ VLLM_TARGET_DEVICE="neuron" pip install -e .
``` ```
AWS Neuron maintains a [Github fork of vLLM](https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2) at AWS Neuron maintains a [Github fork of vLLM](https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2) at
[https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2](https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2), which contains several features in addition to what's <https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2>, which contains several features in addition to what's
available on vLLM V0. Please utilize the AWS Fork for the following features: available on vLLM V0. Please utilize the AWS Fork for the following features:
- Llama-3.2 multi-modal support - Llama-3.2 multi-modal support
- Multi-node distributed inference - Multi-node distributed inference
@@ -81,25 +75,22 @@ VLLM_TARGET_DEVICE="neuron" pip install -e .
Note that the AWS Neuron fork is only intended to support Neuron hardware; compatibility with other hardwares is not tested. Note that the AWS Neuron fork is only intended to support Neuron hardware; compatibility with other hardwares is not tested.
# --8<-- [end:build-wheel-from-source] ## Set up using Docker
# --8<-- [start:set-up-using-docker]
# --8<-- [end:set-up-using-docker] ### Pre-built images
# --8<-- [start:pre-built-images]
Currently, there are no pre-built Neuron images. Currently, there are no pre-built Neuron images.
# --8<-- [end:pre-built-images] ### Build image from source
# --8<-- [start:build-image-from-source]
See [deployment-docker-build-image-from-source][deployment-docker-build-image-from-source] for instructions on building the Docker image. See [deployment-docker-build-image-from-source][deployment-docker-build-image-from-source] for instructions on building the Docker image.
Make sure to use <gh-file:docker/Dockerfile.neuron> in place of the default Dockerfile. Make sure to use <gh-file:docker/Dockerfile.neuron> in place of the default Dockerfile.
# --8<-- [end:build-image-from-source] ## Extra information
# --8<-- [start:extra-information]
[](){ #feature-support-through-nxd-inference-backend } [](){ #feature-support-through-nxd-inference-backend }
### Feature support through NxD Inference backend ### Feature support through NxD Inference backend
The current vLLM and Neuron integration relies on either the `neuronx-distributed-inference` (preferred) or `transformers-neuronx` backend The current vLLM and Neuron integration relies on either the `neuronx-distributed-inference` (preferred) or `transformers-neuronx` backend
@@ -108,12 +99,15 @@ to perform most of the heavy lifting which includes PyTorch model initialization
To configure NxD Inference features through the vLLM entrypoint, use the `override_neuron_config` setting. Provide the configs you want to override To configure NxD Inference features through the vLLM entrypoint, use the `override_neuron_config` setting. Provide the configs you want to override
as a dictionary (or JSON object when starting vLLM from the CLI). For example, to disable auto bucketing, include as a dictionary (or JSON object when starting vLLM from the CLI). For example, to disable auto bucketing, include
```console ```console
override_neuron_config={ override_neuron_config={
"enable_bucketing":False, "enable_bucketing":False,
} }
``` ```
or when launching vLLM from the CLI, pass or when launching vLLM from the CLI, pass
```console ```console
--override-neuron-config "{\"enable_bucketing\":false}" --override-neuron-config "{\"enable_bucketing\":false}"
``` ```
@@ -142,8 +136,8 @@ Alternatively, users can directly call the NxDI library to trace and compile you
for paged attention, there isn't another Neuron block for vLLM to allocate. A workaround fix (to terminate 1 iteration early) is for paged attention, there isn't another Neuron block for vLLM to allocate. A workaround fix (to terminate 1 iteration early) is
implemented in the AWS Neuron fork but is not upstreamed to vLLM main as it modifies core vLLM logic. implemented in the AWS Neuron fork but is not upstreamed to vLLM main as it modifies core vLLM logic.
### Environment variables ### Environment variables
- `NEURON_COMPILED_ARTIFACTS`: set this environment variable to point to your pre-compiled model artifacts directory to avoid - `NEURON_COMPILED_ARTIFACTS`: set this environment variable to point to your pre-compiled model artifacts directory to avoid
compilation time upon server initialization. If this variable is not set, the Neuron module will perform compilation and save the compilation time upon server initialization. If this variable is not set, the Neuron module will perform compilation and save the
artifacts under `neuron-compiled-artifacts/{unique_hash}/` sub-directory in the model path. If this environment variable is set, artifacts under `neuron-compiled-artifacts/{unique_hash}/` sub-directory in the model path. If this environment variable is set,
@@ -151,5 +145,3 @@ Alternatively, users can directly call the NxDI library to trace and compile you
under this specified path. under this specified path.
- `NEURON_CONTEXT_LENGTH_BUCKETS`: Bucket sizes for context encoding. (Only applicable to `transformers-neuronx` backend). - `NEURON_CONTEXT_LENGTH_BUCKETS`: Bucket sizes for context encoding. (Only applicable to `transformers-neuronx` backend).
- `NEURON_TOKEN_GEN_BUCKETS`: Bucket sizes for token generation. (Only applicable to `transformers-neuronx` backend). - `NEURON_TOKEN_GEN_BUCKETS`: Bucket sizes for token generation. (Only applicable to `transformers-neuronx` backend).
# --8<-- [end:extra-information]

View File

@@ -1,4 +1,4 @@
# --8<-- [start:installation] # Google TPU
Tensor Processing Units (TPUs) are Google's custom-developed application-specific Tensor Processing Units (TPUs) are Google's custom-developed application-specific
integrated circuits (ASICs) used to accelerate machine learning workloads. TPUs integrated circuits (ASICs) used to accelerate machine learning workloads. TPUs
@@ -33,8 +33,7 @@ information, see [Storage options for Cloud TPU data](https://cloud.devsite.corp
!!! warning !!! warning
There are no pre-built wheels for this device, so you must either use the pre-built Docker image or build vLLM from source. There are no pre-built wheels for this device, so you must either use the pre-built Docker image or build vLLM from source.
# --8<-- [end:installation] ## Requirements
# --8<-- [start:requirements]
- Google Cloud TPU VM - Google Cloud TPU VM
- TPU versions: v6e, v5e, v5p, v4 - TPU versions: v6e, v5e, v5p, v4
@@ -63,8 +62,7 @@ For more information about using TPUs with GKE, see:
- <https://cloud.google.com/kubernetes-engine/docs/concepts/tpus> - <https://cloud.google.com/kubernetes-engine/docs/concepts/tpus>
- <https://cloud.google.com/kubernetes-engine/docs/concepts/plan-tpus> - <https://cloud.google.com/kubernetes-engine/docs/concepts/plan-tpus>
# --8<-- [end:requirements] ## Configure a new environment
# --8<-- [start:configure-a-new-environment]
### Provision a Cloud TPU with the queued resource API ### Provision a Cloud TPU with the queued resource API
@@ -100,16 +98,13 @@ gcloud compute tpus tpu-vm ssh TPU_NAME --project PROJECT_ID --zone ZONE
[TPU VM images]: https://cloud.google.com/tpu/docs/runtimes [TPU VM images]: https://cloud.google.com/tpu/docs/runtimes
[TPU regions and zones]: https://cloud.google.com/tpu/docs/regions-zones [TPU regions and zones]: https://cloud.google.com/tpu/docs/regions-zones
# --8<-- [end:configure-a-new-environment] ## Set up using Python
# --8<-- [start:set-up-using-python]
# --8<-- [end:set-up-using-python] ### Pre-built wheels
# --8<-- [start:pre-built-wheels]
Currently, there are no pre-built TPU wheels. Currently, there are no pre-built TPU wheels.
# --8<-- [end:pre-built-wheels] ### Build wheel from source
# --8<-- [start:build-wheel-from-source]
Install Miniconda: Install Miniconda:
@@ -142,7 +137,7 @@ Install build dependencies:
```bash ```bash
pip install -r requirements/tpu.txt pip install -r requirements/tpu.txt
sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev sudo apt-get install --no-install-recommends --yes libopenblas-base libopenmpi-dev libomp-dev
``` ```
Run the setup script: Run the setup script:
@@ -151,16 +146,13 @@ Run the setup script:
VLLM_TARGET_DEVICE="tpu" python -m pip install -e . VLLM_TARGET_DEVICE="tpu" python -m pip install -e .
``` ```
# --8<-- [end:build-wheel-from-source] ## Set up using Docker
# --8<-- [start:set-up-using-docker]
# --8<-- [end:set-up-using-docker] ### Pre-built images
# --8<-- [start:pre-built-images]
See [deployment-docker-pre-built-image][deployment-docker-pre-built-image] for instructions on using the official Docker image, making sure to substitute the image name `vllm/vllm-openai` with `vllm/vllm-tpu`. See [deployment-docker-pre-built-image][deployment-docker-pre-built-image] for instructions on using the official Docker image, making sure to substitute the image name `vllm/vllm-openai` with `vllm/vllm-tpu`.
# --8<-- [end:pre-built-images] ### Build image from source
# --8<-- [start:build-image-from-source]
You can use <gh-file:docker/Dockerfile.tpu> to build a Docker image with TPU support. You can use <gh-file:docker/Dockerfile.tpu> to build a Docker image with TPU support.
@@ -194,11 +186,5 @@ docker run --privileged --net host --shm-size=16G -it vllm-tpu
Install OpenBLAS with the following command: Install OpenBLAS with the following command:
```console ```console
sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev sudo apt-get install --no-install-recommends --yes libopenblas-base libopenmpi-dev libomp-dev
``` ```
# --8<-- [end:build-image-from-source]
# --8<-- [start:extra-information]
There is no extra information for this device.
# --8<-- [end:extra-information]

View File

@@ -1,12 +1,11 @@
# --8<-- [start:installation] # Intel Gaudi
This tab provides instructions on running vLLM with Intel Gaudi devices. This page provides instructions on running vLLM with Intel Gaudi devices.
!!! warning !!! warning
There are no pre-built wheels or images for this device, so you must build vLLM from source. There are no pre-built wheels or images for this device, so you must build vLLM from source.
# --8<-- [end:installation] ## Requirements
# --8<-- [start:requirements]
- OS: Ubuntu 22.04 LTS - OS: Ubuntu 22.04 LTS
- Python: 3.10 - Python: 3.10
@@ -19,8 +18,7 @@ to set up the execution environment. To achieve the best performance,
please follow the methods outlined in the please follow the methods outlined in the
[Optimizing Training Platform Guide](https://docs.habana.ai/en/latest/PyTorch/Model_Optimization_PyTorch/Optimization_in_Training_Platform.html). [Optimizing Training Platform Guide](https://docs.habana.ai/en/latest/PyTorch/Model_Optimization_PyTorch/Optimization_in_Training_Platform.html).
# --8<-- [end:requirements] ## Configure a new environment
# --8<-- [start:configure-a-new-environment]
### Environment verification ### Environment verification
@@ -57,16 +55,13 @@ docker run \
vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest
``` ```
# --8<-- [end:configure-a-new-environment] ## Set up using Python
# --8<-- [start:set-up-using-python]
# --8<-- [end:set-up-using-python] ### Pre-built wheels
# --8<-- [start:pre-built-wheels]
Currently, there are no pre-built Intel Gaudi wheels. Currently, there are no pre-built Intel Gaudi wheels.
# --8<-- [end:pre-built-wheels] ### Build wheel from source
# --8<-- [start:build-wheel-from-source]
To build and install vLLM from source, run: To build and install vLLM from source, run:
@@ -87,16 +82,13 @@ pip install -r requirements/hpu.txt
python setup.py develop python setup.py develop
``` ```
# --8<-- [end:build-wheel-from-source] ## Set up using Docker
# --8<-- [start:set-up-using-docker]
# --8<-- [end:set-up-using-docker] ### Pre-built images
# --8<-- [start:pre-built-images]
Currently, there are no pre-built Intel Gaudi images. Currently, there are no pre-built Intel Gaudi images.
# --8<-- [end:pre-built-images] ### Build image from source
# --8<-- [start:build-image-from-source]
```console ```console
docker build -f docker/Dockerfile.hpu -t vllm-hpu-env . docker build -f docker/Dockerfile.hpu -t vllm-hpu-env .
@@ -113,10 +105,9 @@ docker run \
!!! tip !!! tip
If you're observing the following error: `docker: Error response from daemon: Unknown runtime specified habana.`, please refer to "Install Using Containers" section of [Intel Gaudi Software Stack and Driver Installation](https://docs.habana.ai/en/v1.18.0/Installation_Guide/Bare_Metal_Fresh_OS.html). Make sure you have `habana-container-runtime` package installed and that `habana` container runtime is registered. If you're observing the following error: `docker: Error response from daemon: Unknown runtime specified habana.`, please refer to "Install Using Containers" section of [Intel Gaudi Software Stack and Driver Installation](https://docs.habana.ai/en/v1.18.0/Installation_Guide/Bare_Metal_Fresh_OS.html). Make sure you have `habana-container-runtime` package installed and that `habana` container runtime is registered.
# --8<-- [end:build-image-from-source] ## Extra information
# --8<-- [start:extra-information]
## Supported features ### Supported features
- [Offline inference][offline-inference] - [Offline inference][offline-inference]
- Online serving via [OpenAI-Compatible Server][openai-compatible-server] - Online serving via [OpenAI-Compatible Server][openai-compatible-server]
@@ -130,14 +121,14 @@ docker run \
for accelerating low-batch latency and throughput for accelerating low-batch latency and throughput
- Attention with Linear Biases (ALiBi) - Attention with Linear Biases (ALiBi)
## Unsupported features ### Unsupported features
- Beam search - Beam search
- LoRA adapters - LoRA adapters
- Quantization - Quantization
- Prefill chunking (mixed-batch inferencing) - Prefill chunking (mixed-batch inferencing)
## Supported configurations ### Supported configurations
The following configurations have been validated to function with The following configurations have been validated to function with
Gaudi2 devices. Configurations that are not listed may or may not work. Gaudi2 devices. Configurations that are not listed may or may not work.
@@ -401,4 +392,3 @@ the below:
higher batches. You can do that by adding `--enforce-eager` flag to higher batches. You can do that by adding `--enforce-eager` flag to
server (for online serving), or by passing `enforce_eager=True` server (for online serving), or by passing `enforce_eager=True`
argument to LLM constructor (for offline inference). argument to LLM constructor (for offline inference).
# --8<-- [end:extra-information]