[doc] split "Other AI Accelerators" tabs (#19708)
This commit is contained in:
@@ -2,4 +2,6 @@ nav:
|
|||||||
- README.md
|
- README.md
|
||||||
- gpu.md
|
- gpu.md
|
||||||
- cpu.md
|
- cpu.md
|
||||||
- ai_accelerator.md
|
- google_tpu.md
|
||||||
|
- intel_gaudi.md
|
||||||
|
- aws_neuron.md
|
||||||
|
|||||||
@@ -14,7 +14,6 @@ vLLM supports the following hardware platforms:
|
|||||||
- [ARM AArch64](cpu.md#arm-aarch64)
|
- [ARM AArch64](cpu.md#arm-aarch64)
|
||||||
- [Apple silicon](cpu.md#apple-silicon)
|
- [Apple silicon](cpu.md#apple-silicon)
|
||||||
- [IBM Z (S390X)](cpu.md#ibm-z-s390x)
|
- [IBM Z (S390X)](cpu.md#ibm-z-s390x)
|
||||||
- [Other AI accelerators](ai_accelerator.md)
|
- [Google TPU](google_tpu.md)
|
||||||
- [Google TPU](ai_accelerator.md#google-tpu)
|
- [Intel Gaudi](intel_gaudi.md)
|
||||||
- [Intel Gaudi](ai_accelerator.md#intel-gaudi)
|
- [AWS Neuron](aws_neuron.md)
|
||||||
- [AWS Neuron](ai_accelerator.md#aws-neuron)
|
|
||||||
|
|||||||
@@ -1,117 +0,0 @@
|
|||||||
# Other AI accelerators
|
|
||||||
|
|
||||||
vLLM is a Python library that supports the following AI accelerators. Select your AI accelerator type to see vendor specific instructions:
|
|
||||||
|
|
||||||
=== "Google TPU"
|
|
||||||
|
|
||||||
--8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:installation"
|
|
||||||
|
|
||||||
=== "Intel Gaudi"
|
|
||||||
|
|
||||||
--8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:installation"
|
|
||||||
|
|
||||||
=== "AWS Neuron"
|
|
||||||
|
|
||||||
--8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:installation"
|
|
||||||
|
|
||||||
## Requirements
|
|
||||||
|
|
||||||
=== "Google TPU"
|
|
||||||
|
|
||||||
--8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:requirements"
|
|
||||||
|
|
||||||
=== "Intel Gaudi"
|
|
||||||
|
|
||||||
--8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:requirements"
|
|
||||||
|
|
||||||
=== "AWS Neuron"
|
|
||||||
|
|
||||||
--8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:requirements"
|
|
||||||
|
|
||||||
## Configure a new environment
|
|
||||||
|
|
||||||
=== "Google TPU"
|
|
||||||
|
|
||||||
--8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:configure-a-new-environment"
|
|
||||||
|
|
||||||
=== "Intel Gaudi"
|
|
||||||
|
|
||||||
--8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:configure-a-new-environment"
|
|
||||||
|
|
||||||
=== "AWS Neuron"
|
|
||||||
|
|
||||||
--8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:configure-a-new-environment"
|
|
||||||
|
|
||||||
## Set up using Python
|
|
||||||
|
|
||||||
### Pre-built wheels
|
|
||||||
|
|
||||||
=== "Google TPU"
|
|
||||||
|
|
||||||
--8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:pre-built-wheels"
|
|
||||||
|
|
||||||
=== "Intel Gaudi"
|
|
||||||
|
|
||||||
--8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:pre-built-wheels"
|
|
||||||
|
|
||||||
=== "AWS Neuron"
|
|
||||||
|
|
||||||
--8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:pre-built-wheels"
|
|
||||||
|
|
||||||
### Build wheel from source
|
|
||||||
|
|
||||||
=== "Google TPU"
|
|
||||||
|
|
||||||
--8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:build-wheel-from-source"
|
|
||||||
|
|
||||||
=== "Intel Gaudi"
|
|
||||||
|
|
||||||
--8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:build-wheel-from-source"
|
|
||||||
|
|
||||||
=== "AWS Neuron"
|
|
||||||
|
|
||||||
--8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:build-wheel-from-source"
|
|
||||||
|
|
||||||
## Set up using Docker
|
|
||||||
|
|
||||||
### Pre-built images
|
|
||||||
|
|
||||||
=== "Google TPU"
|
|
||||||
|
|
||||||
--8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:pre-built-images"
|
|
||||||
|
|
||||||
=== "Intel Gaudi"
|
|
||||||
|
|
||||||
--8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:pre-built-images"
|
|
||||||
|
|
||||||
=== "AWS Neuron"
|
|
||||||
|
|
||||||
--8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:pre-built-images"
|
|
||||||
|
|
||||||
### Build image from source
|
|
||||||
|
|
||||||
=== "Google TPU"
|
|
||||||
|
|
||||||
--8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:build-image-from-source"
|
|
||||||
|
|
||||||
=== "Intel Gaudi"
|
|
||||||
|
|
||||||
--8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:build-image-from-source"
|
|
||||||
|
|
||||||
=== "AWS Neuron"
|
|
||||||
|
|
||||||
--8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:build-image-from-source"
|
|
||||||
|
|
||||||
## Extra information
|
|
||||||
|
|
||||||
=== "Google TPU"
|
|
||||||
|
|
||||||
--8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:extra-information"
|
|
||||||
|
|
||||||
=== "Intel Gaudi"
|
|
||||||
|
|
||||||
--8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:extra-information"
|
|
||||||
|
|
||||||
=== "AWS Neuron"
|
|
||||||
|
|
||||||
--8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:extra-information"
|
|
||||||
@@ -1,15 +1,14 @@
|
|||||||
# --8<-- [start:installation]
|
# AWS Neuron
|
||||||
|
|
||||||
[AWS Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/) is the software development kit (SDK) used to run deep learning and
|
[AWS Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/) is the software development kit (SDK) used to run deep learning and
|
||||||
generative AI workloads on AWS Inferentia and AWS Trainium powered Amazon EC2 instances and UltraServers (Inf1, Inf2, Trn1, Trn2,
|
generative AI workloads on AWS Inferentia and AWS Trainium powered Amazon EC2 instances and UltraServers (Inf1, Inf2, Trn1, Trn2,
|
||||||
and Trn2 UltraServer). Both Trainium and Inferentia are powered by fully-independent heterogeneous compute-units called NeuronCores.
|
and Trn2 UltraServer). Both Trainium and Inferentia are powered by fully-independent heterogeneous compute-units called NeuronCores.
|
||||||
This tab describes how to set up your environment to run vLLM on Neuron.
|
This describes how to set up your environment to run vLLM on Neuron.
|
||||||
|
|
||||||
!!! warning
|
!!! warning
|
||||||
There are no pre-built wheels or images for this device, so you must build vLLM from source.
|
There are no pre-built wheels or images for this device, so you must build vLLM from source.
|
||||||
|
|
||||||
# --8<-- [end:installation]
|
## Requirements
|
||||||
# --8<-- [start:requirements]
|
|
||||||
|
|
||||||
- OS: Linux
|
- OS: Linux
|
||||||
- Python: 3.9 or newer
|
- Python: 3.9 or newer
|
||||||
@@ -17,8 +16,7 @@
|
|||||||
- Accelerator: NeuronCore-v2 (in trn1/inf2 chips) or NeuronCore-v3 (in trn2 chips)
|
- Accelerator: NeuronCore-v2 (in trn1/inf2 chips) or NeuronCore-v3 (in trn2 chips)
|
||||||
- AWS Neuron SDK 2.23
|
- AWS Neuron SDK 2.23
|
||||||
|
|
||||||
# --8<-- [end:requirements]
|
## Configure a new environment
|
||||||
# --8<-- [start:configure-a-new-environment]
|
|
||||||
|
|
||||||
### Launch a Trn1/Trn2/Inf2 instance and verify Neuron dependencies
|
### Launch a Trn1/Trn2/Inf2 instance and verify Neuron dependencies
|
||||||
|
|
||||||
@@ -27,6 +25,7 @@ The easiest way to launch a Trainium or Inferentia instance with pre-installed N
|
|||||||
|
|
||||||
- After launching the instance, follow the instructions in [Connect to your instance](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html) to connect to the instance
|
- After launching the instance, follow the instructions in [Connect to your instance](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html) to connect to the instance
|
||||||
- Once inside your instance, activate the pre-installed virtual environment for inference by running
|
- Once inside your instance, activate the pre-installed virtual environment for inference by running
|
||||||
|
|
||||||
```console
|
```console
|
||||||
source /opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/bin/activate
|
source /opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/bin/activate
|
||||||
```
|
```
|
||||||
@@ -38,20 +37,15 @@ for alternative setup instructions including using Docker and manually installin
|
|||||||
NxD Inference is the default recommended backend to run inference on Neuron. If you are looking to use the legacy [transformers-neuronx](https://github.com/aws-neuron/transformers-neuronx)
|
NxD Inference is the default recommended backend to run inference on Neuron. If you are looking to use the legacy [transformers-neuronx](https://github.com/aws-neuron/transformers-neuronx)
|
||||||
library, refer to [Transformers NeuronX Setup](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/transformers-neuronx/setup/index.html).
|
library, refer to [Transformers NeuronX Setup](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/transformers-neuronx/setup/index.html).
|
||||||
|
|
||||||
# --8<-- [end:configure-a-new-environment]
|
## Set up using Python
|
||||||
# --8<-- [start:set-up-using-python]
|
|
||||||
|
|
||||||
# --8<-- [end:set-up-using-python]
|
### Pre-built wheels
|
||||||
# --8<-- [start:pre-built-wheels]
|
|
||||||
|
|
||||||
Currently, there are no pre-built Neuron wheels.
|
Currently, there are no pre-built Neuron wheels.
|
||||||
|
|
||||||
# --8<-- [end:pre-built-wheels]
|
### Build wheel from source
|
||||||
# --8<-- [start:build-wheel-from-source]
|
|
||||||
|
|
||||||
#### Install vLLM from source
|
To build and install vLLM from source, run:
|
||||||
|
|
||||||
Install vllm as follows:
|
|
||||||
|
|
||||||
```console
|
```console
|
||||||
git clone https://github.com/vllm-project/vllm.git
|
git clone https://github.com/vllm-project/vllm.git
|
||||||
@@ -61,8 +55,8 @@ VLLM_TARGET_DEVICE="neuron" pip install -e .
|
|||||||
```
|
```
|
||||||
|
|
||||||
AWS Neuron maintains a [Github fork of vLLM](https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2) at
|
AWS Neuron maintains a [Github fork of vLLM](https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2) at
|
||||||
[https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2](https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2), which contains several features in addition to what's
|
<https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2>, which contains several features in addition to what's
|
||||||
available on vLLM V0. Please utilize the AWS Fork for the following features:
|
available on vLLM V0. Please utilize the AWS Fork for the following features:
|
||||||
|
|
||||||
- Llama-3.2 multi-modal support
|
- Llama-3.2 multi-modal support
|
||||||
- Multi-node distributed inference
|
- Multi-node distributed inference
|
||||||
@@ -81,25 +75,22 @@ VLLM_TARGET_DEVICE="neuron" pip install -e .
|
|||||||
|
|
||||||
Note that the AWS Neuron fork is only intended to support Neuron hardware; compatibility with other hardwares is not tested.
|
Note that the AWS Neuron fork is only intended to support Neuron hardware; compatibility with other hardwares is not tested.
|
||||||
|
|
||||||
# --8<-- [end:build-wheel-from-source]
|
## Set up using Docker
|
||||||
# --8<-- [start:set-up-using-docker]
|
|
||||||
|
|
||||||
# --8<-- [end:set-up-using-docker]
|
### Pre-built images
|
||||||
# --8<-- [start:pre-built-images]
|
|
||||||
|
|
||||||
Currently, there are no pre-built Neuron images.
|
Currently, there are no pre-built Neuron images.
|
||||||
|
|
||||||
# --8<-- [end:pre-built-images]
|
### Build image from source
|
||||||
# --8<-- [start:build-image-from-source]
|
|
||||||
|
|
||||||
See [deployment-docker-build-image-from-source][deployment-docker-build-image-from-source] for instructions on building the Docker image.
|
See [deployment-docker-build-image-from-source][deployment-docker-build-image-from-source] for instructions on building the Docker image.
|
||||||
|
|
||||||
Make sure to use <gh-file:docker/Dockerfile.neuron> in place of the default Dockerfile.
|
Make sure to use <gh-file:docker/Dockerfile.neuron> in place of the default Dockerfile.
|
||||||
|
|
||||||
# --8<-- [end:build-image-from-source]
|
## Extra information
|
||||||
# --8<-- [start:extra-information]
|
|
||||||
|
|
||||||
[](){ #feature-support-through-nxd-inference-backend }
|
[](){ #feature-support-through-nxd-inference-backend }
|
||||||
|
|
||||||
### Feature support through NxD Inference backend
|
### Feature support through NxD Inference backend
|
||||||
|
|
||||||
The current vLLM and Neuron integration relies on either the `neuronx-distributed-inference` (preferred) or `transformers-neuronx` backend
|
The current vLLM and Neuron integration relies on either the `neuronx-distributed-inference` (preferred) or `transformers-neuronx` backend
|
||||||
@@ -108,12 +99,15 @@ to perform most of the heavy lifting which includes PyTorch model initialization
|
|||||||
|
|
||||||
To configure NxD Inference features through the vLLM entrypoint, use the `override_neuron_config` setting. Provide the configs you want to override
|
To configure NxD Inference features through the vLLM entrypoint, use the `override_neuron_config` setting. Provide the configs you want to override
|
||||||
as a dictionary (or JSON object when starting vLLM from the CLI). For example, to disable auto bucketing, include
|
as a dictionary (or JSON object when starting vLLM from the CLI). For example, to disable auto bucketing, include
|
||||||
|
|
||||||
```console
|
```console
|
||||||
override_neuron_config={
|
override_neuron_config={
|
||||||
"enable_bucketing":False,
|
"enable_bucketing":False,
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
or when launching vLLM from the CLI, pass
|
or when launching vLLM from the CLI, pass
|
||||||
|
|
||||||
```console
|
```console
|
||||||
--override-neuron-config "{\"enable_bucketing\":false}"
|
--override-neuron-config "{\"enable_bucketing\":false}"
|
||||||
```
|
```
|
||||||
@@ -142,8 +136,8 @@ Alternatively, users can directly call the NxDI library to trace and compile you
|
|||||||
for paged attention, there isn't another Neuron block for vLLM to allocate. A workaround fix (to terminate 1 iteration early) is
|
for paged attention, there isn't another Neuron block for vLLM to allocate. A workaround fix (to terminate 1 iteration early) is
|
||||||
implemented in the AWS Neuron fork but is not upstreamed to vLLM main as it modifies core vLLM logic.
|
implemented in the AWS Neuron fork but is not upstreamed to vLLM main as it modifies core vLLM logic.
|
||||||
|
|
||||||
|
|
||||||
### Environment variables
|
### Environment variables
|
||||||
|
|
||||||
- `NEURON_COMPILED_ARTIFACTS`: set this environment variable to point to your pre-compiled model artifacts directory to avoid
|
- `NEURON_COMPILED_ARTIFACTS`: set this environment variable to point to your pre-compiled model artifacts directory to avoid
|
||||||
compilation time upon server initialization. If this variable is not set, the Neuron module will perform compilation and save the
|
compilation time upon server initialization. If this variable is not set, the Neuron module will perform compilation and save the
|
||||||
artifacts under `neuron-compiled-artifacts/{unique_hash}/` sub-directory in the model path. If this environment variable is set,
|
artifacts under `neuron-compiled-artifacts/{unique_hash}/` sub-directory in the model path. If this environment variable is set,
|
||||||
@@ -151,5 +145,3 @@ Alternatively, users can directly call the NxDI library to trace and compile you
|
|||||||
under this specified path.
|
under this specified path.
|
||||||
- `NEURON_CONTEXT_LENGTH_BUCKETS`: Bucket sizes for context encoding. (Only applicable to `transformers-neuronx` backend).
|
- `NEURON_CONTEXT_LENGTH_BUCKETS`: Bucket sizes for context encoding. (Only applicable to `transformers-neuronx` backend).
|
||||||
- `NEURON_TOKEN_GEN_BUCKETS`: Bucket sizes for token generation. (Only applicable to `transformers-neuronx` backend).
|
- `NEURON_TOKEN_GEN_BUCKETS`: Bucket sizes for token generation. (Only applicable to `transformers-neuronx` backend).
|
||||||
|
|
||||||
# --8<-- [end:extra-information]
|
|
||||||
@@ -1,4 +1,4 @@
|
|||||||
# --8<-- [start:installation]
|
# Google TPU
|
||||||
|
|
||||||
Tensor Processing Units (TPUs) are Google's custom-developed application-specific
|
Tensor Processing Units (TPUs) are Google's custom-developed application-specific
|
||||||
integrated circuits (ASICs) used to accelerate machine learning workloads. TPUs
|
integrated circuits (ASICs) used to accelerate machine learning workloads. TPUs
|
||||||
@@ -33,8 +33,7 @@ information, see [Storage options for Cloud TPU data](https://cloud.devsite.corp
|
|||||||
!!! warning
|
!!! warning
|
||||||
There are no pre-built wheels for this device, so you must either use the pre-built Docker image or build vLLM from source.
|
There are no pre-built wheels for this device, so you must either use the pre-built Docker image or build vLLM from source.
|
||||||
|
|
||||||
# --8<-- [end:installation]
|
## Requirements
|
||||||
# --8<-- [start:requirements]
|
|
||||||
|
|
||||||
- Google Cloud TPU VM
|
- Google Cloud TPU VM
|
||||||
- TPU versions: v6e, v5e, v5p, v4
|
- TPU versions: v6e, v5e, v5p, v4
|
||||||
@@ -63,8 +62,7 @@ For more information about using TPUs with GKE, see:
|
|||||||
- <https://cloud.google.com/kubernetes-engine/docs/concepts/tpus>
|
- <https://cloud.google.com/kubernetes-engine/docs/concepts/tpus>
|
||||||
- <https://cloud.google.com/kubernetes-engine/docs/concepts/plan-tpus>
|
- <https://cloud.google.com/kubernetes-engine/docs/concepts/plan-tpus>
|
||||||
|
|
||||||
# --8<-- [end:requirements]
|
## Configure a new environment
|
||||||
# --8<-- [start:configure-a-new-environment]
|
|
||||||
|
|
||||||
### Provision a Cloud TPU with the queued resource API
|
### Provision a Cloud TPU with the queued resource API
|
||||||
|
|
||||||
@@ -100,16 +98,13 @@ gcloud compute tpus tpu-vm ssh TPU_NAME --project PROJECT_ID --zone ZONE
|
|||||||
[TPU VM images]: https://cloud.google.com/tpu/docs/runtimes
|
[TPU VM images]: https://cloud.google.com/tpu/docs/runtimes
|
||||||
[TPU regions and zones]: https://cloud.google.com/tpu/docs/regions-zones
|
[TPU regions and zones]: https://cloud.google.com/tpu/docs/regions-zones
|
||||||
|
|
||||||
# --8<-- [end:configure-a-new-environment]
|
## Set up using Python
|
||||||
# --8<-- [start:set-up-using-python]
|
|
||||||
|
|
||||||
# --8<-- [end:set-up-using-python]
|
### Pre-built wheels
|
||||||
# --8<-- [start:pre-built-wheels]
|
|
||||||
|
|
||||||
Currently, there are no pre-built TPU wheels.
|
Currently, there are no pre-built TPU wheels.
|
||||||
|
|
||||||
# --8<-- [end:pre-built-wheels]
|
### Build wheel from source
|
||||||
# --8<-- [start:build-wheel-from-source]
|
|
||||||
|
|
||||||
Install Miniconda:
|
Install Miniconda:
|
||||||
|
|
||||||
@@ -142,7 +137,7 @@ Install build dependencies:
|
|||||||
|
|
||||||
```bash
|
```bash
|
||||||
pip install -r requirements/tpu.txt
|
pip install -r requirements/tpu.txt
|
||||||
sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev
|
sudo apt-get install --no-install-recommends --yes libopenblas-base libopenmpi-dev libomp-dev
|
||||||
```
|
```
|
||||||
|
|
||||||
Run the setup script:
|
Run the setup script:
|
||||||
@@ -151,16 +146,13 @@ Run the setup script:
|
|||||||
VLLM_TARGET_DEVICE="tpu" python -m pip install -e .
|
VLLM_TARGET_DEVICE="tpu" python -m pip install -e .
|
||||||
```
|
```
|
||||||
|
|
||||||
# --8<-- [end:build-wheel-from-source]
|
## Set up using Docker
|
||||||
# --8<-- [start:set-up-using-docker]
|
|
||||||
|
|
||||||
# --8<-- [end:set-up-using-docker]
|
### Pre-built images
|
||||||
# --8<-- [start:pre-built-images]
|
|
||||||
|
|
||||||
See [deployment-docker-pre-built-image][deployment-docker-pre-built-image] for instructions on using the official Docker image, making sure to substitute the image name `vllm/vllm-openai` with `vllm/vllm-tpu`.
|
See [deployment-docker-pre-built-image][deployment-docker-pre-built-image] for instructions on using the official Docker image, making sure to substitute the image name `vllm/vllm-openai` with `vllm/vllm-tpu`.
|
||||||
|
|
||||||
# --8<-- [end:pre-built-images]
|
### Build image from source
|
||||||
# --8<-- [start:build-image-from-source]
|
|
||||||
|
|
||||||
You can use <gh-file:docker/Dockerfile.tpu> to build a Docker image with TPU support.
|
You can use <gh-file:docker/Dockerfile.tpu> to build a Docker image with TPU support.
|
||||||
|
|
||||||
@@ -194,11 +186,5 @@ docker run --privileged --net host --shm-size=16G -it vllm-tpu
|
|||||||
Install OpenBLAS with the following command:
|
Install OpenBLAS with the following command:
|
||||||
|
|
||||||
```console
|
```console
|
||||||
sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev
|
sudo apt-get install --no-install-recommends --yes libopenblas-base libopenmpi-dev libomp-dev
|
||||||
```
|
```
|
||||||
|
|
||||||
# --8<-- [end:build-image-from-source]
|
|
||||||
# --8<-- [start:extra-information]
|
|
||||||
|
|
||||||
There is no extra information for this device.
|
|
||||||
# --8<-- [end:extra-information]
|
|
||||||
@@ -1,12 +1,11 @@
|
|||||||
# --8<-- [start:installation]
|
# Intel Gaudi
|
||||||
|
|
||||||
This tab provides instructions on running vLLM with Intel Gaudi devices.
|
This page provides instructions on running vLLM with Intel Gaudi devices.
|
||||||
|
|
||||||
!!! warning
|
!!! warning
|
||||||
There are no pre-built wheels or images for this device, so you must build vLLM from source.
|
There are no pre-built wheels or images for this device, so you must build vLLM from source.
|
||||||
|
|
||||||
# --8<-- [end:installation]
|
## Requirements
|
||||||
# --8<-- [start:requirements]
|
|
||||||
|
|
||||||
- OS: Ubuntu 22.04 LTS
|
- OS: Ubuntu 22.04 LTS
|
||||||
- Python: 3.10
|
- Python: 3.10
|
||||||
@@ -19,8 +18,7 @@ to set up the execution environment. To achieve the best performance,
|
|||||||
please follow the methods outlined in the
|
please follow the methods outlined in the
|
||||||
[Optimizing Training Platform Guide](https://docs.habana.ai/en/latest/PyTorch/Model_Optimization_PyTorch/Optimization_in_Training_Platform.html).
|
[Optimizing Training Platform Guide](https://docs.habana.ai/en/latest/PyTorch/Model_Optimization_PyTorch/Optimization_in_Training_Platform.html).
|
||||||
|
|
||||||
# --8<-- [end:requirements]
|
## Configure a new environment
|
||||||
# --8<-- [start:configure-a-new-environment]
|
|
||||||
|
|
||||||
### Environment verification
|
### Environment verification
|
||||||
|
|
||||||
@@ -57,16 +55,13 @@ docker run \
|
|||||||
vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest
|
vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest
|
||||||
```
|
```
|
||||||
|
|
||||||
# --8<-- [end:configure-a-new-environment]
|
## Set up using Python
|
||||||
# --8<-- [start:set-up-using-python]
|
|
||||||
|
|
||||||
# --8<-- [end:set-up-using-python]
|
### Pre-built wheels
|
||||||
# --8<-- [start:pre-built-wheels]
|
|
||||||
|
|
||||||
Currently, there are no pre-built Intel Gaudi wheels.
|
Currently, there are no pre-built Intel Gaudi wheels.
|
||||||
|
|
||||||
# --8<-- [end:pre-built-wheels]
|
### Build wheel from source
|
||||||
# --8<-- [start:build-wheel-from-source]
|
|
||||||
|
|
||||||
To build and install vLLM from source, run:
|
To build and install vLLM from source, run:
|
||||||
|
|
||||||
@@ -87,16 +82,13 @@ pip install -r requirements/hpu.txt
|
|||||||
python setup.py develop
|
python setup.py develop
|
||||||
```
|
```
|
||||||
|
|
||||||
# --8<-- [end:build-wheel-from-source]
|
## Set up using Docker
|
||||||
# --8<-- [start:set-up-using-docker]
|
|
||||||
|
|
||||||
# --8<-- [end:set-up-using-docker]
|
### Pre-built images
|
||||||
# --8<-- [start:pre-built-images]
|
|
||||||
|
|
||||||
Currently, there are no pre-built Intel Gaudi images.
|
Currently, there are no pre-built Intel Gaudi images.
|
||||||
|
|
||||||
# --8<-- [end:pre-built-images]
|
### Build image from source
|
||||||
# --8<-- [start:build-image-from-source]
|
|
||||||
|
|
||||||
```console
|
```console
|
||||||
docker build -f docker/Dockerfile.hpu -t vllm-hpu-env .
|
docker build -f docker/Dockerfile.hpu -t vllm-hpu-env .
|
||||||
@@ -113,10 +105,9 @@ docker run \
|
|||||||
!!! tip
|
!!! tip
|
||||||
If you're observing the following error: `docker: Error response from daemon: Unknown runtime specified habana.`, please refer to "Install Using Containers" section of [Intel Gaudi Software Stack and Driver Installation](https://docs.habana.ai/en/v1.18.0/Installation_Guide/Bare_Metal_Fresh_OS.html). Make sure you have `habana-container-runtime` package installed and that `habana` container runtime is registered.
|
If you're observing the following error: `docker: Error response from daemon: Unknown runtime specified habana.`, please refer to "Install Using Containers" section of [Intel Gaudi Software Stack and Driver Installation](https://docs.habana.ai/en/v1.18.0/Installation_Guide/Bare_Metal_Fresh_OS.html). Make sure you have `habana-container-runtime` package installed and that `habana` container runtime is registered.
|
||||||
|
|
||||||
# --8<-- [end:build-image-from-source]
|
## Extra information
|
||||||
# --8<-- [start:extra-information]
|
|
||||||
|
|
||||||
## Supported features
|
### Supported features
|
||||||
|
|
||||||
- [Offline inference][offline-inference]
|
- [Offline inference][offline-inference]
|
||||||
- Online serving via [OpenAI-Compatible Server][openai-compatible-server]
|
- Online serving via [OpenAI-Compatible Server][openai-compatible-server]
|
||||||
@@ -130,14 +121,14 @@ docker run \
|
|||||||
for accelerating low-batch latency and throughput
|
for accelerating low-batch latency and throughput
|
||||||
- Attention with Linear Biases (ALiBi)
|
- Attention with Linear Biases (ALiBi)
|
||||||
|
|
||||||
## Unsupported features
|
### Unsupported features
|
||||||
|
|
||||||
- Beam search
|
- Beam search
|
||||||
- LoRA adapters
|
- LoRA adapters
|
||||||
- Quantization
|
- Quantization
|
||||||
- Prefill chunking (mixed-batch inferencing)
|
- Prefill chunking (mixed-batch inferencing)
|
||||||
|
|
||||||
## Supported configurations
|
### Supported configurations
|
||||||
|
|
||||||
The following configurations have been validated to function with
|
The following configurations have been validated to function with
|
||||||
Gaudi2 devices. Configurations that are not listed may or may not work.
|
Gaudi2 devices. Configurations that are not listed may or may not work.
|
||||||
@@ -401,4 +392,3 @@ the below:
|
|||||||
higher batches. You can do that by adding `--enforce-eager` flag to
|
higher batches. You can do that by adding `--enforce-eager` flag to
|
||||||
server (for online serving), or by passing `enforce_eager=True`
|
server (for online serving), or by passing `enforce_eager=True`
|
||||||
argument to LLM constructor (for offline inference).
|
argument to LLM constructor (for offline inference).
|
||||||
# --8<-- [end:extra-information]
|
|
||||||
Reference in New Issue
Block a user