[doc] split "Other AI Accelerators" tabs (#19708)

2025-06-17 09:05:29 -04:00
parent 154d063b9f
commit 93aee29fdb
6 changed files with 69 additions and 217 deletions
--- a/docs/getting_started/installation/.nav.yml
+++ b/docs/getting_started/installation/.nav.yml
@@ -2,4 +2,6 @@ nav:
  - README.md
  - gpu.md
  - cpu.md
-  - ai_accelerator.md
+  - google_tpu.md
  - intel_gaudi.md
  - aws_neuron.md
--- a/docs/getting_started/installation/README.md
+++ b/docs/getting_started/installation/README.md
@@ -14,7 +14,6 @@ vLLM supports the following hardware platforms:
    - [ARM AArch64](cpu.md#arm-aarch64)
    - [Apple silicon](cpu.md#apple-silicon)
    - [IBM Z (S390X)](cpu.md#ibm-z-s390x)
- [Other AI accelerators](ai_accelerator.md)
+- [Google TPU](google_tpu.md)
-    - [Google TPU](ai_accelerator.md#google-tpu)
+- [Intel Gaudi](intel_gaudi.md)
-    - [Intel Gaudi](ai_accelerator.md#intel-gaudi)
+- [AWS Neuron](aws_neuron.md)
    - [AWS Neuron](ai_accelerator.md#aws-neuron)
--- a/docs/getting_started/installation/ai_accelerator.md
+++ b/docs/getting_started/installation/ai_accelerator.md
@@ -1,117 +0,0 @@
 # Other AI accelerators
 vLLM is a Python library that supports the following AI accelerators. Select your AI accelerator type to see vendor specific instructions:
 === "Google TPU"
    --8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:installation"
 === "Intel Gaudi"
    --8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:installation"
 === "AWS Neuron"
    --8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:installation"
 ## Requirements
 === "Google TPU"
    --8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:requirements"
 === "Intel Gaudi"
    --8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:requirements"
 === "AWS Neuron"
    --8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:requirements"
 ## Configure a new environment
 === "Google TPU"
    --8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:configure-a-new-environment"
 === "Intel Gaudi"
    --8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:configure-a-new-environment"
 === "AWS Neuron"
    --8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:configure-a-new-environment"
 ## Set up using Python
 ### Pre-built wheels
 === "Google TPU"
    --8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:pre-built-wheels"
 === "Intel Gaudi"
    --8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:pre-built-wheels"
 === "AWS Neuron"
    --8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:pre-built-wheels"
 ### Build wheel from source
 === "Google TPU"
    --8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:build-wheel-from-source"
 === "Intel Gaudi"
    --8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:build-wheel-from-source"
 === "AWS Neuron"
    --8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:build-wheel-from-source"
 ## Set up using Docker
 ### Pre-built images
 === "Google TPU"
    --8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:pre-built-images"
 === "Intel Gaudi"
    --8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:pre-built-images"
 === "AWS Neuron"
    --8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:pre-built-images"
 ### Build image from source
 === "Google TPU"
    --8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:build-image-from-source"
 === "Intel Gaudi"
    --8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:build-image-from-source"
 === "AWS Neuron"
    --8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:build-image-from-source"
 ## Extra information
 === "Google TPU"
    --8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:extra-information"
 === "Intel Gaudi"
    --8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:extra-information"
 === "AWS Neuron"
    --8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:extra-information"
--- a/docs/getting_started/installation/ai_accelerator/neuron.inc.md
+++ b/docs/getting_started/installation/ai_accelerator/neuron.inc.md
@@ -1,15 +1,14 @@
-# --8<-- [start:installation]
+# AWS Neuron
 [AWS Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/) is the software development kit (SDK) used to run deep learning and
-    generative AI workloads on AWS Inferentia and AWS Trainium powered Amazon EC2 instances and UltraServers (Inf1, Inf2, Trn1, Trn2,
+generative AI workloads on AWS Inferentia and AWS Trainium powered Amazon EC2 instances and UltraServers (Inf1, Inf2, Trn1, Trn2,
-    and Trn2 UltraServer). Both Trainium and Inferentia are powered by fully-independent heterogeneous compute-units called NeuronCores.
+and Trn2 UltraServer). Both Trainium and Inferentia are powered by fully-independent heterogeneous compute-units called NeuronCores.
-    This tab describes how to set up your environment to run vLLM on Neuron.
+This describes how to set up your environment to run vLLM on Neuron.
 !!! warning
    There are no pre-built wheels or images for this device, so you must build vLLM from source.
-# --8<-- [end:installation]
+## Requirements
 # --8<-- [start:requirements]
 - OS: Linux
 - Python: 3.9 or newer
@@ -17,8 +16,7 @@
 - Accelerator: NeuronCore-v2 (in trn1/inf2 chips) or NeuronCore-v3 (in trn2 chips)
 - AWS Neuron SDK 2.23
-# --8<-- [end:requirements]
+## Configure a new environment
 # --8<-- [start:configure-a-new-environment]
 ### Launch a Trn1/Trn2/Inf2 instance and verify Neuron dependencies
@@ -27,6 +25,7 @@ The easiest way to launch a Trainium or Inferentia instance with pre-installed N
 - After launching the instance, follow the instructions in [Connect to your instance](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html) to connect to the instance
 - Once inside your instance, activate the pre-installed virtual environment for inference by running
 ```console
 source /opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/bin/activate
 ```
@@ -38,20 +37,15 @@ for alternative setup instructions including using Docker and manually installin
    NxD Inference is the default recommended backend to run inference on Neuron. If you are looking to use the legacy [transformers-neuronx](https://github.com/aws-neuron/transformers-neuronx)
    library, refer to [Transformers NeuronX Setup](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/transformers-neuronx/setup/index.html).
-# --8<-- [end:configure-a-new-environment]
+## Set up using Python
 # --8<-- [start:set-up-using-python]
-# --8<-- [end:set-up-using-python]
+### Pre-built wheels
 # --8<-- [start:pre-built-wheels]
 Currently, there are no pre-built Neuron wheels.
-# --8<-- [end:pre-built-wheels]
+### Build wheel from source
 # --8<-- [start:build-wheel-from-source]
-#### Install vLLM from source
+To build and install vLLM from source, run:
 Install vllm as follows:
 ```console
 git clone https://github.com/vllm-project/vllm.git
@@ -61,8 +55,8 @@ VLLM_TARGET_DEVICE="neuron" pip install -e .
 ```
 AWS Neuron maintains a [Github fork of vLLM](https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2) at
-    [https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2](https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2), which contains several features in addition to what's
+<https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2>, which contains several features in addition to what's
-    available on vLLM V0. Please utilize the AWS Fork for the following features:
+available on vLLM V0. Please utilize the AWS Fork for the following features:
 - Llama-3.2 multi-modal support
 - Multi-node distributed inference
@@ -81,25 +75,22 @@ VLLM_TARGET_DEVICE="neuron" pip install -e .
 Note that the AWS Neuron fork is only intended to support Neuron hardware; compatibility with other hardwares is not tested.
-# --8<-- [end:build-wheel-from-source]
+## Set up using Docker
 # --8<-- [start:set-up-using-docker]
-# --8<-- [end:set-up-using-docker]
+### Pre-built images
 # --8<-- [start:pre-built-images]
 Currently, there are no pre-built Neuron images.
-# --8<-- [end:pre-built-images]
+### Build image from source
 # --8<-- [start:build-image-from-source]
 See [deployment-docker-build-image-from-source][deployment-docker-build-image-from-source] for instructions on building the Docker image.
 Make sure to use <gh-file:docker/Dockerfile.neuron> in place of the default Dockerfile.
-# --8<-- [end:build-image-from-source]
+## Extra information
 # --8<-- [start:extra-information]
 [](){ #feature-support-through-nxd-inference-backend }
 ### Feature support through NxD Inference backend
 The current vLLM and Neuron integration relies on either the `neuronx-distributed-inference` (preferred) or `transformers-neuronx` backend
@@ -108,12 +99,15 @@ to perform most of the heavy lifting which includes PyTorch model initialization
 To configure NxD Inference features through the vLLM entrypoint, use the `override_neuron_config` setting. Provide the configs you want to override
 as a dictionary (or JSON object when starting vLLM from the CLI). For example, to disable auto bucketing, include
 ```console
 override_neuron_config={
    "enable_bucketing":False,
 }
 ```
 or when launching vLLM from the CLI, pass
 ```console
 --override-neuron-config "{\"enable_bucketing\":false}"
 ```
@@ -124,32 +118,30 @@ Alternatively, users can directly call the NxDI library to trace and compile you
 ### Known limitations
 - EAGLE speculative decoding: NxD Inference requires the EAGLE draft checkpoint to include the LM head weights from the target model. Refer to this
-    [guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/feature-guide.html#eagle-checkpoint-compatibility)
+  [guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/feature-guide.html#eagle-checkpoint-compatibility)
-    for how to convert pretrained EAGLE model checkpoints to be compatible for NxDI.
+  for how to convert pretrained EAGLE model checkpoints to be compatible for NxDI.
 - Quantization: the native quantization flow in vLLM is not well supported on NxD Inference. It is recommended to follow this
-    [Neuron quantization guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/custom-quantization.html)
+  [Neuron quantization guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/custom-quantization.html)
-    to quantize and compile your model using NxD Inference, and then load the compiled artifacts into vLLM.
+  to quantize and compile your model using NxD Inference, and then load the compiled artifacts into vLLM.
 - Multi-LoRA serving: NxD Inference only supports loading of LoRA adapters at server startup. Dynamic loading of LoRA adapters at
-    runtime is not currently supported. Refer to [multi-lora example](https://github.com/aws-neuron/upstreaming-to-vllm/blob/neuron-2.23-vllm-v0.7.2/examples/offline_inference/neuron_multi_lora.py)
+  runtime is not currently supported. Refer to [multi-lora example](https://github.com/aws-neuron/upstreaming-to-vllm/blob/neuron-2.23-vllm-v0.7.2/examples/offline_inference/neuron_multi_lora.py)
 - Multi-modal support: multi-modal support is only available through the AWS Neuron fork. This feature has not been upstreamed
-    to vLLM main because NxD Inference currently relies on certain adaptations to the core vLLM logic to support this feature.
+  to vLLM main because NxD Inference currently relies on certain adaptations to the core vLLM logic to support this feature.
 - Multi-node support: distributed inference across multiple Trainium/Inferentia instances is only supported on the AWS Neuron fork. Refer
-    to this [multi-node example](https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2/examples/neuron/multi_node)
+  to this [multi-node example](https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2/examples/neuron/multi_node)
-    to run. Note that tensor parallelism (distributed inference across NeuronCores) is available in vLLM main.
+  to run. Note that tensor parallelism (distributed inference across NeuronCores) is available in vLLM main.
 - Known edge case bug in speculative decoding: An edge case failure may occur in speculative decoding when sequence length approaches
-    max model length (e.g. when requesting max tokens up to the max model length and ignoring eos). In this scenario, vLLM may attempt
+  max model length (e.g. when requesting max tokens up to the max model length and ignoring eos). In this scenario, vLLM may attempt
-    to allocate an additional block to ensure there is enough memory for number of lookahead slots, but since we do not have good support
+  to allocate an additional block to ensure there is enough memory for number of lookahead slots, but since we do not have good support
-    for paged attention, there isn't another Neuron block for vLLM to allocate. A workaround fix (to terminate 1 iteration early) is
+  for paged attention, there isn't another Neuron block for vLLM to allocate. A workaround fix (to terminate 1 iteration early) is
-    implemented in the AWS Neuron fork but is not upstreamed to vLLM main as it modifies core vLLM logic.
+  implemented in the AWS Neuron fork but is not upstreamed to vLLM main as it modifies core vLLM logic.
 ### Environment variables
 - `NEURON_COMPILED_ARTIFACTS`: set this environment variable to point to your pre-compiled model artifacts directory to avoid
-    compilation time upon server initialization. If this variable is not set, the Neuron module will perform compilation and save the
+  compilation time upon server initialization. If this variable is not set, the Neuron module will perform compilation and save the
-    artifacts under `neuron-compiled-artifacts/{unique_hash}/` sub-directory in the model path. If this environment variable is set,
+  artifacts under `neuron-compiled-artifacts/{unique_hash}/` sub-directory in the model path. If this environment variable is set,
-    but the directory does not exist, or the contents are invalid, Neuron will also fallback to a new compilation and store the artifacts
+  but the directory does not exist, or the contents are invalid, Neuron will also fallback to a new compilation and store the artifacts
-    under this specified path.
+  under this specified path.
 - `NEURON_CONTEXT_LENGTH_BUCKETS`: Bucket sizes for context encoding. (Only applicable to `transformers-neuronx` backend).
 - `NEURON_TOKEN_GEN_BUCKETS`: Bucket sizes for token generation. (Only applicable to `transformers-neuronx` backend).
 # --8<-- [end:extra-information]
--- a/docs/getting_started/installation/ai_accelerator/tpu.inc.md
+++ b/docs/getting_started/installation/ai_accelerator/tpu.inc.md
@@ -1,4 +1,4 @@
-# --8<-- [start:installation]
+# Google TPU
 Tensor Processing Units (TPUs) are Google's custom-developed application-specific
 integrated circuits (ASICs) used to accelerate machine learning workloads. TPUs
@@ -33,8 +33,7 @@ information, see [Storage options for Cloud TPU data](https://cloud.devsite.corp
 !!! warning
    There are no pre-built wheels for this device, so you must either use the pre-built Docker image or build vLLM from source.
-# --8<-- [end:installation]
+## Requirements
 # --8<-- [start:requirements]
 - Google Cloud TPU VM
 - TPU versions: v6e, v5e, v5p, v4
@@ -63,8 +62,7 @@ For more information about using TPUs with GKE, see:
 - <https://cloud.google.com/kubernetes-engine/docs/concepts/tpus>
 - <https://cloud.google.com/kubernetes-engine/docs/concepts/plan-tpus>
-# --8<-- [end:requirements]
+## Configure a new environment
 # --8<-- [start:configure-a-new-environment]
 ### Provision a Cloud TPU with the queued resource API
@@ -100,16 +98,13 @@ gcloud compute tpus tpu-vm ssh TPU_NAME --project PROJECT_ID --zone ZONE
 [TPU VM images]: https://cloud.google.com/tpu/docs/runtimes
 [TPU regions and zones]: https://cloud.google.com/tpu/docs/regions-zones
-# --8<-- [end:configure-a-new-environment]
+## Set up using Python
 # --8<-- [start:set-up-using-python]
-# --8<-- [end:set-up-using-python]
+### Pre-built wheels
 # --8<-- [start:pre-built-wheels]
 Currently, there are no pre-built TPU wheels.
-# --8<-- [end:pre-built-wheels]
+### Build wheel from source
 # --8<-- [start:build-wheel-from-source]
 Install Miniconda:
@@ -142,7 +137,7 @@ Install build dependencies:
 ```bash
 pip install -r requirements/tpu.txt
-sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev
+sudo apt-get install --no-install-recommends --yes libopenblas-base libopenmpi-dev libomp-dev
 ```
 Run the setup script:
@@ -151,16 +146,13 @@ Run the setup script:
 VLLM_TARGET_DEVICE="tpu" python -m pip install -e .
 ```
-# --8<-- [end:build-wheel-from-source]
+## Set up using Docker
 # --8<-- [start:set-up-using-docker]
-# --8<-- [end:set-up-using-docker]
+### Pre-built images
 # --8<-- [start:pre-built-images]
 See [deployment-docker-pre-built-image][deployment-docker-pre-built-image] for instructions on using the official Docker image, making sure to substitute the image name `vllm/vllm-openai` with `vllm/vllm-tpu`.
-# --8<-- [end:pre-built-images]
+### Build image from source
 # --8<-- [start:build-image-from-source]
 You can use <gh-file:docker/Dockerfile.tpu> to build a Docker image with TPU support.
@@ -194,11 +186,5 @@ docker run --privileged --net host --shm-size=16G -it vllm-tpu
    Install OpenBLAS with the following command:
    ```console
-    sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev
+    sudo apt-get install --no-install-recommends --yes libopenblas-base libopenmpi-dev libomp-dev
    ```
 # --8<-- [end:build-image-from-source]
 # --8<-- [start:extra-information]
 There is no extra information for this device.
 # --8<-- [end:extra-information]
--- a/docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md
+++ b/docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md
@@ -1,12 +1,11 @@
-# --8<-- [start:installation]
+# Intel Gaudi
-This tab provides instructions on running vLLM with Intel Gaudi devices.
+This page provides instructions on running vLLM with Intel Gaudi devices.
 !!! warning
    There are no pre-built wheels or images for this device, so you must build vLLM from source.
-# --8<-- [end:installation]
+## Requirements
 # --8<-- [start:requirements]
 - OS: Ubuntu 22.04 LTS
 - Python: 3.10
@@ -19,8 +18,7 @@ to set up the execution environment. To achieve the best performance,
 please follow the methods outlined in the
 [Optimizing Training Platform Guide](https://docs.habana.ai/en/latest/PyTorch/Model_Optimization_PyTorch/Optimization_in_Training_Platform.html).
-# --8<-- [end:requirements]
+## Configure a new environment
 # --8<-- [start:configure-a-new-environment]
 ### Environment verification
@@ -57,16 +55,13 @@ docker run \
  vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest
 ```
-# --8<-- [end:configure-a-new-environment]
+## Set up using Python
 # --8<-- [start:set-up-using-python]
-# --8<-- [end:set-up-using-python]
+### Pre-built wheels
 # --8<-- [start:pre-built-wheels]
 Currently, there are no pre-built Intel Gaudi wheels.
-# --8<-- [end:pre-built-wheels]
+### Build wheel from source
 # --8<-- [start:build-wheel-from-source]
 To build and install vLLM from source, run:
@@ -87,16 +82,13 @@ pip install -r requirements/hpu.txt
 python setup.py develop
 ```
-# --8<-- [end:build-wheel-from-source]
+## Set up using Docker
 # --8<-- [start:set-up-using-docker]
-# --8<-- [end:set-up-using-docker]
+### Pre-built images
 # --8<-- [start:pre-built-images]
 Currently, there are no pre-built Intel Gaudi images.
-# --8<-- [end:pre-built-images]
+### Build image from source
 # --8<-- [start:build-image-from-source]
 ```console
 docker build -f docker/Dockerfile.hpu -t vllm-hpu-env  .
@@ -113,10 +105,9 @@ docker run \
 !!! tip
    If you're observing the following error: `docker: Error response from daemon: Unknown runtime specified habana.`, please refer to "Install Using Containers" section of [Intel Gaudi Software Stack and Driver Installation](https://docs.habana.ai/en/v1.18.0/Installation_Guide/Bare_Metal_Fresh_OS.html). Make sure you have `habana-container-runtime` package installed and that `habana` container runtime is registered.
-# --8<-- [end:build-image-from-source]
+## Extra information
 # --8<-- [start:extra-information]
-## Supported features
+### Supported features
 - [Offline inference][offline-inference]
 - Online serving via [OpenAI-Compatible Server][openai-compatible-server]
@@ -130,14 +121,14 @@ docker run \
  for accelerating low-batch latency and throughput
 - Attention with Linear Biases (ALiBi)
-## Unsupported features
+### Unsupported features
 - Beam search
 - LoRA adapters
 - Quantization
 - Prefill chunking (mixed-batch inferencing)
-## Supported configurations
+### Supported configurations
 The following configurations have been validated to function with
 Gaudi2 devices. Configurations that are not listed may or may not work.
@@ -401,4 +392,3 @@ the below:
  higher batches. You can do that by adding `--enforce-eager` flag to
  server (for online serving), or by passing `enforce_eager=True`
  argument to LLM constructor (for offline inference).
 # --8<-- [end:extra-information]