2025-06-17 09:05:29 -04:00
# Google TPU
2024-12-23 17:35:38 -05:00
Tensor Processing Units (TPUs) are Google's custom-developed application-specific
integrated circuits (ASICs) used to accelerate machine learning workloads. TPUs
are available in different versions each with different hardware specifications.
For more information about TPUs, see [TPU System Architecture ](https://cloud.google.com/tpu/docs/system-architecture-tpu-vm ).
For more information on the TPU versions supported with vLLM, see:
- [TPU v6e ](https://cloud.google.com/tpu/docs/v6e )
- [TPU v5e ](https://cloud.google.com/tpu/docs/v5e )
- [TPU v5p ](https://cloud.google.com/tpu/docs/v5p )
- [TPU v4 ](https://cloud.google.com/tpu/docs/v4 )
These TPU versions allow you to configure the physical arrangements of the TPU
chips. This can improve throughput and networking performance. For more
information see:
- [TPU v6e topologies ](https://cloud.google.com/tpu/docs/v6e#configurations )
- [TPU v5e topologies ](https://cloud.google.com/tpu/docs/v5e#tpu-v5e-config )
- [TPU v5p topologies ](https://cloud.google.com/tpu/docs/v5p#tpu-v5p-config )
- [TPU v4 topologies ](https://cloud.google.com/tpu/docs/v4#tpu-v4-config )
In order for you to use Cloud TPUs you need to have TPU quota granted to your
Google Cloud Platform project. TPU quotas specify how many TPUs you can use in a
GPC project and are specified in terms of TPU version, the number of TPU you
want to use, and quota type. For more information, see [TPU quota ](https://cloud.google.com/tpu/docs/quota#tpu_quota ).
For TPU pricing information, see [Cloud TPU pricing ](https://cloud.google.com/tpu/pricing ).
You may need additional persistent storage for your TPU VMs. For more
information, see [Storage options for Cloud TPU data ](https://cloud.devsite.corp.google.com/tpu/docs/storage-options ).
2025-05-23 11:09:53 +02:00
!!! warning
There are no pre-built wheels for this device, so you must either use the pre-built Docker image or build vLLM from source.
2025-01-31 23:38:35 +00:00
2025-06-17 09:05:29 -04:00
## Requirements
2024-12-23 17:35:38 -05:00
- Google Cloud TPU VM
- TPU versions: v6e, v5e, v5p, v4
2025-07-16 19:37:44 -07:00
- Python: 3.11 or newer
2024-12-23 17:35:38 -05:00
### Provision Cloud TPUs
You can provision Cloud TPUs using the [Cloud TPU API ](https://cloud.google.com/tpu/docs/reference/rest )
or the [queued resources ](https://cloud.google.com/tpu/docs/queued-resources )
2025-04-25 21:39:56 -07:00
API (preferred). This section shows how to create TPUs using the queued resource API. For
2024-12-23 17:35:38 -05:00
more information about using the Cloud TPU API, see [Create a Cloud TPU using the Create Node API ](https://cloud.google.com/tpu/docs/managing-tpus-tpu-vm#create-node-api ).
Queued resources enable you to request Cloud TPU resources in a queued manner.
When you request queued resources, the request is added to a queue maintained by
the Cloud TPU service. When the requested resource becomes available, it's
assigned to your Google Cloud project for your immediate exclusive use.
2025-05-23 11:09:53 +02:00
!!! note
In all of the following commands, replace the ALL CAPS parameter names with
appropriate values. See the parameter descriptions table for more information.
2024-12-23 17:35:38 -05:00
2025-01-13 12:27:36 +00:00
### Provision Cloud TPUs with GKE
For more information about using TPUs with GKE, see:
2025-06-11 12:11:17 -04:00
2025-07-02 17:04:08 +08:00
- [About TPUs in GKE ](https://cloud.google.com/kubernetes-engine/docs/concepts/tpus )
- [Deploy TPU workloads in GKE Standard ](https://cloud.google.com/kubernetes-engine/docs/how-to/tpus )
- [Plan for TPUs in GKE ](https://cloud.google.com/kubernetes-engine/docs/concepts/plan-tpus )
2025-01-13 12:27:36 +00:00
2025-06-17 09:05:29 -04:00
## Configure a new environment
2025-01-13 12:27:36 +00:00
### Provision a Cloud TPU with the queued resource API
2024-12-23 17:35:38 -05:00
Create a TPU v5e with 4 TPU chips:
2025-06-23 18:59:09 +01:00
```bash
2024-12-23 17:35:38 -05:00
gcloud alpha compute tpus queued-resources create QUEUED_RESOURCE_ID \
2025-06-11 12:11:17 -04:00
--node-id TPU_NAME \
--project PROJECT_ID \
--zone ZONE \
--accelerator-type ACCELERATOR_TYPE \
--runtime-version RUNTIME_VERSION \
--service-account SERVICE_ACCOUNT
2024-12-23 17:35:38 -05:00
```
2025-05-23 11:09:53 +02:00
| Parameter name | Description |
|--------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| QUEUED_RESOURCE_ID | The user-assigned ID of the queued resource request. |
2025-06-11 12:11:17 -04:00
| TPU_NAME | The user-assigned name of the TPU which is created when the queued resource request is allocated. |
2025-05-23 11:09:53 +02:00
| PROJECT_ID | Your Google Cloud project |
2025-06-11 12:11:17 -04:00
| ZONE | The GCP zone where you want to create your Cloud TPU. The value you use depends on the version of TPUs you are using. For more information, see [TPU regions and zones] |
| ACCELERATOR_TYPE | The TPU version you want to use. Specify the TPU version, for example `v5litepod-4` specifies a v5e TPU with 4 cores, `v6e-1` specifies a v6e TPU with 1 core. For more information, see [TPU versions]. |
2025-08-05 11:33:46 +02:00
| RUNTIME_VERSION | The TPU VM runtime version to use. For example, use `v2-alpha-tpuv6e` for a VM loaded with one or more v6e TPU(s). |
2025-06-11 12:11:17 -04:00
| SERVICE_ACCOUNT | The email address for your service account. You can find it in the IAM Cloud Console under * Service Accounts * . For example: `tpu-service-account@<your_project_ID>.iam.gserviceaccount.com` |
2024-12-23 17:35:38 -05:00
2025-06-16 21:00:09 -04:00
Connect to your TPU VM using SSH:
2024-12-23 17:35:38 -05:00
```bash
2025-06-16 21:00:09 -04:00
gcloud compute tpus tpu-vm ssh TPU_NAME --project PROJECT_ID --zone ZONE
2024-12-23 17:35:38 -05:00
```
2025-08-05 11:33:46 +02:00
!!! note
When configuring `RUNTIME_VERSION` ("TPU software version") on GCP, ensure it matches the TPU generation you've selected by referencing the [TPU VM images] compatibility matrix. Using an incompatible version may prevent vLLM from running correctly.
2025-06-11 12:11:17 -04:00
[TPU versions]: https://cloud.google.com/tpu/docs/runtimes
[TPU VM images]: https://cloud.google.com/tpu/docs/runtimes
[TPU regions and zones]: https://cloud.google.com/tpu/docs/regions-zones
2025-06-17 09:05:29 -04:00
## Set up using Python
2025-01-13 12:27:36 +00:00
2025-06-17 09:05:29 -04:00
### Pre-built wheels
2025-01-13 12:27:36 +00:00
Currently, there are no pre-built TPU wheels.
2025-06-17 09:05:29 -04:00
### Build wheel from source
2025-01-13 12:27:36 +00:00
2024-12-28 21:53:59 +08:00
Install Miniconda:
2024-12-23 17:35:38 -05:00
```bash
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
source ~/.bashrc
```
Create and activate a Conda environment for vLLM:
```bash
2025-07-16 19:37:44 -07:00
conda create -n vllm python=3.12 -y
2024-12-23 17:35:38 -05:00
conda activate vllm
```
Clone the vLLM repository and go to the vLLM directory:
```bash
git clone https://github.com/vllm-project/vllm.git && cd vllm
```
Uninstall the existing `torch` and `torch_xla` packages:
```bash
pip uninstall torch torch-xla -y
```
Install build dependencies:
```bash
2025-03-08 17:44:35 +01:00
pip install -r requirements/tpu.txt
2025-06-17 09:05:29 -04:00
sudo apt-get install --no-install-recommends --yes libopenblas-base libopenmpi-dev libomp-dev
2024-12-23 17:35:38 -05:00
```
Run the setup script:
```bash
2025-04-29 19:36:48 +02:00
VLLM_TARGET_DEVICE="tpu" python -m pip install -e .
2024-12-23 17:35:38 -05:00
```
2025-06-17 09:05:29 -04:00
## Set up using Docker
2024-12-23 17:35:38 -05:00
2025-06-17 09:05:29 -04:00
### Pre-built images
2024-12-23 17:35:38 -05:00
2025-05-23 11:09:53 +02:00
See [deployment-docker-pre-built-image][deployment-docker-pre-built-image] for instructions on using the official Docker image, making sure to substitute the image name `vllm/vllm-openai` with `vllm/vllm-tpu` .
2024-12-23 17:35:38 -05:00
2025-06-17 09:05:29 -04:00
### Build image from source
2024-12-23 17:35:38 -05:00
2025-03-31 21:47:32 +01:00
You can use <gh-file:docker/Dockerfile.tpu> to build a Docker image with TPU support.
2024-12-23 17:35:38 -05:00
2025-06-23 18:59:09 +01:00
```bash
2025-03-31 21:47:32 +01:00
docker build -f docker/Dockerfile.tpu -t vllm-tpu .
2024-12-23 17:35:38 -05:00
```
Run the Docker image with the following command:
2025-06-23 18:59:09 +01:00
```bash
2025-01-12 03:17:13 -05:00
# Make sure to add `--privileged --net host --shm-size=16G`.
docker run --privileged --net host --shm-size=16G -it vllm-tpu
2024-12-23 17:35:38 -05:00
```
2025-05-23 11:09:53 +02:00
!!! note
Since TPU relies on XLA which requires static shapes, vLLM bucketizes the
possible input shapes and compiles an XLA graph for each shape. The
compilation time may take 20~30 minutes in the first run. However, the
compilation time reduces to ~5 minutes afterwards because the XLA graphs are
cached in the disk (in `VLLM_XLA_CACHE_PATH` or `~/.cache/vllm/xla_cache` by default).
2024-12-23 17:35:38 -05:00
2025-05-23 11:09:53 +02:00
!!! tip
If you encounter the following error:
2024-12-23 17:35:38 -05:00
2025-05-23 11:09:53 +02:00
```console
from torch._C import * # noqa: F403
ImportError: libopenblas.so.0: cannot open shared object file: No such
file or directory
```
2024-12-23 17:35:38 -05:00
2025-05-23 11:09:53 +02:00
Install OpenBLAS with the following command:
2025-01-29 03:38:29 +00:00
2025-06-23 18:59:09 +01:00
```bash
2025-06-17 09:05:29 -04:00
sudo apt-get install --no-install-recommends --yes libopenblas-base libopenmpi-dev libomp-dev
2025-05-23 11:09:53 +02:00
```