diff --git a/docs/getting_started/installation/gpu.rocm.inc.md b/docs/getting_started/installation/gpu.rocm.inc.md index f2b8a7a81..88c57b659 100644 --- a/docs/getting_started/installation/gpu.rocm.inc.md +++ b/docs/getting_started/installation/gpu.rocm.inc.md @@ -1,9 +1,6 @@ # --8<-- [start:installation] -vLLM supports AMD GPUs with ROCm 6.3 or above, and torch 2.8.0 and above. - -!!! tip - [Docker](#set-up-using-docker) is the recommended way to use vLLM on ROCm. +vLLM supports AMD GPUs with ROCm 6.3 or above. Pre-built wheels are available for ROCm 7.0. # --8<-- [end:installation] # --8<-- [start:requirements] @@ -16,12 +13,36 @@ vLLM supports AMD GPUs with ROCm 6.3 or above, and torch 2.8.0 and above. # --8<-- [end:requirements] # --8<-- [start:set-up-using-python] -There is no extra information on creating a new Python environment for this device. +The vLLM wheel bundles PyTorch and all required dependencies, and you should use the included PyTorch for compatibility. Because vLLM compiles many ROCm kernels to ensure a validated, high‑performance stack, the resulting binaries may not be compatible with other ROCm or PyTorch builds. +If you need a different ROCm version or want to use an existing PyTorch installation, you’ll need to build vLLM from source. See [below](#build-wheel-from-source) for more details. # --8<-- [end:set-up-using-python] # --8<-- [start:pre-built-wheels] -Currently, there are no pre-built ROCm wheels. +To install the latest version of vLLM for Python 3.12, ROCm 7.0 and `glibc >= 2.35`. + +```bash +uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/ +``` + +!!! tip + You can find out about which ROCm version the latest vLLM supports by checking the index in extra-index-url [https://wheels.vllm.ai/rocm/](https://wheels.vllm.ai/rocm/) . + +To install a specific version and ROCm variant of vLLM wheel. + +```bash +uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/0.14.0/rocm700 +``` + +!!! warning "Caveats for using `pip`" + + We recommend leveraging `uv` to install vLLM wheel. Using `pip` to install from custom indices is cumbersome, because `pip` combines packages from `--extra-index-url` and the default index, choosing only the latest version, which makes it difficult to install wheel from custom index if exact versions of all packages are specified exactly. In contrast, `uv` gives the extra index [higher priority than the default index](https://docs.astral.sh/uv/pip/compatibility/#packages-that-exist-on-multiple-indexes). + + If you insist on using `pip`, you have to specify the exact vLLM version and full URL of the wheel path `https://wheels.vllm.ai/rocm//` (which can be obtained from the web page). + + ```bash + pip install vllm==0.14.0+rocm700 --extra-index-url https://wheels.vllm.ai/rocm/0.14.0/rocm700 + ``` # --8<-- [end:pre-built-wheels] # --8<-- [start:build-wheel-from-source] @@ -84,7 +105,7 @@ Currently, there are no pre-built ROCm wheels. - The validated `$FA_BRANCH` can be found in the [docker/Dockerfile.rocm_base](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm_base). -3. If you choose to build AITER yourself to use a certain branch or commit, you can build AITER using the following steps: +3. Optionally, if you choose to build AITER yourself to use a certain branch or commit, you can build AITER using the following steps: ```bash python3 -m pip uninstall -y aiter @@ -100,14 +121,14 @@ Currently, there are no pre-built ROCm wheels. - The validated `$AITER_BRANCH_OR_COMMIT` can be found in the [docker/Dockerfile.rocm_base](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm_base). -4. If you want to use MORI for EP or PD disaggregation, you can install [MORI](https://github.com/ROCm/mori) using the following steps: +4. Optionally, if you want to use MORI for EP or PD disaggregation, you can install [MORI](https://github.com/ROCm/mori) using the following steps: ```bash git clone https://github.com/ROCm/mori.git cd mori git checkout $MORI_BRANCH_OR_COMMIT git submodule sync; git submodule update --init --recursive - MORI_GPU_ARCHS="gfx942;gfx950" python3 install . + MORI_GPU_ARCHS="gfx942;gfx950" python3 setup.py install ``` !!! note @@ -141,7 +162,7 @@ Currently, there are no pre-built ROCm wheels. python3 setup.py develop ``` - This may take 5-10 minutes. Currently, `pip install .` does not work for ROCm installation. + This may take 5-10 minutes. Currently, `pip install .` does not work for ROCm when installing vLLM from source. !!! tip - The ROCm version of PyTorch, ideally, should match the ROCm driver version. @@ -153,9 +174,51 @@ Currently, there are no pre-built ROCm wheels. # --8<-- [end:build-wheel-from-source] # --8<-- [start:pre-built-images] +#### Use vLLM's Official Docker Image + +vLLM offers an official Docker image for deployment. +The image can be used to run OpenAI compatible server and is available on Docker Hub as [vllm/vllm-openai-rocm](https://hub.docker.com/r/vllm/vllm-openai-rocm/tags). + +???+ console "Commands" + ```bash + docker run --rm \ + --group-add=video \ + --cap-add=SYS_PTRACE \ + --security-opt seccomp=unconfined \ + --device /dev/kfd \ + --device /dev/dri \ + -v ~/.cache/huggingface:/root/.cache/huggingface \ + --env "HF_TOKEN=$HF_TOKEN" \ + -p 8000:8000 \ + --ipc=host \ + vllm/vllm-openai-rocm:latest \ + --model Qwen/Qwen3-0.6B + ``` + +To use the docker image as base for development, you can launch it in interactive session through overriding the entrypoint. + +???+ console "Commands" + ```bash + docker run --rm -it \ + --group-add=video \ + --cap-add=SYS_PTRACE \ + --security-opt seccomp=unconfined \ + --device /dev/kfd \ + --device /dev/dri \ + -v ~/.cache/huggingface:/root/.cache/huggingface \ + --env "HF_TOKEN=$HF_TOKEN" \ + -p 8000:8000 \ + --ipc=host \ + --entrypoint bash \ + vllm/vllm-openai-rocm:latest + ``` + + +#### Use AMD's Docker Images + The [AMD Infinity hub for vLLM](https://hub.docker.com/r/rocm/vllm/tags) offers a prebuilt, optimized docker image designed for validating inference performance on the AMD Instinct™ MI300X accelerator. -AMD also offers nightly prebuilt docker image from [Docker Hub](https://hub.docker.com/r/rocm/vllm-dev), which has vLLM and all its dependencies installed. +AMD also offers nightly prebuilt docker image from [Docker Hub](https://hub.docker.com/r/rocm/vllm-dev), which has vLLM and all its dependencies installed. The entrypoint of this docker image is `/bin/bash` (different from the vLLM's Official Docker Image). ???+ console "Commands" ```bash @@ -188,7 +251,7 @@ Building the Docker image from source is the recommended way to use vLLM with RO **This step is optional as this rocm_base image is usually prebuilt and store at [Docker Hub](https://hub.docker.com/r/rocm/vllm-dev) under tag `rocm/vllm-dev:base` to speed up user experience.** If you choose to build this rocm_base image yourself, the steps are as follows. - It is important that the user kicks off the docker build using buildkit. Either the user put DOCKER_BUILDKIT=1 as environment variable when calling docker build command, or the user needs to set up buildkit in the docker daemon configuration /etc/docker/daemon.json as follows and restart the daemon: + It is important that the user kicks off the docker build using buildkit. Either the user put `DOCKER_BUILDKIT=1` as environment variable when calling docker build command, or the user needs to set up buildkit in the docker daemon configuration `/etc/docker/daemon.json` as follows and restart the daemon: ```json { @@ -211,7 +274,7 @@ Building the Docker image from source is the recommended way to use vLLM with RO First, build a docker image from [docker/Dockerfile.rocm](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm) and launch a docker container from the image. It is important that the user kicks off the docker build using buildkit. Either the user put `DOCKER_BUILDKIT=1` as environment variable when calling docker build command, or the user needs to set up buildkit in the docker daemon configuration /etc/docker/daemon.json as follows and restart the daemon: -```bash +```json { "features": { "buildkit": true @@ -227,7 +290,7 @@ It provides flexibility to customize the build of docker image using the followi Their values can be passed in when running `docker build` with `--build-arg` options. -To build vllm on ROCm 7.0 for MI200 and MI300 series, you can use the default: +To build vllm on ROCm 7.0 for MI200 and MI300 series, you can use the default (which build a docker image with `vllm serve` as entrypoint): ???+ console "Commands" ```bash @@ -236,6 +299,7 @@ To build vllm on ROCm 7.0 for MI200 and MI300 series, you can use the default: To run the above docker image `vllm-rocm`, use the below command: + ???+ console "Commands" ```bash docker run -it \ @@ -247,7 +311,8 @@ To run the above docker image `vllm-rocm`, use the below command: --device /dev/kfd \ --device /dev/dri \ -v :/app/model \ - vllm-rocm + vllm-rocm \ + --model Qwen/Qwen3-0.6B ``` Where the `` is the location where the model is stored, for example, the weights for llama2 or llama3 models. diff --git a/docs/getting_started/quickstart.md b/docs/getting_started/quickstart.md index 01025c43e..d5c68172d 100644 --- a/docs/getting_started/quickstart.md +++ b/docs/getting_started/quickstart.md @@ -43,25 +43,21 @@ This guide will help you quickly get started with vLLM to perform: === "AMD ROCm" - Use a pre-built docker image from Docker Hub. The public stable image is [rocm/vllm:latest](https://hub.docker.com/r/rocm/vllm). There is also a development image at [rocm/vllm-dev](https://hub.docker.com/r/rocm/vllm-dev). - - The `-v` flag in the `docker run` command below mounts a local directory into the container. Replace `` with the path on your host machine to the directory containing your models. The models will then be accessible inside the container at `/app/models`. - - ???+ console "Commands" - ```bash - docker pull rocm/vllm-dev:nightly # to get the latest image - docker run -it --rm \ - --network=host \ - --group-add=video \ - --ipc=host \ - --cap-add=SYS_PTRACE \ - --security-opt seccomp=unconfined \ - --device /dev/kfd \ - --device /dev/dri \ - -v :/app/models \ - -e HF_HOME="/app/models" \ - rocm/vllm-dev:nightly - ``` + If you are using AMD GPUs, you can install vLLM using `uv`. + + It's recommended to use [uv](https://docs.astral.sh/uv/), as it gives the extra index [higher priority than the default index](https://docs.astral.sh/uv/pip/compatibility/#packages-that-exist-on-multiple-indexes). `uv` is also a very fast Python environment manager, to create and manage Python environments. Please follow the [documentation](https://docs.astral.sh/uv/#getting-started) to install `uv`. After installing `uv`, you can create a new Python environment and install vLLM using the following commands: + + ```bash + uv venv --python 3.12 --seed + source .venv/bin/activate + uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/ + ``` + + !!! note + It currently supports Python 3.12, ROCm 7.0 and `glibc >= 2.35`. + + !!! note + Note that, previously, docker images were published using AMD's docker release pipeline and were located `rocm/vlm-dev`. This is being deprecated by using vLLM's docker release pipeline. === "Google TPU" @@ -294,14 +290,7 @@ python script.py --attention-backend FLASHINFER Some of the available backend options include: - On NVIDIA CUDA: `FLASH_ATTN` or `FLASHINFER`. -- On AMD ROCm: `TRITON_ATTN`, `ROCM_ATTN`, `ROCM_AITER_FA` or `ROCM_AITER_UNIFIED_ATTN`. - -For AMD ROCm, you can further control the specific Attention implementation using the following options: - -- Triton Unified Attention: Set the environment variables `VLLM_ROCM_USE_AITER=0 VLLM_ROCM_USE_AITER_MHA=0` and pass `--attention-config.use_prefill_decode_attention=false` as a CLI argument. -- AITER Unified Attention: Set the environment variables `VLLM_ROCM_USE_AITER=1 VLLM_USE_AITER_UNIFIED_ATTENTION=1 VLLM_ROCM_USE_AITER_MHA=0` and pass `--attention-config.use_prefill_decode_attention=false` as a CLI argument. -- Triton Prefill-Decode Attention: Set the environment variables `VLLM_ROCM_USE_AITER=1 VLLM_ROCM_USE_AITER_MHA=0` and pass `--attention-config.use_prefill_decode_attention=true` as a CLI argument. -- AITER Multi-head Attention: Set the environment variables `VLLM_ROCM_USE_AITER=1 VLLM_ROCM_USE_AITER_MHA=1` and pass `--attention-config.use_prefill_decode_attention=false` as a CLI argument. +- On AMD ROCm: `TRITON_ATTN`, `ROCM_ATTN`, `ROCM_AITER_FA`, `ROCM_AITER_UNIFIED_ATTN`, `TRITON_MLA`, `ROCM_AITER_MLA` or `ROCM_AITER_TRITON_MLA`. !!! warning There are no pre-built vllm wheels containing Flash Infer, so you must install it in your environment first. Refer to the [Flash Infer official docs](https://docs.flashinfer.ai/) or see [docker/Dockerfile](../../docker/Dockerfile) for instructions on how to install it.