[DOC] [ROCm] Update docker deployment doc (#33971)

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: TJian <tunjian.tan@embeddedllm.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
vllmellm
2026-02-07 02:05:35 +08:00
committed by GitHub
parent aca5967416
commit aaa2efbe98
4 changed files with 239 additions and 237 deletions

View File

@@ -239,27 +239,168 @@ uv pip install -e .
# --8<-- [end:build-wheel-from-source]
# --8<-- [start:pre-built-images]
See [Using Docker](../../deployment/docker.md) for instructions on using the official Docker image.
Another way to access the latest code is to use the docker images:
vLLM offers an official Docker image for deployment.
The image can be used to run OpenAI compatible server and is available on Docker Hub as [vllm/vllm-openai](https://hub.docker.com/r/vllm/vllm-openai/tags).
```bash
export VLLM_COMMIT=33f460b17a54acb3b6cc0b03f4a17876cff5eafd # use full commit hash from the main branch
docker pull public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:${VLLM_COMMIT}
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=$HF_TOKEN" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model Qwen/Qwen3-0.6B
```
These docker images are used for CI and testing only, and they are not intended for production use. They will be expired after several days.
This image can also be used with other container engines such as [Podman](https://podman.io/).
The latest code can contain bugs and may not be stable. Please use it with caution.
```bash
podman run --device nvidia.com/gpu=all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=$HF_TOKEN" \
-p 8000:8000 \
--ipc=host \
docker.io/vllm/vllm-openai:latest \
--model Qwen/Qwen3-0.6B
```
You can add any other [engine-args](https://docs.vllm.ai/en/latest/configuration/engine_args/) you need after the image tag (`vllm/vllm-openai:latest`).
!!! note
You can either use the `ipc=host` flag or `--shm-size` flag to allow the
container to access the host's shared memory. vLLM uses PyTorch, which uses shared
memory to share data between processes under the hood, particularly for tensor parallel inference.
!!! note
Optional dependencies are not included in order to avoid licensing issues (e.g. <https://github.com/vllm-project/vllm/issues/8030>).
If you need to use those dependencies (having accepted the license terms),
create a custom Dockerfile on top of the base image with an extra layer that installs them:
```Dockerfile
FROM vllm/vllm-openai:v0.11.0
# e.g. install the `audio` optional dependencies
# NOTE: Make sure the version of vLLM matches the base image!
RUN uv pip install --system vllm[audio]==0.11.0
```
!!! tip
Some new models may only be available on the main branch of [HF Transformers](https://github.com/huggingface/transformers).
To use the development version of `transformers`, create a custom Dockerfile on top of the base image
with an extra layer that installs their code from source:
```Dockerfile
FROM vllm/vllm-openai:latest
RUN uv pip install --system git+https://github.com/huggingface/transformers.git
```
# --8<-- [end:pre-built-images]
# --8<-- [start:build-image-from-source]
See [Building vLLM's Docker Image from Source](../../deployment/docker.md#building-vllms-docker-image-from-source) for instructions on building the Docker image.
You can build and run vLLM from source via the provided [docker/Dockerfile](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile). To build vLLM:
```bash
# optionally specifies: --build-arg max_jobs=8 --build-arg nvcc_threads=2
DOCKER_BUILDKIT=1 docker build . \
--target vllm-openai \
--tag vllm/vllm-openai \
--file docker/Dockerfile
```
!!! note
By default vLLM will build for all GPU types for widest distribution. If you are just building for the
current GPU type the machine is running on, you can add the argument `--build-arg torch_cuda_arch_list=""`
for vLLM to find the current GPU type and build for that.
If you are using Podman instead of Docker, you might need to disable SELinux labeling by
adding `--security-opt label=disable` when running `podman build` command to avoid certain [existing issues](https://github.com/containers/buildah/discussions/4184).
!!! note
If you have not changed any C++ or CUDA kernel code, you can use precompiled wheels to significantly reduce Docker build time.
* **Enable the feature** by adding the build argument: `--build-arg VLLM_USE_PRECOMPILED="1"`.
* **How it works**: By default, vLLM automatically finds the correct wheels from our [Nightly Builds](https://docs.vllm.ai/en/latest/contributing/ci/nightly_builds/) by using the merge-base commit with the upstream `main` branch.
* **Override commit**: To use wheels from a specific commit, provide the `--build-arg VLLM_PRECOMPILED_WHEEL_COMMIT=<commit_hash>` argument.
For a detailed explanation, refer to the documentation on 'Set up using Python-only build (without compilation)' part in [Build wheel from source](https://docs.vllm.ai/en/latest/contributing/ci/nightly_builds/#precompiled-wheels-usage), these args are similar.
#### Building vLLM's Docker Image from Source for Arm64/aarch64
A docker container can be built for aarch64 systems such as the Nvidia Grace-Hopper and Grace-Blackwell. Using the flag `--platform "linux/arm64"` will build for arm64.
!!! note
Multiple modules must be compiled, so this process can take a while. Recommend using `--build-arg max_jobs=` & `--build-arg nvcc_threads=`
flags to speed up build process. However, ensure your `max_jobs` is substantially larger than `nvcc_threads` to get the most benefits.
Keep an eye on memory usage with parallel jobs as it can be substantial (see example below).
??? console "Command"
```bash
# Example of building on Nvidia GH200 server. (Memory usage: ~15GB, Build time: ~1475s / ~25 min, Image size: 6.93GB)
DOCKER_BUILDKIT=1 docker build . \
--file docker/Dockerfile \
--target vllm-openai \
--platform "linux/arm64" \
-t vllm/vllm-gh200-openai:latest \
--build-arg max_jobs=66 \
--build-arg nvcc_threads=2 \
--build-arg torch_cuda_arch_list="9.0 10.0+PTX" \
--build-arg RUN_WHEEL_CHECK=false
```
For (G)B300, we recommend using CUDA 13, as shown in the following command.
??? console "Command"
```bash
DOCKER_BUILDKIT=1 docker build \
--build-arg CUDA_VERSION=13.0.1 \
--build-arg BUILD_BASE_IMAGE=nvidia/cuda:13.0.1-devel-ubuntu22.04 \
--build-arg max_jobs=256 \
--build-arg nvcc_threads=2 \
--build-arg RUN_WHEEL_CHECK=false \
--build-arg torch_cuda_arch_list='9.0 10.0+PTX' \
--platform "linux/arm64" \
--tag vllm/vllm-gb300-openai:latest \
--target vllm-openai \
-f docker/Dockerfile \
.
```
!!! note
If you are building the `linux/arm64` image on a non-ARM host (e.g., an x86_64 machine), you need to ensure your system is set up for cross-compilation using QEMU. This allows your host machine to emulate ARM64 execution.
Run the following command on your host machine to register QEMU user static handlers:
```bash
docker run --rm --privileged multiarch/qemu-user-static --reset -p yes
```
After setting up QEMU, you can use the `--platform "linux/arm64"` flag in your `docker build` command.
#### Use the custom-built vLLM Docker image**
To run vLLM with the custom-built Docker image:
```bash
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
--env "HF_TOKEN=<secret>" \
vllm/vllm-openai <args...>
```
The argument `vllm/vllm-openai` specifies the image to run, and should be replaced with the name of the custom-built image (the `-t` tag from the build command).
!!! note
**For version 0.4.1 and 0.4.2 only** - the vLLM docker images under these versions are supposed to be run under the root user since a library under the root user's home directory, i.e. `/root/.config/vllm/nccl/cu12/libnccl.so.2.18.1` is required to be loaded during runtime. If you are running the container under a different user, you may need to first change the permissions of the library (and all the parent directories) to allow the user to access it, then run vLLM with environment variable `VLLM_NCCL_SO_PATH=/root/.config/vllm/nccl/cu12/libnccl.so.2.18.1` .
# --8<-- [end:build-image-from-source]
# --8<-- [start:supported-features]
See [Feature x Hardware](../../features/README.md#feature-x-hardware) compatibility matrix for feature support information.
# --8<-- [end:supported-features]
# --8<-- [end:supported-features]

View File

@@ -1,3 +1,7 @@
---
toc_depth: 3
---
# GPU
vLLM is a Python library that supports the following GPU variants. Select your GPU type to see vendor specific instructions:
@@ -84,6 +88,9 @@ vLLM is a Python library that supports the following GPU variants. Select your G
### Pre-built images
<!-- markdownlint-disable MD025 -->
# --8<-- [start:pre-built-images]
=== "NVIDIA CUDA"
--8<-- "docs/getting_started/installation/gpu.cuda.inc.md:pre-built-images"
@@ -96,7 +103,15 @@ vLLM is a Python library that supports the following GPU variants. Select your G
--8<-- "docs/getting_started/installation/gpu.xpu.inc.md:pre-built-images"
# --8<-- [end:pre-built-images]
<!-- markdownlint-enable MD025 -->
<!-- markdownlint-disable MD001 -->
### Build image from source
<!-- markdownlint-enable MD001 -->
<!-- markdownlint-disable MD025 -->
# --8<-- [start:build-image-from-source]
=== "NVIDIA CUDA"
@@ -110,6 +125,9 @@ vLLM is a Python library that supports the following GPU variants. Select your G
--8<-- "docs/getting_started/installation/gpu.xpu.inc.md:build-image-from-source"
# --8<-- [end:build-image-from-source]
<!-- markdownlint-enable MD025 -->
## Supported features
=== "NVIDIA CUDA"

View File

@@ -174,67 +174,44 @@ uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/0.15.0/rocm700
# --8<-- [end:build-wheel-from-source]
# --8<-- [start:pre-built-images]
#### Use vLLM's Official Docker Image
vLLM offers an official Docker image for deployment.
The image can be used to run OpenAI compatible server and is available on Docker Hub as [vllm/vllm-openai-rocm](https://hub.docker.com/r/vllm/vllm-openai-rocm/tags).
???+ console "Commands"
```bash
docker run --rm \
--group-add=video \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device /dev/kfd \
--device /dev/dri \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=$HF_TOKEN" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai-rocm:latest \
--model Qwen/Qwen3-0.6B
```
To use the docker image as base for development, you can launch it in interactive session through overriding the entrypoint.
???+ console "Commands"
```bash
docker run --rm -it \
--group-add=video \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device /dev/kfd \
--device /dev/dri \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=$HF_TOKEN" \
-p 8000:8000 \
--ipc=host \
--entrypoint bash \
vllm/vllm-openai-rocm:latest
```
#### Use AMD's Docker Images
The [AMD Infinity hub for vLLM](https://hub.docker.com/r/rocm/vllm/tags) offers a prebuilt, optimized
docker image designed for validating inference performance on the AMD Instinct™ MI300X accelerator.
AMD also offers nightly prebuilt docker image from [Docker Hub](https://hub.docker.com/r/rocm/vllm-dev), which has vLLM and all its dependencies installed. The entrypoint of this docker image is `/bin/bash` (different from the vLLM's Official Docker Image).
???+ console "Commands"
```bash
docker pull rocm/vllm-dev:nightly # to get the latest image
docker run -it --rm \
--network=host \
```bash
docker run --rm \
--group-add=video \
--ipc=host \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device /dev/kfd \
--device /dev/dri \
-v <path/to/your/models>:/app/models \
-e HF_HOME="/app/models" \
rocm/vllm-dev:nightly
```
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=$HF_TOKEN" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai-rocm:latest \
--model Qwen/Qwen3-0.6B
```
#### Use AMD's Docker Images
Prior to January 20th, 2026 when the official docker images are available on [upstream vLLM docker hub](https://hub.docker.com/v2/repositories/vllm/vllm-openai-rocm/tags/), the [AMD Infinity hub for vLLM](https://hub.docker.com/r/rocm/vllm/tags) offers a prebuilt, optimized
docker image designed for validating inference performance on the AMD Instinct MI300X™ accelerator.
AMD also offers nightly prebuilt docker image from [Docker Hub](https://hub.docker.com/r/rocm/vllm-dev), which has vLLM and all its dependencies installed. The entrypoint of this docker image is `/bin/bash` (different from the vLLM's Official Docker Image).
```bash
docker pull rocm/vllm-dev:nightly # to get the latest image
docker run -it --rm \
--network=host \
--group-add=video \
--ipc=host \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device /dev/kfd \
--device /dev/dri \
-v <path/to/your/models>:/app/models \
-e HF_HOME="/app/models" \
rocm/vllm-dev:nightly
```
!!! tip
Please check [LLM inference performance validation on AMD Instinct MI300X](https://rocm.docs.amd.com/en/latest/how-to/performance-validation/mi300x/vllm-benchmark.html)
@@ -243,7 +220,7 @@ AMD also offers nightly prebuilt docker image from [Docker Hub](https://hub.dock
# --8<-- [end:pre-built-images]
# --8<-- [start:build-image-from-source]
Building the Docker image from source is the recommended way to use vLLM with ROCm.
You can build and run vLLM from source via the provided [docker/Dockerfile.rocm](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm).
??? info "(Optional) Build an image with ROCm software stack"
@@ -269,8 +246,6 @@ Building the Docker image from source is the recommended way to use vLLM with RO
-t rocm/vllm-dev:base .
```
#### Build an image with vLLM
First, build a docker image from [docker/Dockerfile.rocm](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm) and launch a docker container from the image.
It is important that the user kicks off the docker build using buildkit. Either the user put `DOCKER_BUILDKIT=1` as environment variable when calling docker build command, or the user needs to set up buildkit in the docker daemon configuration /etc/docker/daemon.json as follows and restart the daemon:
@@ -292,30 +267,46 @@ Their values can be passed in when running `docker build` with `--build-arg` opt
To build vllm on ROCm 7.0 for MI200 and MI300 series, you can use the default (which build a docker image with `vllm serve` as entrypoint):
???+ console "Commands"
```bash
DOCKER_BUILDKIT=1 docker build -f docker/Dockerfile.rocm -t vllm-rocm .
```
To run the above docker image `vllm-rocm`, use the below command:
```bash
DOCKER_BUILDKIT=1 docker build -f docker/Dockerfile.rocm -t vllm/vllm-openai-rocm .
```
???+ console "Commands"
```bash
docker run -it \
--network=host \
To run vLLM with the custom-built Docker image:
```bash
docker run --rm \
--group-add=video \
--ipc=host \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device /dev/kfd \
--device /dev/dri \
-v <path/to/model>:/app/model \
vllm-rocm \
--model Qwen/Qwen3-0.6B
```
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=$HF_TOKEN" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai-rocm <args...>
```
Where the `<path/to/model>` is the location where the model is stored, for example, the weights for llama2 or llama3 models.
The argument `vllm/vllm-openai-rocm` specifies the image to run, and should be replaced with the name of the custom-built image (the `-t` tag from the build command).
To use the docker image as base for development, you can launch it in interactive session through overriding the entrypoint.
???+ console "Commands"
```bash
docker run --rm -it \
--group-add=video \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device /dev/kfd \
--device /dev/dri \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=$HF_TOKEN" \
--network=host \
--ipc=host \
--entrypoint bash \
vllm/vllm-openai-rocm
```
# --8<-- [end:build-image-from-source]
# --8<-- [start:supported-features]