[DOC] [ROCm] Update docker deployment doc (#33971)
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: TJian <tunjian.tan@embeddedllm.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
@@ -239,27 +239,168 @@ uv pip install -e .
|
||||
# --8<-- [end:build-wheel-from-source]
|
||||
# --8<-- [start:pre-built-images]
|
||||
|
||||
See [Using Docker](../../deployment/docker.md) for instructions on using the official Docker image.
|
||||
|
||||
Another way to access the latest code is to use the docker images:
|
||||
vLLM offers an official Docker image for deployment.
|
||||
The image can be used to run OpenAI compatible server and is available on Docker Hub as [vllm/vllm-openai](https://hub.docker.com/r/vllm/vllm-openai/tags).
|
||||
|
||||
```bash
|
||||
export VLLM_COMMIT=33f460b17a54acb3b6cc0b03f4a17876cff5eafd # use full commit hash from the main branch
|
||||
docker pull public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:${VLLM_COMMIT}
|
||||
docker run --runtime nvidia --gpus all \
|
||||
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
||||
--env "HF_TOKEN=$HF_TOKEN" \
|
||||
-p 8000:8000 \
|
||||
--ipc=host \
|
||||
vllm/vllm-openai:latest \
|
||||
--model Qwen/Qwen3-0.6B
|
||||
```
|
||||
|
||||
These docker images are used for CI and testing only, and they are not intended for production use. They will be expired after several days.
|
||||
This image can also be used with other container engines such as [Podman](https://podman.io/).
|
||||
|
||||
The latest code can contain bugs and may not be stable. Please use it with caution.
|
||||
```bash
|
||||
podman run --device nvidia.com/gpu=all \
|
||||
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
||||
--env "HF_TOKEN=$HF_TOKEN" \
|
||||
-p 8000:8000 \
|
||||
--ipc=host \
|
||||
docker.io/vllm/vllm-openai:latest \
|
||||
--model Qwen/Qwen3-0.6B
|
||||
```
|
||||
|
||||
You can add any other [engine-args](https://docs.vllm.ai/en/latest/configuration/engine_args/) you need after the image tag (`vllm/vllm-openai:latest`).
|
||||
|
||||
!!! note
|
||||
You can either use the `ipc=host` flag or `--shm-size` flag to allow the
|
||||
container to access the host's shared memory. vLLM uses PyTorch, which uses shared
|
||||
memory to share data between processes under the hood, particularly for tensor parallel inference.
|
||||
|
||||
!!! note
|
||||
Optional dependencies are not included in order to avoid licensing issues (e.g. <https://github.com/vllm-project/vllm/issues/8030>).
|
||||
|
||||
If you need to use those dependencies (having accepted the license terms),
|
||||
create a custom Dockerfile on top of the base image with an extra layer that installs them:
|
||||
|
||||
```Dockerfile
|
||||
FROM vllm/vllm-openai:v0.11.0
|
||||
|
||||
# e.g. install the `audio` optional dependencies
|
||||
# NOTE: Make sure the version of vLLM matches the base image!
|
||||
RUN uv pip install --system vllm[audio]==0.11.0
|
||||
```
|
||||
|
||||
!!! tip
|
||||
Some new models may only be available on the main branch of [HF Transformers](https://github.com/huggingface/transformers).
|
||||
|
||||
To use the development version of `transformers`, create a custom Dockerfile on top of the base image
|
||||
with an extra layer that installs their code from source:
|
||||
|
||||
```Dockerfile
|
||||
FROM vllm/vllm-openai:latest
|
||||
|
||||
RUN uv pip install --system git+https://github.com/huggingface/transformers.git
|
||||
```
|
||||
|
||||
# --8<-- [end:pre-built-images]
|
||||
# --8<-- [start:build-image-from-source]
|
||||
|
||||
See [Building vLLM's Docker Image from Source](../../deployment/docker.md#building-vllms-docker-image-from-source) for instructions on building the Docker image.
|
||||
You can build and run vLLM from source via the provided [docker/Dockerfile](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile). To build vLLM:
|
||||
|
||||
```bash
|
||||
# optionally specifies: --build-arg max_jobs=8 --build-arg nvcc_threads=2
|
||||
DOCKER_BUILDKIT=1 docker build . \
|
||||
--target vllm-openai \
|
||||
--tag vllm/vllm-openai \
|
||||
--file docker/Dockerfile
|
||||
```
|
||||
|
||||
!!! note
|
||||
By default vLLM will build for all GPU types for widest distribution. If you are just building for the
|
||||
current GPU type the machine is running on, you can add the argument `--build-arg torch_cuda_arch_list=""`
|
||||
for vLLM to find the current GPU type and build for that.
|
||||
|
||||
If you are using Podman instead of Docker, you might need to disable SELinux labeling by
|
||||
adding `--security-opt label=disable` when running `podman build` command to avoid certain [existing issues](https://github.com/containers/buildah/discussions/4184).
|
||||
|
||||
!!! note
|
||||
If you have not changed any C++ or CUDA kernel code, you can use precompiled wheels to significantly reduce Docker build time.
|
||||
|
||||
* **Enable the feature** by adding the build argument: `--build-arg VLLM_USE_PRECOMPILED="1"`.
|
||||
* **How it works**: By default, vLLM automatically finds the correct wheels from our [Nightly Builds](https://docs.vllm.ai/en/latest/contributing/ci/nightly_builds/) by using the merge-base commit with the upstream `main` branch.
|
||||
* **Override commit**: To use wheels from a specific commit, provide the `--build-arg VLLM_PRECOMPILED_WHEEL_COMMIT=<commit_hash>` argument.
|
||||
|
||||
For a detailed explanation, refer to the documentation on 'Set up using Python-only build (without compilation)' part in [Build wheel from source](https://docs.vllm.ai/en/latest/contributing/ci/nightly_builds/#precompiled-wheels-usage), these args are similar.
|
||||
|
||||
#### Building vLLM's Docker Image from Source for Arm64/aarch64
|
||||
|
||||
A docker container can be built for aarch64 systems such as the Nvidia Grace-Hopper and Grace-Blackwell. Using the flag `--platform "linux/arm64"` will build for arm64.
|
||||
|
||||
!!! note
|
||||
Multiple modules must be compiled, so this process can take a while. Recommend using `--build-arg max_jobs=` & `--build-arg nvcc_threads=`
|
||||
flags to speed up build process. However, ensure your `max_jobs` is substantially larger than `nvcc_threads` to get the most benefits.
|
||||
Keep an eye on memory usage with parallel jobs as it can be substantial (see example below).
|
||||
|
||||
??? console "Command"
|
||||
|
||||
```bash
|
||||
# Example of building on Nvidia GH200 server. (Memory usage: ~15GB, Build time: ~1475s / ~25 min, Image size: 6.93GB)
|
||||
DOCKER_BUILDKIT=1 docker build . \
|
||||
--file docker/Dockerfile \
|
||||
--target vllm-openai \
|
||||
--platform "linux/arm64" \
|
||||
-t vllm/vllm-gh200-openai:latest \
|
||||
--build-arg max_jobs=66 \
|
||||
--build-arg nvcc_threads=2 \
|
||||
--build-arg torch_cuda_arch_list="9.0 10.0+PTX" \
|
||||
--build-arg RUN_WHEEL_CHECK=false
|
||||
```
|
||||
|
||||
For (G)B300, we recommend using CUDA 13, as shown in the following command.
|
||||
|
||||
??? console "Command"
|
||||
|
||||
```bash
|
||||
DOCKER_BUILDKIT=1 docker build \
|
||||
--build-arg CUDA_VERSION=13.0.1 \
|
||||
--build-arg BUILD_BASE_IMAGE=nvidia/cuda:13.0.1-devel-ubuntu22.04 \
|
||||
--build-arg max_jobs=256 \
|
||||
--build-arg nvcc_threads=2 \
|
||||
--build-arg RUN_WHEEL_CHECK=false \
|
||||
--build-arg torch_cuda_arch_list='9.0 10.0+PTX' \
|
||||
--platform "linux/arm64" \
|
||||
--tag vllm/vllm-gb300-openai:latest \
|
||||
--target vllm-openai \
|
||||
-f docker/Dockerfile \
|
||||
.
|
||||
```
|
||||
|
||||
!!! note
|
||||
If you are building the `linux/arm64` image on a non-ARM host (e.g., an x86_64 machine), you need to ensure your system is set up for cross-compilation using QEMU. This allows your host machine to emulate ARM64 execution.
|
||||
|
||||
Run the following command on your host machine to register QEMU user static handlers:
|
||||
|
||||
```bash
|
||||
docker run --rm --privileged multiarch/qemu-user-static --reset -p yes
|
||||
```
|
||||
|
||||
After setting up QEMU, you can use the `--platform "linux/arm64"` flag in your `docker build` command.
|
||||
|
||||
#### Use the custom-built vLLM Docker image**
|
||||
|
||||
To run vLLM with the custom-built Docker image:
|
||||
|
||||
```bash
|
||||
docker run --runtime nvidia --gpus all \
|
||||
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
||||
-p 8000:8000 \
|
||||
--env "HF_TOKEN=<secret>" \
|
||||
vllm/vllm-openai <args...>
|
||||
```
|
||||
|
||||
The argument `vllm/vllm-openai` specifies the image to run, and should be replaced with the name of the custom-built image (the `-t` tag from the build command).
|
||||
|
||||
!!! note
|
||||
**For version 0.4.1 and 0.4.2 only** - the vLLM docker images under these versions are supposed to be run under the root user since a library under the root user's home directory, i.e. `/root/.config/vllm/nccl/cu12/libnccl.so.2.18.1` is required to be loaded during runtime. If you are running the container under a different user, you may need to first change the permissions of the library (and all the parent directories) to allow the user to access it, then run vLLM with environment variable `VLLM_NCCL_SO_PATH=/root/.config/vllm/nccl/cu12/libnccl.so.2.18.1` .
|
||||
|
||||
# --8<-- [end:build-image-from-source]
|
||||
# --8<-- [start:supported-features]
|
||||
|
||||
See [Feature x Hardware](../../features/README.md#feature-x-hardware) compatibility matrix for feature support information.
|
||||
|
||||
# --8<-- [end:supported-features]
|
||||
# --8<-- [end:supported-features]
|
||||
@@ -1,3 +1,7 @@
|
||||
---
|
||||
toc_depth: 3
|
||||
---
|
||||
|
||||
# GPU
|
||||
|
||||
vLLM is a Python library that supports the following GPU variants. Select your GPU type to see vendor specific instructions:
|
||||
@@ -84,6 +88,9 @@ vLLM is a Python library that supports the following GPU variants. Select your G
|
||||
|
||||
### Pre-built images
|
||||
|
||||
<!-- markdownlint-disable MD025 -->
|
||||
# --8<-- [start:pre-built-images]
|
||||
|
||||
=== "NVIDIA CUDA"
|
||||
|
||||
--8<-- "docs/getting_started/installation/gpu.cuda.inc.md:pre-built-images"
|
||||
@@ -96,7 +103,15 @@ vLLM is a Python library that supports the following GPU variants. Select your G
|
||||
|
||||
--8<-- "docs/getting_started/installation/gpu.xpu.inc.md:pre-built-images"
|
||||
|
||||
# --8<-- [end:pre-built-images]
|
||||
<!-- markdownlint-enable MD025 -->
|
||||
|
||||
<!-- markdownlint-disable MD001 -->
|
||||
### Build image from source
|
||||
<!-- markdownlint-enable MD001 -->
|
||||
|
||||
<!-- markdownlint-disable MD025 -->
|
||||
# --8<-- [start:build-image-from-source]
|
||||
|
||||
=== "NVIDIA CUDA"
|
||||
|
||||
@@ -110,6 +125,9 @@ vLLM is a Python library that supports the following GPU variants. Select your G
|
||||
|
||||
--8<-- "docs/getting_started/installation/gpu.xpu.inc.md:build-image-from-source"
|
||||
|
||||
# --8<-- [end:build-image-from-source]
|
||||
<!-- markdownlint-enable MD025 -->
|
||||
|
||||
## Supported features
|
||||
|
||||
=== "NVIDIA CUDA"
|
||||
|
||||
@@ -174,67 +174,44 @@ uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/0.15.0/rocm700
|
||||
# --8<-- [end:build-wheel-from-source]
|
||||
# --8<-- [start:pre-built-images]
|
||||
|
||||
#### Use vLLM's Official Docker Image
|
||||
|
||||
vLLM offers an official Docker image for deployment.
|
||||
The image can be used to run OpenAI compatible server and is available on Docker Hub as [vllm/vllm-openai-rocm](https://hub.docker.com/r/vllm/vllm-openai-rocm/tags).
|
||||
|
||||
???+ console "Commands"
|
||||
```bash
|
||||
docker run --rm \
|
||||
--group-add=video \
|
||||
--cap-add=SYS_PTRACE \
|
||||
--security-opt seccomp=unconfined \
|
||||
--device /dev/kfd \
|
||||
--device /dev/dri \
|
||||
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
||||
--env "HF_TOKEN=$HF_TOKEN" \
|
||||
-p 8000:8000 \
|
||||
--ipc=host \
|
||||
vllm/vllm-openai-rocm:latest \
|
||||
--model Qwen/Qwen3-0.6B
|
||||
```
|
||||
|
||||
To use the docker image as base for development, you can launch it in interactive session through overriding the entrypoint.
|
||||
|
||||
???+ console "Commands"
|
||||
```bash
|
||||
docker run --rm -it \
|
||||
--group-add=video \
|
||||
--cap-add=SYS_PTRACE \
|
||||
--security-opt seccomp=unconfined \
|
||||
--device /dev/kfd \
|
||||
--device /dev/dri \
|
||||
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
||||
--env "HF_TOKEN=$HF_TOKEN" \
|
||||
-p 8000:8000 \
|
||||
--ipc=host \
|
||||
--entrypoint bash \
|
||||
vllm/vllm-openai-rocm:latest
|
||||
```
|
||||
|
||||
|
||||
#### Use AMD's Docker Images
|
||||
|
||||
The [AMD Infinity hub for vLLM](https://hub.docker.com/r/rocm/vllm/tags) offers a prebuilt, optimized
|
||||
docker image designed for validating inference performance on the AMD Instinct™ MI300X accelerator.
|
||||
AMD also offers nightly prebuilt docker image from [Docker Hub](https://hub.docker.com/r/rocm/vllm-dev), which has vLLM and all its dependencies installed. The entrypoint of this docker image is `/bin/bash` (different from the vLLM's Official Docker Image).
|
||||
|
||||
???+ console "Commands"
|
||||
```bash
|
||||
docker pull rocm/vllm-dev:nightly # to get the latest image
|
||||
docker run -it --rm \
|
||||
--network=host \
|
||||
```bash
|
||||
docker run --rm \
|
||||
--group-add=video \
|
||||
--ipc=host \
|
||||
--cap-add=SYS_PTRACE \
|
||||
--security-opt seccomp=unconfined \
|
||||
--device /dev/kfd \
|
||||
--device /dev/dri \
|
||||
-v <path/to/your/models>:/app/models \
|
||||
-e HF_HOME="/app/models" \
|
||||
rocm/vllm-dev:nightly
|
||||
```
|
||||
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
||||
--env "HF_TOKEN=$HF_TOKEN" \
|
||||
-p 8000:8000 \
|
||||
--ipc=host \
|
||||
vllm/vllm-openai-rocm:latest \
|
||||
--model Qwen/Qwen3-0.6B
|
||||
```
|
||||
|
||||
#### Use AMD's Docker Images
|
||||
|
||||
Prior to January 20th, 2026 when the official docker images are available on [upstream vLLM docker hub](https://hub.docker.com/v2/repositories/vllm/vllm-openai-rocm/tags/), the [AMD Infinity hub for vLLM](https://hub.docker.com/r/rocm/vllm/tags) offers a prebuilt, optimized
|
||||
docker image designed for validating inference performance on the AMD Instinct MI300X™ accelerator.
|
||||
AMD also offers nightly prebuilt docker image from [Docker Hub](https://hub.docker.com/r/rocm/vllm-dev), which has vLLM and all its dependencies installed. The entrypoint of this docker image is `/bin/bash` (different from the vLLM's Official Docker Image).
|
||||
|
||||
```bash
|
||||
docker pull rocm/vllm-dev:nightly # to get the latest image
|
||||
docker run -it --rm \
|
||||
--network=host \
|
||||
--group-add=video \
|
||||
--ipc=host \
|
||||
--cap-add=SYS_PTRACE \
|
||||
--security-opt seccomp=unconfined \
|
||||
--device /dev/kfd \
|
||||
--device /dev/dri \
|
||||
-v <path/to/your/models>:/app/models \
|
||||
-e HF_HOME="/app/models" \
|
||||
rocm/vllm-dev:nightly
|
||||
```
|
||||
|
||||
!!! tip
|
||||
Please check [LLM inference performance validation on AMD Instinct MI300X](https://rocm.docs.amd.com/en/latest/how-to/performance-validation/mi300x/vllm-benchmark.html)
|
||||
@@ -243,7 +220,7 @@ AMD also offers nightly prebuilt docker image from [Docker Hub](https://hub.dock
|
||||
# --8<-- [end:pre-built-images]
|
||||
# --8<-- [start:build-image-from-source]
|
||||
|
||||
Building the Docker image from source is the recommended way to use vLLM with ROCm.
|
||||
You can build and run vLLM from source via the provided [docker/Dockerfile.rocm](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm).
|
||||
|
||||
??? info "(Optional) Build an image with ROCm software stack"
|
||||
|
||||
@@ -269,8 +246,6 @@ Building the Docker image from source is the recommended way to use vLLM with RO
|
||||
-t rocm/vllm-dev:base .
|
||||
```
|
||||
|
||||
#### Build an image with vLLM
|
||||
|
||||
First, build a docker image from [docker/Dockerfile.rocm](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm) and launch a docker container from the image.
|
||||
It is important that the user kicks off the docker build using buildkit. Either the user put `DOCKER_BUILDKIT=1` as environment variable when calling docker build command, or the user needs to set up buildkit in the docker daemon configuration /etc/docker/daemon.json as follows and restart the daemon:
|
||||
|
||||
@@ -292,30 +267,46 @@ Their values can be passed in when running `docker build` with `--build-arg` opt
|
||||
|
||||
To build vllm on ROCm 7.0 for MI200 and MI300 series, you can use the default (which build a docker image with `vllm serve` as entrypoint):
|
||||
|
||||
???+ console "Commands"
|
||||
```bash
|
||||
DOCKER_BUILDKIT=1 docker build -f docker/Dockerfile.rocm -t vllm-rocm .
|
||||
```
|
||||
|
||||
To run the above docker image `vllm-rocm`, use the below command:
|
||||
```bash
|
||||
DOCKER_BUILDKIT=1 docker build -f docker/Dockerfile.rocm -t vllm/vllm-openai-rocm .
|
||||
```
|
||||
|
||||
|
||||
???+ console "Commands"
|
||||
```bash
|
||||
docker run -it \
|
||||
--network=host \
|
||||
To run vLLM with the custom-built Docker image:
|
||||
|
||||
```bash
|
||||
docker run --rm \
|
||||
--group-add=video \
|
||||
--ipc=host \
|
||||
--cap-add=SYS_PTRACE \
|
||||
--security-opt seccomp=unconfined \
|
||||
--device /dev/kfd \
|
||||
--device /dev/dri \
|
||||
-v <path/to/model>:/app/model \
|
||||
vllm-rocm \
|
||||
--model Qwen/Qwen3-0.6B
|
||||
```
|
||||
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
||||
--env "HF_TOKEN=$HF_TOKEN" \
|
||||
-p 8000:8000 \
|
||||
--ipc=host \
|
||||
vllm/vllm-openai-rocm <args...>
|
||||
```
|
||||
|
||||
Where the `<path/to/model>` is the location where the model is stored, for example, the weights for llama2 or llama3 models.
|
||||
The argument `vllm/vllm-openai-rocm` specifies the image to run, and should be replaced with the name of the custom-built image (the `-t` tag from the build command).
|
||||
|
||||
To use the docker image as base for development, you can launch it in interactive session through overriding the entrypoint.
|
||||
|
||||
???+ console "Commands"
|
||||
```bash
|
||||
docker run --rm -it \
|
||||
--group-add=video \
|
||||
--cap-add=SYS_PTRACE \
|
||||
--security-opt seccomp=unconfined \
|
||||
--device /dev/kfd \
|
||||
--device /dev/dri \
|
||||
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
||||
--env "HF_TOKEN=$HF_TOKEN" \
|
||||
--network=host \
|
||||
--ipc=host \
|
||||
--entrypoint bash \
|
||||
vllm/vllm-openai-rocm
|
||||
```
|
||||
|
||||
# --8<-- [end:build-image-from-source]
|
||||
# --8<-- [start:supported-features]
|
||||
|
||||
Reference in New Issue
Block a user