Allow markdownlint to run locally (#36398)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
@@ -1,4 +1,5 @@
|
||||
# --8<-- [start:installation]
|
||||
<!-- markdownlint-disable MD041 -->
|
||||
--8<-- [start:installation]
|
||||
|
||||
vLLM has experimental support for macOS with Apple Silicon. For now, users must build from source to natively run on macOS.
|
||||
|
||||
@@ -7,23 +8,23 @@ Currently the CPU implementation for macOS supports FP32 and FP16 datatypes.
|
||||
!!! tip "GPU-Accelerated Inference with vLLM-Metal"
|
||||
For GPU-accelerated inference on Apple Silicon using Metal, check out [vllm-metal](https://github.com/vllm-project/vllm-metal), a community-maintained hardware plugin that uses MLX as the compute backend.
|
||||
|
||||
# --8<-- [end:installation]
|
||||
# --8<-- [start:requirements]
|
||||
--8<-- [end:installation]
|
||||
--8<-- [start:requirements]
|
||||
|
||||
- OS: `macOS Sonoma` or later
|
||||
- SDK: `XCode 15.4` or later with Command Line Tools
|
||||
- Compiler: `Apple Clang >= 15.0.0`
|
||||
|
||||
# --8<-- [end:requirements]
|
||||
# --8<-- [start:set-up-using-python]
|
||||
--8<-- [end:requirements]
|
||||
--8<-- [start:set-up-using-python]
|
||||
|
||||
# --8<-- [end:set-up-using-python]
|
||||
# --8<-- [start:pre-built-wheels]
|
||||
--8<-- [end:set-up-using-python]
|
||||
--8<-- [start:pre-built-wheels]
|
||||
|
||||
Currently, there are no pre-built Apple silicon CPU wheels.
|
||||
|
||||
# --8<-- [end:pre-built-wheels]
|
||||
# --8<-- [start:build-wheel-from-source]
|
||||
--8<-- [end:pre-built-wheels]
|
||||
--8<-- [start:build-wheel-from-source]
|
||||
|
||||
After installation of XCode and the Command Line Tools, which include Apple Clang, execute the following commands to build and install vLLM from source.
|
||||
|
||||
@@ -36,7 +37,7 @@ uv pip install -e .
|
||||
|
||||
!!! tip
|
||||
The `--index-strategy unsafe-best-match` flag is needed to resolve dependencies across multiple package indexes (PyTorch CPU index and PyPI). Without this flag, you may encounter `typing-extensions` version conflicts.
|
||||
|
||||
|
||||
The term "unsafe" refers to the package resolution strategy, not security. By default, `uv` only searches the first index where a package is found to prevent dependency confusion attacks. This flag allows `uv` to search all configured indexes to find the best compatible versions. Since both PyTorch and PyPI are trusted package sources, using this strategy is safe and appropriate for vLLM installation.
|
||||
|
||||
!!! note
|
||||
@@ -77,14 +78,14 @@ uv pip install -e .
|
||||
```
|
||||
On Apple Clang 16 you should see: `#define __cplusplus 201703L`
|
||||
|
||||
# --8<-- [end:build-wheel-from-source]
|
||||
# --8<-- [start:pre-built-images]
|
||||
--8<-- [end:build-wheel-from-source]
|
||||
--8<-- [start:pre-built-images]
|
||||
|
||||
Currently, there are no pre-built Arm silicon CPU images.
|
||||
|
||||
# --8<-- [end:pre-built-images]
|
||||
# --8<-- [start:build-image-from-source]
|
||||
--8<-- [end:pre-built-images]
|
||||
--8<-- [start:build-image-from-source]
|
||||
|
||||
# --8<-- [end:build-image-from-source]
|
||||
# --8<-- [start:extra-information]
|
||||
# --8<-- [end:extra-information]
|
||||
--8<-- [end:build-image-from-source]
|
||||
--8<-- [start:extra-information]
|
||||
--8<-- [end:extra-information]
|
||||
|
||||
@@ -1,19 +1,20 @@
|
||||
# --8<-- [start:installation]
|
||||
<!-- markdownlint-disable MD041 -->
|
||||
--8<-- [start:installation]
|
||||
|
||||
vLLM offers basic model inferencing and serving on Arm CPU platform, with support for NEON, data types FP32, FP16 and BF16.
|
||||
|
||||
# --8<-- [end:installation]
|
||||
# --8<-- [start:requirements]
|
||||
--8<-- [end:installation]
|
||||
--8<-- [start:requirements]
|
||||
|
||||
- OS: Linux
|
||||
- Compiler: `gcc/g++ >= 12.3.0` (optional, recommended)
|
||||
- Instruction Set Architecture (ISA): NEON support is required
|
||||
|
||||
# --8<-- [end:requirements]
|
||||
# --8<-- [start:set-up-using-python]
|
||||
--8<-- [end:requirements]
|
||||
--8<-- [start:set-up-using-python]
|
||||
|
||||
# --8<-- [end:set-up-using-python]
|
||||
# --8<-- [start:pre-built-wheels]
|
||||
--8<-- [end:set-up-using-python]
|
||||
--8<-- [start:pre-built-wheels]
|
||||
|
||||
Pre-built vLLM wheels for Arm are available since version 0.11.2. These wheels contain pre-compiled C++ binaries.
|
||||
|
||||
@@ -43,13 +44,14 @@ uv pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VE
|
||||
|
||||
The `uv` approach works for vLLM `v0.6.6` and later. A unique feature of `uv` is that packages in `--extra-index-url` have [higher priority than the default index](https://docs.astral.sh/uv/pip/compatibility/#packages-that-exist-on-multiple-indexes). If the latest public release is `v0.6.6.post1`, `uv`'s behavior allows installing a commit before `v0.6.6.post1` by specifying the `--extra-index-url`. In contrast, `pip` combines packages from `--extra-index-url` and the default index, choosing only the latest version, which makes it difficult to install a development version prior to the released version.
|
||||
|
||||
**Install the latest code**
|
||||
#### Install the latest code
|
||||
|
||||
LLM inference is a fast-evolving field, and the latest code may contain bug fixes, performance improvements, and new features that are not released yet. To allow users to try the latest code without waiting for the next release, vLLM provides working pre-built Arm CPU wheels for every commit since `v0.11.2` on <https://wheels.vllm.ai/nightly>. For native CPU wheels, this index should be used:
|
||||
|
||||
* `https://wheels.vllm.ai/nightly/cpu/vllm`
|
||||
- `https://wheels.vllm.ai/nightly/cpu/vllm`
|
||||
|
||||
To install from nightly index, run:
|
||||
|
||||
```bash
|
||||
uv pip install vllm --extra-index-url https://wheels.vllm.ai/nightly/cpu --index-strategy first-index
|
||||
```
|
||||
@@ -64,7 +66,7 @@ uv pip install vllm --extra-index-url https://wheels.vllm.ai/nightly/cpu --index
|
||||
pip install https://wheels.vllm.ai/4fa7ce46f31cbd97b4651694caf9991cc395a259/vllm-0.13.0rc2.dev104%2Bg4fa7ce46f.cpu-cp38-abi3-manylinux_2_35_aarch64.whl # current nightly build (the filename will change!)
|
||||
```
|
||||
|
||||
**Install specific revisions**
|
||||
#### Install specific revisions
|
||||
|
||||
If you want to access the wheels for previous commits (e.g. to bisect the behavior change, performance regression), you can specify the commit hash in the URL:
|
||||
|
||||
@@ -73,8 +75,8 @@ export VLLM_COMMIT=730bd35378bf2a5b56b6d3a45be28b3092d26519 # use full commit ha
|
||||
uv pip install vllm --extra-index-url https://wheels.vllm.ai/${VLLM_COMMIT}/cpu --index-strategy first-index
|
||||
```
|
||||
|
||||
# --8<-- [end:pre-built-wheels]
|
||||
# --8<-- [start:build-wheel-from-source]
|
||||
--8<-- [end:pre-built-wheels]
|
||||
--8<-- [start:build-wheel-from-source]
|
||||
|
||||
First, install the recommended compiler. We recommend using `gcc/g++ >= 12.3.0` as the default compiler to avoid potential problems. For example, on Ubuntu 22.4, you can run:
|
||||
|
||||
@@ -133,8 +135,8 @@ Testing has been conducted on AWS Graviton3 instances for compatibility.
|
||||
export LD_PRELOAD="$TC_PATH:$LD_PRELOAD"
|
||||
```
|
||||
|
||||
# --8<-- [end:build-wheel-from-source]
|
||||
# --8<-- [start:pre-built-images]
|
||||
--8<-- [end:build-wheel-from-source]
|
||||
--8<-- [start:pre-built-images]
|
||||
|
||||
To pull the latest image from Docker Hub:
|
||||
|
||||
@@ -170,10 +172,10 @@ export VLLM_COMMIT=6299628d326f429eba78736acb44e76749b281f5 # use full commit ha
|
||||
docker pull public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:${VLLM_COMMIT}-arm64-cpu
|
||||
```
|
||||
|
||||
# --8<-- [end:pre-built-images]
|
||||
# --8<-- [start:build-image-from-source]
|
||||
--8<-- [end:pre-built-images]
|
||||
--8<-- [start:build-image-from-source]
|
||||
|
||||
## Building for your target ARM CPU
|
||||
#### Building for your target ARM CPU
|
||||
|
||||
```bash
|
||||
docker build -f docker/Dockerfile.cpu \
|
||||
@@ -189,9 +191,9 @@ docker build -f docker/Dockerfile.cpu \
|
||||
- `VLLM_CPU_ARM_BF16=true` - Force-enable ARM BF16 support (build with BF16 regardless of build system capabilities)
|
||||
- `VLLM_CPU_ARM_BF16=false` - Rely on auto-detection (default)
|
||||
|
||||
### Examples
|
||||
##### Examples
|
||||
|
||||
**Auto-detection build (native ARM)**
|
||||
###### Auto-detection build (native ARM)
|
||||
|
||||
```bash
|
||||
# Building on ARM64 system - platform auto-detected
|
||||
@@ -200,7 +202,7 @@ docker build -f docker/Dockerfile.cpu \
|
||||
--target vllm-openai .
|
||||
```
|
||||
|
||||
**Cross-compile for ARM with BF16 support**
|
||||
###### Cross-compile for ARM with BF16 support
|
||||
|
||||
```bash
|
||||
# Building on ARM64 for newer ARM CPUs with BF16
|
||||
@@ -210,7 +212,7 @@ docker build -f docker/Dockerfile.cpu \
|
||||
--target vllm-openai .
|
||||
```
|
||||
|
||||
**Cross-compile from x86_64 to ARM64 with BF16**
|
||||
###### Cross-compile from x86_64 to ARM64 with BF16
|
||||
|
||||
```bash
|
||||
# Requires Docker buildx with ARM emulation (QEMU)
|
||||
@@ -226,7 +228,7 @@ docker buildx build -f docker/Dockerfile.cpu \
|
||||
!!! note "ARM BF16 requirements"
|
||||
ARM BF16 support requires ARMv8.6-A or later (FEAT_BF16). Supported on AWS Graviton3/4, AmpereOne, and other recent ARM processors.
|
||||
|
||||
## Launching the OpenAI server
|
||||
#### Launching the OpenAI server
|
||||
|
||||
```bash
|
||||
docker run --rm \
|
||||
@@ -245,6 +247,6 @@ docker run --rm \
|
||||
!!! tip "Alternative to --privileged"
|
||||
Instead of `--privileged=true`, use `--cap-add SYS_NICE --security-opt seccomp=unconfined` for better security.
|
||||
|
||||
# --8<-- [end:build-image-from-source]
|
||||
# --8<-- [start:extra-information]
|
||||
# --8<-- [end:extra-information]
|
||||
--8<-- [end:build-image-from-source]
|
||||
--8<-- [start:extra-information]
|
||||
--8<-- [end:extra-information]
|
||||
|
||||
@@ -1,3 +1,7 @@
|
||||
---
|
||||
toc_depth: 3
|
||||
---
|
||||
|
||||
# CPU
|
||||
|
||||
vLLM is a Python library that supports the following CPU variants. Select your CPU type to see vendor specific instructions:
|
||||
|
||||
@@ -1,27 +1,28 @@
|
||||
# --8<-- [start:installation]
|
||||
<!-- markdownlint-disable MD041 -->
|
||||
--8<-- [start:installation]
|
||||
|
||||
vLLM has experimental support for s390x architecture on IBM Z platform. For now, users must build from source to natively run on IBM Z platform.
|
||||
|
||||
Currently, the CPU implementation for s390x architecture supports FP32 datatype only.
|
||||
|
||||
# --8<-- [end:installation]
|
||||
# --8<-- [start:requirements]
|
||||
--8<-- [end:installation]
|
||||
--8<-- [start:requirements]
|
||||
|
||||
- OS: `Linux`
|
||||
- SDK: `gcc/g++ >= 12.3.0` or later with Command Line Tools
|
||||
- Instruction Set Architecture (ISA): VXE support is required. Works with Z14 and above.
|
||||
- Build install python packages: `pyarrow`, `torch` and `torchvision`
|
||||
|
||||
# --8<-- [end:requirements]
|
||||
# --8<-- [start:set-up-using-python]
|
||||
--8<-- [end:requirements]
|
||||
--8<-- [start:set-up-using-python]
|
||||
|
||||
# --8<-- [end:set-up-using-python]
|
||||
# --8<-- [start:pre-built-wheels]
|
||||
--8<-- [end:set-up-using-python]
|
||||
--8<-- [start:pre-built-wheels]
|
||||
|
||||
Currently, there are no pre-built IBM Z CPU wheels.
|
||||
|
||||
# --8<-- [end:pre-built-wheels]
|
||||
# --8<-- [start:build-wheel-from-source]
|
||||
--8<-- [end:pre-built-wheels]
|
||||
--8<-- [start:build-wheel-from-source]
|
||||
|
||||
Install the following packages from the package manager before building the vLLM. For example on RHEL 9.4:
|
||||
|
||||
@@ -65,13 +66,13 @@ Execute the following commands to build and install vLLM from source.
|
||||
pip install dist/*.whl
|
||||
```
|
||||
|
||||
# --8<-- [end:build-wheel-from-source]
|
||||
# --8<-- [start:pre-built-images]
|
||||
--8<-- [end:build-wheel-from-source]
|
||||
--8<-- [start:pre-built-images]
|
||||
|
||||
Currently, there are no pre-built IBM Z CPU images.
|
||||
|
||||
# --8<-- [end:pre-built-images]
|
||||
# --8<-- [start:build-image-from-source]
|
||||
--8<-- [end:pre-built-images]
|
||||
--8<-- [start:build-image-from-source]
|
||||
|
||||
```bash
|
||||
docker build -f docker/Dockerfile.s390x \
|
||||
@@ -93,6 +94,6 @@ docker run --rm \
|
||||
!!! tip
|
||||
An alternative of `--privileged true` is `--cap-add SYS_NICE --security-opt seccomp=unconfined`.
|
||||
|
||||
# --8<-- [end:build-image-from-source]
|
||||
# --8<-- [start:extra-information]
|
||||
# --8<-- [end:extra-information]
|
||||
--8<-- [end:build-image-from-source]
|
||||
--8<-- [start:extra-information]
|
||||
--8<-- [end:extra-information]
|
||||
|
||||
@@ -1,9 +1,10 @@
|
||||
# --8<-- [start:installation]
|
||||
<!-- markdownlint-disable MD041 -->
|
||||
--8<-- [start:installation]
|
||||
|
||||
vLLM supports basic model inferencing and serving on x86 CPU platform, with data types FP32, FP16 and BF16.
|
||||
|
||||
# --8<-- [end:installation]
|
||||
# --8<-- [start:requirements]
|
||||
--8<-- [end:installation]
|
||||
--8<-- [start:requirements]
|
||||
|
||||
- OS: Linux
|
||||
- CPU flags: `avx512f` (Recommended), `avx512_bf16` (Optional), `avx512_vnni` (Optional)
|
||||
@@ -11,11 +12,11 @@ vLLM supports basic model inferencing and serving on x86 CPU platform, with data
|
||||
!!! tip
|
||||
Use `lscpu` to check the CPU flags.
|
||||
|
||||
# --8<-- [end:requirements]
|
||||
# --8<-- [start:set-up-using-python]
|
||||
--8<-- [end:requirements]
|
||||
--8<-- [start:set-up-using-python]
|
||||
|
||||
# --8<-- [end:set-up-using-python]
|
||||
# --8<-- [start:pre-built-wheels]
|
||||
--8<-- [end:set-up-using-python]
|
||||
--8<-- [start:pre-built-wheels]
|
||||
|
||||
Pre-built vLLM wheels for x86 with AVX512 are available since version 0.13.0. To install release wheels:
|
||||
|
||||
@@ -25,6 +26,7 @@ export VLLM_VERSION=$(curl -s https://api.github.com/repos/vllm-project/vllm/rel
|
||||
# use uv
|
||||
uv pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cpu-cp38-abi3-manylinux_2_35_x86_64.whl --torch-backend cpu
|
||||
```
|
||||
|
||||
??? console "pip"
|
||||
```bash
|
||||
# use pip
|
||||
@@ -46,7 +48,7 @@ uv pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VE
|
||||
export LD_PRELOAD="$TC_PATH:$IOMP_PATH:$LD_PRELOAD"
|
||||
```
|
||||
|
||||
**Install the latest code**
|
||||
#### Install the latest code
|
||||
|
||||
To install the wheel built from the latest main branch:
|
||||
|
||||
@@ -54,7 +56,7 @@ To install the wheel built from the latest main branch:
|
||||
uv pip install vllm --extra-index-url https://wheels.vllm.ai/nightly/cpu --index-strategy first-index --torch-backend cpu
|
||||
```
|
||||
|
||||
**Install specific revisions**
|
||||
#### Install specific revisions
|
||||
|
||||
If you want to access the wheels for previous commits (e.g. to bisect the behavior change, performance regression), you can specify the commit hash in the URL:
|
||||
|
||||
@@ -63,8 +65,8 @@ export VLLM_COMMIT=730bd35378bf2a5b56b6d3a45be28b3092d26519 # use full commit ha
|
||||
uv pip install vllm --extra-index-url https://wheels.vllm.ai/${VLLM_COMMIT}/cpu --index-strategy first-index --torch-backend cpu
|
||||
```
|
||||
|
||||
# --8<-- [end:pre-built-wheels]
|
||||
# --8<-- [start:build-wheel-from-source]
|
||||
--8<-- [end:pre-built-wheels]
|
||||
--8<-- [start:build-wheel-from-source]
|
||||
|
||||
Install recommended compiler. We recommend to use `gcc/g++ >= 12.3.0` as the default compiler to avoid potential problems. For example, on Ubuntu 22.4, you can run:
|
||||
|
||||
@@ -158,8 +160,8 @@ uv pip install dist/*.whl
|
||||
]
|
||||
```
|
||||
|
||||
# --8<-- [end:build-wheel-from-source]
|
||||
# --8<-- [start:pre-built-images]
|
||||
--8<-- [end:build-wheel-from-source]
|
||||
--8<-- [start:pre-built-images]
|
||||
|
||||
You can pull the latest available CPU image from Docker Hub:
|
||||
|
||||
@@ -189,10 +191,10 @@ vllm/vllm-openai-cpu:latest-x86_64 <args...>
|
||||
!!! warning
|
||||
If deploying the pre-built images on machines without `avx512f`, `avx512_bf16`, or `avx512_vnni` support, an `Illegal instruction` error may be raised. See the build-image-from-source section below for build arguments to match your target CPU capabilities.
|
||||
|
||||
# --8<-- [end:pre-built-images]
|
||||
# --8<-- [start:build-image-from-source]
|
||||
--8<-- [end:pre-built-images]
|
||||
--8<-- [start:build-image-from-source]
|
||||
|
||||
## Building for your target CPU
|
||||
#### Building for your target CPU
|
||||
|
||||
```bash
|
||||
docker build -f docker/Dockerfile.cpu \
|
||||
@@ -212,15 +214,15 @@ docker build -f docker/Dockerfile.cpu \
|
||||
- `VLLM_CPU_{ISA}=true` - Force-enable the instruction set (build with ISA regardless of build system capabilities)
|
||||
- `VLLM_CPU_{ISA}=false` - Rely on auto-detection (default)
|
||||
|
||||
### Examples
|
||||
##### Examples
|
||||
|
||||
**Auto-detection build (default)**
|
||||
###### Auto-detection build (default)
|
||||
|
||||
```bash
|
||||
docker build -f docker/Dockerfile.cpu --tag vllm-cpu-env --target vllm-openai .
|
||||
```
|
||||
|
||||
**Cross-compile for AVX512**
|
||||
###### Cross-compile for AVX512
|
||||
|
||||
```bash
|
||||
docker build -f docker/Dockerfile.cpu \
|
||||
@@ -231,7 +233,7 @@ docker build -f docker/Dockerfile.cpu \
|
||||
--target vllm-openai .
|
||||
```
|
||||
|
||||
**Cross-compile for AVX2**
|
||||
###### Cross-compile for AVX2
|
||||
|
||||
```bash
|
||||
docker build -f docker/Dockerfile.cpu \
|
||||
@@ -240,7 +242,7 @@ docker build -f docker/Dockerfile.cpu \
|
||||
--target vllm-openai .
|
||||
```
|
||||
|
||||
## Launching the OpenAI server
|
||||
#### Launching the OpenAI server
|
||||
|
||||
```bash
|
||||
docker run --rm \
|
||||
@@ -255,6 +257,6 @@ docker run --rm \
|
||||
other vLLM OpenAI server arguments
|
||||
```
|
||||
|
||||
# --8<-- [end:build-image-from-source]
|
||||
# --8<-- [start:extra-information]
|
||||
# --8<-- [end:extra-information]
|
||||
--8<-- [end:build-image-from-source]
|
||||
--8<-- [start:extra-information]
|
||||
--8<-- [end:extra-information]
|
||||
|
||||
@@ -1,14 +1,15 @@
|
||||
# --8<-- [start:installation]
|
||||
<!-- markdownlint-disable MD041 MD051 -->
|
||||
--8<-- [start:installation]
|
||||
|
||||
vLLM contains pre-compiled C++ and CUDA (12.8) binaries.
|
||||
|
||||
# --8<-- [end:installation]
|
||||
# --8<-- [start:requirements]
|
||||
--8<-- [end:installation]
|
||||
--8<-- [start:requirements]
|
||||
|
||||
- GPU: compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, L4, H100, etc.)
|
||||
|
||||
# --8<-- [end:requirements]
|
||||
# --8<-- [start:set-up-using-python]
|
||||
--8<-- [end:requirements]
|
||||
--8<-- [start:set-up-using-python]
|
||||
|
||||
!!! note
|
||||
PyTorch installed via `conda` will statically link `NCCL` library, which can cause issues when vLLM tries to use `NCCL`. See <https://github.com/vllm-project/vllm/issues/8420> for more details.
|
||||
@@ -17,8 +18,8 @@ In order to be performant, vLLM has to compile many cuda kernels. The compilatio
|
||||
|
||||
Therefore, it is recommended to install vLLM with a **fresh new** environment. If either you have a different CUDA version or you want to use an existing PyTorch installation, you need to build vLLM from source. See [below](#build-wheel-from-source) for more details.
|
||||
|
||||
# --8<-- [end:set-up-using-python]
|
||||
# --8<-- [start:pre-built-wheels]
|
||||
--8<-- [end:set-up-using-python]
|
||||
--8<-- [start:pre-built-wheels]
|
||||
|
||||
```bash
|
||||
uv pip install vllm --torch-backend=auto
|
||||
@@ -49,8 +50,8 @@ uv pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VE
|
||||
|
||||
LLM inference is a fast-evolving field, and the latest code may contain bug fixes, performance improvements, and new features that are not released yet. To allow users to try the latest code without waiting for the next release, vLLM provides wheels for every commit since `v0.5.3` on <https://wheels.vllm.ai/nightly>. There are multiple indices that could be used:
|
||||
|
||||
* `https://wheels.vllm.ai/nightly`: the default variant (CUDA with version specified in `VLLM_MAIN_CUDA_VERSION`) built with the last commit on the `main` branch. Currently it is CUDA 12.9.
|
||||
* `https://wheels.vllm.ai/nightly/<variant>`: all other variants. Now this includes `cu130`, and `cpu`. The default variant (`cu129`) also has a subdirectory to keep consistency.
|
||||
- `https://wheels.vllm.ai/nightly`: the default variant (CUDA with version specified in `VLLM_MAIN_CUDA_VERSION`) built with the last commit on the `main` branch. Currently it is CUDA 12.9.
|
||||
- `https://wheels.vllm.ai/nightly/<variant>`: all other variants. Now this includes `cu130`, and `cpu`. The default variant (`cu129`) also has a subdirectory to keep consistency.
|
||||
|
||||
To install from nightly index, run:
|
||||
|
||||
@@ -82,8 +83,8 @@ uv pip install vllm \
|
||||
--extra-index-url https://wheels.vllm.ai/${VLLM_COMMIT} # add variant subdirectory here if needed
|
||||
```
|
||||
|
||||
# --8<-- [end:pre-built-wheels]
|
||||
# --8<-- [start:build-wheel-from-source]
|
||||
--8<-- [end:pre-built-wheels]
|
||||
--8<-- [start:build-wheel-from-source]
|
||||
|
||||
#### Set up using Python-only build (without compilation) {#python-only-build}
|
||||
|
||||
@@ -116,9 +117,9 @@ uv pip install --editable .
|
||||
|
||||
There are more environment variables to control the behavior of Python-only build:
|
||||
|
||||
* `VLLM_PRECOMPILED_WHEEL_LOCATION`: specify the exact wheel URL or local file path of a pre-compiled wheel to use. All other logic to find the wheel will be skipped.
|
||||
* `VLLM_PRECOMPILED_WHEEL_COMMIT`: override the commit hash to download the pre-compiled wheel. It can be `nightly` to use the last **already built** commit on the main branch.
|
||||
* `VLLM_PRECOMPILED_WHEEL_VARIANT`: specify the variant subdirectory to use on the nightly index, e.g., `cu129`, `cu130`, `cpu`. If not specified, the variant is auto-detected based on your system's CUDA version (from PyTorch or nvidia-smi). You can also set `VLLM_MAIN_CUDA_VERSION` to override auto-detection.
|
||||
- `VLLM_PRECOMPILED_WHEEL_LOCATION`: specify the exact wheel URL or local file path of a pre-compiled wheel to use. All other logic to find the wheel will be skipped.
|
||||
- `VLLM_PRECOMPILED_WHEEL_COMMIT`: override the commit hash to download the pre-compiled wheel. It can be `nightly` to use the last **already built** commit on the main branch.
|
||||
- `VLLM_PRECOMPILED_WHEEL_VARIANT`: specify the variant subdirectory to use on the nightly index, e.g., `cu129`, `cu130`, `cpu`. If not specified, the variant is auto-detected based on your system's CUDA version (from PyTorch or nvidia-smi). You can also set `VLLM_MAIN_CUDA_VERSION` to override auto-detection.
|
||||
|
||||
You can find more information about vLLM's wheels in [Install the latest code](#install-the-latest-code).
|
||||
|
||||
@@ -236,8 +237,8 @@ export VLLM_TARGET_DEVICE=empty
|
||||
uv pip install -e .
|
||||
```
|
||||
|
||||
# --8<-- [end:build-wheel-from-source]
|
||||
# --8<-- [start:pre-built-images]
|
||||
--8<-- [end:build-wheel-from-source]
|
||||
--8<-- [start:pre-built-images]
|
||||
|
||||
vLLM offers an official Docker image for deployment.
|
||||
The image can be used to run OpenAI compatible server and is available on Docker Hub as [vllm/vllm-openai](https://hub.docker.com/r/vllm/vllm-openai/tags).
|
||||
@@ -314,8 +315,8 @@ docker run --runtime nvidia --gpus all \
|
||||
|
||||
This will automatically configure `LD_LIBRARY_PATH` to point to the compatibility libraries before loading PyTorch and other dependencies.
|
||||
|
||||
# --8<-- [end:pre-built-images]
|
||||
# --8<-- [start:build-image-from-source]
|
||||
--8<-- [end:pre-built-images]
|
||||
--8<-- [start:build-image-from-source]
|
||||
|
||||
You can build and run vLLM from source via the provided [docker/Dockerfile](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile). To build vLLM:
|
||||
|
||||
@@ -415,9 +416,9 @@ The argument `vllm/vllm-openai` specifies the image to run, and should be replac
|
||||
!!! note
|
||||
**For version 0.4.1 and 0.4.2 only** - the vLLM docker images under these versions are supposed to be run under the root user since a library under the root user's home directory, i.e. `/root/.config/vllm/nccl/cu12/libnccl.so.2.18.1` is required to be loaded during runtime. If you are running the container under a different user, you may need to first change the permissions of the library (and all the parent directories) to allow the user to access it, then run vLLM with environment variable `VLLM_NCCL_SO_PATH=/root/.config/vllm/nccl/cu12/libnccl.so.2.18.1` .
|
||||
|
||||
# --8<-- [end:build-image-from-source]
|
||||
# --8<-- [start:supported-features]
|
||||
--8<-- [end:build-image-from-source]
|
||||
--8<-- [start:supported-features]
|
||||
|
||||
See [Feature x Hardware](../../features/README.md#feature-x-hardware) compatibility matrix for feature support information.
|
||||
|
||||
# --8<-- [end:supported-features]
|
||||
--8<-- [end:supported-features]
|
||||
|
||||
@@ -88,8 +88,7 @@ vLLM is a Python library that supports the following GPU variants. Select your G
|
||||
|
||||
### Pre-built images
|
||||
|
||||
<!-- markdownlint-disable MD025 -->
|
||||
# --8<-- [start:pre-built-images]
|
||||
--8<-- [start:pre-built-images]
|
||||
|
||||
=== "NVIDIA CUDA"
|
||||
|
||||
@@ -103,15 +102,11 @@ vLLM is a Python library that supports the following GPU variants. Select your G
|
||||
|
||||
--8<-- "docs/getting_started/installation/gpu.xpu.inc.md:pre-built-images"
|
||||
|
||||
# --8<-- [end:pre-built-images]
|
||||
<!-- markdownlint-enable MD025 -->
|
||||
--8<-- [end:pre-built-images]
|
||||
|
||||
<!-- markdownlint-disable MD001 -->
|
||||
### Build image from source
|
||||
<!-- markdownlint-enable MD001 -->
|
||||
|
||||
<!-- markdownlint-disable MD025 -->
|
||||
# --8<-- [start:build-image-from-source]
|
||||
--8<-- [start:build-image-from-source]
|
||||
|
||||
=== "NVIDIA CUDA"
|
||||
|
||||
@@ -125,8 +120,7 @@ vLLM is a Python library that supports the following GPU variants. Select your G
|
||||
|
||||
--8<-- "docs/getting_started/installation/gpu.xpu.inc.md:build-image-from-source"
|
||||
|
||||
# --8<-- [end:build-image-from-source]
|
||||
<!-- markdownlint-enable MD025 -->
|
||||
--8<-- [end:build-image-from-source]
|
||||
|
||||
## Supported features
|
||||
|
||||
|
||||
@@ -1,23 +1,24 @@
|
||||
# --8<-- [start:installation]
|
||||
<!-- markdownlint-disable MD041 MD051 -->
|
||||
--8<-- [start:installation]
|
||||
|
||||
vLLM supports AMD GPUs with ROCm 6.3 or above. Pre-built wheels are available for ROCm 7.0.
|
||||
|
||||
# --8<-- [end:installation]
|
||||
# --8<-- [start:requirements]
|
||||
--8<-- [end:installation]
|
||||
--8<-- [start:requirements]
|
||||
|
||||
- GPU: MI200s (gfx90a), MI300 (gfx942), MI350 (gfx950), Radeon RX 7900 series (gfx1100/1101), Radeon RX 9000 series (gfx1200/1201), Ryzen AI MAX / AI 300 Series (gfx1151/1150)
|
||||
- ROCm 6.3 or above
|
||||
- MI350 requires ROCm 7.0 or above
|
||||
- Ryzen AI MAX / AI 300 Series requires ROCm 7.0.2 or above
|
||||
|
||||
# --8<-- [end:requirements]
|
||||
# --8<-- [start:set-up-using-python]
|
||||
--8<-- [end:requirements]
|
||||
--8<-- [start:set-up-using-python]
|
||||
|
||||
The vLLM wheel bundles PyTorch and all required dependencies, and you should use the included PyTorch for compatibility. Because vLLM compiles many ROCm kernels to ensure a validated, high‑performance stack, the resulting binaries may not be compatible with other ROCm or PyTorch builds.
|
||||
If you need a different ROCm version or want to use an existing PyTorch installation, you’ll need to build vLLM from source. See [below](#build-wheel-from-source) for more details.
|
||||
|
||||
# --8<-- [end:set-up-using-python]
|
||||
# --8<-- [start:pre-built-wheels]
|
||||
--8<-- [end:set-up-using-python]
|
||||
--8<-- [start:pre-built-wheels]
|
||||
|
||||
To install the latest version of vLLM for Python 3.12, ROCm 7.0 and `glibc >= 2.35`.
|
||||
|
||||
@@ -34,7 +35,7 @@ To install a specific version and ROCm variant of vLLM wheel.
|
||||
uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/0.15.0/rocm700
|
||||
```
|
||||
|
||||
!!! warning "Caveats for using `pip`"
|
||||
!!! warning "Caveats for using `pip`"
|
||||
|
||||
We recommend leveraging `uv` to install vLLM wheel. Using `pip` to install from custom indices is cumbersome, because `pip` combines packages from `--extra-index-url` and the default index, choosing only the latest version, which makes it difficult to install wheel from custom index if exact versions of all packages are specified exactly. In contrast, `uv` gives the extra index [higher priority than the default index](https://docs.astral.sh/uv/pip/compatibility/#packages-that-exist-on-multiple-indexes).
|
||||
|
||||
@@ -44,8 +45,8 @@ uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/0.15.0/rocm700
|
||||
pip install vllm==0.15.0+rocm700 --extra-index-url https://wheels.vllm.ai/rocm/0.15.0/rocm700
|
||||
```
|
||||
|
||||
# --8<-- [end:pre-built-wheels]
|
||||
# --8<-- [start:build-wheel-from-source]
|
||||
--8<-- [end:pre-built-wheels]
|
||||
--8<-- [start:build-wheel-from-source]
|
||||
|
||||
!!! tip
|
||||
- If you found that the following installation step does not work for you, please refer to [docker/Dockerfile.rocm_base](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm_base). Dockerfile is a form of installation steps.
|
||||
@@ -104,7 +105,6 @@ uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/0.15.0/rocm700
|
||||
!!! note
|
||||
- The validated `$FA_BRANCH` can be found in the [docker/Dockerfile.rocm_base](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm_base).
|
||||
|
||||
|
||||
3. Optionally, if you choose to build AITER yourself to use a certain branch or commit, you can build AITER using the following steps:
|
||||
|
||||
```bash
|
||||
@@ -120,7 +120,6 @@ uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/0.15.0/rocm700
|
||||
- You will need to config the `$AITER_BRANCH_OR_COMMIT` for your purpose.
|
||||
- The validated `$AITER_BRANCH_OR_COMMIT` can be found in the [docker/Dockerfile.rocm_base](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm_base).
|
||||
|
||||
|
||||
4. Optionally, if you want to use MORI for EP or PD disaggregation, you can install [MORI](https://github.com/ROCm/mori) using the following steps:
|
||||
|
||||
```bash
|
||||
@@ -135,7 +134,6 @@ uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/0.15.0/rocm700
|
||||
- You will need to config the `$MORI_BRANCH_OR_COMMIT` for your purpose.
|
||||
- The validated `$MORI_BRANCH_OR_COMMIT` can be found in the [docker/Dockerfile.rocm_base](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm_base).
|
||||
|
||||
|
||||
5. Build vLLM. For example, vLLM on ROCM 7.0 can be built with the following steps:
|
||||
|
||||
???+ console "Commands"
|
||||
@@ -171,8 +169,8 @@ uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/0.15.0/rocm700
|
||||
- For MI300x (gfx942) users, to achieve optimal performance, please refer to [MI300x tuning guide](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/index.html) for performance optimization and tuning tips on system and workflow level.
|
||||
For vLLM, please refer to [vLLM performance optimization](https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/inference-optimization/vllm-optimization.html).
|
||||
|
||||
# --8<-- [end:build-wheel-from-source]
|
||||
# --8<-- [start:pre-built-images]
|
||||
--8<-- [end:build-wheel-from-source]
|
||||
--8<-- [start:pre-built-images]
|
||||
|
||||
vLLM offers an official Docker image for deployment.
|
||||
The image can be used to run OpenAI compatible server and is available on Docker Hub as [vllm/vllm-openai-rocm](https://hub.docker.com/r/vllm/vllm-openai-rocm/tags).
|
||||
@@ -217,8 +215,8 @@ rocm/vllm-dev:nightly
|
||||
Please check [LLM inference performance validation on AMD Instinct MI300X](https://rocm.docs.amd.com/en/latest/how-to/performance-validation/mi300x/vllm-benchmark.html)
|
||||
for instructions on how to use this prebuilt docker image.
|
||||
|
||||
# --8<-- [end:pre-built-images]
|
||||
# --8<-- [start:build-image-from-source]
|
||||
--8<-- [end:pre-built-images]
|
||||
--8<-- [start:build-image-from-source]
|
||||
|
||||
You can build and run vLLM from source via the provided [docker/Dockerfile.rocm](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm).
|
||||
|
||||
@@ -271,7 +269,6 @@ To build vllm on ROCm 7.0 for MI200 and MI300 series, you can use the default (w
|
||||
DOCKER_BUILDKIT=1 docker build -f docker/Dockerfile.rocm -t vllm/vllm-openai-rocm .
|
||||
```
|
||||
|
||||
|
||||
To run vLLM with the custom-built Docker image:
|
||||
|
||||
```bash
|
||||
@@ -308,9 +305,9 @@ To use the docker image as base for development, you can launch it in interactiv
|
||||
vllm/vllm-openai-rocm
|
||||
```
|
||||
|
||||
# --8<-- [end:build-image-from-source]
|
||||
# --8<-- [start:supported-features]
|
||||
--8<-- [end:build-image-from-source]
|
||||
--8<-- [start:supported-features]
|
||||
|
||||
See [Feature x Hardware](../../features/README.md#feature-x-hardware) compatibility matrix for feature support information.
|
||||
|
||||
# --8<-- [end:supported-features]
|
||||
--8<-- [end:supported-features]
|
||||
|
||||
@@ -1,29 +1,30 @@
|
||||
# --8<-- [start:installation]
|
||||
<!-- markdownlint-disable MD041 -->
|
||||
--8<-- [start:installation]
|
||||
|
||||
vLLM initially supports basic model inference and serving on Intel GPU platform.
|
||||
|
||||
# --8<-- [end:installation]
|
||||
# --8<-- [start:requirements]
|
||||
--8<-- [end:installation]
|
||||
--8<-- [start:requirements]
|
||||
|
||||
- Supported Hardware: Intel Data Center GPU, Intel ARC GPU
|
||||
- OneAPI requirements: oneAPI 2025.3
|
||||
- Dependency: [vllm-xpu-kernels](https://github.com/vllm-project/vllm-xpu-kernels): a package provide all necessary vllm custom kernel when running vLLM on Intel GPU platform,
|
||||
- Dependency: [vllm-xpu-kernels](https://github.com/vllm-project/vllm-xpu-kernels): a package provide all necessary vllm custom kernel when running vLLM on Intel GPU platform,
|
||||
- Python: 3.12
|
||||
!!! warning
|
||||
The provided vllm-xpu-kernels whl is Python3.12 specific so this version is a MUST.
|
||||
|
||||
# --8<-- [end:requirements]
|
||||
# --8<-- [start:set-up-using-python]
|
||||
--8<-- [end:requirements]
|
||||
--8<-- [start:set-up-using-python]
|
||||
|
||||
There is no extra information on creating a new Python environment for this device.
|
||||
|
||||
# --8<-- [end:set-up-using-python]
|
||||
# --8<-- [start:pre-built-wheels]
|
||||
--8<-- [end:set-up-using-python]
|
||||
--8<-- [start:pre-built-wheels]
|
||||
|
||||
Currently, there are no pre-built XPU wheels.
|
||||
|
||||
# --8<-- [end:pre-built-wheels]
|
||||
# --8<-- [start:build-wheel-from-source]
|
||||
--8<-- [end:pre-built-wheels]
|
||||
--8<-- [start:build-wheel-from-source]
|
||||
|
||||
- First, install required [driver](https://dgpu-docs.intel.com/driver/installation.html#installing-gpu-drivers) and [Intel OneAPI](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html) 2025.3 or later.
|
||||
- Second, install Python packages for vLLM XPU backend building:
|
||||
@@ -54,13 +55,13 @@ pip install -v -r requirements/xpu.txt
|
||||
VLLM_TARGET_DEVICE=xpu pip install --no-build-isolation -e . -v
|
||||
```
|
||||
|
||||
# --8<-- [end:build-wheel-from-source]
|
||||
# --8<-- [start:pre-built-images]
|
||||
--8<-- [end:build-wheel-from-source]
|
||||
--8<-- [start:pre-built-images]
|
||||
|
||||
Currently, we release prebuilt XPU images at docker [hub](https://hub.docker.com/r/intel/vllm/tags) based on vLLM released version. For more information, please refer release [note](https://github.com/intel/ai-containers/blob/main/vllm).
|
||||
|
||||
# --8<-- [end:pre-built-images]
|
||||
# --8<-- [start:build-image-from-source]
|
||||
--8<-- [end:pre-built-images]
|
||||
--8<-- [start:build-image-from-source]
|
||||
|
||||
```bash
|
||||
docker build -f docker/Dockerfile.xpu -t vllm-xpu-env --shm-size=4g .
|
||||
@@ -74,8 +75,8 @@ docker run -it \
|
||||
vllm-xpu-env
|
||||
```
|
||||
|
||||
# --8<-- [end:build-image-from-source]
|
||||
# --8<-- [start:supported-features]
|
||||
--8<-- [end:build-image-from-source]
|
||||
--8<-- [start:supported-features]
|
||||
|
||||
XPU platform supports **tensor parallel** inference/serving and also supports **pipeline parallel** as a beta feature for online serving. For **pipeline parallel**, we support it on single node with mp as the backend. For example, a reference execution like following:
|
||||
|
||||
@@ -90,9 +91,9 @@ vllm serve facebook/opt-13b \
|
||||
|
||||
By default, a ray instance will be launched automatically if no existing one is detected in the system, with `num-gpus` equals to `parallel_config.world_size`. We recommend properly starting a ray cluster before execution, referring to the [examples/online_serving/run_cluster.sh](https://github.com/vllm-project/vllm/blob/main/examples/online_serving/run_cluster.sh) helper script.
|
||||
|
||||
# --8<-- [end:supported-features]
|
||||
# --8<-- [start:distributed-backend]
|
||||
--8<-- [end:supported-features]
|
||||
--8<-- [start:distributed-backend]
|
||||
|
||||
XPU platform uses **torch-ccl** for torch<2.8 and **xccl** for torch>=2.8 as distributed backend, since torch 2.8 supports **xccl** as built-in backend for XPU.
|
||||
|
||||
# --8<-- [end:distributed-backend]
|
||||
--8<-- [end:distributed-backend]
|
||||
|
||||
@@ -1,3 +1,4 @@
|
||||
<!-- markdownlint-disable MD041 -->
|
||||
It's recommended to use [uv](https://docs.astral.sh/uv/), a very fast Python environment manager, to create and manage Python environments. Please follow the [documentation](https://docs.astral.sh/uv/#getting-started) to install `uv`. After installing `uv`, you can create a new Python environment using the following commands:
|
||||
|
||||
```bash
|
||||
|
||||
Reference in New Issue
Block a user