[Docs] Fix syntax highlighting of shell commands (#19870)
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
This commit is contained in:
@@ -22,7 +22,7 @@ Therefore, it is recommended to install vLLM with a **fresh new** environment. I
|
||||
|
||||
You can install vLLM using either `pip` or `uv pip`:
|
||||
|
||||
```console
|
||||
```bash
|
||||
# Install vLLM with CUDA 12.8.
|
||||
# If you are using pip.
|
||||
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu128
|
||||
@@ -37,7 +37,7 @@ We recommend leveraging `uv` to [automatically select the appropriate PyTorch in
|
||||
|
||||
As of now, vLLM's binaries are compiled with CUDA 12.8 and public PyTorch release versions by default. We also provide vLLM binaries compiled with CUDA 12.6, 11.8, and public PyTorch release versions:
|
||||
|
||||
```console
|
||||
```bash
|
||||
# Install vLLM with CUDA 11.8.
|
||||
export VLLM_VERSION=0.6.1.post1
|
||||
export PYTHON_VERSION=312
|
||||
@@ -52,7 +52,7 @@ LLM inference is a fast-evolving field, and the latest code may contain bug fixe
|
||||
|
||||
##### Install the latest code using `pip`
|
||||
|
||||
```console
|
||||
```bash
|
||||
pip install -U vllm \
|
||||
--pre \
|
||||
--extra-index-url https://wheels.vllm.ai/nightly
|
||||
@@ -62,7 +62,7 @@ pip install -U vllm \
|
||||
|
||||
Another way to install the latest code is to use `uv`:
|
||||
|
||||
```console
|
||||
```bash
|
||||
uv pip install -U vllm \
|
||||
--torch-backend=auto \
|
||||
--extra-index-url https://wheels.vllm.ai/nightly
|
||||
@@ -72,7 +72,7 @@ uv pip install -U vllm \
|
||||
|
||||
If you want to access the wheels for previous commits (e.g. to bisect the behavior change, performance regression), due to the limitation of `pip`, you have to specify the full URL of the wheel file by embedding the commit hash in the URL:
|
||||
|
||||
```console
|
||||
```bash
|
||||
export VLLM_COMMIT=33f460b17a54acb3b6cc0b03f4a17876cff5eafd # use full commit hash from the main branch
|
||||
pip install https://wheels.vllm.ai/${VLLM_COMMIT}/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl
|
||||
```
|
||||
@@ -83,7 +83,7 @@ Note that the wheels are built with Python 3.8 ABI (see [PEP 425](https://peps.p
|
||||
|
||||
If you want to access the wheels for previous commits (e.g. to bisect the behavior change, performance regression), you can specify the commit hash in the URL:
|
||||
|
||||
```console
|
||||
```bash
|
||||
export VLLM_COMMIT=72d9c316d3f6ede485146fe5aabd4e61dbc59069 # use full commit hash from the main branch
|
||||
uv pip install vllm \
|
||||
--torch-backend=auto \
|
||||
@@ -99,7 +99,7 @@ The `uv` approach works for vLLM `v0.6.6` and later and offers an easy-to-rememb
|
||||
|
||||
If you only need to change Python code, you can build and install vLLM without compilation. Using `pip`'s [`--editable` flag](https://pip.pypa.io/en/stable/topics/local-project-installs/#editable-installs), changes you make to the code will be reflected when you run vLLM:
|
||||
|
||||
```console
|
||||
```bash
|
||||
git clone https://github.com/vllm-project/vllm.git
|
||||
cd vllm
|
||||
VLLM_USE_PRECOMPILED=1 pip install --editable .
|
||||
@@ -118,7 +118,7 @@ This command will do the following:
|
||||
|
||||
In case you see an error about wheel not found when running the above command, it might be because the commit you based on in the main branch was just merged and the wheel is being built. In this case, you can wait for around an hour to try again, or manually assign the previous commit in the installation using the `VLLM_PRECOMPILED_WHEEL_LOCATION` environment variable.
|
||||
|
||||
```console
|
||||
```bash
|
||||
export VLLM_COMMIT=72d9c316d3f6ede485146fe5aabd4e61dbc59069 # use full commit hash from the main branch
|
||||
export VLLM_PRECOMPILED_WHEEL_LOCATION=https://wheels.vllm.ai/${VLLM_COMMIT}/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl
|
||||
pip install --editable .
|
||||
@@ -134,7 +134,7 @@ You can find more information about vLLM's wheels in [install-the-latest-code][i
|
||||
|
||||
If you want to modify C++ or CUDA code, you'll need to build vLLM from source. This can take several minutes:
|
||||
|
||||
```console
|
||||
```bash
|
||||
git clone https://github.com/vllm-project/vllm.git
|
||||
cd vllm
|
||||
pip install -e .
|
||||
@@ -160,7 +160,7 @@ There are scenarios where the PyTorch dependency cannot be easily installed via
|
||||
|
||||
To build vLLM using an existing PyTorch installation:
|
||||
|
||||
```console
|
||||
```bash
|
||||
git clone https://github.com/vllm-project/vllm.git
|
||||
cd vllm
|
||||
python use_existing_torch.py
|
||||
@@ -173,7 +173,7 @@ pip install --no-build-isolation -e .
|
||||
Currently, before starting the build process, vLLM fetches cutlass code from GitHub. However, there may be scenarios where you want to use a local version of cutlass instead.
|
||||
To achieve this, you can set the environment variable VLLM_CUTLASS_SRC_DIR to point to your local cutlass directory.
|
||||
|
||||
```console
|
||||
```bash
|
||||
git clone https://github.com/vllm-project/vllm.git
|
||||
cd vllm
|
||||
VLLM_CUTLASS_SRC_DIR=/path/to/cutlass pip install -e .
|
||||
@@ -184,7 +184,7 @@ VLLM_CUTLASS_SRC_DIR=/path/to/cutlass pip install -e .
|
||||
To avoid your system being overloaded, you can limit the number of compilation jobs
|
||||
to be run simultaneously, via the environment variable `MAX_JOBS`. For example:
|
||||
|
||||
```console
|
||||
```bash
|
||||
export MAX_JOBS=6
|
||||
pip install -e .
|
||||
```
|
||||
@@ -194,7 +194,7 @@ A side effect is a much slower build process.
|
||||
|
||||
Additionally, if you have trouble building vLLM, we recommend using the NVIDIA PyTorch Docker image.
|
||||
|
||||
```console
|
||||
```bash
|
||||
# Use `--ipc=host` to make sure the shared memory is large enough.
|
||||
docker run \
|
||||
--gpus all \
|
||||
@@ -205,14 +205,14 @@ docker run \
|
||||
|
||||
If you don't want to use docker, it is recommended to have a full installation of CUDA Toolkit. You can download and install it from [the official website](https://developer.nvidia.com/cuda-toolkit-archive). After installation, set the environment variable `CUDA_HOME` to the installation path of CUDA Toolkit, and make sure that the `nvcc` compiler is in your `PATH`, e.g.:
|
||||
|
||||
```console
|
||||
```bash
|
||||
export CUDA_HOME=/usr/local/cuda
|
||||
export PATH="${CUDA_HOME}/bin:$PATH"
|
||||
```
|
||||
|
||||
Here is a sanity check to verify that the CUDA Toolkit is correctly installed:
|
||||
|
||||
```console
|
||||
```bash
|
||||
nvcc --version # verify that nvcc is in your PATH
|
||||
${CUDA_HOME}/bin/nvcc --version # verify that nvcc is in your CUDA_HOME
|
||||
```
|
||||
@@ -223,7 +223,7 @@ vLLM can fully run only on Linux but for development purposes, you can still bui
|
||||
|
||||
Simply disable the `VLLM_TARGET_DEVICE` environment variable before installing:
|
||||
|
||||
```console
|
||||
```bash
|
||||
export VLLM_TARGET_DEVICE=empty
|
||||
pip install -e .
|
||||
```
|
||||
@@ -238,7 +238,7 @@ See [deployment-docker-pre-built-image][deployment-docker-pre-built-image] for i
|
||||
|
||||
Another way to access the latest code is to use the docker images:
|
||||
|
||||
```console
|
||||
```bash
|
||||
export VLLM_COMMIT=33f460b17a54acb3b6cc0b03f4a17876cff5eafd # use full commit hash from the main branch
|
||||
docker pull public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:${VLLM_COMMIT}
|
||||
```
|
||||
|
||||
@@ -31,17 +31,17 @@ Currently, there are no pre-built ROCm wheels.
|
||||
|
||||
Alternatively, you can install PyTorch using PyTorch wheels. You can check PyTorch installation guide in PyTorch [Getting Started](https://pytorch.org/get-started/locally/). Example:
|
||||
|
||||
```console
|
||||
```bash
|
||||
# Install PyTorch
|
||||
$ pip uninstall torch -y
|
||||
$ pip install --no-cache-dir --pre torch --index-url https://download.pytorch.org/whl/nightly/rocm6.3
|
||||
pip uninstall torch -y
|
||||
pip install --no-cache-dir --pre torch --index-url https://download.pytorch.org/whl/nightly/rocm6.3
|
||||
```
|
||||
|
||||
1. Install [Triton flash attention for ROCm](https://github.com/ROCm/triton)
|
||||
|
||||
Install ROCm's Triton flash attention (the default triton-mlir branch) following the instructions from [ROCm/triton](https://github.com/ROCm/triton/blob/triton-mlir/README.md)
|
||||
|
||||
```console
|
||||
```bash
|
||||
python3 -m pip install ninja cmake wheel pybind11
|
||||
pip uninstall -y triton
|
||||
git clone https://github.com/OpenAI/triton.git
|
||||
@@ -62,7 +62,7 @@ Currently, there are no pre-built ROCm wheels.
|
||||
|
||||
For example, for ROCm 6.3, suppose your gfx arch is `gfx90a`. To get your gfx architecture, run `rocminfo |grep gfx`.
|
||||
|
||||
```console
|
||||
```bash
|
||||
git clone https://github.com/ROCm/flash-attention.git
|
||||
cd flash-attention
|
||||
git checkout b7d29fb
|
||||
@@ -76,7 +76,7 @@ Currently, there are no pre-built ROCm wheels.
|
||||
|
||||
3. If you choose to build AITER yourself to use a certain branch or commit, you can build AITER using the following steps:
|
||||
|
||||
```console
|
||||
```bash
|
||||
python3 -m pip uninstall -y aiter
|
||||
git clone --recursive https://github.com/ROCm/aiter.git
|
||||
cd aiter
|
||||
@@ -148,7 +148,7 @@ If you choose to build this rocm_base image yourself, the steps are as follows.
|
||||
|
||||
It is important that the user kicks off the docker build using buildkit. Either the user put DOCKER_BUILDKIT=1 as environment variable when calling docker build command, or the user needs to setup buildkit in the docker daemon configuration /etc/docker/daemon.json as follows and restart the daemon:
|
||||
|
||||
```console
|
||||
```json
|
||||
{
|
||||
"features": {
|
||||
"buildkit": true
|
||||
@@ -158,7 +158,7 @@ It is important that the user kicks off the docker build using buildkit. Either
|
||||
|
||||
To build vllm on ROCm 6.3 for MI200 and MI300 series, you can use the default:
|
||||
|
||||
```console
|
||||
```bash
|
||||
DOCKER_BUILDKIT=1 docker build \
|
||||
-f docker/Dockerfile.rocm_base \
|
||||
-t rocm/vllm-dev:base .
|
||||
@@ -169,7 +169,7 @@ DOCKER_BUILDKIT=1 docker build \
|
||||
First, build a docker image from <gh-file:docker/Dockerfile.rocm> and launch a docker container from the image.
|
||||
It is important that the user kicks off the docker build using buildkit. Either the user put `DOCKER_BUILDKIT=1` as environment variable when calling docker build command, or the user needs to setup buildkit in the docker daemon configuration /etc/docker/daemon.json as follows and restart the daemon:
|
||||
|
||||
```console
|
||||
```bash
|
||||
{
|
||||
"features": {
|
||||
"buildkit": true
|
||||
@@ -187,13 +187,13 @@ Their values can be passed in when running `docker build` with `--build-arg` opt
|
||||
|
||||
To build vllm on ROCm 6.3 for MI200 and MI300 series, you can use the default:
|
||||
|
||||
```console
|
||||
```bash
|
||||
DOCKER_BUILDKIT=1 docker build -f docker/Dockerfile.rocm -t vllm-rocm .
|
||||
```
|
||||
|
||||
To build vllm on ROCm 6.3 for Radeon RX7900 series (gfx1100), you should pick the alternative base image:
|
||||
|
||||
```console
|
||||
```bash
|
||||
DOCKER_BUILDKIT=1 docker build \
|
||||
--build-arg BASE_IMAGE="rocm/vllm-dev:navi_base" \
|
||||
-f docker/Dockerfile.rocm \
|
||||
@@ -205,7 +205,7 @@ To run the above docker image `vllm-rocm`, use the below command:
|
||||
|
||||
??? Command
|
||||
|
||||
```console
|
||||
```bash
|
||||
docker run -it \
|
||||
--network=host \
|
||||
--group-add=video \
|
||||
|
||||
@@ -25,7 +25,7 @@ Currently, there are no pre-built XPU wheels.
|
||||
- First, install required driver and Intel OneAPI 2025.0 or later.
|
||||
- Second, install Python packages for vLLM XPU backend building:
|
||||
|
||||
```console
|
||||
```bash
|
||||
git clone https://github.com/vllm-project/vllm.git
|
||||
cd vllm
|
||||
pip install --upgrade pip
|
||||
@@ -34,7 +34,7 @@ pip install -v -r requirements/xpu.txt
|
||||
|
||||
- Then, build and install vLLM XPU backend:
|
||||
|
||||
```console
|
||||
```bash
|
||||
VLLM_TARGET_DEVICE=xpu python setup.py install
|
||||
```
|
||||
|
||||
@@ -53,9 +53,9 @@ Currently, there are no pre-built XPU images.
|
||||
# --8<-- [end:pre-built-images]
|
||||
# --8<-- [start:build-image-from-source]
|
||||
|
||||
```console
|
||||
$ docker build -f docker/Dockerfile.xpu -t vllm-xpu-env --shm-size=4g .
|
||||
$ docker run -it \
|
||||
```bash
|
||||
docker build -f docker/Dockerfile.xpu -t vllm-xpu-env --shm-size=4g .
|
||||
docker run -it \
|
||||
--rm \
|
||||
--network=host \
|
||||
--device /dev/dri \
|
||||
@@ -68,7 +68,7 @@ $ docker run -it \
|
||||
|
||||
XPU platform supports **tensor parallel** inference/serving and also supports **pipeline parallel** as a beta feature for online serving. We require Ray as the distributed runtime backend. For example, a reference execution like following:
|
||||
|
||||
```console
|
||||
```bash
|
||||
python -m vllm.entrypoints.openai.api_server \
|
||||
--model=facebook/opt-13b \
|
||||
--dtype=bfloat16 \
|
||||
|
||||
Reference in New Issue
Block a user