[Doc] Update CPU doc (#20676)

Signed-off-by: jiang1.li <jiang1.li@intel.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-07-10 01:28:30 +08:00
parent 0bbac1c1b4
commit 138709f8d1
5 changed files with 99 additions and 84 deletions
--- a/docs/getting_started/installation/cpu/arm.inc.md
+++ b/docs/getting_started/installation/cpu/arm.inc.md
@@ -32,7 +32,22 @@ Testing has been conducted on AWS Graviton3 instances for compatibility.

 # --8<-- [end:pre-built-images]
 # --8<-- [start:build-image-from-source]
+```bash
+docker build -f docker/Dockerfile.arm \
+        --tag vllm-cpu-env .

+# Launching OpenAI server
+docker run --rm \
+            --privileged=true \
+            --shm-size=4g \
+            -p 8000:8000 \
+            -e VLLM_CPU_KVCACHE_SPACE=<KV cache space> \
+            -e VLLM_CPU_OMP_THREADS_BIND=<CPU cores for inference> \
+            vllm-cpu-env \
+            --model=meta-llama/Llama-3.2-1B-Instruct \
+            --dtype=bfloat16 \
+            other vLLM OpenAI server arguments
+```
 # --8<-- [end:build-image-from-source]
 # --8<-- [start:extra-information]
 # --8<-- [end:extra-information]
--- a/docs/getting_started/installation/cpu/build.inc.md
+++ b/docs/getting_started/installation/cpu/build.inc.md
@@ -2,7 +2,7 @@ First, install recommended compiler. We recommend to use `gcc/g++ >= 12.3.0` as

 ```bash
 sudo apt-get update  -y
-sudo apt-get install -y gcc-12 g++-12 libnuma-dev python3-dev
+sudo apt-get install -y --no-install-recommends ccache git curl wget ca-certificates gcc-12 g++-12 libtcmalloc-minimal4 libnuma-dev ffmpeg libsm6 libxext6 libgl1 jq lsof
 sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12
 ```

@@ -17,7 +17,7 @@ Third, install Python packages for vLLM CPU backend building:

 ```bash
 pip install --upgrade pip
-pip install "cmake>=3.26.1" wheel packaging ninja "setuptools-scm>=8" numpy
+pip install -v -r requirements/cpu-build.txt --extra-index-url https://download.pytorch.org/whl/cpu
 pip install -v -r requirements/cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
 ```

@@ -33,4 +33,7 @@ If you want to develop vllm, install it in editable mode instead.
 VLLM_TARGET_DEVICE=cpu python setup.py develop
 ```

+!!! note
+    If you are building vLLM from source and not using the pre-built images, remember to set `LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:$LD_PRELOAD"` on x86 machines before running vLLM.
+
 # --8<-- [end:extra-information]
--- a/docs/getting_started/installation/cpu/s390x.inc.md
+++ b/docs/getting_started/installation/cpu/s390x.inc.md
@@ -61,6 +61,23 @@ Execute the following commands to build and install vLLM from the source.
 # --8<-- [end:pre-built-images]
 # --8<-- [start:build-image-from-source]

+```bash
+docker build -f docker/Dockerfile.s390x \
+        --tag vllm-cpu-env .
+
+# Launching OpenAI server
+docker run --rm \
+            --privileged=true \
+            --shm-size=4g \
+            -p 8000:8000 \
+            -e VLLM_CPU_KVCACHE_SPACE=<KV cache space> \
+            -e VLLM_CPU_OMP_THREADS_BIND=<CPU cores for inference> \
+            vllm-cpu-env \
+            --model=meta-llama/Llama-3.2-1B-Instruct \
+            --dtype=float \
+            other vLLM OpenAI server arguments
+```
+
 # --8<-- [end:build-image-from-source]
 # --8<-- [start:extra-information]
 # --8<-- [end:extra-information]
--- a/docs/getting_started/installation/cpu/x86.inc.md
+++ b/docs/getting_started/installation/cpu/x86.inc.md
@@ -1,19 +1,15 @@
 # --8<-- [start:installation]

-vLLM initially supports basic model inferencing and serving on x86 CPU platform, with data types FP32, FP16 and BF16.
-
-!!! warning
-    There are no pre-built wheels or images for this device, so you must build vLLM from source.
+vLLM supports basic model inferencing and serving on x86 CPU platform, with data types FP32, FP16 and BF16.

 # --8<-- [end:installation]
 # --8<-- [start:requirements]

 - OS: Linux
- Compiler: `gcc/g++ >= 12.3.0` (optional, recommended)
- Instruction Set Architecture (ISA): AVX512 (optional, recommended)
+- CPU flags: `avx512f`, `avx512_bf16` (Optional), `avx512_vnni` (Optional)

 !!! tip
-    [Intel Extension for PyTorch (IPEX)](https://github.com/intel/intel-extension-for-pytorch) extends PyTorch with up-to-date features optimizations for an extra performance boost on Intel hardware.
+    Use `lscpu` to check the CPU flags.

 # --8<-- [end:requirements]
 # --8<-- [start:set-up-using-python]
@@ -26,18 +22,37 @@ vLLM initially supports basic model inferencing and serving on x86 CPU platform,

 --8<-- "docs/getting_started/installation/cpu/build.inc.md"

-!!! note
-    - AVX512_BF16 is an extension ISA provides native BF16 data type conversion and vector product instructions, which brings some performance improvement compared with pure AVX512. The CPU backend build script will check the host CPU flags to determine whether to enable AVX512_BF16.
-    - If you want to force enable AVX512_BF16 for the cross-compilation, please set environment variable `VLLM_CPU_AVX512BF16=1` before the building.
-
 # --8<-- [end:build-wheel-from-source]
 # --8<-- [start:pre-built-images]

-See [https://gallery.ecr.aws/q9t5s3a7/vllm-cpu-release-repo](https://gallery.ecr.aws/q9t5s3a7/vllm-cpu-release-repo)
+[https://gallery.ecr.aws/q9t5s3a7/vllm-cpu-release-repo](https://gallery.ecr.aws/q9t5s3a7/vllm-cpu-release-repo)
+
+!!! warning
+    If deploying the pre-built images on machines only contain `avx512f`, `Illegal instruction` error may be raised. It is recommended to build images for these machines with `--build-arg VLLM_CPU_AVX512BF16=false` and `--build-arg VLLM_CPU_AVX512VNNI=false`.

 # --8<-- [end:pre-built-images]
 # --8<-- [start:build-image-from-source]

+```bash
+docker build -f docker/Dockerfile.cpu \
+        --build-arg VLLM_CPU_AVX512BF16=false (default)|true \
+        --build-arg VLLM_CPU_AVX512VNNI=false (default)|true \
+        --tag vllm-cpu-env \
+        --target vllm-openai .
+
+# Launching OpenAI server
+docker run --rm \
+            --privileged=true \
+            --shm-size=4g \
+            -p 8000:8000 \
+            -e VLLM_CPU_KVCACHE_SPACE=<KV cache space> \
+            -e VLLM_CPU_OMP_THREADS_BIND=<CPU cores for inference> \
+            vllm-cpu-env \
+            --model=meta-llama/Llama-3.2-1B-Instruct \
+            --dtype=bfloat16 \
+            other vLLM OpenAI server arguments
+```
+
 # --8<-- [end:build-image-from-source]
 # --8<-- [start:extra-information]
 # --8<-- [end:extra-information]