[doc] improve readability for long commands (#19920)
Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com>
This commit is contained in:
@@ -79,7 +79,9 @@ Currently, there are no pre-built CPU wheels.
|
||||
??? Commands
|
||||
|
||||
```console
|
||||
$ docker build -f docker/Dockerfile.cpu --tag vllm-cpu-env --target vllm-openai .
|
||||
$ docker build -f docker/Dockerfile.cpu \
|
||||
--tag vllm-cpu-env \
|
||||
--target vllm-openai .
|
||||
|
||||
# Launching OpenAI server
|
||||
$ docker run --rm \
|
||||
@@ -188,13 +190,19 @@ vllm serve facebook/opt-125m
|
||||
- Tensor Parallel is supported for serving and offline inferencing. In general each NUMA node is treated as one GPU card. Below is the example script to enable Tensor Parallel = 2 for serving:
|
||||
|
||||
```console
|
||||
VLLM_CPU_KVCACHE_SPACE=40 VLLM_CPU_OMP_THREADS_BIND="0-31|32-63" vllm serve meta-llama/Llama-2-7b-chat-hf -tp=2 --distributed-executor-backend mp
|
||||
VLLM_CPU_KVCACHE_SPACE=40 VLLM_CPU_OMP_THREADS_BIND="0-31|32-63" \
|
||||
vllm serve meta-llama/Llama-2-7b-chat-hf \
|
||||
-tp=2 \
|
||||
--distributed-executor-backend mp
|
||||
```
|
||||
|
||||
or using default auto thread binding:
|
||||
|
||||
```console
|
||||
VLLM_CPU_KVCACHE_SPACE=40 vllm serve meta-llama/Llama-2-7b-chat-hf -tp=2 --distributed-executor-backend mp
|
||||
VLLM_CPU_KVCACHE_SPACE=40 \
|
||||
vllm serve meta-llama/Llama-2-7b-chat-hf \
|
||||
-tp=2 \
|
||||
--distributed-executor-backend mp
|
||||
```
|
||||
|
||||
- For each thread id list in `VLLM_CPU_OMP_THREADS_BIND`, users should guarantee threads in the list belong to a same NUMA node.
|
||||
|
||||
Reference in New Issue
Block a user