[Docs] Add sections on process architecture and minimum CPU resources (#33940)
It seems users can be confused about vLLM's performance when running with very small amounts of CPU cores available. We are missing a clear overview of what vLLM's process architecture is, so I added this along with some diagrams in arch_overview.md, and included a section on CPU resource recommendations in optimization.md Signed-off-by: mgoin <mgoin64@gmail.com>
This commit is contained in:
@@ -291,6 +291,52 @@ Based on the configuration, the content of the multi-modal caches on `P0` and `P
|
||||
K: Stores the hashes of multi-modal items
|
||||
V: Stores the processed tensor data of multi-modal items
|
||||
|
||||
## CPU Resources for GPU Deployments
|
||||
|
||||
vLLM V1 uses a multi-process architecture (see [V1 Process Architecture](../design/arch_overview.md#v1-process-architecture)) where each process requires CPU resources. Underprovisioning CPU cores is a common source of performance degradation, especially in virtualized environments.
|
||||
|
||||
### Minimum CPU Requirements
|
||||
|
||||
For a deployment with `N` GPUs, there are at minimum:
|
||||
|
||||
- **1 API server process** -- handles HTTP requests, tokenization, and input processing
|
||||
- **1 engine core process** -- runs the scheduler and coordinates GPU workers
|
||||
- **N GPU worker processes** -- one per GPU, executes model forward passes
|
||||
|
||||
This means there are always at least **`2 + N` processes** competing for CPU time.
|
||||
|
||||
!!! warning
|
||||
Using fewer physical CPU cores than processes will cause contention and significantly degrade throughput and latency. The engine core process runs a busy loop and is particularly sensitive to CPU starvation.
|
||||
|
||||
The minimum is `2 + N` physical cores (1 for the API server, 1 for the engine core, and 1 per GPU worker). In practice, allocating more cores improves performance because the OS, PyTorch background threads, and other system processes also need CPU time.
|
||||
|
||||
!!! important
|
||||
Please note we are referring to **physical CPU cores** here. If your system has hyperthreading enabled, then 1 vCPU = 1 hyperthread = 1/2 physical CPU core, so you need `2 x (2 + N)` minimum vCPUs.
|
||||
|
||||
### Data Parallel and Multi-API Server Deployments
|
||||
|
||||
When using data parallelism or multiple API servers, the CPU requirements increase:
|
||||
|
||||
```console
|
||||
Minimum physical cores = A + DP + N + (1 if DP > 1 else 0)
|
||||
```
|
||||
|
||||
where `A` is the API server count (defaults to `DP`), `DP` is the data parallel size, and `N` is the total number of GPUs. For example, with `DP=4, TP=2` on 8 GPUs:
|
||||
|
||||
```console
|
||||
4 API servers + 4 engine cores + 8 GPU workers + 1 DP coordinator = 17 processes
|
||||
```
|
||||
|
||||
### Performance Impact
|
||||
|
||||
CPU underprovisioning particularly impacts:
|
||||
|
||||
- **Input processing throughput** -- tokenization, chat template rendering, and multi-modal data loading all run on CPU
|
||||
- **Scheduling latency** -- the engine core scheduler runs on CPU and directly affects how quickly new tokens are dispatched to the GPU workers
|
||||
- **Output processing** -- detokenization, networking, and especially streaming token responses use CPU cycles
|
||||
|
||||
If you observe that GPU utilization is lower than expected, CPU contention may be the bottleneck. Increasing the number of available CPU cores and even the clock speed can significantly improve end-to-end performance.
|
||||
|
||||
## Attention Backend Selection
|
||||
|
||||
vLLM supports multiple attention backends optimized for different hardware and use cases. The backend is automatically selected based on your GPU architecture, model type, and configuration, but you can also manually specify one for optimal performance.
|
||||
|
||||
Reference in New Issue
Block a user