In such cases, vLLM can preempt requests to free up KV cache space for other requests. Preempted requests are recomputed when sufficient KV cache space becomes
available again. When this occurs, you may see the following warning:
WARNING 05-09 00:49:33 scheduler.py:1057 Sequence group 0 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_cumulative_preemption_cnt=1
If you frequently encounter preemptions, consider the following actions:
- Increase `gpu_memory_utilization`. vLLM pre-allocates GPU cache using this percentage of memory. By increasing utilization, you can provide more KV cache space.
- Decrease `max_num_seqs` or `max_num_batched_tokens`. This reduces the number of concurrent requests in a batch, thereby requiring less KV cache space.
- Increase `tensor_parallel_size`. This shards model weights across GPUs, allowing each GPU to have more memory available for KV cache. However, increasing this value may cause excessive synchronization overhead.
- Increase `pipeline_parallel_size`. This distributes model layers across GPUs, reducing the memory needed for model weights on each GPU, indirectly leaving more memory available for KV cache. However, increasing this value may cause latency penalties.
You can monitor the number of preemption requests through Prometheus metrics exposed by vLLM. Additionally, you can log the cumulative number of preemption requests by setting `disable_log_stats=False`.
Chunked prefill allows vLLM to process large prefills in smaller chunks and batch them together with decode requests. This feature helps improve both throughput and latency by better balancing compute-bound (prefill) and memory-bound (decode) operations.
In V1, **chunked prefill is enabled by default whenever possible**. With chunked prefill enabled, the scheduling policy prioritizes decode requests. It batches all pending decode requests before scheduling any prefill operations. When there are available tokens in the `max_num_batched_tokens` budget, it schedules pending prefills. If a pending prefill request cannot fit into `max_num_batched_tokens`, it automatically chunks it.
- If `max_num_batched_tokens` is the same as `max_model_len`, that's almost the equivalent to the V0 default scheduling policy (except that it still prioritizes decodes).
Tensor parallelism shards model parameters across multiple GPUs within each model layer. This is the most common strategy for large model inference within a single node.
Expert parallelism is a specialized form of parallelism for Mixture of Experts (MoE) models, where different expert networks are distributed across GPUs.
- Specifically for MoE models (like DeepSeekV3, Qwen3MoE, Llama-4)
- When you want to balance the expert computation load across GPUs
Expert parallelism is enabled by setting `enable_expert_parallel=True`, which will use expert parallelism instead of tensor parallelism for MoE layers.
It will use the same degree of parallelism as what you have set for tensor parallelism.
### Data Parallelism (DP)
Data parallelism replicates the entire model across multiple GPU sets and processes different batches of requests in parallel.
**When to use:**
- When you have enough GPUs to replicate the entire model
- When you need to scale throughput rather than model size
- In multi-user environments where isolation between request batches is beneficial
Data parallelism can be combined with the other parallelism strategies and is set by `data_parallel_size=N`.
Note that MoE layers will be sharded according to the product of the tensor parallel size and data parallel size.
On multi-socket GPU servers, GPU worker processes can lose performance if their
CPU execution and memory allocation drift away from the NUMA node nearest to the
GPU. vLLM can pin each worker with `numactl` before the Python subprocess starts,
so the interpreter, imports, and early allocator state are created with the
desired NUMA policy from the beginning.
Use `--numa-bind` to enable the feature. By default, vLLM auto-detects the
GPU-to-NUMA mapping and uses `--cpunodebind=<node> --membind=<node>` for each
worker. When you need a custom CPU policy, add `--numa-bind-cpus` and vLLM will
switch to `--physcpubind=<cpu-list> --membind=<node>`.
These `--numa-bind*` options only apply to GPU execution processes. They do not
configure the CPU backend's separate thread-affinity controls. Automatic
GPU-to-NUMA detection is currently implemented for CUDA/NVML-based platforms;
other GPU backends must provide explicit binding lists if they use these
options.
`--numa-bind-nodes` takes one non-negative NUMA node index per visible GPU, in
the same order as the GPU indices.
`--numa-bind-cpus` takes one `numactl` CPU list per visible GPU, in the same
order as the GPU indices. Each CPU list must use
`numactl --physcpubind` syntax such as `0-3`, `0,2,4-7`, or `16-31,48-63`.
```bash
# Auto-detect NUMA nodes for visible GPUs
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--tensor-parallel-size 4 \
--numa-bind
# Explicit NUMA-node mapping
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--tensor-parallel-size 4 \
--numa-bind \
--numa-bind-nodes 0 0 1 1
# Explicit CPU pinning, useful for PCT or other high-frequency core layouts
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--tensor-parallel-size 4 \
--numa-bind \
--numa-bind-nodes 0 0 1 1 \
--numa-bind-cpus 0-3 4-7 48-51 52-55
```
Notes:
- CLI usage forces multiprocessing to use the `spawn` method automatically. If you enable NUMA binding through the Python API, also set `VLLM_WORKER_MULTIPROC_METHOD=spawn`.
- Automatic detection relies on NVML and NUMA support from the host. If it cannot determine the mapping reliably, pass `--numa-bind-nodes` explicitly.
- Explicit `--numa-bind-nodes` and `--numa-bind-cpus` values must be valid `numactl` inputs. vLLM does a small amount of validation, but the effective binding semantics are still determined by `numactl`.
- The current implementation binds GPU execution processes such as `EngineCore` and multiprocessing workers. It does not apply NUMA binding to frontend API server processes or the DP coordinator.
- In containerized environments, NUMA policy syscalls may require extra permissions, such as `--cap-add SYS_NICE` when running via `docker run`.
### CPU Backend Thread Affinity
The CPU backend uses a different mechanism from `--numa-bind`. CPU execution is
configured through CPU-specific environment variables such as
`VLLM_CPU_OMP_THREADS_BIND`, `VLLM_CPU_NUM_OF_RESERVED_CPU`, and
`CPU_VISIBLE_MEMORY_NODES`, rather than the GPU-oriented `--numa-bind*` CLI
options.
By default, `VLLM_CPU_OMP_THREADS_BIND=auto` derives OpenMP placement from the
available CPU and NUMA topology for each CPU worker. To override the automatic
policy, set `VLLM_CPU_OMP_THREADS_BIND` explicitly using the CPU list format
documented for the CPU backend, or use `nobind` to disable this behavior.
For the current CPU backend setup and tuning guidance, see:
vLLM V1 uses a multi-process architecture (see [V1 Process Architecture](../design/arch_overview.md#v1-process-architecture)) where each process requires CPU resources. Underprovisioning CPU cores is a common source of performance degradation, especially in virtualized environments.
### Minimum CPU Requirements
For a deployment with `N` GPUs, there are at minimum:
- **1 API server process** -- handles HTTP requests, tokenization, and input processing
- **1 engine core process** -- runs the scheduler and coordinates GPU workers
- **N GPU worker processes** -- one per GPU, executes model forward passes
This means there are always at least **`2 + N` processes** competing for CPU time.
!!! warning
Using fewer physical CPU cores than processes will cause contention and significantly degrade throughput and latency. The engine core process runs a busy loop and is particularly sensitive to CPU starvation.
The minimum is `2 + N` physical cores (1 for the API server, 1 for the engine core, and 1 per GPU worker). In practice, allocating more cores improves performance because the OS, PyTorch background threads, and other system processes also need CPU time.
!!! important
Please note we are referring to **physical CPU cores** here. If your system has hyperthreading enabled, then 1 vCPU = 1 hyperthread = 1/2 physical CPU core, so you need `2 x (2 + N)` minimum vCPUs.
### Data Parallel and Multi-API Server Deployments
When using data parallelism or multiple API servers, the CPU requirements increase:
```console
Minimum physical cores = A + DP + N + (1 if DP > 1 else 0)
```
where `A` is the API server count (defaults to `DP`), `DP` is the data parallel size, and `N` is the total number of GPUs. For example, with `DP=4, TP=2` on 8 GPUs:
- **Input processing throughput** -- tokenization, chat template rendering, and multi-modal data loading all run on CPU
- **Scheduling latency** -- the engine core scheduler runs on CPU and directly affects how quickly new tokens are dispatched to the GPU workers
- **Output processing** -- detokenization, networking, and especially streaming token responses use CPU cycles
If you observe that GPU utilization is lower than expected, CPU contention may be the bottleneck. Increasing the number of available CPU cores and even the clock speed can significantly improve end-to-end performance.
vLLM supports multiple attention backends optimized for different hardware and use cases. The backend is automatically selected based on your GPU architecture, model type, and configuration, but you can also manually specify one for optimal performance.
For detailed information on available backends, their feature support, and how to configure them, see the [Attention Backend Feature Support](../design/attention_backends.md) documentation.