docs/configuration/optimization.md

# Optimization and Tuning

This guide covers optimization strategies and performance tuning for vLLM V1.

!!! tip
    Running out of memory? Consult [this guide](./conserving_memory.md) on how to conserve memory.

## Optimization Levels

vLLM provides 4 optimization levels (`-O0`, `-O1`, `-O2`, `-O3`) that allow users to trade off startup time for performance:

- `-O0`: No optimizations. Fastest startup time, but lowest performance.
- `-O1`: Fast optimization. Simple compilation and fast fusions, and PIECEWISE cudagraphs.
- `-O2`: Default optimization. Additional compilation ranges, additional fusions, FULL_AND_PIECEWISE cudagraphs.
- `-O3`: Aggressive optimization. Currently equal to `-O2`, but may include additional time-consuming or experimental optimizations in the future.

For more information, see the [optimization level documentation](../design/optimization_levels.md).

## Preemption

Due to the autoregressive nature of transformer architecture, there are times when KV cache space is insufficient to handle all batched requests.
In such cases, vLLM can preempt requests to free up KV cache space for other requests. Preempted requests are recomputed when sufficient KV cache space becomes
available again. When this occurs, you may see the following warning:

```text
WARNING 05-09 00:49:33 scheduler.py:1057 Sequence group 0 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_cumulative_preemption_cnt=1
```

While this mechanism ensures system robustness, preemption and recomputation can adversely affect end-to-end latency.
If you frequently encounter preemptions, consider the following actions:

- Increase `gpu_memory_utilization`. vLLM pre-allocates GPU cache using this percentage of memory. By increasing utilization, you can provide more KV cache space.
- Decrease `max_num_seqs` or `max_num_batched_tokens`. This reduces the number of concurrent requests in a batch, thereby requiring less KV cache space.
- Increase `tensor_parallel_size`. This shards model weights across GPUs, allowing each GPU to have more memory available for KV cache. However, increasing this value may cause excessive synchronization overhead.
- Increase `pipeline_parallel_size`. This distributes model layers across GPUs, reducing the memory needed for model weights on each GPU, indirectly leaving more memory available for KV cache. However, increasing this value may cause latency penalties.

You can monitor the number of preemption requests through Prometheus metrics exposed by vLLM. Additionally, you can log the cumulative number of preemption requests by setting `disable_log_stats=False`.

In vLLM V1, the default preemption mode is `RECOMPUTE` rather than `SWAP`, as recomputation has lower overhead in the V1 architecture.

## Chunked Prefill

Chunked prefill allows vLLM to process large prefills in smaller chunks and batch them together with decode requests. This feature helps improve both throughput and latency by better balancing compute-bound (prefill) and memory-bound (decode) operations.

In V1, **chunked prefill is enabled by default whenever possible**. With chunked prefill enabled, the scheduling policy prioritizes decode requests. It batches all pending decode requests before scheduling any prefill operations. When there are available tokens in the `max_num_batched_tokens` budget, it schedules pending prefills. If a pending prefill request cannot fit into `max_num_batched_tokens`, it automatically chunks it.

This policy has two benefits:

- It improves ITL and generation decode because decode requests are prioritized.
- It helps achieve better GPU utilization by locating compute-bound (prefill) and memory-bound (decode) requests to the same batch.

### Performance Tuning with Chunked Prefill

You can tune the performance by adjusting `max_num_batched_tokens`:

- Smaller values (e.g., 2048) achieve better inter-token latency (ITL) because there are fewer prefills slowing down decodes.
- Higher values achieve better time to first token (TTFT) as you can process more prefill tokens in a batch.
- For optimal throughput, we recommend setting `max_num_batched_tokens > 8192` especially for smaller models on large GPUs.
- If `max_num_batched_tokens` is the same as `max_model_len`, that's almost the equivalent to the V0 default scheduling policy (except that it still prioritizes decodes).

!!! warning
    When chunked prefill is disabled, `max_num_batched_tokens` must be greater than `max_model_len`.  
    In that case, if `max_num_batched_tokens < max_model_len`, vLLM may crash at server start‑up.

```python
from vllm import LLM

# Set max_num_batched_tokens to tune performance
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct", max_num_batched_tokens=16384)
```

See related papers for more details (<https://arxiv.org/pdf/2401.08671> or <https://arxiv.org/pdf/2308.16369>).

## Parallelism Strategies

vLLM supports multiple parallelism strategies that can be combined to optimize performance across different hardware configurations.

### Tensor Parallelism (TP)

Tensor parallelism shards model parameters across multiple GPUs within each model layer. This is the most common strategy for large model inference within a single node.

**When to use:**

- When the model is too large to fit on a single GPU
- When you need to reduce memory pressure per GPU to allow more KV cache space for higher throughput

```python
from vllm import LLM

# Split model across 4 GPUs
llm = LLM(model="meta-llama/Llama-3.3-70B-Instruct", tensor_parallel_size=4)
```

For models that are too large to fit on a single GPU (like 70B parameter models), tensor parallelism is essential.

### Pipeline Parallelism (PP)

Pipeline parallelism distributes model layers across multiple GPUs. Each GPU processes different parts of the model in sequence.

**When to use:**

- When you've already maxed out efficient tensor parallelism but need to distribute the model further, or across nodes
- For very deep and narrow models where layer distribution is more efficient than tensor sharding

Pipeline parallelism can be combined with tensor parallelism for very large models:

```python
from vllm import LLM

# Combine pipeline and tensor parallelism
llm = LLM(
    model="meta-llama/Llama-3.3-70B-Instruct,
    tensor_parallel_size=4,
    pipeline_parallel_size=2,
)
```

### Expert Parallelism (EP)

Expert parallelism is a specialized form of parallelism for Mixture of Experts (MoE) models, where different expert networks are distributed across GPUs.

**When to use:**

- Specifically for MoE models (like DeepSeekV3, Qwen3MoE, Llama-4)
- When you want to balance the expert computation load across GPUs

Expert parallelism is enabled by setting `enable_expert_parallel=True`, which will use expert parallelism instead of tensor parallelism for MoE layers.
It will use the same degree of parallelism as what you have set for tensor parallelism.

### Data Parallelism (DP)

Data parallelism replicates the entire model across multiple GPU sets and processes different batches of requests in parallel.

**When to use:**

- When you have enough GPUs to replicate the entire model
- When you need to scale throughput rather than model size
- In multi-user environments where isolation between request batches is beneficial

Data parallelism can be combined with the other parallelism strategies and is set by `data_parallel_size=N`.
Note that MoE layers will be sharded according to the product of the tensor parallel size and data parallel size.

### NUMA Binding for Multi-Socket GPU Nodes

On multi-socket GPU servers, GPU worker processes can lose performance if their
CPU execution and memory allocation drift away from the NUMA node nearest to the
GPU. vLLM can pin each worker with `numactl` before the Python subprocess starts,
so the interpreter, imports, and early allocator state are created with the
desired NUMA policy from the beginning.

Use `--numa-bind` to enable the feature. By default, vLLM auto-detects the
GPU-to-NUMA mapping and uses `--cpunodebind=<node> --membind=<node>` for each
worker. When you need a custom CPU policy, add `--numa-bind-cpus` and vLLM will
switch to `--physcpubind=<cpu-list> --membind=<node>`.

These `--numa-bind*` options only apply to GPU execution processes. They do not
configure the CPU backend's separate thread-affinity controls. Automatic
GPU-to-NUMA detection is currently implemented for CUDA/NVML-based platforms;
other GPU backends must provide explicit binding lists if they use these
options.

`--numa-bind-nodes` takes one non-negative NUMA node index per visible GPU, in
the same order as the GPU indices.
`--numa-bind-cpus` takes one `numactl` CPU list per visible GPU, in the same
order as the GPU indices. Each CPU list must use
`numactl --physcpubind` syntax such as `0-3`, `0,2,4-7`, or `16-31,48-63`.

```bash
# Auto-detect NUMA nodes for visible GPUs
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --tensor-parallel-size 4 \
  --numa-bind

# Explicit NUMA-node mapping
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --tensor-parallel-size 4 \
  --numa-bind \
  --numa-bind-nodes 0 0 1 1

# Explicit CPU pinning, useful for PCT or other high-frequency core layouts
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --tensor-parallel-size 4 \
  --numa-bind \
  --numa-bind-nodes 0 0 1 1 \
  --numa-bind-cpus 0-3 4-7 48-51 52-55
```

Notes:

- CLI usage forces multiprocessing to use the `spawn` method automatically. If you enable NUMA binding through the Python API, also set `VLLM_WORKER_MULTIPROC_METHOD=spawn`.
- Automatic detection relies on NVML and NUMA support from the host. If it cannot determine the mapping reliably, pass `--numa-bind-nodes` explicitly.
- Explicit `--numa-bind-nodes` and `--numa-bind-cpus` values must be valid `numactl` inputs. vLLM does a small amount of validation, but the effective binding semantics are still determined by `numactl`.
- The current implementation binds GPU execution processes such as `EngineCore` and multiprocessing workers. It does not apply NUMA binding to frontend API server processes or the DP coordinator.
- In containerized environments, NUMA policy syscalls may require extra permissions, such as `--cap-add SYS_NICE` when running via `docker run`.

### CPU Backend Thread Affinity

The CPU backend uses a different mechanism from `--numa-bind`. CPU execution is
configured through CPU-specific environment variables such as
`VLLM_CPU_OMP_THREADS_BIND`, `VLLM_CPU_NUM_OF_RESERVED_CPU`, and
`CPU_VISIBLE_MEMORY_NODES`, rather than the GPU-oriented `--numa-bind*` CLI
options.

By default, `VLLM_CPU_OMP_THREADS_BIND=auto` derives OpenMP placement from the
available CPU and NUMA topology for each CPU worker. To override the automatic
policy, set `VLLM_CPU_OMP_THREADS_BIND` explicitly using the CPU list format
documented for the CPU backend, or use `nobind` to disable this behavior.

For the current CPU backend setup and tuning guidance, see:

- [Related runtime environment variables](../getting_started/installation/cpu.md#related-runtime-environment-variables)
- [How to decide `VLLM_CPU_OMP_THREADS_BIND`](../getting_started/installation/cpu.md#how-to-decide-vllm_cpu_omp_threads_bind)

The GPU-only `--numa-bind`, `--numa-bind-nodes`, and `--numa-bind-cpus` options
do not configure CPU worker affinity.

### Batch-level DP for Multi-Modal Encoders

By default, TP is used to shard the weights of multi-modal encoders just like for language decoders,
in order to reduce the memory and compute load on each GPU.

However, since the size of multi-modal encoders is very small compared to language decoders,
there is relatively little gain from TP. On the other hand, TP incurs significant communication
overhead because of all-reduce being performed after every layer.

Given this, it may be advantageous to instead shard the batched input data using TP, essentially
performing batch-level DP. This has been shown to improve the throughput and TTFT by around 10% for
`tensor_parallel_size=8`. For vision encoders that use hardware-unoptimized Conv3D operations,
batch-level DP can provide another 40% improvement compared to regular TP.

Nevertheless, since the weights of the multi-modal encoder are replicated across each TP rank,
there will be a minor increase in memory consumption and may cause OOM if you can barely fit the model already.

You can enable batch-level DP by setting `mm_encoder_tp_mode="data"`, for example:

```python
from vllm import LLM

llm = LLM(
    model="Qwen/Qwen2.5-VL-72B-Instruct",
    tensor_parallel_size=4,
    # When mm_encoder_tp_mode="data",
    # the vision encoder uses TP=4 (not DP=1) to shard the input data,
    # so the TP size becomes the effective DP size.
    # Note that this is independent of the DP size for language decoder which is used in expert parallel setting.
    mm_encoder_tp_mode="data",
    # The language decoder uses TP=4 to shard the weights regardless
    # of the setting of mm_encoder_tp_mode
)
```

!!! important
    Batch-level DP is not to be confused with API request-level DP
    (which is instead controlled by `data_parallel_size`).

Batch-level DP needs to be implemented on a per-model basis,
and enabled by setting `supports_encoder_tp_data = True` in the model class.
Regardless, you need to set `mm_encoder_tp_mode="data"` in engine arguments to use this feature.

Known supported models (with corresponding benchmarks):

- dots_ocr (<https://github.com/vllm-project/vllm/pull/25466>)
- GLM-4.1V or above (<https://github.com/vllm-project/vllm/pull/23168>)
- InternVL (<https://github.com/vllm-project/vllm/pull/23909>)
- Kimi-VL (<https://github.com/vllm-project/vllm/pull/23817>)
- Llama4 (<https://github.com/vllm-project/vllm/pull/18368>)
- MiniCPM-V-2.5 or above (<https://github.com/vllm-project/vllm/pull/23327>, <https://github.com/vllm-project/vllm/pull/23948>)
- Qwen2-VL or above (<https://github.com/vllm-project/vllm/pull/22742>, <https://github.com/vllm-project/vllm/pull/24955>, <https://github.com/vllm-project/vllm/pull/25445>)
- Step3 (<https://github.com/vllm-project/vllm/pull/22697>)

## Input Processing

### Parallel Processing

You can run input processing in parallel via [API server scale-out](../serving/data_parallel_deployment.md#internal-load-balancing).
This is useful when input processing (which is run inside the API server)
becomes a bottleneck compared to model execution (which is run inside engine core)
and you have excess CPU capacity.

```console
# Run 4 API processes and 1 engine core process
vllm serve Qwen/Qwen2.5-VL-3B-Instruct --api-server-count 4

# Run 4 API processes and 2 engine core processes
vllm serve Qwen/Qwen2.5-VL-3B-Instruct --api-server-count 4 -dp 2
```

!!! note
    API server scale-out is only available for online inference.

!!! warning
    By default, 8 CPU threads are used in each API server to load media items (e.g. images)
    from request data.

    If you apply API server scale-out, consider adjusting `VLLM_MEDIA_LOADING_THREAD_COUNT`
    to avoid CPU resource exhaustion.

!!! note
    API server scale-out disables [multi-modal IPC caching](#ipc-caching)
    because it requires a one-to-one correspondence between API and engine core processes.

    This does not impact [multi-modal processor caching](#processor-caching).

## Multi-Modal Caching

Multi-modal caching avoids repeated transfer or processing of the same multi-modal data,
which commonly occurs in multi-turn conversations.

### Processor Caching

Multi-modal processor caching is automatically enabled
to avoid repeatedly processing the same multi-modal inputs in `BaseMultiModalProcessor`.

### IPC Caching

Multi-modal IPC caching is automatically enabled when
there is a one-to-one correspondence between API (`P0`) and engine core (`P1`) processes,
to avoid repeatedly transferring the same multi-modal inputs between them.

#### Key-Replicated Cache

By default, IPC caching uses a **key-replicated cache**, where cache keys exist
in both the API (`P0`) and engine core (`P1`) processes, but the actual cache
data resides only in `P1`.

#### Shared Memory Cache

When multiple worker processes are involved (e.g., when TP > 1), a
**shared-memory cache** is more efficient. This can be enabled by setting
`mm_processor_cache_type="shm"`. In this mode, cache keys are stored
on `P0`, while the cache data itself lives in shared memory accessible by all
processes.

### Configuration

You can adjust the size of the cache by setting the value of `mm_processor_cache_gb` (default 4 GiB).

If you do not benefit much from the cache, you can disable both IPC
and processor caching completely via `mm_processor_cache_gb=0`.

Examples:

```python
# Use a larger cache
llm = LLM(
    model="Qwen/Qwen2.5-VL-3B-Instruct",
    mm_processor_cache_gb=8,
)

# Use a shared-memory based IPC cache
llm = LLM(
    model="Qwen/Qwen2.5-VL-3B-Instruct",
    tensor_parallel_size=2,
    mm_processor_cache_type="shm",
    mm_processor_cache_gb=8,
)

# Disable the cache
llm = LLM(
    model="Qwen/Qwen2.5-VL-3B-Instruct",
    mm_processor_cache_gb=0,
)
```

### Cache Placement

Based on the configuration, the content of the multi-modal caches on `P0` and `P1` are as follows:

| mm_processor_cache_type | Cache Type | `P0` Cache | `P1` Engine Cache | `P1` Worker Cache | Max. Memory |
| ----------------- | ----------- | ---------- | ---------- | ----------- | ----------- |
| lru | Processor Caching | K + V | N/A | N/A | `mm_processor_cache_gb * data_parallel_size` |
| lru | Key-Replicated Caching | K | K + V | N/A | `mm_processor_cache_gb * api_server_count` |
| shm | Shared Memory Caching | K | N/A | V | `mm_processor_cache_gb * api_server_count` |
| N/A | Disabled | N/A | N/A | N/A | `0` |

K: Stores the hashes of multi-modal items
V: Stores the processed tensor data of multi-modal items

## CPU Resources for GPU Deployments

vLLM V1 uses a multi-process architecture (see [V1 Process Architecture](../design/arch_overview.md#v1-process-architecture)) where each process requires CPU resources. Underprovisioning CPU cores is a common source of performance degradation, especially in virtualized environments.

### Minimum CPU Requirements

For a deployment with `N` GPUs, there are at minimum:

- **1 API server process** -- handles HTTP requests, tokenization, and input processing
- **1 engine core process** -- runs the scheduler and coordinates GPU workers
- **N GPU worker processes** -- one per GPU, executes model forward passes

This means there are always at least **`2 + N` processes** competing for CPU time.

!!! warning
    Using fewer physical CPU cores than processes will cause contention and significantly degrade throughput and latency. The engine core process runs a busy loop and is particularly sensitive to CPU starvation.

The minimum is `2 + N` physical cores (1 for the API server, 1 for the engine core, and 1 per GPU worker). In practice, allocating more cores improves performance because the OS, PyTorch background threads, and other system processes also need CPU time.

!!! important
    Please note we are referring to **physical CPU cores** here. If your system has hyperthreading enabled, then 1 vCPU = 1 hyperthread = 1/2 physical CPU core, so you need `2 x (2 + N)` minimum vCPUs.

### Data Parallel and Multi-API Server Deployments

When using data parallelism or multiple API servers, the CPU requirements increase:

```console
Minimum physical cores = A + DP + N + (1 if DP > 1 else 0)
```

where `A` is the API server count (defaults to `DP`), `DP` is the data parallel size, and `N` is the total number of GPUs. For example, with `DP=4, TP=2` on 8 GPUs:

```console
4 API servers + 4 engine cores + 8 GPU workers + 1 DP coordinator = 17 processes
```

### Performance Impact

CPU underprovisioning particularly impacts:

- **Input processing throughput** -- tokenization, chat template rendering, and multi-modal data loading all run on CPU
- **Scheduling latency** -- the engine core scheduler runs on CPU and directly affects how quickly new tokens are dispatched to the GPU workers
- **Output processing** -- detokenization, networking, and especially streaming token responses use CPU cycles

If you observe that GPU utilization is lower than expected, CPU contention may be the bottleneck. Increasing the number of available CPU cores and even the clock speed can significantly improve end-to-end performance.

## Attention Backend Selection

vLLM supports multiple attention backends optimized for different hardware and use cases. The backend is automatically selected based on your GPU architecture, model type, and configuration, but you can also manually specify one for optimal performance.

For detailed information on available backends, their feature support, and how to configure them, see the [Attention Backend Feature Support](../design/attention_backends.md) documentation.
-												[Doc] Reorganize user guide (#18661)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-05-24 22:25:33 +08:00
+								# Optimization and Tuning
-												[Docs] Convert rST to MyST (Markdown) (#11145)

Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
											
										
										
											2024-12-23 17:35:38 -05:00
-												[Docs] Update optimization.md doc (#17482)

Signed-off-by: mgoin <mgoin64@gmail.com>
											
										
										
											2025-04-30 10:34:02 -06:00
+								This guide covers optimization strategies and performance tuning for vLLM V1.
-												[Core] Store only the keys for multi-modal data in P0 (#22198)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-08-07 16:45:04 +08:00
+								!!! tip
 								    Running out of memory? Consult [this guide](./conserving_memory.md) on how to conserve memory.
-												[docs][torch.compile] Add fusions.md — kernel/operator fusion reference page (#35538)

Signed-off-by: ProExpertProg <luka.govedic@gmail.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: ProExpertProg <11367180+ProExpertProg@users.noreply.github.com>
Co-authored-by: ProExpertProg <luka.govedic@gmail.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
											
										
										
											2026-03-06 23:55:06 +00:00
+								## Optimization Levels
 								vLLM provides 4 optimization levels (`-O0`, `-O1`, `-O2`, `-O3`) that allow users to trade off startup time for performance:
 								- `-O0`: No optimizations. Fastest startup time, but lowest performance.
 								- `-O1`: Fast optimization. Simple compilation and fast fusions, and PIECEWISE cudagraphs.
 								- `-O2`: Default optimization. Additional compilation ranges, additional fusions, FULL_AND_PIECEWISE cudagraphs.
 								- `-O3`: Aggressive optimization. Currently equal to `-O2`, but may include additional time-consuming or experimental optimizations in the future.
 								For more information, see the [optimization level documentation](../design/optimization_levels.md).
-												[Docs] Convert rST to MyST (Markdown) (#11145)

Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
											
										
										
											2024-12-23 17:35:38 -05:00
+								## Preemption
-												[Doc] Chunked Prefill Documentation (#4580)


											
										
										
											2024-05-04 16:18:00 +09:00
-												[Doc]: fixing typos in various files (#30540)

Signed-off-by: Didier Durand <durand.didier@gmail.com>
Signed-off-by: Didier Durand <2927957+didier-durand@users.noreply.github.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
											
										
										
											2025-12-14 11:14:37 +01:00
+								Due to the autoregressive nature of transformer architecture, there are times when KV cache space is insufficient to handle all batched requests.
-												[Docs] Update optimization.md doc (#17482)

Signed-off-by: mgoin <mgoin64@gmail.com>
											
										
										
											2025-04-30 10:34:02 -06:00
+								In such cases, vLLM can preempt requests to free up KV cache space for other requests. Preempted requests are recomputed when sufficient KV cache space becomes
 								available again. When this occurs, you may see the following warning:
-												[Scheduler] Warning upon preemption and Swapping (#4647)

Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
											
										
										
											2024-05-13 23:50:44 +09:00
-												[CI/Build] Add markdown linter (#11857)

Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
											
										
										
											2025-01-12 03:17:13 -05:00
+								```text
-												[Docs] Update optimization.md doc (#17482)

Signed-off-by: mgoin <mgoin64@gmail.com>
											
										
										
											2025-04-30 10:34:02 -06:00
+								WARNING 05-09 00:49:33 scheduler.py:1057 Sequence group 0 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_cumulative_preemption_cnt=1
-												[Scheduler] Warning upon preemption and Swapping (#4647)

Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
											
										
										
											2024-05-13 23:50:44 +09:00
+								```
 								While this mechanism ensures system robustness, preemption and recomputation can adversely affect end-to-end latency.
-												[Docs] Update optimization.md doc (#17482)

Signed-off-by: mgoin <mgoin64@gmail.com>
											
										
										
											2025-04-30 10:34:02 -06:00
+								If you frequently encounter preemptions, consider the following actions:
 								- Increase `gpu_memory_utilization`. vLLM pre-allocates GPU cache using this percentage of memory. By increasing utilization, you can provide more KV cache space.
 								- Decrease `max_num_seqs` or `max_num_batched_tokens`. This reduces the number of concurrent requests in a batch, thereby requiring less KV cache space.
 								- Increase `tensor_parallel_size`. This shards model weights across GPUs, allowing each GPU to have more memory available for KV cache. However, increasing this value may cause excessive synchronization overhead.
 								- Increase `pipeline_parallel_size`. This distributes model layers across GPUs, reducing the memory needed for model weights on each GPU, indirectly leaving more memory available for KV cache. However, increasing this value may cause latency penalties.
-												[Scheduler] Warning upon preemption and Swapping (#4647)

Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
											
										
										
											2024-05-13 23:50:44 +09:00
-												[Docs] Update optimization.md doc (#17482)

Signed-off-by: mgoin <mgoin64@gmail.com>
											
										
										
											2025-04-30 10:34:02 -06:00
+								You can monitor the number of preemption requests through Prometheus metrics exposed by vLLM. Additionally, you can log the cumulative number of preemption requests by setting `disable_log_stats=False`.
-												[Scheduler] Warning upon preemption and Swapping (#4647)

Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
											
										
										
											2024-05-13 23:50:44 +09:00
-												[Docs] Update optimization.md doc (#17482)

Signed-off-by: mgoin <mgoin64@gmail.com>
											
										
										
											2025-04-30 10:34:02 -06:00
+								In vLLM V1, the default preemption mode is `RECOMPUTE` rather than `SWAP`, as recomputation has lower overhead in the V1 architecture.
-												[Scheduler] Warning upon preemption and Swapping (#4647)

Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
											
										
										
											2024-05-13 23:50:44 +09:00
-												[Docs] Convert rST to MyST (Markdown) (#11145)

Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
											
										
										
											2024-12-23 17:35:38 -05:00
+								## Chunked Prefill
-												[Doc] Chunked Prefill Documentation (#4580)


											
										
										
											2024-05-04 16:18:00 +09:00
-												[Docs] Update optimization.md doc (#17482)

Signed-off-by: mgoin <mgoin64@gmail.com>
											
										
										
											2025-04-30 10:34:02 -06:00
+								Chunked prefill allows vLLM to process large prefills in smaller chunks and batch them together with decode requests. This feature helps improve both throughput and latency by better balancing compute-bound (prefill) and memory-bound (decode) operations.
-												[Doc] Update more docs with respect to V1 (#29188)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-11-23 10:58:48 +08:00
+								In V1, **chunked prefill is enabled by default whenever possible**. With chunked prefill enabled, the scheduling policy prioritizes decode requests. It batches all pending decode requests before scheduling any prefill operations. When there are available tokens in the `max_num_batched_tokens` budget, it schedules pending prefills. If a pending prefill request cannot fit into `max_num_batched_tokens`, it automatically chunks it.
-												[Docs] Update optimization.md doc (#17482)

Signed-off-by: mgoin <mgoin64@gmail.com>
											
										
										
											2025-04-30 10:34:02 -06:00
 								This policy has two benefits:
 								- It improves ITL and generation decode because decode requests are prioritized.
 								- It helps achieve better GPU utilization by locating compute-bound (prefill) and memory-bound (decode) requests to the same batch.
-												[Doc] Chunked Prefill Documentation (#4580)


											
										
										
											2024-05-04 16:18:00 +09:00
-												[Docs] Update optimization.md doc (#17482)

Signed-off-by: mgoin <mgoin64@gmail.com>
											
										
										
											2025-04-30 10:34:02 -06:00
+								### Performance Tuning with Chunked Prefill
 								You can tune the performance by adjusting `max_num_batched_tokens`:
 								- Smaller values (e.g., 2048) achieve better inter-token latency (ITL) because there are fewer prefills slowing down decodes.
 								- Higher values achieve better time to first token (TTFT) as you can process more prefill tokens in a batch.
-												[Doc] use power of 2 (#23172)


											
										
										
											2025-08-19 13:16:23 +03:00
+								- For optimal throughput, we recommend setting `max_num_batched_tokens > 8192` especially for smaller models on large GPUs.
-												[Docs] Update optimization.md doc (#17482)

Signed-off-by: mgoin <mgoin64@gmail.com>
											
										
										
											2025-04-30 10:34:02 -06:00
+								- If `max_num_batched_tokens` is the same as `max_model_len`, that's almost the equivalent to the V0 default scheduling policy (except that it still prioritizes decodes).
-												[Doc] Chunked Prefill Documentation (#4580)


											
										
										
											2024-05-04 16:18:00 +09:00
-												[DOC]: Add warning about max_num_batched_tokens and max_model_len when chunked prefill is disabled (#33109)

Signed-off-by: Vincent Gimenes <147169146+VincentG1234@users.noreply.github.com>
											
										
										
											2026-01-27 04:05:02 +01:00
+								!!! warning
 								    When chunked prefill is disabled, `max_num_batched_tokens` must be greater than `max_model_len`.
 								    In that case, if `max_num_batched_tokens < max_model_len`, vLLM may crash at server start‑up.
-												[Docs] Convert rST to MyST (Markdown) (#11145)

Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
											
										
										
											2024-12-23 17:35:38 -05:00
+								```python
-												[doc] add missing imports (#15699)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>
											
										
										
											2025-03-28 23:56:48 +08:00
+								from vllm import LLM
-												[Docs] Update optimization.md doc (#17482)

Signed-off-by: mgoin <mgoin64@gmail.com>
											
										
										
											2025-04-30 10:34:02 -06:00
+								# Set max_num_batched_tokens to tune performance
 								llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct", max_num_batched_tokens=16384)
-												[Docs] Convert rST to MyST (Markdown) (#11145)

Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
											
										
										
											2024-12-23 17:35:38 -05:00
+								```
-												[Doc] Chunked Prefill Documentation (#4580)


											
										
										
											2024-05-04 16:18:00 +09:00
-												[Docs] Update optimization.md doc (#17482)

Signed-off-by: mgoin <mgoin64@gmail.com>
											
										
										
											2025-04-30 10:34:02 -06:00
+								See related papers for more details (<https://arxiv.org/pdf/2401.08671> or <https://arxiv.org/pdf/2308.16369>).
-												[Doc] Chunked Prefill Documentation (#4580)


											
										
										
											2024-05-04 16:18:00 +09:00
-												[Docs] Update optimization.md doc (#17482)

Signed-off-by: mgoin <mgoin64@gmail.com>
											
										
										
											2025-04-30 10:34:02 -06:00
+								## Parallelism Strategies
-												[Doc] Chunked Prefill Documentation (#4580)


											
										
										
											2024-05-04 16:18:00 +09:00
-												[Docs] Update optimization.md doc (#17482)

Signed-off-by: mgoin <mgoin64@gmail.com>
											
										
										
											2025-04-30 10:34:02 -06:00
+								vLLM supports multiple parallelism strategies that can be combined to optimize performance across different hardware configurations.
-												[Doc] Chunked Prefill Documentation (#4580)


											
										
										
											2024-05-04 16:18:00 +09:00
-												[Docs] Update optimization.md doc (#17482)

Signed-off-by: mgoin <mgoin64@gmail.com>
											
										
										
											2025-04-30 10:34:02 -06:00
+								### Tensor Parallelism (TP)
-												[Doc] Chunked Prefill Documentation (#4580)


											
										
										
											2024-05-04 16:18:00 +09:00
-												[Docs] Update optimization.md doc (#17482)

Signed-off-by: mgoin <mgoin64@gmail.com>
											
										
										
											2025-04-30 10:34:02 -06:00
+								Tensor parallelism shards model parameters across multiple GPUs within each model layer. This is the most common strategy for large model inference within a single node.
-												chunked-prefill-doc-syntax (#4603)

Fix the docs: https://docs.vllm.ai/en/latest/models/performance.html

Co-authored-by: sang <rkooo567@gmail.com>
											
										
										
											2024-05-09 22:13:23 -07:00
-												[Docs] Update optimization.md doc (#17482)

Signed-off-by: mgoin <mgoin64@gmail.com>
											
										
										
											2025-04-30 10:34:02 -06:00
+								**When to use:**
-												chunked-prefill-doc-syntax (#4603)

Fix the docs: https://docs.vllm.ai/en/latest/models/performance.html

Co-authored-by: sang <rkooo567@gmail.com>
											
										
										
											2024-05-09 22:13:23 -07:00
-												[Docs] Update optimization.md doc (#17482)

Signed-off-by: mgoin <mgoin64@gmail.com>
											
										
										
											2025-04-30 10:34:02 -06:00
+								- When the model is too large to fit on a single GPU
 								- When you need to reduce memory pressure per GPU to allow more KV cache space for higher throughput
-												chunked-prefill-doc-syntax (#4603)

Fix the docs: https://docs.vllm.ai/en/latest/models/performance.html

Co-authored-by: sang <rkooo567@gmail.com>
											
										
										
											2024-05-09 22:13:23 -07:00
-												[Docs] Update optimization.md doc (#17482)

Signed-off-by: mgoin <mgoin64@gmail.com>
											
										
										
											2025-04-30 10:34:02 -06:00
+								```python
 								from vllm import LLM
 								# Split model across 4 GPUs
 								llm = LLM(model="meta-llama/Llama-3.3-70B-Instruct", tensor_parallel_size=4)
 								```
 								For models that are too large to fit on a single GPU (like 70B parameter models), tensor parallelism is essential.
 								### Pipeline Parallelism (PP)
 								Pipeline parallelism distributes model layers across multiple GPUs. Each GPU processes different parts of the model in sequence.
 								**When to use:**
 								- When you've already maxed out efficient tensor parallelism but need to distribute the model further, or across nodes
 								- For very deep and narrow models where layer distribution is more efficient than tensor sharding
 								Pipeline parallelism can be combined with tensor parallelism for very large models:
 								```python
 								from vllm import LLM
 								# Combine pipeline and tensor parallelism
 								llm = LLM(
 								    model="meta-llama/Llama-3.3-70B-Instruct,
 								    tensor_parallel_size=4,
-												[Doc] ruff format some Python examples (#26767)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-10-14 18:21:53 +08:00
+								    pipeline_parallel_size=2,
-												[Docs] Update optimization.md doc (#17482)

Signed-off-by: mgoin <mgoin64@gmail.com>
											
										
										
											2025-04-30 10:34:02 -06:00
+								)
 								```
 								### Expert Parallelism (EP)
 								Expert parallelism is a specialized form of parallelism for Mixture of Experts (MoE) models, where different expert networks are distributed across GPUs.
 								**When to use:**
-												[Doc] Chunked Prefill Documentation (#4580)


											
										
										
											2024-05-04 16:18:00 +09:00
-												[Docs] Update optimization.md doc (#17482)

Signed-off-by: mgoin <mgoin64@gmail.com>
											
										
										
											2025-04-30 10:34:02 -06:00
+								- Specifically for MoE models (like DeepSeekV3, Qwen3MoE, Llama-4)
 								- When you want to balance the expert computation load across GPUs
 								Expert parallelism is enabled by setting `enable_expert_parallel=True`, which will use expert parallelism instead of tensor parallelism for MoE layers.
 								It will use the same degree of parallelism as what you have set for tensor parallelism.
 								### Data Parallelism (DP)
 								Data parallelism replicates the entire model across multiple GPU sets and processes different batches of requests in parallel.
 								**When to use:**
 								- When you have enough GPUs to replicate the entire model
 								- When you need to scale throughput rather than model size
 								- In multi-user environments where isolation between request batches is beneficial
 								Data parallelism can be combined with the other parallelism strategies and is set by `data_parallel_size=N`.
 								Note that MoE layers will be sharded according to the product of the tensor parallel size and data parallel size.
-												[Feature] NUMA binding support for GPU workers (#38635)

Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
Co-authored-by: Jason Li <jasonlizhengjian@gmail.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
											
										
										
											2026-04-09 00:55:24 +08:00
+								### NUMA Binding for Multi-Socket GPU Nodes
 								On multi-socket GPU servers, GPU worker processes can lose performance if their
 								CPU execution and memory allocation drift away from the NUMA node nearest to the
 								GPU. vLLM can pin each worker with `numactl` before the Python subprocess starts,
 								so the interpreter, imports, and early allocator state are created with the
 								desired NUMA policy from the beginning.
 								Use `--numa-bind` to enable the feature. By default, vLLM auto-detects the
 								GPU-to-NUMA mapping and uses `--cpunodebind=<node> --membind=<node>` for each
 								worker. When you need a custom CPU policy, add `--numa-bind-cpus` and vLLM will
 								switch to `--physcpubind=<cpu-list> --membind=<node>`.
 								These `--numa-bind*` options only apply to GPU execution processes. They do not
 								configure the CPU backend's separate thread-affinity controls. Automatic
 								GPU-to-NUMA detection is currently implemented for CUDA/NVML-based platforms;
 								other GPU backends must provide explicit binding lists if they use these
 								options.
 								`--numa-bind-nodes` takes one non-negative NUMA node index per visible GPU, in
 								the same order as the GPU indices.
 								`--numa-bind-cpus` takes one `numactl` CPU list per visible GPU, in the same
 								order as the GPU indices. Each CPU list must use
 								`numactl --physcpubind` syntax such as `0-3`, `0,2,4-7`, or `16-31,48-63`.
 								```bash
 								# Auto-detect NUMA nodes for visible GPUs
 								vllm serve meta-llama/Llama-3.1-8B-Instruct \
 								  --tensor-parallel-size 4 \
 								  --numa-bind
 								# Explicit NUMA-node mapping
 								vllm serve meta-llama/Llama-3.1-8B-Instruct \
 								  --tensor-parallel-size 4 \
 								  --numa-bind \
 								  --numa-bind-nodes 0 0 1 1
 								# Explicit CPU pinning, useful for PCT or other high-frequency core layouts
 								vllm serve meta-llama/Llama-3.1-8B-Instruct \
 								  --tensor-parallel-size 4 \
 								  --numa-bind \
 								  --numa-bind-nodes 0 0 1 1 \
 								  --numa-bind-cpus 0-3 4-7 48-51 52-55
 								```
 								Notes:
 								- CLI usage forces multiprocessing to use the `spawn` method automatically. If you enable NUMA binding through the Python API, also set `VLLM_WORKER_MULTIPROC_METHOD=spawn`.
 								- Automatic detection relies on NVML and NUMA support from the host. If it cannot determine the mapping reliably, pass `--numa-bind-nodes` explicitly.
 								- Explicit `--numa-bind-nodes` and `--numa-bind-cpus` values must be valid `numactl` inputs. vLLM does a small amount of validation, but the effective binding semantics are still determined by `numactl`.
 								- The current implementation binds GPU execution processes such as `EngineCore` and multiprocessing workers. It does not apply NUMA binding to frontend API server processes or the DP coordinator.
 								- In containerized environments, NUMA policy syscalls may require extra permissions, such as `--cap-add SYS_NICE` when running via `docker run`.
 								### CPU Backend Thread Affinity
 								The CPU backend uses a different mechanism from `--numa-bind`. CPU execution is
 								configured through CPU-specific environment variables such as
 								`VLLM_CPU_OMP_THREADS_BIND`, `VLLM_CPU_NUM_OF_RESERVED_CPU`, and
 								`CPU_VISIBLE_MEMORY_NODES`, rather than the GPU-oriented `--numa-bind*` CLI
 								options.
 								By default, `VLLM_CPU_OMP_THREADS_BIND=auto` derives OpenMP placement from the
 								available CPU and NUMA topology for each CPU worker. To override the automatic
 								policy, set `VLLM_CPU_OMP_THREADS_BIND` explicitly using the CPU list format
 								documented for the CPU backend, or use `nobind` to disable this behavior.
 								For the current CPU backend setup and tuning guidance, see:
 								- [Related runtime environment variables](../getting_started/installation/cpu.md#related-runtime-environment-variables)
 								- [How to decide `VLLM_CPU_OMP_THREADS_BIND`](../getting_started/installation/cpu.md#how-to-decide-vllm_cpu_omp_threads_bind)
 								The GPU-only `--numa-bind`, `--numa-bind-nodes`, and `--numa-bind-cpus` options
 								do not configure CPU worker affinity.
-												[CLI][Doc] Formalize `--mm-encoder-tp-mode` (#23190)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-08-20 23:42:28 +08:00
+								### Batch-level DP for Multi-Modal Encoders
 								By default, TP is used to shard the weights of multi-modal encoders just like for language decoders,
 								in order to reduce the memory and compute load on each GPU.
 								However, since the size of multi-modal encoders is very small compared to language decoders,
 								there is relatively little gain from TP. On the other hand, TP incurs significant communication
 								overhead because of all-reduce being performed after every layer.
 								Given this, it may be advantageous to instead shard the batched input data using TP, essentially
-												[Doc] Update Batch-level DP docs (#25757)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-09-26 17:37:40 +08:00
+								performing batch-level DP. This has been shown to improve the throughput and TTFT by around 10% for
-												[CLI][Doc] Formalize `--mm-encoder-tp-mode` (#23190)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-08-20 23:42:28 +08:00
+								`tensor_parallel_size=8`. For vision encoders that use hardware-unoptimized Conv3D operations,
-												[Doc] Update Batch-level DP docs (#25757)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-09-26 17:37:40 +08:00
+								batch-level DP can provide another 40% improvement compared to regular TP.
-												[CLI][Doc] Formalize `--mm-encoder-tp-mode` (#23190)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-08-20 23:42:28 +08:00
 								Nevertheless, since the weights of the multi-modal encoder are replicated across each TP rank,
 								there will be a minor increase in memory consumption and may cause OOM if you can barely fit the model already.
 								You can enable batch-level DP by setting `mm_encoder_tp_mode="data"`, for example:
 								```python
 								from vllm import LLM
 								llm = LLM(
 								    model="Qwen/Qwen2.5-VL-72B-Instruct",
 								    tensor_parallel_size=4,
-												[Doc] Fix batch-level DP example (#23325)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
											
										
										
											2025-08-21 21:16:38 +08:00
+								    # When mm_encoder_tp_mode="data",
 								    # the vision encoder uses TP=4 (not DP=1) to shard the input data,
 								    # so the TP size becomes the effective DP size.
 								    # Note that this is independent of the DP size for language decoder which is used in expert parallel setting.
-												[CLI][Doc] Formalize `--mm-encoder-tp-mode` (#23190)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-08-20 23:42:28 +08:00
+								    mm_encoder_tp_mode="data",
-												[Doc] Fix batch-level DP example (#23325)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
											
										
										
											2025-08-21 21:16:38 +08:00
+								    # The language decoder uses TP=4 to shard the weights regardless
 								    # of the setting of mm_encoder_tp_mode
-												[CLI][Doc] Formalize `--mm-encoder-tp-mode` (#23190)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-08-20 23:42:28 +08:00
+								)
 								```
-												[Docs] Fix an admonition important (#23726)

Signed-off-by: windsonsea <haifeng.yao@daocloud.io>
											
										
										
											2025-08-27 17:50:09 +08:00
+								!!! important
-												[CLI][Doc] Formalize `--mm-encoder-tp-mode` (#23190)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-08-20 23:42:28 +08:00
+								    Batch-level DP is not to be confused with API request-level DP
 								    (which is instead controlled by `data_parallel_size`).
-												[Model] Interface to enable batch-level DP support (#23733)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
											
										
										
											2025-08-27 21:41:22 +08:00
+								Batch-level DP needs to be implemented on a per-model basis,
 								and enabled by setting `supports_encoder_tp_data = True` in the model class.
 								Regardless, you need to set `mm_encoder_tp_mode="data"` in engine arguments to use this feature.
-												[Doc] Update Batch-level DP docs (#25757)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-09-26 17:37:40 +08:00
+								Known supported models (with corresponding benchmarks):
-												[CLI][Doc] Formalize `--mm-encoder-tp-mode` (#23190)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-08-20 23:42:28 +08:00
-												[Docs] Reduce custom syntax used in docs (#27009)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-10-17 04:05:34 +01:00
+								- dots_ocr (<https://github.com/vllm-project/vllm/pull/25466>)
 								- GLM-4.1V or above (<https://github.com/vllm-project/vllm/pull/23168>)
 								- InternVL (<https://github.com/vllm-project/vllm/pull/23909>)
 								- Kimi-VL (<https://github.com/vllm-project/vllm/pull/23817>)
 								- Llama4 (<https://github.com/vllm-project/vllm/pull/18368>)
 								- MiniCPM-V-2.5 or above (<https://github.com/vllm-project/vllm/pull/23327>, <https://github.com/vllm-project/vllm/pull/23948>)
 								- Qwen2-VL or above (<https://github.com/vllm-project/vllm/pull/22742>, <https://github.com/vllm-project/vllm/pull/24955>, <https://github.com/vllm-project/vllm/pull/25445>)
 								- Step3 (<https://github.com/vllm-project/vllm/pull/22697>)
-												[CLI][Doc] Formalize `--mm-encoder-tp-mode` (#23190)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-08-20 23:42:28 +08:00
-												[Core] Store only the keys for multi-modal data in P0 (#22198)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-08-07 16:45:04 +08:00
+								## Input Processing
-												[Docs] Update optimization.md doc (#17482)

Signed-off-by: mgoin <mgoin64@gmail.com>
											
										
										
											2025-04-30 10:34:02 -06:00
-												[Core] Store only the keys for multi-modal data in P0 (#22198)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-08-07 16:45:04 +08:00
+								### Parallel Processing
-												[Docs] Update optimization.md doc (#17482)

Signed-off-by: mgoin <mgoin64@gmail.com>
											
										
										
											2025-04-30 10:34:02 -06:00
-												[Core] Store only the keys for multi-modal data in P0 (#22198)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-08-07 16:45:04 +08:00
+								You can run input processing in parallel via [API server scale-out](../serving/data_parallel_deployment.md#internal-load-balancing).
 								This is useful when input processing (which is run inside the API server)
 								becomes a bottleneck compared to model execution (which is run inside engine core)
 								and you have excess CPU capacity.
-												[Docs] Update optimization.md doc (#17482)

Signed-off-by: mgoin <mgoin64@gmail.com>
											
										
										
											2025-04-30 10:34:02 -06:00
-												[Core] Store only the keys for multi-modal data in P0 (#22198)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-08-07 16:45:04 +08:00
+								```console
 								# Run 4 API processes and 1 engine core process
 								vllm serve Qwen/Qwen2.5-VL-3B-Instruct --api-server-count 4
-												[Docs] Update optimization.md doc (#17482)

Signed-off-by: mgoin <mgoin64@gmail.com>
											
										
										
											2025-04-30 10:34:02 -06:00
-												[Core] Store only the keys for multi-modal data in P0 (#22198)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-08-07 16:45:04 +08:00
+								# Run 4 API processes and 2 engine core processes
 								vllm serve Qwen/Qwen2.5-VL-3B-Instruct --api-server-count 4 -dp 2
-												[Docs] Update optimization.md doc (#17482)

Signed-off-by: mgoin <mgoin64@gmail.com>
											
										
										
											2025-04-30 10:34:02 -06:00
+								```
-												[Core] Store only the keys for multi-modal data in P0 (#22198)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-08-07 16:45:04 +08:00
+								!!! note
 								    API server scale-out is only available for online inference.
-												[Docs] Update optimization.md doc (#17482)

Signed-off-by: mgoin <mgoin64@gmail.com>
											
										
										
											2025-04-30 10:34:02 -06:00
-												[Doc] Add caution for API server scale-out (#23550)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-08-25 21:14:15 +08:00
+								!!! warning
 								    By default, 8 CPU threads are used in each API server to load media items (e.g. images)
 								    from request data.
 								    If you apply API server scale-out, consider adjusting `VLLM_MEDIA_LOADING_THREAD_COUNT`
 								    to avoid CPU resource exhaustion.
-												[Core] Store only the keys for multi-modal data in P0 (#22198)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-08-07 16:45:04 +08:00
+								!!! note
-												[Core] Use key-only cache for `BaseMultiModalProcessor` (#23018)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-08-27 14:19:13 +08:00
+								    API server scale-out disables [multi-modal IPC caching](#ipc-caching)
-												fix some typos (#24071)

Signed-off-by: co63oc <co63oc@users.noreply.github.com>
											
										
										
											2025-09-03 11:44:50 +08:00
+								    because it requires a one-to-one correspondence between API and engine core processes.
-												[Docs] Update optimization.md doc (#17482)

Signed-off-by: mgoin <mgoin64@gmail.com>
											
										
										
											2025-04-30 10:34:02 -06:00
-												[Core] Use key-only cache for `BaseMultiModalProcessor` (#23018)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-08-27 14:19:13 +08:00
+								    This does not impact [multi-modal processor caching](#processor-caching).
-												[Docs] Update optimization.md doc (#17482)

Signed-off-by: mgoin <mgoin64@gmail.com>
											
										
										
											2025-04-30 10:34:02 -06:00
-												[Core] Use key-only cache for `BaseMultiModalProcessor` (#23018)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-08-27 14:19:13 +08:00
+								## Multi-Modal Caching
-												[Docs] Update optimization.md doc (#17482)

Signed-off-by: mgoin <mgoin64@gmail.com>
											
										
										
											2025-04-30 10:34:02 -06:00
-												[Core] Use key-only cache for `BaseMultiModalProcessor` (#23018)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-08-27 14:19:13 +08:00
+								Multi-modal caching avoids repeated transfer or processing of the same multi-modal data,
-												[Core] Store only the keys for multi-modal data in P0 (#22198)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-08-07 16:45:04 +08:00
+								which commonly occurs in multi-turn conversations.
-												[Docs] Update optimization.md doc (#17482)

Signed-off-by: mgoin <mgoin64@gmail.com>
											
										
										
											2025-04-30 10:34:02 -06:00
-												[Core] Use key-only cache for `BaseMultiModalProcessor` (#23018)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-08-27 14:19:13 +08:00
+								### Processor Caching
 								Multi-modal processor caching is automatically enabled
 								to avoid repeatedly processing the same multi-modal inputs in `BaseMultiModalProcessor`.
 								### IPC Caching
 								Multi-modal IPC caching is automatically enabled when
-												fix some typos (#24071)

Signed-off-by: co63oc <co63oc@users.noreply.github.com>
											
										
										
											2025-09-03 11:44:50 +08:00
+								there is a one-to-one correspondence between API (`P0`) and engine core (`P1`) processes,
-												[Core] Use key-only cache for `BaseMultiModalProcessor` (#23018)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-08-27 14:19:13 +08:00
+								to avoid repeatedly transferring the same multi-modal inputs between them.
-												[Core] Shared memory based object store for Multimodal data caching and IPC (#20452)

Signed-off-by: donglu <donglu@cohere.com>
											
										
										
											2025-09-12 10:54:17 -04:00
+								#### Key-Replicated Cache
 								By default, IPC caching uses a **key-replicated cache**, where cache keys exist
 								in both the API (`P0`) and engine core (`P1`) processes, but the actual cache
 								data resides only in `P1`.
 								#### Shared Memory Cache
 								When multiple worker processes are involved (e.g., when TP > 1), a
 								**shared-memory cache** is more efficient. This can be enabled by setting
 								`mm_processor_cache_type="shm"`. In this mode, cache keys are stored
 								on `P0`, while the cache data itself lives in shared memory accessible by all
 								processes.
-												[Core] Use key-only cache for `BaseMultiModalProcessor` (#23018)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-08-27 14:19:13 +08:00
+								### Configuration
 								You can adjust the size of the cache by setting the value of `mm_processor_cache_gb` (default 4 GiB).
 								If you do not benefit much from the cache, you can disable both IPC
 								and processor caching completely via `mm_processor_cache_gb=0`.
-												[Docs] Update optimization.md doc (#17482)

Signed-off-by: mgoin <mgoin64@gmail.com>
											
										
										
											2025-04-30 10:34:02 -06:00
-												[Frontend] Use engine argument to control MM cache size (#22441)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-08-08 00:47:10 +08:00
+								Examples:
-												[Docs] Update optimization.md doc (#17482)

Signed-off-by: mgoin <mgoin64@gmail.com>
											
										
										
											2025-04-30 10:34:02 -06:00
 								```python
-												[Frontend] Use engine argument to control MM cache size (#22441)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-08-08 00:47:10 +08:00
+								# Use a larger cache
-												[Doc] ruff format some Python examples (#26767)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-10-14 18:21:53 +08:00
+								llm = LLM(
 								    model="Qwen/Qwen2.5-VL-3B-Instruct",
 								    mm_processor_cache_gb=8,
 								)
-												[Frontend] Use engine argument to control MM cache size (#22441)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-08-08 00:47:10 +08:00
-												[Core] Shared memory based object store for Multimodal data caching and IPC (#20452)

Signed-off-by: donglu <donglu@cohere.com>
											
										
										
											2025-09-12 10:54:17 -04:00
+								# Use a shared-memory based IPC cache
-												[Doc] ruff format some Python examples (#26767)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-10-14 18:21:53 +08:00
+								llm = LLM(
 								    model="Qwen/Qwen2.5-VL-3B-Instruct",
 								    tensor_parallel_size=2,
 								    mm_processor_cache_type="shm",
 								    mm_processor_cache_gb=8,
 								)
-												[Core] Shared memory based object store for Multimodal data caching and IPC (#20452)

Signed-off-by: donglu <donglu@cohere.com>
											
										
										
											2025-09-12 10:54:17 -04:00
-												[Frontend] Use engine argument to control MM cache size (#22441)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-08-08 00:47:10 +08:00
+								# Disable the cache
-												[Doc] ruff format some Python examples (#26767)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-10-14 18:21:53 +08:00
+								llm = LLM(
 								    model="Qwen/Qwen2.5-VL-3B-Instruct",
 								    mm_processor_cache_gb=0,
 								)
-												[Docs] Update optimization.md doc (#17482)

Signed-off-by: mgoin <mgoin64@gmail.com>
											
										
										
											2025-04-30 10:34:02 -06:00
+								```
-												[Core] Use key-only cache for `BaseMultiModalProcessor` (#23018)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-08-27 14:19:13 +08:00
 								### Cache Placement
 								Based on the configuration, the content of the multi-modal caches on `P0` and `P1` are as follows:
-												[Core] Shared memory based object store for Multimodal data caching and IPC (#20452)

Signed-off-by: donglu <donglu@cohere.com>
											
										
										
											2025-09-12 10:54:17 -04:00
+								| mm_processor_cache_type | Cache Type | `P0` Cache | `P1` Engine Cache | `P1` Worker Cache | Max. Memory |
-												Allow `markdownlint` to run locally (#36398)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2026-03-09 03:05:24 +00:00
+								| ----------------- | ----------- | ---------- | ---------- | ----------- | ----------- |
-												[Core] Shared memory based object store for Multimodal data caching and IPC (#20452)

Signed-off-by: donglu <donglu@cohere.com>
											
										
										
											2025-09-12 10:54:17 -04:00
+								| lru | Processor Caching | K + V | N/A | N/A | `mm_processor_cache_gb * data_parallel_size` |
 								| lru | Key-Replicated Caching | K | K + V | N/A | `mm_processor_cache_gb * api_server_count` |
 								| shm | Shared Memory Caching | K | N/A | V | `mm_processor_cache_gb * api_server_count` |
 								| N/A | Disabled | N/A | N/A | N/A | `0` |
-												[Core] Use key-only cache for `BaseMultiModalProcessor` (#23018)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-08-27 14:19:13 +08:00
-												[7/N][Attention][Docs] Add documentation for attention backends (#32477)

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
											
										
										
											2026-01-28 17:20:22 -05:00
+								K: Stores the hashes of multi-modal items
-												[Core] Use key-only cache for `BaseMultiModalProcessor` (#23018)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-08-27 14:19:13 +08:00
+								V: Stores the processed tensor data of multi-modal items
-												[7/N][Attention][Docs] Add documentation for attention backends (#32477)

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
											
										
										
											2026-01-28 17:20:22 -05:00
-												[Docs] Add sections on process architecture and minimum CPU resources (#33940)

It seems users can be confused about vLLM's performance when running
with very small amounts of CPU cores available. We are missing a clear
overview of what vLLM's process architecture is, so I added this along with
some diagrams in arch_overview.md, and included a section on CPU resource
recommendations in optimization.md

Signed-off-by: mgoin <mgoin64@gmail.com>
											
										
										
											2026-02-06 10:26:43 -05:00
+								## CPU Resources for GPU Deployments
 								vLLM V1 uses a multi-process architecture (see [V1 Process Architecture](../design/arch_overview.md#v1-process-architecture)) where each process requires CPU resources. Underprovisioning CPU cores is a common source of performance degradation, especially in virtualized environments.
 								### Minimum CPU Requirements
 								For a deployment with `N` GPUs, there are at minimum:
 								- **1 API server process** -- handles HTTP requests, tokenization, and input processing
 								- **1 engine core process** -- runs the scheduler and coordinates GPU workers
 								- **N GPU worker processes** -- one per GPU, executes model forward passes
 								This means there are always at least **`2 + N` processes** competing for CPU time.
 								!!! warning
 								    Using fewer physical CPU cores than processes will cause contention and significantly degrade throughput and latency. The engine core process runs a busy loop and is particularly sensitive to CPU starvation.
 								The minimum is `2 + N` physical cores (1 for the API server, 1 for the engine core, and 1 per GPU worker). In practice, allocating more cores improves performance because the OS, PyTorch background threads, and other system processes also need CPU time.
 								!!! important
 								    Please note we are referring to **physical CPU cores** here. If your system has hyperthreading enabled, then 1 vCPU = 1 hyperthread = 1/2 physical CPU core, so you need `2 x (2 + N)` minimum vCPUs.
 								### Data Parallel and Multi-API Server Deployments
 								When using data parallelism or multiple API servers, the CPU requirements increase:
 								```console
 								Minimum physical cores = A + DP + N + (1 if DP > 1 else 0)
 								```
 								where `A` is the API server count (defaults to `DP`), `DP` is the data parallel size, and `N` is the total number of GPUs. For example, with `DP=4, TP=2` on 8 GPUs:
 								```console
 API servers + 4 engine cores + 8 GPU workers + 1 DP coordinator = 17 processes
 								```
 								### Performance Impact
 								CPU underprovisioning particularly impacts:
 								- **Input processing throughput** -- tokenization, chat template rendering, and multi-modal data loading all run on CPU
 								- **Scheduling latency** -- the engine core scheduler runs on CPU and directly affects how quickly new tokens are dispatched to the GPU workers
 								- **Output processing** -- detokenization, networking, and especially streaming token responses use CPU cycles
 								If you observe that GPU utilization is lower than expected, CPU contention may be the bottleneck. Increasing the number of available CPU cores and even the clock speed can significantly improve end-to-end performance.
-												[7/N][Attention][Docs] Add documentation for attention backends (#32477)

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
											
										
										
											2026-01-28 17:20:22 -05:00
+								## Attention Backend Selection
 								vLLM supports multiple attention backends optimized for different hardware and use cases. The backend is automatically selected based on your GPU architecture, model type, and configuration, but you can also manually specify one for optimal performance.
 								For detailed information on available backends, their feature support, and how to configure them, see the [Attention Backend Feature Support](../design/attention_backends.md) documentation.