docs/configuration/conserving_memory.md

# Conserving Memory

Large models might cause your machine to run out of memory (OOM). Here are some options that help alleviate this problem.

## Tensor Parallelism (TP)

Tensor parallelism (`tensor_parallel_size` option) can be used to split the model across multiple GPUs.

The following code splits the model across 2 GPUs.

```python
from vllm import LLM

llm = LLM(model="ibm-granite/granite-3.1-8b-instruct", tensor_parallel_size=2)
```

!!! warning
    To ensure that vLLM initializes CUDA correctly, you should avoid calling related functions (e.g. [torch.accelerator.set_device_index][])
    before initializing vLLM. Otherwise, you may run into an error like `RuntimeError: Cannot re-initialize CUDA in forked subprocess`.

    To control which devices are used, please instead set the `CUDA_VISIBLE_DEVICES` environment variable.

!!! note
    With tensor parallelism enabled, each process will read the whole model and split it into chunks, which makes the disk reading time even longer (proportional to the size of tensor parallelism).

    You can convert the model checkpoint to a sharded checkpoint using [examples/offline_inference/save_sharded_state.py](../../examples/offline_inference/save_sharded_state.py). The conversion process might take some time, but later you can load the sharded checkpoint much faster. The model loading time should remain constant regardless of the size of tensor parallelism.

## Quantization

Quantized models take less memory at the cost of lower precision.

Statically quantized models can be downloaded from HF Hub (some popular ones are available at [Red Hat AI](https://huggingface.co/RedHatAI))
and used directly without extra configuration.

Dynamic quantization is also supported via the `quantization` option -- see [here](../features/quantization/README.md) for more details.

## Context length and batch size

You can further reduce memory usage by limiting the context length of the model (`max_model_len` option)
and the maximum batch size (`max_num_seqs` option).

```python
from vllm import LLM

llm = LLM(model="adept/fuyu-8b", max_model_len=2048, max_num_seqs=2)
```

## Reduce CUDA Graphs

By default, we optimize model inference using CUDA graphs which take up extra memory in the GPU.

You can adjust `compilation_config` to achieve a better balance between inference speed and memory usage:

??? code

    ```python
    from vllm import LLM
    from vllm.config import CompilationConfig, CompilationMode

    llm = LLM(
        model="meta-llama/Llama-3.1-8B-Instruct",
        compilation_config=CompilationConfig(
            mode=CompilationMode.VLLM_COMPILE,
            # By default, it goes up to max_num_seqs
            cudagraph_capture_sizes=[1, 2, 4, 8, 16],
        ),
    )
    ```

You can disable graph capturing completely via the `enforce_eager` flag:

```python
from vllm import LLM

llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct", enforce_eager=True)
```

## Adjust cache size

If you run out of CPU RAM, try the following options:

- (Multi-modal models only) you can set the size of multi-modal cache by setting `mm_processor_cache_gb` engine argument (default 4 GiB).
- (CPU backend only) you can set the size of KV cache using `VLLM_CPU_KVCACHE_SPACE` environment variable (default 4 GiB).

## Multi-modal input limits

You can allow a smaller number of multi-modal items per prompt to reduce the memory footprint of the model:

```python
from vllm import LLM

# Accept up to 3 images and 1 video per prompt
llm = LLM(
    model="Qwen/Qwen2.5-VL-3B-Instruct",
    limit_mm_per_prompt={"image": 3, "video": 1},
)
```

You can go a step further and disable unused modalities completely by setting its limit to zero.
For example, if your application only accepts image input, there is no need to allocate any memory for videos.

```python
from vllm import LLM

# Accept any number of images but no videos
llm = LLM(
    model="Qwen/Qwen2.5-VL-3B-Instruct",
    limit_mm_per_prompt={"video": 0},
)
```

You can even run a multi-modal model for text-only inference:

```python
from vllm import LLM

# Don't accept images. Just text.
llm = LLM(
    model="google/gemma-3-27b-it",
    limit_mm_per_prompt={"image": 0},
)
```

### Configurable options

`limit_mm_per_prompt` also accepts configurable options per modality. In the configurable form, you still specify `count`, and you may optionally provide size hints that control how vLLM profiles and reserves memory for your multi‑modal inputs. This helps you tune memory for the actual media you expect, instead of the model’s absolute maxima.

Configurable options by modality:

- `image`: `{"count": int, "width": int, "height": int}`
- `video`: `{"count": int, "num_frames": int, "width": int, "height": int}`
- `audio`: `{"count": int, "length": int}`

Details could be found in [`ImageDummyOptions`][vllm.config.multimodal.ImageDummyOptions], [`VideoDummyOptions`][vllm.config.multimodal.VideoDummyOptions], and [`AudioDummyOptions`][vllm.config.multimodal.AudioDummyOptions].

Examples:

```python
from vllm import LLM

# Up to 5 images per prompt, profile with 512x512.
# Up to 1 video per prompt, profile with 32 frames at 640x640.
llm = LLM(
    model="Qwen/Qwen2.5-VL-3B-Instruct",
    limit_mm_per_prompt={
        "image": {"count": 5, "width": 512, "height": 512},
        "video": {"count": 1, "num_frames": 32, "width": 640, "height": 640},
    },
)
```

For backward compatibility, passing an integer works as before and is interpreted as `{"count": <int>}`. For example:

- `limit_mm_per_prompt={"image": 5}` is equivalent to `limit_mm_per_prompt={"image": {"count": 5}}`
- You can mix formats: `limit_mm_per_prompt={"image": 5, "video": {"count": 1, "num_frames": 32, "width": 640, "height": 640}}`

!!! note
    - The size hints affect memory profiling only. They shape the dummy inputs used to compute reserved activation sizes. They do not change how inputs are actually processed at inference time.
    - If a hint exceeds what the model can accept, vLLM clamps it to the model's effective maximum and may log a warning.

!!! warning
    These size hints currently only affect activation memory profiling. Encoder cache size is determined by the actual inputs at runtime and is not limited by these hints.

## Multi-modal processor arguments

For certain models, you can adjust the multi-modal processor arguments to
reduce the size of the processed multi-modal inputs, which in turn saves memory.

Here are some examples:

```python
from vllm import LLM

# Available for Qwen2-VL series models
llm = LLM(
    model="Qwen/Qwen2.5-VL-3B-Instruct",
    mm_processor_kwargs={"max_pixels": 768 * 768},  # Default is 1280 * 28 * 28
)

# Available for InternVL series models
llm = LLM(
    model="OpenGVLab/InternVL2-2B",
    mm_processor_kwargs={"max_dynamic_patch": 4},  # Default is 12
)
```
-												[Doc] Reorganize user guide (#18661)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-05-24 22:25:33 +08:00
+								# Conserving Memory
 								Large models might cause your machine to run out of memory (OOM). Here are some options that help alleviate this problem.
 								## Tensor Parallelism (TP)
 								Tensor parallelism (`tensor_parallel_size` option) can be used to split the model across multiple GPUs.
 								The following code splits the model across 2 GPUs.
 								```python
 								from vllm import LLM
-												[Doc] ruff format some Python examples (#26767)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-10-14 18:21:53 +08:00
+								llm = LLM(model="ibm-granite/granite-3.1-8b-instruct", tensor_parallel_size=2)
-												[Doc] Reorganize user guide (#18661)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-05-24 22:25:33 +08:00
+								```
 								!!! warning
-												[Hardware] Replace torch.cuda.device_count/current_device/set_device API (#36145)

Signed-off-by: Kunshang Ji <jikunshang95@gmail.com>
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
											
										
										
											2026-03-12 22:57:47 +08:00
+								    To ensure that vLLM initializes CUDA correctly, you should avoid calling related functions (e.g. [torch.accelerator.set_device_index][])
-												[Doc] Reorganize user guide (#18661)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-05-24 22:25:33 +08:00
+								    before initializing vLLM. Otherwise, you may run into an error like `RuntimeError: Cannot re-initialize CUDA in forked subprocess`.
 								    To control which devices are used, please instead set the `CUDA_VISIBLE_DEVICES` environment variable.
 								!!! note
 								    With tensor parallelism enabled, each process will read the whole model and split it into chunks, which makes the disk reading time even longer (proportional to the size of tensor parallelism).
-												[Docs] Reduce custom syntax used in docs (#27009)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-10-17 04:05:34 +01:00
+								    You can convert the model checkpoint to a sharded checkpoint using [examples/offline_inference/save_sharded_state.py](../../examples/offline_inference/save_sharded_state.py). The conversion process might take some time, but later you can load the sharded checkpoint much faster. The model loading time should remain constant regardless of the size of tensor parallelism.
-												[Doc] Reorganize user guide (#18661)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-05-24 22:25:33 +08:00
 								## Quantization
 								Quantized models take less memory at the cost of lower precision.
 								Statically quantized models can be downloaded from HF Hub (some popular ones are available at [Red Hat AI](https://huggingface.co/RedHatAI))
 								and used directly without extra configuration.
-												Remove unnecessary explicit title anchors and use relative links instead (#20620)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-07-08 10:49:13 +01:00
+								Dynamic quantization is also supported via the `quantization` option -- see [here](../features/quantization/README.md) for more details.
-												[Doc] Reorganize user guide (#18661)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-05-24 22:25:33 +08:00
 								## Context length and batch size
 								You can further reduce memory usage by limiting the context length of the model (`max_model_len` option)
 								and the maximum batch size (`max_num_seqs` option).
 								```python
 								from vllm import LLM
-												[Doc] ruff format some Python examples (#26767)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-10-14 18:21:53 +08:00
+								llm = LLM(model="adept/fuyu-8b", max_model_len=2048, max_num_seqs=2)
-												[Doc] Reorganize user guide (#18661)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-05-24 22:25:33 +08:00
+								```
 								## Reduce CUDA Graphs
 								By default, we optimize model inference using CUDA graphs which take up extra memory in the GPU.
 								You can adjust `compilation_config` to achieve a better balance between inference speed and memory usage:
-												Make distinct `code` and `console` admonitions so readers are less likely to miss them (#20585)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-07-08 03:55:28 +01:00
+								??? code
-												[doc] Fold long code blocks to improve readability (#19926)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>
											
										
										
											2025-06-23 13:24:23 +08:00
 								    ```python
 								    from vllm import LLM
-												[Frontend][torch.compile] CompilationConfig Overhaul (#20283): name change  compilation level to compilation mode, deprecation compilation level (#26355)

Signed-off-by: morrison-turnansky <mturnans@redhat.com>
Signed-off-by: Morrison Turnansky <mturnans@redhat.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
											
										
										
											2025-10-14 22:51:16 -04:00
+								    from vllm.config import CompilationConfig, CompilationMode
-												[doc] Fold long code blocks to improve readability (#19926)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>
											
										
										
											2025-06-23 13:24:23 +08:00
 								    llm = LLM(
 								        model="meta-llama/Llama-3.1-8B-Instruct",
 								        compilation_config=CompilationConfig(
-												[Frontend][torch.compile] CompilationConfig Overhaul (#20283): name change  compilation level to compilation mode, deprecation compilation level (#26355)

Signed-off-by: morrison-turnansky <mturnans@redhat.com>
Signed-off-by: Morrison Turnansky <mturnans@redhat.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
											
										
										
											2025-10-14 22:51:16 -04:00
+								            mode=CompilationMode.VLLM_COMPILE,
-												[doc] Fold long code blocks to improve readability (#19926)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>
											
										
										
											2025-06-23 13:24:23 +08:00
+								            # By default, it goes up to max_num_seqs
 								            cudagraph_capture_sizes=[1, 2, 4, 8, 16],
 								        ),
 								    )
 								    ```
-												[Doc] Reorganize user guide (#18661)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-05-24 22:25:33 +08:00
 								You can disable graph capturing completely via the `enforce_eager` flag:
 								```python
 								from vllm import LLM
-												[Doc] ruff format some Python examples (#26767)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-10-14 18:21:53 +08:00
+								llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct", enforce_eager=True)
-												[Doc] Reorganize user guide (#18661)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-05-24 22:25:33 +08:00
+								```
 								## Adjust cache size
 								If you run out of CPU RAM, try the following options:
-												[Core] Use key-only cache for `BaseMultiModalProcessor` (#23018)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-08-27 14:19:13 +08:00
+								- (Multi-modal models only) you can set the size of multi-modal cache by setting `mm_processor_cache_gb` engine argument (default 4 GiB).
-												[Doc] Reorganize user guide (#18661)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-05-24 22:25:33 +08:00
+								- (CPU backend only) you can set the size of KV cache using `VLLM_CPU_KVCACHE_SPACE` environment variable (default 4 GiB).
 								## Multi-modal input limits
 								You can allow a smaller number of multi-modal items per prompt to reduce the memory footprint of the model:
 								```python
 								from vllm import LLM
 								# Accept up to 3 images and 1 video per prompt
-												[Doc] ruff format some Python examples (#26767)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-10-14 18:21:53 +08:00
+								llm = LLM(
 								    model="Qwen/Qwen2.5-VL-3B-Instruct",
 								    limit_mm_per_prompt={"image": 3, "video": 1},
 								)
-												[Doc] Reorganize user guide (#18661)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-05-24 22:25:33 +08:00
+								```
 								You can go a step further and disable unused modalities completely by setting its limit to zero.
 								For example, if your application only accepts image input, there is no need to allocate any memory for videos.
 								```python
 								from vllm import LLM
 								# Accept any number of images but no videos
-												[Doc] ruff format some Python examples (#26767)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-10-14 18:21:53 +08:00
+								llm = LLM(
 								    model="Qwen/Qwen2.5-VL-3B-Instruct",
 								    limit_mm_per_prompt={"video": 0},
 								)
-												[Doc] Reorganize user guide (#18661)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-05-24 22:25:33 +08:00
+								```
 								You can even run a multi-modal model for text-only inference:
 								```python
 								from vllm import LLM
 								# Don't accept images. Just text.
-												[Doc] ruff format some Python examples (#26767)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-10-14 18:21:53 +08:00
+								llm = LLM(
 								    model="google/gemma-3-27b-it",
 								    limit_mm_per_prompt={"image": 0},
 								)
-												[Doc] Reorganize user guide (#18661)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-05-24 22:25:33 +08:00
+								```
-												[MM][Doc] Add documentation for configurable mm profiling (#26200)

Signed-off-by: wwl2755 <wangwenlong2755@gmail.com>
											
										
										
											2025-10-08 23:21:20 -07:00
+								### Configurable options
 								`limit_mm_per_prompt` also accepts configurable options per modality. In the configurable form, you still specify `count`, and you may optionally provide size hints that control how vLLM profiles and reserves memory for your multi‑modal inputs. This helps you tune memory for the actual media you expect, instead of the model’s absolute maxima.
 								Configurable options by modality:
 								- `image`: `{"count": int, "width": int, "height": int}`
 								- `video`: `{"count": int, "num_frames": int, "width": int, "height": int}`
 								- `audio`: `{"count": int, "length": int}`
 								Details could be found in [`ImageDummyOptions`][vllm.config.multimodal.ImageDummyOptions], [`VideoDummyOptions`][vllm.config.multimodal.VideoDummyOptions], and [`AudioDummyOptions`][vllm.config.multimodal.AudioDummyOptions].
 								Examples:
 								```python
 								from vllm import LLM
 								# Up to 5 images per prompt, profile with 512x512.
 								# Up to 1 video per prompt, profile with 32 frames at 640x640.
 								llm = LLM(
 								    model="Qwen/Qwen2.5-VL-3B-Instruct",
 								    limit_mm_per_prompt={
 								        "image": {"count": 5, "width": 512, "height": 512},
 								        "video": {"count": 1, "num_frames": 32, "width": 640, "height": 640},
 								    },
 								)
 								```
 								For backward compatibility, passing an integer works as before and is interpreted as `{"count": <int>}`. For example:
 								- `limit_mm_per_prompt={"image": 5}` is equivalent to `limit_mm_per_prompt={"image": {"count": 5}}`
 								- You can mix formats: `limit_mm_per_prompt={"image": 5, "video": {"count": 1, "num_frames": 32, "width": 640, "height": 640}}`
 								!!! note
 								    - The size hints affect memory profiling only. They shape the dummy inputs used to compute reserved activation sizes. They do not change how inputs are actually processed at inference time.
 								    - If a hint exceeds what the model can accept, vLLM clamps it to the model's effective maximum and may log a warning.
 								!!! warning
 								    These size hints currently only affect activation memory profiling. Encoder cache size is determined by the actual inputs at runtime and is not limited by these hints.
-												[Doc] Reorganize user guide (#18661)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-05-24 22:25:33 +08:00
+								## Multi-modal processor arguments
 								For certain models, you can adjust the multi-modal processor arguments to
 								reduce the size of the processed multi-modal inputs, which in turn saves memory.
 								Here are some examples:
-												[Core] Store only the keys for multi-modal data in P0 (#22198)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-08-07 16:45:04 +08:00
+								```python
 								from vllm import LLM
-												[doc] Fold long code blocks to improve readability (#19926)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>
											
										
										
											2025-06-23 13:24:23 +08:00
-												[Core] Store only the keys for multi-modal data in P0 (#22198)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-08-07 16:45:04 +08:00
+								# Available for Qwen2-VL series models
-												[Doc] ruff format some Python examples (#26767)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-10-14 18:21:53 +08:00
+								llm = LLM(
 								    model="Qwen/Qwen2.5-VL-3B-Instruct",
 								    mm_processor_kwargs={"max_pixels": 768 * 768},  # Default is 1280 * 28 * 28
 								)
-												[Core] Store only the keys for multi-modal data in P0 (#22198)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-08-07 16:45:04 +08:00
 								# Available for InternVL series models
-												[Doc] ruff format some Python examples (#26767)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-10-14 18:21:53 +08:00
+								llm = LLM(
 								    model="OpenGVLab/InternVL2-2B",
 								    mm_processor_kwargs={"max_dynamic_patch": 4},  # Default is 12
 								)
-												[Core] Store only the keys for multi-modal data in P0 (#22198)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-08-07 16:45:04 +08:00
+								```