docs/design/torch_compile_multimodal.md

# torch.compile with Multimodal Encoders

`torch.compile` can now be applied to multimodal encoders and miscellaneous nn modules in vLLM, including vision-language models like LLaMA 4, Qwen-VL,
and similar encoder-based architectures.

This document covers the basics of how the `torch.compile` integration works for multimodal encoders in vLLM, as well as how to apply the decorator
to new models to improve performance.

!!! note
    For general information about `torch.compile` integration in vLLM, see the [torch.compile design document](./torch_compile.md).

## Overview

We have recently enabled the `@support_torch_compile` decorator to work for multiple nn module components within a model type; this enables
turning compile on for multimodal encoders, bringing performance improvements to additional components of the stack.

When applied to the vision block of [`Qwen2_5_vl`](https://github.com/vllm-project/vllm/pull/23207) we observe ~4.5% e2e perf improvements with
some increase in compilation time

This feature is off by default, but can be enabled by setting `compile_mm_encoder: true` in the compilation config when models have the
`@support_torch_compile` decorator.

## How Compilation Works for Multimodal Components

### APIs for Enablement

To compile a multimodal component such as an encoder, we follow the same mechanism as the LLM text backbone, with a few additional scaffoldings:

1. The `@support_torch_compile` decorator should include `enable_if=should_torch_compile_mm_vit`. This will gate the compilation behind our
`compile_mm_encoder` configuration

2. `with set_model_tag("<component_name>", is_encoder=True)` context manager should be used around the nn.Module's instantiation. Since torch.compile
relies on caching artifacts to reduce start time, we must properly propagate the `<component_name>` information to the cache in order to avoid collisions
with the LLM text-backbone, or other instances of the same artifact (as is the case with vision block). `is_encoder=True` is also needed for encoder
components (see Compile Range Integration).

3. `with set_forward_context` context manager should be used around the nn.Module's forward call. This will properly forward the vllm_config which is needed
for torch.compile integration.

### CompilationConfig

With the exception of `compile_mm_encoder: true`, the multimodal encoder will inherit from the same compilation config as the text LLM. We may extend
this for more configuration in the future.

## Applying torch.compile to a New Multimodal Model/Component

To apply `support_torch_compile` to a new general nn.Module, we advise following the same steps in [`debug_vllm_compile`](./debug_vllm_compile.md); this includes:

1. Applying `support_torch_compile` on initially small modules (such as basic MLP layers), then raising to more general modules until one reaches a good performance
tradeoff

2. Leveraging [`tlparse`](https://github.com/meta-pytorch/tlparse) to identify and eliminate the source of recompiles and graph breaks

3. Using `dynamic_arg_dims` and proper `dynamic_shapes_config` to handle dynamism.

### Common pitfalls

## VllmBackend Feature Support

### Compile ranges

The torch.compile integration will try to rely on max_batch_size to infer compilation ranges for dynamic shapes; however, for modules used in the encoder, this
shape can be difficult to infer due to the unspecified range of shapes the encoder may see as input. Therefore, we rely on `is_encoder=True` in the `set_model_tag`
to alert torch.compile to the fact that this range cannot be inferred, and we default to the range (1, MAX_INT).

!!! note
    We may seek to tighten this range for better performance in the future

### Cudagraphs

We have not yet explored compilation for multimodal encoders with CUDAGraph integration; behavior is currently unspecified.

## Troubleshooting

### Graph Breaks in Vision Encoders

Some vision encoder operations may cause graph breaks. To identify them:

```bash
TORCH_LOGS="+dynamo" vllm serve <MODEL>
```

Common causes of graph breaks in multimodal models:

- **Dynamic image sizes**: Use `dynamic_shapes_config` to handle variable resolutions
- **Untraceable operations**: Some operations (such as to_list) may not be supported by Dynamo
- **Conditional processing**: Data-dependent branching based on image properties

### Compilation Errors

If compilation fails for a multimodal model:

1. **Disable and test**: First verify the model works without compilation:
   ```bash
   VLLM_TORCH_COMPILE_LEVEL=0 vllm serve <model> --compilation-config='{"compile_mm_encoder":"false"}'
   ```

2. **Check logs**: Enable debug logging to see compilation details:
   ```bash
   VLLM_LOGGING_LEVEL=DEBUG vllm serve <model> --compilation-config='{"compile_mm_encoder":"true"}'
   ```

3. **Report issues**: If you find a bug, [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose)

## See Also

- [torch.compile Integration](./torch_compile.md) - Core design document
- [Debugging torch.compile](./debug_vllm_compile.md) - Detailed debugging guide
- [Multimodal Inputs](../features/multimodal_inputs.md) - How to pass multimodal data
- [Disaggregated Encoder](../features/disagg_encoder.md) - Scaling vision encoders
- [Supported Multimodal Models](../models/supported_models.md#list-of-multimodal-language-models) - Model compatibility
[Documentation][torch.compile] Add documentation for torch.compile + multimodal encoders (#31627) Signed-off-by: Lucas Kabela <lucaskabela@meta.com> 2026-01-08 11:33:24 -08:00			`# torch.compile with Multimodal Encoders`

			`torch.compile` can now be applied to multimodal encoders and miscellaneous nn modules in vLLM, including vision-language models like LLaMA 4, Qwen-VL,
			`and similar encoder-based architectures.`

			This document covers the basics of how the `torch.compile` integration works for multimodal encoders in vLLM, as well as how to apply the decorator
			`to new models to improve performance.`

			`!!! note`
			For general information about `torch.compile` integration in vLLM, see the [torch.compile design document](./torch_compile.md).

			`## Overview`

[Documentation] Fix typo in `docs/design/torch_compile_multimodal.md` (#32741) Signed-off-by: Lucas Kabela <lucaskabela@meta.com> 2026-01-20 23:54:20 -08:00			We have recently enabled the `@support_torch_compile` decorator to work for multiple nn module components within a model type; this enables
[Documentation][torch.compile] Add documentation for torch.compile + multimodal encoders (#31627) Signed-off-by: Lucas Kabela <lucaskabela@meta.com> 2026-01-08 11:33:24 -08:00			`turning compile on for multimodal encoders, bringing performance improvements to additional components of the stack.`

			When applied to the vision block of [`Qwen2_5_vl`](https://github.com/vllm-project/vllm/pull/23207) we observe ~4.5% e2e perf improvements with
			`some increase in compilation time`

			This feature is off by default, but can be enabled by setting `compile_mm_encoder: true` in the compilation config when models have the
[Documentation] Fix typo in `docs/design/torch_compile_multimodal.md` (#32741) Signed-off-by: Lucas Kabela <lucaskabela@meta.com> 2026-01-20 23:54:20 -08:00			`@support_torch_compile` decorator.
[Documentation][torch.compile] Add documentation for torch.compile + multimodal encoders (#31627) Signed-off-by: Lucas Kabela <lucaskabela@meta.com> 2026-01-08 11:33:24 -08:00
			`## How Compilation Works for Multimodal Components`

			`### APIs for Enablement`

			`To compile a multimodal component such as an encoder, we follow the same mechanism as the LLM text backbone, with a few additional scaffoldings:`

[Documentation] Fix typo in `docs/design/torch_compile_multimodal.md` (#32741) Signed-off-by: Lucas Kabela <lucaskabela@meta.com> 2026-01-20 23:54:20 -08:00			1. The `@support_torch_compile` decorator should include `enable_if=should_torch_compile_mm_vit`. This will gate the compilation behind our
[Documentation][torch.compile] Add documentation for torch.compile + multimodal encoders (#31627) Signed-off-by: Lucas Kabela <lucaskabela@meta.com> 2026-01-08 11:33:24 -08:00			`compile_mm_encoder` configuration

			2. `with set_model_tag("<component_name>", is_encoder=True)` context manager should be used around the nn.Module's instantiation. Since torch.compile
			relies on caching artifacts to reduce start time, we must properly propagate the `<component_name>` information to the cache in order to avoid collisions
			with the LLM text-backbone, or other instances of the same artifact (as is the case with vision block). `is_encoder=True` is also needed for encoder
			`components (see Compile Range Integration).`

			3. `with set_forward_context` context manager should be used around the nn.Module's forward call. This will properly forward the vllm_config which is needed
			`for torch.compile integration.`

			`### CompilationConfig`

			With the exception of `compile_mm_encoder: true`, the multimodal encoder will inherit from the same compilation config as the text LLM. We may extend
			`this for more configuration in the future.`

			`## Applying torch.compile to a New Multimodal Model/Component`

[Documentation] Fix typo in `docs/design/torch_compile_multimodal.md` (#32741) Signed-off-by: Lucas Kabela <lucaskabela@meta.com> 2026-01-20 23:54:20 -08:00			To apply `support_torch_compile` to a new general nn.Module, we advise following the same steps in [`debug_vllm_compile`](./debug_vllm_compile.md); this includes:
[Documentation][torch.compile] Add documentation for torch.compile + multimodal encoders (#31627) Signed-off-by: Lucas Kabela <lucaskabela@meta.com> 2026-01-08 11:33:24 -08:00
[Documentation] Fix typo in `docs/design/torch_compile_multimodal.md` (#32741) Signed-off-by: Lucas Kabela <lucaskabela@meta.com> 2026-01-20 23:54:20 -08:00			1. Applying `support_torch_compile` on initially small modules (such as basic MLP layers), then raising to more general modules until one reaches a good performance
[Documentation][torch.compile] Add documentation for torch.compile + multimodal encoders (#31627) Signed-off-by: Lucas Kabela <lucaskabela@meta.com> 2026-01-08 11:33:24 -08:00			`tradeoff`

			2. Leveraging [`tlparse`](https://github.com/meta-pytorch/tlparse) to identify and eliminate the source of recompiles and graph breaks

			3. Using `dynamic_arg_dims` and proper `dynamic_shapes_config` to handle dynamism.

			`### Common pitfalls`

			`## VllmBackend Feature Support`

			`### Compile ranges`

			`The torch.compile integration will try to rely on max_batch_size to infer compilation ranges for dynamic shapes; however, for modules used in the encoder, this`
			shape can be difficult to infer due to the unspecified range of shapes the encoder may see as input. Therefore, we rely on `is_encoder=True` in the `set_model_tag`
			`to alert torch.compile to the fact that this range cannot be inferred, and we default to the range (1, MAX_INT).`

			`!!! note`
			`We may seek to tighten this range for better performance in the future`

			`### Cudagraphs`

			`We have not yet explored compilation for multimodal encoders with CUDAGraph integration; behavior is currently unspecified.`

			`## Troubleshooting`

			`### Graph Breaks in Vision Encoders`

			`Some vision encoder operations may cause graph breaks. To identify them:`

			```bash
			`TORCH_LOGS="+dynamo" vllm serve <MODEL>`
			```

			`Common causes of graph breaks in multimodal models:`

			- Dynamic image sizes: Use `dynamic_shapes_config` to handle variable resolutions
			`- Untraceable operations: Some operations (such as to_list) may not be supported by Dynamo`
			`- Conditional processing: Data-dependent branching based on image properties`

			`### Compilation Errors`

			`If compilation fails for a multimodal model:`

			`1. Disable and test: First verify the model works without compilation:`
			```bash
			`VLLM_TORCH_COMPILE_LEVEL=0 vllm serve <model> --compilation-config='{"compile_mm_encoder":"false"}'`
			```

			`2. Check logs: Enable debug logging to see compilation details:`
			```bash
			`VLLM_LOGGING_LEVEL=DEBUG vllm serve <model> --compilation-config='{"compile_mm_encoder":"true"}'`
			```

			`3. Report issues: If you find a bug, [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose)`

			`## See Also`

			`- [torch.compile Integration](./torch_compile.md) - Core design document`
			`- [Debugging torch.compile](./debug_vllm_compile.md) - Detailed debugging guide`
			`- [Multimodal Inputs](../features/multimodal_inputs.md) - How to pass multimodal data`
			`- [Disaggregated Encoder](../features/disagg_encoder.md) - Scaling vision encoders`
			`- [Supported Multimodal Models](../models/supported_models.md#list-of-multimodal-language-models) - Model compatibility`