[Frontend] Use engine argument to control MM cache size (#22441)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-08-08 00:47:10 +08:00
parent 8c9da6be22
commit 139d155781
13 changed files with 101 additions and 47 deletions
--- a/docs/configuration/optimization.md
+++ b/docs/configuration/optimization.md
@@ -161,12 +161,18 @@ By default, the multi-modal processor cache is enabled to avoid repeatedly proce
 the same multi-modal inputs via Hugging Face `AutoProcessor`,
 which commonly occurs in multi-turn conversations.

-You can adjust the size of the cache via `VLLM_MM_INPUT_CACHE_GIB` environment variable
+You can adjust the size of the cache by setting the value of `mm_processor_cache_gb`
 (default 4 GiB per API process + 4 GiB per engine core process).
+If you do not benefit much from the cache, you can disable it completely via `mm_processor_cache_gb=0`.

-If you do not benefit much from the cache, you can disable it completely via `disable_mm_preprocessor_cache`:
+Examples:

 ```python
+# Use a larger cache
 llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct",
-          disable_mm_preprocessor_cache=True)
+          mm_processor_cache_gb=8)
+
+# Disable the cache
+llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct",
+          mm_processor_cache_gb=0)
 ```