[Doc] Add more tips to avoid OOM (#16765)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-04-17 17:54:34 +08:00
parent a6481525b8
commit 61a44a0b22
2 changed files with 33 additions and 0 deletions
--- a/docs/source/serving/offline_inference.md
+++ b/docs/source/serving/offline_inference.md
@@ -28,6 +28,8 @@ Please refer to the above pages for more details about each API.
 [API Reference](/api/offline_inference/index)
 :::

+(configuration-options)=
+
 ## Configuration Options

 This section lists the most common options for running the vLLM engine.
@@ -184,6 +186,29 @@ llm = LLM(model="google/gemma-3-27b-it",
          limit_mm_per_prompt={"image": 0})
 ```

+#### Multi-modal processor arguments
+
+For certain models, you can adjust the multi-modal processor arguments to
+reduce the size of the processed multi-modal inputs, which in turn saves memory.
+
+Here are some examples:
+
+```python
+from vllm import LLM
+
+# Available for Qwen2-VL series models
+llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct",
+          mm_processor_kwargs={
+              "max_pixels": 768 * 768,  # Default is 1280 * 28 * 28
+          })
+
+# Available for InternVL series models
+llm = LLM(model="OpenGVLab/InternVL2-2B",
+          mm_processor_kwargs={
+              "max_dynamic_patch": 4,  # Default is 12
+          })
+```
+
 ### Performance optimization and tuning

 You can potentially improve the performance of vLLM by finetuning various options.