[Refactor] Use data parser for matching data items to multi-modal UUIDs (#32955)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
This commit is contained in:
Cyrus Leung
2026-01-26 15:00:28 +08:00
committed by GitHub
parent ee484b3f4b
commit 11b556878b
14 changed files with 701 additions and 604 deletions

View File

@@ -20,67 +20,6 @@ To input multi-modal data, follow this schema in [vllm.inputs.PromptType][]:
- `prompt`: The prompt should follow the format that is documented on HuggingFace.
- `multi_modal_data`: This is a dictionary that follows the schema defined in [vllm.multimodal.inputs.MultiModalDataDict][].
### Stable UUIDs for Caching (multi_modal_uuids)
When using multi-modal inputs, vLLM normally hashes each media item by content to enable caching across requests. You can optionally pass `multi_modal_uuids` to provide your own stable IDs for each item so caching can reuse work across requests without rehashing the raw content.
??? code
```python
from vllm import LLM
from PIL import Image
# Qwen2.5-VL example with two images
llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct")
prompt = "USER: <image><image>\nDescribe the differences.\nASSISTANT:"
img_a = Image.open("/path/to/a.jpg")
img_b = Image.open("/path/to/b.jpg")
outputs = llm.generate({
"prompt": prompt,
"multi_modal_data": {"image": [img_a, img_b]},
# Provide stable IDs for caching.
# Requirements (matched by this example):
# - Include every modality present in multi_modal_data.
# - For lists, provide the same number of entries.
# - Use None to fall back to content hashing for that item.
"multi_modal_uuids": {"image": ["sku-1234-a", None]},
})
for o in outputs:
print(o.outputs[0].text)
```
Using UUIDs, you can also skip sending media data entirely if you expect cache hits for respective items. Note that the request will fail if the skipped media doesn't have a corresponding UUID, or if the UUID fails to hit the cache.
??? code
```python
from vllm import LLM
from PIL import Image
# Qwen2.5-VL example with two images
llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct")
prompt = "USER: <image><image>\nDescribe the differences.\nASSISTANT:"
img_b = Image.open("/path/to/b.jpg")
outputs = llm.generate({
"prompt": prompt,
"multi_modal_data": {"image": [None, img_b]},
# Since img_a is expected to be cached, we can skip sending the actual
# image entirely.
"multi_modal_uuids": {"image": ["sku-1234-a", None]},
})
for o in outputs:
print(o.outputs[0].text)
```
!!! warning
If both multimodal processor caching and prefix caching are disabled, user-provided `multi_modal_uuids` are ignored.
### Image Inputs
You can pass a single image to the `'image'` field of the multi-modal dictionary, as shown in the following examples:
@@ -397,7 +336,8 @@ No manual conversion is needed - vLLM handles the channel normalization automati
### Embedding Inputs
To input pre-computed embeddings belonging to a data type (i.e. image, video, or audio) directly to the language model,
pass a tensor of shape `(num_items, feature_size, hidden_size of LM)` to the corresponding field of the multi-modal dictionary.
pass a tensor of shape `(..., hidden_size of LM)` to the corresponding field of the multi-modal dictionary.
The exact shape depends on the model being used.
You must enable this feature via `enable_mm_embeds=True`.
@@ -418,8 +358,7 @@ You must enable this feature via `enable_mm_embeds=True`.
# Refer to the HuggingFace repo for the correct format to use
prompt = "USER: <image>\nWhat is the content of this image?\nASSISTANT:"
# Embeddings for single image
# torch.Tensor of shape (1, image_feature_size, hidden_size of LM)
# For most models, `image_embeds` has shape: (num_images, image_feature_size, hidden_size)
image_embeds = torch.load(...)
outputs = llm.generate({
@@ -430,21 +369,8 @@ You must enable this feature via `enable_mm_embeds=True`.
for o in outputs:
generated_text = o.outputs[0].text
print(generated_text)
```
For Qwen2-VL and MiniCPM-V, we accept additional parameters alongside the embeddings:
??? code
```python
# Construct the prompt based on your model
prompt = ...
# Embeddings for multiple images
# torch.Tensor of shape (num_images, image_feature_size, hidden_size of LM)
image_embeds = torch.load(...)
# Qwen2-VL
# Additional examples for models that require extra fields
llm = LLM(
"Qwen/Qwen2-VL-2B-Instruct",
limit_mm_per_prompt={"image": 4},
@@ -452,13 +378,15 @@ For Qwen2-VL and MiniCPM-V, we accept additional parameters alongside the embedd
)
mm_data = {
"image": {
"image_embeds": image_embeds,
# Shape: (total_feature_size, hidden_size)
# total_feature_size = sum(image_feature_size for image in images)
"image_embeds": torch.load(...),
# Shape: (num_images, 3)
# image_grid_thw is needed to calculate positional encoding.
"image_grid_thw": torch.load(...), # torch.Tensor of shape (1, 3),
"image_grid_thw": torch.load(...),
}
}
# MiniCPM-V
llm = LLM(
"openbmb/MiniCPM-V-2_6",
trust_remote_code=True,
@@ -467,20 +395,14 @@ For Qwen2-VL and MiniCPM-V, we accept additional parameters alongside the embedd
)
mm_data = {
"image": {
"image_embeds": image_embeds,
# Shape: (num_images, num_slices, hidden_size)
# num_slices can differ for each image
"image_embeds": [torch.load(...) for image in images],
# Shape: (num_images, 2)
# image_sizes is needed to calculate details of the sliced image.
"image_sizes": [image.size for image in images], # list of image sizes
"image_sizes": [image.size for image in images],
}
}
outputs = llm.generate({
"prompt": prompt,
"multi_modal_data": mm_data,
})
for o in outputs:
generated_text = o.outputs[0].text
print(generated_text)
```
For Qwen3-VL, the `image_embeds` should contain both the base image embedding and deepstack features.
@@ -501,8 +423,8 @@ You can pass pre-computed audio embeddings similar to image embeddings:
# Refer to the HuggingFace repo for the correct format to use
prompt = "USER: <audio>\nWhat is in this audio?\nASSISTANT:"
# Load pre-computed audio embeddings
# torch.Tensor of shape (1, audio_feature_size, hidden_size of LM)
# Load pre-computed audio embeddings, usually with shape:
# (num_audios, audio_feature_size, hidden_size of LM)
audio_embeds = torch.load(...)
outputs = llm.generate({
@@ -515,6 +437,67 @@ You can pass pre-computed audio embeddings similar to image embeddings:
print(generated_text)
```
### Cached Inputs
When using multi-modal inputs, vLLM normally hashes each media item by content to enable caching across requests. You can optionally pass `multi_modal_uuids` to provide your own stable IDs for each item so caching can reuse work across requests without rehashing the raw content.
??? code
```python
from vllm import LLM
from PIL import Image
# Qwen2.5-VL example with two images
llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct")
prompt = "USER: <image><image>\nDescribe the differences.\nASSISTANT:"
img_a = Image.open("/path/to/a.jpg")
img_b = Image.open("/path/to/b.jpg")
outputs = llm.generate({
"prompt": prompt,
"multi_modal_data": {"image": [img_a, img_b]},
# Provide stable IDs for caching.
# Requirements (matched by this example):
# - Include every modality present in multi_modal_data.
# - For lists, provide the same number of entries.
# - Use None to fall back to content hashing for that item.
"multi_modal_uuids": {"image": ["sku-1234-a", None]},
})
for o in outputs:
print(o.outputs[0].text)
```
Using UUIDs, you can also skip sending media data entirely if you expect cache hits for respective items. Note that the request will fail if the skipped media doesn't have a corresponding UUID, or if the UUID fails to hit the cache.
??? code
```python
from vllm import LLM
from PIL import Image
# Qwen2.5-VL example with two images
llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct")
prompt = "USER: <image><image>\nDescribe the differences.\nASSISTANT:"
img_b = Image.open("/path/to/b.jpg")
outputs = llm.generate({
"prompt": prompt,
"multi_modal_data": {"image": [None, img_b]},
# Since img_a is expected to be cached, we can skip sending the actual
# image entirely.
"multi_modal_uuids": {"image": ["sku-1234-a", None]},
})
for o in outputs:
print(o.outputs[0].text)
```
!!! warning
If both multimodal processor caching and prefix caching are disabled, user-provided `multi_modal_uuids` are ignored.
## Online Serving
Our OpenAI-compatible server accepts multi-modal data via the [Chat Completions API](https://platform.openai.com/docs/api-reference/chat). Media inputs also support optional UUIDs users can provide to uniquely identify each media, which is used to cache the media results across requests.
@@ -879,7 +862,11 @@ Full example: [examples/online_serving/openai_chat_completion_client_for_multimo
### Embedding Inputs
To input pre-computed embeddings belonging to a data type (i.e. image, video, or audio) directly to the language model,
pass a tensor of shape `(num_items, feature_size, hidden_size of LM)` to the corresponding field of the multi-modal dictionary.
pass a tensor of shape `(..., hidden_size of LM)` for each item to the corresponding field of the multi-modal dictionary.
!!! important
Unlike offline inference, the embeddings for each item must be passed separately
in order for placeholder tokens to be applied correctly by the chat template.
You must enable this feature via the `--enable-mm-embeds` flag in `vllm serve`.
@@ -897,11 +884,6 @@ The following example demonstrates how to pass image embeddings to the OpenAI se
```python
from vllm.utils.serial_utils import tensor2base64
image_embedding = torch.load(...)
grid_thw = torch.load(...) # Required by Qwen/Qwen2-VL-2B-Instruct
base64_image_embedding = tensor2base64(image_embedding)
client = OpenAI(
# defaults to os.environ.get("OPENAI_API_KEY")
api_key=openai_api_key,
@@ -912,29 +894,33 @@ The following example demonstrates how to pass image embeddings to the OpenAI se
model = "llava-hf/llava-1.5-7b-hf"
embeds = {
"type": "image_embeds",
"image_embeds": f"{base64_image_embedding}",
"image_embeds": tensor2base64(torch.load(...)), # Shape: (image_feature_size, hidden_size)
"uuid": image_url, # Optional
}
# Pass additional parameters (available to Qwen2-VL and MiniCPM-V)
# Additional examples for models that require extra fields
model = "Qwen/Qwen2-VL-2B-Instruct"
embeds = {
"type": "image_embeds",
"image_embeds": {
"image_embeds": f"{base64_image_embedding}", # Required
"image_grid_thw": f"{base64_image_grid_thw}", # Required by Qwen/Qwen2-VL-2B-Instruct
"image_embeds": tensor2base64(torch.load(...)), # Shape: (image_feature_size, hidden_size)
"image_grid_thw": tensor2base64(torch.load(...)), # Shape: (3,)
},
"uuid": image_url, # Optional
}
model = "openbmb/MiniCPM-V-2_6"
embeds = {
"type": "image_embeds",
"image_embeds": {
"image_embeds": f"{base64_image_embedding}", # Required
"image_sizes": f"{base64_image_sizes}", # Required by openbmb/MiniCPM-V-2_6
"image_embeds": tensor2base64(torch.load(...)), # Shape: (num_slices, hidden_size)
"image_sizes": tensor2base64(torch.load(...)), # Shape: (2,)
},
"uuid": image_url, # Optional
}
# Single image input
chat_completion = client.chat.completions.create(
messages=[
{
@@ -954,9 +940,55 @@ The following example demonstrates how to pass image embeddings to the OpenAI se
],
model=model,
)
# Multi image input
chat_completion = client.chat.completions.create(
messages=[
{
"role": "system",
"content": "You are a helpful assistant.",
},
{
"role": "user",
"content": [
{
"type": "text",
"text": "What's in this image?",
},
embeds,
embeds,
],
},
],
model=model,
)
# Multi image input (interleaved)
chat_completion = client.chat.completions.create(
messages=[
{
"role": "system",
"content": "You are a helpful assistant.",
},
{
"role": "user",
"content": [
embeds,
{
"type": "text",
"text": "What's in this image?",
},
embeds,
],
},
],
model=model,
)
```
For Online Serving, you can also skip sending media if you expect cache hits with provided UUIDs. You can do so by sending media like this:
### Cached Inputs
Just like with offline inference, you can skip sending media if you expect cache hits with provided UUIDs. You can do so by sending media like this:
??? code
@@ -990,13 +1022,3 @@ For Online Serving, you can also skip sending media if you expect cache hits wit
},
```
!!! note
Multiple messages can now contain `{"type": "image_embeds"}`, enabling you to pass multiple image embeddings in a single request (similar to regular images). The number of embeddings is limited by `--limit-mm-per-prompt`.
**Important**: The embedding shape format differs based on the number of embeddings:
- **Single embedding**: 3D tensor of shape `(1, feature_size, hidden_size)`
- **Multiple embeddings**: List of 2D tensors, each of shape `(feature_size, hidden_size)`
If used with a model that requires additional parameters, you must also provide a tensor for each of them, e.g. `image_grid_thw`, `image_sizes`, etc.