[Refactor] Use data parser for matching data items to multi-modal UUIDs (#32955)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
This commit is contained in:
@@ -20,67 +20,6 @@ To input multi-modal data, follow this schema in [vllm.inputs.PromptType][]:
|
||||
- `prompt`: The prompt should follow the format that is documented on HuggingFace.
|
||||
- `multi_modal_data`: This is a dictionary that follows the schema defined in [vllm.multimodal.inputs.MultiModalDataDict][].
|
||||
|
||||
### Stable UUIDs for Caching (multi_modal_uuids)
|
||||
|
||||
When using multi-modal inputs, vLLM normally hashes each media item by content to enable caching across requests. You can optionally pass `multi_modal_uuids` to provide your own stable IDs for each item so caching can reuse work across requests without rehashing the raw content.
|
||||
|
||||
??? code
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
from PIL import Image
|
||||
|
||||
# Qwen2.5-VL example with two images
|
||||
llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct")
|
||||
|
||||
prompt = "USER: <image><image>\nDescribe the differences.\nASSISTANT:"
|
||||
img_a = Image.open("/path/to/a.jpg")
|
||||
img_b = Image.open("/path/to/b.jpg")
|
||||
|
||||
outputs = llm.generate({
|
||||
"prompt": prompt,
|
||||
"multi_modal_data": {"image": [img_a, img_b]},
|
||||
# Provide stable IDs for caching.
|
||||
# Requirements (matched by this example):
|
||||
# - Include every modality present in multi_modal_data.
|
||||
# - For lists, provide the same number of entries.
|
||||
# - Use None to fall back to content hashing for that item.
|
||||
"multi_modal_uuids": {"image": ["sku-1234-a", None]},
|
||||
})
|
||||
|
||||
for o in outputs:
|
||||
print(o.outputs[0].text)
|
||||
```
|
||||
|
||||
Using UUIDs, you can also skip sending media data entirely if you expect cache hits for respective items. Note that the request will fail if the skipped media doesn't have a corresponding UUID, or if the UUID fails to hit the cache.
|
||||
|
||||
??? code
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
from PIL import Image
|
||||
|
||||
# Qwen2.5-VL example with two images
|
||||
llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct")
|
||||
|
||||
prompt = "USER: <image><image>\nDescribe the differences.\nASSISTANT:"
|
||||
img_b = Image.open("/path/to/b.jpg")
|
||||
|
||||
outputs = llm.generate({
|
||||
"prompt": prompt,
|
||||
"multi_modal_data": {"image": [None, img_b]},
|
||||
# Since img_a is expected to be cached, we can skip sending the actual
|
||||
# image entirely.
|
||||
"multi_modal_uuids": {"image": ["sku-1234-a", None]},
|
||||
})
|
||||
|
||||
for o in outputs:
|
||||
print(o.outputs[0].text)
|
||||
```
|
||||
|
||||
!!! warning
|
||||
If both multimodal processor caching and prefix caching are disabled, user-provided `multi_modal_uuids` are ignored.
|
||||
|
||||
### Image Inputs
|
||||
|
||||
You can pass a single image to the `'image'` field of the multi-modal dictionary, as shown in the following examples:
|
||||
@@ -397,7 +336,8 @@ No manual conversion is needed - vLLM handles the channel normalization automati
|
||||
### Embedding Inputs
|
||||
|
||||
To input pre-computed embeddings belonging to a data type (i.e. image, video, or audio) directly to the language model,
|
||||
pass a tensor of shape `(num_items, feature_size, hidden_size of LM)` to the corresponding field of the multi-modal dictionary.
|
||||
pass a tensor of shape `(..., hidden_size of LM)` to the corresponding field of the multi-modal dictionary.
|
||||
The exact shape depends on the model being used.
|
||||
|
||||
You must enable this feature via `enable_mm_embeds=True`.
|
||||
|
||||
@@ -418,8 +358,7 @@ You must enable this feature via `enable_mm_embeds=True`.
|
||||
# Refer to the HuggingFace repo for the correct format to use
|
||||
prompt = "USER: <image>\nWhat is the content of this image?\nASSISTANT:"
|
||||
|
||||
# Embeddings for single image
|
||||
# torch.Tensor of shape (1, image_feature_size, hidden_size of LM)
|
||||
# For most models, `image_embeds` has shape: (num_images, image_feature_size, hidden_size)
|
||||
image_embeds = torch.load(...)
|
||||
|
||||
outputs = llm.generate({
|
||||
@@ -430,21 +369,8 @@ You must enable this feature via `enable_mm_embeds=True`.
|
||||
for o in outputs:
|
||||
generated_text = o.outputs[0].text
|
||||
print(generated_text)
|
||||
```
|
||||
|
||||
For Qwen2-VL and MiniCPM-V, we accept additional parameters alongside the embeddings:
|
||||
|
||||
??? code
|
||||
|
||||
```python
|
||||
# Construct the prompt based on your model
|
||||
prompt = ...
|
||||
|
||||
# Embeddings for multiple images
|
||||
# torch.Tensor of shape (num_images, image_feature_size, hidden_size of LM)
|
||||
image_embeds = torch.load(...)
|
||||
|
||||
# Qwen2-VL
|
||||
# Additional examples for models that require extra fields
|
||||
llm = LLM(
|
||||
"Qwen/Qwen2-VL-2B-Instruct",
|
||||
limit_mm_per_prompt={"image": 4},
|
||||
@@ -452,13 +378,15 @@ For Qwen2-VL and MiniCPM-V, we accept additional parameters alongside the embedd
|
||||
)
|
||||
mm_data = {
|
||||
"image": {
|
||||
"image_embeds": image_embeds,
|
||||
# Shape: (total_feature_size, hidden_size)
|
||||
# total_feature_size = sum(image_feature_size for image in images)
|
||||
"image_embeds": torch.load(...),
|
||||
# Shape: (num_images, 3)
|
||||
# image_grid_thw is needed to calculate positional encoding.
|
||||
"image_grid_thw": torch.load(...), # torch.Tensor of shape (1, 3),
|
||||
"image_grid_thw": torch.load(...),
|
||||
}
|
||||
}
|
||||
|
||||
# MiniCPM-V
|
||||
llm = LLM(
|
||||
"openbmb/MiniCPM-V-2_6",
|
||||
trust_remote_code=True,
|
||||
@@ -467,20 +395,14 @@ For Qwen2-VL and MiniCPM-V, we accept additional parameters alongside the embedd
|
||||
)
|
||||
mm_data = {
|
||||
"image": {
|
||||
"image_embeds": image_embeds,
|
||||
# Shape: (num_images, num_slices, hidden_size)
|
||||
# num_slices can differ for each image
|
||||
"image_embeds": [torch.load(...) for image in images],
|
||||
# Shape: (num_images, 2)
|
||||
# image_sizes is needed to calculate details of the sliced image.
|
||||
"image_sizes": [image.size for image in images], # list of image sizes
|
||||
"image_sizes": [image.size for image in images],
|
||||
}
|
||||
}
|
||||
|
||||
outputs = llm.generate({
|
||||
"prompt": prompt,
|
||||
"multi_modal_data": mm_data,
|
||||
})
|
||||
|
||||
for o in outputs:
|
||||
generated_text = o.outputs[0].text
|
||||
print(generated_text)
|
||||
```
|
||||
|
||||
For Qwen3-VL, the `image_embeds` should contain both the base image embedding and deepstack features.
|
||||
@@ -501,8 +423,8 @@ You can pass pre-computed audio embeddings similar to image embeddings:
|
||||
# Refer to the HuggingFace repo for the correct format to use
|
||||
prompt = "USER: <audio>\nWhat is in this audio?\nASSISTANT:"
|
||||
|
||||
# Load pre-computed audio embeddings
|
||||
# torch.Tensor of shape (1, audio_feature_size, hidden_size of LM)
|
||||
# Load pre-computed audio embeddings, usually with shape:
|
||||
# (num_audios, audio_feature_size, hidden_size of LM)
|
||||
audio_embeds = torch.load(...)
|
||||
|
||||
outputs = llm.generate({
|
||||
@@ -515,6 +437,67 @@ You can pass pre-computed audio embeddings similar to image embeddings:
|
||||
print(generated_text)
|
||||
```
|
||||
|
||||
### Cached Inputs
|
||||
|
||||
When using multi-modal inputs, vLLM normally hashes each media item by content to enable caching across requests. You can optionally pass `multi_modal_uuids` to provide your own stable IDs for each item so caching can reuse work across requests without rehashing the raw content.
|
||||
|
||||
??? code
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
from PIL import Image
|
||||
|
||||
# Qwen2.5-VL example with two images
|
||||
llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct")
|
||||
|
||||
prompt = "USER: <image><image>\nDescribe the differences.\nASSISTANT:"
|
||||
img_a = Image.open("/path/to/a.jpg")
|
||||
img_b = Image.open("/path/to/b.jpg")
|
||||
|
||||
outputs = llm.generate({
|
||||
"prompt": prompt,
|
||||
"multi_modal_data": {"image": [img_a, img_b]},
|
||||
# Provide stable IDs for caching.
|
||||
# Requirements (matched by this example):
|
||||
# - Include every modality present in multi_modal_data.
|
||||
# - For lists, provide the same number of entries.
|
||||
# - Use None to fall back to content hashing for that item.
|
||||
"multi_modal_uuids": {"image": ["sku-1234-a", None]},
|
||||
})
|
||||
|
||||
for o in outputs:
|
||||
print(o.outputs[0].text)
|
||||
```
|
||||
|
||||
Using UUIDs, you can also skip sending media data entirely if you expect cache hits for respective items. Note that the request will fail if the skipped media doesn't have a corresponding UUID, or if the UUID fails to hit the cache.
|
||||
|
||||
??? code
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
from PIL import Image
|
||||
|
||||
# Qwen2.5-VL example with two images
|
||||
llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct")
|
||||
|
||||
prompt = "USER: <image><image>\nDescribe the differences.\nASSISTANT:"
|
||||
img_b = Image.open("/path/to/b.jpg")
|
||||
|
||||
outputs = llm.generate({
|
||||
"prompt": prompt,
|
||||
"multi_modal_data": {"image": [None, img_b]},
|
||||
# Since img_a is expected to be cached, we can skip sending the actual
|
||||
# image entirely.
|
||||
"multi_modal_uuids": {"image": ["sku-1234-a", None]},
|
||||
})
|
||||
|
||||
for o in outputs:
|
||||
print(o.outputs[0].text)
|
||||
```
|
||||
|
||||
!!! warning
|
||||
If both multimodal processor caching and prefix caching are disabled, user-provided `multi_modal_uuids` are ignored.
|
||||
|
||||
## Online Serving
|
||||
|
||||
Our OpenAI-compatible server accepts multi-modal data via the [Chat Completions API](https://platform.openai.com/docs/api-reference/chat). Media inputs also support optional UUIDs users can provide to uniquely identify each media, which is used to cache the media results across requests.
|
||||
@@ -879,7 +862,11 @@ Full example: [examples/online_serving/openai_chat_completion_client_for_multimo
|
||||
### Embedding Inputs
|
||||
|
||||
To input pre-computed embeddings belonging to a data type (i.e. image, video, or audio) directly to the language model,
|
||||
pass a tensor of shape `(num_items, feature_size, hidden_size of LM)` to the corresponding field of the multi-modal dictionary.
|
||||
pass a tensor of shape `(..., hidden_size of LM)` for each item to the corresponding field of the multi-modal dictionary.
|
||||
|
||||
!!! important
|
||||
Unlike offline inference, the embeddings for each item must be passed separately
|
||||
in order for placeholder tokens to be applied correctly by the chat template.
|
||||
|
||||
You must enable this feature via the `--enable-mm-embeds` flag in `vllm serve`.
|
||||
|
||||
@@ -897,11 +884,6 @@ The following example demonstrates how to pass image embeddings to the OpenAI se
|
||||
```python
|
||||
from vllm.utils.serial_utils import tensor2base64
|
||||
|
||||
image_embedding = torch.load(...)
|
||||
grid_thw = torch.load(...) # Required by Qwen/Qwen2-VL-2B-Instruct
|
||||
|
||||
base64_image_embedding = tensor2base64(image_embedding)
|
||||
|
||||
client = OpenAI(
|
||||
# defaults to os.environ.get("OPENAI_API_KEY")
|
||||
api_key=openai_api_key,
|
||||
@@ -912,29 +894,33 @@ The following example demonstrates how to pass image embeddings to the OpenAI se
|
||||
model = "llava-hf/llava-1.5-7b-hf"
|
||||
embeds = {
|
||||
"type": "image_embeds",
|
||||
"image_embeds": f"{base64_image_embedding}",
|
||||
"image_embeds": tensor2base64(torch.load(...)), # Shape: (image_feature_size, hidden_size)
|
||||
"uuid": image_url, # Optional
|
||||
}
|
||||
|
||||
# Pass additional parameters (available to Qwen2-VL and MiniCPM-V)
|
||||
|
||||
# Additional examples for models that require extra fields
|
||||
model = "Qwen/Qwen2-VL-2B-Instruct"
|
||||
embeds = {
|
||||
"type": "image_embeds",
|
||||
"image_embeds": {
|
||||
"image_embeds": f"{base64_image_embedding}", # Required
|
||||
"image_grid_thw": f"{base64_image_grid_thw}", # Required by Qwen/Qwen2-VL-2B-Instruct
|
||||
"image_embeds": tensor2base64(torch.load(...)), # Shape: (image_feature_size, hidden_size)
|
||||
"image_grid_thw": tensor2base64(torch.load(...)), # Shape: (3,)
|
||||
},
|
||||
"uuid": image_url, # Optional
|
||||
}
|
||||
|
||||
model = "openbmb/MiniCPM-V-2_6"
|
||||
embeds = {
|
||||
"type": "image_embeds",
|
||||
"image_embeds": {
|
||||
"image_embeds": f"{base64_image_embedding}", # Required
|
||||
"image_sizes": f"{base64_image_sizes}", # Required by openbmb/MiniCPM-V-2_6
|
||||
"image_embeds": tensor2base64(torch.load(...)), # Shape: (num_slices, hidden_size)
|
||||
"image_sizes": tensor2base64(torch.load(...)), # Shape: (2,)
|
||||
},
|
||||
"uuid": image_url, # Optional
|
||||
}
|
||||
|
||||
# Single image input
|
||||
chat_completion = client.chat.completions.create(
|
||||
messages=[
|
||||
{
|
||||
@@ -954,9 +940,55 @@ The following example demonstrates how to pass image embeddings to the OpenAI se
|
||||
],
|
||||
model=model,
|
||||
)
|
||||
|
||||
# Multi image input
|
||||
chat_completion = client.chat.completions.create(
|
||||
messages=[
|
||||
{
|
||||
"role": "system",
|
||||
"content": "You are a helpful assistant.",
|
||||
},
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{
|
||||
"type": "text",
|
||||
"text": "What's in this image?",
|
||||
},
|
||||
embeds,
|
||||
embeds,
|
||||
],
|
||||
},
|
||||
],
|
||||
model=model,
|
||||
)
|
||||
|
||||
# Multi image input (interleaved)
|
||||
chat_completion = client.chat.completions.create(
|
||||
messages=[
|
||||
{
|
||||
"role": "system",
|
||||
"content": "You are a helpful assistant.",
|
||||
},
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
embeds,
|
||||
{
|
||||
"type": "text",
|
||||
"text": "What's in this image?",
|
||||
},
|
||||
embeds,
|
||||
],
|
||||
},
|
||||
],
|
||||
model=model,
|
||||
)
|
||||
```
|
||||
|
||||
For Online Serving, you can also skip sending media if you expect cache hits with provided UUIDs. You can do so by sending media like this:
|
||||
### Cached Inputs
|
||||
|
||||
Just like with offline inference, you can skip sending media if you expect cache hits with provided UUIDs. You can do so by sending media like this:
|
||||
|
||||
??? code
|
||||
|
||||
@@ -990,13 +1022,3 @@ For Online Serving, you can also skip sending media if you expect cache hits wit
|
||||
},
|
||||
|
||||
```
|
||||
|
||||
!!! note
|
||||
Multiple messages can now contain `{"type": "image_embeds"}`, enabling you to pass multiple image embeddings in a single request (similar to regular images). The number of embeddings is limited by `--limit-mm-per-prompt`.
|
||||
|
||||
**Important**: The embedding shape format differs based on the number of embeddings:
|
||||
|
||||
- **Single embedding**: 3D tensor of shape `(1, feature_size, hidden_size)`
|
||||
- **Multiple embeddings**: List of 2D tensors, each of shape `(feature_size, hidden_size)`
|
||||
|
||||
If used with a model that requires additional parameters, you must also provide a tensor for each of them, e.g. `image_grid_thw`, `image_sizes`, etc.
|
||||
|
||||
Reference in New Issue
Block a user