diff --git a/docs/features/multimodal_inputs.md b/docs/features/multimodal_inputs.md index 894865208..264fd8c48 100644 --- a/docs/features/multimodal_inputs.md +++ b/docs/features/multimodal_inputs.md @@ -20,67 +20,6 @@ To input multi-modal data, follow this schema in [vllm.inputs.PromptType][]: - `prompt`: The prompt should follow the format that is documented on HuggingFace. - `multi_modal_data`: This is a dictionary that follows the schema defined in [vllm.multimodal.inputs.MultiModalDataDict][]. -### Stable UUIDs for Caching (multi_modal_uuids) - -When using multi-modal inputs, vLLM normally hashes each media item by content to enable caching across requests. You can optionally pass `multi_modal_uuids` to provide your own stable IDs for each item so caching can reuse work across requests without rehashing the raw content. - -??? code - - ```python - from vllm import LLM - from PIL import Image - - # Qwen2.5-VL example with two images - llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct") - - prompt = "USER: \nDescribe the differences.\nASSISTANT:" - img_a = Image.open("/path/to/a.jpg") - img_b = Image.open("/path/to/b.jpg") - - outputs = llm.generate({ - "prompt": prompt, - "multi_modal_data": {"image": [img_a, img_b]}, - # Provide stable IDs for caching. - # Requirements (matched by this example): - # - Include every modality present in multi_modal_data. - # - For lists, provide the same number of entries. - # - Use None to fall back to content hashing for that item. - "multi_modal_uuids": {"image": ["sku-1234-a", None]}, - }) - - for o in outputs: - print(o.outputs[0].text) - ``` - -Using UUIDs, you can also skip sending media data entirely if you expect cache hits for respective items. Note that the request will fail if the skipped media doesn't have a corresponding UUID, or if the UUID fails to hit the cache. - -??? code - - ```python - from vllm import LLM - from PIL import Image - - # Qwen2.5-VL example with two images - llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct") - - prompt = "USER: \nDescribe the differences.\nASSISTANT:" - img_b = Image.open("/path/to/b.jpg") - - outputs = llm.generate({ - "prompt": prompt, - "multi_modal_data": {"image": [None, img_b]}, - # Since img_a is expected to be cached, we can skip sending the actual - # image entirely. - "multi_modal_uuids": {"image": ["sku-1234-a", None]}, - }) - - for o in outputs: - print(o.outputs[0].text) - ``` - -!!! warning - If both multimodal processor caching and prefix caching are disabled, user-provided `multi_modal_uuids` are ignored. - ### Image Inputs You can pass a single image to the `'image'` field of the multi-modal dictionary, as shown in the following examples: @@ -397,7 +336,8 @@ No manual conversion is needed - vLLM handles the channel normalization automati ### Embedding Inputs To input pre-computed embeddings belonging to a data type (i.e. image, video, or audio) directly to the language model, -pass a tensor of shape `(num_items, feature_size, hidden_size of LM)` to the corresponding field of the multi-modal dictionary. +pass a tensor of shape `(..., hidden_size of LM)` to the corresponding field of the multi-modal dictionary. +The exact shape depends on the model being used. You must enable this feature via `enable_mm_embeds=True`. @@ -418,8 +358,7 @@ You must enable this feature via `enable_mm_embeds=True`. # Refer to the HuggingFace repo for the correct format to use prompt = "USER: \nWhat is the content of this image?\nASSISTANT:" - # Embeddings for single image - # torch.Tensor of shape (1, image_feature_size, hidden_size of LM) + # For most models, `image_embeds` has shape: (num_images, image_feature_size, hidden_size) image_embeds = torch.load(...) outputs = llm.generate({ @@ -430,21 +369,8 @@ You must enable this feature via `enable_mm_embeds=True`. for o in outputs: generated_text = o.outputs[0].text print(generated_text) - ``` -For Qwen2-VL and MiniCPM-V, we accept additional parameters alongside the embeddings: - -??? code - - ```python - # Construct the prompt based on your model - prompt = ... - - # Embeddings for multiple images - # torch.Tensor of shape (num_images, image_feature_size, hidden_size of LM) - image_embeds = torch.load(...) - - # Qwen2-VL + # Additional examples for models that require extra fields llm = LLM( "Qwen/Qwen2-VL-2B-Instruct", limit_mm_per_prompt={"image": 4}, @@ -452,13 +378,15 @@ For Qwen2-VL and MiniCPM-V, we accept additional parameters alongside the embedd ) mm_data = { "image": { - "image_embeds": image_embeds, + # Shape: (total_feature_size, hidden_size) + # total_feature_size = sum(image_feature_size for image in images) + "image_embeds": torch.load(...), + # Shape: (num_images, 3) # image_grid_thw is needed to calculate positional encoding. - "image_grid_thw": torch.load(...), # torch.Tensor of shape (1, 3), + "image_grid_thw": torch.load(...), } } - # MiniCPM-V llm = LLM( "openbmb/MiniCPM-V-2_6", trust_remote_code=True, @@ -467,20 +395,14 @@ For Qwen2-VL and MiniCPM-V, we accept additional parameters alongside the embedd ) mm_data = { "image": { - "image_embeds": image_embeds, + # Shape: (num_images, num_slices, hidden_size) + # num_slices can differ for each image + "image_embeds": [torch.load(...) for image in images], + # Shape: (num_images, 2) # image_sizes is needed to calculate details of the sliced image. - "image_sizes": [image.size for image in images], # list of image sizes + "image_sizes": [image.size for image in images], } } - - outputs = llm.generate({ - "prompt": prompt, - "multi_modal_data": mm_data, - }) - - for o in outputs: - generated_text = o.outputs[0].text - print(generated_text) ``` For Qwen3-VL, the `image_embeds` should contain both the base image embedding and deepstack features. @@ -501,8 +423,8 @@ You can pass pre-computed audio embeddings similar to image embeddings: # Refer to the HuggingFace repo for the correct format to use prompt = "USER: