[Frontend] User-provided uuids for medias in chat. (RFC #22044) (#23449)

Signed-off-by: Roger Wang <hey@rogerw.io> Signed-off-by: Chenheli Hua <huachenheli@outlook.com> Signed-off-by: Roger Wang <hey@rogerw.me> Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Roger Wang <hey@rogerw.io> Co-authored-by: Roger Wang <hey@rogerw.me> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2025-09-08 06:42:20 -07:00
parent 03dd652c16
commit 01dfb5e982
8 changed files with 1079 additions and 79 deletions
--- a/docs/features/multimodal_inputs.md
+++ b/docs/features/multimodal_inputs.md
@@ -215,19 +215,19 @@ When loading RGBA images (images with transparency), vLLM converts them to RGB f

    ```python
    from vllm import LLM
-    
+
    # Default white background (no configuration needed)
    llm = LLM(model="llava-hf/llava-1.5-7b-hf")
-    
+
    # Custom black background for dark theme
    llm = LLM(
        model="llava-hf/llava-1.5-7b-hf",
        media_io_kwargs={"image": {"rgba_background_color": [0, 0, 0]}}
    )
-    
+
    # Custom brand color background (e.g., blue)
    llm = LLM(
-        model="llava-hf/llava-1.5-7b-hf", 
+        model="llava-hf/llava-1.5-7b-hf",
        media_io_kwargs={"image": {"rgba_background_color": [0, 0, 255]}}
    )
    ```
@@ -388,7 +388,7 @@ For Qwen2-VL and MiniCPM-V, we accept additional parameters alongside the embedd

 ## Online Serving

-Our OpenAI-compatible server accepts multi-modal data via the [Chat Completions API](https://platform.openai.com/docs/api-reference/chat).
+Our OpenAI-compatible server accepts multi-modal data via the [Chat Completions API](https://platform.openai.com/docs/api-reference/chat). Media inputs also support optional UUIDs users can provide to uniquely identify each media, which is used to cache the media results across requests.

 !!! important
    A chat template is **required** to use Chat Completions API.
@@ -438,7 +438,13 @@ Then, you can use the OpenAI client as follows:
                # NOTE: The prompt formatting with the image token `<image>` is not needed
                # since the prompt will be processed automatically by the API server.
                {"type": "text", "text": "What’s in this image?"},
-                {"type": "image_url", "image_url": {"url": image_url}},
+                {
+                    "type": "image_url",
+                    "image_url": {
+                        url": image_url
+                    },
+                    "uuid": image_url # Optional
+                },
            ],
        }],
    )
@@ -454,8 +460,20 @@ Then, you can use the OpenAI client as follows:
            "role": "user",
            "content": [
                {"type": "text", "text": "What are the animals in these images?"},
-                {"type": "image_url", "image_url": {"url": image_url_duck}},
-                {"type": "image_url", "image_url": {"url": image_url_lion}},
+                {
+                    "type": "image_url",
+                    "image_url": {
+                        "url": image_url_duck
+                    },
+                    "uuid": image_url_duck # Optional
+                },
+                {
+                    "type": "image_url",
+                    "image_url": {
+                        "url": image_url_lion
+                    },
+                    "uuid": image_url_lion # Optional
+                },
            ],
        }],
    )
@@ -522,6 +540,7 @@ Then, you can use the OpenAI client as follows:
                    "video_url": {
                        "url": video_url
                    },
+                    "uuid": video_url # Optional
                },
            ],
        }],
@@ -613,6 +632,7 @@ Then, you can use the OpenAI client as follows:
                        "data": audio_base64,
                        "format": "wav"
                    },
+                    "uuid": audio_url # Optional
                },
            ],
        }],
@@ -642,6 +662,7 @@ Alternatively, you can pass `audio_url`, which is the audio counterpart of `imag
                    "audio_url": {
                        "url": audio_url
                    },
+                    "uuid": audio_url # Optional
                },
            ],
        }],
@@ -695,7 +716,8 @@ The following example demonstrates how to pass image embeddings to the OpenAI se
    model = "llava-hf/llava-1.5-7b-hf"
    embeds =  {
        "type": "image_embeds",
-        "image_embeds": f"{base64_image_embedding}" 
+        "image_embeds": f"{base64_image_embedding}",
+        "uuid": image_url # Optional
    }

    # Pass additional parameters (available to Qwen2-VL and MiniCPM-V)
@@ -706,6 +728,7 @@ The following example demonstrates how to pass image embeddings to the OpenAI se
            "image_embeds": f"{base64_image_embedding}" , # Required
            "image_grid_thw": f"{base64_image_grid_thw}"  # Required by Qwen/Qwen2-VL-2B-Instruct
        },
+        "uuid": image_url # Optional
    }
    model = "openbmb/MiniCPM-V-2_6"
    embeds =  {
@@ -714,6 +737,7 @@ The following example demonstrates how to pass image embeddings to the OpenAI se
            "image_embeds": f"{base64_image_embedding}" , # Required
            "image_sizes": f"{base64_image_sizes}"  # Required by openbmb/MiniCPM-V-2_6
        },
+        "uuid": image_url # Optional
    }
    chat_completion = client.chat.completions.create(
        messages=[