[Doc] Fix format of multimodal_inputs.md (#31800)

Signed-off-by: BlankR <hjyblanche@gmail.com>
This commit is contained in:
BlankR
2026-01-06 03:30:24 -08:00
committed by GitHub
parent 43d384bab4
commit 6ebb66ccea

View File

@@ -166,49 +166,51 @@ Full example: [examples/offline_inference/vision_language_multi_image.py](../../
If using the [LLM.chat](../models/generative_models.md#llmchat) method, you can pass images directly in the message content using various formats: image URLs, PIL Image objects, or pre-computed embeddings:
```python
from vllm import LLM
from vllm.assets.image import ImageAsset
??? code
llm = LLM(model="llava-hf/llava-1.5-7b-hf")
image_url = "https://picsum.photos/id/32/512/512"
image_pil = ImageAsset('cherry_blossom').pil_image
image_embeds = torch.load(...)
```python
from vllm import LLM
from vllm.assets.image import ImageAsset
conversation = [
{"role": "system", "content": "You are a helpful assistant"},
{"role": "user", "content": "Hello"},
{"role": "assistant", "content": "Hello! How can I assist you today?"},
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"url": image_url},
},
{
"type": "image_pil",
"image_pil": image_pil,
},
{
"type": "image_embeds",
"image_embeds": image_embeds,
},
{
"type": "text",
"text": "What's in these images?",
},
],
},
]
llm = LLM(model="llava-hf/llava-1.5-7b-hf")
image_url = "https://picsum.photos/id/32/512/512"
image_pil = ImageAsset('cherry_blossom').pil_image
image_embeds = torch.load(...)
# Perform inference and log output.
outputs = llm.chat(conversation)
conversation = [
{"role": "system", "content": "You are a helpful assistant"},
{"role": "user", "content": "Hello"},
{"role": "assistant", "content": "Hello! How can I assist you today?"},
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"url": image_url},
},
{
"type": "image_pil",
"image_pil": image_pil,
},
{
"type": "image_embeds",
"image_embeds": image_embeds,
},
{
"type": "text",
"text": "What's in these images?",
},
],
},
]
for o in outputs:
generated_text = o.outputs[0].text
print(generated_text)
```
# Perform inference and log output.
outputs = llm.chat(conversation)
for o in outputs:
generated_text = o.outputs[0].text
print(generated_text)
```
Multi-image input can be extended to perform video captioning. We show this with [Qwen2-VL](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct) as it supports videos:
@@ -893,6 +895,8 @@ The following example demonstrates how to pass image embeddings to the OpenAI se
For Online Serving, you can also skip sending media if you expect cache hits with provided UUIDs. You can do so by sending media like this:
??? code
```python
# Image/video/audio URL:
{