[Doc] Fix format of multimodal_inputs.md (#31800)
Signed-off-by: BlankR <hjyblanche@gmail.com>
This commit is contained in:
@@ -166,49 +166,51 @@ Full example: [examples/offline_inference/vision_language_multi_image.py](../../
|
||||
|
||||
If using the [LLM.chat](../models/generative_models.md#llmchat) method, you can pass images directly in the message content using various formats: image URLs, PIL Image objects, or pre-computed embeddings:
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
from vllm.assets.image import ImageAsset
|
||||
??? code
|
||||
|
||||
llm = LLM(model="llava-hf/llava-1.5-7b-hf")
|
||||
image_url = "https://picsum.photos/id/32/512/512"
|
||||
image_pil = ImageAsset('cherry_blossom').pil_image
|
||||
image_embeds = torch.load(...)
|
||||
```python
|
||||
from vllm import LLM
|
||||
from vllm.assets.image import ImageAsset
|
||||
|
||||
conversation = [
|
||||
{"role": "system", "content": "You are a helpful assistant"},
|
||||
{"role": "user", "content": "Hello"},
|
||||
{"role": "assistant", "content": "Hello! How can I assist you today?"},
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{
|
||||
"type": "image_url",
|
||||
"image_url": {"url": image_url},
|
||||
},
|
||||
{
|
||||
"type": "image_pil",
|
||||
"image_pil": image_pil,
|
||||
},
|
||||
{
|
||||
"type": "image_embeds",
|
||||
"image_embeds": image_embeds,
|
||||
},
|
||||
{
|
||||
"type": "text",
|
||||
"text": "What's in these images?",
|
||||
},
|
||||
],
|
||||
},
|
||||
]
|
||||
llm = LLM(model="llava-hf/llava-1.5-7b-hf")
|
||||
image_url = "https://picsum.photos/id/32/512/512"
|
||||
image_pil = ImageAsset('cherry_blossom').pil_image
|
||||
image_embeds = torch.load(...)
|
||||
|
||||
# Perform inference and log output.
|
||||
outputs = llm.chat(conversation)
|
||||
conversation = [
|
||||
{"role": "system", "content": "You are a helpful assistant"},
|
||||
{"role": "user", "content": "Hello"},
|
||||
{"role": "assistant", "content": "Hello! How can I assist you today?"},
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{
|
||||
"type": "image_url",
|
||||
"image_url": {"url": image_url},
|
||||
},
|
||||
{
|
||||
"type": "image_pil",
|
||||
"image_pil": image_pil,
|
||||
},
|
||||
{
|
||||
"type": "image_embeds",
|
||||
"image_embeds": image_embeds,
|
||||
},
|
||||
{
|
||||
"type": "text",
|
||||
"text": "What's in these images?",
|
||||
},
|
||||
],
|
||||
},
|
||||
]
|
||||
|
||||
for o in outputs:
|
||||
generated_text = o.outputs[0].text
|
||||
print(generated_text)
|
||||
```
|
||||
# Perform inference and log output.
|
||||
outputs = llm.chat(conversation)
|
||||
|
||||
for o in outputs:
|
||||
generated_text = o.outputs[0].text
|
||||
print(generated_text)
|
||||
```
|
||||
|
||||
Multi-image input can be extended to perform video captioning. We show this with [Qwen2-VL](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct) as it supports videos:
|
||||
|
||||
@@ -893,6 +895,8 @@ The following example demonstrates how to pass image embeddings to the OpenAI se
|
||||
|
||||
For Online Serving, you can also skip sending media if you expect cache hits with provided UUIDs. You can do so by sending media like this:
|
||||
|
||||
??? code
|
||||
|
||||
```python
|
||||
# Image/video/audio URL:
|
||||
{
|
||||
|
||||
Reference in New Issue
Block a user