# Multimodal Inputs This page teaches you how to pass multi-modal inputs to [multi-modal models](../models/supported_models.md#list-of-multimodal-language-models) in vLLM. !!! note We are actively iterating on multi-modal support. See [this RFC](https://github.com/vllm-project/vllm/issues/4194) for upcoming changes, and [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) if you have any feedback or feature requests. !!! tip When serving multi-modal models, consider setting `--allowed-media-domains` to restrict domain that vLLM can access to prevent it from accessing arbitrary endpoints that can potentially be vulnerable to Server-Side Request Forgery (SSRF) attacks. You can provide a list of domains for this arg. For example: `--allowed-media-domains upload.wikimedia.org github.com www.bogotobogo.com` Also, consider setting `VLLM_MEDIA_URL_ALLOW_REDIRECTS=0` to prevent HTTP redirects from being followed to bypass domain restrictions. This restriction is especially important if you run vLLM in a containerized environment where the vLLM pods may have unrestricted access to internal networks. ## Offline Inference To input multi-modal data, follow this schema in [vllm.inputs.PromptType][]: - `prompt`: The prompt should follow the format that is documented on HuggingFace. - `multi_modal_data`: This is a dictionary that follows the schema defined in [vllm.inputs.MultiModalDataDict][]. ### Image Inputs You can pass a single image to the `'image'` field of the multi-modal dictionary, as shown in the following examples: ??? code ```python from vllm import LLM llm = LLM(model="llava-hf/llava-1.5-7b-hf") # Refer to the HuggingFace repo for the correct format to use prompt = "USER: \nWhat is the content of this image?\nASSISTANT:" # Load the image using PIL.Image image = PIL.Image.open(...) # Single prompt inference outputs = llm.generate({ "prompt": prompt, "multi_modal_data": {"image": image}, }) for o in outputs: generated_text = o.outputs[0].text print(generated_text) # Batch inference image_1 = PIL.Image.open(...) image_2 = PIL.Image.open(...) outputs = llm.generate( [ { "prompt": "USER: \nWhat is the content of this image?\nASSISTANT:", "multi_modal_data": {"image": image_1}, }, { "prompt": "USER: \nWhat's the color of this image?\nASSISTANT:", "multi_modal_data": {"image": image_2}, } ] ) for o in outputs: generated_text = o.outputs[0].text print(generated_text) ``` Full example: [examples/offline_inference/vision_language.py](../../examples/offline_inference/vision_language.py) To substitute multiple images inside the same text prompt, you can pass in a list of images instead: ??? code ```python from vllm import LLM llm = LLM( model="microsoft/Phi-3.5-vision-instruct", trust_remote_code=True, # Required to load Phi-3.5-vision max_model_len=4096, # Otherwise, it may not fit in smaller GPUs limit_mm_per_prompt={"image": 2}, # The maximum number to accept ) # Refer to the HuggingFace repo for the correct format to use prompt = "<|user|>\n<|image_1|>\n<|image_2|>\nWhat is the content of each image?<|end|>\n<|assistant|>\n" # Load the images using PIL.Image image1 = PIL.Image.open(...) image2 = PIL.Image.open(...) outputs = llm.generate({ "prompt": prompt, "multi_modal_data": {"image": [image1, image2]}, }) for o in outputs: generated_text = o.outputs[0].text print(generated_text) ``` Full example: [examples/offline_inference/vision_language_multi_image.py](../../examples/offline_inference/vision_language_multi_image.py) If using the [LLM.chat](../models/generative_models.md#llmchat) method, you can pass images directly in the message content using various formats: image URLs, PIL Image objects, or pre-computed embeddings: ??? code ```python from vllm import LLM from vllm.assets.image import ImageAsset llm = LLM(model="llava-hf/llava-1.5-7b-hf") image_url = "https://picsum.photos/id/32/512/512" image_pil = ImageAsset('cherry_blossom').pil_image image_embeds = torch.load(...) conversation = [ {"role": "system", "content": "You are a helpful assistant"}, {"role": "user", "content": "Hello"}, {"role": "assistant", "content": "Hello! How can I assist you today?"}, { "role": "user", "content": [ { "type": "image_url", "image_url": {"url": image_url}, }, { "type": "image_pil", "image_pil": image_pil, }, { "type": "image_embeds", "image_embeds": image_embeds, }, { "type": "text", "text": "What's in these images?", }, ], }, ] # Perform inference and log output. outputs = llm.chat(conversation) for o in outputs: generated_text = o.outputs[0].text print(generated_text) ``` Multi-image input can be extended to perform video captioning. We show this with [Qwen2-VL](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct) as it supports videos: ??? code ```python from vllm import LLM # Specify the maximum number of frames per video to be 4. This can be changed. llm = LLM("Qwen/Qwen2-VL-2B-Instruct", limit_mm_per_prompt={"image": 4}) # Create the request payload. video_frames = ... # load your video making sure it only has the number of frames specified earlier. message = { "role": "user", "content": [ { "type": "text", "text": "Describe this set of frames. Consider the frames to be a part of the same video.", }, ], } for i in range(len(video_frames)): base64_image = encode_image(video_frames[i]) # base64 encoding. new_image = {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}} message["content"].append(new_image) # Perform inference and log output. outputs = llm.chat([message]) for o in outputs: generated_text = o.outputs[0].text print(generated_text) ``` #### Custom RGBA Background Color When loading RGBA images (images with transparency), vLLM converts them to RGB format. By default, transparent pixels are replaced with white background. You can customize this background color using the `rgba_background_color` parameter in `media_io_kwargs`. ??? code ```python from vllm import LLM # Default white background (no configuration needed) llm = LLM(model="llava-hf/llava-1.5-7b-hf") # Custom black background for dark theme llm = LLM( model="llava-hf/llava-1.5-7b-hf", media_io_kwargs={"image": {"rgba_background_color": [0, 0, 0]}}, ) # Custom brand color background (e.g., blue) llm = LLM( model="llava-hf/llava-1.5-7b-hf", media_io_kwargs={"image": {"rgba_background_color": [0, 0, 255]}}, ) ``` !!! note - The `rgba_background_color` accepts RGB values as a list `[R, G, B]` or tuple `(R, G, B)` where each value is 0-255 - This setting only affects RGBA images with transparency; RGB images are unchanged - If not specified, the default white background `(255, 255, 255)` is used for backward compatibility ### Video Inputs You can pass a list of NumPy arrays directly to the `'video'` field of the multi-modal dictionary instead of using multi-image input. Instead of NumPy arrays, you can also pass `'torch.Tensor'` instances, as shown in this example using Qwen2.5-VL: ??? code ```python from transformers import AutoProcessor from vllm import LLM, SamplingParams from qwen_vl_utils import process_vision_info model_path = "Qwen/Qwen2.5-VL-3B-Instruct" video_path = "https://content.pexels.com/videos/free-videos.mp4" llm = LLM( model=model_path, gpu_memory_utilization=0.8, enforce_eager=True, limit_mm_per_prompt={"video": 1}, ) sampling_params = SamplingParams(max_tokens=1024) video_messages = [ { "role": "system", "content": "You are a helpful assistant.", }, { "role": "user", "content": [ {"type": "text", "text": "describe this video."}, { "type": "video", "video": video_path, "total_pixels": 20480 * 28 * 28, "min_pixels": 16 * 28 * 28, }, ] }, ] messages = video_messages processor = AutoProcessor.from_pretrained(model_path) prompt = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, ) image_inputs, video_inputs = process_vision_info(messages) mm_data = {} if video_inputs is not None: mm_data["video"] = video_inputs llm_inputs = { "prompt": prompt, "multi_modal_data": mm_data, } outputs = llm.generate([llm_inputs], sampling_params=sampling_params) for o in outputs: generated_text = o.outputs[0].text print(generated_text) ``` !!! note 'process_vision_info' is only applicable to Qwen2.5-VL and similar models. Full example: [examples/offline_inference/vision_language.py](../../examples/offline_inference/vision_language.py) ### Audio Inputs You can pass a tuple `(array, sampling_rate)` to the `'audio'` field of the multi-modal dictionary. Full example: [examples/offline_inference/audio_language.py](../../examples/offline_inference/audio_language.py) #### Chunking Long Audio for Transcription Speech-to-text models like Whisper have a maximum audio length they can process (typically 30 seconds). For longer audio files, vLLM provides a utility to intelligently split audio into chunks at quiet points to minimize cutting through speech. ```python import librosa from vllm import LLM, SamplingParams from vllm.multimodal.audio import split_audio # Load long audio file audio, sr = librosa.load("long_audio.wav", sr=16000) # Split into chunks at low-energy (quiet) regions chunks = split_audio( audio_data=audio, sample_rate=sr, max_clip_duration_s=30.0, # Maximum chunk length in seconds overlap_duration_s=1.0, # Search window for finding quiet split points min_energy_window_size=1600, # Window size for energy calculation (~100ms at 16kHz) ) # Initialize Whisper model llm = LLM(model="openai/whisper-large-v3-turbo") sampling_params = SamplingParams(temperature=0, max_tokens=256) # Transcribe each chunk transcriptions = [] for chunk in chunks: outputs = llm.generate({ "prompt": "<|startoftranscript|><|en|><|transcribe|><|notimestamps|>", "multi_modal_data": {"audio": (chunk, sr)}, }, sampling_params) transcriptions.append(outputs[0].outputs[0].text) # Combine results full_transcription = " ".join(transcriptions) ``` The `split_audio` function: - Splits audio at quiet points to avoid cutting through speech - Uses RMS energy to find low-amplitude regions within the overlap window - Preserves all audio samples (no data loss) - Supports any sample rate #### Automatic Audio Channel Normalization vLLM automatically normalizes audio channels for models that require specific audio formats. When loading audio with libraries like `torchaudio`, stereo files return shape `[channels, time]`, but many audio models (particularly Whisper-based models) expect mono audio with shape `[time]`. **Supported models with automatic mono conversion:** - **Whisper** and all Whisper-based models - **Qwen2-Audio** - **Qwen2.5-Omni** / **Qwen3-Omni** (inherits from Qwen2.5-Omni) - **Ultravox** For these models, vLLM automatically: 1. Detects if the model requires mono audio via the feature extractor 2. Converts multi-channel audio to mono using channel averaging 3. Handles both `(channels, time)` format (torchaudio) and `(time, channels)` format (soundfile) **Example with stereo audio:** ```python import torchaudio from vllm import LLM # Load stereo audio file - returns (channels, time) shape audio, sr = torchaudio.load("stereo_audio.wav") print(f"Original shape: {audio.shape}") # e.g., torch.Size([2, 16000]) # vLLM automatically converts to mono for Whisper-based models llm = LLM(model="openai/whisper-large-v3") outputs = llm.generate({ "prompt": "", "multi_modal_data": {"audio": (audio.numpy(), sr)}, }) ``` No manual conversion is needed - vLLM handles the channel normalization automatically based on the model's requirements. ### Embedding Inputs To input pre-computed embeddings belonging to a data type (i.e. image, video, or audio) directly to the language model, pass a tensor of shape `(..., hidden_size of LM)` to the corresponding field of the multi-modal dictionary. The exact shape depends on the model being used. You must enable this feature via `enable_mm_embeds=True`. !!! warning The vLLM engine may crash if incorrect shape of embeddings is passed. Only enable this flag for trusted users! #### Image Embeddings ??? code ```python from vllm import LLM # Inference with image embeddings as input llm = LLM(model="llava-hf/llava-1.5-7b-hf", enable_mm_embeds=True) # Refer to the HuggingFace repo for the correct format to use prompt = "USER: \nWhat is the content of this image?\nASSISTANT:" # For most models, `image_embeds` has shape: (num_images, image_feature_size, hidden_size) image_embeds = torch.load(...) outputs = llm.generate({ "prompt": prompt, "multi_modal_data": {"image": image_embeds}, }) for o in outputs: generated_text = o.outputs[0].text print(generated_text) # Additional examples for models that require extra fields llm = LLM( "Qwen/Qwen2-VL-2B-Instruct", limit_mm_per_prompt={"image": 4}, enable_mm_embeds=True, ) mm_data = { "image": { # Shape: (total_feature_size, hidden_size) # total_feature_size = sum(image_feature_size for image in images) "image_embeds": torch.load(...), # Shape: (num_images, 3) # image_grid_thw is needed to calculate positional encoding. "image_grid_thw": torch.load(...), } } llm = LLM( "openbmb/MiniCPM-V-2_6", trust_remote_code=True, limit_mm_per_prompt={"image": 4}, enable_mm_embeds=True, ) mm_data = { "image": { # Shape: (num_images, num_slices, hidden_size) # num_slices can differ for each image "image_embeds": [torch.load(...) for image in images], # Shape: (num_images, 2) # image_sizes is needed to calculate details of the sliced image. "image_sizes": [image.size for image in images], } } ``` For Qwen3-VL, the `image_embeds` should contain both the base image embedding and deepstack features. #### Audio Embedding Inputs You can pass pre-computed audio embeddings similar to image embeddings: ??? code ```python from vllm import LLM import torch # Enable audio embeddings support llm = LLM(model="fixie-ai/ultravox-v0_5-llama-3_2-1b", enable_mm_embeds=True) # Refer to the HuggingFace repo for the correct format to use prompt = "USER: