[Core] Add audio_embeds support to chat completions (#29059)

Signed-off-by: Jeremy Teboul <jeremyteboul@fb.com>
Co-authored-by: Jeremy Teboul <jeremyteboul@fb.com>
This commit is contained in:
jeremyteboul
2025-11-20 19:39:47 -08:00
committed by GitHub
parent a982f5b5ea
commit 0730414999
5 changed files with 360 additions and 3 deletions

View File

@@ -365,6 +365,8 @@ You must enable this feature via `enable_mm_embeds=True`.
The vLLM engine may crash if incorrect shape of embeddings is passed.
Only enable this flag for trusted users!
#### Image Embeddings
??? code
```python
@@ -441,6 +443,36 @@ For Qwen2-VL and MiniCPM-V, we accept additional parameters alongside the embedd
print(generated_text)
```
#### Audio Embeddings
You can pass pre-computed audio embeddings similar to image embeddings:
??? code
```python
from vllm import LLM
import torch
# Enable audio embeddings support
llm = LLM(model="fixie-ai/ultravox-v0_5-llama-3_2-1b", enable_mm_embeds=True)
# Refer to the HuggingFace repo for the correct format to use
prompt = "USER: <audio>\nWhat is in this audio?\nASSISTANT:"
# Load pre-computed audio embeddings
# torch.Tensor of shape (1, audio_feature_size, hidden_size of LM)
audio_embeds = torch.load(...)
outputs = llm.generate({
"prompt": prompt,
"multi_modal_data": {"audio": audio_embeds},
})
for o in outputs:
generated_text = o.outputs[0].text
print(generated_text)
```
## Online Serving
Our OpenAI-compatible server accepts multi-modal data via the [Chat Completions API](https://platform.openai.com/docs/api-reference/chat). Media inputs also support optional UUIDs users can provide to uniquely identify each media, which is used to cache the media results across requests.