2025-09-11 16:39:01 +02:00
# Speech-to-Text (Transcription/Translation) Support
This document walks you through the steps to add support for speech-to-text (ASR) models to vLLM’ s transcription and translation APIs by implementing [SupportsTranscription][vllm.model_executor.models.interfaces.SupportsTranscription].
Please refer to the [supported models ](../../models/supported_models.md#transcription ) for further guidance.
2025-09-11 19:18:06 +01:00
## Update the base vLLM model
2025-09-11 16:39:01 +02:00
It is assumed you have already implemented your model in vLLM according to the basic model guide. Extend your model with the [SupportsTranscription][vllm.model_executor.models.interfaces.SupportsTranscription] interface and implement the following class attributes and methods.
2025-09-11 19:18:06 +01:00
### `supported_languages` and `supports_transcription_only`
2025-09-11 16:39:01 +02:00
2025-09-11 19:18:06 +01:00
Declare supported languages and capabilities:
2025-09-11 16:39:01 +02:00
2025-09-11 19:18:06 +01:00
- The `supported_languages` mapping is validated at init time.
- Set `supports_transcription_only=True` if the model should not serve text generation (eg Whisper).
??? code "supported_languages and supports_transcription_only"
2025-10-14 18:21:53 +08:00
2025-09-11 19:18:06 +01:00
```python
from typing import ClassVar, Mapping, Literal
import numpy as np
import torch
from torch import nn
2025-12-07 16:00:22 +08:00
from vllm.config import ModelConfig, SpeechToTextConfig
2025-09-11 19:18:06 +01:00
from vllm.inputs.data import PromptType
from vllm.model_executor.models.interfaces import SupportsTranscription
class YourASRModel(nn.Module, SupportsTranscription):
# Map of ISO 639-1 language codes to language names
supported_languages: ClassVar[Mapping[str, str]] = {
"en": "English",
"it": "Italian",
# ... add more as needed
}
2025-09-11 16:39:01 +02:00
2025-09-11 19:18:06 +01:00
# If your model only supports audio-conditioned generation
# (no text-only generation), enable this flag.
supports_transcription_only: ClassVar[bool] = True
```
Provide an ASR configuration via [get_speech_to_text_config][vllm.model_executor.models.interfaces.SupportsTranscription.get_speech_to_text_config].
This is for controlling general behavior of the API when serving your model:
??? code "get_speech_to_text_config()"
2025-10-14 18:21:53 +08:00
2025-09-11 19:18:06 +01:00
```python
class YourASRModel(nn.Module, SupportsTranscription):
...
@classmethod
def get_speech_to_text_config(
cls,
2025-12-07 16:00:22 +08:00
model_config: ModelConfig,
2025-09-11 19:18:06 +01:00
task_type: Literal["transcribe", "translate"],
) -> SpeechToTextConfig:
return SpeechToTextConfig(
sample_rate=16_000,
max_audio_clip_s=30,
# Set to None to disable server-side chunking if your
# model/processor handles it already
min_energy_split_window_size=None,
)
```
See [Audio preprocessing and chunking ](#audio-preprocessing-and-chunking ) for what each field controls.
Implement the prompt construction via [get_generation_prompt][vllm.model_executor.models.interfaces.SupportsTranscription.get_generation_prompt]. The server passes you the resampled waveform and task parameters; you return a valid [PromptType][vllm.inputs.data.PromptType]. There are two common patterns:
#### Multimodal LLM with audio embeddings (e.g., Voxtral, Gemma3n)
Return a dict containing `multi_modal_data` with the audio, and either a `prompt` string or `prompt_token_ids` :
??? code "get_generation_prompt()"
2025-10-14 18:21:53 +08:00
2025-09-11 19:18:06 +01:00
```python
class YourASRModel(nn.Module, SupportsTranscription):
...
@classmethod
def get_generation_prompt(
cls,
audio: np.ndarray,
stt_config: SpeechToTextConfig,
2025-12-07 16:00:22 +08:00
model_config: ModelConfig,
2025-09-11 19:18:06 +01:00
language: str | None,
task_type: Literal["transcribe", "translate"],
request_prompt: str,
to_language: str | None,
) -> PromptType:
# Example with a free-form instruction prompt
task_word = "Transcribe" if task_type == "transcribe" else "Translate"
prompt = (
"<start_of_turn>user\n"
f"{task_word} this audio: <audio_soft_token>"
"<end_of_turn>\n<start_of_turn>model\n"
)
return {
"multi_modal_data": {"audio": (audio, stt_config.sample_rate)},
"prompt": prompt,
2025-09-11 16:39:01 +02:00
}
2025-09-11 19:18:06 +01:00
```
For further clarification on multi modal inputs, please refer to [Multi-Modal Inputs ](../../features/multimodal_inputs.md ).
#### Encoder– decoder audio-only (e.g., Whisper)
Return a dict with separate `encoder_prompt` and `decoder_prompt` entries:
??? code "get_generation_prompt()"
2025-10-14 18:21:53 +08:00
2025-09-11 19:18:06 +01:00
```python
class YourASRModel(nn.Module, SupportsTranscription):
...
@classmethod
def get_generation_prompt(
cls,
audio: np.ndarray,
stt_config: SpeechToTextConfig,
2025-12-07 16:00:22 +08:00
model_config: ModelConfig,
2025-09-11 19:18:06 +01:00
language: str | None,
task_type: Literal["transcribe", "translate"],
request_prompt: str,
to_language: str | None,
) -> PromptType:
if language is None:
raise ValueError("Language must be specified")
prompt = {
"encoder_prompt": {
"prompt": "",
"multi_modal_data": {
"audio": (audio, stt_config.sample_rate),
2025-09-11 16:39:01 +02:00
},
2025-09-11 19:18:06 +01:00
},
"decoder_prompt": (
(f"<|prev|>{request_prompt}" if request_prompt else "")
+ f"<|startoftranscript|><|{language}|>"
+ f"<|{task_type}|><|notimestamps|>"
),
}
return cast(PromptType, prompt)
```
2025-09-11 16:39:01 +02:00
2025-09-11 19:18:06 +01:00
### `validate_language` (optional)
2025-09-11 16:39:01 +02:00
2025-09-11 19:18:06 +01:00
Language validation via [validate_language][vllm.model_executor.models.interfaces.SupportsTranscription.validate_language]
2025-09-11 16:39:01 +02:00
2025-09-11 19:18:06 +01:00
If your model requires a language and you want a default, override this method (see Whisper):
??? code "validate_language()"
2025-10-14 18:21:53 +08:00
2025-09-11 16:39:01 +02:00
```python
@classmethod
def validate_language(cls, language: str | None) -> str | None:
if language is None:
logger.warning(
2025-10-14 18:21:53 +08:00
"Defaulting to language='en'. If you wish to transcribe "
"audio in a different language, pass the `language` field "
"in the TranscriptionRequest."
)
2025-09-11 16:39:01 +02:00
language = "en"
return super().validate_language(language)
```
2025-09-11 19:18:06 +01:00
### `get_num_audio_tokens` (optional)
Token accounting for streaming via [get_num_audio_tokens][vllm.model_executor.models.interfaces.SupportsTranscription.get_num_audio_tokens]
2025-09-11 16:39:01 +02:00
2025-09-11 19:18:06 +01:00
Provide a fast duration→token estimate to improve streaming usage statistics:
2025-09-11 16:39:01 +02:00
2025-09-11 19:18:06 +01:00
??? code "get_num_audio_tokens()"
2025-10-14 18:21:53 +08:00
2025-09-11 19:18:06 +01:00
```python
class YourASRModel(nn.Module, SupportsTranscription):
...
2025-09-11 16:39:01 +02:00
2025-09-11 19:18:06 +01:00
@classmethod
def get_num_audio_tokens(
cls,
audio_duration_s: float,
stt_config: SpeechToTextConfig,
2025-12-07 16:00:22 +08:00
model_config: ModelConfig,
2025-09-11 19:18:06 +01:00
) -> int | None:
# Return None if unknown; otherwise return an estimate.
return int(audio_duration_s * stt_config.sample_rate // 320) # example
```
2025-09-11 16:39:01 +02:00
2025-09-11 19:18:06 +01:00
## Audio preprocessing and chunking
2025-09-11 16:39:01 +02:00
The API server takes care of basic audio I/O and optional chunking before building prompts:
- Resampling: Input audio is resampled to `SpeechToTextConfig.sample_rate` using `librosa` .
- Chunking: If `SpeechToTextConfig.allow_audio_chunking` is True and the duration exceeds `max_audio_clip_s` , the server splits the audio into overlapping chunks and generates a prompt per chunk. Overlap is controlled by `overlap_chunk_second` .
- Energy-aware splitting: When `min_energy_split_window_size` is set, the server finds low-energy regions to minimize cutting within words.
Relevant server logic:
2025-09-11 19:18:06 +01:00
??? code "_preprocess_speech_to_text()"
2025-10-14 18:21:53 +08:00
2025-09-11 16:39:01 +02:00
```python
# vllm/entrypoints/openai/speech_to_text.py
async def _preprocess_speech_to_text(...):
language = self.model_cls.validate_language(request.language)
...
y, sr = librosa.load(bytes_, sr=self.asr_config.sample_rate)
duration = librosa.get_duration(y=y, sr=sr)
do_split_audio = (self.asr_config.allow_audio_chunking
and duration > self.asr_config.max_audio_clip_s)
chunks = [y] if not do_split_audio else self._split_audio(y, int(sr))
prompts = []
for chunk in chunks:
prompt = self.model_cls.get_generation_prompt(
audio=chunk,
stt_config=self.asr_config,
2025-12-07 16:00:22 +08:00
model_config=self.model_config,
2025-09-11 16:39:01 +02:00
language=language,
task_type=self.task_type,
request_prompt=request.prompt,
to_language=to_language,
)
prompts.append(prompt)
return prompts, duration
```
2025-09-11 19:18:06 +01:00
## Exposing tasks automatically
2025-09-11 16:39:01 +02:00
2025-09-11 19:18:06 +01:00
vLLM automatically advertises transcription support if your model implements the interface:
2025-09-11 16:39:01 +02:00
```python
if supports_transcription(model):
if model.supports_transcription_only:
return ["transcription"]
supported_tasks.append("transcription")
```
2025-09-11 19:18:06 +01:00
When enabled, the server initializes the transcription and translation handlers:
2025-09-11 16:39:01 +02:00
```python
state.openai_serving_transcription = OpenAIServingTranscription(...) if "transcription" in supported_tasks else None
state.openai_serving_translation = OpenAIServingTranslation(...) if "transcription" in supported_tasks else None
```
No extra registration is required beyond having your model class available via the model registry and implementing `SupportsTranscription` .
2025-09-11 19:18:06 +01:00
## Examples in-tree
2025-09-11 16:39:01 +02:00
2025-10-17 04:05:34 +01:00
- Whisper encoder– decoder (audio-only): [vllm/model_executor/models/whisper.py ](../../../vllm/model_executor/models/whisper.py )
2025-11-14 18:54:38 +01:00
- Voxtral decoder-only (audio embeddings + LLM): [vllm/model_executor/models/voxtral.py ](../../../vllm/model_executor/models/voxtral.py ). Make sure to have installed `mistral-common[audio]` .
2025-10-17 04:05:34 +01:00
- Gemma3n decoder-only with fixed instruction prompt: [vllm/model_executor/models/gemma3n_mm.py ](../../../vllm/model_executor/models/gemma3n_mm.py )
2025-09-11 16:39:01 +02:00
2025-09-11 19:18:06 +01:00
## Test with the API
2025-09-11 16:39:01 +02:00
Once your model implements `SupportsTranscription` , you can test the endpoints (API mimics OpenAI):
- Transcription (ASR):
```bash
curl -s -X POST \
-H "Authorization: Bearer $VLLM_API_KEY" \
-H "Content-Type: multipart/form-data" \
-F "file=@/path/to/audio .wav" \
-F "model=$MODEL_ID" \
http://localhost:8000/v1/audio/transcriptions
```
- Translation (source → English unless otherwise supported):
```bash
curl -s -X POST \
-H "Authorization: Bearer $VLLM_API_KEY" \
-H "Content-Type: multipart/form-data" \
-F "file=@/path/to/audio .wav" \
-F "model=$MODEL_ID" \
http://localhost:8000/v1/audio/translations
```
2025-09-11 12:05:59 -04:00
2025-10-17 04:05:34 +01:00
Or check out more examples in [examples/online_serving ](../../../examples/online_serving ).
2025-09-11 16:39:01 +02:00
!!! note
2025-09-11 19:18:06 +01:00
- If your model handles chunking internally (e.g., via its processor or encoder), set `min_energy_split_window_size=None` in the returned `SpeechToTextConfig` to disable server-side chunking.
- Implementing `get_num_audio_tokens` improves accuracy of streaming usage metrics (`prompt_tokens` ) without an extra forward pass.
- For multilingual behavior, keep `supported_languages` aligned with actual model capabilities.