[Docs] Fix formatting of transcription doc (#24676)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
Harry Mellor
2025-09-11 19:18:06 +01:00
committed by GitHub
parent e26fef8397
commit 361ae27f8a

View File

@@ -3,14 +3,18 @@
This document walks you through the steps to add support for speech-to-text (ASR) models to vLLMs transcription and translation APIs by implementing [SupportsTranscription][vllm.model_executor.models.interfaces.SupportsTranscription]. This document walks you through the steps to add support for speech-to-text (ASR) models to vLLMs transcription and translation APIs by implementing [SupportsTranscription][vllm.model_executor.models.interfaces.SupportsTranscription].
Please refer to the [supported models](../../models/supported_models.md#transcription) for further guidance. Please refer to the [supported models](../../models/supported_models.md#transcription) for further guidance.
## 1. Update the base vLLM model ## Update the base vLLM model
It is assumed you have already implemented your model in vLLM according to the basic model guide. Extend your model with the [SupportsTranscription][vllm.model_executor.models.interfaces.SupportsTranscription] interface and implement the following class attributes and methods. It is assumed you have already implemented your model in vLLM according to the basic model guide. Extend your model with the [SupportsTranscription][vllm.model_executor.models.interfaces.SupportsTranscription] interface and implement the following class attributes and methods.
- Declare supported languages and capabilities: ### `supported_languages` and `supports_transcription_only`
??? code Declare supported languages and capabilities:
- The `supported_languages` mapping is validated at init time.
- Set `supports_transcription_only=True` if the model should not serve text generation (eg Whisper).
??? code "supported_languages and supports_transcription_only"
```python ```python
from typing import ClassVar, Mapping, Optional, Literal from typing import ClassVar, Mapping, Optional, Literal
import numpy as np import numpy as np
@@ -34,14 +38,11 @@ It is assumed you have already implemented your model in vLLM according to the b
supports_transcription_only: ClassVar[bool] = True supports_transcription_only: ClassVar[bool] = True
``` ```
- The `supported_languages` mapping is validated at init time. Provide an ASR configuration via [get_speech_to_text_config][vllm.model_executor.models.interfaces.SupportsTranscription.get_speech_to_text_config].
- Set `supports_transcription_only=True` if the model should not serve text generation (eg Whisper).
- Provide an ASR configuration via [get_speech_to_text_config][vllm.model_executor.models.interfaces.SupportsTranscription.get_speech_to_text_config].
This is for controlling general behavior of the API when serving your model: This is for controlling general behavior of the API when serving your model:
??? code ??? code "get_speech_to_text_config()"
```python ```python
class YourASRModel(nn.Module, SupportsTranscription): class YourASRModel(nn.Module, SupportsTranscription):
... ...
@@ -61,16 +62,15 @@ It is assumed you have already implemented your model in vLLM according to the b
) )
``` ```
See the “Audio preprocessing and chunking” section for what each field controls. See [Audio preprocessing and chunking](#audio-preprocessing-and-chunking) for what each field controls.
- Implement the prompt construction via [get_generation_prompt][vllm.model_executor.models.interfaces.SupportsTranscription.get_generation_prompt]. The server passes you the resampled waveform and task parameters; you return a valid [PromptType][vllm.inputs.data.PromptType]. There are two common patterns: Implement the prompt construction via [get_generation_prompt][vllm.model_executor.models.interfaces.SupportsTranscription.get_generation_prompt]. The server passes you the resampled waveform and task parameters; you return a valid [PromptType][vllm.inputs.data.PromptType]. There are two common patterns:
### A. Multimodal LLM with audio embeddings (e.g., Voxtral, Gemma3n) #### Multimodal LLM with audio embeddings (e.g., Voxtral, Gemma3n)
Return a dict containing `multi_modal_data` with the audio, and either a `prompt` string or `prompt_token_ids`: Return a dict containing `multi_modal_data` with the audio, and either a `prompt` string or `prompt_token_ids`:
??? code ??? code "get_generation_prompt()"
```python ```python
class YourASRModel(nn.Module, SupportsTranscription): class YourASRModel(nn.Module, SupportsTranscription):
... ...
@@ -102,12 +102,11 @@ It is assumed you have already implemented your model in vLLM according to the b
For further clarification on multi modal inputs, please refer to [Multi-Modal Inputs](../../features/multimodal_inputs.md). For further clarification on multi modal inputs, please refer to [Multi-Modal Inputs](../../features/multimodal_inputs.md).
### B. Encoderdecoder audio-only (e.g., Whisper) #### Encoderdecoder audio-only (e.g., Whisper)
Return a dict with separate `encoder_prompt` and `decoder_prompt` entries: Return a dict with separate `encoder_prompt` and `decoder_prompt` entries:
??? code ??? code "get_generation_prompt()"
```python ```python
class YourASRModel(nn.Module, SupportsTranscription): class YourASRModel(nn.Module, SupportsTranscription):
... ...
@@ -142,10 +141,13 @@ It is assumed you have already implemented your model in vLLM according to the b
return cast(PromptType, prompt) return cast(PromptType, prompt)
``` ```
- (Optional) Language validation via [validate_language][vllm.model_executor.models.interfaces.SupportsTranscription.validate_language] ### `validate_language` (optional)
Language validation via [validate_language][vllm.model_executor.models.interfaces.SupportsTranscription.validate_language]
If your model requires a language and you want a default, override this method (see Whisper): If your model requires a language and you want a default, override this method (see Whisper):
??? code "validate_language()"
```python ```python
@classmethod @classmethod
def validate_language(cls, language: Optional[str]) -> Optional[str]: def validate_language(cls, language: Optional[str]) -> Optional[str]:
@@ -156,11 +158,13 @@ It is assumed you have already implemented your model in vLLM according to the b
return super().validate_language(language) return super().validate_language(language)
``` ```
- (Optional) Token accounting for streaming via [get_num_audio_tokens][vllm.model_executor.models.interfaces.SupportsTranscription.get_num_audio_tokens] ### `get_num_audio_tokens` (optional)
Token accounting for streaming via [get_num_audio_tokens][vllm.model_executor.models.interfaces.SupportsTranscription.get_num_audio_tokens]
Provide a fast duration→token estimate to improve streaming usage statistics: Provide a fast duration→token estimate to improve streaming usage statistics:
??? code ??? code "get_num_audio_tokens()"
```python ```python
class YourASRModel(nn.Module, SupportsTranscription): class YourASRModel(nn.Module, SupportsTranscription):
... ...
@@ -176,7 +180,7 @@ It is assumed you have already implemented your model in vLLM according to the b
return int(audio_duration_s * stt_config.sample_rate // 320) # example return int(audio_duration_s * stt_config.sample_rate // 320) # example
``` ```
## 2. Audio preprocessing and chunking ## Audio preprocessing and chunking
The API server takes care of basic audio I/O and optional chunking before building prompts: The API server takes care of basic audio I/O and optional chunking before building prompts:
@@ -185,7 +189,8 @@ The API server takes care of basic audio I/O and optional chunking before buildi
- Energy-aware splitting: When `min_energy_split_window_size` is set, the server finds low-energy regions to minimize cutting within words. - Energy-aware splitting: When `min_energy_split_window_size` is set, the server finds low-energy regions to minimize cutting within words.
Relevant server logic: Relevant server logic:
??? code
??? code "_preprocess_speech_to_text()"
```python ```python
# vllm/entrypoints/openai/speech_to_text.py # vllm/entrypoints/openai/speech_to_text.py
async def _preprocess_speech_to_text(...): async def _preprocess_speech_to_text(...):
@@ -211,9 +216,9 @@ Relevant server logic:
return prompts, duration return prompts, duration
``` ```
## 3. Exposing tasks automatically ## Exposing tasks automatically
- vLLM automatically advertises transcription support if your model implements the interface: vLLM automatically advertises transcription support if your model implements the interface:
```python ```python
if supports_transcription(model): if supports_transcription(model):
@@ -222,7 +227,7 @@ if supports_transcription(model):
supported_tasks.append("transcription") supported_tasks.append("transcription")
``` ```
- When enabled, the server initializes the transcription and translation handlers: When enabled, the server initializes the transcription and translation handlers:
```python ```python
state.openai_serving_transcription = OpenAIServingTranscription(...) if "transcription" in supported_tasks else None state.openai_serving_transcription = OpenAIServingTranscription(...) if "transcription" in supported_tasks else None
@@ -231,13 +236,13 @@ state.openai_serving_translation = OpenAIServingTranslation(...) if "transcripti
No extra registration is required beyond having your model class available via the model registry and implementing `SupportsTranscription`. No extra registration is required beyond having your model class available via the model registry and implementing `SupportsTranscription`.
## 4. Examples in-tree ## Examples in-tree
- Whisper encoderdecoder (audio-only): <gh-file:vllm/model_executor/models/whisper.py> - Whisper encoderdecoder (audio-only): <gh-file:vllm/model_executor/models/whisper.py>
- Voxtral decoder-only (audio embeddings + LLM): <gh-file:vllm/model_executor/models/voxtral.py> - Voxtral decoder-only (audio embeddings + LLM): <gh-file:vllm/model_executor/models/voxtral.py>
- Gemma3n decoder-only with fixed instruction prompt: <gh-file:vllm/model_executor/models/gemma3n_mm.py> - Gemma3n decoder-only with fixed instruction prompt: <gh-file:vllm/model_executor/models/gemma3n_mm.py>
## 5. Test with the API ## Test with the API
Once your model implements `SupportsTranscription`, you can test the endpoints (API mimics OpenAI): Once your model implements `SupportsTranscription`, you can test the endpoints (API mimics OpenAI):
@@ -266,7 +271,6 @@ Once your model implements `SupportsTranscription`, you can test the endpoints (
Or check out more examples in <gh-file:examples/online_serving>. Or check out more examples in <gh-file:examples/online_serving>.
!!! note !!! note
- If your model handles chunking internally (e.g., via its processor or encoder), set `min_energy_split_window_size=None` in the returned `SpeechToTextConfig` to disable server-side chunking. - If your model handles chunking internally (e.g., via its processor or encoder), set `min_energy_split_window_size=None` in the returned `SpeechToTextConfig` to disable server-side chunking.
- Implementing `get_num_audio_tokens` improves accuracy of streaming usage metrics (`prompt_tokens`) without an extra forward pass. - Implementing `get_num_audio_tokens` improves accuracy of streaming usage metrics (`prompt_tokens`) without an extra forward pass.
- For multilingual behavior, keep `supported_languages` aligned with actual model capabilities. - For multilingual behavior, keep `supported_languages` aligned with actual model capabilities.