docs/contributing/model/transcription.md

# Speech-to-Text (Transcription/Translation) Support

This document walks you through the steps to add support for speech-to-text (ASR) models to vLLM’s transcription and translation APIs by implementing [SupportsTranscription][vllm.model_executor.models.interfaces.SupportsTranscription].
Please refer to the [supported models](../../models/supported_models.md#transcription) for further guidance.

## Update the base vLLM model

It is assumed you have already implemented your model in vLLM according to the basic model guide. Extend your model with the [SupportsTranscription][vllm.model_executor.models.interfaces.SupportsTranscription] interface and implement the following class attributes and methods.

### `supported_languages` and `supports_transcription_only`

Declare supported languages and capabilities:

- The `supported_languages` mapping is validated at init time.
- Set `supports_transcription_only=True` if the model should not serve text generation (eg Whisper).

??? code "supported_languages and supports_transcription_only"

    ```python
    from typing import ClassVar, Mapping, Literal
    import numpy as np
    import torch
    from torch import nn

    from vllm.config import ModelConfig, SpeechToTextConfig
    from vllm.inputs.data import PromptType
    from vllm.model_executor.models.interfaces import SupportsTranscription
    
    class YourASRModel(nn.Module, SupportsTranscription):
        # Map of ISO 639-1 language codes to language names
        supported_languages: ClassVar[Mapping[str, str]] = {
            "en": "English",
            "it": "Italian",
            # ... add more as needed
        }
        
        # If your model only supports audio-conditioned generation
        # (no text-only generation), enable this flag.
        supports_transcription_only: ClassVar[bool] = True
    ```

Provide an ASR configuration via [get_speech_to_text_config][vllm.model_executor.models.interfaces.SupportsTranscription.get_speech_to_text_config].

This is for controlling general behavior of the API when serving your model:

??? code "get_speech_to_text_config()"

    ```python
    class YourASRModel(nn.Module, SupportsTranscription):
        ...

        @classmethod
        def get_speech_to_text_config(
            cls,
            model_config: ModelConfig,
            task_type: Literal["transcribe", "translate"],
        ) -> SpeechToTextConfig:
            return SpeechToTextConfig(
                sample_rate=16_000,
                max_audio_clip_s=30,
                # Set to None to disable server-side chunking if your
                # model/processor handles it already
                min_energy_split_window_size=None,
            )
    ```

See [Audio preprocessing and chunking](#audio-preprocessing-and-chunking) for what each field controls.

Implement the prompt construction via [get_generation_prompt][vllm.model_executor.models.interfaces.SupportsTranscription.get_generation_prompt]. The server passes you the resampled waveform and task parameters; you return a valid [PromptType][vllm.inputs.data.PromptType]. There are two common patterns:

#### Multimodal LLM with audio embeddings (e.g., Voxtral, Gemma3n)

Return a dict containing `multi_modal_data` with the audio, and either a `prompt` string or `prompt_token_ids`:

??? code "get_generation_prompt()"

    ```python
    class YourASRModel(nn.Module, SupportsTranscription):
        ...

        @classmethod
        def get_generation_prompt(
            cls,
            audio: np.ndarray,
            stt_config: SpeechToTextConfig,
            model_config: ModelConfig,
            language: str | None,
            task_type: Literal["transcribe", "translate"],
            request_prompt: str,
            to_language: str | None,
        ) -> PromptType:
            # Example with a free-form instruction prompt
            task_word = "Transcribe" if task_type == "transcribe" else "Translate"
            prompt = (
                "<start_of_turn>user\n"
                f"{task_word} this audio: <audio_soft_token>"
                "<end_of_turn>\n<start_of_turn>model\n"
            )

            return {
                "multi_modal_data": {"audio": (audio, stt_config.sample_rate)},
                "prompt": prompt,
            }
    ```

    For further clarification on multi modal inputs, please refer to [Multi-Modal Inputs](../../features/multimodal_inputs.md).

#### Encoder–decoder audio-only (e.g., Whisper)

Return a dict with separate `encoder_prompt` and `decoder_prompt` entries:

??? code "get_generation_prompt()"

    ```python
    class YourASRModel(nn.Module, SupportsTranscription):
        ...

        @classmethod
        def get_generation_prompt(
            cls,
            audio: np.ndarray,
            stt_config: SpeechToTextConfig,
            model_config: ModelConfig,
            language: str | None,
            task_type: Literal["transcribe", "translate"],
            request_prompt: str,
            to_language: str | None,
        ) -> PromptType:
            if language is None:
                raise ValueError("Language must be specified")

            prompt = {
                "encoder_prompt": {
                    "prompt": "",
                    "multi_modal_data": {
                        "audio": (audio, stt_config.sample_rate),
                    },
                },
                "decoder_prompt": (
                    (f"<|prev|>{request_prompt}" if request_prompt else "")
                    + f"<|startoftranscript|><|{language}|>"
                    + f"<|{task_type}|><|notimestamps|>"
                ),
            }
            return cast(PromptType, prompt)
    ```

### `validate_language` (optional)

Language validation via [validate_language][vllm.model_executor.models.interfaces.SupportsTranscription.validate_language]

If your model requires a language and you want a default, override this method (see Whisper):

??? code "validate_language()"

    ```python
    @classmethod
    def validate_language(cls, language: str | None) -> str | None:
        if language is None:
            logger.warning(
                "Defaulting to language='en'. If you wish to transcribe "
                "audio in a different language, pass the `language` field "
                "in the TranscriptionRequest."
            )
            language = "en"
        return super().validate_language(language)
    ```

### `get_num_audio_tokens` (optional)

Token accounting for streaming via [get_num_audio_tokens][vllm.model_executor.models.interfaces.SupportsTranscription.get_num_audio_tokens]

Provide a fast duration→token estimate to improve streaming usage statistics:

??? code "get_num_audio_tokens()"

    ```python
    class YourASRModel(nn.Module, SupportsTranscription):
        ...

        @classmethod
        def get_num_audio_tokens(
            cls,
            audio_duration_s: float,
            stt_config: SpeechToTextConfig,
            model_config: ModelConfig,
        ) -> int | None:
            # Return None if unknown; otherwise return an estimate.
            return int(audio_duration_s * stt_config.sample_rate // 320)  # example
    ```

## Audio preprocessing and chunking

The API server takes care of basic audio I/O and optional chunking before building prompts:

- Resampling: Input audio is resampled to `SpeechToTextConfig.sample_rate` using `librosa`.
- Chunking: If `SpeechToTextConfig.allow_audio_chunking` is True and the duration exceeds `max_audio_clip_s`, the server splits the audio into overlapping chunks and generates a prompt per chunk. Overlap is controlled by `overlap_chunk_second`.
- Energy-aware splitting: When `min_energy_split_window_size` is set, the server finds low-energy regions to minimize cutting within words.

Relevant server logic:

??? code "_preprocess_speech_to_text()"

    ```python
    # vllm/entrypoints/openai/speech_to_text.py
    async def _preprocess_speech_to_text(...):
        language = self.model_cls.validate_language(request.language)
        ...
        y, sr = librosa.load(bytes_, sr=self.asr_config.sample_rate)
        duration = librosa.get_duration(y=y, sr=sr)
        do_split_audio = (self.asr_config.allow_audio_chunking
                        and duration > self.asr_config.max_audio_clip_s)
        chunks = [y] if not do_split_audio else self._split_audio(y, int(sr))
        prompts = []
        for chunk in chunks:
            prompt = self.model_cls.get_generation_prompt(
                audio=chunk,
                stt_config=self.asr_config,
                model_config=self.model_config,
                language=language,
                task_type=self.task_type,
                request_prompt=request.prompt,
                to_language=to_language,
            )
            prompts.append(prompt)
        return prompts, duration
    ```

## Exposing tasks automatically

vLLM automatically advertises transcription support if your model implements the interface:

```python
if supports_transcription(model):
    if model.supports_transcription_only:
        return ["transcription"]
    supported_tasks.append("transcription")
```

When enabled, the server initializes the transcription and translation handlers:

```python
state.openai_serving_transcription = OpenAIServingTranscription(...) if "transcription" in supported_tasks else None
state.openai_serving_translation = OpenAIServingTranslation(...) if "transcription" in supported_tasks else None
```

No extra registration is required beyond having your model class available via the model registry and implementing `SupportsTranscription`.

## Examples in-tree

- Whisper encoder–decoder (audio-only): [vllm/model_executor/models/whisper.py](../../../vllm/model_executor/models/whisper.py)
- Voxtral decoder-only (audio embeddings + LLM): [vllm/model_executor/models/voxtral.py](../../../vllm/model_executor/models/voxtral.py). Make sure to have installed `mistral-common[audio]`.
- Gemma3n decoder-only with fixed instruction prompt: [vllm/model_executor/models/gemma3n_mm.py](../../../vllm/model_executor/models/gemma3n_mm.py)

## Test with the API

Once your model implements `SupportsTranscription`, you can test the endpoints (API mimics OpenAI):

- Transcription (ASR):

    ```bash
    curl -s -X POST \
      -H "Authorization: Bearer $VLLM_API_KEY" \
      -H "Content-Type: multipart/form-data" \
      -F "file=@/path/to/audio.wav" \
      -F "model=$MODEL_ID" \
      http://localhost:8000/v1/audio/transcriptions
    ```

- Translation (source → English unless otherwise supported):

    ```bash
    curl -s -X POST \
      -H "Authorization: Bearer $VLLM_API_KEY" \
      -H "Content-Type: multipart/form-data" \
      -F "file=@/path/to/audio.wav" \
      -F "model=$MODEL_ID" \
      http://localhost:8000/v1/audio/translations
    ```

Or check out more examples in [examples/online_serving](../../../examples/online_serving).

!!! note
    - If your model handles chunking internally (e.g., via its processor or encoder), set `min_energy_split_window_size=None` in the returned `SpeechToTextConfig` to disable server-side chunking.
    - Implementing `get_num_audio_tokens` improves accuracy of streaming usage metrics (`prompt_tokens`) without an extra forward pass.
    - For multilingual behavior, keep `supported_languages` aligned with actual model capabilities.
-												[Docs] Add transcription support to model (#24664)

Signed-off-by: NickLucche <nlucches@redhat.com>
											
										
										
											2025-09-11 16:39:01 +02:00
+								# Speech-to-Text (Transcription/Translation) Support
 								This document walks you through the steps to add support for speech-to-text (ASR) models to vLLM’s transcription and translation APIs by implementing [SupportsTranscription][vllm.model_executor.models.interfaces.SupportsTranscription].
 								Please refer to the [supported models](../../models/supported_models.md#transcription) for further guidance.
-												[Docs] Fix formatting of transcription doc (#24676)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-09-11 19:18:06 +01:00
+								## Update the base vLLM model
-												[Docs] Add transcription support to model (#24664)

Signed-off-by: NickLucche <nlucches@redhat.com>
											
										
										
											2025-09-11 16:39:01 +02:00
 								It is assumed you have already implemented your model in vLLM according to the basic model guide. Extend your model with the [SupportsTranscription][vllm.model_executor.models.interfaces.SupportsTranscription] interface and implement the following class attributes and methods.
-												[Docs] Fix formatting of transcription doc (#24676)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-09-11 19:18:06 +01:00
+								### `supported_languages` and `supports_transcription_only`
-												[Docs] Add transcription support to model (#24664)

Signed-off-by: NickLucche <nlucches@redhat.com>
											
										
										
											2025-09-11 16:39:01 +02:00
-												[Docs] Fix formatting of transcription doc (#24676)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-09-11 19:18:06 +01:00
+								Declare supported languages and capabilities:
-												[Docs] Add transcription support to model (#24664)

Signed-off-by: NickLucche <nlucches@redhat.com>
											
										
										
											2025-09-11 16:39:01 +02:00
-												[Docs] Fix formatting of transcription doc (#24676)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-09-11 19:18:06 +01:00
+								- The `supported_languages` mapping is validated at init time.
 								- Set `supports_transcription_only=True` if the model should not serve text generation (eg Whisper).
 								??? code "supported_languages and supports_transcription_only"
-												[Doc] ruff format some Python examples (#26767)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-10-14 18:21:53 +08:00
-												[Docs] Fix formatting of transcription doc (#24676)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-09-11 19:18:06 +01:00
+								    ```python
 								    from typing import ClassVar, Mapping, Literal
 								    import numpy as np
 								    import torch
 								    from torch import nn
-												Revert "[Renderer] Separate out `RendererConfig` from `ModelConfig` (#30145)" (#30199)


											
										
										
											2025-12-07 16:00:22 +08:00
+								    from vllm.config import ModelConfig, SpeechToTextConfig
-												[Docs] Fix formatting of transcription doc (#24676)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-09-11 19:18:06 +01:00
+								    from vllm.inputs.data import PromptType
 								    from vllm.model_executor.models.interfaces import SupportsTranscription
 								    class YourASRModel(nn.Module, SupportsTranscription):
 								        # Map of ISO 639-1 language codes to language names
 								        supported_languages: ClassVar[Mapping[str, str]] = {
 								            "en": "English",
 								            "it": "Italian",
 								            # ... add more as needed
 								        }
-												[Docs] Add transcription support to model (#24664)

Signed-off-by: NickLucche <nlucches@redhat.com>
											
										
										
											2025-09-11 16:39:01 +02:00
-												[Docs] Fix formatting of transcription doc (#24676)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-09-11 19:18:06 +01:00
+								        # If your model only supports audio-conditioned generation
 								        # (no text-only generation), enable this flag.
 								        supports_transcription_only: ClassVar[bool] = True
 								    ```
 								Provide an ASR configuration via [get_speech_to_text_config][vllm.model_executor.models.interfaces.SupportsTranscription.get_speech_to_text_config].
 								This is for controlling general behavior of the API when serving your model:
 								??? code "get_speech_to_text_config()"
-												[Doc] ruff format some Python examples (#26767)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-10-14 18:21:53 +08:00
-												[Docs] Fix formatting of transcription doc (#24676)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-09-11 19:18:06 +01:00
+								    ```python
 								    class YourASRModel(nn.Module, SupportsTranscription):
 								        ...
 								        @classmethod
 								        def get_speech_to_text_config(
 								            cls,
-												Revert "[Renderer] Separate out `RendererConfig` from `ModelConfig` (#30145)" (#30199)


											
										
										
											2025-12-07 16:00:22 +08:00
+								            model_config: ModelConfig,
-												[Docs] Fix formatting of transcription doc (#24676)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-09-11 19:18:06 +01:00
+								            task_type: Literal["transcribe", "translate"],
 								        ) -> SpeechToTextConfig:
 								            return SpeechToTextConfig(
 								                sample_rate=16_000,
 								                max_audio_clip_s=30,
 								                # Set to None to disable server-side chunking if your
 								                # model/processor handles it already
 								                min_energy_split_window_size=None,
 								            )
 								    ```
 								See [Audio preprocessing and chunking](#audio-preprocessing-and-chunking) for what each field controls.
 								Implement the prompt construction via [get_generation_prompt][vllm.model_executor.models.interfaces.SupportsTranscription.get_generation_prompt]. The server passes you the resampled waveform and task parameters; you return a valid [PromptType][vllm.inputs.data.PromptType]. There are two common patterns:
 								#### Multimodal LLM with audio embeddings (e.g., Voxtral, Gemma3n)
 								Return a dict containing `multi_modal_data` with the audio, and either a `prompt` string or `prompt_token_ids`:
 								??? code "get_generation_prompt()"
-												[Doc] ruff format some Python examples (#26767)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-10-14 18:21:53 +08:00
-												[Docs] Fix formatting of transcription doc (#24676)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-09-11 19:18:06 +01:00
+								    ```python
 								    class YourASRModel(nn.Module, SupportsTranscription):
 								        ...
 								        @classmethod
 								        def get_generation_prompt(
 								            cls,
 								            audio: np.ndarray,
 								            stt_config: SpeechToTextConfig,
-												Revert "[Renderer] Separate out `RendererConfig` from `ModelConfig` (#30145)" (#30199)


											
										
										
											2025-12-07 16:00:22 +08:00
+								            model_config: ModelConfig,
-												[Docs] Fix formatting of transcription doc (#24676)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-09-11 19:18:06 +01:00
+								            language: str | None,
 								            task_type: Literal["transcribe", "translate"],
 								            request_prompt: str,
 								            to_language: str | None,
 								        ) -> PromptType:
 								            # Example with a free-form instruction prompt
 								            task_word = "Transcribe" if task_type == "transcribe" else "Translate"
 								            prompt = (
 								                "<start_of_turn>user\n"
 								                f"{task_word} this audio: <audio_soft_token>"
 								                "<end_of_turn>\n<start_of_turn>model\n"
 								            )
 								            return {
 								                "multi_modal_data": {"audio": (audio, stt_config.sample_rate)},
 								                "prompt": prompt,
-												[Docs] Add transcription support to model (#24664)

Signed-off-by: NickLucche <nlucches@redhat.com>
											
										
										
											2025-09-11 16:39:01 +02:00
+								            }
-												[Docs] Fix formatting of transcription doc (#24676)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-09-11 19:18:06 +01:00
+								    ```
 								    For further clarification on multi modal inputs, please refer to [Multi-Modal Inputs](../../features/multimodal_inputs.md).
 								#### Encoder–decoder audio-only (e.g., Whisper)
 								Return a dict with separate `encoder_prompt` and `decoder_prompt` entries:
 								??? code "get_generation_prompt()"
-												[Doc] ruff format some Python examples (#26767)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-10-14 18:21:53 +08:00
-												[Docs] Fix formatting of transcription doc (#24676)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-09-11 19:18:06 +01:00
+								    ```python
 								    class YourASRModel(nn.Module, SupportsTranscription):
 								        ...
 								        @classmethod
 								        def get_generation_prompt(
 								            cls,
 								            audio: np.ndarray,
 								            stt_config: SpeechToTextConfig,
-												Revert "[Renderer] Separate out `RendererConfig` from `ModelConfig` (#30145)" (#30199)


											
										
										
											2025-12-07 16:00:22 +08:00
+								            model_config: ModelConfig,
-												[Docs] Fix formatting of transcription doc (#24676)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-09-11 19:18:06 +01:00
+								            language: str | None,
 								            task_type: Literal["transcribe", "translate"],
 								            request_prompt: str,
 								            to_language: str | None,
 								        ) -> PromptType:
 								            if language is None:
 								                raise ValueError("Language must be specified")
 								            prompt = {
 								                "encoder_prompt": {
 								                    "prompt": "",
 								                    "multi_modal_data": {
 								                        "audio": (audio, stt_config.sample_rate),
-												[Docs] Add transcription support to model (#24664)

Signed-off-by: NickLucche <nlucches@redhat.com>
											
										
										
											2025-09-11 16:39:01 +02:00
+								                    },
-												[Docs] Fix formatting of transcription doc (#24676)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-09-11 19:18:06 +01:00
+								                },
 								                "decoder_prompt": (
 								                    (f"<|prev|>{request_prompt}" if request_prompt else "")
 								                    + f"<|startoftranscript|><|{language}|>"
 								                    + f"<|{task_type}|><|notimestamps|>"
 								                ),
 								            }
 								            return cast(PromptType, prompt)
 								    ```
-												[Docs] Add transcription support to model (#24664)

Signed-off-by: NickLucche <nlucches@redhat.com>
											
										
										
											2025-09-11 16:39:01 +02:00
-												[Docs] Fix formatting of transcription doc (#24676)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-09-11 19:18:06 +01:00
+								### `validate_language` (optional)
-												[Docs] Add transcription support to model (#24664)

Signed-off-by: NickLucche <nlucches@redhat.com>
											
										
										
											2025-09-11 16:39:01 +02:00
-												[Docs] Fix formatting of transcription doc (#24676)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-09-11 19:18:06 +01:00
+								Language validation via [validate_language][vllm.model_executor.models.interfaces.SupportsTranscription.validate_language]
-												[Docs] Add transcription support to model (#24664)

Signed-off-by: NickLucche <nlucches@redhat.com>
											
										
										
											2025-09-11 16:39:01 +02:00
-												[Docs] Fix formatting of transcription doc (#24676)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-09-11 19:18:06 +01:00
+								If your model requires a language and you want a default, override this method (see Whisper):
 								??? code "validate_language()"
-												[Doc] ruff format some Python examples (#26767)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-10-14 18:21:53 +08:00
-												[Docs] Add transcription support to model (#24664)

Signed-off-by: NickLucche <nlucches@redhat.com>
											
										
										
											2025-09-11 16:39:01 +02:00
+								    ```python
 								    @classmethod
 								    def validate_language(cls, language: str | None) -> str | None:
 								        if language is None:
 								            logger.warning(
-												[Doc] ruff format some Python examples (#26767)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-10-14 18:21:53 +08:00
+								                "Defaulting to language='en'. If you wish to transcribe "
 								                "audio in a different language, pass the `language` field "
 								                "in the TranscriptionRequest."
 								            )
-												[Docs] Add transcription support to model (#24664)

Signed-off-by: NickLucche <nlucches@redhat.com>
											
										
										
											2025-09-11 16:39:01 +02:00
+								            language = "en"
 								        return super().validate_language(language)
 								    ```
-												[Docs] Fix formatting of transcription doc (#24676)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-09-11 19:18:06 +01:00
+								### `get_num_audio_tokens` (optional)
 								Token accounting for streaming via [get_num_audio_tokens][vllm.model_executor.models.interfaces.SupportsTranscription.get_num_audio_tokens]
-												[Docs] Add transcription support to model (#24664)

Signed-off-by: NickLucche <nlucches@redhat.com>
											
										
										
											2025-09-11 16:39:01 +02:00
-												[Docs] Fix formatting of transcription doc (#24676)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-09-11 19:18:06 +01:00
+								Provide a fast duration→token estimate to improve streaming usage statistics:
-												[Docs] Add transcription support to model (#24664)

Signed-off-by: NickLucche <nlucches@redhat.com>
											
										
										
											2025-09-11 16:39:01 +02:00
-												[Docs] Fix formatting of transcription doc (#24676)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-09-11 19:18:06 +01:00
+								??? code "get_num_audio_tokens()"
-												[Doc] ruff format some Python examples (#26767)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-10-14 18:21:53 +08:00
-												[Docs] Fix formatting of transcription doc (#24676)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-09-11 19:18:06 +01:00
+								    ```python
 								    class YourASRModel(nn.Module, SupportsTranscription):
 								        ...
-												[Docs] Add transcription support to model (#24664)

Signed-off-by: NickLucche <nlucches@redhat.com>
											
										
										
											2025-09-11 16:39:01 +02:00
-												[Docs] Fix formatting of transcription doc (#24676)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-09-11 19:18:06 +01:00
+								        @classmethod
 								        def get_num_audio_tokens(
 								            cls,
 								            audio_duration_s: float,
 								            stt_config: SpeechToTextConfig,
-												Revert "[Renderer] Separate out `RendererConfig` from `ModelConfig` (#30145)" (#30199)


											
										
										
											2025-12-07 16:00:22 +08:00
+								            model_config: ModelConfig,
-												[Docs] Fix formatting of transcription doc (#24676)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-09-11 19:18:06 +01:00
+								        ) -> int | None:
 								            # Return None if unknown; otherwise return an estimate.
 								            return int(audio_duration_s * stt_config.sample_rate // 320)  # example
 								    ```
-												[Docs] Add transcription support to model (#24664)

Signed-off-by: NickLucche <nlucches@redhat.com>
											
										
										
											2025-09-11 16:39:01 +02:00
-												[Docs] Fix formatting of transcription doc (#24676)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-09-11 19:18:06 +01:00
+								## Audio preprocessing and chunking
-												[Docs] Add transcription support to model (#24664)

Signed-off-by: NickLucche <nlucches@redhat.com>
											
										
										
											2025-09-11 16:39:01 +02:00
 								The API server takes care of basic audio I/O and optional chunking before building prompts:
 								- Resampling: Input audio is resampled to `SpeechToTextConfig.sample_rate` using `librosa`.
 								- Chunking: If `SpeechToTextConfig.allow_audio_chunking` is True and the duration exceeds `max_audio_clip_s`, the server splits the audio into overlapping chunks and generates a prompt per chunk. Overlap is controlled by `overlap_chunk_second`.
 								- Energy-aware splitting: When `min_energy_split_window_size` is set, the server finds low-energy regions to minimize cutting within words.
 								Relevant server logic:
-												[Docs] Fix formatting of transcription doc (#24676)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-09-11 19:18:06 +01:00
 								??? code "_preprocess_speech_to_text()"
-												[Doc] ruff format some Python examples (#26767)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-10-14 18:21:53 +08:00
-												[Docs] Add transcription support to model (#24664)

Signed-off-by: NickLucche <nlucches@redhat.com>
											
										
										
											2025-09-11 16:39:01 +02:00
+								    ```python
 								    # vllm/entrypoints/openai/speech_to_text.py
 								    async def _preprocess_speech_to_text(...):
 								        language = self.model_cls.validate_language(request.language)
 								        ...
 								        y, sr = librosa.load(bytes_, sr=self.asr_config.sample_rate)
 								        duration = librosa.get_duration(y=y, sr=sr)
 								        do_split_audio = (self.asr_config.allow_audio_chunking
 								                        and duration > self.asr_config.max_audio_clip_s)
 								        chunks = [y] if not do_split_audio else self._split_audio(y, int(sr))
 								        prompts = []
 								        for chunk in chunks:
 								            prompt = self.model_cls.get_generation_prompt(
 								                audio=chunk,
 								                stt_config=self.asr_config,
-												Revert "[Renderer] Separate out `RendererConfig` from `ModelConfig` (#30145)" (#30199)


											
										
										
											2025-12-07 16:00:22 +08:00
+								                model_config=self.model_config,
-												[Docs] Add transcription support to model (#24664)

Signed-off-by: NickLucche <nlucches@redhat.com>
											
										
										
											2025-09-11 16:39:01 +02:00
+								                language=language,
 								                task_type=self.task_type,
 								                request_prompt=request.prompt,
 								                to_language=to_language,
 								            )
 								            prompts.append(prompt)
 								        return prompts, duration
 								    ```
-												[Docs] Fix formatting of transcription doc (#24676)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-09-11 19:18:06 +01:00
+								## Exposing tasks automatically
-												[Docs] Add transcription support to model (#24664)

Signed-off-by: NickLucche <nlucches@redhat.com>
											
										
										
											2025-09-11 16:39:01 +02:00
-												[Docs] Fix formatting of transcription doc (#24676)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-09-11 19:18:06 +01:00
+								vLLM automatically advertises transcription support if your model implements the interface:
-												[Docs] Add transcription support to model (#24664)

Signed-off-by: NickLucche <nlucches@redhat.com>
											
										
										
											2025-09-11 16:39:01 +02:00
 								```python
 								if supports_transcription(model):
 								    if model.supports_transcription_only:
 								        return ["transcription"]
 								    supported_tasks.append("transcription")
 								```
-												[Docs] Fix formatting of transcription doc (#24676)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-09-11 19:18:06 +01:00
+								When enabled, the server initializes the transcription and translation handlers:
-												[Docs] Add transcription support to model (#24664)

Signed-off-by: NickLucche <nlucches@redhat.com>
											
										
										
											2025-09-11 16:39:01 +02:00
 								```python
 								state.openai_serving_transcription = OpenAIServingTranscription(...) if "transcription" in supported_tasks else None
 								state.openai_serving_translation = OpenAIServingTranslation(...) if "transcription" in supported_tasks else None
 								```
 								No extra registration is required beyond having your model class available via the model registry and implementing `SupportsTranscription`.
-												[Docs] Fix formatting of transcription doc (#24676)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-09-11 19:18:06 +01:00
+								## Examples in-tree
-												[Docs] Add transcription support to model (#24664)

Signed-off-by: NickLucche <nlucches@redhat.com>
											
										
										
											2025-09-11 16:39:01 +02:00
-												[Docs] Reduce custom syntax used in docs (#27009)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-10-17 04:05:34 +01:00
+								- Whisper encoder–decoder (audio-only): [vllm/model_executor/models/whisper.py](../../../vllm/model_executor/models/whisper.py)
-												Remove audio optional dependency for mistral-common (#28722)

Signed-off-by: Julien Denize <julien.denize@mistral.ai>
Signed-off-by: Julien Denize <40604584+juliendenize@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
											
										
										
											2025-11-14 18:54:38 +01:00
+								- Voxtral decoder-only (audio embeddings + LLM): [vllm/model_executor/models/voxtral.py](../../../vllm/model_executor/models/voxtral.py). Make sure to have installed `mistral-common[audio]`.
-												[Docs] Reduce custom syntax used in docs (#27009)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-10-17 04:05:34 +01:00
+								- Gemma3n decoder-only with fixed instruction prompt: [vllm/model_executor/models/gemma3n_mm.py](../../../vllm/model_executor/models/gemma3n_mm.py)
-												[Docs] Add transcription support to model (#24664)

Signed-off-by: NickLucche <nlucches@redhat.com>
											
										
										
											2025-09-11 16:39:01 +02:00
-												[Docs] Fix formatting of transcription doc (#24676)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-09-11 19:18:06 +01:00
+								## Test with the API
-												[Docs] Add transcription support to model (#24664)

Signed-off-by: NickLucche <nlucches@redhat.com>
											
										
										
											2025-09-11 16:39:01 +02:00
 								Once your model implements `SupportsTranscription`, you can test the endpoints (API mimics OpenAI):
 								- Transcription (ASR):
 								    ```bash
 								    curl -s -X POST \
 								      -H "Authorization: Bearer $VLLM_API_KEY" \
 								      -H "Content-Type: multipart/form-data" \
 								      -F "file=@/path/to/audio.wav" \
 								      -F "model=$MODEL_ID" \
 								      http://localhost:8000/v1/audio/transcriptions
 								    ```
 								- Translation (source → English unless otherwise supported):
 								    ```bash
 								    curl -s -X POST \
 								      -H "Authorization: Bearer $VLLM_API_KEY" \
 								      -H "Content-Type: multipart/form-data" \
 								      -F "file=@/path/to/audio.wav" \
 								      -F "model=$MODEL_ID" \
 								      http://localhost:8000/v1/audio/translations
 								    ```
-												[Doc] Fix Markdown Pre-commit Error (#24670)

Signed-off-by: yewentao256 <zhyanwentao@126.com>
											
										
										
											2025-09-11 12:05:59 -04:00
-												[Docs] Reduce custom syntax used in docs (#27009)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-10-17 04:05:34 +01:00
+								Or check out more examples in [examples/online_serving](../../../examples/online_serving).
-												[Docs] Add transcription support to model (#24664)

Signed-off-by: NickLucche <nlucches@redhat.com>
											
										
										
											2025-09-11 16:39:01 +02:00
 								!!! note
-												[Docs] Fix formatting of transcription doc (#24676)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-09-11 19:18:06 +01:00
+								    - If your model handles chunking internally (e.g., via its processor or encoder), set `min_energy_split_window_size=None` in the returned `SpeechToTextConfig` to disable server-side chunking.
 								    - Implementing `get_num_audio_tokens` improves accuracy of streaming usage metrics (`prompt_tokens`) without an extra forward pass.
 								    - For multilingual behavior, keep `supported_languages` aligned with actual model capabilities.