[MM] Allow audio chunking for offline LLM (#34628)

Signed-off-by: NickLucche <nlucches@redhat.com>
2026-02-24 06:04:28 +01:00
parent 33a0d43c71
commit f91808ae0d
5 changed files with 366 additions and 54 deletions
--- a/docs/features/multimodal_inputs.md
+++ b/docs/features/multimodal_inputs.md
@@ -295,6 +295,51 @@ You can pass a tuple `(array, sampling_rate)` to the `'audio'` field of the mult

 Full example: [examples/offline_inference/audio_language.py](../../examples/offline_inference/audio_language.py)

+#### Chunking Long Audio for Transcription
+
+Speech-to-text models like Whisper have a maximum audio length they can process (typically 30 seconds). For longer audio files, vLLM provides a utility to intelligently split audio into chunks at quiet points to minimize cutting through speech.
+
+```python
+import librosa
+from vllm import LLM, SamplingParams
+from vllm.multimodal.audio import split_audio
+
+# Load long audio file
+audio, sr = librosa.load("long_audio.wav", sr=16000)
+
+# Split into chunks at low-energy (quiet) regions
+chunks = split_audio(
+    audio_data=audio,
+    sample_rate=sr,
+    max_clip_duration_s=30.0,      # Maximum chunk length in seconds
+    overlap_duration_s=1.0,         # Search window for finding quiet split points
+    min_energy_window_size=1600,    # Window size for energy calculation (~100ms at 16kHz)
+)
+
+# Initialize Whisper model
+llm = LLM(model="openai/whisper-large-v3-turbo")
+sampling_params = SamplingParams(temperature=0, max_tokens=256)
+
+# Transcribe each chunk
+transcriptions = []
+for chunk in chunks:
+    outputs = llm.generate({
+        "prompt": "<|startoftranscript|><|en|><|transcribe|><|notimestamps|>",
+        "multi_modal_data": {"audio": (chunk, sr)},
+    }, sampling_params)
+    transcriptions.append(outputs[0].outputs[0].text)
+
+# Combine results
+full_transcription = " ".join(transcriptions)
+```
+
+The `split_audio` function:
+
+- Splits audio at quiet points to avoid cutting through speech
+- Uses RMS energy to find low-amplitude regions within the overlap window
+- Preserves all audio samples (no data loss)
+- Supports any sample rate
+
 #### Automatic Audio Channel Normalization

 vLLM automatically normalizes audio channels for models that require specific audio formats. When loading audio with libraries like `torchaudio`, stereo files return shape `[channels, time]`, but many audio models (particularly Whisper-based models) expect mono audio with shape `[time]`.