[Docs] Update docs to include mm processor + encoder benchmarks (#34083)

Signed-off-by: Reagan <reaganjlee@gmail.com>
2026-03-05 01:38:25 -08:00
parent 48e376a007
commit ac773bbe80
2 changed files with 135 additions and 3 deletions
--- a/docs/benchmarking/cli.md
+++ b/docs/benchmarking/cli.md
@@ -25,7 +25,7 @@ th {
 | BurstGPT | ✅ | ✅ | `wget https://github.com/HPMLL/BurstGPT/releases/download/v1.1/BurstGPT_without_fails_2.csv` |
 | Sonnet (deprecated) | ✅ | ✅ | Local file: `benchmarks/sonnet.txt` |
 | Random | ✅ | ✅ | `synthetic` |
-| RandomMultiModal (Image/Video) | 🟡 | 🚧 | `synthetic` |
+| RandomMultiModal (Image/Video) | ✅ | ✅ | `synthetic` |
 | RandomForReranking | ✅ | ✅ | `synthetic` |
 | Prefix Repetition | ✅ | ✅ | `synthetic` |
 | HuggingFace-VisionArena | ✅ | ✅ | `lmarena-ai/VisionArena-Chat` |
@@ -545,6 +545,24 @@ vllm bench throughput \
  --lora-path yard1/llama-2-7b-sql-lora-test
 ```

+#### Synthetic Random Multimodal (random-mm)
+
+Generate synthetic multimodal inputs for offline throughput testing without external datasets.
+Use `--backend vllm-chat` so that image tokens are counted correctly.
+
+```bash
+vllm bench throughput \
+  --model Qwen/Qwen2-VL-7B-Instruct \
+  --backend vllm-chat \
+  --dataset-name random-mm \
+  --num-prompts 100 \
+  --random-input-len 300 \
+  --random-output-len 40 \
+  --random-mm-base-items-per-request 2 \
+  --random-mm-limit-mm-per-prompt '{"image": 3, "video": 0}' \
+  --random-mm-bucket-config '{(256, 256, 1): 0.7, (720, 1280, 1): 0.3}'
+```
+
 </details>

 ### 🛠️ Structured Output Benchmark
@@ -846,8 +864,8 @@ Generate synthetic image inputs alongside random text prompts to stress-test vis

 Notes:

- Works only with online benchmark via the OpenAI backend (`--backend openai-chat`) and endpoint `/v1/chat/completions`.
- Video sampling is not yet implemented.
+- For online benchmarks, use `--backend openai-chat` with endpoint `/v1/chat/completions`.
+- For offline benchmarks, use `--backend vllm-chat` (see [Offline Throughput Benchmark](#-offline-throughput-benchmark) for an example).

 Start the server (example):

@@ -913,6 +931,74 @@ This should be seen as an edge case, and if this behavior can be avoided by sett

 </details>

+### 🔬 Multimodal Processor Benchmark
+
+Benchmark per-stage latency of the multimodal (MM) input processor pipeline, including the encoder forward pass. This is useful for profiling preprocessing bottlenecks in vision-language models.
+
+<details class="admonition abstract" markdown="1">
+<summary>Show more</summary>
+
+The benchmark measures the following stages for each request:
+
+| Stage | Description |
+|-------|-------------|
+| `get_mm_hashes_secs` | Time spent hashing multimodal inputs |
+| `get_cache_missing_items_secs` | Time spent looking up the processor cache |
+| `apply_hf_processor_secs` | Time spent in the HuggingFace processor |
+| `merge_mm_kwargs_secs` | Time spent merging multimodal kwargs |
+| `apply_prompt_updates_secs` | Time spent updating prompt tokens |
+| `preprocessor_total_secs` | Total preprocessing time |
+| `encoder_forward_secs` | Time spent in the encoder model forward pass |
+| `num_encoder_calls` | Number of encoder invocations per request |
+
+The benchmark also reports end-to-end latency (TTFT + decode time) per
+request. Use `--metric-percentiles` to select which percentiles to report
+(default: p99) and `--output-json` to save results.
+
+#### Basic Example with Synthetic Data (random-mm)
+
+```bash
+vllm bench mm-processor \
+  --model Qwen/Qwen2-VL-7B-Instruct \
+  --dataset-name random-mm \
+  --num-prompts 50 \
+  --random-input-len 300 \
+  --random-output-len 40 \
+  --random-mm-base-items-per-request 2 \
+  --random-mm-limit-mm-per-prompt '{"image": 3, "video": 0}' \
+  --random-mm-bucket-config '{(256, 256, 1): 0.7, (720, 1280, 1): 0.3}'
+```
+
+#### Using a HuggingFace Dataset
+
+```bash
+vllm bench mm-processor \
+  --model Qwen/Qwen2-VL-7B-Instruct \
+  --dataset-name hf \
+  --dataset-path lmarena-ai/VisionArena-Chat \
+  --hf-split train \
+  --num-prompts 100
+```
+
+#### Warmup, Custom Percentiles, and JSON Output
+
+```bash
+vllm bench mm-processor \
+  --model Qwen/Qwen2-VL-7B-Instruct \
+  --dataset-name random-mm \
+  --num-prompts 200 \
+  --num-warmups 5 \
+  --random-input-len 300 \
+  --random-output-len 40 \
+  --random-mm-base-items-per-request 1 \
+  --metric-percentiles 50,90,95,99 \
+  --output-json results.json
+```
+
+See [`vllm bench mm-processor`](../cli/bench/mm_processor.md) for the full argument reference.
+
+</details>
+
 ### Embedding Benchmark

 Benchmark the performance of embedding requests in vLLM.
--- a/docs/cli/bench/mm_processor.md
+++ b/docs/cli/bench/mm_processor.md
@@ -1,5 +1,51 @@
 # vllm bench mm-processor

+## Overview
+
+`vllm bench mm-processor` profiles the multimodal input processor pipeline of
+vision-language models. It measures per-stage latency from the HuggingFace
+processor through to the encoder forward pass, helping you identify
+preprocessing bottlenecks and understand how different image resolutions or
+item counts affect end-to-end request time.
+
+The benchmark supports two data sources: synthetic random multimodal inputs
+(`random-mm`) and HuggingFace datasets (`hf`). Warmup requests are run before
+measurement to ensure stable results.
+
+## Quick Start
+
+```bash
+vllm bench mm-processor \
+  --model Qwen/Qwen2-VL-7B-Instruct \
+  --dataset-name random-mm \
+  --num-prompts 50 \
+  --random-input-len 300 \
+  --random-output-len 40 \
+  --random-mm-base-items-per-request 2 \
+  --random-mm-limit-mm-per-prompt '{"image": 3, "video": 0}' \
+  --random-mm-bucket-config '{(256, 256, 1): 0.7, (720, 1280, 1): 0.3}'
+```
+
+## Measured Stages
+
+| Stage | Description |
+|-------|-------------|
+| `get_mm_hashes_secs` | Time spent hashing multimodal inputs |
+| `get_cache_missing_items_secs` | Time spent looking up the processor cache |
+| `apply_hf_processor_secs` | Time spent in the HuggingFace processor |
+| `merge_mm_kwargs_secs` | Time spent merging multimodal kwargs |
+| `apply_prompt_updates_secs` | Time spent updating prompt tokens |
+| `preprocessor_total_secs` | Total preprocessing time |
+| `encoder_forward_secs` | Time spent in the encoder model forward pass |
+| `num_encoder_calls` | Number of encoder invocations per request |
+
+The benchmark also reports end-to-end latency (TTFT + decode time) per
+request. Use `--metric-percentiles` to select which percentiles to report
+(default: p99) and `--output-json` to save results.
+
+For more examples (HF datasets, warmup, JSON output), see
+[Benchmarking CLI — Multimodal Processor Benchmark](../../benchmarking/cli.md#multimodal-processor-benchmark).
+
 ## JSON CLI Arguments

 --8<-- "docs/cli/json_tip.inc.md"