diff --git a/docs/benchmarking/cli.md b/docs/benchmarking/cli.md index 8bbd9b0c0..3c2d4992c 100644 --- a/docs/benchmarking/cli.md +++ b/docs/benchmarking/cli.md @@ -25,7 +25,7 @@ th { | BurstGPT | ✅ | ✅ | `wget https://github.com/HPMLL/BurstGPT/releases/download/v1.1/BurstGPT_without_fails_2.csv` | | Sonnet (deprecated) | ✅ | ✅ | Local file: `benchmarks/sonnet.txt` | | Random | ✅ | ✅ | `synthetic` | -| RandomMultiModal (Image/Video) | 🟡 | 🚧 | `synthetic` | +| RandomMultiModal (Image/Video) | ✅ | ✅ | `synthetic` | | RandomForReranking | ✅ | ✅ | `synthetic` | | Prefix Repetition | ✅ | ✅ | `synthetic` | | HuggingFace-VisionArena | ✅ | ✅ | `lmarena-ai/VisionArena-Chat` | @@ -545,6 +545,24 @@ vllm bench throughput \ --lora-path yard1/llama-2-7b-sql-lora-test ``` +#### Synthetic Random Multimodal (random-mm) + +Generate synthetic multimodal inputs for offline throughput testing without external datasets. +Use `--backend vllm-chat` so that image tokens are counted correctly. + +```bash +vllm bench throughput \ + --model Qwen/Qwen2-VL-7B-Instruct \ + --backend vllm-chat \ + --dataset-name random-mm \ + --num-prompts 100 \ + --random-input-len 300 \ + --random-output-len 40 \ + --random-mm-base-items-per-request 2 \ + --random-mm-limit-mm-per-prompt '{"image": 3, "video": 0}' \ + --random-mm-bucket-config '{(256, 256, 1): 0.7, (720, 1280, 1): 0.3}' +``` + ### 🛠️ Structured Output Benchmark @@ -846,8 +864,8 @@ Generate synthetic image inputs alongside random text prompts to stress-test vis Notes: -- Works only with online benchmark via the OpenAI backend (`--backend openai-chat`) and endpoint `/v1/chat/completions`. -- Video sampling is not yet implemented. +- For online benchmarks, use `--backend openai-chat` with endpoint `/v1/chat/completions`. +- For offline benchmarks, use `--backend vllm-chat` (see [Offline Throughput Benchmark](#-offline-throughput-benchmark) for an example). Start the server (example): @@ -913,6 +931,74 @@ This should be seen as an edge case, and if this behavior can be avoided by sett +### 🔬 Multimodal Processor Benchmark + +Benchmark per-stage latency of the multimodal (MM) input processor pipeline, including the encoder forward pass. This is useful for profiling preprocessing bottlenecks in vision-language models. + +
+Show more + +The benchmark measures the following stages for each request: + +| Stage | Description | +|-------|-------------| +| `get_mm_hashes_secs` | Time spent hashing multimodal inputs | +| `get_cache_missing_items_secs` | Time spent looking up the processor cache | +| `apply_hf_processor_secs` | Time spent in the HuggingFace processor | +| `merge_mm_kwargs_secs` | Time spent merging multimodal kwargs | +| `apply_prompt_updates_secs` | Time spent updating prompt tokens | +| `preprocessor_total_secs` | Total preprocessing time | +| `encoder_forward_secs` | Time spent in the encoder model forward pass | +| `num_encoder_calls` | Number of encoder invocations per request | + +The benchmark also reports end-to-end latency (TTFT + decode time) per +request. Use `--metric-percentiles` to select which percentiles to report +(default: p99) and `--output-json` to save results. + +#### Basic Example with Synthetic Data (random-mm) + +```bash +vllm bench mm-processor \ + --model Qwen/Qwen2-VL-7B-Instruct \ + --dataset-name random-mm \ + --num-prompts 50 \ + --random-input-len 300 \ + --random-output-len 40 \ + --random-mm-base-items-per-request 2 \ + --random-mm-limit-mm-per-prompt '{"image": 3, "video": 0}' \ + --random-mm-bucket-config '{(256, 256, 1): 0.7, (720, 1280, 1): 0.3}' +``` + +#### Using a HuggingFace Dataset + +```bash +vllm bench mm-processor \ + --model Qwen/Qwen2-VL-7B-Instruct \ + --dataset-name hf \ + --dataset-path lmarena-ai/VisionArena-Chat \ + --hf-split train \ + --num-prompts 100 +``` + +#### Warmup, Custom Percentiles, and JSON Output + +```bash +vllm bench mm-processor \ + --model Qwen/Qwen2-VL-7B-Instruct \ + --dataset-name random-mm \ + --num-prompts 200 \ + --num-warmups 5 \ + --random-input-len 300 \ + --random-output-len 40 \ + --random-mm-base-items-per-request 1 \ + --metric-percentiles 50,90,95,99 \ + --output-json results.json +``` + +See [`vllm bench mm-processor`](../cli/bench/mm_processor.md) for the full argument reference. + +
+ ### Embedding Benchmark Benchmark the performance of embedding requests in vLLM. diff --git a/docs/cli/bench/mm_processor.md b/docs/cli/bench/mm_processor.md index af2c3a8cf..e90583ef9 100644 --- a/docs/cli/bench/mm_processor.md +++ b/docs/cli/bench/mm_processor.md @@ -1,5 +1,51 @@ # vllm bench mm-processor +## Overview + +`vllm bench mm-processor` profiles the multimodal input processor pipeline of +vision-language models. It measures per-stage latency from the HuggingFace +processor through to the encoder forward pass, helping you identify +preprocessing bottlenecks and understand how different image resolutions or +item counts affect end-to-end request time. + +The benchmark supports two data sources: synthetic random multimodal inputs +(`random-mm`) and HuggingFace datasets (`hf`). Warmup requests are run before +measurement to ensure stable results. + +## Quick Start + +```bash +vllm bench mm-processor \ + --model Qwen/Qwen2-VL-7B-Instruct \ + --dataset-name random-mm \ + --num-prompts 50 \ + --random-input-len 300 \ + --random-output-len 40 \ + --random-mm-base-items-per-request 2 \ + --random-mm-limit-mm-per-prompt '{"image": 3, "video": 0}' \ + --random-mm-bucket-config '{(256, 256, 1): 0.7, (720, 1280, 1): 0.3}' +``` + +## Measured Stages + +| Stage | Description | +|-------|-------------| +| `get_mm_hashes_secs` | Time spent hashing multimodal inputs | +| `get_cache_missing_items_secs` | Time spent looking up the processor cache | +| `apply_hf_processor_secs` | Time spent in the HuggingFace processor | +| `merge_mm_kwargs_secs` | Time spent merging multimodal kwargs | +| `apply_prompt_updates_secs` | Time spent updating prompt tokens | +| `preprocessor_total_secs` | Total preprocessing time | +| `encoder_forward_secs` | Time spent in the encoder model forward pass | +| `num_encoder_calls` | Number of encoder invocations per request | + +The benchmark also reports end-to-end latency (TTFT + decode time) per +request. Use `--metric-percentiles` to select which percentiles to report +(default: p99) and `--output-json` to save results. + +For more examples (HF datasets, warmup, JSON output), see +[Benchmarking CLI — Multimodal Processor Benchmark](../../benchmarking/cli.md#multimodal-processor-benchmark). + ## JSON CLI Arguments --8<-- "docs/cli/json_tip.inc.md"