2.0 KiB
vllm bench mm-processor
Overview
vllm bench mm-processor profiles the multimodal input processor pipeline of
vision-language models. It measures per-stage latency from the HuggingFace
processor through to the encoder forward pass, helping you identify
preprocessing bottlenecks and understand how different image resolutions or
item counts affect end-to-end request time.
The benchmark supports two data sources: synthetic random multimodal inputs
(random-mm) and HuggingFace datasets (hf). Warmup requests are run before
measurement to ensure stable results.
Quick Start
vllm bench mm-processor \
--model Qwen/Qwen2-VL-7B-Instruct \
--dataset-name random-mm \
--num-prompts 50 \
--random-input-len 300 \
--random-output-len 40 \
--random-mm-base-items-per-request 2 \
--random-mm-limit-mm-per-prompt '{"image": 3, "video": 0}' \
--random-mm-bucket-config '{(256, 256, 1): 0.7, (720, 1280, 1): 0.3}'
Measured Stages
| Stage | Description |
|---|---|
get_mm_hashes_secs |
Time spent hashing multimodal inputs |
get_cache_missing_items_secs |
Time spent looking up the processor cache |
apply_hf_processor_secs |
Time spent in the HuggingFace processor |
merge_mm_kwargs_secs |
Time spent merging multimodal kwargs |
apply_prompt_updates_secs |
Time spent updating prompt tokens |
preprocessor_total_secs |
Total preprocessing time |
encoder_forward_secs |
Time spent in the encoder model forward pass |
num_encoder_calls |
Number of encoder invocations per request |
The benchmark also reports end-to-end latency (TTFT + decode time) per
request. Use --metric-percentiles to select which percentiles to report
(default: p99) and --output-json to save results.
For more examples (HF datasets, warmup, JSON output), see Benchmarking CLI — Multimodal Processor Benchmark.
JSON CLI Arguments
--8<-- "docs/cli/json_tip.inc.md"
Arguments
--8<-- "docs/generated/argparse/bench_mm_processor.inc.md"