[Benchmark] Simplify SLA scan (#35306)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
This commit is contained in:
Cyrus Leung
2026-02-26 14:35:41 +08:00
committed by GitHub
parent 186ea22efe
commit d3a51da92a
8 changed files with 253 additions and 799 deletions

View File

@@ -4,6 +4,11 @@ This section guides you through running benchmark tests with the extensive datas
It's a living document, updated as new features and datasets become available.
!!! tip
The benchmarks described on this page are mainly for evaluating specific vLLM features as well as regression testing.
For benchmarking production vLLM servers, we recommend [GuideLLM](https://github.com/vllm-project/guidellm), an established performance benchmarking framework with live progress updates and automatic report generation. It is also more flexible than `vllm bench serve` in terms of dataset loading, request formatting, and workload patterns.
## Dataset Overview
<style>

View File

@@ -1,10 +1,15 @@
# Parameter Sweeps
`vllm bench sweep` is a suite of commands designed to run benchmarks across multiple configurations and compare them by visualizing the results.
## Online Benchmark
### Basic
`vllm bench sweep serve` automatically starts `vllm serve` and runs `vllm bench serve` to evaluate vLLM over multiple configurations.
`vllm bench sweep serve` starts `vllm serve` and iteratively runs `vllm bench serve` for each server configuration.
!!! tip
If you only need to run benchmarks for a single server configuration, consider using [GuideLLM](https://github.com/vllm-project/guidellm), an established performance benchmarking framework with live progress updates and automatic report generation. It is also more flexible than `vllm bench serve` in terms of dataset loading, request formatting, and workload patterns.
Follow these steps to run the script:
@@ -50,14 +55,17 @@ Follow these steps to run the script:
```json
[
{
"_benchmark_name": "scenario_A",
"random_input_len": 128,
"random_output_len": 32
},
{
"_benchmark_name": "scenario_B",
"random_input_len": 256,
"random_output_len": 64
},
{
"_benchmark_name": "scenario_C",
"random_input_len": 512,
"random_output_len": 128
}
@@ -77,6 +85,8 @@ vllm bench sweep serve \
-o benchmarks/results
```
By default, each parameter combination is benchmarked 3 times to make the results more reliable. You can adjust the number of runs by setting `--num-runs`.
!!! important
If both `--serve-params` and `--bench-params` are passed, the script will iterate over the Cartesian product between them.
You can use `--dry-run` to preview the commands to be run.
@@ -86,60 +96,40 @@ vllm bench sweep serve \
In case you are using a custom `--serve-cmd`, you can override the commands used for resetting the state by setting `--after-bench-cmd`.
!!! note
By default, each parameter combination is run 3 times to make the results more reliable. You can adjust the number of runs by setting `--num-runs`.
You should set `_benchmark_name` to provide a human-readable name for parameter combinations involving many variables.
This becomes mandatory if the file name would otherwise exceed the maximum path length allowed by the filesystem.
!!! tip
You can use the `--resume` option to continue the parameter sweep if one of the runs failed.
### SLA auto-tuner
You can use the `--resume` option to continue the parameter sweep if an unexpected error occurs, e.g., timeout when connecting to HF Hub.
`vllm bench sweep serve_sla` is a wrapper over `vllm bench sweep serve` that tunes either the request rate or concurrency (choose using `--sla-variable`) in order to satisfy the SLA constraints given by `--sla-params`.
### SLA Scanner
For example, to ensure E2E latency within different target values for 99% of requests:
```json
[
{
"p99_e2el_ms": "<=200"
},
{
"p99_e2el_ms": "<=500"
},
{
"p99_e2el_ms": "<=1000"
},
{
"p99_e2el_ms": "<=2000"
}
]
```
`vllm bench sweep serve_sla` is a variant of `vllm bench sweep serve` that scans through values of request rate or concurrency (choose using `--sla-variable`) in order to find the tradeoff between latency and throughput. The results can then be [visualized](#visualization) to determine the feasible SLAs.
Example command:
```bash
vllm bench sweep serve_sla \
--serve-cmd 'vllm serve meta-llama/Llama-2-7b-chat-hf' \
--bench-cmd 'vllm bench serve --model meta-llama/Llama-2-7b-chat-hf --backend vllm --endpoint /v1/completions --dataset-name sharegpt --dataset-path benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json' \
--bench-cmd 'vllm bench serve --model meta-llama/Llama-2-7b-chat-hf --backend vllm --endpoint /v1/completions --dataset-name sharegpt --dataset-path benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 100' \
--serve-params benchmarks/serve_hparams.json \
--bench-params benchmarks/bench_hparams.json \
--sla-params benchmarks/sla_hparams.json \
--sla-variable max_concurrency \
--bench-params benchmarks/bench_hparams.json
-o benchmarks/results
```
The algorithm for adjusting the SLA variable is as follows:
The algorithm for scanning through different values of `sla_variable` can be summarized as follows:
1. Run the benchmark once with maximum possible QPS, and once with minimum possible QPS. For each run, calculate the distance of the SLA metrics from their targets, resulting in data points of QPS vs SLA distance.
2. Perform spline interpolation between the data points to estimate the QPS that results in zero SLA distance.
3. Run the benchmark with the estimated QPS and add the resulting data point to the history.
4. Repeat Steps 2 and 3 until the maximum QPS that passes SLA and the minimum QPS that fails SLA in the history are close enough to each other.
1. Run the benchmark once with `sla_variable = 1` to simulate serial inference. This results in the lowest possible latency and throughput.
2. Run the benchmark once with `sla_variable = num_prompts` to simulate batch inference over the whole dataset. This results in the highest possible latency and throughput.
3. Estimate the maximum value of `sla_variable` that can be supported by the server without oversaturating it.
4. Run the benchmark over intermediate values of `sla_variable` uniformly using the remaining iterations.
!!! important
SLA tuning is applied over each combination of `--serve-params`, `--bench-params`, and `--sla-params`.
You can override the number of iterations in the algorithm by setting `--sla-iters`.
For a given combination of `--serve-params` and `--bench-params`, we share the benchmark results across `--sla-params` to avoid rerunning benchmarks with the same SLA variable value.
!!! tip
This is our equivalent of [GuideLLM's `--profile sweep`](https://github.com/vllm-project/guidellm/blob/v0.5.3/src/guidellm/benchmark/profiles.py#L575).
### Startup
## Startup Benchmark
`vllm bench sweep startup` runs `vllm bench startup` across parameter combinations to compare cold/warm startup time for different engine settings.
@@ -202,15 +192,28 @@ vllm bench sweep startup \
`vllm bench sweep plot` can be used to plot performance curves from parameter sweep results.
Example command:
Control the variables to plot via `--var-x` and `--var-y`, optionally applying `--filter-by` and `--bin-by` to the values. The plot is organized according to `--fig-by`, `--row-by`, `--col-by`, and `--curve-by`.
Example commands for visualizing [SLA Scanner](#sla-scanner) results:
```bash
# Latency increases as the request rate increases
vllm bench sweep plot benchmarks/results/<timestamp> \
--var-x max_concurrency \
--var-x request_rate \
--var-y p99_ttft_ms \
--row-by random_input_len \
--col-by random_output_len \
--curve-by api_server_count,max_num_batched_tokens \
--filter-by 'max_concurrency<=1024'
--curve-by max_num_seqs,max_num_batched_tokens \
--filter-by 'request_rate<=128'
# Tradeoff between latency and throughput
vllm bench sweep plot benchmarks/results/<timestamp> \
--var-x request_throughput \
--var-y median_ttft_ms \
--row-by random_input_len \
--col-by random_output_len \
--curve-by max_num_seqs,max_num_batched_tokens \
--filter-by 'request_rate<=128'
```
!!! tip
@@ -233,3 +236,6 @@ Example:
vllm bench sweep plot_pareto benchmarks/results/<timestamp> \
--label-by max_concurrency,tensor_parallel_size,pipeline_parallel_size
```
!!! tip
You can use `--dry-run` to preview the figures to be plotted.