[Benchmark] Rename SLA Finder to Workload Explorer (#35586)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-02-28 15:31:55 +08:00
parent 57c86c0741
commit 24d6ea8afd
6 changed files with 124 additions and 106 deletions
--- a/docs/benchmarking/sweeps.md
+++ b/docs/benchmarking/sweeps.md
@@ -102,36 +102,39 @@ By default, each parameter combination is benchmarked 3 times to make the result
 !!! tip
    You can use the `--resume` option to continue the parameter sweep if an unexpected error occurs, e.g., timeout when connecting to HF Hub.

-### SLA Scanner
+### Workload Explorer

-`vllm bench sweep serve_sla` is a variant of `vllm bench sweep serve` that scans through values of request rate or concurrency (choose using `--sla-variable`) in order to find the tradeoff between latency and throughput. The results can then be [visualized](#visualization) to determine the feasible SLAs.
+`vllm bench sweep serve_workload` is a variant of `vllm bench sweep serve` that explores different workload levels in order to find the tradeoff between latency and throughput. The results can also be [visualized](#visualization) to determine the feasible SLAs.
+
+The workload can be expressed in terms of request rate or concurrency (choose using `--workload-var`).

 Example command:

 ```bash
-vllm bench sweep serve_sla \
+vllm bench sweep serve_workload \
    --serve-cmd 'vllm serve meta-llama/Llama-2-7b-chat-hf' \
    --bench-cmd 'vllm bench serve --model meta-llama/Llama-2-7b-chat-hf --backend vllm --endpoint /v1/completions --dataset-name sharegpt --dataset-path benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 100' \
-    --sla-variable max_concurrency \
+    --workload-var max_concurrency \
    --serve-params benchmarks/serve_hparams.json \
-    --bench-params benchmarks/bench_hparams.json
+    --bench-params benchmarks/bench_hparams.json \
+    --num-runs 1 \
    -o benchmarks/results
 ```

-The algorithm for scanning through different values of `sla_variable` can be summarized as follows:
+The algorithm for exploring different workload levels can be summarized as follows:

-1. Run the benchmark by sending requests one at a time (serial inference). This results in the lowest possible latency and throughput.
-2. Run the benchmark by sending all requests at once (batch inference). This results in the highest possible latency and throughput.
-3. Estimate the maximum value of `sla_variable` that can be supported by the server without oversaturating it.
-4. Run the benchmark over intermediate values of `sla_variable` uniformly using the remaining iterations.
+1. Run the benchmark by sending requests one at a time (serial inference, lowest workload). This results in the lowest possible latency and throughput.
+2. Run the benchmark by sending all requests at once (batch inference, highest workload). This results in the highest possible latency and throughput.
+3. Estimate the value of `workload_var` corresponding to Step 2.
+4. Run the benchmark over intermediate values of `workload_var` uniformly using the remaining iterations.

-You can override the number of iterations in the algorithm by setting `--sla-iters`.
+You can override the number of iterations in the algorithm by setting `--workload-iters`.

 !!! tip
    This is our equivalent of [GuideLLM's `--profile sweep`](https://github.com/vllm-project/guidellm/blob/v0.5.3/src/guidellm/benchmark/profiles.py#L575).

-    In general, `--sla-variable max_concurrency` produces more reliable results because it directly controls the workload imposed on the vLLM engine.
-    Nevertheless, we default to `--sla-variable request_rate` to maintain similar behavior as GuideLLM.
+    In general, `--workload-var max_concurrency` produces more reliable results because it directly controls the workload imposed on the vLLM engine.
+    Nevertheless, we default to `--workload-var request_rate` to maintain similar behavior as GuideLLM.

 ## Startup Benchmark

@@ -198,7 +201,7 @@ vllm bench sweep startup \

 Control the variables to plot via `--var-x` and `--var-y`, optionally applying `--filter-by` and `--bin-by` to the values. The plot is organized according to `--fig-by`, `--row-by`, `--col-by`, and `--curve-by`.

-Example commands for visualizing [SLA Scanner](#sla-scanner) results:
+Example commands for visualizing [Workload Explorer](#workload-explorer) results:

 ```bash
 # Name of the directory that stores the results