Vllm CPU benchmark suite improvement (#34128)
Signed-off-by: louie-tsai <louie.tsai@intel.com>
This commit is contained in:
@@ -176,7 +176,7 @@ For the full and up-to-date list of models validated on CPU platforms, please se
|
||||
|
||||
### How to find benchmark configuration examples for supported CPU models?
|
||||
|
||||
For any model listed under [Supported Models on CPU](../../models/hardware_supported_models/cpu.md), optimized runtime configurations are provided in the vLLM Benchmark Suite’s CPU test cases, defined in [cpu test cases](../../../.buildkite/performance-benchmarks/tests/serving-tests-cpu.json)
|
||||
For any model listed under [Supported Models on CPU](../../models/hardware_supported_models/cpu.md), optimized runtime configurations are provided in the vLLM Benchmark Suite’s CPU test cases, defined in cpu test cases as serving-tests-cpu.json. Full test cases for Text-only models, Multi-Modal models and Embedded models are in cpu Text-Only test cases as serving-tests-cpu-text.json, cpu Multi-Modal test cases as serving-tests-cpu-multimodal.json and cpu Embedded test cases as serving-tests-cpu-embed.json.
|
||||
For details on how these optimized configurations are determined, see: [performance-benchmark-details](../../../.buildkite/performance-benchmarks/README.md#performance-benchmark-details).
|
||||
To benchmark the supported models using these optimized settings, follow the steps in [running vLLM Benchmark Suite manually](../../benchmarking/dashboard.md#manually-trigger-the-benchmark) and run the Benchmark Suite on a CPU environment.
|
||||
|
||||
@@ -199,6 +199,28 @@ lscpu | grep "NUMA node(s):" | awk '{print $3}'
|
||||
For performance reference, users may also consult the [vLLM Performance Dashboard](https://hud.pytorch.org/benchmark/llms?repoName=vllm-project%2Fvllm&deviceName=cpu)
|
||||
, which publishes default-model CPU results produced using the same Benchmark Suite.
|
||||
|
||||
#### Dry-Run
|
||||
|
||||
For users only need to get the optimized runtime configurations without running benchmark, a Dry-Run mode is provided.
|
||||
By passing an environment variable DRY_RUN=1 with run-performance-benchmarks.sh,
|
||||
all commands will be generated under `./benchmark/results/`.
|
||||
|
||||
```bash
|
||||
ON_CPU=1 DRY_RUN=1 bash .buildkite/performance-benchmarks/scripts/run-performance-benchmarks.sh
|
||||
```
|
||||
|
||||
By providing different JSON file, users can get runtime configurations for different models such as Embedded Models.
|
||||
|
||||
```bash
|
||||
ON_CPU=1 SERVING_JSON=serving-tests-cpu-embed.json DRY_RUN=1 bash .buildkite/performance-benchmarks/scripts/run-performance-benchmarks.sh
|
||||
```
|
||||
|
||||
By providing MODEL_FILTER and DTYPE_FILTER, only commands for related model ID and Data Type will be generated.
|
||||
|
||||
```bash
|
||||
ON_CPU=1 SERVING_JSON=serving-tests-cpu-text.json DRY_RUN=1 MODEL_FILTER=meta-llama/Llama-3.1-8B-Instruct DTYPE_FILTER=bfloat16 bash .buildkite/performance-benchmarks/scripts/run-performance-benchmarks.sh
|
||||
```
|
||||
|
||||
### How to decide `VLLM_CPU_OMP_THREADS_BIND`?
|
||||
|
||||
- Default `auto` thread-binding is recommended for most cases. Ideally, each OpenMP thread will be bound to a dedicated physical core respectively, threads of each rank will be bound to the same NUMA node respectively, and 1 CPU per rank will be reserved for other vLLM components when `world_size > 1`. If you have any performance problems or unexpected binding behaviours, please try to bind threads as following.
|
||||
|
||||
Reference in New Issue
Block a user