[Docs] Switch to better markdown linting pre-commit hook (#21851)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
@@ -4,7 +4,7 @@ This README guides you through running benchmark tests with the extensive
|
||||
datasets supported on vLLM. It’s a living document, updated as new features and datasets
|
||||
become available.
|
||||
|
||||
**Dataset Overview**
|
||||
## Dataset Overview
|
||||
|
||||
<table style="width:100%; border-collapse: collapse;">
|
||||
<thead>
|
||||
@@ -81,9 +81,10 @@ become available.
|
||||
|
||||
**Note**: HuggingFace dataset's `dataset-name` should be set to `hf`
|
||||
|
||||
---
|
||||
## 🚀 Example - Online Benchmark
|
||||
|
||||
<details>
|
||||
<summary><b>🚀 Example - Online Benchmark</b></summary>
|
||||
<summary>Show more</summary>
|
||||
|
||||
<br/>
|
||||
|
||||
@@ -109,7 +110,7 @@ vllm bench serve \
|
||||
|
||||
If successful, you will see the following output
|
||||
|
||||
```
|
||||
```text
|
||||
============ Serving Benchmark Result ============
|
||||
Successful requests: 10
|
||||
Benchmark duration (s): 5.78
|
||||
@@ -133,11 +134,11 @@ P99 ITL (ms): 8.39
|
||||
==================================================
|
||||
```
|
||||
|
||||
**Custom Dataset**
|
||||
### Custom Dataset
|
||||
|
||||
If the dataset you want to benchmark is not supported yet in vLLM, even then you can benchmark on it using `CustomDataset`. Your data needs to be in `.jsonl` format and needs to have "prompt" field per entry, e.g., data.jsonl
|
||||
|
||||
```
|
||||
```json
|
||||
{"prompt": "What is the capital of India?"}
|
||||
{"prompt": "What is the capital of Iran?"}
|
||||
{"prompt": "What is the capital of China?"}
|
||||
@@ -166,7 +167,7 @@ vllm bench serve --port 9001 --save-result --save-detailed \
|
||||
|
||||
You can skip applying chat template if your data already has it by using `--custom-skip-chat-template`.
|
||||
|
||||
**VisionArena Benchmark for Vision Language Models**
|
||||
### VisionArena Benchmark for Vision Language Models
|
||||
|
||||
```bash
|
||||
# need a model with vision capability here
|
||||
@@ -184,7 +185,7 @@ vllm bench serve \
|
||||
--num-prompts 1000
|
||||
```
|
||||
|
||||
**InstructCoder Benchmark with Speculative Decoding**
|
||||
### InstructCoder Benchmark with Speculative Decoding
|
||||
|
||||
``` bash
|
||||
VLLM_USE_V1=1 vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
|
||||
@@ -201,13 +202,13 @@ vllm bench serve \
|
||||
--num-prompts 2048
|
||||
```
|
||||
|
||||
**Other HuggingFaceDataset Examples**
|
||||
### Other HuggingFaceDataset Examples
|
||||
|
||||
```bash
|
||||
vllm serve Qwen/Qwen2-VL-7B-Instruct --disable-log-requests
|
||||
```
|
||||
|
||||
**`lmms-lab/LLaVA-OneVision-Data`**
|
||||
`lmms-lab/LLaVA-OneVision-Data`:
|
||||
|
||||
```bash
|
||||
vllm bench serve \
|
||||
@@ -221,7 +222,7 @@ vllm bench serve \
|
||||
--num-prompts 10
|
||||
```
|
||||
|
||||
**`Aeala/ShareGPT_Vicuna_unfiltered`**
|
||||
`Aeala/ShareGPT_Vicuna_unfiltered`:
|
||||
|
||||
```bash
|
||||
vllm bench serve \
|
||||
@@ -234,7 +235,7 @@ vllm bench serve \
|
||||
--num-prompts 10
|
||||
```
|
||||
|
||||
**`AI-MO/aimo-validation-aime`**
|
||||
`AI-MO/aimo-validation-aime`:
|
||||
|
||||
``` bash
|
||||
vllm bench serve \
|
||||
@@ -245,7 +246,7 @@ vllm bench serve \
|
||||
--seed 42
|
||||
```
|
||||
|
||||
**`philschmid/mt-bench`**
|
||||
`philschmid/mt-bench`:
|
||||
|
||||
``` bash
|
||||
vllm bench serve \
|
||||
@@ -255,7 +256,7 @@ vllm bench serve \
|
||||
--num-prompts 80
|
||||
```
|
||||
|
||||
**Running With Sampling Parameters**
|
||||
### Running With Sampling Parameters
|
||||
|
||||
When using OpenAI-compatible backends such as `vllm`, optional sampling
|
||||
parameters can be specified. Example client command:
|
||||
@@ -273,25 +274,29 @@ vllm bench serve \
|
||||
--num-prompts 10
|
||||
```
|
||||
|
||||
**Running With Ramp-Up Request Rate**
|
||||
### Running With Ramp-Up Request Rate
|
||||
|
||||
The benchmark tool also supports ramping up the request rate over the
|
||||
duration of the benchmark run. This can be useful for stress testing the
|
||||
server or finding the maximum throughput that it can handle, given some latency budget.
|
||||
|
||||
Two ramp-up strategies are supported:
|
||||
|
||||
- `linear`: Increases the request rate linearly from a start value to an end value.
|
||||
- `exponential`: Increases the request rate exponentially.
|
||||
|
||||
The following arguments can be used to control the ramp-up:
|
||||
|
||||
- `--ramp-up-strategy`: The ramp-up strategy to use (`linear` or `exponential`).
|
||||
- `--ramp-up-start-rps`: The request rate at the beginning of the benchmark.
|
||||
- `--ramp-up-end-rps`: The request rate at the end of the benchmark.
|
||||
|
||||
</details>
|
||||
|
||||
## 📈 Example - Offline Throughput Benchmark
|
||||
|
||||
<details>
|
||||
<summary><b>📈 Example - Offline Throughput Benchmark</b></summary>
|
||||
<summary>Show more</summary>
|
||||
|
||||
<br/>
|
||||
|
||||
@@ -305,15 +310,15 @@ vllm bench throughput \
|
||||
|
||||
If successful, you will see the following output
|
||||
|
||||
```
|
||||
```text
|
||||
Throughput: 7.15 requests/s, 4656.00 total tokens/s, 1072.15 output tokens/s
|
||||
Total num prompt tokens: 5014
|
||||
Total num output tokens: 1500
|
||||
```
|
||||
|
||||
**VisionArena Benchmark for Vision Language Models**
|
||||
### VisionArena Benchmark for Vision Language Models
|
||||
|
||||
``` bash
|
||||
```bash
|
||||
vllm bench throughput \
|
||||
--model Qwen/Qwen2-VL-7B-Instruct \
|
||||
--backend vllm-chat \
|
||||
@@ -325,13 +330,13 @@ vllm bench throughput \
|
||||
|
||||
The `num prompt tokens` now includes image token counts
|
||||
|
||||
```
|
||||
```text
|
||||
Throughput: 2.55 requests/s, 4036.92 total tokens/s, 326.90 output tokens/s
|
||||
Total num prompt tokens: 14527
|
||||
Total num output tokens: 1280
|
||||
```
|
||||
|
||||
**InstructCoder Benchmark with Speculative Decoding**
|
||||
### InstructCoder Benchmark with Speculative Decoding
|
||||
|
||||
``` bash
|
||||
VLLM_WORKER_MULTIPROC_METHOD=spawn \
|
||||
@@ -349,15 +354,15 @@ vllm bench throughput \
|
||||
"prompt_lookup_min": 2}'
|
||||
```
|
||||
|
||||
```
|
||||
```text
|
||||
Throughput: 104.77 requests/s, 23836.22 total tokens/s, 10477.10 output tokens/s
|
||||
Total num prompt tokens: 261136
|
||||
Total num output tokens: 204800
|
||||
```
|
||||
|
||||
**Other HuggingFaceDataset Examples**
|
||||
### Other HuggingFaceDataset Examples
|
||||
|
||||
**`lmms-lab/LLaVA-OneVision-Data`**
|
||||
`lmms-lab/LLaVA-OneVision-Data`:
|
||||
|
||||
```bash
|
||||
vllm bench throughput \
|
||||
@@ -370,7 +375,7 @@ vllm bench throughput \
|
||||
--num-prompts 10
|
||||
```
|
||||
|
||||
**`Aeala/ShareGPT_Vicuna_unfiltered`**
|
||||
`Aeala/ShareGPT_Vicuna_unfiltered`:
|
||||
|
||||
```bash
|
||||
vllm bench throughput \
|
||||
@@ -382,7 +387,7 @@ vllm bench throughput \
|
||||
--num-prompts 10
|
||||
```
|
||||
|
||||
**`AI-MO/aimo-validation-aime`**
|
||||
`AI-MO/aimo-validation-aime`:
|
||||
|
||||
```bash
|
||||
vllm bench throughput \
|
||||
@@ -394,7 +399,7 @@ vllm bench throughput \
|
||||
--num-prompts 10
|
||||
```
|
||||
|
||||
**Benchmark with LoRA Adapters**
|
||||
Benchmark with LoRA adapters:
|
||||
|
||||
``` bash
|
||||
# download dataset
|
||||
@@ -413,20 +418,22 @@ vllm bench throughput \
|
||||
|
||||
</details>
|
||||
|
||||
## 🛠️ Example - Structured Output Benchmark
|
||||
|
||||
<details>
|
||||
<summary><b>🛠️ Example - Structured Output Benchmark</b></summary>
|
||||
<summary>Show more</summary>
|
||||
|
||||
<br/>
|
||||
|
||||
Benchmark the performance of structured output generation (JSON, grammar, regex).
|
||||
|
||||
**Server Setup**
|
||||
### Server Setup
|
||||
|
||||
```bash
|
||||
vllm serve NousResearch/Hermes-3-Llama-3.1-8B --disable-log-requests
|
||||
```
|
||||
|
||||
**JSON Schema Benchmark**
|
||||
### JSON Schema Benchmark
|
||||
|
||||
```bash
|
||||
python3 benchmarks/benchmark_serving_structured_output.py \
|
||||
@@ -438,7 +445,7 @@ python3 benchmarks/benchmark_serving_structured_output.py \
|
||||
--num-prompts 1000
|
||||
```
|
||||
|
||||
**Grammar-based Generation Benchmark**
|
||||
### Grammar-based Generation Benchmark
|
||||
|
||||
```bash
|
||||
python3 benchmarks/benchmark_serving_structured_output.py \
|
||||
@@ -450,7 +457,7 @@ python3 benchmarks/benchmark_serving_structured_output.py \
|
||||
--num-prompts 1000
|
||||
```
|
||||
|
||||
**Regex-based Generation Benchmark**
|
||||
### Regex-based Generation Benchmark
|
||||
|
||||
```bash
|
||||
python3 benchmarks/benchmark_serving_structured_output.py \
|
||||
@@ -461,7 +468,7 @@ python3 benchmarks/benchmark_serving_structured_output.py \
|
||||
--num-prompts 1000
|
||||
```
|
||||
|
||||
**Choice-based Generation Benchmark**
|
||||
### Choice-based Generation Benchmark
|
||||
|
||||
```bash
|
||||
python3 benchmarks/benchmark_serving_structured_output.py \
|
||||
@@ -472,7 +479,7 @@ python3 benchmarks/benchmark_serving_structured_output.py \
|
||||
--num-prompts 1000
|
||||
```
|
||||
|
||||
**XGrammar Benchmark Dataset**
|
||||
### XGrammar Benchmark Dataset
|
||||
|
||||
```bash
|
||||
python3 benchmarks/benchmark_serving_structured_output.py \
|
||||
@@ -485,14 +492,16 @@ python3 benchmarks/benchmark_serving_structured_output.py \
|
||||
|
||||
</details>
|
||||
|
||||
## 📚 Example - Long Document QA Benchmark
|
||||
|
||||
<details>
|
||||
<summary><b>📚 Example - Long Document QA Benchmark</b></summary>
|
||||
<summary>Show more</summary>
|
||||
|
||||
<br/>
|
||||
|
||||
Benchmark the performance of long document question-answering with prefix caching.
|
||||
|
||||
**Basic Long Document QA Test**
|
||||
### Basic Long Document QA Test
|
||||
|
||||
```bash
|
||||
python3 benchmarks/benchmark_long_document_qa_throughput.py \
|
||||
@@ -504,7 +513,7 @@ python3 benchmarks/benchmark_long_document_qa_throughput.py \
|
||||
--repeat-count 5
|
||||
```
|
||||
|
||||
**Different Repeat Modes**
|
||||
### Different Repeat Modes
|
||||
|
||||
```bash
|
||||
# Random mode (default) - shuffle prompts randomly
|
||||
@@ -537,14 +546,16 @@ python3 benchmarks/benchmark_long_document_qa_throughput.py \
|
||||
|
||||
</details>
|
||||
|
||||
## 🗂️ Example - Prefix Caching Benchmark
|
||||
|
||||
<details>
|
||||
<summary><b>🗂️ Example - Prefix Caching Benchmark</b></summary>
|
||||
<summary>Show more</summary>
|
||||
|
||||
<br/>
|
||||
|
||||
Benchmark the efficiency of automatic prefix caching.
|
||||
|
||||
**Fixed Prompt with Prefix Caching**
|
||||
### Fixed Prompt with Prefix Caching
|
||||
|
||||
```bash
|
||||
python3 benchmarks/benchmark_prefix_caching.py \
|
||||
@@ -555,7 +566,7 @@ python3 benchmarks/benchmark_prefix_caching.py \
|
||||
--input-length-range 128:256
|
||||
```
|
||||
|
||||
**ShareGPT Dataset with Prefix Caching**
|
||||
### ShareGPT Dataset with Prefix Caching
|
||||
|
||||
```bash
|
||||
# download dataset
|
||||
@@ -572,14 +583,16 @@ python3 benchmarks/benchmark_prefix_caching.py \
|
||||
|
||||
</details>
|
||||
|
||||
## ⚡ Example - Request Prioritization Benchmark
|
||||
|
||||
<details>
|
||||
<summary><b>⚡ Example - Request Prioritization Benchmark</b></summary>
|
||||
<summary>Show more</summary>
|
||||
|
||||
<br/>
|
||||
|
||||
Benchmark the performance of request prioritization in vLLM.
|
||||
|
||||
**Basic Prioritization Test**
|
||||
### Basic Prioritization Test
|
||||
|
||||
```bash
|
||||
python3 benchmarks/benchmark_prioritization.py \
|
||||
@@ -590,7 +603,7 @@ python3 benchmarks/benchmark_prioritization.py \
|
||||
--scheduling-policy priority
|
||||
```
|
||||
|
||||
**Multiple Sequences per Prompt**
|
||||
### Multiple Sequences per Prompt
|
||||
|
||||
```bash
|
||||
python3 benchmarks/benchmark_prioritization.py \
|
||||
|
||||
Reference in New Issue
Block a user