2023-06-17 03:07:40 -07:00
# Benchmarking vLLM
2023-05-28 03:20:05 -07:00
2025-03-11 19:23:04 -07:00
This README guides you through running benchmark tests with the extensive
datasets supported on vLLM. It’ s a living document, updated as new features and datasets
become available.
2023-05-28 03:20:05 -07:00
2025-07-30 03:45:08 +01:00
## Dataset Overview
2025-03-11 19:23:04 -07:00
<table style="width:100%; border-collapse: collapse;">
<thead>
<tr>
<th style="width:15%; text-align: left;">Dataset</th>
<th style="width:10%; text-align: center;">Online</th>
<th style="width:10%; text-align: center;">Offline</th>
<th style="width:65%; text-align: left;">Data Path</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>ShareGPT</strong></td>
<td style="text-align: center;">✅</td>
<td style="text-align: center;">✅</td>
<td><code>wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json</code></td>
</tr>
<tr>
<td><strong>BurstGPT</strong></td>
<td style="text-align: center;">✅</td>
<td style="text-align: center;">✅</td>
<td><code>wget https://github.com/HPMLL/BurstGPT/releases/download/v1.1/BurstGPT_without_fails_2.csv</code></td>
</tr>
<tr>
<td><strong>Sonnet</strong></td>
<td style="text-align: center;">✅</td>
<td style="text-align: center;">✅</td>
<td>Local file: <code>benchmarks/sonnet.txt</code></td>
</tr>
<tr>
<td><strong>Random</strong></td>
<td style="text-align: center;">✅</td>
<td style="text-align: center;">✅</td>
<td><code>synthetic</code></td>
</tr>
<tr>
2025-03-31 00:38:58 -07:00
<td><strong>HuggingFace-VisionArena</strong></td>
<td style="text-align: center;">✅</td>
<td style="text-align: center;">✅</td>
<td><code>lmarena-ai/VisionArena-Chat</code></td>
</tr>
<tr>
<td><strong>HuggingFace-InstructCoder</strong></td>
<td style="text-align: center;">✅</td>
<td style="text-align: center;">✅</td>
<td><code>likaixin/InstructCoder</code></td>
2025-04-04 09:39:02 -07:00
</tr>
<tr>
<td><strong>HuggingFace-AIMO</strong></td>
<td style="text-align: center;">✅</td>
<td style="text-align: center;">✅</td>
<td><code>AI-MO/aimo-validation-aime</code> , <code>AI-MO/NuminaMath-1.5</code>, <code>AI-MO/NuminaMath-CoT</code></td>
2025-03-11 19:23:04 -07:00
</tr>
<tr>
2025-03-31 00:38:58 -07:00
<td><strong>HuggingFace-Other</strong></td>
2025-03-11 19:23:04 -07:00
<td style="text-align: center;">✅</td>
2025-03-13 21:07:54 -07:00
<td style="text-align: center;">✅</td>
2025-03-31 00:38:58 -07:00
<td><code>lmms-lab/LLaVA-OneVision-Data</code>, <code>Aeala/ShareGPT_Vicuna_unfiltered</code></td>
2025-03-11 19:23:04 -07:00
</tr>
2025-05-31 15:07:38 -04:00
<tr>
<td><strong>Custom</strong></td>
<td style="text-align: center;">✅</td>
<td style="text-align: center;">✅</td>
<td>Local file: <code>data.jsonl</code></td>
</tr>
2025-03-11 19:23:04 -07:00
</tbody>
</table>
2025-03-13 21:07:54 -07:00
✅: supported
2025-03-31 00:38:58 -07:00
🟡: Partial support
2025-03-11 19:23:04 -07:00
2025-03-31 00:38:58 -07:00
🚧: to be supported
2025-03-13 21:07:54 -07:00
2025-03-31 00:38:58 -07:00
**Note**: HuggingFace dataset's `dataset-name` should be set to `hf`
2025-03-11 19:23:04 -07:00
2025-07-30 03:45:08 +01:00
## 🚀 Example - Online Benchmark
2025-06-26 18:35:16 +08:00
<details>
2025-07-30 03:45:08 +01:00
<summary>Show more</summary>
2025-06-26 18:35:16 +08:00
<br/>
2025-03-11 19:23:04 -07:00
First start serving your model
2025-02-08 20:25:15 +08:00
2023-05-28 03:20:05 -07:00
```bash
2025-08-02 00:19:48 -07:00
vllm serve NousResearch/Hermes-3-Llama-3.1-8B
2023-05-28 03:20:05 -07:00
```
2024-11-05 11:30:02 -08:00
2025-03-11 19:23:04 -07:00
Then run the benchmarking script
```bash
# download dataset
# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
2025-07-26 07:10:14 -07:00
vllm bench serve \
2025-03-31 00:38:58 -07:00
--backend vllm \
--model NousResearch/Hermes-3-Llama-3.1-8B \
--endpoint /v1/completions \
--dataset-name sharegpt \
--dataset-path <your data path>/ShareGPT_V3_unfiltered_cleaned_split.json \
--num-prompts 10
2025-03-11 19:23:04 -07:00
```
If successful, you will see the following output
2025-07-30 03:45:08 +01:00
```text
2025-03-11 19:23:04 -07:00
============ Serving Benchmark Result ============
2025-07-26 07:10:14 -07:00
Successful requests: 10
Benchmark duration (s): 5.78
Total input tokens: 1369
Total generated tokens: 2212
Request throughput (req/s): 1.73
Output token throughput (tok/s): 382.89
Total Token throughput (tok/s): 619.85
2025-03-11 19:23:04 -07:00
---------------Time to First Token----------------
2025-07-26 07:10:14 -07:00
Mean TTFT (ms): 71.54
Median TTFT (ms): 73.88
P99 TTFT (ms): 79.49
2025-03-11 19:23:04 -07:00
-----Time per Output Token (excl. 1st token)------
2025-07-26 07:10:14 -07:00
Mean TPOT (ms): 7.91
Median TPOT (ms): 7.96
P99 TPOT (ms): 8.03
2025-03-11 19:23:04 -07:00
---------------Inter-token Latency----------------
2025-07-26 07:10:14 -07:00
Mean ITL (ms): 7.74
Median ITL (ms): 7.70
P99 ITL (ms): 8.39
2025-03-11 19:23:04 -07:00
==================================================
```
2024-11-05 11:30:02 -08:00
2025-07-30 03:45:08 +01:00
### Custom Dataset
2025-06-26 18:35:16 +08:00
2025-05-31 15:07:38 -04:00
If the dataset you want to benchmark is not supported yet in vLLM, even then you can benchmark on it using `CustomDataset` . Your data needs to be in `.jsonl` format and needs to have "prompt" field per entry, e.g., data.jsonl
2025-07-30 03:45:08 +01:00
```json
2025-05-31 15:07:38 -04:00
{"prompt": "What is the capital of India?"}
{"prompt": "What is the capital of Iran?"}
{"prompt": "What is the capital of China?"}
2025-07-26 07:10:14 -07:00
```
2025-05-31 15:07:38 -04:00
```bash
# start server
2025-08-02 00:19:48 -07:00
VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.1-8B-Instruct
2025-05-31 15:07:38 -04:00
```
```bash
# run benchmarking script
2025-07-26 07:10:14 -07:00
vllm bench serve --port 9001 --save-result --save-detailed \
2025-05-31 15:07:38 -04:00
--backend vllm \
--model meta-llama/Llama-3.1-8B-Instruct \
--endpoint /v1/completions \
--dataset-name custom \
--dataset-path <path-to-your-data-jsonl> \
--custom-skip-chat-template \
--num-prompts 80 \
--max-concurrency 1 \
--temperature=0.3 \
--top-p=0.75 \
--result-dir "./log/"
```
You can skip applying chat template if your data already has it by using `--custom-skip-chat-template` .
2025-07-30 03:45:08 +01:00
### VisionArena Benchmark for Vision Language Models
2025-02-08 20:25:15 +08:00
2024-11-05 11:30:02 -08:00
```bash
2025-03-11 19:23:04 -07:00
# need a model with vision capability here
2025-08-02 00:19:48 -07:00
vllm serve Qwen/Qwen2-VL-7B-Instruct
2024-11-05 11:30:02 -08:00
```
2025-02-10 21:25:30 -08:00
2025-03-11 19:23:04 -07:00
```bash
2025-07-26 07:10:14 -07:00
vllm bench serve \
2025-03-31 00:38:58 -07:00
--backend openai-chat \
--model Qwen/Qwen2-VL-7B-Instruct \
--endpoint /v1/chat/completions \
--dataset-name hf \
--dataset-path lmarena-ai/VisionArena-Chat \
--hf-split train \
--num-prompts 1000
2025-03-11 19:23:04 -07:00
```
2025-02-10 21:25:30 -08:00
2025-07-30 03:45:08 +01:00
### InstructCoder Benchmark with Speculative Decoding
2025-03-19 21:32:58 -07:00
2025-03-31 00:38:58 -07:00
``` bash
VLLM_USE_V1=1 vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
2025-05-30 13:28:04 +05:30
--speculative-config $'{"method": "ngram",
"num_speculative_tokens": 5, "prompt_lookup_max": 5,
"prompt_lookup_min": 2}'
2025-03-31 00:38:58 -07:00
```
``` bash
2025-07-26 07:10:14 -07:00
vllm bench serve \
2025-03-31 00:38:58 -07:00
--model meta-llama/Meta-Llama-3-8B-Instruct \
--dataset-name hf \
--dataset-path likaixin/InstructCoder \
--num-prompts 2048
```
2025-07-30 03:45:08 +01:00
### Other HuggingFaceDataset Examples
2025-03-19 21:32:58 -07:00
```bash
2025-08-02 00:19:48 -07:00
vllm serve Qwen/Qwen2-VL-7B-Instruct
2025-03-19 21:32:58 -07:00
```
2025-07-30 03:45:08 +01:00
`lmms-lab/LLaVA-OneVision-Data` :
2025-03-19 21:32:58 -07:00
```bash
2025-07-26 07:10:14 -07:00
vllm bench serve \
2025-03-31 00:38:58 -07:00
--backend openai-chat \
--model Qwen/Qwen2-VL-7B-Instruct \
--endpoint /v1/chat/completions \
--dataset-name hf \
--dataset-path lmms-lab/LLaVA-OneVision-Data \
--hf-split train \
--hf-subset "chart2text(cauldron)" \
--num-prompts 10
2025-03-19 21:32:58 -07:00
```
2025-07-30 03:45:08 +01:00
`Aeala/ShareGPT_Vicuna_unfiltered` :
2025-03-19 21:32:58 -07:00
```bash
2025-07-26 07:10:14 -07:00
vllm bench serve \
2025-03-31 00:38:58 -07:00
--backend openai-chat \
--model Qwen/Qwen2-VL-7B-Instruct \
--endpoint /v1/chat/completions \
--dataset-name hf \
--dataset-path Aeala/ShareGPT_Vicuna_unfiltered \
--hf-split train \
--num-prompts 10
2025-03-19 21:32:58 -07:00
```
2025-07-30 03:45:08 +01:00
`AI-MO/aimo-validation-aime` :
2025-04-04 09:39:02 -07:00
``` bash
2025-07-26 07:10:14 -07:00
vllm bench serve \
2025-04-04 09:39:02 -07:00
--model Qwen/QwQ-32B \
--dataset-name hf \
--dataset-path AI-MO/aimo-validation-aime \
--num-prompts 10 \
--seed 42
```
2025-07-30 03:45:08 +01:00
`philschmid/mt-bench` :
2025-05-31 15:07:38 -04:00
``` bash
2025-07-26 07:10:14 -07:00
vllm bench serve \
2025-05-31 15:07:38 -04:00
--model Qwen/QwQ-32B \
--dataset-name hf \
--dataset-path philschmid/mt-bench \
--num-prompts 80
```
2025-07-30 03:45:08 +01:00
### Running With Sampling Parameters
2025-04-05 21:30:35 -07:00
When using OpenAI-compatible backends such as `vllm` , optional sampling
parameters can be specified. Example client command:
```bash
2025-07-26 07:10:14 -07:00
vllm bench serve \
2025-04-05 21:30:35 -07:00
--backend vllm \
--model NousResearch/Hermes-3-Llama-3.1-8B \
--endpoint /v1/completions \
--dataset-name sharegpt \
--dataset-path <your data path>/ShareGPT_V3_unfiltered_cleaned_split.json \
--top-k 10 \
--top-p 0.9 \
--temperature 0.5 \
--num-prompts 10
```
2025-07-30 03:45:08 +01:00
### Running With Ramp-Up Request Rate
2025-06-24 20:41:49 +02:00
The benchmark tool also supports ramping up the request rate over the
duration of the benchmark run. This can be useful for stress testing the
server or finding the maximum throughput that it can handle, given some latency budget.
Two ramp-up strategies are supported:
2025-07-30 03:45:08 +01:00
2025-06-24 20:41:49 +02:00
- `linear` : Increases the request rate linearly from a start value to an end value.
- `exponential` : Increases the request rate exponentially.
The following arguments can be used to control the ramp-up:
2025-07-30 03:45:08 +01:00
2025-06-24 20:41:49 +02:00
- `--ramp-up-strategy` : The ramp-up strategy to use (`linear` or `exponential` ).
- `--ramp-up-start-rps` : The request rate at the beginning of the benchmark.
- `--ramp-up-end-rps` : The request rate at the end of the benchmark.
2025-06-26 18:35:16 +08:00
</details>
2025-07-30 03:45:08 +01:00
## 📈 Example - Offline Throughput Benchmark
2025-06-26 18:35:16 +08:00
<details>
2025-07-30 03:45:08 +01:00
<summary>Show more</summary>
2025-06-26 18:35:16 +08:00
<br/>
2025-02-10 21:25:30 -08:00
```bash
2025-07-26 07:10:14 -07:00
vllm bench throughput \
2025-03-31 00:38:58 -07:00
--model NousResearch/Hermes-3-Llama-3.1-8B \
--dataset-name sonnet \
--dataset-path vllm/benchmarks/sonnet.txt \
--num-prompts 10
2025-03-13 21:07:54 -07:00
```
2025-03-11 19:23:04 -07:00
If successful, you will see the following output
2025-07-30 03:45:08 +01:00
```text
2025-03-13 21:07:54 -07:00
Throughput: 7.15 requests/s, 4656.00 total tokens/s, 1072.15 output tokens/s
Total num prompt tokens: 5014
Total num output tokens: 1500
```
2025-07-30 03:45:08 +01:00
### VisionArena Benchmark for Vision Language Models
2025-03-13 21:07:54 -07:00
2025-07-30 03:45:08 +01:00
```bash
2025-07-26 07:10:14 -07:00
vllm bench throughput \
2025-03-31 00:38:58 -07:00
--model Qwen/Qwen2-VL-7B-Instruct \
--backend vllm-chat \
--dataset-name hf \
--dataset-path lmarena-ai/VisionArena-Chat \
--num-prompts 1000 \
--hf-split train
2025-03-13 21:07:54 -07:00
```
The `num prompt tokens` now includes image token counts
2025-07-30 03:45:08 +01:00
```text
2025-03-13 21:07:54 -07:00
Throughput: 2.55 requests/s, 4036.92 total tokens/s, 326.90 output tokens/s
Total num prompt tokens: 14527
Total num output tokens: 1280
2025-02-10 21:25:30 -08:00
```
2025-03-11 19:23:04 -07:00
2025-07-30 03:45:08 +01:00
### InstructCoder Benchmark with Speculative Decoding
2025-03-31 00:38:58 -07:00
``` bash
VLLM_WORKER_MULTIPROC_METHOD=spawn \
VLLM_USE_V1=1 \
2025-07-26 07:10:14 -07:00
vllm bench throughput \
2025-03-31 00:38:58 -07:00
--dataset-name=hf \
--dataset-path=likaixin/InstructCoder \
--model=meta-llama/Meta-Llama-3-8B-Instruct \
--input-len=1000 \
--output-len=100 \
--num-prompts=2048 \
--async-engine \
2025-05-30 13:28:04 +05:30
--speculative-config $'{"method": "ngram",
"num_speculative_tokens": 5, "prompt_lookup_max": 5,
"prompt_lookup_min": 2}'
2025-03-31 00:38:58 -07:00
```
2025-07-30 03:45:08 +01:00
```text
2025-03-31 00:38:58 -07:00
Throughput: 104.77 requests/s, 23836.22 total tokens/s, 10477.10 output tokens/s
Total num prompt tokens: 261136
Total num output tokens: 204800
```
2025-07-30 03:45:08 +01:00
### Other HuggingFaceDataset Examples
2025-03-31 00:38:58 -07:00
2025-07-30 03:45:08 +01:00
`lmms-lab/LLaVA-OneVision-Data` :
2025-03-31 00:38:58 -07:00
```bash
2025-07-26 07:10:14 -07:00
vllm bench throughput \
2025-03-31 00:38:58 -07:00
--model Qwen/Qwen2-VL-7B-Instruct \
--backend vllm-chat \
--dataset-name hf \
--dataset-path lmms-lab/LLaVA-OneVision-Data \
--hf-split train \
--hf-subset "chart2text(cauldron)" \
--num-prompts 10
```
2025-07-30 03:45:08 +01:00
`Aeala/ShareGPT_Vicuna_unfiltered` :
2025-03-31 00:38:58 -07:00
```bash
2025-07-26 07:10:14 -07:00
vllm bench throughput \
2025-03-31 00:38:58 -07:00
--model Qwen/Qwen2-VL-7B-Instruct \
--backend vllm-chat \
--dataset-name hf \
--dataset-path Aeala/ShareGPT_Vicuna_unfiltered \
--hf-split train \
--num-prompts 10
```
2025-07-30 03:45:08 +01:00
`AI-MO/aimo-validation-aime` :
2025-04-04 09:39:02 -07:00
```bash
2025-07-26 07:10:14 -07:00
vllm bench throughput \
2025-04-04 09:39:02 -07:00
--model Qwen/QwQ-32B \
--backend vllm \
--dataset-name hf \
--dataset-path AI-MO/aimo-validation-aime \
--hf-split train \
--num-prompts 10
```
2025-07-30 03:45:08 +01:00
Benchmark with LoRA adapters:
2025-03-11 19:23:04 -07:00
``` bash
2025-03-13 21:07:54 -07:00
# download dataset
# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
2025-07-26 07:10:14 -07:00
vllm bench throughput \
2025-03-31 00:38:58 -07:00
--model meta-llama/Llama-2-7b-hf \
--backend vllm \
--dataset_path <your data path>/ShareGPT_V3_unfiltered_cleaned_split.json \
--dataset_name sharegpt \
--num-prompts 10 \
--max-loras 2 \
--max-lora-rank 8 \
--enable-lora \
--lora-path yard1/llama-2-7b-sql-lora-test
2025-03-11 19:23:04 -07:00
```
2025-06-24 13:57:46 +08:00
2025-06-26 18:35:16 +08:00
</details>
2025-07-30 03:45:08 +01:00
## 🛠️ Example - Structured Output Benchmark
2025-06-26 18:35:16 +08:00
<details>
2025-07-30 03:45:08 +01:00
<summary>Show more</summary>
2025-06-26 18:35:16 +08:00
<br/>
2025-06-24 13:57:46 +08:00
Benchmark the performance of structured output generation (JSON, grammar, regex).
2025-07-30 03:45:08 +01:00
### Server Setup
2025-06-24 13:57:46 +08:00
```bash
2025-08-02 00:19:48 -07:00
vllm serve NousResearch/Hermes-3-Llama-3.1-8B
2025-06-24 13:57:46 +08:00
```
2025-07-30 03:45:08 +01:00
### JSON Schema Benchmark
2025-06-24 13:57:46 +08:00
```bash
python3 benchmarks/benchmark_serving_structured_output.py \
--backend vllm \
--model NousResearch/Hermes-3-Llama-3.1-8B \
--dataset json \
--structured-output-ratio 1.0 \
--request-rate 10 \
--num-prompts 1000
```
2025-07-30 03:45:08 +01:00
### Grammar-based Generation Benchmark
2025-06-24 13:57:46 +08:00
```bash
python3 benchmarks/benchmark_serving_structured_output.py \
--backend vllm \
--model NousResearch/Hermes-3-Llama-3.1-8B \
--dataset grammar \
--structure-type grammar \
--request-rate 10 \
--num-prompts 1000
```
2025-07-30 03:45:08 +01:00
### Regex-based Generation Benchmark
2025-06-24 13:57:46 +08:00
```bash
python3 benchmarks/benchmark_serving_structured_output.py \
--backend vllm \
--model NousResearch/Hermes-3-Llama-3.1-8B \
--dataset regex \
--request-rate 10 \
--num-prompts 1000
```
2025-07-30 03:45:08 +01:00
### Choice-based Generation Benchmark
2025-06-24 13:57:46 +08:00
```bash
python3 benchmarks/benchmark_serving_structured_output.py \
--backend vllm \
--model NousResearch/Hermes-3-Llama-3.1-8B \
--dataset choice \
--request-rate 10 \
--num-prompts 1000
```
2025-07-30 03:45:08 +01:00
### XGrammar Benchmark Dataset
2025-06-24 13:57:46 +08:00
```bash
python3 benchmarks/benchmark_serving_structured_output.py \
--backend vllm \
--model NousResearch/Hermes-3-Llama-3.1-8B \
--dataset xgrammar_bench \
--request-rate 10 \
--num-prompts 1000
```
2025-06-26 18:35:16 +08:00
</details>
2025-07-30 03:45:08 +01:00
## 📚 Example - Long Document QA Benchmark
2025-06-26 18:35:16 +08:00
<details>
2025-07-30 03:45:08 +01:00
<summary>Show more</summary>
2025-06-26 18:35:16 +08:00
<br/>
2025-06-24 13:57:46 +08:00
Benchmark the performance of long document question-answering with prefix caching.
2025-07-30 03:45:08 +01:00
### Basic Long Document QA Test
2025-06-24 13:57:46 +08:00
```bash
python3 benchmarks/benchmark_long_document_qa_throughput.py \
--model meta-llama/Llama-2-7b-chat-hf \
--enable-prefix-caching \
--num-documents 16 \
--document-length 2000 \
--output-len 50 \
--repeat-count 5
```
2025-07-30 03:45:08 +01:00
### Different Repeat Modes
2025-06-24 13:57:46 +08:00
```bash
# Random mode (default) - shuffle prompts randomly
python3 benchmarks/benchmark_long_document_qa_throughput.py \
--model meta-llama/Llama-2-7b-chat-hf \
--enable-prefix-caching \
--num-documents 8 \
--document-length 3000 \
--repeat-count 3 \
--repeat-mode random
# Tile mode - repeat entire prompt list in sequence
python3 benchmarks/benchmark_long_document_qa_throughput.py \
--model meta-llama/Llama-2-7b-chat-hf \
--enable-prefix-caching \
--num-documents 8 \
--document-length 3000 \
--repeat-count 3 \
--repeat-mode tile
# Interleave mode - repeat each prompt consecutively
python3 benchmarks/benchmark_long_document_qa_throughput.py \
--model meta-llama/Llama-2-7b-chat-hf \
--enable-prefix-caching \
--num-documents 8 \
--document-length 3000 \
--repeat-count 3 \
--repeat-mode interleave
```
2025-06-26 18:35:16 +08:00
</details>
2025-07-30 03:45:08 +01:00
## 🗂️ Example - Prefix Caching Benchmark
2025-06-26 18:35:16 +08:00
<details>
2025-07-30 03:45:08 +01:00
<summary>Show more</summary>
2025-06-26 18:35:16 +08:00
<br/>
2025-06-24 13:57:46 +08:00
Benchmark the efficiency of automatic prefix caching.
2025-07-30 03:45:08 +01:00
### Fixed Prompt with Prefix Caching
2025-06-24 13:57:46 +08:00
```bash
python3 benchmarks/benchmark_prefix_caching.py \
--model meta-llama/Llama-2-7b-chat-hf \
--enable-prefix-caching \
--num-prompts 1 \
--repeat-count 100 \
--input-length-range 128:256
```
2025-07-30 03:45:08 +01:00
### ShareGPT Dataset with Prefix Caching
2025-06-24 13:57:46 +08:00
```bash
# download dataset
# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
python3 benchmarks/benchmark_prefix_caching.py \
--model meta-llama/Llama-2-7b-chat-hf \
--dataset-path /path/ShareGPT_V3_unfiltered_cleaned_split.json \
--enable-prefix-caching \
--num-prompts 20 \
--repeat-count 5 \
--input-length-range 128:256
```
2025-06-26 18:35:16 +08:00
</details>
2025-07-30 03:45:08 +01:00
## ⚡ Example - Request Prioritization Benchmark
2025-06-26 18:35:16 +08:00
<details>
2025-07-30 03:45:08 +01:00
<summary>Show more</summary>
2025-06-26 18:35:16 +08:00
<br/>
2025-06-24 13:57:46 +08:00
Benchmark the performance of request prioritization in vLLM.
2025-07-30 03:45:08 +01:00
### Basic Prioritization Test
2025-06-24 13:57:46 +08:00
```bash
python3 benchmarks/benchmark_prioritization.py \
--model meta-llama/Llama-2-7b-chat-hf \
--input-len 128 \
--output-len 64 \
--num-prompts 100 \
--scheduling-policy priority
```
2025-07-30 03:45:08 +01:00
### Multiple Sequences per Prompt
2025-06-24 13:57:46 +08:00
```bash
python3 benchmarks/benchmark_prioritization.py \
--model meta-llama/Llama-2-7b-chat-hf \
--input-len 128 \
--output-len 64 \
--num-prompts 100 \
--scheduling-policy priority \
--n 2
```
2025-06-26 18:35:16 +08:00
</details>