benchmarks/multi_turn/README.md

# Benchmark KV Cache Offloading with Multi-Turn Conversations

The requirements (pip) for `benchmark_serving_multi_turn.py` can be found in `requirements.txt`

First start serving your model

```bash
export MODEL_PATH=/models/meta-llama/Meta-Llama-3.1-8B-Instruct/

vllm serve $MODEL_PATH --served-model-name Llama --disable-log-requests
```

The variable `MODEL_PATH` should be a path to the model files (e.g. downloaded from huggingface).

## Synthetic Multi-Turn Conversations

Download the following text file (used for generation of synthetic conversations)

```bash
wget https://www.gutenberg.org/ebooks/1184.txt.utf-8
mv 1184.txt.utf-8 pg1184.txt
```

The filename `pg1184.txt` is used in `generate_multi_turn.json` (see `"text_files"`).

But you may use other text files if you prefer (using this specific file is not required).

Then run the benchmarking script

```bash
export MODEL_PATH=/models/meta-llama/Meta-Llama-3.1-8B-Instruct/

python benchmark_serving_multi_turn.py --model $MODEL_PATH --served-model-name Llama \
--input-file generate_multi_turn.json --num-clients 2 --max-active-conversations 6
```

You can edit the file `generate_multi_turn.json` to change the conversation parameters (number of turns, etc.).

If successful, you will see the following output

```bash
----------------------------------------------------------------------------------------------------
Statistics summary:
runtime_sec = 215.810
requests_per_sec = 0.769
----------------------------------------------------------------------------------------------------
                   count     mean     std      min      25%      50%      75%      90%      99%      max
ttft_ms            166.0    78.22   67.63    45.91    59.94    62.26    64.43    69.66   353.18   567.54
tpot_ms            166.0    25.37    0.57    24.40    25.07    25.31    25.50    25.84    27.50    28.05
latency_ms         166.0  2591.07  326.90  1998.53  2341.62  2573.01  2860.10  3003.50  3268.46  3862.94
input_num_turns    166.0     7.43    4.57     1.00     3.00     7.00    11.00    13.00    17.00    17.00
input_num_tokens   166.0  2006.20  893.56   522.00  1247.75  2019.00  2718.00  3233.00  3736.45  3899.00
output_num_tokens  166.0   100.01   11.80    80.00    91.00    99.00   109.75   116.00   120.00   120.00
output_num_chunks  166.0    99.01   11.80    79.00    90.00    98.00   108.75   115.00   119.00   119.00
----------------------------------------------------------------------------------------------------
```

If you run with `--warmup-step`, the summary will also include `warmup_runtime_sec`
and `total_runtime_incl_warmup_sec` (while `runtime_sec` continues to reflect the
benchmark-only runtime so the reported throughput stays comparable).

### JSON configuration file for synthetic conversations generation

The input flag `--input-file` is used to determine the input conversations for the benchmark.<br/>
When the input is a JSON file with the field `"filetype": "generate_conversations"` the tool will generate synthetic multi-turn (questions and answers) conversations.

The file `generate_multi_turn.json` is an example file.

The file must contain the sections `prompt_input` and `prompt_output`.

The `prompt_input` section must contain `num_turns`, `prefix_num_tokens` and `num_tokens`:

* `num_turns` - Number of total turns in the conversation (both user & assistant).<br/>
The final value will always be rounded to an even number so each user turn has a reply.
* `prefix_num_tokens` - Tokens added at the start of only the **first user turn** in a conversation (unique per conversation).
* `num_tokens` - Total token length of each **user** message (one turn).

The `prompt_output` section must contain `num_tokens`:

* `num_tokens` - Total token length of each **assistant** message (one turn).

### Random distributions for synthetic conversations generation

When creating an input JSON file (such as `generate_multi_turn.json`),<br/>
every numeric field (such as `num_turns` or `num_tokens`) requires a distribution.<br/>
The distribution determines how to randomly sample values for the field.

The available distributions are listed below.

**Note:** The optional `max` field (for lognormal, zipf, and poisson) can be used to cap sampled values at an upper bound.</br>
Can be used to make sure that the total number of tokens in every request does not exceed `--max-model-len`.

#### constant

```json
{
    "distribution": "constant",
    "value": 500
}
```

* `value` - the fixed integer value (always returns the same number).

#### uniform

```json
{
    "distribution": "uniform",
    "min": 12,
    "max": 18
}
```

* `min` - minimum value (inclusive).
* `max` - maximum value (inclusive), should be equal or larger than min.

#### lognormal

```json
{
    "distribution": "lognormal",
    "average": 1000,
    "max": 5000
}
```

You can parameterize the lognormal distribution in one of two ways:

Using the average and optional median ratio:

* `average` - target average value of the distribution.
* `median_ratio` - the ratio of the median to the average; controls the skewness. Must be in the range (0, 1).

Using the parameters of the underlying normal distribution:

* `mean` - mean of the underlying normal distribution.
* `sigma` - standard deviation of the underlying normal distribution.

#### zipf

```json
{
    "distribution": "zipf",
    "alpha": 1.2,
    "max": 100
}
```

* `alpha` - skew parameter (> 1). Larger values produce stronger skew toward smaller integers.

#### poisson

```json
{
    "distribution": "poisson",
    "alpha": 10,
    "max": 50
}
```

* `alpha` - expected value (λ). Also the variance of the distribution.

## ShareGPT Conversations

To run with the ShareGPT data, download the following ShareGPT dataset:
`https://huggingface.co/datasets/philschmid/sharegpt-raw/blob/main/sharegpt_20230401_clean_lang_split.json`

Use the `convert_sharegpt_to_openai.py` script to convert the dataset to a format supported by `benchmark_serving_multi_turn.py`

```bash
python convert_sharegpt_to_openai.py sharegpt_20230401_clean_lang_split.json sharegpt_conv_128.json --seed=99 --max-items=128
```

The script will convert the ShareGPT dataset to a dataset with the standard user/assistant roles.

The flag `--max-items=128` is used to sample 128 conversations from the original dataset (change as needed).

Use the output JSON file `sharegpt_conv_128.json` as the `--input-file` for `benchmark_serving_multi_turn.py`.
[Benchmark] Add benchmark tool for multi turn conversations (#20267) 2025-08-08 20:28:50 +03:00			`# Benchmark KV Cache Offloading with Multi-Turn Conversations`

			The requirements (pip) for `benchmark_serving_multi_turn.py` can be found in `requirements.txt`

			`First start serving your model`

			```bash
[Benchmark] Add flag --served-model-name to benchmark_serving_multi_turn (#22889) Signed-off-by: daniels <daniels@pliops.com> 2025-08-19 10:48:07 +03:00			`export MODEL_PATH=/models/meta-llama/Meta-Llama-3.1-8B-Instruct/`
[Benchmark] Add benchmark tool for multi turn conversations (#20267) 2025-08-08 20:28:50 +03:00
[Benchmark] Add flag --served-model-name to benchmark_serving_multi_turn (#22889) Signed-off-by: daniels <daniels@pliops.com> 2025-08-19 10:48:07 +03:00			`vllm serve $MODEL_PATH --served-model-name Llama --disable-log-requests`
[Benchmark] Add benchmark tool for multi turn conversations (#20267) 2025-08-08 20:28:50 +03:00			```

[Benchmark] Add flag --served-model-name to benchmark_serving_multi_turn (#22889) Signed-off-by: daniels <daniels@pliops.com> 2025-08-19 10:48:07 +03:00			The variable `MODEL_PATH` should be a path to the model files (e.g. downloaded from huggingface).

[Benchmark] Add benchmark tool for multi turn conversations (#20267) 2025-08-08 20:28:50 +03:00			`## Synthetic Multi-Turn Conversations`

			`Download the following text file (used for generation of synthetic conversations)`

			```bash
			`wget https://www.gutenberg.org/ebooks/1184.txt.utf-8`
			`mv 1184.txt.utf-8 pg1184.txt`
			```

			The filename `pg1184.txt` is used in `generate_multi_turn.json` (see `"text_files"`).

			`But you may use other text files if you prefer (using this specific file is not required).`

			`Then run the benchmarking script`

			```bash
[Benchmark] Add flag --served-model-name to benchmark_serving_multi_turn (#22889) Signed-off-by: daniels <daniels@pliops.com> 2025-08-19 10:48:07 +03:00			`export MODEL_PATH=/models/meta-llama/Meta-Llama-3.1-8B-Instruct/`
[Benchmark] Add benchmark tool for multi turn conversations (#20267) 2025-08-08 20:28:50 +03:00
[Benchmark] Add flag --served-model-name to benchmark_serving_multi_turn (#22889) Signed-off-by: daniels <daniels@pliops.com> 2025-08-19 10:48:07 +03:00			`python benchmark_serving_multi_turn.py --model $MODEL_PATH --served-model-name Llama \`
			`--input-file generate_multi_turn.json --num-clients 2 --max-active-conversations 6`
[Benchmark] Add benchmark tool for multi turn conversations (#20267) 2025-08-08 20:28:50 +03:00			```

			You can edit the file `generate_multi_turn.json` to change the conversation parameters (number of turns, etc.).

			`If successful, you will see the following output`

			```bash
			`----------------------------------------------------------------------------------------------------`
			`Statistics summary:`
			`runtime_sec = 215.810`
			`requests_per_sec = 0.769`
			`----------------------------------------------------------------------------------------------------`
			`count mean std min 25% 50% 75% 90% 99% max`
			`ttft_ms 166.0 78.22 67.63 45.91 59.94 62.26 64.43 69.66 353.18 567.54`
			`tpot_ms 166.0 25.37 0.57 24.40 25.07 25.31 25.50 25.84 27.50 28.05`
			`latency_ms 166.0 2591.07 326.90 1998.53 2341.62 2573.01 2860.10 3003.50 3268.46 3862.94`
			`input_num_turns 166.0 7.43 4.57 1.00 3.00 7.00 11.00 13.00 17.00 17.00`
			`input_num_tokens 166.0 2006.20 893.56 522.00 1247.75 2019.00 2718.00 3233.00 3736.45 3899.00`
			`output_num_tokens 166.0 100.01 11.80 80.00 91.00 99.00 109.75 116.00 120.00 120.00`
			`output_num_chunks 166.0 99.01 11.80 79.00 90.00 98.00 108.75 115.00 119.00 119.00`
			`----------------------------------------------------------------------------------------------------`
			```

[Benchmark] multi_turn: Report warmup-inclusive runtime (#28937) Signed-off-by: Ido Segev <idos@pliops.com> 2025-11-18 18:38:22 +02:00			If you run with `--warmup-step`, the summary will also include `warmup_runtime_sec`
			and `total_runtime_incl_warmup_sec` (while `runtime_sec` continues to reflect the
			`benchmark-only runtime so the reported throughput stays comparable).`

Add more documentation and improve usability of lognormal dist (benchmark_serving_multi_turn) (#23255) Signed-off-by: daniels <daniels@pliops.com> 2025-09-17 08:53:17 +03:00			`### JSON configuration file for synthetic conversations generation`

			The input flag `--input-file` is used to determine the input conversations for the benchmark.<br/>
			When the input is a JSON file with the field `"filetype": "generate_conversations"` the tool will generate synthetic multi-turn (questions and answers) conversations.

			The file `generate_multi_turn.json` is an example file.

			The file must contain the sections `prompt_input` and `prompt_output`.

			The `prompt_input` section must contain `num_turns`, `prefix_num_tokens` and `num_tokens`:

			* `num_turns` - Number of total turns in the conversation (both user & assistant).<br/>
			`The final value will always be rounded to an even number so each user turn has a reply.`
			* `prefix_num_tokens` - Tokens added at the start of only the first user turn in a conversation (unique per conversation).
			* `num_tokens` - Total token length of each user message (one turn).

			The `prompt_output` section must contain `num_tokens`:

			* `num_tokens` - Total token length of each assistant message (one turn).

			`### Random distributions for synthetic conversations generation`

			When creating an input JSON file (such as `generate_multi_turn.json`),<br/>
			every numeric field (such as `num_turns` or `num_tokens`) requires a distribution.<br/>
			`The distribution determines how to randomly sample values for the field.`

			`The available distributions are listed below.`

			Note: The optional `max` field (for lognormal, zipf, and poisson) can be used to cap sampled values at an upper bound.</br>
			Can be used to make sure that the total number of tokens in every request does not exceed `--max-model-len`.

			`#### constant`

			```json
			`{`
			`"distribution": "constant",`
			`"value": 500`
			`}`
			```

			* `value` - the fixed integer value (always returns the same number).

			`#### uniform`

			```json
			`{`
			`"distribution": "uniform",`
			`"min": 12,`
			`"max": 18`
			`}`
			```

			* `min` - minimum value (inclusive).
			* `max` - maximum value (inclusive), should be equal or larger than min.

			`#### lognormal`

			```json
			`{`
			`"distribution": "lognormal",`
			`"average": 1000,`
			`"max": 5000`
			`}`
			```

			`You can parameterize the lognormal distribution in one of two ways:`

			`Using the average and optional median ratio:`

			* `average` - target average value of the distribution.
			* `median_ratio` - the ratio of the median to the average; controls the skewness. Must be in the range (0, 1).

			`Using the parameters of the underlying normal distribution:`

			* `mean` - mean of the underlying normal distribution.
			* `sigma` - standard deviation of the underlying normal distribution.

			`#### zipf`

			```json
			`{`
			`"distribution": "zipf",`
			`"alpha": 1.2,`
			`"max": 100`
			`}`
			```

			* `alpha` - skew parameter (> 1). Larger values produce stronger skew toward smaller integers.

			`#### poisson`

			```json
			`{`
			`"distribution": "poisson",`
			`"alpha": 10,`
			`"max": 50`
			`}`
			```

			* `alpha` - expected value (λ). Also the variance of the distribution.

[Benchmark] Add benchmark tool for multi turn conversations (#20267) 2025-08-08 20:28:50 +03:00			`## ShareGPT Conversations`

			`To run with the ShareGPT data, download the following ShareGPT dataset:`
			`https://huggingface.co/datasets/philschmid/sharegpt-raw/blob/main/sharegpt_20230401_clean_lang_split.json`

			Use the `convert_sharegpt_to_openai.py` script to convert the dataset to a format supported by `benchmark_serving_multi_turn.py`

			```bash
			`python convert_sharegpt_to_openai.py sharegpt_20230401_clean_lang_split.json sharegpt_conv_128.json --seed=99 --max-items=128`
			```

			`The script will convert the ShareGPT dataset to a dataset with the standard user/assistant roles.`

			The flag `--max-items=128` is used to sample 128 conversations from the original dataset (change as needed).

			Use the output JSON file `sharegpt_conv_128.json` as the `--input-file` for `benchmark_serving_multi_turn.py`.