benchmarks/attention_benchmarks/README.md

# vLLM Attention Benchmarking Suite

Fast, flexible benchmarking for vLLM attention and MLA backends with an extended batch specification grammar.

## Quick Start

```bash
cd benchmarks/attention_benchmarks

# Run a pre-configured benchmark
python benchmark.py --config configs/mla_decode.yaml
python benchmark.py --config configs/mla_mixed_batch.yaml
python benchmark.py --config configs/speculative_decode.yaml
python benchmark.py --config configs/standard_attention.yaml
python benchmark.py --config configs/reorder_threshold.yaml

# Or run custom benchmarks
python benchmark.py \
    --backends flash flashinfer \
    --batch-specs "q2k" "8q1s1k" "2q2k_32q1s1k" \
    --output-csv results.csv
```

## Simplified Batch Specification Grammar

Express workloads concisely using query length and sequence length:

```python
"q2k"              # 2048-token prefill (q_len=2048, seq_len=2048)
"q1s1k"            # Decode: 1 token with 1K sequence
"8q1s1k"           # 8 decode requests
"q4s1k"            # 4-token extend (e.g., spec decode)
"2q2k_32q1s1k"     # Mixed: 2 prefills + 32 decodes
"16q4s1k"          # 16 spec decode (4 tokens each)
```

### Grammar Rule

```text
Format: (<count>?) q<q_len>(k?) (s<seq_len>(k?))?

- count:   Number of identical requests (optional, default=1)
- q_len:   Query length (number of new tokens)
- seq_len: Total sequence length (optional, defaults to q_len for prefill)
- 'k':     Multiplies value by 1024

Mixed batches: Use _ to combine (e.g., "2q2k_32q1s1k")
```

**Note**: Decode, prefill, and spec decode are just different query lengths - no special syntax needed!

## Pre-configured Benchmarks

The suite includes several pre-configured YAML benchmark configurations:

### MLA Decode Benchmark

Tests pure decode performance across MLA backends with varying batch sizes and sequence lengths.

```bash
python benchmark.py --config configs/mla_decode.yaml
```

### MLA Mixed Batch Benchmark

Tests chunked prefill performance with mixed prefill + decode batches.

```bash
python benchmark.py --config configs/mla_mixed_batch.yaml
```

### Speculative Decoding Benchmark

Tests speculative decode scenarios (K-token verification) and reorder_batch_threshold optimization.

```bash
python benchmark.py --config configs/speculative_decode.yaml
```

### Standard Attention Benchmark

Tests standard attention backends (Flash/Triton/FlashInfer) with pure prefill, decode, and mixed batches.

```bash
python benchmark.py --config configs/standard_attention.yaml
```

### Reorder Threshold Study

**Question:** At what query length does the prefill pipeline become faster than the decode pipeline?

Tests query lengths from 1-1024 across 9 batch sizes to find the crossover point. Uses `decode_vs_prefill` mode to compare both pipelines for each query length.

```bash
python benchmark.py --config configs/reorder_threshold.yaml
```

---

## Universal Benchmark

The `benchmark.py` script handles **all** backends - both standard attention and MLA.

### Standard Attention (Flash/Triton/FlashInfer)

```bash
python benchmark.py \
    --backends flash triton flashinfer \
    --batch-specs "q2k" "8q1s1k" "2q2k_32q1s1k" \
    --num-layers 10 \
    --repeats 5 \
    --output-csv results.csv
```

### MLA Backends

```bash
# Compare all MLA backends
python benchmark.py \
    --backends cutlass_mla flashinfer_mla flashattn_mla flashmla \
    --batch-specs "64q1s1k" "64q1s4k" \
    --output-csv mla_results.csv
```

### Parameter Sweeps

Use `--sweep-param` and `--sweep-values` to run parameter sweeps from the CLI:

#### CUTLASS MLA num-splits Optimization

**Question:** What is the optimal `num_kv_splits` for CUTLASS MLA?

```bash
python benchmark.py \
    --backend cutlass_mla \
    --batch-specs "64q1s1k" "64q1s4k" "64q1s16k" \
    --sweep-param num_kv_splits \
    --sweep-values 1 2 4 8 16 \
    --output-json optimal_splits.json
```

#### Reorder Batch Threshold Optimization

**Question:** What's the optimal `reorder_batch_threshold` for speculative decoding?

```bash
python benchmark.py \
    --backend flashmla \
    --batch-specs "q4s1k" "q8s2k" \
    --sweep-param reorder_batch_threshold \
    --sweep-values 1 4 16 64 256 512 \
    --output-csv threshold_sweep.csv
```

### All Command-Line Options

```text
--config CONFIG                     # Path to YAML config file (overrides other args)
--backends BACKEND [BACKEND ...]    # flash, triton, flashinfer, cutlass_mla,
                                    # flashinfer_mla, flashattn_mla, flashmla
--backend BACKEND                   # Single backend (alternative to --backends)
--batch-specs SPEC [SPEC ...]       # Batch specifications using extended grammar

# Model configuration
--num-layers N                      # Number of layers
--head-dim N                        # Head dimension
--num-q-heads N                     # Query heads
--num-kv-heads N                    # KV heads
--block-size N                      # Block size

# Benchmark settings
--device DEVICE                     # Device (default: cuda:0)
--repeats N                         # Repetitions
--warmup-iters N                    # Warmup iterations
--profile-memory                    # Profile memory usage

# Parameter sweeps
--sweep-param PARAM                 # Parameter name to sweep (e.g., num_kv_splits,
                                    # reorder_batch_threshold)
--sweep-values N [N ...]            # Values to sweep for the parameter

# Output
--output-csv FILE                   # Save to CSV
--output-json FILE                  # Save to JSON
```

## Hardware Requirements

| Backend | Hardware |
| ------- | -------- |
| Flash/Triton/FlashInfer | Any CUDA GPU |
| CUTLASS MLA | Blackwell (SM100+) |
| FlashAttn MLA | Hopper (SM90+) |
| FlashMLA | Hopper (SM90+) |
| FlashInfer-MLA | Any CUDA GPU |

## Using MLA Runner Directly

All MLA backends are available through `mla_runner.run_mla_benchmark()`:

```python
from mla_runner import run_mla_benchmark
from common import BenchmarkConfig

config = BenchmarkConfig(
    backend="cutlass_mla",
    batch_spec="64q1s4k",
    num_layers=10,
    head_dim=576,
    num_q_heads=128,
    num_kv_heads=1,
    block_size=128,
    device="cuda:0",
    repeats=5,
    warmup_iters=3,
)

# CUTLASS MLA with specific num_kv_splits
result = run_mla_benchmark("cutlass_mla", config, num_kv_splits=4)
print(f"Time: {result.mean_time:.6f}s")

# FlashInfer-MLA
result = run_mla_benchmark("flashinfer_mla", config)

# FlashAttn MLA (Hopper SM90+)
result = run_mla_benchmark("flashattn_mla", config, reorder_batch_threshold=64)

# FlashMLA (Hopper SM90+)
result = run_mla_benchmark("flashmla", config, reorder_batch_threshold=64)
```

## Python API

```python
from batch_spec import parse_batch_spec, format_batch_spec, get_batch_stats
from common import BenchmarkConfig, BenchmarkResult, ResultsFormatter

# Parse batch specs
requests = parse_batch_spec("2q2k_q4s1k_32q1s1k")
print(format_batch_spec(requests))
# "2 prefill (2x2k), 1 extend (1xq4kv1k), 32 decode (32x1k)"

# Get batch statistics
stats = get_batch_stats(requests)
print(f"Total tokens: {stats['total_tokens']}")
print(f"Num decode: {stats['num_decode']}, Num prefill: {stats['num_prefill']}")

# Format results
formatter = ResultsFormatter()
formatter.save_csv(results, "output.csv")
formatter.save_json(results, "output.json")
```

## Tips

**1. Warmup matters** - Use `--warmup-iters 10` for stable results

**2. Multiple repeats** - Use `--repeats 20` for low variance

**3. Save results** - Always use `--output-csv` or `--output-json`

**4. Test incrementally** - Start with `--num-layers 1 --repeats 1`

**5. Extended grammar** - Leverage spec decode, chunked prefill patterns

**6. Parameter sweeps** - Use `--sweep-param` and `--sweep-values` to find optimal values
Add attention benchmarking tools (#26835) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Claude <noreply@anthropic.com> 2026-01-27 19:09:20 -05:00			`# vLLM Attention Benchmarking Suite`

			`Fast, flexible benchmarking for vLLM attention and MLA backends with an extended batch specification grammar.`

			`## Quick Start`

			```bash
			`cd benchmarks/attention_benchmarks`

			`# Run a pre-configured benchmark`
			`python benchmark.py --config configs/mla_decode.yaml`
			`python benchmark.py --config configs/mla_mixed_batch.yaml`
			`python benchmark.py --config configs/speculative_decode.yaml`
			`python benchmark.py --config configs/standard_attention.yaml`
			`python benchmark.py --config configs/reorder_threshold.yaml`

			`# Or run custom benchmarks`
			`python benchmark.py \`
			`--backends flash flashinfer \`
			`--batch-specs "q2k" "8q1s1k" "2q2k_32q1s1k" \`
			`--output-csv results.csv`
			```

			`## Simplified Batch Specification Grammar`

			`Express workloads concisely using query length and sequence length:`

			```python
			`"q2k" # 2048-token prefill (q_len=2048, seq_len=2048)`
			`"q1s1k" # Decode: 1 token with 1K sequence`
			`"8q1s1k" # 8 decode requests`
			`"q4s1k" # 4-token extend (e.g., spec decode)`
			`"2q2k_32q1s1k" # Mixed: 2 prefills + 32 decodes`
			`"16q4s1k" # 16 spec decode (4 tokens each)`
			```

			`### Grammar Rule`

			```text
			`Format: (<count>?) q<q_len>(k?) (s<seq_len>(k?))?`

			`- count: Number of identical requests (optional, default=1)`
			`- q_len: Query length (number of new tokens)`
			`- seq_len: Total sequence length (optional, defaults to q_len for prefill)`
			`- 'k': Multiplies value by 1024`

			`Mixed batches: Use _ to combine (e.g., "2q2k_32q1s1k")`
			```

			`Note: Decode, prefill, and spec decode are just different query lengths - no special syntax needed!`

			`## Pre-configured Benchmarks`

			`The suite includes several pre-configured YAML benchmark configurations:`

			`### MLA Decode Benchmark`

			`Tests pure decode performance across MLA backends with varying batch sizes and sequence lengths.`

			```bash
			`python benchmark.py --config configs/mla_decode.yaml`
			```

			`### MLA Mixed Batch Benchmark`

			`Tests chunked prefill performance with mixed prefill + decode batches.`

			```bash
			`python benchmark.py --config configs/mla_mixed_batch.yaml`
			```

			`### Speculative Decoding Benchmark`

			`Tests speculative decode scenarios (K-token verification) and reorder_batch_threshold optimization.`

			```bash
			`python benchmark.py --config configs/speculative_decode.yaml`
			```

			`### Standard Attention Benchmark`

			`Tests standard attention backends (Flash/Triton/FlashInfer) with pure prefill, decode, and mixed batches.`

			```bash
			`python benchmark.py --config configs/standard_attention.yaml`
			```

			`### Reorder Threshold Study`

			`Question: At what query length does the prefill pipeline become faster than the decode pipeline?`

			Tests query lengths from 1-1024 across 9 batch sizes to find the crossover point. Uses `decode_vs_prefill` mode to compare both pipelines for each query length.

			```bash
			`python benchmark.py --config configs/reorder_threshold.yaml`
			```

			`---`

			`## Universal Benchmark`

			The `benchmark.py` script handles all backends - both standard attention and MLA.

			`### Standard Attention (Flash/Triton/FlashInfer)`

			```bash
			`python benchmark.py \`
			`--backends flash triton flashinfer \`
			`--batch-specs "q2k" "8q1s1k" "2q2k_32q1s1k" \`
			`--num-layers 10 \`
			`--repeats 5 \`
			`--output-csv results.csv`
			```

			`### MLA Backends`

			```bash
			`# Compare all MLA backends`
			`python benchmark.py \`
			`--backends cutlass_mla flashinfer_mla flashattn_mla flashmla \`
			`--batch-specs "64q1s1k" "64q1s4k" \`
			`--output-csv mla_results.csv`
			```

			`### Parameter Sweeps`

			Use `--sweep-param` and `--sweep-values` to run parameter sweeps from the CLI:

			`#### CUTLASS MLA num-splits Optimization`

			Question: What is the optimal `num_kv_splits` for CUTLASS MLA?

			```bash
			`python benchmark.py \`
			`--backend cutlass_mla \`
			`--batch-specs "64q1s1k" "64q1s4k" "64q1s16k" \`
			`--sweep-param num_kv_splits \`
			`--sweep-values 1 2 4 8 16 \`
			`--output-json optimal_splits.json`
			```

			`#### Reorder Batch Threshold Optimization`

			Question: What's the optimal `reorder_batch_threshold` for speculative decoding?

			```bash
			`python benchmark.py \`
			`--backend flashmla \`
			`--batch-specs "q4s1k" "q8s2k" \`
			`--sweep-param reorder_batch_threshold \`
			`--sweep-values 1 4 16 64 256 512 \`
			`--output-csv threshold_sweep.csv`
			```

			`### All Command-Line Options`

			```text
			`--config CONFIG # Path to YAML config file (overrides other args)`
			`--backends BACKEND [BACKEND ...] # flash, triton, flashinfer, cutlass_mla,`
			`# flashinfer_mla, flashattn_mla, flashmla`
			`--backend BACKEND # Single backend (alternative to --backends)`
			`--batch-specs SPEC [SPEC ...] # Batch specifications using extended grammar`

			`# Model configuration`
			`--num-layers N # Number of layers`
			`--head-dim N # Head dimension`
			`--num-q-heads N # Query heads`
			`--num-kv-heads N # KV heads`
			`--block-size N # Block size`

			`# Benchmark settings`
			`--device DEVICE # Device (default: cuda:0)`
			`--repeats N # Repetitions`
			`--warmup-iters N # Warmup iterations`
			`--profile-memory # Profile memory usage`

			`# Parameter sweeps`
			`--sweep-param PARAM # Parameter name to sweep (e.g., num_kv_splits,`
			`# reorder_batch_threshold)`
			`--sweep-values N [N ...] # Values to sweep for the parameter`

			`# Output`
			`--output-csv FILE # Save to CSV`
			`--output-json FILE # Save to JSON`
			```

			`## Hardware Requirements`

			`\| Backend \| Hardware \|`
Allow `markdownlint` to run locally (#36398) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> 2026-03-09 03:05:24 +00:00			`\| ------- \| -------- \|`
Add attention benchmarking tools (#26835) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Claude <noreply@anthropic.com> 2026-01-27 19:09:20 -05:00			`\| Flash/Triton/FlashInfer \| Any CUDA GPU \|`
			`\| CUTLASS MLA \| Blackwell (SM100+) \|`
			`\| FlashAttn MLA \| Hopper (SM90+) \|`
			`\| FlashMLA \| Hopper (SM90+) \|`
			`\| FlashInfer-MLA \| Any CUDA GPU \|`

			`## Using MLA Runner Directly`

			All MLA backends are available through `mla_runner.run_mla_benchmark()`:

			```python
			`from mla_runner import run_mla_benchmark`
			`from common import BenchmarkConfig`

			`config = BenchmarkConfig(`
			`backend="cutlass_mla",`
			`batch_spec="64q1s4k",`
			`num_layers=10,`
			`head_dim=576,`
			`num_q_heads=128,`
			`num_kv_heads=1,`
			`block_size=128,`
			`device="cuda:0",`
			`repeats=5,`
			`warmup_iters=3,`
			`)`

			`# CUTLASS MLA with specific num_kv_splits`
			`result = run_mla_benchmark("cutlass_mla", config, num_kv_splits=4)`
			`print(f"Time: {result.mean_time:.6f}s")`

			`# FlashInfer-MLA`
			`result = run_mla_benchmark("flashinfer_mla", config)`

			`# FlashAttn MLA (Hopper SM90+)`
			`result = run_mla_benchmark("flashattn_mla", config, reorder_batch_threshold=64)`

			`# FlashMLA (Hopper SM90+)`
			`result = run_mla_benchmark("flashmla", config, reorder_batch_threshold=64)`
			```

			`## Python API`

			```python
			`from batch_spec import parse_batch_spec, format_batch_spec, get_batch_stats`
			`from common import BenchmarkConfig, BenchmarkResult, ResultsFormatter`

			`# Parse batch specs`
			`requests = parse_batch_spec("2q2k_q4s1k_32q1s1k")`
			`print(format_batch_spec(requests))`
			`# "2 prefill (2x2k), 1 extend (1xq4kv1k), 32 decode (32x1k)"`

			`# Get batch statistics`
			`stats = get_batch_stats(requests)`
			`print(f"Total tokens: {stats['total_tokens']}")`
			`print(f"Num decode: {stats['num_decode']}, Num prefill: {stats['num_prefill']}")`

			`# Format results`
			`formatter = ResultsFormatter()`
			`formatter.save_csv(results, "output.csv")`
			`formatter.save_json(results, "output.json")`
			```

			`## Tips`

			1. Warmup matters - Use `--warmup-iters 10` for stable results

			2. Multiple repeats - Use `--repeats 20` for low variance

			3. Save results - Always use `--output-csv` or `--output-json`

			4. Test incrementally - Start with `--num-layers 1 --repeats 1`

			`5. Extended grammar - Leverage spec decode, chunked prefill patterns`

			6. Parameter sweeps - Use `--sweep-param` and `--sweep-values` to find optimal values