# vLLM Attention Benchmarking Suite Fast, flexible benchmarking for vLLM attention and MLA backends with an extended batch specification grammar. ## Quick Start ```bash cd benchmarks/attention_benchmarks # Run a pre-configured benchmark python benchmark.py --config configs/mla_decode.yaml python benchmark.py --config configs/mla_mixed_batch.yaml python benchmark.py --config configs/speculative_decode.yaml python benchmark.py --config configs/standard_attention.yaml python benchmark.py --config configs/reorder_threshold.yaml # Or run custom benchmarks python benchmark.py \ --backends flash flashinfer \ --batch-specs "q2k" "8q1s1k" "2q2k_32q1s1k" \ --output-csv results.csv ``` ## Simplified Batch Specification Grammar Express workloads concisely using query length and sequence length: ```python "q2k" # 2048-token prefill (q_len=2048, seq_len=2048) "q1s1k" # Decode: 1 token with 1K sequence "8q1s1k" # 8 decode requests "q4s1k" # 4-token extend (e.g., spec decode) "2q2k_32q1s1k" # Mixed: 2 prefills + 32 decodes "16q4s1k" # 16 spec decode (4 tokens each) ``` ### Grammar Rule ```text Format: (?) q(k?) (s(k?))? - count: Number of identical requests (optional, default=1) - q_len: Query length (number of new tokens) - seq_len: Total sequence length (optional, defaults to q_len for prefill) - 'k': Multiplies value by 1024 Mixed batches: Use _ to combine (e.g., "2q2k_32q1s1k") ``` **Note**: Decode, prefill, and spec decode are just different query lengths - no special syntax needed! ## Pre-configured Benchmarks The suite includes several pre-configured YAML benchmark configurations: ### MLA Decode Benchmark Tests pure decode performance across MLA backends with varying batch sizes and sequence lengths. ```bash python benchmark.py --config configs/mla_decode.yaml ``` ### MLA Mixed Batch Benchmark Tests chunked prefill performance with mixed prefill + decode batches. ```bash python benchmark.py --config configs/mla_mixed_batch.yaml ``` ### Speculative Decoding Benchmark Tests speculative decode scenarios (K-token verification) and reorder_batch_threshold optimization. ```bash python benchmark.py --config configs/speculative_decode.yaml ``` ### Standard Attention Benchmark Tests standard attention backends (Flash/Triton/FlashInfer) with pure prefill, decode, and mixed batches. ```bash python benchmark.py --config configs/standard_attention.yaml ``` ### Reorder Threshold Study **Question:** At what query length does the prefill pipeline become faster than the decode pipeline? Tests query lengths from 1-1024 across 9 batch sizes to find the crossover point. Uses `decode_vs_prefill` mode to compare both pipelines for each query length. ```bash python benchmark.py --config configs/reorder_threshold.yaml ``` --- ## Universal Benchmark The `benchmark.py` script handles **all** backends - both standard attention and MLA. ### Standard Attention (Flash/Triton/FlashInfer) ```bash python benchmark.py \ --backends flash triton flashinfer \ --batch-specs "q2k" "8q1s1k" "2q2k_32q1s1k" \ --num-layers 10 \ --repeats 5 \ --output-csv results.csv ``` ### MLA Backends ```bash # Compare all MLA backends python benchmark.py \ --backends cutlass_mla flashinfer_mla flashattn_mla flashmla \ --batch-specs "64q1s1k" "64q1s4k" \ --output-csv mla_results.csv ``` ### Parameter Sweeps Use `--sweep-param` and `--sweep-values` to run parameter sweeps from the CLI: #### CUTLASS MLA num-splits Optimization **Question:** What is the optimal `num_kv_splits` for CUTLASS MLA? ```bash python benchmark.py \ --backend cutlass_mla \ --batch-specs "64q1s1k" "64q1s4k" "64q1s16k" \ --sweep-param num_kv_splits \ --sweep-values 1 2 4 8 16 \ --output-json optimal_splits.json ``` #### Reorder Batch Threshold Optimization **Question:** What's the optimal `reorder_batch_threshold` for speculative decoding? ```bash python benchmark.py \ --backend flashmla \ --batch-specs "q4s1k" "q8s2k" \ --sweep-param reorder_batch_threshold \ --sweep-values 1 4 16 64 256 512 \ --output-csv threshold_sweep.csv ``` ### All Command-Line Options ```text --config CONFIG # Path to YAML config file (overrides other args) --backends BACKEND [BACKEND ...] # flash, triton, flashinfer, cutlass_mla, # flashinfer_mla, flashattn_mla, flashmla --backend BACKEND # Single backend (alternative to --backends) --batch-specs SPEC [SPEC ...] # Batch specifications using extended grammar # Model configuration --num-layers N # Number of layers --head-dim N # Head dimension --num-q-heads N # Query heads --num-kv-heads N # KV heads --block-size N # Block size # Benchmark settings --device DEVICE # Device (default: cuda:0) --repeats N # Repetitions --warmup-iters N # Warmup iterations --profile-memory # Profile memory usage # Parameter sweeps --sweep-param PARAM # Parameter name to sweep (e.g., num_kv_splits, # reorder_batch_threshold) --sweep-values N [N ...] # Values to sweep for the parameter # Output --output-csv FILE # Save to CSV --output-json FILE # Save to JSON ``` ## Hardware Requirements | Backend | Hardware | | ------- | -------- | | Flash/Triton/FlashInfer | Any CUDA GPU | | CUTLASS MLA | Blackwell (SM100+) | | FlashAttn MLA | Hopper (SM90+) | | FlashMLA | Hopper (SM90+) | | FlashInfer-MLA | Any CUDA GPU | ## Using MLA Runner Directly All MLA backends are available through `mla_runner.run_mla_benchmark()`: ```python from mla_runner import run_mla_benchmark from common import BenchmarkConfig config = BenchmarkConfig( backend="cutlass_mla", batch_spec="64q1s4k", num_layers=10, head_dim=576, num_q_heads=128, num_kv_heads=1, block_size=128, device="cuda:0", repeats=5, warmup_iters=3, ) # CUTLASS MLA with specific num_kv_splits result = run_mla_benchmark("cutlass_mla", config, num_kv_splits=4) print(f"Time: {result.mean_time:.6f}s") # FlashInfer-MLA result = run_mla_benchmark("flashinfer_mla", config) # FlashAttn MLA (Hopper SM90+) result = run_mla_benchmark("flashattn_mla", config, reorder_batch_threshold=64) # FlashMLA (Hopper SM90+) result = run_mla_benchmark("flashmla", config, reorder_batch_threshold=64) ``` ## Python API ```python from batch_spec import parse_batch_spec, format_batch_spec, get_batch_stats from common import BenchmarkConfig, BenchmarkResult, ResultsFormatter # Parse batch specs requests = parse_batch_spec("2q2k_q4s1k_32q1s1k") print(format_batch_spec(requests)) # "2 prefill (2x2k), 1 extend (1xq4kv1k), 32 decode (32x1k)" # Get batch statistics stats = get_batch_stats(requests) print(f"Total tokens: {stats['total_tokens']}") print(f"Num decode: {stats['num_decode']}, Num prefill: {stats['num_prefill']}") # Format results formatter = ResultsFormatter() formatter.save_csv(results, "output.csv") formatter.save_json(results, "output.json") ``` ## Tips **1. Warmup matters** - Use `--warmup-iters 10` for stable results **2. Multiple repeats** - Use `--repeats 20` for low variance **3. Save results** - Always use `--output-csv` or `--output-json` **4. Test incrementally** - Start with `--num-layers 1 --repeats 1` **5. Extended grammar** - Leverage spec decode, chunked prefill patterns **6. Parameter sweeps** - Use `--sweep-param` and `--sweep-values` to find optimal values