Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
216 lines
8.8 KiB
Markdown
216 lines
8.8 KiB
Markdown
# Attention Backend Feature Support
|
|
|
|
This document is auto-generated by `tools/pre_commit/generate_attention_backend_docs.py`.
|
|
It shows the feature support for each registered attention backend
|
|
based on the checks in `AttentionBackend.validate_configuration()`.
|
|
|
|
**Do not edit this file manually.** Run the following command to
|
|
regenerate it:
|
|
|
|
```bash
|
|
python tools/pre_commit/generate_attention_backend_docs.py
|
|
```
|
|
|
|
## Setting the Attention Backend
|
|
|
|
### Command Line
|
|
|
|
There are two ways to specify the backend from the command line:
|
|
|
|
**Option 1: Using `--attention-backend` (simple)**
|
|
|
|
```bash
|
|
vllm serve <model> --attention-backend FLASH_ATTN
|
|
```
|
|
|
|
**Option 2: Using `--attention-config.backend` / `-ac.backend` (structured config)**
|
|
|
|
```bash
|
|
# Dot notation
|
|
vllm serve <model> --attention-config.backend FLASH_ATTN
|
|
vllm serve <model> -ac.backend FLASH_ATTN
|
|
|
|
# JSON format
|
|
vllm serve <model> --attention-config '{"backend": "FLASH_ATTN"}'
|
|
vllm serve <model> -ac '{"backend": "FLASH_ATTN"}'
|
|
```
|
|
|
|
> **Note:** `--attention-backend` and `--attention-config.backend` are mutually
|
|
> exclusive. Use one or the other, not both.
|
|
|
|
### Python API
|
|
|
|
Use `AttentionConfig` with the `LLM` class:
|
|
|
|
```python
|
|
from vllm import LLM
|
|
from vllm.config import AttentionConfig
|
|
from vllm.v1.attention.backends.registry import AttentionBackendEnum
|
|
|
|
# Method 1: Using AttentionConfig with enum
|
|
llm = LLM(
|
|
model="Qwen/Qwen3-0.6B",
|
|
attention_config=AttentionConfig(backend=AttentionBackendEnum.FLASH_ATTN),
|
|
)
|
|
|
|
# Method 2: Using attention_backend parameter with string
|
|
llm = LLM(
|
|
model="Qwen/Qwen3-0.6B",
|
|
attention_backend="FLASH_ATTN",
|
|
)
|
|
```
|
|
|
|
## Backend Selection Behavior
|
|
|
|
### Manual Selection
|
|
|
|
When you explicitly set a backend via `--attention-backend` or `AttentionConfig`:
|
|
|
|
1. The backend is **validated** against your configuration (model dtype, head
|
|
size, compute capability, etc.)
|
|
2. If the backend **doesn't support** your configuration, an error is raised
|
|
with the specific reason
|
|
3. If valid, the backend is used
|
|
|
|
Example error when selecting an incompatible backend:
|
|
|
|
```text
|
|
ValueError: Selected backend FLASHMLA is not valid for this configuration.
|
|
Reason: ['compute capability not supported']
|
|
```
|
|
|
|
### Automatic Selection
|
|
|
|
When no backend is specified (the default):
|
|
|
|
1. vLLM iterates through backends in **priority order** (see tables below)
|
|
2. Each backend is validated against your configuration
|
|
3. The **first compatible backend** is selected
|
|
4. If no backend is compatible, an error is raised listing all backends and
|
|
their incompatibility reasons
|
|
|
|
## Backend Priority (CUDA)
|
|
|
|
When no backend is explicitly selected, vLLM chooses the first
|
|
compatible backend from these priority-ordered lists.
|
|
|
|
Priority is **1 = highest** (tried first).
|
|
|
|
### Standard Attention (MHA, MQA, GQA)
|
|
|
|
**Blackwell (SM 10.x):**
|
|
|
|
| Priority | Backend |
|
|
|----------|---------|
|
|
| 1 | `FLASHINFER` |
|
|
| 2 | `FLASH_ATTN` |
|
|
| 3 | `TRITON_ATTN` |
|
|
| 4 | `FLEX_ATTENTION` |
|
|
|
|
**Ampere/Hopper (SM 8.x-9.x):**
|
|
|
|
| Priority | Backend |
|
|
|----------|---------|
|
|
| 1 | `FLASH_ATTN` |
|
|
| 2 | `FLASHINFER` |
|
|
| 3 | `TRITON_ATTN` |
|
|
| 4 | `FLEX_ATTENTION` |
|
|
|
|
### MLA Attention (DeepSeek-style)
|
|
|
|
**Blackwell (SM 10.x):**
|
|
|
|
| Priority | Backend |
|
|
|----------|---------|
|
|
| 1 | `FLASHINFER_MLA` |
|
|
| 2 | `CUTLASS_MLA` |
|
|
| 3 | `FLASH_ATTN_MLA` |
|
|
| 4 | `FLASHMLA` |
|
|
| 5 | `TRITON_MLA` |
|
|
| 6 | `FLASHMLA_SPARSE` |
|
|
| 7 | `FLASHINFER_MLA_SPARSE` |
|
|
|
|
**Ampere/Hopper (SM 8.x-9.x):**
|
|
|
|
| Priority | Backend |
|
|
|----------|---------|
|
|
| 1 | `FLASH_ATTN_MLA` |
|
|
| 2 | `FLASHMLA` |
|
|
| 3 | `FLASHINFER_MLA` |
|
|
| 4 | `TRITON_MLA` |
|
|
| 5 | `FLASHMLA_SPARSE` |
|
|
|
|
> **Note:** ROCm and CPU platforms have their own selection logic. See the platform-specific documentation for details.
|
|
|
|
## Legend
|
|
|
|
| Column | Description |
|
|
|--------|-------------|
|
|
| **Dtypes** | Supported model data types (fp16, bf16, fp32) |
|
|
| **KV Dtypes** | Supported KV cache data types (`auto`, `fp8`, `fp8_e4m3`, etc.) |
|
|
| **Block Sizes** | Supported KV cache block sizes (%N means multiples of N) |
|
|
| **Head Sizes** | Supported attention head sizes |
|
|
| **Sink** | Attention sink support (for StreamingLLM) |
|
|
| **Sparse** | Sparse attention support (MLA only) |
|
|
| **MM Prefix** | Multimodal prefix full attention support |
|
|
| **DCP** | Decode Context Parallelism support (`--decode-context-parallel-size`) |
|
|
| **Attention Types** | Supported attention patterns (Decoder, Encoder, Enc-Dec) |
|
|
| **Compute Cap.** | Required CUDA compute capability (N/A for non-CUDA backends) |
|
|
|
|
**Symbols:** ✅ = Supported, ❌ = Not supported
|
|
|
|
## Standard Attention (MHA, MQA, GQA) Backends
|
|
|
|
| Backend | Version | Dtypes | KV Dtypes | Block Sizes | Head Sizes | Sink | MM Prefix | DCP | Attention Types | Compute Cap. |
|
|
|---------|---------|--------|-----------|-------------|------------|------|-----------|-----|-----------------|--------------|
|
|
| `CPU_ATTN` | | fp16, bf16, fp32 | `auto` | Any | 32, 64, 80, 96, 112, 128, 160, 192, 224, 256 | ❌ | ❌ | ❌ | All | N/A |
|
|
| `FLASHINFER` | Native† | fp16, bf16 | `auto`, `bfloat16`, `fp8`, `fp8_e4m3`, `fp8_e5m2` | 16, 32, 64 | 64, 128, 256 | ❌ | ❌ | ✅ | Decoder | 7.x-9.x |
|
|
| `FLASHINFER` | TRTLLM† | fp16, bf16 | `auto`, `bfloat16`, `fp8`, `fp8_e4m3`, `fp8_e5m2` | 16, 32, 64 | 64, 128, 256 | ✅ | ❌ | ✅ | Decoder | 10.x |
|
|
| `FLASH_ATTN` | FA2* | fp16, bf16 | `auto`, `bfloat16` | %16 | Any | ❌ | ❌ | ✅ | All | ≥8.0 |
|
|
| `FLASH_ATTN` | FA3* | fp16, bf16 | `auto`, `bfloat16`, `fp8`, `fp8_e4m3`, `fp8_e5m2` | %16 | Any | ✅ | ❌ | ✅ | All | 9.x |
|
|
| `FLASH_ATTN_DIFFKV` | | fp16, bf16 | `auto` | Any | Any | ❌ | ❌ | ✅ | Decoder | Any |
|
|
| `FLEX_ATTENTION` | | fp16, bf16, fp32 | `auto`, `bfloat16` | Any | Any | ❌ | ✅ | ❌ | Decoder, Encoder Only | Any |
|
|
| `ROCM_AITER_FA` | | fp16, bf16 | `auto` | 16, 32 | 64, 128, 256 | ❌ | ❌ | ❌ | Decoder | N/A |
|
|
| `ROCM_AITER_UNIFIED_ATTN` | | fp16, bf16 | `auto` | Any | Any | ❌ | ❌ | ❌ | Decoder | N/A |
|
|
| `ROCM_ATTN` | | fp16, bf16, fp32 | `auto` | 16, 32, 544 | 32, 64, 96, 128, 160, 192, 224, 256 | ❌ | ❌ | ❌ | Decoder | N/A |
|
|
| `TREE_ATTN` | | fp16, bf16 | `auto` | %16 | 32, 64, 96, 128, 160, 192, 224, 256 | ❌ | ❌ | ❌ | Decoder | Any |
|
|
| `TRITON_ATTN` | | fp16, bf16, fp32 | `auto`, `bfloat16`, `fp8`, `fp8_e4m3`, `fp8_e5m2` | %16 | Any | ✅ | ✅ | ❌ | All | Any |
|
|
|
|
> **†** FlashInfer uses TRTLLM attention on Blackwell (SM100), which supports sinks. Disable via `--attention-config.use_trtllm_attention=0`.
|
|
>
|
|
> **\*** Specify the FlashAttention version via `--attention-config.flash_attn_version=2` or `3`. Default is FA3 on SM90, FA2 otherwise.
|
|
|
|
## MLA (Multi-head Latent Attention) Backends
|
|
|
|
MLA uses separate backends for prefill and decode phases.
|
|
|
|
### Prefill Backends
|
|
|
|
The prefill backend is selected at runtime based on hardware and
|
|
configuration.
|
|
|
|
| Backend | Description | Compute Cap. | Enable | Disable | Notes |
|
|
|---------|-------------|--------------|--------|---------|-------|
|
|
| TRT-LLM Ragged‡ | TensorRT-LLM ragged attention | 10.x | Default on SM100 | `-ac.use_trtllm_ragged_deepseek_prefill=0` | DeepSeek R1 dims only |
|
|
| FlashInfer | FlashInfer CUTLASS backend | 10.x | `-ac.disable_flashinfer_prefill=0` | `-ac.disable_flashinfer_prefill=1` | DeepSeek R1 dims only |
|
|
| cuDNN | cuDNN-based attention | 10.x | `-ac.use_cudnn_prefill=1` | `-ac.use_cudnn_prefill=0` | |
|
|
| FlashAttention | FlashAttention varlen (FA2/FA3) | Any | Default fallback | Use other backends | FA3 on SM90, FA2 otherwise |
|
|
|
|
> **‡** TRT-LLM Ragged is the default on Blackwell (SM100).
|
|
> On other GPUs, FlashAttention is used as the default.
|
|
|
|
### Decode Backends
|
|
|
|
| Backend | Dtypes | KV Dtypes | Block Sizes | Head Sizes | Sink | Sparse | MM Prefix | DCP | Attention Types | Compute Cap. |
|
|
|---------|--------|-----------|-------------|------------|------|--------|-----------|-----|-----------------|--------------|
|
|
| `CUTLASS_MLA` | fp16, bf16 | `auto`, `bfloat16`, `fp8`, `fp8_e4m3` | 128 | Any | ❌ | ❌ | ❌ | ✅ | Decoder | 10.x |
|
|
| `FLASHINFER_MLA` | fp16, bf16 | `auto`, `bfloat16`, `fp8`, `fp8_e4m3` | 32, 64 | Any | ❌ | ❌ | ❌ | ❌ | Decoder | 10.x |
|
|
| `FLASHINFER_MLA_SPARSE` | fp16, bf16 | `auto`, `bfloat16` | 32, 64 | 576 | ❌ | ✅ | ❌ | ❌ | Decoder | 10.x |
|
|
| `FLASHMLA` | fp16, bf16 | `auto`, `bfloat16`, `fp8`, `fp8_e4m3` | 64 | Any | ❌ | ❌ | ❌ | ✅ | Decoder | 9.x-10.x |
|
|
| `FLASHMLA_SPARSE` | bf16 | `auto`, `bfloat16`, `fp8_ds_mla` | 64 | 576 | ❌ | ✅ | ❌ | ❌ | Decoder | 9.x-10.x |
|
|
| `FLASH_ATTN_MLA` | fp16, bf16 | `auto`, `bfloat16` | %16 | Any | ❌ | ❌ | ❌ | ✅ | Decoder | 9.x |
|
|
| `ROCM_AITER_MLA` | fp16, bf16 | `auto` | 1 | Any | ❌ | ❌ | ❌ | ❌ | Decoder | N/A |
|
|
| `ROCM_AITER_MLA_SPARSE` | fp16, bf16 | `auto` | Any | 576 | ❌ | ❌ | ❌ | ❌ | Decoder | N/A |
|
|
| `ROCM_AITER_TRITON_MLA` | fp16, bf16 | `auto` | Any | Any | ❌ | ❌ | ❌ | ❌ | Decoder | N/A |
|
|
| `TRITON_MLA` | fp16, bf16 | `auto`, `bfloat16` | Any | Any | ❌ | ❌ | ❌ | ✅ | Decoder | Any |
|