2026-01-28 17:20:22 -05:00
# Attention Backend Feature Support
This document is auto-generated by `tools/pre_commit/generate_attention_backend_docs.py` .
It shows the feature support for each registered attention backend
based on the checks in `AttentionBackend.validate_configuration()` .
**Do not edit this file manually.** Run the following command to
regenerate it:
```bash
python tools/pre_commit/generate_attention_backend_docs.py
```
## Setting the Attention Backend
### Command Line
There are two ways to specify the backend from the command line:
**Option 1: Using `--attention-backend` (simple)**
```bash
vllm serve <model> --attention-backend FLASH_ATTN
```
**Option 2: Using `--attention-config.backend` / `-ac.backend` (structured config)**
```bash
# Dot notation
vllm serve <model> --attention-config.backend FLASH_ATTN
vllm serve <model> -ac.backend FLASH_ATTN
# JSON format
vllm serve <model> --attention-config '{"backend": "FLASH_ATTN"}'
vllm serve <model> -ac '{"backend": "FLASH_ATTN"}'
```
> **Note:** `--attention-backend` and `--attention-config.backend` are mutually
> exclusive. Use one or the other, not both.
### Python API
Use `AttentionConfig` with the `LLM` class:
```python
from vllm import LLM
from vllm.config import AttentionConfig
from vllm.v1.attention.backends.registry import AttentionBackendEnum
# Method 1: Using AttentionConfig with enum
llm = LLM(
model="Qwen/Qwen3-0.6B",
attention_config=AttentionConfig(backend=AttentionBackendEnum.FLASH_ATTN),
)
# Method 2: Using attention_backend parameter with string
llm = LLM(
model="Qwen/Qwen3-0.6B",
attention_backend="FLASH_ATTN",
)
```
## Backend Selection Behavior
### Manual Selection
When you explicitly set a backend via `--attention-backend` or `AttentionConfig` :
1. The backend is **validated ** against your configuration (model dtype, head
size, compute capability, etc.)
2. If the backend **doesn't support ** your configuration, an error is raised
with the specific reason
3. If valid, the backend is used
Example error when selecting an incompatible backend:
```text
ValueError: Selected backend FLASHMLA is not valid for this configuration.
Reason: ['compute capability not supported']
```
### Automatic Selection
When no backend is specified (the default):
1. vLLM iterates through backends in **priority order ** (see tables below)
2. Each backend is validated against your configuration
3. The **first compatible backend ** is selected
4. If no backend is compatible, an error is raised listing all backends and
their incompatibility reasons
## Backend Priority (CUDA)
When no backend is explicitly selected, vLLM chooses the first
compatible backend from these priority-ordered lists.
Priority is **1 = highest ** (tried first).
### Standard Attention (MHA, MQA, GQA)
**Blackwell (SM 10.x):**
| Priority | Backend |
|----------|---------|
| 1 | `FLASHINFER` |
| 2 | `FLASH_ATTN` |
| 3 | `TRITON_ATTN` |
| 4 | `FLEX_ATTENTION` |
**Ampere/Hopper (SM 8.x-9.x):**
| Priority | Backend |
|----------|---------|
| 1 | `FLASH_ATTN` |
| 2 | `FLASHINFER` |
| 3 | `TRITON_ATTN` |
| 4 | `FLEX_ATTENTION` |
### MLA Attention (DeepSeek-style)
**Blackwell (SM 10.x):**
| Priority | Backend |
|----------|---------|
| 1 | `FLASHINFER_MLA` |
| 2 | `CUTLASS_MLA` |
| 3 | `FLASH_ATTN_MLA` |
| 4 | `FLASHMLA` |
| 5 | `TRITON_MLA` |
| 6 | `FLASHMLA_SPARSE` |
2026-02-12 12:21:54 -05:00
| 7 | `FLASHINFER_MLA_SPARSE` |
2026-01-28 17:20:22 -05:00
**Ampere/Hopper (SM 8.x-9.x):**
| Priority | Backend |
|----------|---------|
| 1 | `FLASH_ATTN_MLA` |
| 2 | `FLASHMLA` |
| 3 | `FLASHINFER_MLA` |
| 4 | `TRITON_MLA` |
| 5 | `FLASHMLA_SPARSE` |
> **Note:** ROCm and CPU platforms have their own selection logic. See the platform-specific documentation for details.
## Legend
| Column | Description |
|--------|-------------|
| **Dtypes ** | Supported model data types (fp16, bf16, fp32) |
| **KV Dtypes ** | Supported KV cache data types (`auto` , `fp8` , `fp8_e4m3` , etc.) |
| **Block Sizes ** | Supported KV cache block sizes (%N means multiples of N) |
| **Head Sizes ** | Supported attention head sizes |
| **Sink ** | Attention sink support (for StreamingLLM) |
| **Sparse ** | Sparse attention support (MLA only) |
| **MM Prefix ** | Multimodal prefix full attention support |
2026-02-09 18:33:43 -05:00
| **DCP ** | Decode Context Parallelism support (`--decode-context-parallel-size` ) |
2026-01-28 17:20:22 -05:00
| **Attention Types ** | Supported attention patterns (Decoder, Encoder, Enc-Dec) |
| **Compute Cap. ** | Required CUDA compute capability (N/A for non-CUDA backends) |
**Symbols:** ✅ = Supported, ❌ = Not supported
## Standard Attention (MHA, MQA, GQA) Backends
2026-02-09 18:33:43 -05:00
| Backend | Version | Dtypes | KV Dtypes | Block Sizes | Head Sizes | Sink | MM Prefix | DCP | Attention Types | Compute Cap. |
|---------|---------|--------|-----------|-------------|------------|------|-----------|-----|-----------------|--------------|
| `CPU_ATTN` | | fp16, bf16, fp32 | `auto` | Any | 32, 64, 80, 96, 112, 128, 160, 192, 224, 256 | ❌ | ❌ | ❌ | All | N/A |
| `FLASHINFER` | Native† | fp16, bf16 | `auto` , `bfloat16` , `fp8` , `fp8_e4m3` , `fp8_e5m2` | 16, 32, 64 | 64, 128, 256 | ❌ | ❌ | ✅ | Decoder | 7.x-9.x |
| `FLASHINFER` | TRTLLM† | fp16, bf16 | `auto` , `bfloat16` , `fp8` , `fp8_e4m3` , `fp8_e5m2` | 16, 32, 64 | 64, 128, 256 | ✅ | ❌ | ✅ | Decoder | 10.x |
| `FLASH_ATTN` | FA2* | fp16, bf16 | `auto` , `bfloat16` | %16 | Any | ❌ | ❌ | ✅ | All | ≥8.0 |
| `FLASH_ATTN` | FA3* | fp16, bf16 | `auto` , `bfloat16` , `fp8` , `fp8_e4m3` , `fp8_e5m2` | %16 | Any | ✅ | ❌ | ✅ | All | 9.x |
2026-03-01 18:44:57 -05:00
| `FLASH_ATTN` | FA4* | fp16, bf16 | `auto` , `bfloat16` | %16 | Any | ❌ | ❌ | ✅ | All | ≥10.0 |
2026-02-09 18:33:43 -05:00
| `FLASH_ATTN_DIFFKV` | | fp16, bf16 | `auto` | Any | Any | ❌ | ❌ | ✅ | Decoder | Any |
| `FLEX_ATTENTION` | | fp16, bf16, fp32 | `auto` , `bfloat16` | Any | Any | ❌ | ✅ | ❌ | Decoder, Encoder Only | Any |
| `ROCM_AITER_FA` | | fp16, bf16 | `auto` | 16, 32 | 64, 128, 256 | ❌ | ❌ | ❌ | Decoder | N/A |
2026-02-27 15:32:55 -06:00
| `ROCM_AITER_UNIFIED_ATTN` | | fp16, bf16 | `auto` | Any | Any | ❌ | ❌ | ❌ | All | N/A |
2026-02-27 22:16:34 -06:00
| `ROCM_ATTN` | | fp16, bf16, fp32 | `auto` | 16, 32, 544 | 32, 64, 80, 96, 128, 160, 192, 224, 256 | ❌ | ❌ | ❌ | All | N/A |
2026-02-09 18:33:43 -05:00
| `TREE_ATTN` | | fp16, bf16 | `auto` | %16 | 32, 64, 96, 128, 160, 192, 224, 256 | ❌ | ❌ | ❌ | Decoder | Any |
| `TRITON_ATTN` | | fp16, bf16, fp32 | `auto` , `bfloat16` , `fp8` , `fp8_e4m3` , `fp8_e5m2` | %16 | Any | ✅ | ✅ | ❌ | All | Any |
2026-01-28 17:20:22 -05:00
> **†** FlashInfer uses TRTLLM attention on Blackwell (SM100), which supports sinks. Disable via `--attention-config.use_trtllm_attention=0`.
>
2026-03-01 18:44:57 -05:00
> **\*** Specify the FlashAttention version via `--attention-config.flash_attn_version=2`, `3`, or `4`. Default is FA4 on SM100+ (Blackwell), FA3 on SM90 (Hopper), FA2 otherwise.
2026-01-28 17:20:22 -05:00
## MLA (Multi-head Latent Attention) Backends
MLA uses separate backends for prefill and decode phases.
### Prefill Backends
The prefill backend is selected at runtime based on hardware and
configuration.
| Backend | Description | Compute Cap. | Enable | Disable | Notes |
|---------|-------------|--------------|--------|---------|-------|
| TRT-LLM Ragged‡ | TensorRT-LLM ragged attention | 10.x | Default on SM100 | `-ac.use_trtllm_ragged_deepseek_prefill=0` | DeepSeek R1 dims only |
| FlashInfer | FlashInfer CUTLASS backend | 10.x | `-ac.disable_flashinfer_prefill=0` | `-ac.disable_flashinfer_prefill=1` | DeepSeek R1 dims only |
| cuDNN | cuDNN-based attention | 10.x | `-ac.use_cudnn_prefill=1` | `-ac.use_cudnn_prefill=0` | |
| FlashAttention | FlashAttention varlen (FA2/FA3) | Any | Default fallback | Use other backends | FA3 on SM90, FA2 otherwise |
> **‡** TRT-LLM Ragged is the default on Blackwell (SM100).
> On other GPUs, FlashAttention is used as the default.
### Decode Backends
2026-02-09 18:33:43 -05:00
| Backend | Dtypes | KV Dtypes | Block Sizes | Head Sizes | Sink | Sparse | MM Prefix | DCP | Attention Types | Compute Cap. |
|---------|--------|-----------|-------------|------------|------|--------|-----------|-----|-----------------|--------------|
| `CUTLASS_MLA` | fp16, bf16 | `auto` , `bfloat16` , `fp8` , `fp8_e4m3` | 128 | Any | ❌ | ❌ | ❌ | ✅ | Decoder | 10.x |
| `FLASHINFER_MLA` | fp16, bf16 | `auto` , `bfloat16` , `fp8` , `fp8_e4m3` | 32, 64 | Any | ❌ | ❌ | ❌ | ❌ | Decoder | 10.x |
2026-02-12 12:21:54 -05:00
| `FLASHINFER_MLA_SPARSE` | fp16, bf16 | `auto` , `bfloat16` | 32, 64 | 576 | ❌ | ✅ | ❌ | ❌ | Decoder | 10.x |
2026-02-09 18:33:43 -05:00
| `FLASHMLA` | fp16, bf16 | `auto` , `bfloat16` , `fp8` , `fp8_e4m3` | 64 | Any | ❌ | ❌ | ❌ | ✅ | Decoder | 9.x-10.x |
| `FLASHMLA_SPARSE` | bf16 | `auto` , `bfloat16` , `fp8_ds_mla` | 64 | 576 | ❌ | ✅ | ❌ | ❌ | Decoder | 9.x-10.x |
| `FLASH_ATTN_MLA` | fp16, bf16 | `auto` , `bfloat16` | %16 | Any | ❌ | ❌ | ❌ | ✅ | Decoder | 9.x |
| `ROCM_AITER_MLA` | fp16, bf16 | `auto` | 1 | Any | ❌ | ❌ | ❌ | ❌ | Decoder | N/A |
| `ROCM_AITER_MLA_SPARSE` | fp16, bf16 | `auto` | Any | 576 | ❌ | ❌ | ❌ | ❌ | Decoder | N/A |
| `ROCM_AITER_TRITON_MLA` | fp16, bf16 | `auto` | Any | Any | ❌ | ❌ | ❌ | ❌ | Decoder | N/A |
| `TRITON_MLA` | fp16, bf16 | `auto` , `bfloat16` | Any | Any | ❌ | ❌ | ❌ | ✅ | Decoder | Any |