docs/design/attention_backends.md

# Attention Backend Feature Support

This document is auto-generated by `tools/pre_commit/generate_attention_backend_docs.py`.
It shows the feature support for each registered attention backend
based on the checks in `AttentionBackend.validate_configuration()`.

**Do not edit this file manually.** Run the following command to
regenerate it:

```bash
python tools/pre_commit/generate_attention_backend_docs.py
```

## Setting the Attention Backend

### Command Line

There are two ways to specify the backend from the command line:

**Option 1: Using `--attention-backend` (simple)**

```bash
vllm serve <model> --attention-backend FLASH_ATTN
```

**Option 2: Using `--attention-config.backend` / `-ac.backend` (structured config)**

```bash
# Dot notation
vllm serve <model> --attention-config.backend FLASH_ATTN
vllm serve <model> -ac.backend FLASH_ATTN

# JSON format
vllm serve <model> --attention-config '{"backend": "FLASH_ATTN"}'
vllm serve <model> -ac '{"backend": "FLASH_ATTN"}'
```

> **Note:** `--attention-backend` and `--attention-config.backend` are mutually
> exclusive. Use one or the other, not both.

### Python API

Use `AttentionConfig` with the `LLM` class:

```python
from vllm import LLM
from vllm.config import AttentionConfig
from vllm.v1.attention.backends.registry import AttentionBackendEnum

# Method 1: Using AttentionConfig with enum
llm = LLM(
    model="Qwen/Qwen3-0.6B",
    attention_config=AttentionConfig(backend=AttentionBackendEnum.FLASH_ATTN),
)

# Method 2: Using attention_backend parameter with string
llm = LLM(
    model="Qwen/Qwen3-0.6B",
    attention_backend="FLASH_ATTN",
)
```

## Backend Selection Behavior

### Manual Selection

When you explicitly set a backend via `--attention-backend` or `AttentionConfig`:

1. The backend is **validated** against your configuration (model dtype, head
   size, compute capability, etc.)
2. If the backend **doesn't support** your configuration, an error is raised
   with the specific reason
3. If valid, the backend is used

Example error when selecting an incompatible backend:

```text
ValueError: Selected backend FLASHMLA is not valid for this configuration.
Reason: ['compute capability not supported']
```

### Automatic Selection

When no backend is specified (the default):

1. vLLM iterates through backends in **priority order** (see tables below)
2. Each backend is validated against your configuration
3. The **first compatible backend** is selected
4. If no backend is compatible, an error is raised listing all backends and
   their incompatibility reasons

## Backend Priority (CUDA)

When no backend is explicitly selected, vLLM chooses the first
compatible backend from these priority-ordered lists.

Priority is **1 = highest** (tried first).

### Standard Attention (MHA, MQA, GQA)

**Blackwell (SM 10.x):**

| Priority | Backend |
|----------|---------|
| 1 | `FLASHINFER` |
| 2 | `FLASH_ATTN` |
| 3 | `TRITON_ATTN` |
| 4 | `FLEX_ATTENTION` |

**Ampere/Hopper (SM 8.x-9.x):**

| Priority | Backend |
|----------|---------|
| 1 | `FLASH_ATTN` |
| 2 | `FLASHINFER` |
| 3 | `TRITON_ATTN` |
| 4 | `FLEX_ATTENTION` |

### MLA Attention (DeepSeek-style)

**Blackwell (SM 10.x):**

| Priority | Backend |
|----------|---------|
| 1 | `FLASHINFER_MLA` |
| 2 | `CUTLASS_MLA` |
| 3 | `FLASH_ATTN_MLA` |
| 4 | `FLASHMLA` |
| 5 | `TRITON_MLA` |
| 6 | `FLASHMLA_SPARSE` |
| 7 | `FLASHINFER_MLA_SPARSE` |

**Ampere/Hopper (SM 8.x-9.x):**

| Priority | Backend |
|----------|---------|
| 1 | `FLASH_ATTN_MLA` |
| 2 | `FLASHMLA` |
| 3 | `FLASHINFER_MLA` |
| 4 | `TRITON_MLA` |
| 5 | `FLASHMLA_SPARSE` |

> **Note:** ROCm and CPU platforms have their own selection logic. See the platform-specific documentation for details.

## Legend

| Column | Description |
|--------|-------------|
| **Dtypes** | Supported model data types (fp16, bf16, fp32) |
| **KV Dtypes** | Supported KV cache data types (`auto`, `fp8`, `fp8_e4m3`, etc.) |
| **Block Sizes** | Supported KV cache block sizes (%N means multiples of N) |
| **Head Sizes** | Supported attention head sizes |
| **Sink** | Attention sink support (for StreamingLLM) |
| **Sparse** | Sparse attention support (MLA only) |
| **MM Prefix** | Multimodal prefix full attention support |
| **DCP** | Decode Context Parallelism support (`--decode-context-parallel-size`) |
| **Attention Types** | Supported attention patterns (Decoder, Encoder, Enc-Dec) |
| **Compute Cap.** | Required CUDA compute capability (N/A for non-CUDA backends) |

**Symbols:** ✅ = Supported, ❌ = Not supported

## Standard Attention (MHA, MQA, GQA) Backends

| Backend | Version | Dtypes | KV Dtypes | Block Sizes | Head Sizes | Sink | MM Prefix | DCP | Attention Types | Compute Cap. |
|---------|---------|--------|-----------|-------------|------------|------|-----------|-----|-----------------|--------------|
| `CPU_ATTN` |  | fp16, bf16, fp32 | `auto` | Any | 32, 64, 80, 96, 112, 128, 160, 192, 224, 256 | ❌ | ❌ | ❌ | All | N/A |
| `FLASHINFER` | Native† | fp16, bf16 | `auto`, `bfloat16`, `fp8`, `fp8_e4m3`, `fp8_e5m2` | 16, 32, 64 | 64, 128, 256 | ❌ | ❌ | ✅ | Decoder | 7.x-9.x |
| `FLASHINFER` | TRTLLM† | fp16, bf16 | `auto`, `bfloat16`, `fp8`, `fp8_e4m3`, `fp8_e5m2` | 16, 32, 64 | 64, 128, 256 | ✅ | ❌ | ✅ | Decoder | 10.x |
| `FLASH_ATTN` | FA2* | fp16, bf16 | `auto`, `bfloat16` | %16 | Any | ❌ | ❌ | ✅ | All | ≥8.0 |
| `FLASH_ATTN` | FA3* | fp16, bf16 | `auto`, `bfloat16`, `fp8`, `fp8_e4m3`, `fp8_e5m2` | %16 | Any | ✅ | ❌ | ✅ | All | 9.x |
| `FLASH_ATTN` | FA4* | fp16, bf16 | `auto`, `bfloat16` | %16 | Any | ❌ | ❌ | ✅ | All | ≥10.0 |
| `FLASH_ATTN_DIFFKV` |  | fp16, bf16 | `auto` | Any | Any | ❌ | ❌ | ✅ | Decoder | Any |
| `FLEX_ATTENTION` |  | fp16, bf16, fp32 | `auto`, `bfloat16` | Any | Any | ❌ | ✅ | ❌ | Decoder, Encoder Only | Any |
| `ROCM_AITER_FA` |  | fp16, bf16 | `auto` | 16, 32 | 64, 128, 256 | ❌ | ❌ | ❌ | Decoder | N/A |
| `ROCM_AITER_UNIFIED_ATTN` |  | fp16, bf16 | `auto` | Any | Any | ❌ | ❌ | ❌ | All | N/A |
| `ROCM_ATTN` |  | fp16, bf16, fp32 | `auto` | 16, 32, 544 | 32, 64, 80, 96, 128, 160, 192, 224, 256 | ❌ | ❌ | ❌ | All | N/A |
| `TREE_ATTN` |  | fp16, bf16 | `auto` | %16 | 32, 64, 96, 128, 160, 192, 224, 256 | ❌ | ❌ | ❌ | Decoder | Any |
| `TRITON_ATTN` |  | fp16, bf16, fp32 | `auto`, `bfloat16`, `fp8`, `fp8_e4m3`, `fp8_e5m2` | %16 | Any | ✅ | ✅ | ❌ | All | Any |

> **†** FlashInfer uses TRTLLM attention on Blackwell (SM100), which supports sinks. Disable via `--attention-config.use_trtllm_attention=0`.
>
> **\*** Specify the FlashAttention version via `--attention-config.flash_attn_version=2`, `3`, or `4`. Default is FA4 on SM100+ (Blackwell), FA3 on SM90 (Hopper), FA2 otherwise.

## MLA (Multi-head Latent Attention) Backends

MLA uses separate backends for prefill and decode phases.

### Prefill Backends

The prefill backend is selected at runtime based on hardware and
configuration.

| Backend | Description | Compute Cap. | Enable | Disable | Notes |
|---------|-------------|--------------|--------|---------|-------|
| TRT-LLM Ragged‡ | TensorRT-LLM ragged attention | 10.x | Default on SM100 | `-ac.use_trtllm_ragged_deepseek_prefill=0` | DeepSeek R1 dims only |
| FlashInfer | FlashInfer CUTLASS backend | 10.x | `-ac.disable_flashinfer_prefill=0` | `-ac.disable_flashinfer_prefill=1` | DeepSeek R1 dims only |
| cuDNN | cuDNN-based attention | 10.x | `-ac.use_cudnn_prefill=1` | `-ac.use_cudnn_prefill=0` |  |
| FlashAttention | FlashAttention varlen (FA2/FA3) | Any | Default fallback | Use other backends | FA3 on SM90, FA2 otherwise |

> **‡** TRT-LLM Ragged is the default on Blackwell (SM100).
> On other GPUs, FlashAttention is used as the default.

### Decode Backends

| Backend | Dtypes | KV Dtypes | Block Sizes | Head Sizes | Sink | Sparse | MM Prefix | DCP | Attention Types | Compute Cap. |
|---------|--------|-----------|-------------|------------|------|--------|-----------|-----|-----------------|--------------|
| `CUTLASS_MLA` | fp16, bf16 | `auto`, `bfloat16`, `fp8`, `fp8_e4m3` | 128 | Any | ❌ | ❌ | ❌ | ✅ | Decoder | 10.x |
| `FLASHINFER_MLA` | fp16, bf16 | `auto`, `bfloat16`, `fp8`, `fp8_e4m3` | 32, 64 | Any | ❌ | ❌ | ❌ | ❌ | Decoder | 10.x |
| `FLASHINFER_MLA_SPARSE` | fp16, bf16 | `auto`, `bfloat16` | 32, 64 | 576 | ❌ | ✅ | ❌ | ❌ | Decoder | 10.x |
| `FLASHMLA` | fp16, bf16 | `auto`, `bfloat16`, `fp8`, `fp8_e4m3` | 64 | Any | ❌ | ❌ | ❌ | ✅ | Decoder | 9.x-10.x |
| `FLASHMLA_SPARSE` | bf16 | `auto`, `bfloat16`, `fp8_ds_mla` | 64 | 576 | ❌ | ✅ | ❌ | ❌ | Decoder | 9.x-10.x |
| `FLASH_ATTN_MLA` | fp16, bf16 | `auto`, `bfloat16` | %16 | Any | ❌ | ❌ | ❌ | ✅ | Decoder | 9.x |
| `ROCM_AITER_MLA` | fp16, bf16 | `auto` | 1 | Any | ❌ | ❌ | ❌ | ❌ | Decoder | N/A |
| `ROCM_AITER_MLA_SPARSE` | fp16, bf16 | `auto` | Any | 576 | ❌ | ❌ | ❌ | ❌ | Decoder | N/A |
| `ROCM_AITER_TRITON_MLA` | fp16, bf16 | `auto` | Any | Any | ❌ | ❌ | ❌ | ❌ | Decoder | N/A |
| `TRITON_MLA` | fp16, bf16 | `auto`, `bfloat16` | Any | Any | ❌ | ❌ | ❌ | ✅ | Decoder | Any |
[7/N][Attention][Docs] Add documentation for attention backends (#32477) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> 2026-01-28 17:20:22 -05:00			`# Attention Backend Feature Support`

			This document is auto-generated by `tools/pre_commit/generate_attention_backend_docs.py`.
			`It shows the feature support for each registered attention backend`
			based on the checks in `AttentionBackend.validate_configuration()`.

			`Do not edit this file manually. Run the following command to`
			`regenerate it:`

			```bash
			`python tools/pre_commit/generate_attention_backend_docs.py`
			```

			`## Setting the Attention Backend`

			`### Command Line`

			`There are two ways to specify the backend from the command line:`

			Option 1: Using `--attention-backend` (simple)

			```bash
			`vllm serve <model> --attention-backend FLASH_ATTN`
			```

			Option 2: Using `--attention-config.backend` / `-ac.backend` (structured config)

			```bash
			`# Dot notation`
			`vllm serve <model> --attention-config.backend FLASH_ATTN`
			`vllm serve <model> -ac.backend FLASH_ATTN`

			`# JSON format`
			`vllm serve <model> --attention-config '{"backend": "FLASH_ATTN"}'`
			`vllm serve <model> -ac '{"backend": "FLASH_ATTN"}'`
			```

			> Note: `--attention-backend` and `--attention-config.backend` are mutually
			`> exclusive. Use one or the other, not both.`

			`### Python API`

			Use `AttentionConfig` with the `LLM` class:

			```python
			`from vllm import LLM`
			`from vllm.config import AttentionConfig`
			`from vllm.v1.attention.backends.registry import AttentionBackendEnum`

			`# Method 1: Using AttentionConfig with enum`
			`llm = LLM(`
			`model="Qwen/Qwen3-0.6B",`
			`attention_config=AttentionConfig(backend=AttentionBackendEnum.FLASH_ATTN),`
			`)`

			`# Method 2: Using attention_backend parameter with string`
			`llm = LLM(`
			`model="Qwen/Qwen3-0.6B",`
			`attention_backend="FLASH_ATTN",`
			`)`
			```

			`## Backend Selection Behavior`

			`### Manual Selection`

			When you explicitly set a backend via `--attention-backend` or `AttentionConfig`:

			`1. The backend is validated against your configuration (model dtype, head`
			`size, compute capability, etc.)`
			`2. If the backend doesn't support your configuration, an error is raised`
			`with the specific reason`
			`3. If valid, the backend is used`

			`Example error when selecting an incompatible backend:`

			```text
			`ValueError: Selected backend FLASHMLA is not valid for this configuration.`
			`Reason: ['compute capability not supported']`
			```

			`### Automatic Selection`

			`When no backend is specified (the default):`

			`1. vLLM iterates through backends in priority order (see tables below)`
			`2. Each backend is validated against your configuration`
			`3. The first compatible backend is selected`
			`4. If no backend is compatible, an error is raised listing all backends and`
			`their incompatibility reasons`

			`## Backend Priority (CUDA)`

			`When no backend is explicitly selected, vLLM chooses the first`
			`compatible backend from these priority-ordered lists.`

			`Priority is 1 = highest (tried first).`

			`### Standard Attention (MHA, MQA, GQA)`

			`Blackwell (SM 10.x):`

			`\| Priority \| Backend \|`
			`\|----------\|---------\|`
			\| 1 \| `FLASHINFER` \|
			\| 2 \| `FLASH_ATTN` \|
			\| 3 \| `TRITON_ATTN` \|
			\| 4 \| `FLEX_ATTENTION` \|

			`Ampere/Hopper (SM 8.x-9.x):`

			`\| Priority \| Backend \|`
			`\|----------\|---------\|`
			\| 1 \| `FLASH_ATTN` \|
			\| 2 \| `FLASHINFER` \|
			\| 3 \| `TRITON_ATTN` \|
			\| 4 \| `FLEX_ATTENTION` \|

			`### MLA Attention (DeepSeek-style)`

			`Blackwell (SM 10.x):`

			`\| Priority \| Backend \|`
			`\|----------\|---------\|`
			\| 1 \| `FLASHINFER_MLA` \|
			\| 2 \| `CUTLASS_MLA` \|
			\| 3 \| `FLASH_ATTN_MLA` \|
			\| 4 \| `FLASHMLA` \|
			\| 5 \| `TRITON_MLA` \|
			\| 6 \| `FLASHMLA_SPARSE` \|
[Attention] Add FlashInfer Sparse MLA backend (#33451) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> 2026-02-12 12:21:54 -05:00			\| 7 \| `FLASHINFER_MLA_SPARSE` \|
[7/N][Attention][Docs] Add documentation for attention backends (#32477) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> 2026-01-28 17:20:22 -05:00
			`Ampere/Hopper (SM 8.x-9.x):`

			`\| Priority \| Backend \|`
			`\|----------\|---------\|`
			\| 1 \| `FLASH_ATTN_MLA` \|
			\| 2 \| `FLASHMLA` \|
			\| 3 \| `FLASHINFER_MLA` \|
			\| 4 \| `TRITON_MLA` \|
			\| 5 \| `FLASHMLA_SPARSE` \|

			`> Note: ROCm and CPU platforms have their own selection logic. See the platform-specific documentation for details.`

			`## Legend`

			`\| Column \| Description \|`
			`\|--------\|-------------\|`
			`\| Dtypes \| Supported model data types (fp16, bf16, fp32) \|`
			\| KV Dtypes \| Supported KV cache data types (`auto`, `fp8`, `fp8_e4m3`, etc.) \|
			`\| Block Sizes \| Supported KV cache block sizes (%N means multiples of N) \|`
			`\| Head Sizes \| Supported attention head sizes \|`
			`\| Sink \| Attention sink support (for StreamingLLM) \|`
			`\| Sparse \| Sparse attention support (MLA only) \|`
			`\| MM Prefix \| Multimodal prefix full attention support \|`
[Doc] Add DCP support to attention backend doc (#33936) 2026-02-09 18:33:43 -05:00			\| DCP \| Decode Context Parallelism support (`--decode-context-parallel-size`) \|
[7/N][Attention][Docs] Add documentation for attention backends (#32477) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> 2026-01-28 17:20:22 -05:00			`\| Attention Types \| Supported attention patterns (Decoder, Encoder, Enc-Dec) \|`
			`\| Compute Cap. \| Required CUDA compute capability (N/A for non-CUDA backends) \|`

			`Symbols: ✅ = Supported, ❌ = Not supported`

			`## Standard Attention (MHA, MQA, GQA) Backends`

[Doc] Add DCP support to attention backend doc (#33936) 2026-02-09 18:33:43 -05:00			`\| Backend \| Version \| Dtypes \| KV Dtypes \| Block Sizes \| Head Sizes \| Sink \| MM Prefix \| DCP \| Attention Types \| Compute Cap. \|`
			`\|---------\|---------\|--------\|-----------\|-------------\|------------\|------\|-----------\|-----\|-----------------\|--------------\|`
			\| `CPU_ATTN` \| \| fp16, bf16, fp32 \| `auto` \| Any \| 32, 64, 80, 96, 112, 128, 160, 192, 224, 256 \| ❌ \| ❌ \| ❌ \| All \| N/A \|
			\| `FLASHINFER` \| Native† \| fp16, bf16 \| `auto`, `bfloat16`, `fp8`, `fp8_e4m3`, `fp8_e5m2` \| 16, 32, 64 \| 64, 128, 256 \| ❌ \| ❌ \| ✅ \| Decoder \| 7.x-9.x \|
			\| `FLASHINFER` \| TRTLLM† \| fp16, bf16 \| `auto`, `bfloat16`, `fp8`, `fp8_e4m3`, `fp8_e5m2` \| 16, 32, 64 \| 64, 128, 256 \| ✅ \| ❌ \| ✅ \| Decoder \| 10.x \|
			\| `FLASH_ATTN` \| FA2* \| fp16, bf16 \| `auto`, `bfloat16` \| %16 \| Any \| ❌ \| ❌ \| ✅ \| All \| ≥8.0 \|
			\| `FLASH_ATTN` \| FA3* \| fp16, bf16 \| `auto`, `bfloat16`, `fp8`, `fp8_e4m3`, `fp8_e5m2` \| %16 \| Any \| ✅ \| ❌ \| ✅ \| All \| 9.x \|
[Attention] FA4 integration (#32974) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Co-authored-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> 2026-03-01 18:44:57 -05:00			\| `FLASH_ATTN` \| FA4* \| fp16, bf16 \| `auto`, `bfloat16` \| %16 \| Any \| ❌ \| ❌ \| ✅ \| All \| ≥10.0 \|
[Doc] Add DCP support to attention backend doc (#33936) 2026-02-09 18:33:43 -05:00			\| `FLASH_ATTN_DIFFKV` \| \| fp16, bf16 \| `auto` \| Any \| Any \| ❌ \| ❌ \| ✅ \| Decoder \| Any \|
			\| `FLEX_ATTENTION` \| \| fp16, bf16, fp32 \| `auto`, `bfloat16` \| Any \| Any \| ❌ \| ✅ \| ❌ \| Decoder, Encoder Only \| Any \|
			\| `ROCM_AITER_FA` \| \| fp16, bf16 \| `auto` \| 16, 32 \| 64, 128, 256 \| ❌ \| ❌ \| ❌ \| Decoder \| N/A \|
[ROCm] Enabling encoder and encoder-decoder on ROCm and AITER unified backends (#35334) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> 2026-02-27 15:32:55 -06:00			\| `ROCM_AITER_UNIFIED_ATTN` \| \| fp16, bf16 \| `auto` \| Any \| Any \| ❌ \| ❌ \| ❌ \| All \| N/A \|
[ROCm] Add `stablelm` Head Size 80 To Supported Head Sizes For ROCM_ATTN (#35527) Signed-off-by: Micah Williamson <micah.williamson@amd.com> 2026-02-27 22:16:34 -06:00			\| `ROCM_ATTN` \| \| fp16, bf16, fp32 \| `auto` \| 16, 32, 544 \| 32, 64, 80, 96, 128, 160, 192, 224, 256 \| ❌ \| ❌ \| ❌ \| All \| N/A \|
[Doc] Add DCP support to attention backend doc (#33936) 2026-02-09 18:33:43 -05:00			\| `TREE_ATTN` \| \| fp16, bf16 \| `auto` \| %16 \| 32, 64, 96, 128, 160, 192, 224, 256 \| ❌ \| ❌ \| ❌ \| Decoder \| Any \|
			\| `TRITON_ATTN` \| \| fp16, bf16, fp32 \| `auto`, `bfloat16`, `fp8`, `fp8_e4m3`, `fp8_e5m2` \| %16 \| Any \| ✅ \| ✅ \| ❌ \| All \| Any \|
[7/N][Attention][Docs] Add documentation for attention backends (#32477) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> 2026-01-28 17:20:22 -05:00
			> † FlashInfer uses TRTLLM attention on Blackwell (SM100), which supports sinks. Disable via `--attention-config.use_trtllm_attention=0`.
			`>`
[Attention] FA4 integration (#32974) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Co-authored-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> 2026-03-01 18:44:57 -05:00			> \* Specify the FlashAttention version via `--attention-config.flash_attn_version=2`, `3`, or `4`. Default is FA4 on SM100+ (Blackwell), FA3 on SM90 (Hopper), FA2 otherwise.
[7/N][Attention][Docs] Add documentation for attention backends (#32477) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> 2026-01-28 17:20:22 -05:00
			`## MLA (Multi-head Latent Attention) Backends`

			`MLA uses separate backends for prefill and decode phases.`

			`### Prefill Backends`

			`The prefill backend is selected at runtime based on hardware and`
			`configuration.`

			`\| Backend \| Description \| Compute Cap. \| Enable \| Disable \| Notes \|`
			`\|---------\|-------------\|--------------\|--------\|---------\|-------\|`
			\| TRT-LLM Ragged‡ \| TensorRT-LLM ragged attention \| 10.x \| Default on SM100 \| `-ac.use_trtllm_ragged_deepseek_prefill=0` \| DeepSeek R1 dims only \|
			\| FlashInfer \| FlashInfer CUTLASS backend \| 10.x \| `-ac.disable_flashinfer_prefill=0` \| `-ac.disable_flashinfer_prefill=1` \| DeepSeek R1 dims only \|
			\| cuDNN \| cuDNN-based attention \| 10.x \| `-ac.use_cudnn_prefill=1` \| `-ac.use_cudnn_prefill=0` \| \|
			`\| FlashAttention \| FlashAttention varlen (FA2/FA3) \| Any \| Default fallback \| Use other backends \| FA3 on SM90, FA2 otherwise \|`

			`> ‡ TRT-LLM Ragged is the default on Blackwell (SM100).`
			`> On other GPUs, FlashAttention is used as the default.`

			`### Decode Backends`

[Doc] Add DCP support to attention backend doc (#33936) 2026-02-09 18:33:43 -05:00			`\| Backend \| Dtypes \| KV Dtypes \| Block Sizes \| Head Sizes \| Sink \| Sparse \| MM Prefix \| DCP \| Attention Types \| Compute Cap. \|`
			`\|---------\|--------\|-----------\|-------------\|------------\|------\|--------\|-----------\|-----\|-----------------\|--------------\|`
			\| `CUTLASS_MLA` \| fp16, bf16 \| `auto`, `bfloat16`, `fp8`, `fp8_e4m3` \| 128 \| Any \| ❌ \| ❌ \| ❌ \| ✅ \| Decoder \| 10.x \|
			\| `FLASHINFER_MLA` \| fp16, bf16 \| `auto`, `bfloat16`, `fp8`, `fp8_e4m3` \| 32, 64 \| Any \| ❌ \| ❌ \| ❌ \| ❌ \| Decoder \| 10.x \|
[Attention] Add FlashInfer Sparse MLA backend (#33451) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> 2026-02-12 12:21:54 -05:00			\| `FLASHINFER_MLA_SPARSE` \| fp16, bf16 \| `auto`, `bfloat16` \| 32, 64 \| 576 \| ❌ \| ✅ \| ❌ \| ❌ \| Decoder \| 10.x \|
[Doc] Add DCP support to attention backend doc (#33936) 2026-02-09 18:33:43 -05:00			\| `FLASHMLA` \| fp16, bf16 \| `auto`, `bfloat16`, `fp8`, `fp8_e4m3` \| 64 \| Any \| ❌ \| ❌ \| ❌ \| ✅ \| Decoder \| 9.x-10.x \|
			\| `FLASHMLA_SPARSE` \| bf16 \| `auto`, `bfloat16`, `fp8_ds_mla` \| 64 \| 576 \| ❌ \| ✅ \| ❌ \| ❌ \| Decoder \| 9.x-10.x \|
			\| `FLASH_ATTN_MLA` \| fp16, bf16 \| `auto`, `bfloat16` \| %16 \| Any \| ❌ \| ❌ \| ❌ \| ✅ \| Decoder \| 9.x \|
			\| `ROCM_AITER_MLA` \| fp16, bf16 \| `auto` \| 1 \| Any \| ❌ \| ❌ \| ❌ \| ❌ \| Decoder \| N/A \|
			\| `ROCM_AITER_MLA_SPARSE` \| fp16, bf16 \| `auto` \| Any \| 576 \| ❌ \| ❌ \| ❌ \| ❌ \| Decoder \| N/A \|
			\| `ROCM_AITER_TRITON_MLA` \| fp16, bf16 \| `auto` \| Any \| Any \| ❌ \| ❌ \| ❌ \| ❌ \| Decoder \| N/A \|
			\| `TRITON_MLA` \| fp16, bf16 \| `auto`, `bfloat16` \| Any \| Any \| ❌ \| ❌ \| ❌ \| ✅ \| Decoder \| Any \|