# Attention Backend Feature Support This document is auto-generated by `tools/pre_commit/generate_attention_backend_docs.py`. It shows the feature support for each registered attention backend based on the checks in `AttentionBackend.validate_configuration()`. **Do not edit this file manually.** Run the following command to regenerate it: ```bash python tools/pre_commit/generate_attention_backend_docs.py ``` ## Setting the Attention Backend ### Command Line There are two ways to specify the backend from the command line: **Option 1: Using `--attention-backend` (simple)** ```bash vllm serve --attention-backend FLASH_ATTN ``` **Option 2: Using `--attention-config.backend` / `-ac.backend` (structured config)** ```bash # Dot notation vllm serve --attention-config.backend FLASH_ATTN vllm serve -ac.backend FLASH_ATTN # JSON format vllm serve --attention-config '{"backend": "FLASH_ATTN"}' vllm serve -ac '{"backend": "FLASH_ATTN"}' ``` > **Note:** `--attention-backend` and `--attention-config.backend` are mutually > exclusive. Use one or the other, not both. ### Python API Use `AttentionConfig` with the `LLM` class: ```python from vllm import LLM from vllm.config import AttentionConfig from vllm.v1.attention.backends.registry import AttentionBackendEnum # Method 1: Using AttentionConfig with enum llm = LLM( model="Qwen/Qwen3-0.6B", attention_config=AttentionConfig(backend=AttentionBackendEnum.FLASH_ATTN), ) # Method 2: Using attention_backend parameter with string llm = LLM( model="Qwen/Qwen3-0.6B", attention_backend="FLASH_ATTN", ) ``` ## Backend Selection Behavior ### Manual Selection When you explicitly set a backend via `--attention-backend` or `AttentionConfig`: 1. The backend is **validated** against your configuration (model dtype, head size, compute capability, etc.) 2. If the backend **doesn't support** your configuration, an error is raised with the specific reason 3. If valid, the backend is used Example error when selecting an incompatible backend: ```text ValueError: Selected backend FLASHMLA is not valid for this configuration. Reason: ['compute capability not supported'] ``` ### Automatic Selection When no backend is specified (the default): 1. vLLM iterates through backends in **priority order** (see tables below) 2. Each backend is validated against your configuration 3. The **first compatible backend** is selected 4. If no backend is compatible, an error is raised listing all backends and their incompatibility reasons ## Backend Priority (CUDA) When no backend is explicitly selected, vLLM chooses the first compatible backend from these priority-ordered lists. Priority is **1 = highest** (tried first). ### Standard Attention (MHA, MQA, GQA) **Blackwell (SM 10.x):** | Priority | Backend | |----------|---------| | 1 | `FLASHINFER` | | 2 | `FLASH_ATTN` | | 3 | `TRITON_ATTN` | | 4 | `FLEX_ATTENTION` | **Ampere/Hopper (SM 8.x-9.x):** | Priority | Backend | |----------|---------| | 1 | `FLASH_ATTN` | | 2 | `FLASHINFER` | | 3 | `TRITON_ATTN` | | 4 | `FLEX_ATTENTION` | ### MLA Attention (DeepSeek-style) **Blackwell (SM 10.x):** | Priority | Backend | |----------|---------| | 1 | `FLASHINFER_MLA` | | 2 | `CUTLASS_MLA` | | 3 | `FLASH_ATTN_MLA` | | 4 | `FLASHMLA` | | 5 | `TRITON_MLA` | | 6 | `FLASHMLA_SPARSE` | | 7 | `FLASHINFER_MLA_SPARSE` | **Ampere/Hopper (SM 8.x-9.x):** | Priority | Backend | |----------|---------| | 1 | `FLASH_ATTN_MLA` | | 2 | `FLASHMLA` | | 3 | `FLASHINFER_MLA` | | 4 | `TRITON_MLA` | | 5 | `FLASHMLA_SPARSE` | > **Note:** ROCm and CPU platforms have their own selection logic. See the platform-specific documentation for details. ## Legend | Column | Description | |--------|-------------| | **Dtypes** | Supported model data types (fp16, bf16, fp32) | | **KV Dtypes** | Supported KV cache data types (`auto`, `fp8`, `fp8_e4m3`, etc.) | | **Block Sizes** | Supported KV cache block sizes (%N means multiples of N) | | **Head Sizes** | Supported attention head sizes | | **Sink** | Attention sink support (for StreamingLLM) | | **Sparse** | Sparse attention support (MLA only) | | **MM Prefix** | Multimodal prefix full attention support | | **DCP** | Decode Context Parallelism support (`--decode-context-parallel-size`) | | **Attention Types** | Supported attention patterns (Decoder, Encoder, Enc-Dec) | | **Compute Cap.** | Required CUDA compute capability (N/A for non-CUDA backends) | **Symbols:** ✅ = Supported, ❌ = Not supported ## Standard Attention (MHA, MQA, GQA) Backends | Backend | Version | Dtypes | KV Dtypes | Block Sizes | Head Sizes | Sink | MM Prefix | DCP | Attention Types | Compute Cap. | |---------|---------|--------|-----------|-------------|------------|------|-----------|-----|-----------------|--------------| | `CPU_ATTN` | | fp16, bf16, fp32 | `auto` | Any | 32, 64, 80, 96, 112, 128, 160, 192, 224, 256 | ❌ | ❌ | ❌ | All | N/A | | `FLASHINFER` | Native† | fp16, bf16 | `auto`, `bfloat16`, `fp8`, `fp8_e4m3`, `fp8_e5m2` | 16, 32, 64 | 64, 128, 256 | ❌ | ❌ | ✅ | Decoder | 7.x-9.x | | `FLASHINFER` | TRTLLM† | fp16, bf16 | `auto`, `bfloat16`, `fp8`, `fp8_e4m3`, `fp8_e5m2` | 16, 32, 64 | 64, 128, 256 | ✅ | ❌ | ✅ | Decoder | 10.x | | `FLASH_ATTN` | FA2* | fp16, bf16 | `auto`, `bfloat16` | %16 | Any | ❌ | ❌ | ✅ | All | ≥8.0 | | `FLASH_ATTN` | FA3* | fp16, bf16 | `auto`, `bfloat16`, `fp8`, `fp8_e4m3`, `fp8_e5m2` | %16 | Any | ✅ | ❌ | ✅ | All | 9.x | | `FLASH_ATTN` | FA4* | fp16, bf16 | `auto`, `bfloat16` | %16 | Any | ❌ | ❌ | ✅ | All | ≥10.0 | | `FLASH_ATTN_DIFFKV` | | fp16, bf16 | `auto` | Any | Any | ❌ | ❌ | ✅ | Decoder | Any | | `FLEX_ATTENTION` | | fp16, bf16, fp32 | `auto`, `bfloat16` | Any | Any | ❌ | ✅ | ❌ | Decoder, Encoder Only | Any | | `ROCM_AITER_FA` | | fp16, bf16 | `auto` | 16, 32 | 64, 128, 256 | ❌ | ❌ | ❌ | Decoder | N/A | | `ROCM_AITER_UNIFIED_ATTN` | | fp16, bf16 | `auto` | Any | Any | ❌ | ❌ | ❌ | All | N/A | | `ROCM_ATTN` | | fp16, bf16, fp32 | `auto` | 16, 32, 544 | 32, 64, 80, 96, 128, 160, 192, 224, 256 | ❌ | ❌ | ❌ | All | N/A | | `TREE_ATTN` | | fp16, bf16 | `auto` | %16 | 32, 64, 96, 128, 160, 192, 224, 256 | ❌ | ❌ | ❌ | Decoder | Any | | `TRITON_ATTN` | | fp16, bf16, fp32 | `auto`, `bfloat16`, `fp8`, `fp8_e4m3`, `fp8_e5m2` | %16 | Any | ✅ | ✅ | ❌ | All | Any | > **†** FlashInfer uses TRTLLM attention on Blackwell (SM100), which supports sinks. Disable via `--attention-config.use_trtllm_attention=0`. > > **\*** Specify the FlashAttention version via `--attention-config.flash_attn_version=2`, `3`, or `4`. Default is FA4 on SM100+ (Blackwell), FA3 on SM90 (Hopper), FA2 otherwise. ## MLA (Multi-head Latent Attention) Backends MLA uses separate backends for prefill and decode phases. ### Prefill Backends The prefill backend is selected at runtime based on hardware and configuration. | Backend | Description | Compute Cap. | Enable | Disable | Notes | |---------|-------------|--------------|--------|---------|-------| | TRT-LLM Ragged‡ | TensorRT-LLM ragged attention | 10.x | Default on SM100 | `-ac.use_trtllm_ragged_deepseek_prefill=0` | DeepSeek R1 dims only | | FlashInfer | FlashInfer CUTLASS backend | 10.x | `-ac.disable_flashinfer_prefill=0` | `-ac.disable_flashinfer_prefill=1` | DeepSeek R1 dims only | | cuDNN | cuDNN-based attention | 10.x | `-ac.use_cudnn_prefill=1` | `-ac.use_cudnn_prefill=0` | | | FlashAttention | FlashAttention varlen (FA2/FA3) | Any | Default fallback | Use other backends | FA3 on SM90, FA2 otherwise | > **‡** TRT-LLM Ragged is the default on Blackwell (SM100). > On other GPUs, FlashAttention is used as the default. ### Decode Backends | Backend | Dtypes | KV Dtypes | Block Sizes | Head Sizes | Sink | Sparse | MM Prefix | DCP | Attention Types | Compute Cap. | |---------|--------|-----------|-------------|------------|------|--------|-----------|-----|-----------------|--------------| | `CUTLASS_MLA` | fp16, bf16 | `auto`, `bfloat16`, `fp8`, `fp8_e4m3` | 128 | Any | ❌ | ❌ | ❌ | ✅ | Decoder | 10.x | | `FLASHINFER_MLA` | fp16, bf16 | `auto`, `bfloat16`, `fp8`, `fp8_e4m3` | 32, 64 | Any | ❌ | ❌ | ❌ | ❌ | Decoder | 10.x | | `FLASHINFER_MLA_SPARSE` | fp16, bf16 | `auto`, `bfloat16` | 32, 64 | 576 | ❌ | ✅ | ❌ | ❌ | Decoder | 10.x | | `FLASHMLA` | fp16, bf16 | `auto`, `bfloat16`, `fp8`, `fp8_e4m3` | 64 | Any | ❌ | ❌ | ❌ | ✅ | Decoder | 9.x-10.x | | `FLASHMLA_SPARSE` | bf16 | `auto`, `bfloat16`, `fp8_ds_mla` | 64 | 576 | ❌ | ✅ | ❌ | ❌ | Decoder | 9.x-10.x | | `FLASH_ATTN_MLA` | fp16, bf16 | `auto`, `bfloat16` | %16 | Any | ❌ | ❌ | ❌ | ✅ | Decoder | 9.x | | `ROCM_AITER_MLA` | fp16, bf16 | `auto` | 1 | Any | ❌ | ❌ | ❌ | ❌ | Decoder | N/A | | `ROCM_AITER_MLA_SPARSE` | fp16, bf16 | `auto` | Any | 576 | ❌ | ❌ | ❌ | ❌ | Decoder | N/A | | `ROCM_AITER_TRITON_MLA` | fp16, bf16 | `auto` | Any | Any | ❌ | ❌ | ❌ | ❌ | Decoder | N/A | | `TRITON_MLA` | fp16, bf16 | `auto`, `bfloat16` | Any | Any | ❌ | ❌ | ❌ | ✅ | Decoder | Any |