Allow markdownlint to run locally (#36398)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
@@ -101,7 +101,7 @@ Priority is **1 = highest** (tried first).
|
||||
**Blackwell (SM 10.x):**
|
||||
|
||||
| Priority | Backend |
|
||||
|----------|---------|
|
||||
| -------- | ------- |
|
||||
| 1 | `FLASHINFER` |
|
||||
| 2 | `FLASH_ATTN` |
|
||||
| 3 | `TRITON_ATTN` |
|
||||
@@ -110,7 +110,7 @@ Priority is **1 = highest** (tried first).
|
||||
**Ampere/Hopper (SM 8.x-9.x):**
|
||||
|
||||
| Priority | Backend |
|
||||
|----------|---------|
|
||||
| -------- | ------- |
|
||||
| 1 | `FLASH_ATTN` |
|
||||
| 2 | `FLASHINFER` |
|
||||
| 3 | `TRITON_ATTN` |
|
||||
@@ -121,7 +121,7 @@ Priority is **1 = highest** (tried first).
|
||||
**Blackwell (SM 10.x):**
|
||||
|
||||
| Priority | Backend |
|
||||
|----------|---------|
|
||||
| -------- | ------- |
|
||||
| 1 | `FLASHINFER_MLA` |
|
||||
| 2 | `CUTLASS_MLA` |
|
||||
| 3 | `FLASH_ATTN_MLA` |
|
||||
@@ -133,7 +133,7 @@ Priority is **1 = highest** (tried first).
|
||||
**Ampere/Hopper (SM 8.x-9.x):**
|
||||
|
||||
| Priority | Backend |
|
||||
|----------|---------|
|
||||
| -------- | ------- |
|
||||
| 1 | `FLASH_ATTN_MLA` |
|
||||
| 2 | `FLASHMLA` |
|
||||
| 3 | `FLASHINFER_MLA` |
|
||||
@@ -145,7 +145,7 @@ Priority is **1 = highest** (tried first).
|
||||
## Legend
|
||||
|
||||
| Column | Description |
|
||||
|--------|-------------|
|
||||
| ------ | ----------- |
|
||||
| **Dtypes** | Supported model data types (fp16, bf16, fp32) |
|
||||
| **KV Dtypes** | Supported KV cache data types (`auto`, `fp8`, `fp8_e4m3`, etc.) |
|
||||
| **Block Sizes** | Supported KV cache block sizes (%N means multiples of N) |
|
||||
@@ -162,20 +162,20 @@ Priority is **1 = highest** (tried first).
|
||||
## Standard Attention (MHA, MQA, GQA) Backends
|
||||
|
||||
| Backend | Version | Dtypes | KV Dtypes | Block Sizes | Head Sizes | Sink | MM Prefix | DCP | Attention Types | Compute Cap. |
|
||||
|---------|---------|--------|-----------|-------------|------------|------|-----------|-----|-----------------|--------------|
|
||||
| `CPU_ATTN` | | fp16, bf16, fp32 | `auto` | Any | 32, 64, 80, 96, 112, 128, 160, 192, 224, 256 | ❌ | ❌ | ❌ | All | N/A |
|
||||
| ------- | ------- | ------ | --------- | ----------- | ---------- | ---- | --------- | --- | --------------- | ------------ |
|
||||
| `CPU_ATTN` | | fp16, bf16, fp32 | `auto` | Any | 32, 64, 80, 96, 112, 128, 160, 192, 224, 256 | ❌ | ❌ | ❌ | All | N/A |
|
||||
| `FLASHINFER` | Native† | fp16, bf16 | `auto`, `bfloat16`, `fp8`, `fp8_e4m3`, `fp8_e5m2` | 16, 32, 64 | 64, 128, 256 | ❌ | ❌ | ✅ | Decoder | 7.x-9.x |
|
||||
| `FLASHINFER` | TRTLLM† | fp16, bf16 | `auto`, `bfloat16`, `fp8`, `fp8_e4m3`, `fp8_e5m2` | 16, 32, 64 | 64, 128, 256 | ✅ | ❌ | ✅ | Decoder | 10.x |
|
||||
| `FLASH_ATTN` | FA2* | fp16, bf16 | `auto`, `bfloat16` | %16 | Any | ❌ | ❌ | ✅ | All | ≥8.0 |
|
||||
| `FLASH_ATTN` | FA3* | fp16, bf16 | `auto`, `bfloat16`, `fp8`, `fp8_e4m3`, `fp8_e5m2` | %16 | Any | ✅ | ❌ | ✅ | All | 9.x |
|
||||
| `FLASH_ATTN` | FA4* | fp16, bf16 | `auto`, `bfloat16` | %16 | Any | ❌ | ❌ | ✅ | All | ≥10.0 |
|
||||
| `FLASH_ATTN_DIFFKV` | | fp16, bf16 | `auto` | Any | Any | ❌ | ❌ | ✅ | Decoder | Any |
|
||||
| `FLEX_ATTENTION` | | fp16, bf16, fp32 | `auto`, `bfloat16` | Any | Any | ❌ | ✅ | ❌ | Decoder, Encoder Only | Any |
|
||||
| `ROCM_AITER_FA` | | fp16, bf16 | `auto`, `bfloat16`, `fp8`, `fp8_e4m3`, `fp8_e5m2` | 16, 32 | 64, 128, 256 | ❌ | ❌ | ❌ | Decoder, Enc-Dec | N/A |
|
||||
| `ROCM_AITER_UNIFIED_ATTN` | | fp16, bf16 | `auto` | %16 | Any | ✅ | ✅ | ❌ | All | N/A |
|
||||
| `ROCM_ATTN` | | fp16, bf16, fp32 | `auto`, `bfloat16`, `fp8`, `fp8_e4m3`, `fp8_e5m2` | 16, 32, 544 | 32, 64, 80, 96, 128, 160, 192, 224, 256 | ✅ | ✅ | ❌ | All | N/A |
|
||||
| `TREE_ATTN` | | fp16, bf16 | `auto` | %16 | 32, 64, 96, 128, 160, 192, 224, 256 | ❌ | ❌ | ❌ | Decoder | Any |
|
||||
| `TRITON_ATTN` | | fp16, bf16, fp32 | `auto`, `bfloat16`, `fp8`, `fp8_e4m3`, `fp8_e5m2` | %16 | Any | ✅ | ✅ | ❌ | All | Any |
|
||||
| `FLASH_ATTN_DIFFKV` | | fp16, bf16 | `auto` | Any | Any | ❌ | ❌ | ✅ | Decoder | Any |
|
||||
| `FLEX_ATTENTION` | | fp16, bf16, fp32 | `auto`, `bfloat16` | Any | Any | ❌ | ✅ | ❌ | Decoder, Encoder Only | Any |
|
||||
| `ROCM_AITER_FA` | | fp16, bf16 | `auto`, `bfloat16`, `fp8`, `fp8_e4m3`, `fp8_e5m2` | 16, 32 | 64, 128, 256 | ❌ | ❌ | ❌ | Decoder, Enc-Dec | N/A |
|
||||
| `ROCM_AITER_UNIFIED_ATTN` | | fp16, bf16 | `auto` | %16 | Any | ✅ | ✅ | ❌ | All | N/A |
|
||||
| `ROCM_ATTN` | | fp16, bf16, fp32 | `auto`, `bfloat16`, `fp8`, `fp8_e4m3`, `fp8_e5m2` | 16, 32, 544 | 32, 64, 80, 96, 128, 160, 192, 224, 256 | ✅ | ✅ | ❌ | All | N/A |
|
||||
| `TREE_ATTN` | | fp16, bf16 | `auto` | %16 | 32, 64, 96, 128, 160, 192, 224, 256 | ❌ | ❌ | ❌ | Decoder | Any |
|
||||
| `TRITON_ATTN` | | fp16, bf16, fp32 | `auto`, `bfloat16`, `fp8`, `fp8_e4m3`, `fp8_e5m2` | %16 | Any | ✅ | ✅ | ❌ | All | Any |
|
||||
|
||||
> **†** FlashInfer uses TRTLLM attention on Blackwell (SM100), which supports sinks. Disable via `--attention-config.use_trtllm_attention=0`.
|
||||
>
|
||||
@@ -191,10 +191,10 @@ The prefill backend is selected at runtime based on hardware and
|
||||
configuration.
|
||||
|
||||
| Backend | Description | Compute Cap. | Enable | Disable | Notes |
|
||||
|---------|-------------|--------------|--------|---------|-------|
|
||||
| ------- | ----------- | ------------ | ------ | ------- | ----- |
|
||||
| TRT-LLM Ragged‡ | TensorRT-LLM ragged attention | 10.x | Default on SM100 | `-ac.use_trtllm_ragged_deepseek_prefill=0` | DeepSeek R1 dims only |
|
||||
| FlashInfer | FlashInfer CUTLASS backend | 10.x | `-ac.disable_flashinfer_prefill=0` | `-ac.disable_flashinfer_prefill=1` | DeepSeek R1 dims only |
|
||||
| cuDNN | cuDNN-based attention | 10.x | `-ac.use_cudnn_prefill=1` | `-ac.use_cudnn_prefill=0` | |
|
||||
| cuDNN | cuDNN-based attention | 10.x | `-ac.use_cudnn_prefill=1` | `-ac.use_cudnn_prefill=0` | |
|
||||
| FlashAttention | FlashAttention varlen (FA2/FA3) | Any | Default fallback | Use other backends | FA3 on SM90, FA2 otherwise |
|
||||
|
||||
> **‡** TRT-LLM Ragged is the default on Blackwell (SM100).
|
||||
@@ -203,7 +203,7 @@ configuration.
|
||||
### Decode Backends
|
||||
|
||||
| Backend | Dtypes | KV Dtypes | Block Sizes | Head Sizes | Sink | Sparse | MM Prefix | DCP | Attention Types | Compute Cap. |
|
||||
|---------|--------|-----------|-------------|------------|------|--------|-----------|-----|-----------------|--------------|
|
||||
| ------- | ------ | --------- | ----------- | ---------- | ---- | ------ | --------- | --- | --------------- | ------------ |
|
||||
| `CUTLASS_MLA` | fp16, bf16 | `auto`, `bfloat16`, `fp8`, `fp8_e4m3` | 128 | Any | ❌ | ❌ | ❌ | ✅ | Decoder | 10.x |
|
||||
| `FLASHINFER_MLA` | fp16, bf16 | `auto`, `bfloat16`, `fp8`, `fp8_e4m3` | 32, 64 | Any | ❌ | ❌ | ❌ | ❌ | Decoder | 10.x |
|
||||
| `FLASHINFER_MLA_SPARSE` | fp16, bf16 | `auto`, `bfloat16`, `fp8`, `fp8_e4m3` | 32, 64 | 576 | ❌ | ✅ | ❌ | ❌ | Decoder | 10.x |
|
||||
|
||||
Reference in New Issue
Block a user