[Perf] Set Flashinfer sparse MLA as default backend for FP8 kv cache (#37252)

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
2026-03-17 16:09:20 -04:00
parent e78821b438
commit b36adfa349
3 changed files with 61 additions and 20 deletions
--- a/docs/design/attention_backends.md
+++ b/docs/design/attention_backends.md
@@ -127,8 +127,8 @@ Priority is **1 = highest** (tried first).
 | 3 | `FLASH_ATTN_MLA` |
 | 4 | `FLASHMLA` |
 | 5 | `TRITON_MLA` |
-| 6 | `FLASHMLA_SPARSE` |
-| 7 | `FLASHINFER_MLA_SPARSE` |
+| 6 | `FLASHINFER_MLA_SPARSE`**\*** |
+| 7 | `FLASHMLA_SPARSE` |

 **Ampere/Hopper (SM 8.x-9.x):**

@@ -140,6 +140,8 @@ Priority is **1 = highest** (tried first).
 | 4 | `TRITON_MLA` |
 | 5 | `FLASHMLA_SPARSE` |

+> **\*** For sparse MLA, FP8 KV cache always prefers `FLASHINFER_MLA_SPARSE`. With BF16 KV cache, `FLASHINFER_MLA_SPARSE` is preferred for low query-head counts (<= 16), while `FLASHMLA_SPARSE` is preferred otherwise.
+>
 > **Note:** ROCm and CPU platforms have their own selection logic. See the platform-specific documentation for details.

 ## Legend