Allow markdownlint to run locally (#36398)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
@@ -174,18 +174,18 @@ Suppose we have hybrid attention backends (e.g., in mamba mixer models). In that
|
||||
The following table lists backends that support full CUDA Graphs at the time of writing.
|
||||
|
||||
| Attention Backend | cudagraph_support | Comments |
|
||||
|:---|:---|:---|
|
||||
| :---------------- | :---------------- | :------- |
|
||||
| FlashAttention v2 | `UNIFORM_BATCH` | Actually `ALWAYS` but workaround to fallback to `FULL_AND_PIECEWISE` for performance reason |
|
||||
| FlashAttention v3 | `ALWAYS` | has unified routine for both batches, so `FULL` mode is good |
|
||||
| Triton Attention | `ALWAYS` | prefer `FULL_AND_PIECEWISE` since it has different kernels for prefill/mixed and pure decode batches |
|
||||
| AITER FlashAttention | `UNIFORM_BATCH`| |
|
||||
| AITER FlashAttention | `UNIFORM_BATCH` | |
|
||||
| FlashInfer | `UNIFORM_SINGLE_TOKEN_DECODE` | Will be set to `UNIFORM_BATCH` when using TRTLLM attention on Blackwell |
|
||||
| FlashMLA | `UNIFORM_BATCH` | |
|
||||
| FlashInferMLA | `UNIFORM_BATCH` | |
|
||||
| FlashInferMLASparse | `UNIFORM_BATCH` | |
|
||||
| AITER MLA | `UNIFORM_SINGLE_TOKEN_DECODE` | |
|
||||
| CUTLASS MLA | `UNIFORM_SINGLE_TOKEN_DECODE` | |
|
||||
| Mamba attention| `UNIFORM_SINGLE_TOKEN_DECODE` | |
|
||||
| Mamba attention | `UNIFORM_SINGLE_TOKEN_DECODE` | |
|
||||
|
||||
Unlisted backends are all declared as `NEVER`.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user