[7/N][Attention][Docs] Add documentation for attention backends (#32477)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
This commit is contained in:
@@ -288,5 +288,11 @@ Based on the configuration, the content of the multi-modal caches on `P0` and `P
|
||||
| shm | Shared Memory Caching | K | N/A | V | `mm_processor_cache_gb * api_server_count` |
|
||||
| N/A | Disabled | N/A | N/A | N/A | `0` |
|
||||
|
||||
K: Stores the hashes of multi-modal items
|
||||
K: Stores the hashes of multi-modal items
|
||||
V: Stores the processed tensor data of multi-modal items
|
||||
|
||||
## Attention Backend Selection
|
||||
|
||||
vLLM supports multiple attention backends optimized for different hardware and use cases. The backend is automatically selected based on your GPU architecture, model type, and configuration, but you can also manually specify one for optimal performance.
|
||||
|
||||
For detailed information on available backends, their feature support, and how to configure them, see the [Attention Backend Feature Support](../design/attention_backends.md) documentation.
|
||||
|
||||
Reference in New Issue
Block a user