[Kernel][Backend][Model] Blocksparse flash attention kernel and Phi-3-Small model (#4799)

Co-authored-by: beagleski <yunanzhang@microsoft.com> Co-authored-by: bapatra <bapatra@microsoft.com> Co-authored-by: Barun Patra <codedecde@users.noreply.github.com> Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-05-25 01:00:52 -04:00
parent e64fde4b01
commit 8e192ff967
23 changed files with 2445 additions and 87 deletions
--- a/vllm/attention/selector.py
+++ b/vllm/attention/selector.py
@@ -29,7 +29,14 @@ def get_attn_backend(
    dtype: torch.dtype,
    kv_cache_dtype: Optional[str],
    block_size: int,
+    is_blocksparse: bool = False,
 ) -> Type[AttentionBackend]:
+
+    if is_blocksparse:
+        logger.info("Using BlocksparseFlashAttention backend.")
+        from vllm.attention.backends.blocksparse_attn import (
+            BlocksparseFlashAttentionBackend)
+        return BlocksparseFlashAttentionBackend
    """Determine which attention backend to use and only import
    the selected backend module.
    """