Lily Liu
7041de4384
[Kernel] Flashinfer for prefill & decode, with Cudagraph support for decode ( #4628 )
...
Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com >, bong-furiosa <bongwon.jang@furiosa.ai >
2024-06-28 15:28:49 -07:00
Ilya Lavrenov
57f09a419c
[Hardware][Intel] OpenVINO vLLM backend ( #5379 )
2024-06-28 13:50:16 +00:00
Kunshang Ji
728c4c8a06
[Hardware][Intel GPU] Add Intel GPU(XPU) inference backend ( #3814 )
...
Co-authored-by: Jiang Li <jiang1.li@intel.com >
Co-authored-by: Abhilash Majumder <abhilash.majumder@intel.com >
Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com >
2024-06-17 11:01:25 -07:00
Li, Jiang
c3c2903e72
[Bugfix] Add device assertion to TorchSDPA ( #5402 )
2024-06-12 12:58:53 -07:00
Woosuk Kwon
1a8bfd92d5
[Hardware] Initial TPU integration ( #5292 )
2024-06-12 11:53:03 -07:00
Woosuk Kwon
c65146e75e
[Misc] Fix docstring of get_attn_backend ( #5271 )
2024-06-05 09:18:59 -07:00
Eric Xihui Lin
8e192ff967
[Kernel][Backend][Model] Blocksparse flash attention kernel and Phi-3-Small model ( #4799 )
...
Co-authored-by: beagleski <yunanzhang@microsoft.com >
Co-authored-by: bapatra <bapatra@microsoft.com >
Co-authored-by: Barun Patra <codedecde@users.noreply.github.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-05-24 22:00:52 -07:00
Cody Yu
ee3eea0a1b
[Misc] Take user preference in attention selector ( #4960 )
2024-05-23 07:55:56 +09:00
Isotr0py
99eff67ba9
[Bugfix][Kernel] Add head size check for attention backend selection ( #4944 )
2024-05-21 15:33:25 -04:00
Woosuk Kwon
b57e6c5949
[Kernel] Add flash-attn back ( #4907 )
2024-05-19 18:11:30 -07:00
SangBin Cho
8a7cc254a0
Revert "[Kernel] Use flash-attn for decoding ( #3648 )" ( #4820 )
...
Lora 3 & 4 test seems to have illegal memory access failure after this commit;
[2024-05-14 23:51:18,182 E 22 22] logging.cc:101: Unhandled exception: N3c105ErrorE. what(): CUDA error: an illegal memory access was encountered
<br class="Apple-interchange-newline">
Exmaple: https://buildkite.com/vllm/ci/builds/7382#018f793d-1527-4e1c-ab59-c3a34ec55241
This reverts commit 1356df5 .
FILL IN THE PR DESCRIPTION HERE
FIX #xxxx (link existing issues this PR will resolve)
2024-05-15 11:52:45 +09:00
Stephen Krider
1356df53bd
[Kernel] Use flash-attn for decoding ( #3648 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com >
2024-05-13 15:50:33 -07:00
Woosuk Kwon
0fca3cdcf2
[Misc] Enhance attention selector ( #4751 )
2024-05-13 10:47:25 -07:00
Woosuk Kwon
89579a201f
[Misc] Use vllm-flash-attn instead of flash-attn ( #4686 )
2024-05-08 13:15:34 -07:00
Lily Liu
43c413ec57
[Kernel] Use flashinfer for decoding ( #4353 )
...
Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com >
2024-05-03 15:51:27 -07:00
youkaichao
5b8a7c1cb0
[Misc] centralize all usage of environment variables ( #4548 )
2024-05-02 11:13:25 -07:00
Roy
b6dcb4d442
[Misc] Fix flash attention backend log ( #4368 )
2024-04-25 12:43:32 -07:00
SangBin Cho
e42df7227d
[Test] Add xformer and flash attn tests ( #3961 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com >
2024-04-11 03:09:50 +00:00
Juan Villamizar
6c0b04515f
[ROCm][Hardware][AMD] Use Triton Kernel for default FA on ROCm ( #3643 )
...
Co-authored-by: jpvillam <jpvillam@amd.com >
Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-04-09 15:10:47 -07:00
bigPYJ1151
0e3f06fe9c
[Hardware][Intel] Add CPU inference backend ( #3634 )
...
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
Co-authored-by: Yuan Zhou <yuan.zhou@intel.com >
2024-04-01 22:07:30 -07:00
Woosuk Kwon
395aa823ea
[Misc] Minor type annotation fix ( #3716 )
2024-03-28 21:12:24 -07:00
Simon Mo
4716a32dd4
fix logging msg for block manager ( #3701 )
2024-03-28 23:29:55 +00:00
SangBin Cho
01bfb22b41
[CI] Try introducing isort. ( #3495 )
2024-03-25 07:59:47 -07:00
Woosuk Kwon
925f3332ca
[Core] Refactor Attention Take 2 ( #3462 )
2024-03-25 04:39:33 +00:00