Commit Graph

2828 Commits

Author SHA1 Message Date
Zhuohan Li
002800f081 Align vLLM's beam search implementation with HF generate (#857) 2023-09-04 17:29:42 -07:00
Woosuk Kwon
32b6816e55 Add tests for models (#922) 2023-09-01 11:19:43 +09:00
Aman Gupta Karmani
75471386de use flash-attn via xformers (#877) 2023-08-29 21:52:13 -07:00
Woosuk Kwon
d64bf1646c Implement approximate GELU kernels (#828) 2023-08-23 07:43:21 +09:00
Tao Peng
d7a1c6d614 Fix paged attention testing. (#495)
Signed-off-by: Tao Peng <jiankeng.pt@alibaba-inc.com>
2023-07-24 21:01:56 -07:00
Song
bda41c70dd hotfix attn alibi wo head mapping (#496)
Co-authored-by: oliveryuan <oliveryuan@basemind.com>
2023-07-18 11:31:48 -07:00
Andre Slavescu
c894836108 [Model] Add support for GPT-J (#226)
Co-authored-by: woWoosuk Kwon <woosuk.kwon@berkeley.edu>
2023-07-08 17:55:16 -07:00
Woosuk Kwon
e41f06702c Add support for BLOOM (#331) 2023-07-03 13:12:35 -07:00
Zhuohan Li
d6fa1be3a8 [Quality] Add code formatter and linter (#326) 2023-07-03 11:31:55 -07:00
Woosuk Kwon
0b98ba15c7 Change the name to vLLM (#150) 2023-06-17 03:07:40 -07:00
Woosuk Kwon
e38074b1e6 Support FP32 (#141) 2023-06-07 00:40:21 -07:00
Woosuk Kwon
a283ec2eec Add contributing guideline and mypy config (#122) 2023-05-23 17:58:51 -07:00
Woosuk Kwon
825d8892b5 Use pytest format for unit tests (#107) 2023-05-17 17:11:23 -07:00
Woosuk Kwon
c9d5b6d4a8 Replace FlashAttention with xformers (#70) 2023-05-05 02:01:08 -07:00
Woosuk Kwon
436e523bf1 Refactor attention kernels (#53) 2023-05-03 13:40:13 -07:00
Woosuk Kwon
a96d63c21d Add support for GPT-NeoX (Pythia) (#50) 2023-04-28 00:32:10 -07:00
Siyuan (Ryans) Zhuang
e3cec88aa5 Memcpy kernel for flash attention (#29)
* optimize

* add benchmark

* add assert

* add test
2023-04-10 18:22:49 -07:00
Woosuk Kwon
b9926f7f66 Support block size 32 (#35) 2023-04-09 23:07:18 -07:00
Woosuk Kwon
c267b1a02c Add query stride to multi_query_cached_kv_attention & Add kernel benchmark script (#27)
* Add query stride to multi_query_cached_kv_attention

* Add kernel benchmark script
2023-04-08 13:36:09 -07:00
Woosuk Kwon
0f40557af6 Implement block copy kernel to optimize beam search (#32) 2023-04-07 17:45:07 -07:00
Siyuan (Ryans) Zhuang
21b3671bbc Basic attention kernel that supports cached KV + (multi-)prompts (#24) 2023-04-04 20:34:46 -07:00
Woosuk Kwon
897cb2ae28 Optimize data movement (#20) 2023-04-02 00:30:17 -07:00
Woosuk Kwon
09e9245478 Add custom kernel for RMS normalization (#16) 2023-04-01 00:51:22 +08:00
Woosuk Kwon
88c0268a18 Implement custom kernel for LLaMA rotary embedding (#14) 2023-03-30 11:04:21 -07:00
Woosuk Kwon
a1b3de86cd Refactor the test code for attention kernels (#13) 2023-03-29 18:59:27 -07:00
Woosuk Kwon
3e9f991d6a Use FlashAttention for multi_query_kv_attention (#4) 2023-03-01 21:13:08 -08:00
Woosuk Kwon
0deacbce6e Implement single_query_cached_kv_attention kernel (#3) 2023-03-01 15:02:19 -08:00
Woosuk Kwon
af68ec1c5c Add tests for kernels 2023-02-18 19:23:07 +00:00