Cyrus Leung
9831aec49f
[Core] Dynamic image size support for VLMs ( #5276 )
...
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com >
Co-authored-by: Xiaowei Jiang <xwjiang2010@gmail.com >
Co-authored-by: ywang96 <ywang@roblox.com >
Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2024-07-02 20:34:00 -07:00
Mor Zusman
9d6a8daa87
[Model] Jamba support ( #4115 )
...
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai >
Co-authored-by: Erez Schwartz <erezs@ai21.com >
Co-authored-by: Mor Zusman <morz@ai21.com >
Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com >
Co-authored-by: Tomer Asida <tomera@ai21.com >
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com >
Co-authored-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai >
2024-07-02 23:11:29 +00:00
Murali Andoorveedu
c5832d2ae9
[Core] Pipeline Parallel Support ( #4412 )
...
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai >
2024-07-02 10:58:08 -07:00
xwjiang2010
98d6682cd1
[VLM] Remove image_input_type from VLM config ( #5852 )
...
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-07-02 07:57:09 +00:00
Alexander Matveev
3476ed0809
[Core] Optimize block_manager_v2 vs block_manager_v1 (to make V2 default) ( #5602 )
2024-07-01 20:10:37 -07:00
Antoni Baum
7c01f70641
[Core] Optimize SequenceStatus.is_finished by switching to IntEnum ( #5974 )
2024-06-29 12:47:53 +00:00
Cody Yu
b2c620230a
[Spec Decode] Introduce DraftModelRunner ( #5799 )
2024-06-28 09:17:51 -07:00
Cyrus Leung
5cbe8d155c
[Core] Registry for processing model inputs ( #5214 )
...
Co-authored-by: ywang96 <ywang@roblox.com >
2024-06-28 12:09:56 +00:00
youkaichao
64e8d2a783
[core][misc] remove logical block ( #5882 )
2024-06-27 13:34:55 -07:00
Stephanie Wang
dda4811591
[Core] Refactor Worker and ModelRunner to consolidate control plane communication ( #5408 )
...
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu >
Signed-off-by: Stephanie <swang@anyscale.com >
Co-authored-by: Stephanie <swang@anyscale.com >
2024-06-25 20:30:03 -07:00
Joshua Rosenkranz
b12518d3cf
[Model] MLPSpeculator speculative decoding support ( #4947 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Co-authored-by: Thomas Parnell <tpa@zurich.ibm.com >
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
Co-authored-by: Davis Wertheimer <Davis.Wertheimer@ibm.com >
2024-06-20 20:23:12 -04:00
Ronen Schaffer
7879f24dcc
[Misc] Add OpenTelemetry support ( #4687 )
...
This PR adds basic support for OpenTelemetry distributed tracing.
It includes changes to enable tracing functionality and improve monitoring capabilities.
I've also added a markdown with print-screens to guide users how to use this feature. You can find it here
2024-06-19 01:17:03 +09:00
Cyrus Leung
0e9164b40a
[mypy] Enable type checking for test directory ( #5017 )
2024-06-15 04:45:31 +00:00
Cyrus Leung
7a64d24aad
[Core] Support image processor ( #4197 )
2024-06-02 22:56:41 -07:00
Cyrus Leung
b1c255630d
[Core] Avoid the need to pass None values to Sequence.inputs ( #5099 )
2024-05-29 16:05:01 -07:00
afeldman-nm
4238bc82f2
[Core] Cross-attention KV caching and memory-management (towards eventual encoder/decoder model support) ( #4837 )
2024-05-29 16:09:13 +00:00
Cyrus Leung
5ae5ed1e60
[Core] Consolidate prompt arguments to LLM engines ( #4328 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-05-28 13:29:31 -07:00
SangBin Cho
65bf2ac165
[Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode to a single API ( #4681 )
...
This PR combines prepare_prompt and prepare_decode into a single API. This PR also coelsce the attn metadata for prefill/decode to a single class and allow to slice them when running attn backend.
It also refactors subquery_start_loc which was not refactored in the previous PR
2024-05-15 14:00:10 +09:00
Kuntai Du
ccb63a8245
[Core][Hash][Automatic Prefix caching] Accelerating the hashing function by avoiding deep copies ( #4696 )
2024-05-14 21:34:33 +09:00
Chang Su
e254497b66
[Model][Misc] Add e5-mistral-7b-instruct and Embedding API ( #3734 )
2024-05-11 11:30:37 -07:00
Cody Yu
f942efb5a3
[Dynamic Spec Decoding] Auto-disable by the running queue size ( #4592 )
...
Co-authored-by: Cade Daniel <edacih@gmail.com >
2024-05-08 21:44:00 +00:00
youkaichao
20cfcdec99
[Core][Optimization] change python dict to pytorch tensor for blocks to swap ( #4659 )
2024-05-08 12:07:05 -07:00
youkaichao
63575bc2e1
[Core][Optimization] change python dict to pytorch tensor ( #4607 )
2024-05-06 21:30:27 -07:00
Cody Yu
bc8ad68455
[Misc][Refactor] Introduce ExecuteModelData ( #4540 )
2024-05-03 17:47:07 -07:00
Cade Daniel
ab50275111
[Speculative decoding] Support target-model logprobs ( #4378 )
2024-05-03 15:52:01 -07:00
Lily Liu
43c413ec57
[Kernel] Use flashinfer for decoding ( #4353 )
...
Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com >
2024-05-03 15:51:27 -07:00
Ronen Schaffer
bf480c5302
Add more Prometheus metrics ( #2764 )
...
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com >
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com >
2024-04-28 15:59:33 -07:00
SangBin Cho
603ad84815
[Core] Refactoring sampler and support prompt logprob for chunked prefill ( #4309 )
2024-04-26 13:02:02 +00:00
SangBin Cho
050f285ff6
[Core] Scheduling optimization 2 ( #4280 )
2024-04-23 08:02:11 +00:00
Uranus
8f20fc04bf
[Misc] fix docstrings ( #4191 )
...
Co-authored-by: Zhong Wang <wangzhong@infini-ai.com >
2024-04-19 08:18:33 +00:00
Cade Daniel
e95cd87959
[Speculative decoding 6/9] Integrate speculative decoding with LLMEngine ( #3894 )
2024-04-16 13:09:21 -07:00
SangBin Cho
09473ee41c
[mypy] Add mypy type annotation part 1 ( #4006 )
2024-04-12 14:35:50 -07:00
Nick Hill
e46a60aa4c
[BugFix] Fix handling of stop strings and stop token ids ( #3672 )
2024-04-11 15:34:12 -07:00
SangBin Cho
67b4221a61
[Core][5/N] Fully working chunked prefill e2e ( #3884 )
2024-04-10 17:56:48 -07:00
SangBin Cho
18de883489
[Chunked Prefill][4/n] Chunked prefill scheduler. ( #3853 )
2024-04-05 10:17:58 -07:00
SangBin Cho
b51c1cc9d2
[2/N] Chunked prefill data update ( #3538 )
2024-03-28 10:06:01 -07:00
Cade Daniel
14ccd94c89
[Core][Bugfix]Refactor block manager for better testability ( #3492 )
2024-03-27 23:59:28 -07:00
Nick Hill
dfeb2ecc3a
[Misc] Include matched stop string/token in responses ( #2976 )
...
Co-authored-by: Sahil Suneja <sahilsuneja@gmail.com >
2024-03-25 17:31:32 -07:00
xwjiang2010
64172a976c
[Feature] Add vision language model support. ( #3042 )
2024-03-25 14:16:30 -07:00
Swapnil Parekh
819924e749
[Core] Adding token ranks along with logprobs ( #3516 )
...
Co-authored-by: Swapnil Parekh <swapnilp@ibm.com >
2024-03-25 10:13:10 -07:00
SangBin Cho
01bfb22b41
[CI] Try introducing isort. ( #3495 )
2024-03-25 07:59:47 -07:00
Woosuk Kwon
925f3332ca
[Core] Refactor Attention Take 2 ( #3462 )
2024-03-25 04:39:33 +00:00
Antoni Baum
426ec4ec67
[1/n] Triton sampling kernel ( #3186 )
...
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2024-03-20 14:45:08 -07:00
Zhuohan Li
2f8844ba08
Re-enable the 80 char line width limit ( #3305 )
2024-03-10 19:49:14 -07:00
Cade Daniel
8437bae6ef
[Speculative decoding 3/9] Worker which speculates, scores, and applies rejection sampling ( #3103 )
2024-03-08 23:32:46 -08:00
jacobthebanana
8cbba4622c
Possible fix for conflict between Automated Prefix Caching ( #2762 ) and multi-LoRA support ( #1804 ) ( #3263 )
2024-03-07 23:03:22 +00:00
Cade Daniel
a33ce60c66
[Testing] Fix core tests ( #3224 )
2024-03-06 01:04:23 -08:00
Nick Hill
8999ec3c16
Store eos_token_id in Sequence for easy access ( #3166 )
2024-03-05 15:35:43 -08:00
Antoni Baum
22de45235c
Push logprob generation to LLMEngine ( #3065 )
...
Co-authored-by: Avnish Narayan <avnish@anyscale.com >
2024-03-04 19:54:06 +00:00
Zhuohan Li
996d095c54
[FIX] Fix styles in automatic prefix caching & add a automatic prefix caching benchmark ( #3158 )
2024-03-03 14:37:18 -08:00