Commit Graph

75 Commits

Author SHA1 Message Date
Nick Hill
d830656a97 [BugFix] Avoid unnecessary Ray import warnings (#6079) 2024-07-03 14:09:40 +08:00
Mor Zusman
9d6a8daa87 [Model] Jamba support (#4115)
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
Co-authored-by: Erez Schwartz <erezs@ai21.com>
Co-authored-by: Mor Zusman <morz@ai21.com>
Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com>
Co-authored-by: Tomer Asida <tomera@ai21.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
Co-authored-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
2024-07-02 23:11:29 +00:00
Murali Andoorveedu
c5832d2ae9 [Core] Pipeline Parallel Support (#4412)
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
2024-07-02 10:58:08 -07:00
Ilya Lavrenov
57f09a419c [Hardware][Intel] OpenVINO vLLM backend (#5379) 2024-06-28 13:50:16 +00:00
Cyrus Leung
5cbe8d155c [Core] Registry for processing model inputs (#5214)
Co-authored-by: ywang96 <ywang@roblox.com>
2024-06-28 12:09:56 +00:00
Antoni Baum
67882dbb44 [Core] Add fault tolerance for RayTokenizerGroupPool (#5748) 2024-06-25 10:15:10 -07:00
zifeitong
78687504f7 [Bugfix] AsyncLLMEngine hangs with asyncio.run (#5654) 2024-06-19 13:57:12 -07:00
Ronen Schaffer
7879f24dcc [Misc] Add OpenTelemetry support (#4687)
This PR adds basic support for OpenTelemetry distributed tracing.
It includes changes to enable tracing functionality and improve monitoring capabilities.

I've also added a markdown with print-screens to guide users how to use this feature. You can find it here
2024-06-19 01:17:03 +09:00
Kunshang Ji
728c4c8a06 [Hardware][Intel GPU] Add Intel GPU(XPU) inference backend (#3814)
Co-authored-by: Jiang Li <jiang1.li@intel.com>
Co-authored-by: Abhilash Majumder <abhilash.majumder@intel.com>
Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
2024-06-17 11:01:25 -07:00
Cyrus Leung
77490c6f2f [Core] Remove duplicate processing in async engine (#5525) 2024-06-14 10:04:42 -07:00
Woosuk Kwon
1a8bfd92d5 [Hardware] Initial TPU integration (#5292) 2024-06-12 11:53:03 -07:00
Alex Wu
0f83ddd4d7 [Bugfix][Frontend/Core] Don't log exception when AsyncLLMEngine gracefully shuts down. (#5290) 2024-06-05 15:18:12 -07:00
Cyrus Leung
a9bcc7afb2 [Doc] Use intersphinx and update entrypoints docs (#5125) 2024-05-30 09:59:23 -07:00
Cyrus Leung
5ae5ed1e60 [Core] Consolidate prompt arguments to LLM engines (#4328)
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-05-28 13:29:31 -07:00
Nick Hill
eb6d3c264d [Core] Eliminate parallel worker per-step task scheduling overhead (#4894) 2024-05-23 06:17:27 +09:00
Nick Hill
676a99982f [Core] Add MultiprocessingGPUExecutor (#4539)
Co-authored-by: SAHIL SUNEJA <suneja@us.ibm.com>
2024-05-14 10:38:59 -07:00
Chang Su
e254497b66 [Model][Misc] Add e5-mistral-7b-instruct and Embedding API (#3734) 2024-05-11 11:30:37 -07:00
Cyrus Leung
323f27b904 [Bugfix] Fix asyncio.Task not being subscriptable (#4623) 2024-05-06 09:31:05 -07:00
Cody Yu
bc8ad68455 [Misc][Refactor] Introduce ExecuteModelData (#4540) 2024-05-03 17:47:07 -07:00
youkaichao
5b8a7c1cb0 [Misc] centralize all usage of environment variables (#4548) 2024-05-02 11:13:25 -07:00
Roy
3a922c1e7e [Bugfix][Core] Fix and refactor logging stats (#4336) 2024-05-01 20:08:14 +00:00
leiwen83
4bb53e2dde [BugFix] fix num_lookahead_slots missing in async executor (#4165)
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
2024-04-30 10:12:59 -07:00
Roy
7134303cbb [Bugfix][Core] Fix get decoding config from ray (#4335) 2024-04-27 11:30:08 +00:00
SangBin Cho
603ad84815 [Core] Refactoring sampler and support prompt logprob for chunked prefill (#4309) 2024-04-26 13:02:02 +00:00
SangBin Cho
a88081bf76 [CI] Disable non-lazy string operation on logging (#4326)
Co-authored-by: Danny Guinther <dguinther@neuralmagic.com>
2024-04-26 00:16:58 -07:00
Nick Hill
479d69fad0 [Core] Move ray_utils.py from engine to executor package (#4347) 2024-04-25 06:52:22 +00:00
Tao He
077f0a2e8a [Frontend] Enable support for CPU backend in AsyncLLMEngine. (#3993)
Signed-off-by: Tao He <sighingnow@gmail.com>
2024-04-22 09:19:51 +00:00
Ronen Schaffer
7be4f5628f [Bugfix][Core] Restore logging of stats in the async engine (#4150) 2024-04-19 08:08:26 -07:00
Liangfu Chen
cd2f63fb36 [CI/CD] add neuron docker and ci test scripts (#3571) 2024-04-18 15:26:01 -07:00
SangBin Cho
533d2a1f39 [Typing] Mypy typing part 2 (#4043)
Co-authored-by: SangBin Cho <sangcho@sangcho-LT93GQWG9C.local>
2024-04-17 17:28:43 -07:00
Cade Daniel
e95cd87959 [Speculative decoding 6/9] Integrate speculative decoding with LLMEngine (#3894) 2024-04-16 13:09:21 -07:00
SangBin Cho
4e7ee664e2 [Core] Fix engine-use-ray broken (#4105) 2024-04-16 05:24:53 +00:00
Cade Daniel
5757d90e26 [Speculative decoding] Adding configuration object for speculative decoding (#3706)
Co-authored-by: Lily Liu <lilyliupku@gmail.com>
2024-04-03 00:40:57 +00:00
yhu422
d8658c8cc1 Usage Stats Collection (#2852) 2024-03-28 22:16:12 -07:00
xwjiang2010
64172a976c [Feature] Add vision language model support. (#3042) 2024-03-25 14:16:30 -07:00
SangBin Cho
01bfb22b41 [CI] Try introducing isort. (#3495) 2024-03-25 07:59:47 -07:00
Zhuohan Li
e90fc21f2e [Hardware][Neuron] Refactor neuron support (#3471) 2024-03-22 01:22:17 +00:00
Tao He
14b8ae02e7 Fixes the misuse/mixuse of time.time()/time.monotonic() (#3220)
Signed-off-by: Tao He <sighingnow@gmail.com>
Co-authored-by: simon-mo <simon.mo@hey.com>
2024-03-15 18:25:43 +00:00
Zhuohan Li
4c922709b6 Add distributed model executor abstraction (#3191) 2024-03-11 11:03:45 -07:00
Roy
9e8744a545 [BugFix] Fix get tokenizer when using ray (#3301) 2024-03-10 19:17:16 -07:00
Antoni Baum
ff578cae54 Add health check, make async Engine more robust (#3015)
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-03-04 22:01:40 +00:00
Antoni Baum
22de45235c Push logprob generation to LLMEngine (#3065)
Co-authored-by: Avnish Narayan <avnish@anyscale.com>
2024-03-04 19:54:06 +00:00
Sage Moore
ce4f5a29fb Add Automatic Prefix Caching (#2762)
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-03-02 00:50:01 -08:00
felixzhu555
703e42ee4b Add guided decoding for OpenAI API server (#2819)
Co-authored-by: br3no <breno@veltefaria.de>
Co-authored-by: simon-mo <simon.mo@hey.com>
2024-02-29 22:13:08 +00:00
zspo
c664b0e683 fix some bugs (#2689) 2024-01-31 10:09:23 -08:00
Wen Sun
d79ced3292 Fix 'Actor methods cannot be called directly' when using --engine-use-ray (#2664)
* fix: engine-useray complain

* fix: typo
2024-01-30 17:17:05 +01:00
Murali Andoorveedu
89be30fa7d Small async_llm_engine refactor (#2618) 2024-01-27 23:28:37 -08:00
Antoni Baum
9b945daaf1 [Experimental] Add multi-LoRA support (#1804)
Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Co-authored-by: Avnish Narayan <avnish@anyscale.com>
2024-01-23 15:26:37 -08:00
shiyi.c_98
d10f8e1d43 [Experimental] Prefix Caching Support (#1669)
Co-authored-by: DouHappy <2278958187@qq.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-01-17 16:32:10 -08:00
Jiaxiang
6549aef245 [DOC] Add additional comments for LLMEngine and AsyncLLMEngine (#1011) 2024-01-11 19:26:49 -08:00