biondizzle/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Mahesh Keralapura	933790c209	[Core] Add span metrics for model_forward, scheduler and sampler time (#7089 )	2024-08-09 13:55:13 -07:00
William Lin	57b7be0e1c	[Speculative decoding] [Multi-Step] decouple should_modify_greedy_probs_inplace (#6971 )	2024-08-09 05:42:45 +00:00
Bongwon Jang	e9630458c7	[SpecDecode] Support FlashInfer in DraftModelRunner (#6926 )	2024-08-05 08:05:05 -07:00
Cade Daniel	82a1b1a82b	[Speculative decoding] Add periodic log with time spent in proposal/scoring/verification (#6963 )	2024-08-05 08:46:44 +00:00
Cyrus Leung	f230cc2ca6	[Bugfix] Fix broadcasting logic for `multi_modal_kwargs` (#6836 )	2024-07-31 10:38:45 +08:00
Nick Hill	5cf9254a9c	[BugFix] Fix use of per-request seed with pipeline parallel (#6698 )	2024-07-30 10:40:08 -07:00
Allen.Dou	40468b13fa	[Bugfix] Miscalculated latency lead to time_to_first_token_seconds inaccurate. (#6686 )	2024-07-24 08:58:42 -07:00
sroy745	14f91fe67c	[Spec Decode] Disable Log Prob serialization to CPU for spec decoding for both draft and target models. (#6485 )	2024-07-20 23:58:58 -07:00
Matt Wong	06d6c5fe9f	[Bugfix][CI/Build][Hardware][AMD] Fix AMD tests, add HF cache, update CK FA, add partially supported model notes (#6543 )	2024-07-20 09:39:07 -07:00
Thomas Parnell	f0bbfaf917	[Bugfix] [SpecDecode] AsyncMetricsCollector: update time since last collection (#6578 )	2024-07-19 14:01:03 -07:00
Woo-Yeon Lee	a921e86392	[BUGFIX] Raise an error for no draft token case when draft_tp>1 (#6369 )	2024-07-19 06:01:09 -07:00
Thomas Parnell	d4201e06d5	[Bugfix] Make spec. decode respect per-request seed. (#6034 ) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> Co-authored-by: Nick Hill <nickhill@us.ibm.com>	2024-07-18 19:22:08 -07:00
Cody Yu	8a74c68bd1	[Misc] Minor patch for draft model runner (#6523 )	2024-07-18 06:06:21 +00:00
Alexander Matveev	e76466dde2	[Core] draft_model_runner: Implement prepare_inputs on GPU for advance_step (#6338 )	2024-07-17 14:30:28 -07:00
shangmingc	a19e8d3726	[Misc][Speculative decoding] Typos and typing fixes (#6467 ) Co-authored-by: caishangming.csm <caishangming.csm@alibaba-inc.com>	2024-07-17 07:17:07 +00:00
sroy745	ae151d73be	[Speculative Decoding] Enabling bonus token in speculative decoding for KV cache based models (#5765 )	2024-07-10 16:02:47 -07:00
Abhinav Goyal	2416b26e11	[Speculative Decoding] Medusa Implementation with Top-1 proposer (#4978 )	2024-07-09 18:34:02 -07:00
Swapnil Parekh	4d6ada947c	[CORE] Adding support for insertion of soft-tuned prompts (#4645 ) Co-authored-by: Swapnil Parekh <swapnilp@ibm.com> Co-authored-by: Joe G <joseph.granados@h2o.ai> Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2024-07-09 13:26:36 -07:00
xwjiang2010	d9e98f42e4	[vlm] Remove vision language config. (#6089 ) Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com> Co-authored-by: Roger Wang <ywang@roblox.com>	2024-07-03 22:14:16 +00:00
Mor Zusman	9d6a8daa87	[Model] Jamba support (#4115 ) Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai> Co-authored-by: Erez Schwartz <erezs@ai21.com> Co-authored-by: Mor Zusman <morz@ai21.com> Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com> Co-authored-by: Tomer Asida <tomera@ai21.com> Co-authored-by: Zhuohan Li <zhuohan123@gmail.com> Co-authored-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>	2024-07-02 23:11:29 +00:00
Murali Andoorveedu	c5832d2ae9	[Core] Pipeline Parallel Support (#4412 ) Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>	2024-07-02 10:58:08 -07:00
Sirej Dua	15aba081f3	[Speculative Decoding] MLPSpeculator Tensor Parallel support (1/2) (#6050 ) Co-authored-by: Sirej Dua <sirej.dua@databricks.com> Co-authored-by: Sirej Dua <Sirej Dua>	2024-07-02 07:20:29 -07:00
sroy745	80ca1e6a3a	[Speculative Decoding 2/2 ] Integrate typical acceptance sampler into Spec Decode Worker (#5348 )	2024-07-01 00:33:05 -07:00
Cody Yu	b2c620230a	[Spec Decode] Introduce DraftModelRunner (#5799 )	2024-06-28 09:17:51 -07:00
Stephanie Wang	dda4811591	[Core] Refactor Worker and ModelRunner to consolidate control plane communication (#5408 ) Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu> Signed-off-by: Stephanie <swang@anyscale.com> Co-authored-by: Stephanie <swang@anyscale.com>	2024-06-25 20:30:03 -07:00
Woo-Yeon Lee	2ce5d6688b	[Speculative Decoding] Support draft model on different tensor-parallel size than target model (#5414 )	2024-06-25 09:56:06 +00:00
Joshua Rosenkranz	b12518d3cf	[Model] MLPSpeculator speculative decoding support (#4947 ) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> Co-authored-by: Thomas Parnell <tpa@zurich.ibm.com> Co-authored-by: Nick Hill <nickhill@us.ibm.com> Co-authored-by: Davis Wertheimer <Davis.Wertheimer@ibm.com>	2024-06-20 20:23:12 -04:00
Cyrus Leung	0e9164b40a	[mypy] Enable type checking for test directory (#5017 )	2024-06-15 04:45:31 +00:00
Nick Hill	a008629807	[Misc] Various simplifications and typing fixes (#5368 )	2024-06-11 10:29:02 +08:00
Nick Hill	faf71bcd4b	[Speculative Decoding] Add `ProposerWorkerBase` abstract class (#5252 )	2024-06-05 14:53:05 -07:00
Lily Liu	d5a1697772	[Dynamic Spec Decoding] Minor fix for disabling speculative decoding (#5000 )	2024-05-25 10:00:14 -07:00
Nick Hill	eb6d3c264d	[Core] Eliminate parallel worker per-step task scheduling overhead (#4894 )	2024-05-23 06:17:27 +09:00
Cody Yu	973617ae02	[Speculative decoding][Re-take] Enable TP>1 speculative decoding (#4840 ) Co-authored-by: Cade Daniel <edacih@gmail.com> Co-authored-by: Cade Daniel <cade@anyscale.com>	2024-05-16 00:53:51 -07:00
SangBin Cho	65bf2ac165	[Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode to a single API (#4681 ) This PR combines prepare_prompt and prepare_decode into a single API. This PR also coelsce the attn metadata for prefill/decode to a single class and allow to slice them when running attn backend. It also refactors subquery_start_loc which was not refactored in the previous PR	2024-05-15 14:00:10 +09:00
Cody Yu	ce532ff45c	[Speculative decoding] Improve n-gram efficiency (#4724 )	2024-05-13 15:00:13 -07:00
Chang Su	e254497b66	[Model][Misc] Add e5-mistral-7b-instruct and Embedding API (#3734 )	2024-05-11 11:30:37 -07:00
SangBin Cho	6a0f617210	[Core] Fix circular reference which leaked llm instance in local dev env (#4737 ) Storing exception frame is extremely prone to circular refernece because it contains the reference to objects. When tensorizer is not installed, it leaks llm instance because error frame has references to various modules which cause circular reference problem. I also found spec decoding has a circular reference issue, and I solved it using weakref.proxy.	2024-05-10 23:54:32 +09:00
Cade Daniel	8b9241be3a	[Speculative decoding] [Bugfix] Fix overallocation in ngram + spec logprobs (#4672 )	2024-05-08 23:24:46 +00:00
Cody Yu	f942efb5a3	[Dynamic Spec Decoding] Auto-disable by the running queue size (#4592 ) Co-authored-by: Cade Daniel <edacih@gmail.com>	2024-05-08 21:44:00 +00:00
leiwen83	8344f7742b	[Bug fix][Core] fixup ngram not setup correctly (#4551 ) Co-authored-by: Lei Wen <wenlei03@qiyi.com> Co-authored-by: Cade Daniel <edacih@gmail.com> Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>	2024-05-07 11:40:18 -07:00
Cody Yu	bc8ad68455	[Misc][Refactor] Introduce ExecuteModelData (#4540 )	2024-05-03 17:47:07 -07:00
Cade Daniel	ab50275111	[Speculative decoding] Support target-model logprobs (#4378 )	2024-05-03 15:52:01 -07:00
leiwen83	b38e42fbca	[Speculative decoding] Add ngram prompt lookup decoding (#4237 ) Co-authored-by: Lei Wen <wenlei03@qiyi.com>	2024-05-01 11:13:03 -07:00
SangBin Cho	a88081bf76	[CI] Disable non-lazy string operation on logging (#4326 ) Co-authored-by: Danny Guinther <dguinther@neuralmagic.com>	2024-04-26 00:16:58 -07:00
Cade Daniel	62b8aebc6f	[Speculative decoding 7/9] Speculative decoding end-to-end correctness tests. (#3951 )	2024-04-23 08:02:36 +00:00
SangBin Cho	533d2a1f39	[Typing] Mypy typing part 2 (#4043 ) Co-authored-by: SangBin Cho <sangcho@sangcho-LT93GQWG9C.local>	2024-04-17 17:28:43 -07:00
Cade Daniel	e95cd87959	[Speculative decoding 6/9] Integrate speculative decoding with LLMEngine (#3894 )	2024-04-16 13:09:21 -07:00
Cade Daniel	e7c7067b45	[Misc] [Core] Implement RFC "Augment BaseExecutor interfaces to enable hardware-agnostic speculative decoding" (#3837 )	2024-04-09 11:44:15 -07:00
Michael Goin	b321d4881b	[Bugfix] Add `__init__.py` files for `vllm/core/block/` and `vllm/spec_decode/` (#3798 )	2024-04-02 12:35:31 -07:00
SangBin Cho	01bfb22b41	[CI] Try introducing isort. (#3495 )	2024-03-25 07:59:47 -07:00

1 2

53 Commits