biondizzle/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Woosuk Kwon	928de46888	Implement PagedAttention V2 (#1348 )	2023-10-16 00:59:57 -07:00
Lu Wang	de89472897	Fix the issue for AquilaChat2-* models (#1339 )	2023-10-13 11:51:29 -07:00
Woosuk Kwon	e7c8555d06	Bump up transformers version & Remove MistralConfig (#1254 )	2023-10-13 10:05:26 -07:00
Antoni Baum	ec3b5ce9cc	Improve detokenization performance (#1338 )	2023-10-13 09:59:07 -07:00
Woosuk Kwon	875afe38ab	Add blacklist in model checkpoint (#1325 )	2023-10-12 01:05:37 -07:00
amaleshvemula	ee8217e5be	Add Mistral to quantization model list (#1278 )	2023-10-11 00:26:24 -07:00
twaka	8285736840	workaround of AWQ for Turing GPUs (#1252 )	2023-10-10 19:48:16 -07:00
yhlskt23	91fce82c6f	change the timing of sorting logits (#1309 )	2023-10-10 19:37:42 -07:00
Wang Ran (汪然)	ac5cf86aa6	Fix `__repr__` of `SequenceOutputs` (#1311 )	2023-10-10 09:58:28 -07:00
Zhuohan Li	b95ee898fe	[Minor] Fix comment in mistral.py (#1303 )	2023-10-09 19:44:37 -07:00
Zhuohan Li	6b5296aa3a	[FIX] Explain why the finished_reason of ignored sequences are length (#1289 )	2023-10-08 15:22:38 -07:00
Antoni Baum	ee92b58b3a	Move bfloat16 check to worker (#1259 )	2023-10-07 22:10:44 -07:00
Yunfeng Bai	09ff7f106a	API server support ipv4 / ipv6 dualstack (#1288 ) Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>	2023-10-07 15:15:54 -07:00
Antoni Baum	acbed3ef40	Use monotonic time where appropriate (#1249 )	2023-10-02 19:22:05 -07:00
Federico Cassano	66d18a7fb0	add support for tokenizer revision (#1163 ) Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>	2023-10-02 19:19:46 -07:00
Zhuohan Li	ba0bfd40e2	TP/quantization/weight loading refactor part 1 - Simplify parallel linear logic (#1181 )	2023-10-02 15:36:09 -07:00
Woosuk Kwon	84e4e37d14	[Minor] Fix type annotations (#1238 )	2023-10-02 15:28:31 -07:00
Zhuohan Li	a60b353005	support sharding llama2-70b on more than 8 GPUs (#1209 ) Co-authored-by: JiCheng <247153481@qq.com>	2023-10-02 15:26:33 -07:00
Woosuk Kwon	e2fb71ec9f	Bump up the version to v0.2.0 (#1212 ) Some checks failed Create Release / Create Release (push) Has been cancelled Details Create Release / Build Wheel (11.8, ubuntu-20.04, 3.10) (push) Has been cancelled Details Create Release / Build Wheel (11.8, ubuntu-20.04, 3.11) (push) Has been cancelled Details Create Release / Build Wheel (11.8, ubuntu-20.04, 3.8) (push) Has been cancelled Details Create Release / Build Wheel (11.8, ubuntu-20.04, 3.9) (push) Has been cancelled Details	2023-09-28 15:30:38 -07:00
Woosuk Kwon	f936657eb6	Provide default max model length (#1224 )	2023-09-28 14:44:02 -07:00
Woosuk Kwon	2e8e49fce3	[Fix] Remove false assertion (#1222 )	2023-09-28 10:52:38 -07:00
Woosuk Kwon	a8e98aee0c	Fix Mistral model (#1220 )	2023-09-28 10:44:05 -07:00
Chris Bamford	bb1ba58f06	[Mistral] Mistral-7B-v0.1 support (#1196 ) Co-authored-by: timlacroix <t@mistral.ai>	2023-09-28 10:41:03 -07:00
Qing	7bedab5748	Add rope_scaling to Qwen (#1210 )	2023-09-28 00:49:23 -07:00
Dan Lord	20f7cc4cde	Add `skip_special_tokens` sampling params (#1186 )	2023-09-27 19:21:42 -07:00
Woosuk Kwon	a19bc5c628	Automatically configure `max_num_batched_tokens` (#1198 )	2023-09-27 16:34:00 -07:00
Qing	28e616c4e3	fix qwen-14b model (#1173 )	2023-09-27 16:33:16 -07:00
Wang Ran (汪然)	30e775281d	fix typo (#1184 ) Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>	2023-09-27 16:22:45 -07:00
Lily Liu	21877b0d75	Support Longchat and RoPE scaling (#555 ) Co-authored-by: Wing Lian <wing.lian@gmail.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2023-09-27 03:36:02 -07:00
Antoni Baum	cf5cb1e33e	Allocate more shared memory to attention kernel (#1154 )	2023-09-26 22:27:13 -07:00
Woosuk Kwon	03ffd0a022	Add comments on RoPE initialization (#1176 )	2023-09-26 10:48:33 -07:00
Wen Sun	bbbf86565f	Align `max_tokens` behavior with openai (#852 )	2023-09-23 18:10:13 -07:00
Woosuk Kwon	9f6be8692e	Fix config for Falcon (#1164 )	2023-09-23 17:38:43 -07:00
Zhuohan Li	f187877945	[FIX] Simplify sampler logic (#1156 )	2023-09-23 17:21:56 -07:00
Zhuohan Li	947b794146	[Sampler] Vectorized sampling (simplified) (#1048 ) Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2023-09-22 17:48:04 -07:00
Ricardo Lu	f98b745a81	feat: support stop_token_ids parameter. (#1097 )	2023-09-21 15:34:02 -07:00
Roy	2d1e86f1b1	clean api code, remove redundant background task. (#1102 )	2023-09-21 13:25:05 -07:00
Woosuk Kwon	1ac4ccf73c	Add float16 and float32 (#1115 )	2023-09-21 00:52:47 -07:00
Woosuk Kwon	2ac4d5e2bf	Replace DtypeTensor (#1123 )	2023-09-21 00:51:47 -07:00
Antoni Baum	3302f0aef3	rope_theta and max_position_embeddings from config (#1096 ) Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: wnma3mz <wnma3mz@gmail.com>	2023-09-20 13:35:11 -07:00
Woosuk Kwon	bc0644574c	Add gpu_memory_utilization and swap_space to LLM (#1090 )	2023-09-19 22:16:04 -07:00
Woosuk Kwon	400b8289f7	Add pyarrow to dependencies & Print warning on Ray import error (#1094 )	2023-09-18 22:36:17 -07:00
Woosuk Kwon	2b1c116b5a	Add minimum capability requirement for AWQ (#1064 )	2023-09-18 12:02:01 -07:00
Woosuk Kwon	cc796b1358	Convert before transpose (#1073 )	2023-09-18 11:51:48 -07:00
Zhuohan Li	f029ef94d7	Fix get_max_num_running_seqs for waiting and swapped seq groups (#1068 )	2023-09-18 11:49:40 -07:00
Roy	95592fa00a	align llm_engine and async_engine. (#1081 )	2023-09-18 11:49:10 -07:00
orellavie1212	fbe66e1d0b	added support for quantize on LLM module (#1080 )	2023-09-18 11:04:21 -07:00
Zhuohan Li	90979c38f8	[FIX] Don't initialize parameter by default (#1067 )	2023-09-17 17:15:38 -07:00
陈序	e21d7687a9	Fix hanging when prompt exceeds limit (#1029 )	2023-09-17 01:48:56 -07:00
Antoni Baum	ff36139ffc	Remove AsyncLLMEngine busy loop, shield background task (#1059 )	2023-09-17 00:29:08 -07:00

... 156 157 158 159 160 ...

8027 Commits