Commit Graph

61 Commits

Author SHA1 Message Date
Kunshang Ji
96b6f475dd Remove hardcoded device="cuda" to support more devices (#2503)
Co-authored-by: Jiang Li <jiang1.li@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
2024-02-01 15:46:39 -08:00
Robert Shaw
93b38bea5d Refactor Prometheus and Add Request Level Metrics (#2316) 2024-01-31 14:58:07 -08:00
zhaoyang-star
b72af8f1ed Fix error when tp > 1 (#2644)
Co-authored-by: zhaoyang-star <zhao.yang16@zte.com.cn>
2024-01-28 22:47:39 -08:00
zhaoyang-star
9090bf02e7 Support FP8-E5M2 KV Cache (#2279)
Co-authored-by: zhaoyang <zhao.yang16@zte.com.cn>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-01-28 16:43:54 -08:00
Hanzhi Zhou
380170038e Implement custom all reduce kernels (#2192) 2024-01-27 12:46:35 -08:00
Antoni Baum
9b945daaf1 [Experimental] Add multi-LoRA support (#1804)
Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Co-authored-by: Avnish Narayan <avnish@anyscale.com>
2024-01-23 15:26:37 -08:00
Philipp Moritz
ab7e6006d6 Fix https://github.com/vllm-project/vllm/issues/2540 (#2545) 2024-01-22 19:02:38 +01:00
Cade Daniel
18bfcdd05c [Speculative decoding 2/9] Multi-step worker for draft model (#2424) 2024-01-21 16:31:47 -08:00
shiyi.c_98
d10f8e1d43 [Experimental] Prefix Caching Support (#1669)
Co-authored-by: DouHappy <2278958187@qq.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-01-17 16:32:10 -08:00
Jiaxiang
6549aef245 [DOC] Add additional comments for LLMEngine and AsyncLLMEngine (#1011) 2024-01-11 19:26:49 -08:00
Nadav Shmayovits
05921a9a7a Changed scheduler to use deques instead of lists (#2290)
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-01-07 09:48:07 -08:00
Iskren Ivov Chernev
d0215a58e7 Ensure metrics are logged regardless of requests (#2347) 2024-01-05 05:24:42 -08:00
Zhuohan Li
fd4ea8ef5c Use NCCL instead of ray for control-plane communication to remove serialization overhead (#2221) 2024-01-03 11:30:22 -08:00
Zhuohan Li
e0ff920001 [BUGFIX] Do not return ignored sentences twice in async llm engine (#2258) 2023-12-26 13:41:09 +08:00
Woosuk Kwon
3a4fd5ca59 Disable Ray usage stats collection (#2206) 2023-12-20 21:52:08 -08:00
Woosuk Kwon
8041b7305e [BugFix] Raise error when max_model_len is larger than KV cache (#2163) 2023-12-17 17:08:23 -08:00
Woosuk Kwon
c3372e87be Remove dependency on CuPy (#2152) 2023-12-17 01:49:07 -08:00
Woosuk Kwon
37ca558103 Optimize model execution with CUDA graph (#1926)
Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2023-12-16 21:12:08 -08:00
Yunfeng Bai
c06170cc8e Add a flag to include stop string in output text (#1976) 2023-12-15 00:45:58 -08:00
Woosuk Kwon
464dd985e3 Fix num_gpus when TP > 1 (#1852) 2023-12-03 12:24:30 -08:00
Simon Mo
5313c2cb8b Add Production Metrics in Prometheus format (#1890) 2023-12-02 16:37:44 -08:00
Woosuk Kwon
27feead2f8 Refactor Worker & InputMetadata (#1843) 2023-11-29 22:16:37 -08:00
FlorianJoncour
0229c386c5 Better integration with Ray Serve (#1821)
Co-authored-by: FlorianJoncour <florian@zetta-sys.com>
2023-11-29 13:25:43 -08:00
Zhuohan Li
708e6c18b0 [FIX] Fix class naming (#1803) 2023-11-28 14:08:01 -08:00
boydfd
4bb6b67188 fix RAM OOM when load large models in tensor parallel mode. (#1395)
Co-authored-by: ran_lin <rlin@thoughtworks.com>
2023-11-20 19:02:42 -08:00
Simon Mo
5ffc0d13a2 Migrate linter from pylint to ruff (#1665) 2023-11-20 11:58:01 -08:00
Simon Mo
cb08cd0d75 [Minor] Fix duplication of ignored seq group in engine step (#1666) 2023-11-16 13:11:41 -08:00
Dan Lord
7013a80170 Add support for spaces_between_special_tokens 2023-10-30 16:52:56 -07:00
Zhuohan Li
9d9072a069 Implement prompt logprobs & Batched topk for computing logprobs (#1328)
Co-authored-by: Yunmo Chen <16273544+wanmok@users.noreply.github.com>
2023-10-16 10:56:50 -07:00
Antoni Baum
acbed3ef40 Use monotonic time where appropriate (#1249) 2023-10-02 19:22:05 -07:00
Federico Cassano
66d18a7fb0 add support for tokenizer revision (#1163)
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2023-10-02 19:19:46 -07:00
Woosuk Kwon
f936657eb6 Provide default max model length (#1224) 2023-09-28 14:44:02 -07:00
Chris Bamford
bb1ba58f06 [Mistral] Mistral-7B-v0.1 support (#1196)
Co-authored-by: timlacroix <t@mistral.ai>
2023-09-28 10:41:03 -07:00
Dan Lord
20f7cc4cde Add skip_special_tokens sampling params (#1186) 2023-09-27 19:21:42 -07:00
Wang Ran (汪然)
30e775281d fix typo (#1184)
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2023-09-27 16:22:45 -07:00
Ricardo Lu
f98b745a81 feat: support stop_token_ids parameter. (#1097) 2023-09-21 15:34:02 -07:00
陈序
e21d7687a9 Fix hanging when prompt exceeds limit (#1029) 2023-09-17 01:48:56 -07:00
Woosuk Kwon
e3e79e9e8a Implement AWQ quantization support for LLaMA (#1032)
Co-authored-by: Robert Irvine <robert@seamlessml.com>
Co-authored-by: root <rirv938@gmail.com>
Co-authored-by: Casper <casperbh.96@gmail.com>
Co-authored-by: julian-q <julianhquevedo@gmail.com>
2023-09-16 00:03:37 -07:00
Jasmond L
ab019eea75 Add Model Revision Support (#1014)
Co-authored-by: Jasmond Loh <Jasmond.Loh@hotmail.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2023-09-13 15:20:02 -07:00
Antoni Baum
9841d48a10 Use TGI-like incremental detokenization (#984) 2023-09-13 13:38:01 -07:00
Jingru
4042d192f5 fix "tansformers_module" ModuleNotFoundError when load model with trust_remote_code=True (#871) 2023-09-08 17:21:30 -07:00
Zhuohan Li
c957c741d9 Enable safetensors loading for all models (#974) 2023-09-07 15:49:52 -07:00
Zhuohan Li
002800f081 Align vLLM's beam search implementation with HF generate (#857) 2023-09-04 17:29:42 -07:00
Antoni Baum
ce741ba3e4 Refactor AsyncLLMEngine (#880) 2023-09-03 21:43:43 -07:00
Woosuk Kwon
55fe8a81ec Refactor scheduler (#658) 2023-08-02 16:42:01 -07:00
Chaofan Lin
aa39e42c5a fix doc (#622) 2023-07-31 13:11:57 -07:00
Fang li
953f28cf9a fix ModuleNotFoundError (#599)
Co-authored-by: fangli <fangli@tencent.com>
2023-07-29 20:52:41 -07:00
Antoni Baum
9925c17940 Ray placement group support (#397) 2023-07-19 22:49:31 -07:00
Lily Liu
b4b195b360 fix max seq len (#489) 2023-07-17 23:20:20 -07:00
Zhuohan Li
2bdea7ac11 [Fix] Fix the condition of max_seq_len (#477) 2023-07-17 00:33:48 -04:00