Kunshang Ji
96b6f475dd
Remove hardcoded device="cuda" to support more devices ( #2503 )
...
Co-authored-by: Jiang Li <jiang1.li@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2024-02-01 15:46:39 -08:00
Robert Shaw
93b38bea5d
Refactor Prometheus and Add Request Level Metrics ( #2316 )
2024-01-31 14:58:07 -08:00
zspo
c664b0e683
fix some bugs ( #2689 )
2024-01-31 10:09:23 -08:00
Wen Sun
d79ced3292
Fix 'Actor methods cannot be called directly' when using --engine-use-ray ( #2664 )
...
* fix: engine-useray complain
* fix: typo
2024-01-30 17:17:05 +01:00
zhaoyang-star
b72af8f1ed
Fix error when tp > 1 ( #2644 )
...
Co-authored-by: zhaoyang-star <zhao.yang16@zte.com.cn >
2024-01-28 22:47:39 -08:00
zhaoyang-star
9090bf02e7
Support FP8-E5M2 KV Cache ( #2279 )
...
Co-authored-by: zhaoyang <zhao.yang16@zte.com.cn >
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com >
2024-01-28 16:43:54 -08:00
Murali Andoorveedu
89be30fa7d
Small async_llm_engine refactor ( #2618 )
2024-01-27 23:28:37 -08:00
Woosuk Kwon
5f036d2bcc
[Minor] Fix warning on Ray dependencies ( #2630 )
2024-01-27 15:43:40 -08:00
Hanzhi Zhou
380170038e
Implement custom all reduce kernels ( #2192 )
2024-01-27 12:46:35 -08:00
Antoni Baum
9b945daaf1
[Experimental] Add multi-LoRA support ( #1804 )
...
Co-authored-by: Chen Shen <scv119@gmail.com >
Co-authored-by: Shreyas Krishnaswamy <shrekris@anyscale.com >
Co-authored-by: Avnish Narayan <avnish@anyscale.com >
2024-01-23 15:26:37 -08:00
Philipp Moritz
ab7e6006d6
Fix https://github.com/vllm-project/vllm/issues/2540 ( #2545 )
2024-01-22 19:02:38 +01:00
Cade Daniel
18bfcdd05c
[Speculative decoding 2/9] Multi-step worker for draft model ( #2424 )
2024-01-21 16:31:47 -08:00
shiyi.c_98
d10f8e1d43
[Experimental] Prefix Caching Support ( #1669 )
...
Co-authored-by: DouHappy <2278958187@qq.com >
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com >
2024-01-17 16:32:10 -08:00
Jiaxiang
6549aef245
[DOC] Add additional comments for LLMEngine and AsyncLLMEngine ( #1011 )
2024-01-11 19:26:49 -08:00
Nadav Shmayovits
05921a9a7a
Changed scheduler to use deques instead of lists ( #2290 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-01-07 09:48:07 -08:00
Iskren Ivov Chernev
d0215a58e7
Ensure metrics are logged regardless of requests ( #2347 )
2024-01-05 05:24:42 -08:00
ljss
aee8ef661a
Miner fix of type hint ( #2340 )
2024-01-03 21:27:56 -08:00
Zhuohan Li
fd4ea8ef5c
Use NCCL instead of ray for control-plane communication to remove serialization overhead ( #2221 )
2024-01-03 11:30:22 -08:00
Zhuohan Li
e0ff920001
[BUGFIX] Do not return ignored sentences twice in async llm engine ( #2258 )
2023-12-26 13:41:09 +08:00
Woosuk Kwon
3a4fd5ca59
Disable Ray usage stats collection ( #2206 )
2023-12-20 21:52:08 -08:00
Suhong Moon
290e015c6c
Update Help Text for --gpu-memory-utilization Argument ( #2183 )
2023-12-18 11:33:24 -08:00
JohnSaxon
bbe4466fd9
[Minor] Fix typo ( #2166 )
...
Co-authored-by: John-Saxon <zhang.xiangxuan@oushu.com >
2023-12-17 23:28:49 -08:00
Woosuk Kwon
8041b7305e
[BugFix] Raise error when max_model_len is larger than KV cache ( #2163 )
2023-12-17 17:08:23 -08:00
Woosuk Kwon
30fb0956df
[Minor] Add more detailed explanation on quantization argument ( #2145 )
2023-12-17 01:56:16 -08:00
Woosuk Kwon
c3372e87be
Remove dependency on CuPy ( #2152 )
2023-12-17 01:49:07 -08:00
Woosuk Kwon
37ca558103
Optimize model execution with CUDA graph ( #1926 )
...
Co-authored-by: Chen Shen <scv119@gmail.com >
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com >
2023-12-16 21:12:08 -08:00
CHU Tianxiang
0fbfc4b81b
Add GPTQ support ( #916 )
2023-12-15 03:04:22 -08:00
Yunfeng Bai
c06170cc8e
Add a flag to include stop string in output text ( #1976 )
2023-12-15 00:45:58 -08:00
mezuzza
6774bd50b0
Fix typing in AsyncLLMEngine & add toml to requirements-dev ( #2100 )
2023-12-14 00:19:41 -08:00
TJian
6ccc0bfffb
Merge EmbeddedLLM/vllm-rocm into vLLM main ( #1836 )
...
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com >
Co-authored-by: Amir Balwel <amoooori04@gmail.com >
Co-authored-by: root <kuanfu.liu@akirakan.com >
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com >
Co-authored-by: kuanfu <kuanfu.liu@embeddedllm.com >
Co-authored-by: miloice <17350011+kliuae@users.noreply.github.com >
2023-12-07 23:16:52 -08:00
Woosuk Kwon
464dd985e3
Fix num_gpus when TP > 1 ( #1852 )
2023-12-03 12:24:30 -08:00
Simon Mo
5313c2cb8b
Add Production Metrics in Prometheus format ( #1890 )
2023-12-02 16:37:44 -08:00
Woosuk Kwon
27feead2f8
Refactor Worker & InputMetadata ( #1843 )
2023-11-29 22:16:37 -08:00
FlorianJoncour
0229c386c5
Better integration with Ray Serve ( #1821 )
...
Co-authored-by: FlorianJoncour <florian@zetta-sys.com >
2023-11-29 13:25:43 -08:00
Zhuohan Li
708e6c18b0
[FIX] Fix class naming ( #1803 )
2023-11-28 14:08:01 -08:00
Casper
a921d8be9d
[DOCS] Add engine args documentation ( #1741 )
2023-11-22 12:31:27 -08:00
boydfd
4bb6b67188
fix RAM OOM when load large models in tensor parallel mode. ( #1395 )
...
Co-authored-by: ran_lin <rlin@thoughtworks.com >
2023-11-20 19:02:42 -08:00
Simon Mo
5ffc0d13a2
Migrate linter from pylint to ruff ( #1665 )
2023-11-20 11:58:01 -08:00
Simon Mo
cb08cd0d75
[Minor] Fix duplication of ignored seq group in engine step ( #1666 )
2023-11-16 13:11:41 -08:00
Zhuohan Li
7076fa1c9f
TP/quantization/weight loading refactor part 2 - Refactor quantized linear logic and extend quantization support to all models ( #1622 )
...
Refactor the tensor parallelism, quantization, and weight-loading codes.
Summary of the new features enabled by this PR:
- **All models** are able to be quantized with AWQ and SqueezeLLM, and [soon GPTQ](https://github.com/vllm-project/vllm/pull/1580 ).
- Model loading code became much simpler.
- Support model parallelism for all MQA/GQA models when the number of key/value heads is smaller than the tensor parallel size.
2023-11-15 22:50:41 -08:00
Dominik Schwabe
1b290ace4f
Run default _AsyncLLMEngine._run_workers_async in threadpool ( #1628 )
2023-11-11 14:50:44 -08:00
ljss
5687d584fe
[BugFix] Set engine_use_ray=True when TP>1 ( #1531 )
2023-11-01 02:14:18 -07:00
Dan Lord
7013a80170
Add support for spaces_between_special_tokens
2023-10-30 16:52:56 -07:00
chooper1
1f24755bf8
Support SqueezeLLM ( #1326 )
...
Co-authored-by: squeeze-ai-lab <squeezeailab.bair@gmail.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2023-10-21 23:14:59 -07:00
Woosuk Kwon
c1376e0f82
Change scheduler & input tensor shape ( #1381 )
2023-10-16 17:48:42 -07:00
Zhuohan Li
9d9072a069
Implement prompt logprobs & Batched topk for computing logprobs ( #1328 )
...
Co-authored-by: Yunmo Chen <16273544+wanmok@users.noreply.github.com >
2023-10-16 10:56:50 -07:00
Antoni Baum
acbed3ef40
Use monotonic time where appropriate ( #1249 )
2023-10-02 19:22:05 -07:00
Federico Cassano
66d18a7fb0
add support for tokenizer revision ( #1163 )
...
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com >
2023-10-02 19:19:46 -07:00
Woosuk Kwon
f936657eb6
Provide default max model length ( #1224 )
2023-09-28 14:44:02 -07:00
Chris Bamford
bb1ba58f06
[Mistral] Mistral-7B-v0.1 support ( #1196 )
...
Co-authored-by: timlacroix <t@mistral.ai >
2023-09-28 10:41:03 -07:00