Commit Graph - vllm - Gitea: Git with a cup of tea

biondizzle/vllm

Fork 0

Commit Graph

Select branches

Hide Pull Requests

cmm

main

ci/build/22474

submission

v0.1.0

v0.1.1

v0.1.2

v0.1.3

v0.1.4

v0.1.5

v0.1.6

v0.1.7

v0.10.0

v0.10.0rc1

v0.10.0rc2

v0.10.1

v0.10.1.1

v0.10.1rc1

v0.10.2

v0.10.2rc1

v0.10.2rc2

v0.10.2rc3

v0.11.0

v0.11.0rc1

v0.11.0rc2

v0.11.0rc3

v0.11.0rc4

v0.11.0rc5

v0.11.0rc6

v0.11.1

v0.11.1rc0

v0.11.1rc1

v0.11.1rc2

v0.11.1rc3

v0.11.1rc4

v0.11.1rc5

v0.11.1rc6

v0.11.1rc7

v0.11.2

v0.12.0

v0.13.0

v0.13.0rc1

v0.13.0rc2

v0.13.0rc3

v0.13.0rc4

v0.14.0

v0.14.0rc0

v0.14.0rc1

v0.14.0rc2

v0.14.1

v0.15.0

v0.15.0rc0

v0.15.0rc1

v0.15.0rc2

v0.15.0rc3

v0.15.1

v0.15.1rc0

v0.15.1rc1

v0.15.2rc0

v0.16.0

v0.16.0rc0

v0.16.0rc1

v0.16.0rc2

v0.16.0rc3

v0.16.1rc0

v0.17.0

v0.17.0rc0

v0.17.0rc1

v0.17.1

v0.17.1rc0

v0.17.2rc0

v0.18.0

v0.18.0rc0

v0.18.0rc1

v0.18.0rc2

v0.18.1

v0.18.1rc0

v0.18.2rc0

v0.19.0

v0.19.0rc0

v0.19.0rc1

v0.19.1rc0

v0.2.0

v0.2.1

v0.2.1.post1

v0.2.2

v0.2.3

v0.2.4

v0.2.5

v0.2.6

v0.2.7

v0.3.0

v0.3.1

v0.3.2

v0.3.3

v0.4.0

v0.4.0.post1

v0.4.1

v0.4.2

v0.4.3

v0.5.0

v0.5.0.post1

v0.5.1

v0.5.2

v0.5.3

v0.5.3.post1

v0.5.4

v0.5.5

v0.6.0

v0.6.1

v0.6.1.post1

v0.6.1.post2

v0.6.2

v0.6.3

v0.6.3.post1

v0.6.4

v0.6.4.post1

v0.6.5

v0.6.6

v0.6.6.post1

v0.7.0

v0.7.1

v0.7.2

v0.7.3

v0.8.0

v0.8.0rc1

v0.8.0rc2

v0.8.1

v0.8.2

v0.8.3

v0.8.3rc1

v0.8.4

v0.8.5

v0.8.5.post1

v0.9.0

v0.9.0.1

v0.9.1

v0.9.1rc1

v0.9.1rc2

v0.9.2

v0.9.2rc1

v0.9.2rc2

93b38bea5d Refactor Prometheus and Add Request Level Metrics (#2316) Robert Shaw 2024-01-31 14:58:07 -08:00
d0d93b92b1 Add unit test for Mixtral MoE layer (#2677) Philipp Moritz 2024-01-31 14:34:17 -08:00
89efcf1ce5 [Minor] Fix test_cache.py CI test failure (#2684) Philipp Moritz 2024-01-31 10:12:11 -08:00
c664b0e683 fix some bugs (#2689) zspo 2024-02-01 02:09:23 +08:00
d69ff0cbbb Fixes assertion failure in prefix caching: the lora index mapping should respect prefix_len (#2688) Tao He 2024-02-01 01:00:13 +08:00
1af090b57d Bump up version to v0.3.0 (#2656) v0.3.0 Zhuohan Li 2024-01-31 00:07:07 -08:00
3dad944485 Add quantized mixtral support (#2673) Woosuk Kwon 2024-01-30 16:34:10 -08:00
105a40f53a [Minor] Fix false warning when TP=1 (#2674) Woosuk Kwon 2024-01-30 14:39:40 -08:00
bbe9bd9684 [Minor] Fix a small typo (#2672) Philipp Moritz 2024-01-30 13:40:37 -08:00
4f65af0e25 Add swap_blocks unit tests (#2616) Vladimir 2024-01-30 18:30:50 +01:00
d79ced3292 Fix 'Actor methods cannot be called directly' when using --engine-use-ray (#2664) Wen Sun 2024-01-31 00:17:05 +08:00
ab40644669 Fused MOE for Mixtral (#2542) Philipp Moritz 2024-01-29 22:43:37 -08:00
5d60def02c DeepseekMoE support with Fused MoE kernel (#2453) wangding zeng 2024-01-30 13:19:48 +08:00
ea8489fce2 ROCm: Allow setting compilation target (#2581) Rasmus Larsen 2024-01-29 19:52:31 +01:00
1b20639a43 No repeated IPC open (#2642) Hanzhi Zhou 2024-01-30 02:46:29 +08:00
b72af8f1ed Fix error when tp > 1 (#2644) zhaoyang-star 2024-01-29 14:47:39 +08:00
9090bf02e7 Support FP8-E5M2 KV Cache (#2279) zhaoyang-star 2024-01-29 08:43:54 +08:00
7d648418b8 Update Ray version requirements (#2636) Simon Mo 2024-01-28 14:27:22 -08:00
89be30fa7d Small async_llm_engine refactor (#2618) Murali Andoorveedu 2024-01-27 23:28:37 -08:00
f8ecb84c02 Speed up Punica compilation (#2632) Woosuk Kwon 2024-01-27 17:46:56 -08:00
5f036d2bcc [Minor] Fix warning on Ray dependencies (#2630) Woosuk Kwon 2024-01-27 15:43:40 -08:00
380170038e Implement custom all reduce kernels (#2192) Hanzhi Zhou 2024-01-28 04:46:35 +08:00
220a47627b Use head_dim in config if exists (#2622) Xiang Xu 2024-01-27 10:30:49 -08:00
beb89f68b4 AWQ: Up to 2.66x higher throughput (#2566) Casper 2024-01-27 08:53:17 +01:00
390b495ff3 Don't build punica kernels by default (#2605) Philipp Moritz 2024-01-26 15:19:19 -08:00
3a0e1fc070 Support for Stable LM 2 (#2598) dakotamahan-stability 2024-01-26 14:45:19 -06:00
6b7de1a030 [ROCm] add support to ROCm 6.0 and MI300 (#2274) Hongxia Yang 2024-01-26 15:41:10 -05:00
5265631d15 use a correct device when creating OptionalCUDAGuard (#2583) Vladimir 2024-01-26 08:48:17 +01:00
2832e7b9f9 fix names and license for Qwen2 (#2589) Junyang Lin 2024-01-25 14:37:51 +08:00
3a7dd7e367 Support Batch Completion in Server (#2529) Simon Mo 2024-01-24 17:11:07 -08:00
223c19224b Fix the syntax error in the doc of supported_models (#2584) LastWhisper 2024-01-25 03:22:51 +08:00
f1f6cc10c7 Added include_stop_str_in_output and length_penalty parameters to OpenAI API (#2562) Federico Galatolo 2024-01-24 19:21:56 +01:00
3209b49033 [Bugfix] fix crash if max_tokens=None (#2570) Nikola Borisov 2024-01-23 22:38:55 -08:00
1e4277d2d1 lint: format all python file instead of just source code (#2567) Simon Mo 2024-01-23 15:53:06 -08:00
9b945daaf1 [Experimental] Add multi-LoRA support (#1804) Antoni Baum 2024-01-24 00:26:37 +01:00
9c1352eb57 [Feature] Simple API token authentication and pluggable middlewares (#1106) Erfan Al-Hossami 2024-01-23 18:13:00 -05:00
7a0b011dd5 Add a 1-line docstring to explain why calling context_attention_fwd twice in test_prefix_prefill.py (#2553) Jason Zhu 2024-01-22 14:47:25 -08:00
63e835cbcc Fix progress bar and allow HTTPS in benchmark_serving.py (#2552) Harry Mellor 2024-01-22 22:40:31 +00:00
94b5edeb53 Add qwen2 (#2495) Junyang Lin 2024-01-23 06:34:21 +08:00
ab7e6006d6 Fix https://github.com/vllm-project/vllm/issues/2540 (#2545) Philipp Moritz 2024-01-22 10:02:38 -08:00
18bfcdd05c [Speculative decoding 2/9] Multi-step worker for draft model (#2424) Cade Daniel 2024-01-21 16:31:47 -08:00
71d63ed72e migrate pydantic from v1 to v2 (#2531) Jannis Schönleber 2024-01-22 01:05:56 +01:00
d75c40734a [Fix] Keep scheduler.running as deque (#2523) Nick Hill 2024-01-20 22:36:09 -08:00
5b23c3f26f Add group as an argument in broadcast ops (#2522) Junda Chen 2024-01-20 16:00:26 -08:00
00efdc84ba Add benchmark serving to CI (#2505) Simon Mo 2024-01-19 20:20:19 -08:00
91a61da9b1 [Bugfix] fix load local safetensors model (#2512) Roy 2024-01-20 08:26:16 +08:00
ef9b636e2d Simplify broadcast logic for control messages (#2501) Zhuohan Li 2024-01-19 11:23:30 -08:00
2709c0009a Support OpenAI API server in benchmark_serving.py (#2172) Harry Mellor 2024-01-19 04:34:08 +00:00
dd7e8f5f64 refactor complemention api for readability (#2499) Simon Mo 2024-01-18 16:45:14 -08:00
d2a68364c4 [BugFix] Fix abort_seq_group (#2463) ljss 2024-01-19 07:10:42 +08:00
7e1081139d Don't download both safetensor and bin files. (#2480) Nikola Borisov 2024-01-18 11:05:53 -08:00
18473cf498 [Neuron] Add an option to build with neuron (#2065) Liangfu Chen 2024-01-18 10:58:50 -08:00
4df417d059 fix: fix some args desc (#2487) zspo 2024-01-19 01:41:44 +08:00
5d80a9178b Minor fix in prefill cache example (#2494) Jason Zhu 2024-01-18 09:40:34 -08:00
8a25d3a71a fix stablelm.py tensor-parallel-size bug (#2482) YingchaoX 2024-01-19 01:39:46 +08:00
d10f8e1d43 [Experimental] Prefix Caching Support (#1669) shiyi.c_98 2024-01-17 16:32:10 -08:00
14cc317ba4 OpenAI Server refactoring (#2360) FlorianJoncour 2024-01-17 05:33:14 +00:00
e1957c6ebd Add StableLM3B model (#2372) Hyunsung Lee 2024-01-17 13:32:40 +09:00
8cd5a992bf ci: retry on build failure as well (#2457) Simon Mo 2024-01-16 12:51:04 -08:00
947f0b23cc CI: make sure benchmark script exit on error (#2449) Simon Mo 2024-01-16 09:50:13 -08:00
f780504d12 fix weigit loading for GQA with TP (#2379) Chenhui Zhang 2024-01-16 07:43:59 +08:00
bfc072addf Allow buildkite to retry build on agent lost (#2446) Simon Mo 2024-01-15 15:43:15 -08:00
2a18da257c Announce the second vLLM meetup (#2444) Woosuk Kwon 2024-01-15 14:11:59 -08:00
6e01e8c1c8 [CI] Add Buildkite (#2355) Simon Mo 2024-01-14 12:37:58 -08:00
9f659bf07f [Minor] Optimize cuda graph memory usage (#2437) Roy 2024-01-15 01:40:51 +08:00
35c4bc20d9 [Minor] Fix err msg (#2431) Woosuk Kwon 2024-01-12 14:02:52 -08:00
218dc2ccda Aligning top_p and top_k Sampling (#1885) 陈序 2024-01-13 05:51:03 +08:00
827cbcd37c Update quickstart.rst (#2369) Simon 2024-01-12 14:56:18 -06:00
cb7a1c1cbf Suggest using dtype=half when OOM. Ben 2024-01-13 04:33:29 +08:00
7878958c0d Address Phi modeling update 2 (#2428) Gary Hui 2024-01-13 04:16:49 +08:00
ce036244c9 Allow setting fastapi root_path argument (#2341) Chirag Jain 2024-01-13 00:29:59 +05:30
48cf1e413c fix: deque mutated during iteration in abort_seq_group (#2371) 陈序 2024-01-13 00:44:18 +08:00
97460585d9 Add gradio chatbot for openai webserver (#2307) arkohut 2024-01-12 11:45:56 +08:00
f745847ef7 [Minor] Fix the format in quick start guide related to Model Scope (#2425) Zhuohan Li 2024-01-11 19:44:01 -08:00
6549aef245 [DOC] Add additional comments for LLMEngine and AsyncLLMEngine (#1011) Jiaxiang 2024-01-12 11:26:49 +08:00
50376faa7b Rename phi_1_5 -> phi (#2385) Woosuk Kwon 2024-01-11 16:23:43 -08:00
4b61c6b669 get_ip(): Fix ipv4 ipv6 dualstack (#2408) Yunfeng Bai 2024-01-10 11:39:58 -08:00
79d64c4954 [Speculative decoding 1/9] Optimized rejection sampler (#2336) Cade Daniel 2024-01-09 15:38:41 -08:00
74cd5abdd1 Add baichuan chat template jinjia file (#2390) KKY 2024-01-09 11:13:02 -06:00
28c3f12104 [Minor] Remove unused code in attention (#2384) Woosuk Kwon 2024-01-08 13:13:08 -08:00
c884819135 Fix eager mode performance (#2377) Woosuk Kwon 2024-01-08 10:11:06 -08:00
05921a9a7a Changed scheduler to use deques instead of lists (#2290) Nadav Shmayovits 2024-01-07 19:48:07 +02:00
d0215a58e7 Ensure metrics are logged regardless of requests (#2347) Iskren Ivov Chernev 2024-01-05 15:24:42 +02:00
937e7b7d7c Build docker image with shared objects from "build" step (#2237) Alexandre Payot 2024-01-04 18:35:18 +01:00
aee8ef661a Miner fix of type hint (#2340) ljss 2024-01-04 13:27:56 +08:00
2e0b6e7757 Bump up to v0.2.7 (#2337) v0.2.7 Woosuk Kwon 2024-01-03 17:35:56 -08:00
941767127c Revert the changes in test_cache (#2335) Woosuk Kwon 2024-01-03 17:32:05 -08:00
74d8d77626 Remove unused const TIMEOUT_TO_PREVENT_DEADLOCK (#2321) Ronen Schaffer 2024-01-04 01:49:07 +02:00
fd4ea8ef5c Use NCCL instead of ray for control-plane communication to remove serialization overhead (#2221) Zhuohan Li 2024-01-04 03:30:22 +08:00
1066cbd152 Remove deprecated parameter: concurrency_count (#2315) Ronen Schaffer 2024-01-03 19:56:21 +02:00
6ef00b03a2 Enable CUDA graph for GPTQ & SqueezeLLM (#2318) Woosuk Kwon 2024-01-03 09:52:29 -08:00
9140561059 [Minor] Fix typo and remove unused code (#2305) Roy 2024-01-03 11:23:15 +08:00
77af974b40 [FIX] Support non-zero CUDA devices in custom kernels (#1959) Jee Li 2024-01-03 11:09:59 +08:00
4934d49274 Support GPT-NeoX Models without attention biases (#2301) Jong-hun Shin 2023-12-31 01:42:04 +09:00
358c328d69 [BUGFIX] Fix communication test (#2285) Zhuohan Li 2023-12-28 06:18:11 +08:00
4aaafdd289 [BUGFIX] Fix the path of test prompts (#2273) Zhuohan Li 2023-12-27 02:37:21 +08:00
66b108d142 [BUGFIX] Fix API server test (#2270) Zhuohan Li 2023-12-27 02:37:06 +08:00
e0ff920001 [BUGFIX] Do not return ignored sentences twice in async llm engine (#2258) Zhuohan Li 2023-12-26 13:41:09 +08:00
face83c7ec [Docs] Add "About" Heading to README.md (#2260) blueceiling 2023-12-25 17:37:07 -07:00
1db83e31a2 [Docs] Update installation instructions to include CUDA 11.8 xFormers (#2246) Shivam Thakkar 2023-12-23 12:50:02 +05:30

... 151 152 153 154 155 ...