Commit Graph

  • 93b38bea5d Refactor Prometheus and Add Request Level Metrics (#2316) Robert Shaw 2024-01-31 14:58:07 -08:00
  • d0d93b92b1 Add unit test for Mixtral MoE layer (#2677) Philipp Moritz 2024-01-31 14:34:17 -08:00
  • 89efcf1ce5 [Minor] Fix test_cache.py CI test failure (#2684) Philipp Moritz 2024-01-31 10:12:11 -08:00
  • c664b0e683 fix some bugs (#2689) zspo 2024-02-01 02:09:23 +08:00
  • d69ff0cbbb Fixes assertion failure in prefix caching: the lora index mapping should respect prefix_len (#2688) Tao He 2024-02-01 01:00:13 +08:00
  • 1af090b57d Bump up version to v0.3.0 (#2656) v0.3.0 Zhuohan Li 2024-01-31 00:07:07 -08:00
  • 3dad944485 Add quantized mixtral support (#2673) Woosuk Kwon 2024-01-30 16:34:10 -08:00
  • 105a40f53a [Minor] Fix false warning when TP=1 (#2674) Woosuk Kwon 2024-01-30 14:39:40 -08:00
  • bbe9bd9684 [Minor] Fix a small typo (#2672) Philipp Moritz 2024-01-30 13:40:37 -08:00
  • 4f65af0e25 Add swap_blocks unit tests (#2616) Vladimir 2024-01-30 18:30:50 +01:00
  • d79ced3292 Fix 'Actor methods cannot be called directly' when using --engine-use-ray (#2664) Wen Sun 2024-01-31 00:17:05 +08:00
  • ab40644669 Fused MOE for Mixtral (#2542) Philipp Moritz 2024-01-29 22:43:37 -08:00
  • 5d60def02c DeepseekMoE support with Fused MoE kernel (#2453) wangding zeng 2024-01-30 13:19:48 +08:00
  • ea8489fce2 ROCm: Allow setting compilation target (#2581) Rasmus Larsen 2024-01-29 19:52:31 +01:00
  • 1b20639a43 No repeated IPC open (#2642) Hanzhi Zhou 2024-01-30 02:46:29 +08:00
  • b72af8f1ed Fix error when tp > 1 (#2644) zhaoyang-star 2024-01-29 14:47:39 +08:00
  • 9090bf02e7 Support FP8-E5M2 KV Cache (#2279) zhaoyang-star 2024-01-29 08:43:54 +08:00
  • 7d648418b8 Update Ray version requirements (#2636) Simon Mo 2024-01-28 14:27:22 -08:00
  • 89be30fa7d Small async_llm_engine refactor (#2618) Murali Andoorveedu 2024-01-27 23:28:37 -08:00
  • f8ecb84c02 Speed up Punica compilation (#2632) Woosuk Kwon 2024-01-27 17:46:56 -08:00
  • 5f036d2bcc [Minor] Fix warning on Ray dependencies (#2630) Woosuk Kwon 2024-01-27 15:43:40 -08:00
  • 380170038e Implement custom all reduce kernels (#2192) Hanzhi Zhou 2024-01-28 04:46:35 +08:00
  • 220a47627b Use head_dim in config if exists (#2622) Xiang Xu 2024-01-27 10:30:49 -08:00
  • beb89f68b4 AWQ: Up to 2.66x higher throughput (#2566) Casper 2024-01-27 08:53:17 +01:00
  • 390b495ff3 Don't build punica kernels by default (#2605) Philipp Moritz 2024-01-26 15:19:19 -08:00
  • 3a0e1fc070 Support for Stable LM 2 (#2598) dakotamahan-stability 2024-01-26 14:45:19 -06:00
  • 6b7de1a030 [ROCm] add support to ROCm 6.0 and MI300 (#2274) Hongxia Yang 2024-01-26 15:41:10 -05:00
  • 5265631d15 use a correct device when creating OptionalCUDAGuard (#2583) Vladimir 2024-01-26 08:48:17 +01:00
  • 2832e7b9f9 fix names and license for Qwen2 (#2589) Junyang Lin 2024-01-25 14:37:51 +08:00
  • 3a7dd7e367 Support Batch Completion in Server (#2529) Simon Mo 2024-01-24 17:11:07 -08:00
  • 223c19224b Fix the syntax error in the doc of supported_models (#2584) LastWhisper 2024-01-25 03:22:51 +08:00
  • f1f6cc10c7 Added include_stop_str_in_output and length_penalty parameters to OpenAI API (#2562) Federico Galatolo 2024-01-24 19:21:56 +01:00
  • 3209b49033 [Bugfix] fix crash if max_tokens=None (#2570) Nikola Borisov 2024-01-23 22:38:55 -08:00
  • 1e4277d2d1 lint: format all python file instead of just source code (#2567) Simon Mo 2024-01-23 15:53:06 -08:00
  • 9b945daaf1 [Experimental] Add multi-LoRA support (#1804) Antoni Baum 2024-01-24 00:26:37 +01:00
  • 9c1352eb57 [Feature] Simple API token authentication and pluggable middlewares (#1106) Erfan Al-Hossami 2024-01-23 18:13:00 -05:00
  • 7a0b011dd5 Add a 1-line docstring to explain why calling context_attention_fwd twice in test_prefix_prefill.py (#2553) Jason Zhu 2024-01-22 14:47:25 -08:00
  • 63e835cbcc Fix progress bar and allow HTTPS in benchmark_serving.py (#2552) Harry Mellor 2024-01-22 22:40:31 +00:00
  • 94b5edeb53 Add qwen2 (#2495) Junyang Lin 2024-01-23 06:34:21 +08:00
  • ab7e6006d6 Fix https://github.com/vllm-project/vllm/issues/2540 (#2545) Philipp Moritz 2024-01-22 10:02:38 -08:00
  • 18bfcdd05c [Speculative decoding 2/9] Multi-step worker for draft model (#2424) Cade Daniel 2024-01-21 16:31:47 -08:00
  • 71d63ed72e migrate pydantic from v1 to v2 (#2531) Jannis Schönleber 2024-01-22 01:05:56 +01:00
  • d75c40734a [Fix] Keep scheduler.running as deque (#2523) Nick Hill 2024-01-20 22:36:09 -08:00
  • 5b23c3f26f Add group as an argument in broadcast ops (#2522) Junda Chen 2024-01-20 16:00:26 -08:00
  • 00efdc84ba Add benchmark serving to CI (#2505) Simon Mo 2024-01-19 20:20:19 -08:00
  • 91a61da9b1 [Bugfix] fix load local safetensors model (#2512) Roy 2024-01-20 08:26:16 +08:00
  • ef9b636e2d Simplify broadcast logic for control messages (#2501) Zhuohan Li 2024-01-19 11:23:30 -08:00
  • 2709c0009a Support OpenAI API server in benchmark_serving.py (#2172) Harry Mellor 2024-01-19 04:34:08 +00:00
  • dd7e8f5f64 refactor complemention api for readability (#2499) Simon Mo 2024-01-18 16:45:14 -08:00
  • d2a68364c4 [BugFix] Fix abort_seq_group (#2463) ljss 2024-01-19 07:10:42 +08:00
  • 7e1081139d Don't download both safetensor and bin files. (#2480) Nikola Borisov 2024-01-18 11:05:53 -08:00
  • 18473cf498 [Neuron] Add an option to build with neuron (#2065) Liangfu Chen 2024-01-18 10:58:50 -08:00
  • 4df417d059 fix: fix some args desc (#2487) zspo 2024-01-19 01:41:44 +08:00
  • 5d80a9178b Minor fix in prefill cache example (#2494) Jason Zhu 2024-01-18 09:40:34 -08:00
  • 8a25d3a71a fix stablelm.py tensor-parallel-size bug (#2482) YingchaoX 2024-01-19 01:39:46 +08:00
  • d10f8e1d43 [Experimental] Prefix Caching Support (#1669) shiyi.c_98 2024-01-17 16:32:10 -08:00
  • 14cc317ba4 OpenAI Server refactoring (#2360) FlorianJoncour 2024-01-17 05:33:14 +00:00
  • e1957c6ebd Add StableLM3B model (#2372) Hyunsung Lee 2024-01-17 13:32:40 +09:00
  • 8cd5a992bf ci: retry on build failure as well (#2457) Simon Mo 2024-01-16 12:51:04 -08:00
  • 947f0b23cc CI: make sure benchmark script exit on error (#2449) Simon Mo 2024-01-16 09:50:13 -08:00
  • f780504d12 fix weigit loading for GQA with TP (#2379) Chenhui Zhang 2024-01-16 07:43:59 +08:00
  • bfc072addf Allow buildkite to retry build on agent lost (#2446) Simon Mo 2024-01-15 15:43:15 -08:00
  • 2a18da257c Announce the second vLLM meetup (#2444) Woosuk Kwon 2024-01-15 14:11:59 -08:00
  • 6e01e8c1c8 [CI] Add Buildkite (#2355) Simon Mo 2024-01-14 12:37:58 -08:00
  • 9f659bf07f [Minor] Optimize cuda graph memory usage (#2437) Roy 2024-01-15 01:40:51 +08:00
  • 35c4bc20d9 [Minor] Fix err msg (#2431) Woosuk Kwon 2024-01-12 14:02:52 -08:00
  • 218dc2ccda Aligning top_p and top_k Sampling (#1885) 陈序 2024-01-13 05:51:03 +08:00
  • 827cbcd37c Update quickstart.rst (#2369) Simon 2024-01-12 14:56:18 -06:00
  • cb7a1c1cbf Suggest using dtype=half when OOM. Ben 2024-01-13 04:33:29 +08:00
  • 7878958c0d Address Phi modeling update 2 (#2428) Gary Hui 2024-01-13 04:16:49 +08:00
  • ce036244c9 Allow setting fastapi root_path argument (#2341) Chirag Jain 2024-01-13 00:29:59 +05:30
  • 48cf1e413c fix: deque mutated during iteration in abort_seq_group (#2371) 陈序 2024-01-13 00:44:18 +08:00
  • 97460585d9 Add gradio chatbot for openai webserver (#2307) arkohut 2024-01-12 11:45:56 +08:00
  • f745847ef7 [Minor] Fix the format in quick start guide related to Model Scope (#2425) Zhuohan Li 2024-01-11 19:44:01 -08:00
  • 6549aef245 [DOC] Add additional comments for LLMEngine and AsyncLLMEngine (#1011) Jiaxiang 2024-01-12 11:26:49 +08:00
  • 50376faa7b Rename phi_1_5 -> phi (#2385) Woosuk Kwon 2024-01-11 16:23:43 -08:00
  • 4b61c6b669 get_ip(): Fix ipv4 ipv6 dualstack (#2408) Yunfeng Bai 2024-01-10 11:39:58 -08:00
  • 79d64c4954 [Speculative decoding 1/9] Optimized rejection sampler (#2336) Cade Daniel 2024-01-09 15:38:41 -08:00
  • 74cd5abdd1 Add baichuan chat template jinjia file (#2390) KKY 2024-01-09 11:13:02 -06:00
  • 28c3f12104 [Minor] Remove unused code in attention (#2384) Woosuk Kwon 2024-01-08 13:13:08 -08:00
  • c884819135 Fix eager mode performance (#2377) Woosuk Kwon 2024-01-08 10:11:06 -08:00
  • 05921a9a7a Changed scheduler to use deques instead of lists (#2290) Nadav Shmayovits 2024-01-07 19:48:07 +02:00
  • d0215a58e7 Ensure metrics are logged regardless of requests (#2347) Iskren Ivov Chernev 2024-01-05 15:24:42 +02:00
  • 937e7b7d7c Build docker image with shared objects from "build" step (#2237) Alexandre Payot 2024-01-04 18:35:18 +01:00
  • aee8ef661a Miner fix of type hint (#2340) ljss 2024-01-04 13:27:56 +08:00
  • 2e0b6e7757 Bump up to v0.2.7 (#2337) v0.2.7 Woosuk Kwon 2024-01-03 17:35:56 -08:00
  • 941767127c Revert the changes in test_cache (#2335) Woosuk Kwon 2024-01-03 17:32:05 -08:00
  • 74d8d77626 Remove unused const TIMEOUT_TO_PREVENT_DEADLOCK (#2321) Ronen Schaffer 2024-01-04 01:49:07 +02:00
  • fd4ea8ef5c Use NCCL instead of ray for control-plane communication to remove serialization overhead (#2221) Zhuohan Li 2024-01-04 03:30:22 +08:00
  • 1066cbd152 Remove deprecated parameter: concurrency_count (#2315) Ronen Schaffer 2024-01-03 19:56:21 +02:00
  • 6ef00b03a2 Enable CUDA graph for GPTQ & SqueezeLLM (#2318) Woosuk Kwon 2024-01-03 09:52:29 -08:00
  • 9140561059 [Minor] Fix typo and remove unused code (#2305) Roy 2024-01-03 11:23:15 +08:00
  • 77af974b40 [FIX] Support non-zero CUDA devices in custom kernels (#1959) Jee Li 2024-01-03 11:09:59 +08:00
  • 4934d49274 Support GPT-NeoX Models without attention biases (#2301) Jong-hun Shin 2023-12-31 01:42:04 +09:00
  • 358c328d69 [BUGFIX] Fix communication test (#2285) Zhuohan Li 2023-12-28 06:18:11 +08:00
  • 4aaafdd289 [BUGFIX] Fix the path of test prompts (#2273) Zhuohan Li 2023-12-27 02:37:21 +08:00
  • 66b108d142 [BUGFIX] Fix API server test (#2270) Zhuohan Li 2023-12-27 02:37:06 +08:00
  • e0ff920001 [BUGFIX] Do not return ignored sentences twice in async llm engine (#2258) Zhuohan Li 2023-12-26 13:41:09 +08:00
  • face83c7ec [Docs] Add "About" Heading to README.md (#2260) blueceiling 2023-12-25 17:37:07 -07:00
  • 1db83e31a2 [Docs] Update installation instructions to include CUDA 11.8 xFormers (#2246) Shivam Thakkar 2023-12-23 12:50:02 +05:30