Commit Graph

  • f17a1a8f96 [Misc] Make Serving Benchmark More User-friendly (#5044) Roger Wang 2024-05-25 10:28:16 -07:00
  • d5a1697772 [Dynamic Spec Decoding] Minor fix for disabling speculative decoding (#5000) Lily Liu 2024-05-25 10:00:14 -07:00
  • 325c119961 [Misc] add logging level env var (#5045) youkaichao 2024-05-24 23:49:49 -07:00
  • 8e192ff967 [Kernel][Backend][Model] Blocksparse flash attention kernel and Phi-3-Small model (#4799) Eric Xihui Lin 2024-05-25 01:00:52 -04:00
  • e64fde4b01 [Core][Bugfix]: fix prefix caching for blockv2 (#4764) leiwen83 2024-05-25 01:07:09 +08:00
  • 919770957f [Bugfix] Fix Mistral v0.3 Weight Loading (#5005) Robert Shaw 2024-05-24 14:28:27 +02:00
  • 6a50f4cafa [Doc] add ccache guide in doc (#5012) youkaichao 2024-05-23 16:21:54 -07:00
  • e3470f8753 [Core]: Option To Use Prompt Token Ids Inside Logits Processor (#4985) Elisei Smirnov 2024-05-24 01:04:24 +03:00
  • a1242324c9 [Kernel] Initial Activation Quantization Support (#4525) Dipika Sikka 2024-05-23 17:29:18 -04:00
  • 5eda2ea02a [Core][1/N] Support send/recv in PyNCCL Groups (#4988) Murali Andoorveedu 2024-05-23 09:54:48 -07:00
  • 2ba80bed27 [Bugfix] Update Dockerfile.cpu to fix NameError: name 'vllm_ops' is not defined (#5009) Letian Li 2024-05-23 17:08:58 +01:00
  • 6066253296 Marlin 24 prefill performance improvement (about 25% better on average) (#4983) Alexander Matveev 2024-05-23 02:39:27 -04:00
  • ee3eea0a1b [Misc] Take user preference in attention selector (#4960) Cody Yu 2024-05-22 15:55:56 -07:00
  • a36de682d4 [Minor] Fix small typo in llama.py: QKVParallelLinear -> QuantizationConfig (#4991) Philipp Moritz 2024-05-22 15:26:56 -07:00
  • eb6d3c264d [Core] Eliminate parallel worker per-step task scheduling overhead (#4894) Nick Hill 2024-05-22 14:17:27 -07:00
  • 97b030005c [Model] LoRA gptbigcode implementation (#3949) raywanb 2024-05-23 04:58:59 +08:00
  • a3a73ab069 [Misc] Load FP8 kv-cache scaling factors from checkpoints (#4893) Cody Yu 2024-05-22 13:28:20 -07:00
  • 8674f9880e [Kernel] Fixup for CUTLASS kernels in CUDA graphs (#4954) Tyler Michael Smith 2024-05-22 10:10:43 -04:00
  • c74c913bfb [misc] remove comments that were supposed to be removed (#4977) SangBin Cho 2024-05-22 22:02:58 +09:00
  • 5f6d10c14c [CI/Build] Enforce style for C++ and CUDA code with clang-format (#4722) Michael Goin 2024-05-22 03:18:41 -04:00
  • 9b9a10d6cb [Frontend] Dynamic RoPE scaling (#4638) sasha0552 2024-05-22 05:32:35 +00:00
  • 99eff67ba9 [Bugfix][Kernel] Add head size check for attention backend selection (#4944) Isotr0py 2024-05-22 03:33:25 +08:00
  • 14772eeb8e [Bugfix] Fix flag name for max_seq_len_to_capture (#4935) Kante Yin 2024-05-22 00:30:52 +08:00
  • 757b62c495 [CI/Build] Codespell ignore build/ directory (#4945) Michael Goin 2024-05-21 12:06:10 -04:00
  • e941f88584 [Docs] Add acknowledgment for sponsors (#4925) Simon Mo 2024-05-21 00:17:25 -07:00
  • f12c3b5b3d [Model] Add Phi-2 LoRA support (#4886) Isotr0py 2024-05-21 13:24:17 +08:00
  • d130b573a0 [Model] add rope_scaling support for qwen2 (#4930) HUANG Fei 2024-05-21 13:22:22 +08:00
  • 65ae8c2c8f [Core] Fix scheduler considering "no LoRA" as "LoRA" (#4897) Antoni Baum 2024-05-20 17:48:32 -07:00
  • c3af44722c [Doc]Add documentation to benchmarking script when running TGI (#4920) Kuntai Du 2024-05-20 13:16:57 -07:00
  • 1937e29848 [Core] Sharded State Loader download from HF (#4889) Aurick Qiao 2024-05-20 14:46:12 -04:00
  • f0eecee610 [Bugfix] Fix dummy weight for fp8 (#4916) Mor Zusman 2024-05-20 21:44:25 +03:00
  • 943e72ca56 [Build/CI] Enabling AMD Entrypoints Test (#4834) Alexei-V-Ivanov-AMD 2024-05-20 13:29:28 -05:00
  • 546a97ef69 [Misc]: allow user to specify port in distributed setting (#4914) Wenwei Zhang 2024-05-21 01:45:06 +08:00
  • da5a0b539d Remove marlin warning (#4918) Alexander Matveev 2024-05-20 10:55:34 -04:00
  • 6287537a0c [Model] LLaVA model refactor (#4910) Cyrus Leung 2024-05-20 16:11:25 +08:00
  • b57e6c5949 [Kernel] Add flash-attn back (#4907) Woosuk Kwon 2024-05-19 18:11:30 -07:00
  • 27ce85476e [Kernel] Add marlin_24 unit tests (#4901) Alexander Matveev 2024-05-19 11:37:34 -04:00
  • f68470e803 [Bugfix][Model] Add base class for vision-language models (#4809) Cyrus Leung 2024-05-19 15:13:33 +08:00
  • 2e9a2227ec [Lora] Support long context lora (#4787) SangBin Cho 2024-05-18 16:05:23 +09:00
  • c0724fc915 [ROCm][Hardware][AMD] Adding Navi21 to fallback to naive attention if Triton is not used (#4658) alexeykondrat 2024-05-18 01:09:11 -04:00
  • 86b45ae065 [Bugfix] Relax tiktoken to >= 0.6.0 (#4890) Michael Goin 2024-05-17 14:58:52 -04:00
  • c5711ef985 [Doc] Update Ray Data distributed offline inference example (#4871) Antoni Baum 2024-05-17 10:52:11 -07:00
  • 48d5985a08 Sync huggingface modifications of qwen Moe model (#4774) eigenLiu 2024-05-18 00:43:19 +08:00
  • 33e0823de5 [Bugfix] fix rope error when load models with different dtypes (#4835) Jinzhen Lin 2024-05-17 17:43:34 +08:00
  • 26148120b3 [Build/CI] Extending the set of AMD tests with Regression, Basic Correctness, Distributed, Engine, Llava Tests (#4797) Alexei-V-Ivanov-AMD 2024-05-16 22:58:25 -05:00
  • 0150a10630 [Frontend] OpenAI API server: Do not add bos token by default when encoding (#4688) bofeng huang 2024-05-17 03:47:22 +02:00
  • 8e7fb5d43a Support to serve vLLM on Kubernetes with LWS (#4829) Kante Yin 2024-05-17 07:37:29 +08:00
  • 9a31a817a8 [Bugfix] Fix FP8 KV cache support (#4869) Woosuk Kwon 2024-05-16 15:42:29 -07:00
  • 2060e93659 [Kernel] Add w8a8 CUTLASS kernels (#4749) Tyler Michael Smith 2024-05-16 18:32:50 -04:00
  • 8435b207af [Kernel] Add punica dimension for Qwen1.5-32B LoRA (#4850) Silencio 2024-05-17 02:16:09 +08:00
  • 10fa9eea21 [Misc] remove old comments (#4866) youkaichao 2024-05-16 11:07:41 -07:00
  • e08188081b [Core][Distributed] remove graph mode function (#4818) youkaichao 2024-05-16 10:59:52 -07:00
  • b5853f9963 [ROCm][AMD][Bugfix] adding a missing triton autotune config (#4845) Hongxia Yang 2024-05-16 13:46:52 -04:00
  • f09edd8a25 Add JSON output support for benchmark_latency and benchmark_throughput (#4848) Simon Mo 2024-05-16 10:02:56 -07:00
  • 6979ade384 Add GPTQ Marlin 2:4 sparse structured support (#4790) Alexander Matveev 2024-05-16 12:56:15 -04:00
  • 9216b9cc38 [Bugfix] Bypass authorization API token for preflight requests (#4862) Pierre Dulac 2024-05-16 18:42:21 +02:00
  • 5e0391c040 [Frontend] Separate OpenAI Batch Runner usage from API Server (#4851) Alex Wu 2024-05-16 11:42:41 -04:00
  • dbc0754ddf [docs] Fix typo in examples filename openi -> openai (#4864) Alex Wu 2024-05-16 11:42:17 -04:00
  • 99caa49106 [Kernel] add bfloat16 support for gptq marlin kernel (#4788) Jinzhen Lin 2024-05-16 21:55:29 +08:00
  • 5c342570d7 Add marlin unit tests and marlin benchmark script (#4815) alexm-nm 2024-05-16 09:36:49 -04:00
  • 973617ae02 [Speculative decoding][Re-take] Enable TP>1 speculative decoding (#4840) Cody Yu 2024-05-16 00:53:51 -07:00
  • 30e754390c [Core] Implement sharded state loader (#4690) Aurick Qiao 2024-05-16 01:11:54 -04:00
  • 52f8107cf2 [Frontend] Support OpenAI batch file format (#4794) Alex Wu 2024-05-15 19:13:36 -04:00
  • fc0d9dfc3a [Frontend] Re-enable custom roles in Chat Completions API (#4758) Cyrus Leung 2024-05-16 05:58:46 +08:00
  • 361c461a12 [Doc] Highlight the fourth meetup in the README (#4842) Zhuohan Li 2024-05-15 11:38:49 -07:00
  • a5675d348b [Bugfix] Properly set distributed_executor_backend in ParallelConfig (#4816) zifeitong 2024-05-15 07:22:09 -07:00
  • e9cdd2b1e2 [CI/Build] Further decouple HuggingFace implementation from ours during tests (#4166) Cyrus Leung 2024-05-15 14:38:40 +08:00
  • 65bf2ac165 [Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode to a single API (#4681) SangBin Cho 2024-05-15 14:00:10 +09:00
  • 8a7cc254a0 Revert "[Kernel] Use flash-attn for decoding (#3648)" (#4820) SangBin Cho 2024-05-15 11:52:45 +09:00
  • 29bc01bf3b Add 4th meetup announcement to readme (#4817) Simon Mo 2024-05-14 15:33:06 -07:00
  • 676a99982f [Core] Add MultiprocessingGPUExecutor (#4539) Nick Hill 2024-05-14 10:38:59 -07:00
  • dc72402b57 [Bugfix][Doc] Fix CI failure in docs (#4804) Cyrus Leung 2024-05-15 00:57:08 +08:00
  • ccb63a8245 [Core][Hash][Automatic Prefix caching] Accelerating the hashing function by avoiding deep copies (#4696) Kuntai Du 2024-05-14 05:34:33 -07:00
  • c579b750a0 [Doc] Add meetups to the doc (#4798) Zhuohan Li 2024-05-13 18:48:00 -07:00
  • 4bfa7e7f75 [Doc] Add API reference for offline inference (#4710) Cyrus Leung 2024-05-14 08:47:42 +08:00
  • ac1fbf7fd2 [Doc] Shorten README by removing supported model list (#4796) Zhuohan Li 2024-05-13 16:23:54 -07:00
  • 33d3914b1e [Bugfix] Fix dynamic FP8 quantization for Mixtral (#4793) Philipp Moritz 2024-05-13 16:00:27 -07:00
  • 1356df53bd [Kernel] Use flash-attn for decoding (#3648) Stephen Krider 2024-05-13 15:50:33 -07:00
  • ce532ff45c [Speculative decoding] Improve n-gram efficiency (#4724) Cody Yu 2024-05-13 15:00:13 -07:00
  • 8bc68e198c [Frontend] [Core] perf: Automatically detect vLLM-tensorized model, update tensorizer to version 2.9.0 (#4208) Sanger Steel 2024-05-13 17:57:07 -04:00
  • 0fca3cdcf2 [Misc] Enhance attention selector (#4751) Woosuk Kwon 2024-05-13 10:47:25 -07:00
  • e7c46b9527 [Scheduler] Warning upon preemption and Swapping (#4647) SangBin Cho 2024-05-13 23:50:44 +09:00
  • 350f9e107f [CI/Build] Move test_utils.py to tests/utils.py (#4425) Cyrus Leung 2024-05-13 22:50:09 +08:00
  • 702bee461f [Core][Distributed] refactor custom allreduce to support multiple tp groups (#4754) youkaichao 2024-05-12 17:47:59 -07:00
  • a7be4d0072 [CORE] Improvement in ranks code (#4718) Swapnil Parekh 2024-05-12 20:47:47 -04:00
  • a709e87a4f [CI/Build] Tweak Marlin Nondeterminism Issues (#4713) Robert Shaw 2024-05-12 20:46:31 -04:00
  • 6eaccb7353 [Model] Add support for IBM Granite Code models (#4636) Yikang Shen 2024-05-12 00:27:24 -04:00
  • e254497b66 [Model][Misc] Add e5-mistral-7b-instruct and Embedding API (#3734) Chang Su 2024-05-11 11:30:37 -07:00
  • 4e12131089 [Core][Test] fix function name typo in custom allreduce (#4750) youkaichao 2024-05-10 15:14:40 -07:00
  • fcc2994be6 [CI] Nits for bad initialization of SeqGroup in testing (#4748) Robert Shaw 2024-05-10 16:01:01 -06:00
  • 2e7796f2cf [Speculative decoding] CUDA graph support (#4295) heeju-kim2 2024-05-11 02:36:25 +09:00
  • 706588a77d [Bugfix] Fix CLI arguments in OpenAI server docs (#4729) Allen.Dou 2024-05-10 23:00:56 +08:00
  • 6a0f617210 [Core] Fix circular reference which leaked llm instance in local dev env (#4737) SangBin Cho 2024-05-10 23:54:32 +09:00
  • dac6a3f6ed [Misc] Apply a couple g++ cleanups (#4719) Steve Grubb 2024-05-10 09:37:05 -04:00
  • 64b77dfd7e [Core]fix type annotation for swap_blocks (#4726) Kunshang Ji 2024-05-10 20:52:48 +08:00
  • 51d4094fda chunked-prefill-doc-syntax (#4603) Simon Mo 2024-05-09 22:13:23 -07:00
  • e965d46184 [Misc] Keep only one implementation of the create_dummy_prompt function. (#4716) Allen.Dou 2024-05-10 12:42:38 +08:00
  • 208b71bcc1 [Core][Distributed] refactor pynccl (#4591) youkaichao 2024-05-09 19:48:43 -07:00
  • c833101740 [Kernel] Refactor FP8 kv-cache with NVIDIA float8_e4m3 support (#4535) Cody Yu 2024-05-09 17:04:17 -07:00
  • 379da6dcb5 [Kernel] [FP8] Improve FP8 linear layer performance (#4691) Philipp Moritz 2024-05-09 16:38:07 -07:00