Commit Graph - vllm - Gitea: Git with a cup of tea

biondizzle/vllm

Fork 0

Commit Graph

Select branches

Hide Pull Requests

cmm

main

ci/build/22474

submission

v0.1.0

v0.1.1

v0.1.2

v0.1.3

v0.1.4

v0.1.5

v0.1.6

v0.1.7

v0.10.0

v0.10.0rc1

v0.10.0rc2

v0.10.1

v0.10.1.1

v0.10.1rc1

v0.10.2

v0.10.2rc1

v0.10.2rc2

v0.10.2rc3

v0.11.0

v0.11.0rc1

v0.11.0rc2

v0.11.0rc3

v0.11.0rc4

v0.11.0rc5

v0.11.0rc6

v0.11.1

v0.11.1rc0

v0.11.1rc1

v0.11.1rc2

v0.11.1rc3

v0.11.1rc4

v0.11.1rc5

v0.11.1rc6

v0.11.1rc7

v0.11.2

v0.12.0

v0.13.0

v0.13.0rc1

v0.13.0rc2

v0.13.0rc3

v0.13.0rc4

v0.14.0

v0.14.0rc0

v0.14.0rc1

v0.14.0rc2

v0.14.1

v0.15.0

v0.15.0rc0

v0.15.0rc1

v0.15.0rc2

v0.15.0rc3

v0.15.1

v0.15.1rc0

v0.15.1rc1

v0.15.2rc0

v0.16.0

v0.16.0rc0

v0.16.0rc1

v0.16.0rc2

v0.16.0rc3

v0.16.1rc0

v0.17.0

v0.17.0rc0

v0.17.0rc1

v0.17.1

v0.17.1rc0

v0.17.2rc0

v0.18.0

v0.18.0rc0

v0.18.0rc1

v0.18.0rc2

v0.18.1

v0.18.1rc0

v0.18.2rc0

v0.19.0

v0.19.0rc0

v0.19.0rc1

v0.19.1rc0

v0.2.0

v0.2.1

v0.2.1.post1

v0.2.2

v0.2.3

v0.2.4

v0.2.5

v0.2.6

v0.2.7

v0.3.0

v0.3.1

v0.3.2

v0.3.3

v0.4.0

v0.4.0.post1

v0.4.1

v0.4.2

v0.4.3

v0.5.0

v0.5.0.post1

v0.5.1

v0.5.2

v0.5.3

v0.5.3.post1

v0.5.4

v0.5.5

v0.6.0

v0.6.1

v0.6.1.post1

v0.6.1.post2

v0.6.2

v0.6.3

v0.6.3.post1

v0.6.4

v0.6.4.post1

v0.6.5

v0.6.6

v0.6.6.post1

v0.7.0

v0.7.1

v0.7.2

v0.7.3

v0.8.0

v0.8.0rc1

v0.8.0rc2

v0.8.1

v0.8.2

v0.8.3

v0.8.3rc1

v0.8.4

v0.8.5

v0.8.5.post1

v0.9.0

v0.9.0.1

v0.9.1

v0.9.1rc1

v0.9.1rc2

v0.9.2

v0.9.2rc1

v0.9.2rc2

f17a1a8f96 [Misc] Make Serving Benchmark More User-friendly (#5044) Roger Wang 2024-05-25 10:28:16 -07:00
d5a1697772 [Dynamic Spec Decoding] Minor fix for disabling speculative decoding (#5000) Lily Liu 2024-05-25 10:00:14 -07:00
325c119961 [Misc] add logging level env var (#5045) youkaichao 2024-05-24 23:49:49 -07:00
8e192ff967 [Kernel][Backend][Model] Blocksparse flash attention kernel and Phi-3-Small model (#4799) Eric Xihui Lin 2024-05-25 01:00:52 -04:00
e64fde4b01 [Core][Bugfix]: fix prefix caching for blockv2 (#4764) leiwen83 2024-05-25 01:07:09 +08:00
919770957f [Bugfix] Fix Mistral v0.3 Weight Loading (#5005) Robert Shaw 2024-05-24 14:28:27 +02:00
6a50f4cafa [Doc] add ccache guide in doc (#5012) youkaichao 2024-05-23 16:21:54 -07:00
e3470f8753 [Core]: Option To Use Prompt Token Ids Inside Logits Processor (#4985) Elisei Smirnov 2024-05-24 01:04:24 +03:00
a1242324c9 [Kernel] Initial Activation Quantization Support (#4525) Dipika Sikka 2024-05-23 17:29:18 -04:00
5eda2ea02a [Core][1/N] Support send/recv in PyNCCL Groups (#4988) Murali Andoorveedu 2024-05-23 09:54:48 -07:00
2ba80bed27 [Bugfix] Update Dockerfile.cpu to fix NameError: name 'vllm_ops' is not defined (#5009) Letian Li 2024-05-23 17:08:58 +01:00
6066253296 Marlin 24 prefill performance improvement (about 25% better on average) (#4983) Alexander Matveev 2024-05-23 02:39:27 -04:00
ee3eea0a1b [Misc] Take user preference in attention selector (#4960) Cody Yu 2024-05-22 15:55:56 -07:00
a36de682d4 [Minor] Fix small typo in llama.py: QKVParallelLinear -> QuantizationConfig (#4991) Philipp Moritz 2024-05-22 15:26:56 -07:00
eb6d3c264d [Core] Eliminate parallel worker per-step task scheduling overhead (#4894) Nick Hill 2024-05-22 14:17:27 -07:00
97b030005c [Model] LoRA gptbigcode implementation (#3949) raywanb 2024-05-23 04:58:59 +08:00
a3a73ab069 [Misc] Load FP8 kv-cache scaling factors from checkpoints (#4893) Cody Yu 2024-05-22 13:28:20 -07:00
8674f9880e [Kernel] Fixup for CUTLASS kernels in CUDA graphs (#4954) Tyler Michael Smith 2024-05-22 10:10:43 -04:00
c74c913bfb [misc] remove comments that were supposed to be removed (#4977) SangBin Cho 2024-05-22 22:02:58 +09:00
5f6d10c14c [CI/Build] Enforce style for C++ and CUDA code with clang-format (#4722) Michael Goin 2024-05-22 03:18:41 -04:00
9b9a10d6cb [Frontend] Dynamic RoPE scaling (#4638) sasha0552 2024-05-22 05:32:35 +00:00
99eff67ba9 [Bugfix][Kernel] Add head size check for attention backend selection (#4944) Isotr0py 2024-05-22 03:33:25 +08:00
14772eeb8e [Bugfix] Fix flag name for max_seq_len_to_capture (#4935) Kante Yin 2024-05-22 00:30:52 +08:00
757b62c495 [CI/Build] Codespell ignore build/ directory (#4945) Michael Goin 2024-05-21 12:06:10 -04:00
e941f88584 [Docs] Add acknowledgment for sponsors (#4925) Simon Mo 2024-05-21 00:17:25 -07:00
f12c3b5b3d [Model] Add Phi-2 LoRA support (#4886) Isotr0py 2024-05-21 13:24:17 +08:00
d130b573a0 [Model] add rope_scaling support for qwen2 (#4930) HUANG Fei 2024-05-21 13:22:22 +08:00
65ae8c2c8f [Core] Fix scheduler considering "no LoRA" as "LoRA" (#4897) Antoni Baum 2024-05-20 17:48:32 -07:00
c3af44722c [Doc]Add documentation to benchmarking script when running TGI (#4920) Kuntai Du 2024-05-20 13:16:57 -07:00
1937e29848 [Core] Sharded State Loader download from HF (#4889) Aurick Qiao 2024-05-20 14:46:12 -04:00
f0eecee610 [Bugfix] Fix dummy weight for fp8 (#4916) Mor Zusman 2024-05-20 21:44:25 +03:00
943e72ca56 [Build/CI] Enabling AMD Entrypoints Test (#4834) Alexei-V-Ivanov-AMD 2024-05-20 13:29:28 -05:00
546a97ef69 [Misc]: allow user to specify port in distributed setting (#4914) Wenwei Zhang 2024-05-21 01:45:06 +08:00
da5a0b539d Remove marlin warning (#4918) Alexander Matveev 2024-05-20 10:55:34 -04:00
6287537a0c [Model] LLaVA model refactor (#4910) Cyrus Leung 2024-05-20 16:11:25 +08:00
b57e6c5949 [Kernel] Add flash-attn back (#4907) Woosuk Kwon 2024-05-19 18:11:30 -07:00
27ce85476e [Kernel] Add marlin_24 unit tests (#4901) Alexander Matveev 2024-05-19 11:37:34 -04:00
f68470e803 [Bugfix][Model] Add base class for vision-language models (#4809) Cyrus Leung 2024-05-19 15:13:33 +08:00
2e9a2227ec [Lora] Support long context lora (#4787) SangBin Cho 2024-05-18 16:05:23 +09:00
c0724fc915 [ROCm][Hardware][AMD] Adding Navi21 to fallback to naive attention if Triton is not used (#4658) alexeykondrat 2024-05-18 01:09:11 -04:00
86b45ae065 [Bugfix] Relax tiktoken to >= 0.6.0 (#4890) Michael Goin 2024-05-17 14:58:52 -04:00
c5711ef985 [Doc] Update Ray Data distributed offline inference example (#4871) Antoni Baum 2024-05-17 10:52:11 -07:00
48d5985a08 Sync huggingface modifications of qwen Moe model (#4774) eigenLiu 2024-05-18 00:43:19 +08:00
33e0823de5 [Bugfix] fix rope error when load models with different dtypes (#4835) Jinzhen Lin 2024-05-17 17:43:34 +08:00
26148120b3 [Build/CI] Extending the set of AMD tests with Regression, Basic Correctness, Distributed, Engine, Llava Tests (#4797) Alexei-V-Ivanov-AMD 2024-05-16 22:58:25 -05:00
0150a10630 [Frontend] OpenAI API server: Do not add bos token by default when encoding (#4688) bofeng huang 2024-05-17 03:47:22 +02:00
8e7fb5d43a Support to serve vLLM on Kubernetes with LWS (#4829) Kante Yin 2024-05-17 07:37:29 +08:00
9a31a817a8 [Bugfix] Fix FP8 KV cache support (#4869) Woosuk Kwon 2024-05-16 15:42:29 -07:00
2060e93659 [Kernel] Add w8a8 CUTLASS kernels (#4749) Tyler Michael Smith 2024-05-16 18:32:50 -04:00
8435b207af [Kernel] Add punica dimension for Qwen1.5-32B LoRA (#4850) Silencio 2024-05-17 02:16:09 +08:00
10fa9eea21 [Misc] remove old comments (#4866) youkaichao 2024-05-16 11:07:41 -07:00
e08188081b [Core][Distributed] remove graph mode function (#4818) youkaichao 2024-05-16 10:59:52 -07:00
b5853f9963 [ROCm][AMD][Bugfix] adding a missing triton autotune config (#4845) Hongxia Yang 2024-05-16 13:46:52 -04:00
f09edd8a25 Add JSON output support for benchmark_latency and benchmark_throughput (#4848) Simon Mo 2024-05-16 10:02:56 -07:00
6979ade384 Add GPTQ Marlin 2:4 sparse structured support (#4790) Alexander Matveev 2024-05-16 12:56:15 -04:00
9216b9cc38 [Bugfix] Bypass authorization API token for preflight requests (#4862) Pierre Dulac 2024-05-16 18:42:21 +02:00
5e0391c040 [Frontend] Separate OpenAI Batch Runner usage from API Server (#4851) Alex Wu 2024-05-16 11:42:41 -04:00
dbc0754ddf [docs] Fix typo in examples filename openi -> openai (#4864) Alex Wu 2024-05-16 11:42:17 -04:00
99caa49106 [Kernel] add bfloat16 support for gptq marlin kernel (#4788) Jinzhen Lin 2024-05-16 21:55:29 +08:00
5c342570d7 Add marlin unit tests and marlin benchmark script (#4815) alexm-nm 2024-05-16 09:36:49 -04:00
973617ae02 [Speculative decoding][Re-take] Enable TP>1 speculative decoding (#4840) Cody Yu 2024-05-16 00:53:51 -07:00
30e754390c [Core] Implement sharded state loader (#4690) Aurick Qiao 2024-05-16 01:11:54 -04:00
52f8107cf2 [Frontend] Support OpenAI batch file format (#4794) Alex Wu 2024-05-15 19:13:36 -04:00
fc0d9dfc3a [Frontend] Re-enable custom roles in Chat Completions API (#4758) Cyrus Leung 2024-05-16 05:58:46 +08:00
361c461a12 [Doc] Highlight the fourth meetup in the README (#4842) Zhuohan Li 2024-05-15 11:38:49 -07:00
a5675d348b [Bugfix] Properly set distributed_executor_backend in ParallelConfig (#4816) zifeitong 2024-05-15 07:22:09 -07:00
e9cdd2b1e2 [CI/Build] Further decouple HuggingFace implementation from ours during tests (#4166) Cyrus Leung 2024-05-15 14:38:40 +08:00
65bf2ac165 [Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode to a single API (#4681) SangBin Cho 2024-05-15 14:00:10 +09:00
8a7cc254a0 Revert "[Kernel] Use flash-attn for decoding (#3648)" (#4820) SangBin Cho 2024-05-15 11:52:45 +09:00
29bc01bf3b Add 4th meetup announcement to readme (#4817) Simon Mo 2024-05-14 15:33:06 -07:00
676a99982f [Core] Add MultiprocessingGPUExecutor (#4539) Nick Hill 2024-05-14 10:38:59 -07:00
dc72402b57 [Bugfix][Doc] Fix CI failure in docs (#4804) Cyrus Leung 2024-05-15 00:57:08 +08:00
ccb63a8245 [Core][Hash][Automatic Prefix caching] Accelerating the hashing function by avoiding deep copies (#4696) Kuntai Du 2024-05-14 05:34:33 -07:00
c579b750a0 [Doc] Add meetups to the doc (#4798) Zhuohan Li 2024-05-13 18:48:00 -07:00
4bfa7e7f75 [Doc] Add API reference for offline inference (#4710) Cyrus Leung 2024-05-14 08:47:42 +08:00
ac1fbf7fd2 [Doc] Shorten README by removing supported model list (#4796) Zhuohan Li 2024-05-13 16:23:54 -07:00
33d3914b1e [Bugfix] Fix dynamic FP8 quantization for Mixtral (#4793) Philipp Moritz 2024-05-13 16:00:27 -07:00
1356df53bd [Kernel] Use flash-attn for decoding (#3648) Stephen Krider 2024-05-13 15:50:33 -07:00
ce532ff45c [Speculative decoding] Improve n-gram efficiency (#4724) Cody Yu 2024-05-13 15:00:13 -07:00
8bc68e198c [Frontend] [Core] perf: Automatically detect vLLM-tensorized model, update tensorizer to version 2.9.0 (#4208) Sanger Steel 2024-05-13 17:57:07 -04:00
0fca3cdcf2 [Misc] Enhance attention selector (#4751) Woosuk Kwon 2024-05-13 10:47:25 -07:00
e7c46b9527 [Scheduler] Warning upon preemption and Swapping (#4647) SangBin Cho 2024-05-13 23:50:44 +09:00
350f9e107f [CI/Build] Move test_utils.py to tests/utils.py (#4425) Cyrus Leung 2024-05-13 22:50:09 +08:00
702bee461f [Core][Distributed] refactor custom allreduce to support multiple tp groups (#4754) youkaichao 2024-05-12 17:47:59 -07:00
a7be4d0072 [CORE] Improvement in ranks code (#4718) Swapnil Parekh 2024-05-12 20:47:47 -04:00
a709e87a4f [CI/Build] Tweak Marlin Nondeterminism Issues (#4713) Robert Shaw 2024-05-12 20:46:31 -04:00
6eaccb7353 [Model] Add support for IBM Granite Code models (#4636) Yikang Shen 2024-05-12 00:27:24 -04:00
e254497b66 [Model][Misc] Add e5-mistral-7b-instruct and Embedding API (#3734) Chang Su 2024-05-11 11:30:37 -07:00
4e12131089 [Core][Test] fix function name typo in custom allreduce (#4750) youkaichao 2024-05-10 15:14:40 -07:00
fcc2994be6 [CI] Nits for bad initialization of SeqGroup in testing (#4748) Robert Shaw 2024-05-10 16:01:01 -06:00
2e7796f2cf [Speculative decoding] CUDA graph support (#4295) heeju-kim2 2024-05-11 02:36:25 +09:00
706588a77d [Bugfix] Fix CLI arguments in OpenAI server docs (#4729) Allen.Dou 2024-05-10 23:00:56 +08:00
6a0f617210 [Core] Fix circular reference which leaked llm instance in local dev env (#4737) SangBin Cho 2024-05-10 23:54:32 +09:00
dac6a3f6ed [Misc] Apply a couple g++ cleanups (#4719) Steve Grubb 2024-05-10 09:37:05 -04:00
64b77dfd7e [Core]fix type annotation for swap_blocks (#4726) Kunshang Ji 2024-05-10 20:52:48 +08:00
51d4094fda chunked-prefill-doc-syntax (#4603) Simon Mo 2024-05-09 22:13:23 -07:00
e965d46184 [Misc] Keep only one implementation of the create_dummy_prompt function. (#4716) Allen.Dou 2024-05-10 12:42:38 +08:00
208b71bcc1 [Core][Distributed] refactor pynccl (#4591) youkaichao 2024-05-09 19:48:43 -07:00
c833101740 [Kernel] Refactor FP8 kv-cache with NVIDIA float8_e4m3 support (#4535) Cody Yu 2024-05-09 17:04:17 -07:00
379da6dcb5 [Kernel] [FP8] Improve FP8 linear layer performance (#4691) Philipp Moritz 2024-05-09 16:38:07 -07:00

... 144 145 146 147 148 ...