Commit Graph

  • 0003e9154b [Misc][Minor] Fix CPU block num log in CPUExecutor. (#4088) Li, Jiang 2024-04-15 23:35:55 +08:00
  • e11e200736 [Bugfix] Fix filelock version requirement (#4075) Zhuohan Li 2024-04-14 21:50:08 -07:00
  • 8db1bf32f8 [Misc] Upgrade triton to 2.2.0 (#4061) Roy 2024-04-15 08:43:54 +08:00
  • aceb17cf2d [Docs] document that mixtral 8x22b is supported (#4073) Simon Mo 2024-04-14 14:35:55 -07:00
  • 563c54f760 [BugFix] Fix tensorizer extra in setup.py (#4072) Nick Hill 2024-04-14 22:12:42 +01:00
  • 2cd6b4f362 [Core] avoid too many cuda context by caching p2p test (#4021) youkaichao 2024-04-13 23:40:21 -07:00
  • 711a000255 [Frontend] [Core] feat: Add model loading using tensorizer (#3476) Sanger Steel 2024-04-13 20:13:01 -04:00
  • 989ae2538d [Kernel] Add punica dimension for Baichuan-13B (#4053) Jee Li 2024-04-13 22:55:05 +08:00
  • 0a430b4ae2 [Bugfix] fix_small_bug_in_neuron_executor (#4051) zspo 2024-04-13 22:54:03 +08:00
  • ec8e3c695f [Bugfix] fix_log_time_in_metrics (#4050) zspo 2024-04-13 22:52:36 +08:00
  • 98afde19fc [Core][Distributed] improve logging for init dist (#4042) youkaichao 2024-04-13 07:12:53 -07:00
  • 5c2e66e487 [Bugfix] More type hint fixes for py 3.8 (#4039) Dylan Hawk 2024-04-12 21:07:04 -07:00
  • 546e721168 [CI/Test] expand ruff and yapf for all supported python version (#4037) youkaichao 2024-04-12 18:43:37 -07:00
  • b8aacac31a [Bugfix] Fix LoRA bug (#4032) Jee Li 2024-04-13 07:56:37 +08:00
  • d04973ad54 Fix triton compilation issue (#3984) Bellk17 2024-04-12 16:41:26 -07:00
  • fbb9d9eef4 [Core] fix custom allreduce default value (#4040) youkaichao 2024-04-12 16:40:39 -07:00
  • 09473ee41c [mypy] Add mypy type annotation part 1 (#4006) SangBin Cho 2024-04-13 06:35:50 +09:00
  • d4ec9ffb95 [Misc] Fix typo in scheduler.py (#4022) Zhuohan Li 2024-04-12 13:56:04 -07:00
  • 96b6a6d790 [Bugfix] fix type hint for py 3.8 (#4036) youkaichao 2024-04-12 12:35:44 -07:00
  • 36729bac13 [Test] Test multiple attn backend for chunked prefill. (#4023) SangBin Cho 2024-04-13 01:56:57 +09:00
  • 7fd3949a0b [Frontend][Core] Move merge_async_iterators to utils (#4026) Cyrus Leung 2024-04-12 13:30:54 +08:00
  • 1096717ae9 [Core] Support LoRA on quantized models (#4012) Jee Li 2024-04-12 12:02:44 +08:00
  • c2b4a1bce9 [Doc] Add typing hints / mypy types cleanup (#3816) Michael Feil 2024-04-11 17:17:21 -07:00
  • e46a60aa4c [BugFix] Fix handling of stop strings and stop token ids (#3672) Nick Hill 2024-04-11 23:34:12 +01:00
  • 1e96c3341a Add extra punica sizes to support bigger vocabs (#4015) Antoni Baum 2024-04-11 15:18:57 -07:00
  • 95e7d4a97c Fix echo/logprob OpenAI completion bug (#3441) Dylan Hawk 2024-04-11 15:15:50 -07:00
  • 559eb852f8 [Core] init_distributed_environment align with init_process_group(#4014) youkaichao 2024-04-11 14:00:48 -07:00
  • a10d3056da [Core] Set linear_weights directly on the layer (#3977) Antoni Baum 2024-04-11 13:35:51 -07:00
  • 8afca50889 [Hardware][Intel] Isolate CPUModelRunner and ModelRunner for better maintenance (#3824) bigPYJ1151 2024-04-12 02:56:49 +08:00
  • 08ccee1e83 punica fix-bgmv-kernel-640 (#4007) fuchen.ljl 2024-04-11 23:59:26 +08:00
  • c1dc547129 [Kernel] Fused MoE Config for Mixtral 8x22 (#4002) Roger Wang 2024-04-11 07:50:00 -07:00
  • f3d0bf7589 [Doc][Installation] delete python setup.py develop (#3989) youkaichao 2024-04-10 20:33:02 -07:00
  • e9da5a40c6 [Misc] Add indirection layer for custom ops (#3913) Kunshang Ji 2024-04-11 03:26:07 +00:00
  • e42df7227d [Test] Add xformer and flash attn tests (#3961) SangBin Cho 2024-04-11 12:09:50 +09:00
  • caada5e50a [Core][Model] torch.compile for layernorm in commandr (#3985) youkaichao 2024-04-10 18:48:26 -07:00
  • 67b4221a61 [Core][5/N] Fully working chunked prefill e2e (#3884) SangBin Cho 2024-04-11 09:56:48 +09:00
  • 63e7176f26 [Core][Refactor] move parallel_utils into vllm/distributed (#3950) youkaichao 2024-04-10 15:33:30 -07:00
  • 934d3662f7 [Bugfix] handle hf_config with architectures == None (#3982) Travis Johnson 2024-04-10 16:28:25 -06:00
  • 92cd2e2f21 [Doc] Fix getting stared to use publicly available model (#3963) Frαnçois 2024-04-10 20:05:52 +02:00
  • e4c4072c94 [Bugfix] Remove key sorting for guided_json parameter in OpenAi compatible Server (#3945) Daniel E Marasco 2024-04-10 13:15:51 -04:00
  • e35397468f [Doc] Add doc to state our model support policy (#3948) youkaichao 2024-04-10 10:03:02 -07:00
  • 8b317c6dd0 [Model][AMD] ROCm support for 256 head dims for Gemma (#3972) James Whedbee 2024-04-10 10:12:00 -05:00
  • bd3c144e0b [Bugfix][ROCm] Add numba to Dockerfile.rocm (#3962) Woosuk Kwon 2024-04-10 07:37:17 -07:00
  • 0258b7a94b [Bugfix] handle prompt_logprobs in _apply_min_tokens_penalty (#3876) Travis Johnson 2024-04-10 02:39:56 -06:00
  • b3104b2a10 [Bugfix] Fix logits processor when prompt_logprobs is not None (#3899) 胡译文 2024-04-10 15:09:36 +08:00
  • c2e00af523 [Bugfix] fix utils.py/merge_dict func TypeError: 'type' object is not subscriptable (#3955) zhaotyer 2024-04-10 12:49:11 +08:00
  • c013d32c75 [Benchmark] Add cpu options to bench scripts (#3915) Zedong Peng 2024-04-10 12:30:03 +08:00
  • 11dd6ebb89 [Misc] Avoid loading incorrect LoRA config (#3777) Jee Li 2024-04-10 10:47:15 +08:00
  • 6c0b04515f [ROCm][Hardware][AMD] Use Triton Kernel for default FA on ROCm (#3643) Juan Villamizar 2024-04-09 17:10:47 -05:00
  • e23a43aef8 [Bugfix] Fix KeyError on loading GPT-NeoX (#3925) Junichi Sato 2024-04-10 04:11:31 +09:00
  • e7c7067b45 [Misc] [Core] Implement RFC "Augment BaseExecutor interfaces to enable hardware-agnostic speculative decoding" (#3837) Cade Daniel 2024-04-09 11:44:15 -07:00
  • 6d592eb430 [Core] separate distributed_init from worker (#3904) youkaichao 2024-04-09 01:49:02 -07:00
  • d036198e23 [BugFix][Model] Fix commandr RoPE max_position_embeddings (#3919) Roy 2024-04-09 06:17:21 +08:00
  • 59a6abf3c9 [Hotfix][CI/Build][Kernel] CUDA 11.8 does not support layernorm optimizations (#3782) Matt Wong 2024-04-08 14:31:02 -07:00
  • bc0c0192d1 [Bugfix] Enable Proper attention_bias Usage in Llama Model Configuration (#3767) Kiran R 2024-04-09 01:12:35 +05:30
  • f46864d68d [Bugfix] Added Command-R GPTQ support (#3849) egortolmachev 2024-04-08 17:59:38 +03:00
  • b4543c8f6b [Model] add minicpm (#3893) ywfang 2024-04-08 18:28:36 +08:00
  • 0ce0539d47 [Bugfix] Fix Llava inference with Tensor Parallelism. (#3883) Isotr0py 2024-04-07 22:54:13 +08:00
  • 2f19283549 [Core] latency optimization (#3890) youkaichao 2024-04-06 19:14:06 -07:00
  • 95baec828f [Core] enable out-of-tree model register (#3871) youkaichao 2024-04-06 17:11:41 -07:00
  • e4be7d70bb [CI/Benchmark] add more iteration and use median for robust latency benchmark (#3889) youkaichao 2024-04-06 14:32:30 -07:00
  • 54951ac4bf [Bugfix] Fix incorrect output on OLMo models in Tensor Parallelism (#3869) Isotr0py 2024-04-06 03:02:09 +08:00
  • 18de883489 [Chunked Prefill][4/n] Chunked prefill scheduler. (#3853) SangBin Cho 2024-04-06 02:17:58 +09:00
  • 1d7c940d74 Add option to completion API to truncate prompt tokens (#3144) Thomas Parnell 2024-04-05 19:15:42 +02:00
  • cfaf49a167 [Misc] Define common requirements (#3841) Woosuk Kwon 2024-04-05 00:39:17 -07:00
  • 9edec652e2 [Bugfix] Fixing requirements.txt (#3865) Noam Gat 2024-04-05 09:46:01 +03:00
  • e0dd4d3589 [Misc] Fix linter issues in examples/fp8/quantizer/quantize.py (#3864) Cade Daniel 2024-04-04 21:57:33 -07:00
  • e5043a3e75 [Misc] Add pytest marker to opt-out of global test cleanup (#3863) Cade Daniel 2024-04-04 21:54:16 -07:00
  • d03d64fd2e [CI/Build] refactor dockerfile & fix pip cache youkaichao 2024-04-04 21:53:16 -07:00
  • 78107fa091 [Doc]Add asynchronous engine arguments to documentation. (#3810) Sean Gallen 2024-04-04 23:52:01 -05:00
  • c391e4b68e [Core] improve robustness of pynccl (#3860) youkaichao 2024-04-04 16:52:12 -07:00
  • 9117f892f0 [Model] Cohere CommandR+ (#3829) Saurabh Dash 2024-04-05 02:01:49 +05:30
  • db2a6a41e2 [Hardware][CPU] Update cpu torch to match default of 2.2.1 (#3854) Michael Goin 2024-04-04 12:49:49 -07:00
  • ca81ff5196 [Core] manage nccl via a pypi package & upgrade to pt 2.2.1 (#3805) youkaichao 2024-04-04 10:26:19 -07:00
  • b7782002e1 [Benchmark] Refactor sample_requests in benchmark_throughput (#3613) TianYu GUO 2024-04-04 17:56:22 +08:00
  • 819a309c0f [Bugfix] Fix args in benchmark_serving (#3836) Chang Su 2024-04-04 00:41:05 -07:00
  • aabe8f40f2 [Core] [Frontend] Make detokenization optional (#3749) Matthias Gerstgrasser 2024-04-03 21:52:18 -07:00
  • 498eb5cfa3 [Bugfix] Add kv_scale input parameter to CPU backend (#3840) Woosuk Kwon 2024-04-03 21:33:08 -07:00
  • 537ee25f43 [Core] Enable hf_transfer by default if available (#3817) Michael Feil 2024-04-03 21:02:43 -07:00
  • 294f8f6665 [BugFix] Pass tokenizer_config to local_tokenizer_group (#3754) Tao He 2024-04-04 11:31:46 +08:00
  • b95047f2da [Misc] Publish 3rd meetup slides (#3835) Woosuk Kwon 2024-04-03 15:46:10 -07:00
  • 2ff767b513 Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) (#3290) Adrian Abeyta 2024-04-03 16:15:55 -05:00
  • 3dcb3e8b98 [3/N] Refactor scheduler for chunked prefill scheduling (#3550) SangBin Cho 2024-04-04 06:13:49 +09:00
  • c64cf38673 [Doc] Update contribution guidelines for better onboarding (#3819) Michael Feil 2024-04-03 00:31:43 -07:00
  • 76b889bf1d [Doc] Update README.md (#3806) Robert Shaw 2024-04-02 23:11:10 -07:00
  • c9b506dad4 [BugFix] Use different mechanism to get vllm version in is_cpu() (#3804) Nick Hill 2024-04-02 23:06:25 -07:00
  • 5757d90e26 [Speculative decoding] Adding configuration object for speculative decoding (#3706) Cade Daniel 2024-04-02 17:40:57 -07:00
  • a3c226e7eb [CI/Build] 0.4.0.post1, fix sm 7.0/7.5 binary (#3803) v0.4.0.post1 youkaichao 2024-04-02 12:57:04 -07:00
  • b321d4881b [Bugfix] Add __init__.py files for vllm/core/block/ and vllm/spec_decode/ (#3798) Michael Goin 2024-04-02 12:35:31 -07:00
  • ad6eca408b Fix early CUDA init via get_architecture_class_name import (#3770) leiwen83 2024-04-03 02:56:26 +08:00
  • 205b94942e [CI/Build] fix TORCH_CUDA_ARCH_LIST in wheel build (#3801) youkaichao 2024-04-02 11:54:33 -07:00
  • 3bec41f41a [Doc] Fix vLLMEngine Doc Page (#3791) Roger Wang 2024-04-02 09:49:37 -07:00
  • 0739b1947f [Frontend][Bugfix] allow using the default middleware with a root path (#3788) A-Mahla 2024-04-02 10:20:28 +02:00
  • 77a6572aa5 [HotFix] [CI/Build] Minor fix for CPU backend CI (#3787) bigPYJ1151 2024-04-02 13:50:53 +08:00
  • 0e3f06fe9c [Hardware][Intel] Add CPU inference backend (#3634) bigPYJ1151 2024-04-02 13:07:30 +08:00
  • eb69d68804 [Misc] [CI/Build] Speed up block manager CPU-only unit tests ~10x by opting-out of GPU cleanup (#3783) Cade Daniel 2024-04-01 17:49:51 -07:00
  • 7d4e1b85e7 [Misc] Add support for new autogptq checkpoint_format (#3689) Qubitium 2024-04-02 07:32:01 +08:00
  • 93deb0b38f [Speculative decoding 4/9] Lookahead scheduling for speculative decoding (#3250) Cade Daniel 2024-04-01 15:55:24 -07:00
  • ccb58b23e6 [Misc] Fix Benchmark TTFT Calculation for Chat Completions (#3768) Roger Wang 2024-04-01 15:24:30 -07:00
  • 49782fcb76 [Misc] Some minor simplifications to detokenization logic (#3670) Nick Hill 2024-04-01 13:22:06 -07:00