Commit Graph

  • 29678cd213 Minor fix on AWQ kernel launch (#1356) Woosuk Kwon 2023-10-15 21:53:56 -07:00
  • d0740dff1b Fix error message on TORCH_CUDA_ARCH_LIST (#1239) Woosuk Kwon 2023-10-14 14:47:43 -07:00
  • de89472897 Fix the issue for AquilaChat2-* models (#1339) Lu Wang 2023-10-13 11:51:29 -07:00
  • e7c8555d06 Bump up transformers version & Remove MistralConfig (#1254) Woosuk Kwon 2023-10-13 10:05:26 -07:00
  • ec3b5ce9cc Improve detokenization performance (#1338) Antoni Baum 2023-10-13 09:59:07 -07:00
  • 6368e777a8 Add Aquila2 to README (#1331) ldwang 2023-10-13 03:11:16 +08:00
  • 875afe38ab Add blacklist in model checkpoint (#1325) Woosuk Kwon 2023-10-12 01:05:37 -07:00
  • ee8217e5be Add Mistral to quantization model list (#1278) amaleshvemula 2023-10-11 09:26:24 +02:00
  • 980dd4a2c4 Fix overflow in awq kernel (#1295) CHU Tianxiang 2023-10-11 15:19:53 +08:00
  • 8285736840 workaround of AWQ for Turing GPUs (#1252) twaka 2023-10-11 11:48:16 +09:00
  • 91fce82c6f change the timing of sorting logits (#1309) yhlskt23 2023-10-11 11:37:42 +09:00
  • ac5cf86aa6 Fix __repr__ of SequenceOutputs (#1311) Wang Ran (汪然) 2023-10-11 00:58:28 +08:00
  • 6a6119554c lock torch version to 2.0.1 (#1290) yanxiyue 2023-10-11 00:21:57 +08:00
  • b95ee898fe [Minor] Fix comment in mistral.py (#1303) Zhuohan Li 2023-10-09 19:44:37 -07:00
  • 9eed4d1f3e Update README.md (#1292) Zhuohan Li 2023-10-08 23:15:50 -07:00
  • 6b5296aa3a [FIX] Explain why the finished_reason of ignored sequences are length (#1289) Zhuohan Li 2023-10-08 15:22:38 -07:00
  • ee92b58b3a Move bfloat16 check to worker (#1259) Antoni Baum 2023-10-07 22:10:44 -07:00
  • 09ff7f106a API server support ipv4 / ipv6 dualstack (#1288) Yunfeng Bai 2023-10-07 15:15:54 -07:00
  • acbed3ef40 Use monotonic time where appropriate (#1249) Antoni Baum 2023-10-02 19:22:05 -07:00
  • 66d18a7fb0 add support for tokenizer revision (#1163) Federico Cassano 2023-10-02 22:19:46 -04:00
  • ba0bfd40e2 TP/quantization/weight loading refactor part 1 - Simplify parallel linear logic (#1181) Zhuohan Li 2023-10-02 15:36:09 -07:00
  • 84e4e37d14 [Minor] Fix type annotations (#1238) Woosuk Kwon 2023-10-02 15:28:31 -07:00
  • a60b353005 support sharding llama2-70b on more than 8 GPUs (#1209) Zhuohan Li 2023-10-02 15:26:33 -07:00
  • ebe4d1db3a Fix boundary check in paged attention kernel (#1241) Liang 2023-10-02 02:35:06 +08:00
  • b5a10eb0ef Added dtype arg to benchmarks (#1228) kg6-sleipnir 2023-10-01 00:04:03 -04:00
  • 0967102c6d fixing typo in tiiuae/falcon-rw-7b model name (#1226) Usama Ahmed 2023-09-29 23:40:25 +03:00
  • e2fb71ec9f Bump up the version to v0.2.0 (#1212) v0.2.0 Woosuk Kwon 2023-09-28 15:30:38 -07:00
  • f936657eb6 Provide default max model length (#1224) Woosuk Kwon 2023-09-28 14:44:02 -07:00
  • 6f88f762bf Fix OOM in attention kernel test (#1223) Woosuk Kwon 2023-09-28 14:33:24 -07:00
  • 202351d5bf Add Mistral to supported model list (#1221) Woosuk Kwon 2023-09-28 14:33:04 -07:00
  • 2e8e49fce3 [Fix] Remove false assertion (#1222) Woosuk Kwon 2023-09-28 10:52:38 -07:00
  • a8e98aee0c Fix Mistral model (#1220) Woosuk Kwon 2023-09-28 10:44:05 -07:00
  • bb1ba58f06 [Mistral] Mistral-7B-v0.1 support (#1196) Chris Bamford 2023-09-28 19:41:03 +02:00
  • 7bedab5748 Add rope_scaling to Qwen (#1210) Qing 2023-09-28 15:49:23 +08:00
  • 20f7cc4cde Add skip_special_tokens sampling params (#1186) Dan Lord 2023-09-27 19:21:42 -07:00
  • 649aa730c5 Use standard extras for uvicorn (#1166) Danilo Peixoto 2023-09-27 21:41:36 -03:00
  • a19bc5c628 Automatically configure max_num_batched_tokens (#1198) Woosuk Kwon 2023-09-27 16:34:00 -07:00
  • 28e616c4e3 fix qwen-14b model (#1173) Qing 2023-09-28 07:33:16 +08:00
  • 30e775281d fix typo (#1184) Wang Ran (汪然) 2023-09-28 07:22:45 +08:00
  • 21877b0d75 Support Longchat and RoPE scaling (#555) Lily Liu 2023-09-27 03:36:02 -07:00
  • cf5cb1e33e Allocate more shared memory to attention kernel (#1154) Antoni Baum 2023-09-26 22:27:13 -07:00
  • 03ffd0a022 Add comments on RoPE initialization (#1176) Woosuk Kwon 2023-09-26 10:48:33 -07:00
  • a425bd9a9a [Setup] Enable TORCH_CUDA_ARCH_LIST for selecting target GPUs (#1074) Woosuk Kwon 2023-09-26 10:21:08 -07:00
  • bbbf86565f Align max_tokens behavior with openai (#852) Wen Sun 2023-09-24 09:10:13 +08:00
  • 9f6be8692e Fix config for Falcon (#1164) Woosuk Kwon 2023-09-23 17:38:43 -07:00
  • f187877945 [FIX] Simplify sampler logic (#1156) Zhuohan Li 2023-09-23 17:21:56 -07:00
  • 947b794146 [Sampler] Vectorized sampling (simplified) (#1048) Zhuohan Li 2023-09-22 17:48:04 -07:00
  • 8d926e91f1 Announce the First vLLM Meetup (#1148) Woosuk Kwon 2023-09-22 11:37:14 -07:00
  • 4ee52bb169 Docs: Fix broken link to openai example (#1145) Nick Perez 2023-09-22 14:36:09 -04:00
  • 7d7e3b78a3 Use --ipc=host in docker run for distributed inference (#1125) Woosuk Kwon 2023-09-21 18:26:47 -07:00
  • f98b745a81 feat: support stop_token_ids parameter. (#1097) Ricardo Lu 2023-09-22 06:34:02 +08:00
  • 2d1e86f1b1 clean api code, remove redundant background task. (#1102) Roy 2023-09-22 04:25:05 +08:00
  • 1ac4ccf73c Add float16 and float32 (#1115) Woosuk Kwon 2023-09-21 00:52:47 -07:00
  • 2ac4d5e2bf Replace DtypeTensor (#1123) Woosuk Kwon 2023-09-21 00:51:47 -07:00
  • 3302f0aef3 rope_theta and max_position_embeddings from config (#1096) Antoni Baum 2023-09-20 13:35:11 -07:00
  • 6f2dd6c37e Add documentation to Triton server tutorial (#983) Tanmay Verma 2023-09-20 10:32:40 -07:00
  • bc0644574c Add gpu_memory_utilization and swap_space to LLM (#1090) Woosuk Kwon 2023-09-19 22:16:04 -07:00
  • 400b8289f7 Add pyarrow to dependencies & Print warning on Ray import error (#1094) Woosuk Kwon 2023-09-18 22:36:17 -07:00
  • c1026311b5 [Community] Add vLLM Discord server (#1086) Zhuohan Li 2023-09-18 12:23:35 -07:00
  • 2b1c116b5a Add minimum capability requirement for AWQ (#1064) Woosuk Kwon 2023-09-18 12:02:01 -07:00
  • cc796b1358 Convert before transpose (#1073) Woosuk Kwon 2023-09-18 11:51:48 -07:00
  • f029ef94d7 Fix get_max_num_running_seqs for waiting and swapped seq groups (#1068) Zhuohan Li 2023-09-18 11:49:40 -07:00
  • 95592fa00a align llm_engine and async_engine. (#1081) Roy 2023-09-19 02:49:10 +08:00
  • fbe66e1d0b added support for quantize on LLM module (#1080) orellavie1212 2023-09-18 21:04:21 +03:00
  • 90979c38f8 [FIX] Don't initialize parameter by default (#1067) Zhuohan Li 2023-09-17 17:15:38 -07:00
  • e21d7687a9 Fix hanging when prompt exceeds limit (#1029) 陈序 2023-09-17 16:48:56 +08:00
  • ff36139ffc Remove AsyncLLMEngine busy loop, shield background task (#1059) Antoni Baum 2023-09-17 00:29:08 -07:00
  • e3e79e9e8a Implement AWQ quantization support for LLaMA (#1032) Woosuk Kwon 2023-09-16 00:03:37 -07:00
  • b9fe4616f9 Abort when coroutine is cancelled (#1020) Jerry Yang 2023-09-15 08:40:18 +08:00
  • 64ca424e75 Fix warning message on LLaMA FastTokenizer (#1037) Woosuk Kwon 2023-09-14 17:33:32 -07:00
  • b5f93d0631 Only fail if logit_bias has actual values (#1045) Lukas Kreussel 2023-09-15 02:33:01 +02:00
  • a58936966f Add pandas to requirements.txt (#1047) Woosuk Kwon 2023-09-14 17:31:38 -07:00
  • dd54a4b026 Fix detokenization leaving special tokens (#1044) Antoni Baum 2023-09-14 16:37:03 -07:00
  • eda1a7cad3 Announce paper release (#1036) Woosuk Kwon 2023-09-13 17:38:13 -07:00
  • f04908cae7 [FIX] Minor bug fixes (#1035) Zhuohan Li 2023-09-13 16:38:12 -07:00
  • ab019eea75 Add Model Revision Support (#1014) Jasmond L 2023-09-14 06:20:02 +08:00
  • 9841d48a10 Use TGI-like incremental detokenization (#984) Antoni Baum 2023-09-13 13:38:01 -07:00
  • 3272d7a0b7 Fix typo in README.md (#1033) Ikko Eltociear Ashimine 2023-09-14 04:55:23 +09:00
  • 0bb1e885a0 Make max_model_len configurable (#972) Antoni Baum 2023-09-12 16:29:19 -07:00
  • d6545ad22e add option to shorten prompt print in log (#991) leiwen83 2023-09-13 06:10:14 +08:00
  • 90eb3f43ca Bump up the version to v0.1.7 (#1013) v0.1.7 Woosuk Kwon 2023-09-11 00:54:30 -07:00
  • e67b4f2c2a Use FP32 in RoPE initialization (#1004) Woosuk Kwon 2023-09-11 00:26:35 -07:00
  • d6770d1f23 Update setup.py (#1006) Woosuk Kwon 2023-09-10 23:42:45 -07:00
  • b9cecc2635 [Docs] Update installation page (#1005) Woosuk Kwon 2023-09-10 14:23:31 -07:00
  • 898285c9bf fix: CUDA error when inferencing with Falcon-40B base model (#992) Kyujin Cho 2023-09-10 17:39:02 +09:00
  • a62de9ecfd Fix wrong dtype in PagedAttentionWithALiBi bias (#996) Antoni Baum 2023-09-09 14:58:35 -07:00
  • 4042d192f5 fix "tansformers_module" ModuleNotFoundError when load model with trust_remote_code=True (#871) Jingru 2023-09-09 08:21:30 +08:00
  • 1117aa1411 Bump up the version to v0.1.6 (#989) v0.1.6 Zhuohan Li 2023-09-08 00:07:46 -07:00
  • 080438477f Start background task in AsyncLLMEngine.generate (#988) Antoni Baum 2023-09-08 00:03:39 -07:00
  • 4b5bcf8906 faster startup of vLLM (#982) Robert Irvine 2023-09-08 06:48:54 +01:00
  • 852ef5b4f5 Bump up the version to v0.1.5 (#944) v0.1.5 Woosuk Kwon 2023-09-08 08:15:31 +09:00
  • db09d4ad83 [FIX] Fix Alibi implementation in PagedAttention kernel (#945) Zhuohan Li 2023-09-07 15:53:14 -07:00
  • c957c741d9 Enable safetensors loading for all models (#974) Zhuohan Li 2023-09-07 15:49:52 -07:00
  • c07ece5ca4 Make AsyncLLMEngine more robust & fix batched abort (#969) Antoni Baum 2023-09-07 13:43:45 -07:00
  • 7a9c20c715 Bum up transformers version (#976) Woosuk Kwon 2023-09-08 05:15:53 +09:00
  • 005ba458b5 Set torch default dtype in a context manager (#971) Antoni Baum 2023-09-06 23:39:37 -07:00
  • 320a622ec4 [BugFix] Implement RoPE for GPT-J (#941) Woosuk Kwon 2023-09-06 11:54:33 +09:00
  • c9927c1a6a Use queue for finished requests (#957) Antoni Baum 2023-09-05 19:27:23 -07:00
  • fbd80ad409 Clean up kernel unit tests (#938) Woosuk Kwon 2023-09-06 08:57:38 +09:00
  • 22379d5513 fix: typo (#948) Wen Sun 2023-09-05 14:22:30 +08:00