Commit Graph

  • a1b9cb2a34 [BugFix] Fix recovery logic for sequence group (#2186) Woosuk Kwon 2023-12-20 21:52:37 -08:00
  • 3a4fd5ca59 Disable Ray usage stats collection (#2206) Woosuk Kwon 2023-12-20 21:52:08 -08:00
  • c17daa9f89 [Docs] Fix broken links (#2222) Ronen Schaffer 2023-12-20 22:43:42 +02:00
  • bd29cf3d3a Remove Sampler copy stream (#2209) Antoni Baum 2023-12-20 00:04:33 -08:00
  • 31bff69151 Make _prepare_sample non-blocking and use pinned memory for input buffers (#2207) Hanzhi Zhou 2023-12-19 16:52:46 -08:00
  • ba4f826738 [BugFix] Fix weight loading for Mixtral with TP (#2208) Woosuk Kwon 2023-12-19 16:16:11 -08:00
  • de60a3fb93 Added DeciLM-7b and DeciLM-7b-instruct (#2062) avideci 2023-12-19 12:29:33 +02:00
  • 21d5daa4ac Add warning on CUDA graph memory usage (#2182) Woosuk Kwon 2023-12-18 18:16:17 -08:00
  • 290e015c6c Update Help Text for --gpu-memory-utilization Argument (#2183) Suhong Moon 2023-12-18 14:33:24 -05:00
  • 1b7c791d60 [ROCm] Fixes for GPTQ on ROCm (#2180) kliuae 2023-12-19 02:41:04 +08:00
  • bbe4466fd9 [Minor] Fix typo (#2166) JohnSaxon 2023-12-18 15:28:49 +08:00
  • 08133c4d1a Add SSL arguments to API servers (#2109) Harry Mellor 2023-12-18 02:56:23 +00:00
  • 76a7983b23 [BugFix] Fix RoPE kernel on long sequences(#2164) Woosuk Kwon 2023-12-17 17:09:10 -08:00
  • 8041b7305e [BugFix] Raise error when max_model_len is larger than KV cache (#2163) Woosuk Kwon 2023-12-17 17:08:23 -08:00
  • 3ec8c25cd0 [Docs] Update documentation for gpu-memory-utilization option (#2162) Suhong Moon 2023-12-17 13:51:57 -05:00
  • 671af2b1c0 Bump up to v0.2.6 (#2157) v0.2.6 Woosuk Kwon 2023-12-17 10:34:56 -08:00
  • 6f41f0e377 Disable CUDA graph for SqueezeLLM (#2161) Woosuk Kwon 2023-12-17 10:24:25 -08:00
  • 2c9b638065 [Minor] Fix a typo in .pt weight support (#2160) Woosuk Kwon 2023-12-17 10:12:44 -08:00
  • a7347d9a6d Make sampler less blocking (#1889) Antoni Baum 2023-12-17 07:03:49 -08:00
  • f8c688d746 [Minor] Add Phi 2 to supported models (#2159) Woosuk Kwon 2023-12-17 02:54:57 -08:00
  • c9fadda543 [Minor] Fix xformers version (#2158) Woosuk Kwon 2023-12-17 02:28:02 -08:00
  • 30fb0956df [Minor] Add more detailed explanation on quantization argument (#2145) Woosuk Kwon 2023-12-17 01:56:16 -08:00
  • 3a765bd5e1 Temporarily enforce eager mode for GPTQ models (#2154) Woosuk Kwon 2023-12-17 01:51:12 -08:00
  • 26c52a5ea6 [Docs] Add CUDA graph support to docs (#2148) Woosuk Kwon 2023-12-17 01:49:20 -08:00
  • c3372e87be Remove dependency on CuPy (#2152) Woosuk Kwon 2023-12-17 01:49:07 -08:00
  • b0a1d667b0 Pin PyTorch & xformers versions (#2155) Woosuk Kwon 2023-12-17 01:46:54 -08:00
  • e1d5402238 Fix all-reduce memory usage (#2151) Woosuk Kwon 2023-12-17 01:44:45 -08:00
  • 3d1cfbfc74 [Minor] Delete Llama tokenizer warnings (#2146) Woosuk Kwon 2023-12-16 22:05:18 -08:00
  • 37ca558103 Optimize model execution with CUDA graph (#1926) Woosuk Kwon 2023-12-16 21:12:08 -08:00
  • eed74a558f Simplify weight loading logic (#2133) Roy 2023-12-17 04:41:23 +08:00
  • 2acd76f346 [ROCm] Temporarily remove GPTQ ROCm support (#2138) Woosuk Kwon 2023-12-15 17:13:58 -08:00
  • b81a6a6bb3 [Docs] Add supported quantization methods to docs (#2135) Woosuk Kwon 2023-12-15 13:29:22 -08:00
  • 0fbfc4b81b Add GPTQ support (#916) CHU Tianxiang 2023-12-15 19:04:22 +08:00
  • c06170cc8e Add a flag to include stop string in output text (#1976) Yunfeng Bai 2023-12-15 00:45:58 -08:00
  • 614856da25 Avoid multiple redefinition (#1817) Mingcan Xiang 2023-12-14 12:35:58 -05:00
  • 05bdf4eaf3 Fix Dockerfile.rocm (#2101) TJian 2023-12-14 16:45:58 +08:00
  • 6774bd50b0 Fix typing in AsyncLLMEngine & add toml to requirements-dev (#2100) mezuzza 2023-12-14 03:19:41 -05:00
  • 31c1f3255e Bump up to v0.2.5 (#2095) v0.2.5 Woosuk Kwon 2023-12-13 23:56:15 -08:00
  • 21d93c140d Optimize Mixtral with expert parallelism (#2090) Antoni Baum 2023-12-13 23:55:07 -08:00
  • f1c8520146 [BugFix] Fix input positions for long context with sliding window (#2088) Woosuk Kwon 2023-12-13 12:28:13 -08:00
  • 096827c284 [Docs] Add notes on ROCm-supported models (#2087) Woosuk Kwon 2023-12-13 09:45:34 -08:00
  • 6565d9e33e Update installation instruction for vLLM + CUDA 11.8 (#2086) Woosuk Kwon 2023-12-13 09:25:59 -08:00
  • f375ec8440 [ROCm] Upgrade xformers version for ROCm & update doc (#2079) TJian 2023-12-13 16:56:05 +08:00
  • 518369d78c Implement lazy model loader (#2044) Woosuk Kwon 2023-12-12 22:21:45 -08:00
  • 30bad5c492 Fix peak memory profiling (#2031) Woosuk Kwon 2023-12-12 22:01:53 -08:00
  • 3fefe271ec Update Dockerfile to build Megablocks (#2042) Simon Mo 2023-12-12 17:34:17 -08:00
  • 6428f1d051 Support MPT with GQA (#1938) Megha Agarwal 2023-12-12 10:16:05 -08:00
  • 7e1b21daac Remove einops from requirements (#2049) Woosuk Kwon 2023-12-12 09:34:09 -08:00
  • cb3f30c600 Upgrade transformers version to 4.36.0 (#2046) Woosuk Kwon 2023-12-11 18:39:14 -08:00
  • f3e024bece [CI/CD] Upgrade PyTorch version to v2.1.1 (#2045) Woosuk Kwon 2023-12-11 17:48:11 -08:00
  • 31d2ab4aff Remove python 3.10 requirement (#2040) Woosuk Kwon 2023-12-11 12:26:42 -08:00
  • eb17212858 Update Dockerfile to support Mixtral (#2027) Simon Mo 2023-12-11 11:59:08 -08:00
  • 4dd4b5c538 Bump up to v0.2.4 (#2034) v0.2.4 Woosuk Kwon 2023-12-11 11:49:39 -08:00
  • 6120e5aaea Fix import error msg for megablocks (#2038) Woosuk Kwon 2023-12-11 11:40:56 -08:00
  • 2eaa81b236 Update README.md to add megablocks requirement for mixtral (#2033) Ram 2023-12-12 01:07:34 +05:30
  • 81ce2a4b26 [Minor] Fix type annotation in Mixtral (#2036) Woosuk Kwon 2023-12-11 11:32:39 -08:00
  • 5dd80d3777 Fix latency benchmark script (#2035) Woosuk Kwon 2023-12-11 11:19:08 -08:00
  • beeee69bc9 Revert adding Megablocks (#2030) Woosuk Kwon 2023-12-11 10:49:00 -08:00
  • 9bf28d0b69 Update requirements.txt for mixtral (#2029) Ram 2023-12-12 00:09:29 +05:30
  • c0ce15dfb2 Update run_on_sky.rst (#2025) Ikko Eltociear Ashimine 2023-12-12 03:32:58 +09:00
  • b9bcdc7158 Change the load format to pt for Mixtral (#2028) Woosuk Kwon 2023-12-11 10:32:17 -08:00
  • 4ff0203987 Minor fixes for Mixtral (#2015) Woosuk Kwon 2023-12-11 09:16:15 -08:00
  • b5f882cc98 Mixtral 8x7B support (#2011) Pierre Stock 2023-12-11 10:09:15 +01:00
  • 2e8fc0d4c3 Fix completion API echo and logprob combo (#1992) Simon Mo 2023-12-10 13:20:30 -08:00
  • dacaf5a400 Replace head_mapping params with num_kv_heads to attention kernel. (#1997) wbn 2023-12-11 02:12:53 +08:00
  • 24cde76a15 [Minor] Add comment on skipping rope caches (#2004) Woosuk Kwon 2023-12-10 10:04:12 -08:00
  • 1aa1361510 Fix OpenAI server completion_tokens referenced before assignment (#1996) Jin Shang 2023-12-10 13:01:21 +08:00
  • fe470ae5ad [Minor] Fix code style for baichuan (#2003) Woosuk Kwon 2023-12-09 19:24:29 -08:00
  • 3a8c2381f7 Fix for KeyError on Loading LLaMA (#1978) Jun Gao 2023-12-10 07:59:57 +08:00
  • c85b80c2b6 [Docker] Add cuda arch list as build option (#1950) Simon Mo 2023-12-08 09:53:47 -08:00
  • 2b981012a6 Fix Baichuan2-7B-Chat (#1987) firebook 2023-12-09 01:38:36 +08:00
  • 6ccc0bfffb Merge EmbeddedLLM/vllm-rocm into vLLM main (#1836) TJian 2023-12-08 15:16:52 +08:00
  • c8e7eb1eb3 fix typo in getenv call (#1972) Daya Khudia 2023-12-07 16:04:41 -08:00
  • 24f60a54f4 [Docker] Adding number of nvcc_threads during build as envar (#1893) AguirreNicolas 2023-12-07 16:00:32 -03:00
  • 42c02f5892 Fix quickstart.rst typo jinja (#1964) gottlike 2023-12-07 17:34:44 +01:00
  • ebede26ebf Make InternLM follow rope_scaling in config.json (#1956) Jie Li 2023-12-08 00:32:08 +08:00
  • d940ce497e Fix typo in adding_model.rst (#1947) Peter Götz 2023-12-06 19:04:26 +01:00
  • 05ff90b692 Save pytorch profiler output for latency benchmark (#1871) Antoni Baum 2023-12-05 20:55:55 -08:00
  • 1d9b737e05 Support ChatGLMForConditionalGeneration (#1932) dancingpipi 2023-12-06 02:52:48 +08:00
  • 60dc62dc9e add custom server params (#1868) Roy 2023-12-04 04:59:18 +08:00
  • 0f90effc66 Bump up to v0.2.3 (#1903) v0.2.3 Woosuk Kwon 2023-12-03 12:27:47 -08:00
  • 464dd985e3 Fix num_gpus when TP > 1 (#1852) Woosuk Kwon 2023-12-03 12:24:30 -08:00
  • c07a442854 chore(examples-docs): upgrade to OpenAI V1 (#1785) Massimiliano Pronesti 2023-12-03 10:11:22 +01:00
  • cd3aa153a4 Fix broken worker test (#1900) Woosuk Kwon 2023-12-02 22:17:33 -08:00
  • 9b294976a2 Add PyTorch-native implementation of custom layers (#1898) Woosuk Kwon 2023-12-02 21:18:40 -08:00
  • 5313c2cb8b Add Production Metrics in Prometheus format (#1890) Simon Mo 2023-12-02 16:37:44 -08:00
  • 5f09cbdb63 Fix broken sampler tests (#1896) Woosuk Kwon 2023-12-02 16:06:17 -08:00
  • 4cefa9b49b [Docs] Update the AWQ documentation to highlight performance issue (#1883) Simon Mo 2023-12-02 15:52:47 -08:00
  • f86bd6190a Fix the typo in SamplingParams' docstring (#1886) Jerry 2023-12-01 18:06:36 +08:00
  • e5452ddfd6 Normalize head weights for Baichuan 2 (#1876) Woosuk Kwon 2023-11-30 20:03:58 -08:00
  • d06980dfa7 Fix Baichuan tokenizer error (#1874) Woosuk Kwon 2023-11-30 18:35:50 -08:00
  • 66785cc05c Support chat template and echo for chat API (#1756) Adam Brusselback 2023-11-30 19:43:13 -05:00
  • 05a38612b0 docs: add instruction for langchain (#1162) Massimiliano Pronesti 2023-11-30 19:57:44 +01:00
  • d27f4bae39 Fix rope cache key error (#1867) Roy 2023-12-01 00:29:28 +08:00
  • 8d8c2f6ffe Support max-model-len argument for throughput benchmark (#1858) aisensiy 2023-12-01 00:10:24 +08:00
  • 51d3cb951d Remove max_num_seqs in latency benchmark script (#1855) Woosuk Kwon 2023-11-30 00:00:32 -08:00
  • e74b1736a1 Add profile option to latency benchmark script (#1839) Woosuk Kwon 2023-11-29 23:42:52 -08:00
  • f07c1ceaa5 [FIX] Fix docker build error (#1831) (#1832) Allen 2023-11-30 15:06:50 +08:00
  • 63b2206ad0 Avoid multiple instantiations of the RoPE class (#1828) Jee Li 2023-11-30 15:06:27 +08:00
  • 27feead2f8 Refactor Worker & InputMetadata (#1843) Woosuk Kwon 2023-11-29 22:16:37 -08:00