Commit Graph

  • 27ca23dc00 Remove exclude_unset in streaming response (#3143) Seonghyeon 2024-03-02 02:59:06 +09:00
  • 54d3544784 Fix: Output text is always truncated in some models (#3016) Sherry 2024-03-01 15:52:22 +08:00
  • 703e42ee4b Add guided decoding for OpenAI API server (#2819) felixzhu555 2024-02-29 14:13:08 -08:00
  • 29a8d6a554 [Fix] Don't deep-copy LogitsProcessors when copying SamplingParams (#3099) Nick Hill 2024-02-29 11:20:42 -08:00
  • 2c08ff23c0 Fix building from source on WSL (#3112) Billy Cao 2024-03-01 03:13:58 +08:00
  • bfdcfa6a05 Support starcoder2 architecture (#3089) Seonghyeon 2024-02-29 17:51:48 +09:00
  • 9289e577ec add cache_config's info to prometheus metrics. (#3100) Allen.Dou 2024-02-29 14:15:18 +08:00
  • a6d471c759 Fix: AttributeError in OpenAI-compatible server (#3018) Jae-Won Chung 2024-02-29 01:04:07 -05:00
  • 01a5d18a53 Add Support for 2/3/8-bit GPTQ Quantization Models (#2330) CHU Tianxiang 2024-02-29 13:52:23 +08:00
  • 929b4f2973 Add LoRA support for Gemma (#3050) Woosuk Kwon 2024-02-28 13:03:28 -08:00
  • 3b7178cfa4 [Neuron] Support inference with transformers-neuronx (#2569) Liangfu Chen 2024-02-28 09:34:34 -08:00
  • e46fa5d52e Restrict prometheus_client >= 0.18.0 to prevent errors when importing pkgs (#3070) Allen.Dou 2024-02-28 13:38:26 +08:00
  • a8683102cc multi-lora documentation fix (#3064) Ganesh Jagadeesan 2024-02-28 00:26:15 -05:00
  • 71bcaf99e2 Enable GQA support in the prefix prefill kernels (#3007) Tao He 2024-02-27 17:14:31 +08:00
  • 8b430d7dea [Minor] Fix StableLMEpochForCausalLM -> StableLmForCausalLM (#3046) Woosuk Kwon 2024-02-26 20:23:50 -08:00
  • e0ade06d63 Support logit bias for OpenAI API (#3027) Dylan Hawk 2024-02-26 19:51:53 -08:00
  • 4bd18ec0c7 [Minor] Fix type annotation in fused moe (#3045) Woosuk Kwon 2024-02-26 19:44:29 -08:00
  • 2410e320b3 fix get_ip error in pure ipv6 environment (#2931) Jingru 2024-02-27 11:22:16 +08:00
  • 48a8f4a7fd Support Orion model (#2539) 张大成 2024-02-27 11:17:06 +08:00
  • 4dd6416faf Fix stablelm (#3038) Roy 2024-02-27 10:31:10 +08:00
  • c1c0d00b88 Don't use cupy when enforce_eager=True (#3037) Roy 2024-02-27 09:33:38 +08:00
  • d9f726c4d0 [Minor] Remove unused config files (#3039) Roy 2024-02-27 09:25:22 +08:00
  • d6e4a130b0 [Minor] Remove gather_cached_kv kernel (#3043) Woosuk Kwon 2024-02-26 15:00:54 -08:00
  • cfc15a1031 Optimize Triton MoE Kernel (#2979) Philipp Moritz 2024-02-26 13:48:56 -08:00
  • 70f3e8e3a1 Add LogProbs for Chat Completions in OpenAI (#2918) Jared Moore 2024-02-25 18:39:34 -08:00
  • ef978fe411 Port metrics from aioprometheus to prometheus_client (#2730) Harry Mellor 2024-02-25 19:54:00 +00:00
  • f7c1234990 [Fix] Fissertion on YaRN model len (#2984) Woosuk Kwon 2024-02-23 12:57:48 -08:00
  • 57f044945f Fix nvcc not found in vlm-openai image (#2781) zhaoyang-star 2024-02-23 06:25:07 +08:00
  • 4caf7044e0 Include tokens from prompt phase in counter_generation_tokens (#2802) Ronen Schaffer 2024-02-23 00:00:12 +02:00
  • 6f32cddf1c Remove Flash Attention in test env (#2982) Woosuk Kwon 2024-02-22 09:58:29 -08:00
  • c530e2cfe3 [FIX] Fix a bug in initializing Yarn RoPE (#2983) 44670 2024-02-22 17:40:05 +08:00
  • fd5dcc5c81 Optimize GeGLU layer in Gemma (#2975) Woosuk Kwon 2024-02-21 20:17:52 -08:00
  • 93dc5a2870 chore(vllm): codespell for spell checking (#2820) Massimiliano Pronesti 2024-02-22 02:56:01 +00:00
  • 95529e3253 Use Llama RMSNorm custom op for Gemma (#2974) Woosuk Kwon 2024-02-21 18:28:23 -08:00
  • 344020c926 Migrate MistralForCausalLM to LlamaForCausalLM (#2868) Roy 2024-02-22 10:25:05 +08:00
  • 5574081c49 Added early stopping to completion APIs (#2939) Mustafa Eyceoz 2024-02-21 21:24:01 -05:00
  • d7f396486e Update comment (#2934) Ronen Schaffer 2024-02-22 04:18:37 +02:00
  • 8fbd84bf78 Bump up version to v0.3.2 (#2968) v0.3.2 Zhuohan Li 2024-02-21 11:47:25 -08:00
  • 7d2dcce175 Support per-request seed (#2514) Nick Hill 2024-02-21 11:47:00 -08:00
  • dc903e70ac [ROCm] Upgrade transformers to v4.38.0 (#2967) Woosuk Kwon 2024-02-21 09:46:57 -08:00
  • a9c8212895 [FIX] Add Gemma model to the doc (#2966) Zhuohan Li 2024-02-21 09:46:15 -08:00
  • c20ecb6a51 Upgrade transformers to v4.38.0 (#2965) Woosuk Kwon 2024-02-21 09:38:03 -08:00
  • 5253edaacb Add Gemma model (#2964) Xiang Xu 2024-02-21 09:34:30 -08:00
  • 017d9f1515 Add metrics to RequestOutput (#2876) Antoni Baum 2024-02-20 21:55:57 -08:00
  • 181b27d881 Make vLLM logging formatting optional (#2877) Antoni Baum 2024-02-20 14:38:55 -08:00
  • 63e2a6419d [FIX] Fix beam search test (#2930) Zhuohan Li 2024-02-20 14:37:39 -08:00
  • 264017a2bf [ROCm] include gfx908 as supported (#2792) James Whedbee 2024-02-19 19:58:59 -06:00
  • e433c115bc Fix vllm:prompt_tokens_total metric calculation (#2869) Ronen Schaffer 2024-02-19 09:55:41 +02:00
  • 86fd8bb0ac Add warning to prevent changes to benchmark api server (#2858) Simon Mo 2024-02-18 21:36:19 -08:00
  • ab3a5a8259 Support OLMo models. (#2832) Isotr0py 2024-02-19 13:05:15 +08:00
  • a61f0521b8 [Test] Add basic correctness test (#2908) Zhuohan Li 2024-02-18 16:44:50 -08:00
  • 537c9755a7 [Minor] Small fix to make distributed init logic in worker looks cleaner (#2905) Zhuohan Li 2024-02-18 14:39:00 -08:00
  • 786b7f18a5 Add code-revision config argument for Hugging Face Hub (#2892) Mark Mozolewski 2024-02-17 22:36:53 -08:00
  • 8f36444c4f multi-LoRA as extra models in OpenAI server (#2775) jvmncs 2024-02-17 15:00:48 -05:00
  • 185b2c29e2 Defensively copy sampling_params (#2881) Nick Hill 2024-02-17 11:18:04 -08:00
  • 5f08050d8d Bump up to v0.3.1 (#2887) v0.3.1 Woosuk Kwon 2024-02-16 15:05:18 -08:00
  • 64da65b322 Prefix Caching- fix t4 triton error (#2517) shiyi.c_98 2024-02-16 14:17:55 -08:00
  • 5255d99dc5 [ROCm] Dockerfile fix for flash-attention build (#2885) Hongxia Yang 2024-02-15 13:22:39 -05:00
  • 4f2ad11135 Fix DeciLM (#2883) Philipp Moritz 2024-02-14 22:29:57 -08:00
  • d7afab6d3a [BugFix] Fix GC bug for LLM class (#2882) Woosuk Kwon 2024-02-14 22:17:44 -08:00
  • 31348dff03 Align LoRA code between Mistral and Mixtral (fixes #2875) (#2880) Philipp Moritz 2024-02-14 16:00:43 -08:00
  • 25e86b6a61 Don't use cupy NCCL for AMD backends (#2855) Woosuk Kwon 2024-02-14 12:30:44 -08:00
  • 4efbac6d35 Migrate AquilaForCausalLM to LlamaForCausalLM (#2867) Roy 2024-02-15 04:30:24 +08:00
  • 87069ccf68 Fix docker python version (#2845) Nikola Borisov 2024-02-14 10:17:57 -08:00
  • 7e45107f51 [Fix] Fix memory profiling when GPU is used by multiple processes (#2863) Woosuk Kwon 2024-02-13 19:52:34 -08:00
  • 0c48b37c31 Fix internlm after https://github.com/vllm-project/vllm/pull/2860 (#2861) Philipp Moritz 2024-02-13 18:01:15 -08:00
  • 7eacffd951 Migrate InternLMForCausalLM to LlamaForCausalLM (#2860) Philipp Moritz 2024-02-13 17:12:05 -08:00
  • 2a543d6efe Add LoRA support for Mixtral (#2831) Terry 2024-02-13 15:55:45 -08:00
  • 317b29de0f Remove Yi model definition, please use LlamaForCausalLM instead (#2854) Philipp Moritz 2024-02-13 14:22:22 -08:00
  • a463c333dd Use CuPy for CUDA graphs (#2811) Woosuk Kwon 2024-02-13 11:32:06 -08:00
  • ea356004d4 Revert "Refactor llama family models (#2637)" (#2851) Philipp Moritz 2024-02-13 09:24:59 -08:00
  • 5c976a7e1a Refactor llama family models (#2637) Roy 2024-02-13 16:09:23 +08:00
  • f964493274 [CI] Ensure documentation build is checked in CI (#2842) Simon Mo 2024-02-12 22:53:07 -08:00
  • a4211a4dc3 Serving Benchmark Refactoring (#2433) Roger Wang 2024-02-12 22:53:00 -08:00
  • 563836496a Refactor 2 awq gemm kernels into m16nXk32 (#2723) Rex 2024-02-12 11:02:17 -08:00
  • 4ca2c358b1 Add documentation section about LoRA (#2834) Philipp Moritz 2024-02-12 08:24:45 -08:00
  • 0580aab02f [ROCm] support Radeon™ 7900 series (gfx1100) without using flash-attention (#2768) Hongxia Yang 2024-02-11 02:14:37 -05:00
  • 3711811b1d Disable custom all reduce by default (#2808) Woosuk Kwon 2024-02-08 09:58:03 -08:00
  • 65b89d16ee [Ray] Integration compiled DAG off by default (#2471) SangBin Cho 2024-02-09 02:57:25 +09:00
  • 931746bc6d Add documentation on how to do incremental builds (#2796) Philipp Moritz 2024-02-07 14:42:02 -08:00
  • c81dddb45c [ROCm] Fix build problem resulted from previous commit related to FP8 kv-cache support (#2790) Hongxia Yang 2024-02-07 01:36:59 -05:00
  • fe6d09ae61 [Minor] More fix of test_cache.py CI test failure (#2750) Lily Liu 2024-02-06 11:38:38 -08:00
  • ed70c70ea3 modelscope: fix issue when model parameter is not a model id but path of the model. (#2489) liuyhwangyh 2024-02-07 01:57:15 +08:00
  • f0d4e14557 Add fused top-K softmax kernel for MoE (#2769) Woosuk Kwon 2024-02-05 17:38:02 -08:00
  • 2ccee3def6 [ROCm] Fixup arch checks for ROCM (#2627) Douglas Lehr 2024-02-05 16:59:09 -06:00
  • b92adec8e8 Set local logging level via env variable (#2774) Lukas 2024-02-05 23:26:50 +01:00
  • 56f738ae9b [ROCm] Fix some kernels failed unit tests (#2498) Hongxia Yang 2024-02-05 17:25:36 -05:00
  • 72d3a30c63 [Minor] Fix benchmark_latency script (#2765) Woosuk Kwon 2024-02-05 12:45:37 -08:00
  • c9b45adeeb Require triton >= 2.1.0 (#2746) whyiug 2024-02-05 15:07:36 +08:00
  • 5a6c81b051 Remove eos tokens from output by default (#2611) Rex 2024-02-04 14:32:42 -08:00
  • 51cd22ce56 set&get llm internal tokenizer instead of the TokenizerGroup (#2741) dancingpipi 2024-02-05 06:25:36 +08:00
  • 5ed704ec8c docs: fix langchain (#2736) Massimiliano Pronesti 2024-02-04 03:17:55 +01:00
  • 4abf6336ec Add one example to run batch inference distributed on Ray (#2696) Cheng Su 2024-02-02 15:41:42 -08:00
  • 0e163fce18 Fix default length_penalty to 1.0 (#2667) zspo 2024-02-02 07:59:39 +08:00
  • 96b6f475dd Remove hardcoded device="cuda" to support more devices (#2503) Kunshang Ji 2024-02-02 07:46:39 +08:00
  • c410f5d020 Use revision when downloading the quantization config file (#2697) Pernekhan Utemuratov 2024-02-01 15:41:58 -08:00
  • bb8c697ee0 Update README for meetup slides (#2718) Simon Mo 2024-02-01 14:56:53 -08:00
  • b9e96b17de fix python 3.8 syntax (#2716) Simon Mo 2024-02-01 14:00:58 -08:00
  • 923797fea4 Fix compile error when using rocm (#2648) zhaoyang-star 2024-02-02 01:35:09 +08:00
  • cd9e60c76c Add Internlm2 (#2666) Fengzhe Zhou 2024-02-02 01:27:40 +08:00