Commit Graph

800 Commits

Author SHA1 Message Date
Jae-Won Chung
a6d471c759 Fix: AttributeError in OpenAI-compatible server (#3018) 2024-02-28 22:04:07 -08:00
CHU Tianxiang
01a5d18a53 Add Support for 2/3/8-bit GPTQ Quantization Models (#2330) 2024-02-28 21:52:23 -08:00
Woosuk Kwon
929b4f2973 Add LoRA support for Gemma (#3050) 2024-02-28 13:03:28 -08:00
Liangfu Chen
3b7178cfa4 [Neuron] Support inference with transformers-neuronx (#2569) 2024-02-28 09:34:34 -08:00
Allen.Dou
e46fa5d52e Restrict prometheus_client >= 0.18.0 to prevent errors when importing pkgs (#3070) 2024-02-28 05:38:26 +00:00
Ganesh Jagadeesan
a8683102cc multi-lora documentation fix (#3064) 2024-02-27 21:26:15 -08:00
Tao He
71bcaf99e2 Enable GQA support in the prefix prefill kernels (#3007)
Signed-off-by: Tao He <sighingnow@gmail.com>
2024-02-27 01:14:31 -08:00
Woosuk Kwon
8b430d7dea [Minor] Fix StableLMEpochForCausalLM -> StableLmForCausalLM (#3046) 2024-02-26 20:23:50 -08:00
Dylan Hawk
e0ade06d63 Support logit bias for OpenAI API (#3027) 2024-02-27 11:51:53 +08:00
Woosuk Kwon
4bd18ec0c7 [Minor] Fix type annotation in fused moe (#3045) 2024-02-26 19:44:29 -08:00
Jingru
2410e320b3 fix get_ip error in pure ipv6 environment (#2931) 2024-02-26 19:22:16 -08:00
张大成
48a8f4a7fd Support Orion model (#2539)
Co-authored-by: zhangdacheng <zhangdacheng@ainirobot.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-02-26 19:17:06 -08:00
Roy
4dd6416faf Fix stablelm (#3038) 2024-02-26 18:31:10 -08:00
Roy
c1c0d00b88 Don't use cupy when enforce_eager=True (#3037) 2024-02-26 17:33:38 -08:00
Roy
d9f726c4d0 [Minor] Remove unused config files (#3039) 2024-02-26 17:25:22 -08:00
Woosuk Kwon
d6e4a130b0 [Minor] Remove gather_cached_kv kernel (#3043) 2024-02-26 15:00:54 -08:00
Philipp Moritz
cfc15a1031 Optimize Triton MoE Kernel (#2979)
Co-authored-by: Cade Daniel <edacih@gmail.com>
2024-02-26 13:48:56 -08:00
Jared Moore
70f3e8e3a1 Add LogProbs for Chat Completions in OpenAI (#2918) 2024-02-26 10:39:34 +08:00
Harry Mellor
ef978fe411 Port metrics from aioprometheus to prometheus_client (#2730) 2024-02-25 11:54:00 -08:00
Woosuk Kwon
f7c1234990 [Fix] Fissertion on YaRN model len (#2984) 2024-02-23 12:57:48 -08:00
zhaoyang-star
57f044945f Fix nvcc not found in vlm-openai image (#2781) 2024-02-22 14:25:07 -08:00
Ronen Schaffer
4caf7044e0 Include tokens from prompt phase in counter_generation_tokens (#2802) 2024-02-22 14:00:12 -08:00
Woosuk Kwon
6f32cddf1c Remove Flash Attention in test env (#2982) 2024-02-22 09:58:29 -08:00
44670
c530e2cfe3 [FIX] Fix a bug in initializing Yarn RoPE (#2983) 2024-02-22 01:40:05 -08:00
Woosuk Kwon
fd5dcc5c81 Optimize GeGLU layer in Gemma (#2975) 2024-02-21 20:17:52 -08:00
Massimiliano Pronesti
93dc5a2870 chore(vllm): codespell for spell checking (#2820) 2024-02-21 18:56:01 -08:00
Woosuk Kwon
95529e3253 Use Llama RMSNorm custom op for Gemma (#2974) 2024-02-21 18:28:23 -08:00
Roy
344020c926 Migrate MistralForCausalLM to LlamaForCausalLM (#2868) 2024-02-21 18:25:05 -08:00
Mustafa Eyceoz
5574081c49 Added early stopping to completion APIs (#2939) 2024-02-21 18:24:01 -08:00
Ronen Schaffer
d7f396486e Update comment (#2934) 2024-02-21 18:18:37 -08:00
Zhuohan Li
8fbd84bf78 Bump up version to v0.3.2 (#2968)
Some checks failed
Create Release / Create Release (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.10, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.11, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.8, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.9, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.10, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.11, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.8, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.9, 2.1.2) (push) Has been cancelled
This version is for more model support. Add support for Gemma models (#2964) and OLMo models (#2832).
v0.3.2
2024-02-21 11:47:25 -08:00
Nick Hill
7d2dcce175 Support per-request seed (#2514) 2024-02-21 11:47:00 -08:00
Woosuk Kwon
dc903e70ac [ROCm] Upgrade transformers to v4.38.0 (#2967) 2024-02-21 09:46:57 -08:00
Zhuohan Li
a9c8212895 [FIX] Add Gemma model to the doc (#2966) 2024-02-21 09:46:15 -08:00
Woosuk Kwon
c20ecb6a51 Upgrade transformers to v4.38.0 (#2965) 2024-02-21 09:38:03 -08:00
Xiang Xu
5253edaacb Add Gemma model (#2964) 2024-02-21 09:34:30 -08:00
Antoni Baum
017d9f1515 Add metrics to RequestOutput (#2876) 2024-02-20 21:55:57 -08:00
Antoni Baum
181b27d881 Make vLLM logging formatting optional (#2877) 2024-02-20 14:38:55 -08:00
Zhuohan Li
63e2a6419d [FIX] Fix beam search test (#2930) 2024-02-20 14:37:39 -08:00
James Whedbee
264017a2bf [ROCm] include gfx908 as supported (#2792) 2024-02-19 17:58:59 -08:00
Ronen Schaffer
e433c115bc Fix vllm:prompt_tokens_total metric calculation (#2869) 2024-02-18 23:55:41 -08:00
Simon Mo
86fd8bb0ac Add warning to prevent changes to benchmark api server (#2858) 2024-02-18 21:36:19 -08:00
Isotr0py
ab3a5a8259 Support OLMo models. (#2832) 2024-02-18 21:05:15 -08:00
Zhuohan Li
a61f0521b8 [Test] Add basic correctness test (#2908) 2024-02-18 16:44:50 -08:00
Zhuohan Li
537c9755a7 [Minor] Small fix to make distributed init logic in worker looks cleaner (#2905) 2024-02-18 14:39:00 -08:00
Mark Mozolewski
786b7f18a5 Add code-revision config argument for Hugging Face Hub (#2892) 2024-02-17 22:36:53 -08:00
jvmncs
8f36444c4f multi-LoRA as extra models in OpenAI server (#2775)
how to serve the loras (mimicking the [multilora inference example](https://github.com/vllm-project/vllm/blob/main/examples/multilora_inference.py)):
```terminal
$ export LORA_PATH=~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/
$ python -m vllm.entrypoints.api_server \
 --model meta-llama/Llama-2-7b-hf \
 --enable-lora \
 --lora-modules sql-lora=$LORA_PATH sql-lora2=$LORA_PATH
```
the above server will list 3 separate values if the user queries `/models`: one for the base served model, and one each for the specified lora modules. in this case sql-lora and sql-lora2 point to the same underlying lora, but this need not be the case. lora config values take the same values they do in EngineArgs

no work has been done here to scope client permissions to specific models
2024-02-17 12:00:48 -08:00
Nick Hill
185b2c29e2 Defensively copy sampling_params (#2881)
If the SamplingParams object passed to LLMEngine.add_request() is mutated after it returns, it could affect the async sampling process for that request.

Suggested by @Yard1 https://github.com/vllm-project/vllm/pull/2514#discussion_r1490106059
2024-02-17 11:18:04 -08:00
Woosuk Kwon
5f08050d8d Bump up to v0.3.1 (#2887)
Some checks failed
Create Release / Create Release (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.10, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.11, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.8, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.9, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.10, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.11, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.8, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.9, 2.1.2) (push) Has been cancelled
v0.3.1
2024-02-16 15:05:18 -08:00
shiyi.c_98
64da65b322 Prefix Caching- fix t4 triton error (#2517) 2024-02-16 14:17:55 -08:00