biondizzle/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Jae-Won Chung	a6d471c759	Fix: `AttributeError` in OpenAI-compatible server (#3018 )	2024-02-28 22:04:07 -08:00
CHU Tianxiang	01a5d18a53	Add Support for 2/3/8-bit GPTQ Quantization Models (#2330 )	2024-02-28 21:52:23 -08:00
Woosuk Kwon	929b4f2973	Add LoRA support for Gemma (#3050 )	2024-02-28 13:03:28 -08:00
Liangfu Chen	3b7178cfa4	[Neuron] Support inference with transformers-neuronx (#2569 )	2024-02-28 09:34:34 -08:00
Allen.Dou	e46fa5d52e	Restrict prometheus_client >= 0.18.0 to prevent errors when importing pkgs (#3070 )	2024-02-28 05:38:26 +00:00
Ganesh Jagadeesan	a8683102cc	multi-lora documentation fix (#3064 )	2024-02-27 21:26:15 -08:00
Tao He	71bcaf99e2	Enable GQA support in the prefix prefill kernels (#3007 ) Signed-off-by: Tao He <sighingnow@gmail.com>	2024-02-27 01:14:31 -08:00
Woosuk Kwon	8b430d7dea	[Minor] Fix StableLMEpochForCausalLM -> StableLmForCausalLM (#3046 )	2024-02-26 20:23:50 -08:00
Dylan Hawk	e0ade06d63	Support logit bias for OpenAI API (#3027 )	2024-02-27 11:51:53 +08:00
Woosuk Kwon	4bd18ec0c7	[Minor] Fix type annotation in fused moe (#3045 )	2024-02-26 19:44:29 -08:00
Jingru	2410e320b3	fix `get_ip` error in pure ipv6 environment (#2931 )	2024-02-26 19:22:16 -08:00
张大成	48a8f4a7fd	Support Orion model (#2539 ) Co-authored-by: zhangdacheng <zhangdacheng@ainirobot.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2024-02-26 19:17:06 -08:00
Roy	4dd6416faf	Fix stablelm (#3038 )	2024-02-26 18:31:10 -08:00
Roy	c1c0d00b88	Don't use cupy when `enforce_eager=True` (#3037 )	2024-02-26 17:33:38 -08:00
Roy	d9f726c4d0	[Minor] Remove unused config files (#3039 )	2024-02-26 17:25:22 -08:00
Woosuk Kwon	d6e4a130b0	[Minor] Remove gather_cached_kv kernel (#3043 )	2024-02-26 15:00:54 -08:00
Philipp Moritz	cfc15a1031	Optimize Triton MoE Kernel (#2979 ) Co-authored-by: Cade Daniel <edacih@gmail.com>	2024-02-26 13:48:56 -08:00
Jared Moore	70f3e8e3a1	Add LogProbs for Chat Completions in OpenAI (#2918 )	2024-02-26 10:39:34 +08:00
Harry Mellor	ef978fe411	Port metrics from `aioprometheus` to `prometheus_client` (#2730 )	2024-02-25 11:54:00 -08:00
Woosuk Kwon	f7c1234990	[Fix] Fissertion on YaRN model len (#2984 )	2024-02-23 12:57:48 -08:00
zhaoyang-star	57f044945f	Fix nvcc not found in vlm-openai image (#2781 )	2024-02-22 14:25:07 -08:00
Ronen Schaffer	4caf7044e0	Include tokens from prompt phase in `counter_generation_tokens` (#2802 )	2024-02-22 14:00:12 -08:00
Woosuk Kwon	6f32cddf1c	Remove Flash Attention in test env (#2982 )	2024-02-22 09:58:29 -08:00
44670	c530e2cfe3	[FIX] Fix a bug in initializing Yarn RoPE (#2983 )	2024-02-22 01:40:05 -08:00
Woosuk Kwon	fd5dcc5c81	Optimize GeGLU layer in Gemma (#2975 )	2024-02-21 20:17:52 -08:00
Massimiliano Pronesti	93dc5a2870	chore(vllm): codespell for spell checking (#2820 )	2024-02-21 18:56:01 -08:00
Woosuk Kwon	95529e3253	Use Llama RMSNorm custom op for Gemma (#2974 )	2024-02-21 18:28:23 -08:00
Roy	344020c926	Migrate MistralForCausalLM to LlamaForCausalLM (#2868 )	2024-02-21 18:25:05 -08:00
Mustafa Eyceoz	5574081c49	Added early stopping to completion APIs (#2939 )	2024-02-21 18:24:01 -08:00
Ronen Schaffer	d7f396486e	Update comment (#2934 )	2024-02-21 18:18:37 -08:00
Zhuohan Li	8fbd84bf78	Bump up version to v0.3.2 (#2968 ) Some checks failed Create Release / Create Release (push) Has been cancelled Details Create Release / Build Wheel (11.8, ubuntu-20.04, 3.10, 2.1.2) (push) Has been cancelled Details Create Release / Build Wheel (11.8, ubuntu-20.04, 3.11, 2.1.2) (push) Has been cancelled Details Create Release / Build Wheel (11.8, ubuntu-20.04, 3.8, 2.1.2) (push) Has been cancelled Details Create Release / Build Wheel (11.8, ubuntu-20.04, 3.9, 2.1.2) (push) Has been cancelled Details Create Release / Build Wheel (12.1, ubuntu-20.04, 3.10, 2.1.2) (push) Has been cancelled Details Create Release / Build Wheel (12.1, ubuntu-20.04, 3.11, 2.1.2) (push) Has been cancelled Details Create Release / Build Wheel (12.1, ubuntu-20.04, 3.8, 2.1.2) (push) Has been cancelled Details Create Release / Build Wheel (12.1, ubuntu-20.04, 3.9, 2.1.2) (push) Has been cancelled Details This version is for more model support. Add support for Gemma models (#2964) and OLMo models (#2832). v0.3.2	2024-02-21 11:47:25 -08:00
Nick Hill	7d2dcce175	Support per-request seed (#2514 )	2024-02-21 11:47:00 -08:00
Woosuk Kwon	dc903e70ac	[ROCm] Upgrade transformers to v4.38.0 (#2967 )	2024-02-21 09:46:57 -08:00
Zhuohan Li	a9c8212895	[FIX] Add Gemma model to the doc (#2966 )	2024-02-21 09:46:15 -08:00
Woosuk Kwon	c20ecb6a51	Upgrade transformers to v4.38.0 (#2965 )	2024-02-21 09:38:03 -08:00
Xiang Xu	5253edaacb	Add Gemma model (#2964 )	2024-02-21 09:34:30 -08:00
Antoni Baum	017d9f1515	Add metrics to RequestOutput (#2876 )	2024-02-20 21:55:57 -08:00
Antoni Baum	181b27d881	Make vLLM logging formatting optional (#2877 )	2024-02-20 14:38:55 -08:00
Zhuohan Li	63e2a6419d	[FIX] Fix beam search test (#2930 )	2024-02-20 14:37:39 -08:00
James Whedbee	264017a2bf	[ROCm] include gfx908 as supported (#2792 )	2024-02-19 17:58:59 -08:00
Ronen Schaffer	e433c115bc	Fix `vllm:prompt_tokens_total` metric calculation (#2869 )	2024-02-18 23:55:41 -08:00
Simon Mo	86fd8bb0ac	Add warning to prevent changes to benchmark api server (#2858 )	2024-02-18 21:36:19 -08:00
Isotr0py	ab3a5a8259	Support OLMo models. (#2832 )	2024-02-18 21:05:15 -08:00
Zhuohan Li	a61f0521b8	[Test] Add basic correctness test (#2908 )	2024-02-18 16:44:50 -08:00
Zhuohan Li	537c9755a7	[Minor] Small fix to make distributed init logic in worker looks cleaner (#2905 )	2024-02-18 14:39:00 -08:00
Mark Mozolewski	786b7f18a5	Add code-revision config argument for Hugging Face Hub (#2892 )	2024-02-17 22:36:53 -08:00
jvmncs	8f36444c4f	multi-LoRA as extra models in OpenAI server (#2775 ) how to serve the loras (mimicking the [multilora inference example](https://github.com/vllm-project/vllm/blob/main/examples/multilora_inference.py)): ```terminal $ export LORA_PATH=~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/ $ python -m vllm.entrypoints.api_server \ --model meta-llama/Llama-2-7b-hf \ --enable-lora \ --lora-modules sql-lora=$LORA_PATH sql-lora2=$LORA_PATH ``` the above server will list 3 separate values if the user queries `/models`: one for the base served model, and one each for the specified lora modules. in this case sql-lora and sql-lora2 point to the same underlying lora, but this need not be the case. lora config values take the same values they do in EngineArgs no work has been done here to scope client permissions to specific models	2024-02-17 12:00:48 -08:00
Nick Hill	185b2c29e2	Defensively copy `sampling_params` (#2881 ) If the SamplingParams object passed to LLMEngine.add_request() is mutated after it returns, it could affect the async sampling process for that request. Suggested by @Yard1 https://github.com/vllm-project/vllm/pull/2514#discussion_r1490106059	2024-02-17 11:18:04 -08:00
Woosuk Kwon	5f08050d8d	Bump up to v0.3.1 (#2887 ) Some checks failed Create Release / Create Release (push) Has been cancelled Details Create Release / Build Wheel (11.8, ubuntu-20.04, 3.10, 2.1.2) (push) Has been cancelled Details Create Release / Build Wheel (11.8, ubuntu-20.04, 3.11, 2.1.2) (push) Has been cancelled Details Create Release / Build Wheel (11.8, ubuntu-20.04, 3.8, 2.1.2) (push) Has been cancelled Details Create Release / Build Wheel (11.8, ubuntu-20.04, 3.9, 2.1.2) (push) Has been cancelled Details Create Release / Build Wheel (12.1, ubuntu-20.04, 3.10, 2.1.2) (push) Has been cancelled Details Create Release / Build Wheel (12.1, ubuntu-20.04, 3.11, 2.1.2) (push) Has been cancelled Details Create Release / Build Wheel (12.1, ubuntu-20.04, 3.8, 2.1.2) (push) Has been cancelled Details Create Release / Build Wheel (12.1, ubuntu-20.04, 3.9, 2.1.2) (push) Has been cancelled Details v0.3.1	2024-02-16 15:05:18 -08:00
shiyi.c_98	64da65b322	Prefix Caching- fix t4 triton error (#2517 )	2024-02-16 14:17:55 -08:00

1 2 3 4 5 ...

800 Commits