Brendan Wong
168cab6bbf
[Frontend] API support for beam search ( #9087 )
...
Co-authored-by: youkaichao <youkaichao@126.com >
2024-10-05 23:39:03 -07:00
Joe Runde
062c89e7c9
[Frontend][Core] Move guided decoding params into sampling params ( #8252 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
2024-10-01 09:34:25 +08:00
Jiaxin Shan
260d40b5ea
[Core] Support Lora lineage and base model metadata management ( #6315 )
2024-09-20 06:20:56 +00:00
Alexander Matveev
7c7714d856
[Core][Bugfix][Perf] Introduce MQLLMEngine to avoid asyncio OH ( #8157 )
...
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2024-09-18 13:56:58 +00:00
Jiaxin Shan
db3bf7c991
[Core] Support load and unload LoRA in api server ( #6566 )
...
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2024-09-05 18:10:33 -07:00
Cyrus Leung
baaedfdb2d
[mypy] Enable following imports for entrypoints ( #7248 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: Fei <dfdfcai4@gmail.com >
2024-08-20 23:28:21 -07:00
Cyrus Leung
7eb4a51c5f
[Core] Support serving encoder/decoder models ( #7258 )
2024-08-09 10:39:41 +08:00
Robert Shaw
ed812a73fa
[ Frontend ] Multiprocessing for OpenAI Server with zeromq ( #6883 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
Co-authored-by: Joe Runde <Joseph.Runde@ibm.com >
Co-authored-by: Joe Runde <joe@joerun.de >
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2024-08-02 18:27:28 -07:00
zifeitong
3c10591ef2
[Bugfix] Set SamplingParams.max_tokens for OpenAI requests if not provided by user ( #6954 )
2024-07-31 21:13:34 -07:00
Cyrus Leung
da1f7cc12a
[mypy] Enable following imports for some directories ( #6681 )
2024-07-31 10:38:03 +08:00
Evan Z. Liu
5689e256ba
[Frontend] Represent tokens with identifiable strings ( #6626 )
2024-07-25 09:51:00 +08:00
Jiaxin Shan
42c7f66a38
[Core] Support dynamically loading Lora adapter from HuggingFace ( #6234 )
...
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com >
2024-07-22 15:42:40 -07:00
Cyrus Leung
739b61a348
[Frontend] Refactor prompt processing ( #4028 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-07-22 10:13:53 -07:00
Nick Hill
e2fbaee725
[BugFix][Frontend] Use LoRA tokenizer in OpenAI APIs ( #6227 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-07-18 15:13:30 +08:00
Joe
d92b3c5cde
[Bugfix][CI/Build] Test prompt adapters in openai entrypoint tests ( #6419 )
2024-07-15 18:54:15 -07:00
Swapnil Parekh
4d6ada947c
[CORE] Adding support for insertion of soft-tuned prompts ( #4645 )
...
Co-authored-by: Swapnil Parekh <swapnilp@ibm.com >
Co-authored-by: Joe G <joseph.granados@h2o.ai >
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com >
2024-07-09 13:26:36 -07:00
Cyrus Leung
9831aec49f
[Core] Dynamic image size support for VLMs ( #5276 )
...
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com >
Co-authored-by: Xiaowei Jiang <xwjiang2010@gmail.com >
Co-authored-by: ywang96 <ywang@roblox.com >
Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2024-07-02 20:34:00 -07:00
sasha0552
c54269d967
[Frontend] Add tokenize/detokenize endpoints ( #5054 )
2024-06-26 16:54:22 +00:00
tomeras91
f0a500545f
[Frontend] OpenAI API server: Add add_special_tokens to ChatCompletionRequest (default False) ( #5278 )
2024-06-05 09:32:58 -07:00
Avinash Raj
f790ad3c50
[Frontend][OpenAI] Support for returning max_model_len on /v1/models response ( #4643 )
2024-06-02 08:06:13 +00:00
Breno Faria
87d41c849d
[BUGFIX] [FRONTEND] Correct chat logprobs ( #5029 )
...
Co-authored-by: Breno Faria <breno.faria@intrafind.com >
2024-05-30 02:52:14 -07:00
Cyrus Leung
5ae5ed1e60
[Core] Consolidate prompt arguments to LLM engines ( #4328 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-05-28 13:29:31 -07:00
Eric Xihui Lin
8e192ff967
[Kernel][Backend][Model] Blocksparse flash attention kernel and Phi-3-Small model ( #4799 )
...
Co-authored-by: beagleski <yunanzhang@microsoft.com >
Co-authored-by: bapatra <bapatra@microsoft.com >
Co-authored-by: Barun Patra <codedecde@users.noreply.github.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-05-24 22:00:52 -07:00
bofeng huang
0150a10630
[Frontend] OpenAI API server: Do not add bos token by default when encoding ( #4688 )
2024-05-16 18:47:22 -07:00
Chang Su
e254497b66
[Model][Misc] Add e5-mistral-7b-instruct and Embedding API ( #3734 )
2024-05-11 11:30:37 -07:00
Cyrus Leung
f12b20decc
[Frontend] Move async logic outside of constructor ( #4674 )
2024-05-08 22:48:33 -07:00
Sebastian Schoennenbeck
f8e7adda21
Fix/async chat serving ( #2727 )
2024-05-03 11:04:14 -07:00
Cyrus Leung
8947bc3c15
[Frontend][Bugfix] Disallow extra fields in OpenAI API ( #4355 )
2024-04-27 05:08:24 +00:00
Jack Gordley
d3c8180ac4
[Bugfix] Fixing max token error message for openai compatible server ( #4016 )
2024-04-23 19:06:29 +08:00
SangBin Cho
0ae11f78ab
[Mypy] Part 3 fix typing for nested directories for most of directory ( #4161 )
2024-04-22 21:32:44 -07:00
Chirag Jain
bc9df1571b
Pass tokenizer_revision when getting tokenizer in openai serving ( #4214 )
2024-04-19 17:13:56 -07:00
James Whedbee
e1bb2fd52d
[Bugfix] Support logprobs when using guided_json and other constrained decoding fields ( #4149 )
2024-04-18 21:12:55 +00:00
Harry Mellor
66ded03067
Allow model to be served under multiple names ( #2894 )
...
Co-authored-by: Alexandre Payot <alexandrep@graphcore.ai >
2024-04-18 00:16:26 -07:00
Dylan Hawk
95e7d4a97c
Fix echo/logprob OpenAI completion bug ( #3441 )
...
Co-authored-by: Dylan Hawk <dylanwawk@gmail.com >
2024-04-11 22:15:50 +00:00
Thomas Parnell
1d7c940d74
Add option to completion API to truncate prompt tokens ( #3144 )
2024-04-05 10:15:42 -07:00
Roy
6110c39dc8
[BugFix] Fix tokenizer out of vocab size ( #3685 )
2024-03-29 08:18:59 -07:00
SangBin Cho
01bfb22b41
[CI] Try introducing isort. ( #3495 )
2024-03-25 07:59:47 -07:00
Roy
865732342b
[Misc][Log] Add log for tokenizer length not equal to vocabulary size ( #3500 )
2024-03-21 18:07:48 +08:00
Zhuohan Li
2f8844ba08
Re-enable the 80 char line width limit ( #3305 )
2024-03-10 19:49:14 -07:00
Antoni Baum
22de45235c
Push logprob generation to LLMEngine ( #3065 )
...
Co-authored-by: Avnish Narayan <avnish@anyscale.com >
2024-03-04 19:54:06 +00:00
jvmncs
8f36444c4f
multi-LoRA as extra models in OpenAI server ( #2775 )
...
how to serve the loras (mimicking the [multilora inference example](https://github.com/vllm-project/vllm/blob/main/examples/multilora_inference.py )):
```terminal
$ export LORA_PATH=~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/
$ python -m vllm.entrypoints.api_server \
--model meta-llama/Llama-2-7b-hf \
--enable-lora \
--lora-modules sql-lora=$LORA_PATH sql-lora2=$LORA_PATH
```
the above server will list 3 separate values if the user queries `/models`: one for the base served model, and one each for the specified lora modules. in this case sql-lora and sql-lora2 point to the same underlying lora, but this need not be the case. lora config values take the same values they do in EngineArgs
no work has been done here to scope client permissions to specific models
2024-02-17 12:00:48 -08:00
Simon Mo
dd7e8f5f64
refactor complemention api for readability ( #2499 )
2024-01-18 16:45:14 -08:00
FlorianJoncour
14cc317ba4
OpenAI Server refactoring ( #2360 )
2024-01-16 21:33:14 -08:00