[Frontend] Chat-based Embeddings API (#9759)
This commit is contained in:
@@ -96,7 +96,6 @@ def setup(app):
|
||||
|
||||
# Mock out external dependencies here, otherwise the autodoc pages may be blank.
|
||||
autodoc_mock_imports = [
|
||||
"aiohttp",
|
||||
"compressed_tensors",
|
||||
"cpuinfo",
|
||||
"cv2",
|
||||
@@ -143,6 +142,7 @@ intersphinx_mapping = {
|
||||
"python": ("https://docs.python.org/3", None),
|
||||
"typing_extensions":
|
||||
("https://typing-extensions.readthedocs.io/en/latest", None),
|
||||
"aiohttp": ("https://docs.aiohttp.org/en/stable", None),
|
||||
"pillow": ("https://pillow.readthedocs.io/en/stable", None),
|
||||
"numpy": ("https://numpy.org/doc/stable", None),
|
||||
"torch": ("https://pytorch.org/docs/stable", None),
|
||||
|
||||
5
docs/source/dev/pooling_params.rst
Normal file
5
docs/source/dev/pooling_params.rst
Normal file
@@ -0,0 +1,5 @@
|
||||
Pooling Parameters
|
||||
==================
|
||||
|
||||
.. autoclass:: vllm.PoolingParams
|
||||
:members:
|
||||
@@ -138,10 +138,10 @@ Since this server is compatible with OpenAI API, you can use it as a drop-in rep
|
||||
|
||||
A more detailed client example can be found `here <https://github.com/vllm-project/vllm/blob/main/examples/openai_completion_client.py>`__.
|
||||
|
||||
OpenAI Chat API with vLLM
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
OpenAI Chat Completions API with vLLM
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
vLLM is designed to also support the OpenAI Chat API. The chat interface is a more dynamic, interactive way to communicate with the model, allowing back-and-forth exchanges that can be stored in the chat history. This is useful for tasks that require context or more detailed explanations.
|
||||
vLLM is designed to also support the OpenAI Chat Completions API. The chat interface is a more dynamic, interactive way to communicate with the model, allowing back-and-forth exchanges that can be stored in the chat history. This is useful for tasks that require context or more detailed explanations.
|
||||
|
||||
You can use the `create chat completion <https://platform.openai.com/docs/api-reference/chat/completions/create>`_ endpoint to interact with the model:
|
||||
|
||||
@@ -157,7 +157,7 @@ You can use the `create chat completion <https://platform.openai.com/docs/api-re
|
||||
$ ]
|
||||
$ }'
|
||||
|
||||
Alternatively, you can use the `openai` python package:
|
||||
Alternatively, you can use the ``openai`` python package:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
|
||||
@@ -134,6 +134,7 @@ Documentation
|
||||
:caption: Developer Documentation
|
||||
|
||||
dev/sampling_params
|
||||
dev/pooling_params
|
||||
dev/offline_inference/offline_index
|
||||
dev/engine/engine_index
|
||||
dev/kernel/paged_attention
|
||||
|
||||
@@ -185,7 +185,7 @@ Below is an example on how to launch the same ``microsoft/Phi-3.5-vision-instruc
|
||||
--trust-remote-code --max-model-len 4096 --limit-mm-per-prompt image=2
|
||||
|
||||
.. important::
|
||||
Since OpenAI Vision API is based on `Chat Completions <https://platform.openai.com/docs/api-reference/chat>`_ API,
|
||||
Since OpenAI Vision API is based on `Chat Completions API <https://platform.openai.com/docs/api-reference/chat>`_,
|
||||
a chat template is **required** to launch the API server.
|
||||
|
||||
Although Phi-3.5-Vision comes with a chat template, for other models you may have to provide one if the model's tokenizer does not come with it.
|
||||
@@ -243,6 +243,10 @@ To consume the server, you can use the OpenAI client like in the example below:
|
||||
|
||||
A full code example can be found in `examples/openai_api_client_for_multimodal.py <https://github.com/vllm-project/vllm/blob/main/examples/openai_api_client_for_multimodal.py>`_.
|
||||
|
||||
.. tip::
|
||||
There is no need to place image placeholders in the text content of the API request - they are already represented by the image content.
|
||||
In fact, you can place image placeholders in the middle of the text by interleaving text and image content.
|
||||
|
||||
.. note::
|
||||
|
||||
By default, the timeout for fetching images through http url is ``5`` seconds. You can override this by setting the environment variable:
|
||||
@@ -251,5 +255,49 @@ A full code example can be found in `examples/openai_api_client_for_multimodal.p
|
||||
|
||||
$ export VLLM_IMAGE_FETCH_TIMEOUT=<timeout>
|
||||
|
||||
.. note::
|
||||
There is no need to format the prompt in the API request since it will be handled by the server.
|
||||
Chat Embeddings API
|
||||
^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
vLLM's Chat Embeddings API is a superset of OpenAI's `Embeddings API <https://platform.openai.com/docs/api-reference/embeddings>`_,
|
||||
where a list of ``messages`` can be passed instead of batched ``inputs``. This enables multi-modal inputs to be passed to embedding models.
|
||||
|
||||
.. tip::
|
||||
The schema of ``messages`` is exactly the same as in Chat Completions API.
|
||||
|
||||
In this example, we will serve the ``TIGER-Lab/VLM2Vec-Full`` model.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
vllm serve TIGER-Lab/VLM2Vec-Full --task embedding \
|
||||
--trust-remote-code --max-model-len 4096
|
||||
|
||||
.. important::
|
||||
|
||||
Since VLM2Vec has the same model architecture as Phi-3.5-Vision, we have to explicitly pass ``--task embedding``
|
||||
to run this model in embedding mode instead of text generation mode.
|
||||
|
||||
Since this schema is not defined by OpenAI client, we post a request to the server using the lower-level ``requests`` library:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import requests
|
||||
|
||||
image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
|
||||
|
||||
response = requests.post(
|
||||
"http://localhost:8000/v1/embeddings",
|
||||
json={
|
||||
"model": "TIGER-Lab/VLM2Vec-Full",
|
||||
"messages": [{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{"type": "image_url", "image_url": {"url": image_url}},
|
||||
{"type": "text", "text": "Represent the given image."},
|
||||
],
|
||||
}],
|
||||
"encoding_format": "float",
|
||||
},
|
||||
)
|
||||
response.raise_for_status()
|
||||
response_json = response.json()
|
||||
print("Embedding output:", response_json["data"][0]["embedding"])
|
||||
|
||||
@@ -26,13 +26,26 @@ print(completion.choices[0].message)
|
||||
```
|
||||
|
||||
## API Reference
|
||||
Please see the [OpenAI API Reference](https://platform.openai.com/docs/api-reference) for more information on the API. We support all parameters except:
|
||||
- Chat: `tools`, and `tool_choice`.
|
||||
- Completions: `suffix`.
|
||||
|
||||
vLLM also provides experimental support for OpenAI Vision API compatible inference. See more details in [Using VLMs](../models/vlm.rst).
|
||||
We currently support the following OpenAI APIs:
|
||||
|
||||
- [Completions API](https://platform.openai.com/docs/api-reference/completions)
|
||||
- *Note: `suffix` parameter is not supported.*
|
||||
- [Chat Completions API](https://platform.openai.com/docs/api-reference/chat)
|
||||
- [Vision](https://platform.openai.com/docs/guides/vision)-related parameters are supported; see [Using VLMs](../models/vlm.rst).
|
||||
- *Note: `image_url.detail` parameter is not supported.*
|
||||
- We also support `audio_url` content type for audio files.
|
||||
- Refer to [vllm.entrypoints.chat_utils](https://github.com/vllm-project/vllm/tree/main/vllm/entrypoints/chat_utils.py) for the exact schema.
|
||||
- *TODO: Support `input_audio` content type as defined [here](https://github.com/openai/openai-python/blob/v1.52.2/src/openai/types/chat/chat_completion_content_part_input_audio_param.py).*
|
||||
- *Note: `parallel_tool_calls` and `user` parameters are ignored.*
|
||||
- [Embeddings API](https://platform.openai.com/docs/api-reference/embeddings)
|
||||
- Instead of `inputs`, you can pass in a list of `messages` (same schema as Chat Completions API),
|
||||
which will be treated as a single prompt to the model according to its chat template.
|
||||
- This enables multi-modal inputs to be passed to embedding models, see [Using VLMs](../models/vlm.rst).
|
||||
- *Note: You should run `vllm serve` with `--task embedding` to ensure that the model is being run in embedding mode.*
|
||||
|
||||
## Extra Parameters
|
||||
|
||||
vLLM supports a set of parameters that are not part of the OpenAI API.
|
||||
In order to use them, you can pass them as extra parameters in the OpenAI client.
|
||||
Or directly merge them into the JSON payload if you are using HTTP call directly.
|
||||
@@ -49,7 +62,26 @@ completion = client.chat.completions.create(
|
||||
)
|
||||
```
|
||||
|
||||
### Extra Parameters for Chat API
|
||||
### Extra Parameters for Completions API
|
||||
|
||||
The following [sampling parameters (click through to see documentation)](../dev/sampling_params.rst) are supported.
|
||||
|
||||
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
|
||||
:language: python
|
||||
:start-after: begin-completion-sampling-params
|
||||
:end-before: end-completion-sampling-params
|
||||
```
|
||||
|
||||
The following extra parameters are supported:
|
||||
|
||||
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
|
||||
:language: python
|
||||
:start-after: begin-completion-extra-params
|
||||
:end-before: end-completion-extra-params
|
||||
```
|
||||
|
||||
### Extra Parameters for Chat Completions API
|
||||
|
||||
The following [sampling parameters (click through to see documentation)](../dev/sampling_params.rst) are supported.
|
||||
|
||||
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
|
||||
@@ -66,21 +98,22 @@ The following extra parameters are supported:
|
||||
:end-before: end-chat-completion-extra-params
|
||||
```
|
||||
|
||||
### Extra Parameters for Completions API
|
||||
The following [sampling parameters (click through to see documentation)](../dev/sampling_params.rst) are supported.
|
||||
### Extra Parameters for Embeddings API
|
||||
|
||||
The following [pooling parameters (click through to see documentation)](../dev/pooling_params.rst) are supported.
|
||||
|
||||
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
|
||||
:language: python
|
||||
:start-after: begin-completion-sampling-params
|
||||
:end-before: end-completion-sampling-params
|
||||
:start-after: begin-embedding-pooling-params
|
||||
:end-before: end-embedding-pooling-params
|
||||
```
|
||||
|
||||
The following extra parameters are supported:
|
||||
|
||||
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
|
||||
:language: python
|
||||
:start-after: begin-completion-extra-params
|
||||
:end-before: end-completion-extra-params
|
||||
:start-after: begin-embedding-extra-params
|
||||
:end-before: end-embedding-extra-params
|
||||
```
|
||||
|
||||
## Chat Template
|
||||
|
||||
Reference in New Issue
Block a user