[Frontend] Online Pooling API (#11457)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-24 17:54:30 +08:00
parent 4f074fbf53
commit 9edca6bf8f
15 changed files with 808 additions and 156 deletions
--- a/docs/source/serving/openai_compatible_server.md
+++ b/docs/source/serving/openai_compatible_server.md
@@ -42,6 +42,8 @@ In addition, we have the following custom APIs:

 - [Tokenizer API](#tokenizer-api) (`/tokenize`, `/detokenize`)
  - Applicable to any model with a tokenizer.
+- [Pooling API](#pooling-api) (`/pooling`)
+  - Applicable to all [pooling models](../models/pooling_models.md).
 - [Score API](#score-api) (`/score`)
  - Only applicable to [cross-encoder models](../models/pooling_models.md) (`--task score`).

@@ -179,7 +181,12 @@ The order of priorities is `command line > config file values > defaults`.
 (completions-api)=
 ### Completions API

-Refer to [OpenAI's API reference](https://platform.openai.com/docs/api-reference/completions) for more details.
+Our Completions API is compatible with [OpenAI's Completions API](https://platform.openai.com/docs/api-reference/completions);
+you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it.
+
+#### Code example
+
+See [examples/openai_completion_client.py](https://github.com/vllm-project/vllm/blob/main/examples/openai_completion_client.py).

 #### Extra parameters

@@ -200,15 +207,20 @@ The following extra parameters are supported:
 ```

 (chat-api)=
-### Chat Completions API
+### Chat API

-Refer to [OpenAI's API reference](https://platform.openai.com/docs/api-reference/chat) for more details.
+Our Chat API is compatible with [OpenAI's Chat Completions API](https://platform.openai.com/docs/api-reference/chat);
+you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it.

 We support both [Vision](https://platform.openai.com/docs/guides/vision)- and
 [Audio](https://platform.openai.com/docs/guides/audio?audio-generation-quickstart-example=audio-in)-related parameters;
 see our [Multimodal Inputs](../usage/multimodal_inputs.md) guide for more information.
 - *Note: `image_url.detail` parameter is not supported.*

+#### Code example
+
+See [examples/openai_chat_completion_client.py](https://github.com/vllm-project/vllm/blob/main/examples/openai_chat_completion_client.py).
+
 #### Extra parameters

 The following [sampling parameters (click through to see documentation)](../dev/sampling_params.md) are supported.
@@ -230,15 +242,20 @@ The following extra parameters are supported:
 (embeddings-api)=
 ### Embeddings API

-Refer to [OpenAI's API reference](https://platform.openai.com/docs/api-reference/embeddings) for more details.
+Our Embeddings API is compatible with [OpenAI's Embeddings API](https://platform.openai.com/docs/api-reference/embeddings);
+you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it.

-If the model has a [chat template](#chat-template), you can replace `inputs` with a list of `messages` (same schema as [Chat Completions API](#chat-api))
+If the model has a [chat template](#chat-template), you can replace `inputs` with a list of `messages` (same schema as [Chat API](#chat-api))
 which will be treated as a single prompt to the model.

 ```{tip}
-This enables multi-modal inputs to be passed to embedding models, see [this page](../usage/multimodal_inputs.md) for details.
+This enables multi-modal inputs to be passed to embedding models, see [this page](#multimodal-inputs) for details.
 ```

+#### Code example
+
+See [examples/openai_embedding_client.py](https://github.com/vllm-project/vllm/blob/main/examples/openai_embedding_client.py).
+
 #### Extra parameters

 The following [pooling parameters (click through to see documentation)](../dev/pooling_params.md) are supported.
@@ -268,20 +285,35 @@ For chat-like input (i.e. if `messages` is passed), these extra parameters are s
 (tokenizer-api)=
 ### Tokenizer API

-The Tokenizer API is a simple wrapper over [HuggingFace-style tokenizers](https://huggingface.co/docs/transformers/en/main_classes/tokenizer).
+Our Tokenizer API is a simple wrapper over [HuggingFace-style tokenizers](https://huggingface.co/docs/transformers/en/main_classes/tokenizer).
 It consists of two endpoints:

 - `/tokenize` corresponds to calling `tokenizer.encode()`.
 - `/detokenize` corresponds to calling `tokenizer.decode()`.

+(pooling-api)=
+### Pooling API
+
+Our Pooling API encodes input prompts using a [pooling model](../models/pooling_models.md) and returns the corresponding hidden states.
+
+The input format is the same as [Embeddings API](#embeddings-api), but the output data can contain an arbitrary nested list, not just a 1-D list of floats.
+
+#### Code example
+
+See [examples/openai_pooling_client.py](https://github.com/vllm-project/vllm/blob/main/examples/openai_pooling_client.py).
+
 (score-api)=
 ### Score API

-The Score API applies a cross-encoder model to predict scores for sentence pairs.
+Our Score API applies a cross-encoder model to predict scores for sentence pairs.
 Usually, the score for a sentence pair refers to the similarity between two sentences, on a scale of 0 to 1.

 You can find the documentation for these kind of models at [sbert.net](https://www.sbert.net/docs/package_reference/cross_encoder/cross_encoder.html).

+#### Code example
+
+See [examples/openai_cross_encoder_score.py](https://github.com/vllm-project/vllm/blob/main/examples/openai_cross_encoder_score.py).
+
 #### Single inference

 You can pass a string to both `text_1` and `text_2`, forming a single sentence pair.