[Docs] Reorganize pooling docs. (#35592)

Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
wang.yuqi
2026-03-19 19:25:47 +08:00
committed by GitHub
parent 4426447bba
commit f9e2a38386
17 changed files with 2393 additions and 1736 deletions

View File

@@ -53,8 +53,8 @@ We currently support the following OpenAI APIs:
- Only applicable to [text generation models](../models/generative_models.md) with a [chat template](../serving/openai_compatible_server.md#chat-template).
- *Note: `user` parameter is ignored.*
- *Note:* Setting the `parallel_tool_calls` parameter to `false` ensures vLLM only returns zero or one tool call per request. Setting it to `true` (the default) allows returning more than one tool call per request. There is no guarantee more than one tool call will be returned if this is set to `true`, as that behavior is model dependent and not all models are designed to support parallel tool calls.
- [Embeddings API](#embeddings-api) (`/v1/embeddings`)
- Only applicable to [embedding models](../models/pooling_models.md).
- [Embeddings API](../models/pooling_models/embed.md#openai-compatible-embeddings-api) (`/v1/embeddings`)
- Only applicable to [embedding models](../models/pooling_models/embed.md).
- [Transcriptions API](#transcriptions-api) (`/v1/audio/transcriptions`)
- Only applicable to [Automatic Speech Recognition (ASR) models](../models/supported_models.md#transcription).
- [Translation API](#translations-api) (`/v1/audio/translations`)
@@ -66,20 +66,19 @@ In addition, we have the following custom APIs:
- [Tokenizer API](#tokenizer-api) (`/tokenize`, `/detokenize`)
- Applicable to any model with a tokenizer.
- [Pooling API](#pooling-api) (`/pooling`)
- Applicable to all [pooling models](../models/pooling_models.md).
- [Classification API](#classification-api) (`/classify`)
- Only applicable to [classification models](../models/pooling_models.md).
- [Score API](#score-api) (`/score`)
- Applicable to [embedding models and cross-encoder models](../models/pooling_models.md).
- [Cohere Embed API](#cohere-embed-api) (`/v2/embed`)
- [pooling API](../models/pooling_models/README.md#pooling-api) (`/pooling`)
- Applicable to all [pooling models](../models/pooling_models/README.md).
- [Classification API](../models/pooling_models/classify.md#classification-api) (`/classify`)
- Only applicable to [classification models](../models/pooling_models/classify.md).
- [Cohere Embed API](../models/pooling_models/embed.md#cohere-embed-api) (`/v2/embed`)
- Compatible with [Cohere's Embed API](https://docs.cohere.com/reference/embed)
- Works with any [embedding model](../models/pooling_models.md), including multimodal models.
- [Re-rank API](#re-rank-api) (`/rerank`, `/v1/rerank`, `/v2/rerank`)
- Implements [Jina AI's v1 re-rank API](https://jina.ai/reranker/)
- Also compatible with [Cohere's v1 & v2 re-rank APIs](https://docs.cohere.com/v2/reference/rerank)
- Works with any [embedding model](../models/pooling_models/embed.md#supported-models), including multimodal models.
- [Score API](../models/pooling_models/scoring.md#score-api) (`/score`)
- Applicable to [score models](../models/pooling_models/scoring.md).
- [Rerank API](../models/pooling_models/scoring.md#rerank-api) (`/rerank`, `/v1/rerank`, `/v2/rerank`)
- Implements [Jina AI's v1 rerank API](https://jina.ai/reranker/)
- Also compatible with [Cohere's v1 & v2 rerank APIs](https://docs.cohere.com/v2/reference/rerank)
- Jina and Cohere's APIs are very similar; Jina's includes extra information in the rerank endpoint's response.
- Only applicable to [cross-encoder models](../models/pooling_models.md).
## Chat Template
@@ -269,300 +268,6 @@ The following extra parameters in the response object are supported:
--8<-- "vllm/entrypoints/openai/responses/protocol.py:responses-response-extra-params"
```
### Embeddings API
Our Embeddings API is compatible with [OpenAI's Embeddings API](https://platform.openai.com/docs/api-reference/embeddings);
you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it.
Code example: [examples/pooling/embed/openai_embedding_client.py](../../examples/pooling/embed/openai_embedding_client.py)
If the model has a [chat template](../serving/openai_compatible_server.md#chat-template), you can replace `inputs` with a list of `messages` (same schema as [Chat API](#chat-api))
which will be treated as a single prompt to the model. Here is a convenience function for calling the API while retaining OpenAI's type annotations:
??? code
```python
from openai import OpenAI
from openai._types import NOT_GIVEN, NotGiven
from openai.types.chat import ChatCompletionMessageParam
from openai.types.create_embedding_response import CreateEmbeddingResponse
def create_chat_embeddings(
client: OpenAI,
*,
messages: list[ChatCompletionMessageParam],
model: str,
encoding_format: Union[Literal["base64", "float"], NotGiven] = NOT_GIVEN,
) -> CreateEmbeddingResponse:
return client.post(
"/embeddings",
cast_to=CreateEmbeddingResponse,
body={"messages": messages, "model": model, "encoding_format": encoding_format},
)
```
#### Multi-modal inputs
You can pass multi-modal inputs to embedding models by defining a custom chat template for the server
and passing a list of `messages` in the request. Refer to the examples below for illustration.
=== "VLM2Vec"
To serve the model:
```bash
vllm serve TIGER-Lab/VLM2Vec-Full --runner pooling \
--trust-remote-code \
--max-model-len 4096 \
--chat-template examples/pooling/embed/template/vlm2vec_phi3v.jinja
```
!!! important
Since VLM2Vec has the same model architecture as Phi-3.5-Vision, we have to explicitly pass `--runner pooling`
to run this model in embedding mode instead of text generation mode.
The custom chat template is completely different from the original one for this model,
and can be found here: [examples/pooling/embed/template/vlm2vec_phi3v.jinja](../../examples/pooling/embed/template/vlm2vec_phi3v.jinja)
Since the request schema is not defined by OpenAI client, we post a request to the server using the lower-level `requests` library:
??? code
```python
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="EMPTY",
)
image_url = "https://vllm-public-assets.s3.us-west-2.amazonaws.com/vision_model_images/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
response = create_chat_embeddings(
client,
model="TIGER-Lab/VLM2Vec-Full",
messages=[
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": image_url}},
{"type": "text", "text": "Represent the given image."},
],
}
],
encoding_format="float",
)
print("Image embedding output:", response.data[0].embedding)
```
=== "DSE-Qwen2-MRL"
To serve the model:
```bash
vllm serve MrLight/dse-qwen2-2b-mrl-v1 --runner pooling \
--trust-remote-code \
--max-model-len 8192 \
--chat-template examples/pooling/embed/template/dse_qwen2_vl.jinja
```
!!! important
Like with VLM2Vec, we have to explicitly pass `--runner pooling`.
Additionally, `MrLight/dse-qwen2-2b-mrl-v1` requires an EOS token for embeddings, which is handled
by a custom chat template: [examples/pooling/embed/template/dse_qwen2_vl.jinja](../../examples/pooling/embed/template/dse_qwen2_vl.jinja)
!!! important
`MrLight/dse-qwen2-2b-mrl-v1` requires a placeholder image of the minimum image size for text query embeddings. See the full code
example below for details.
Full example: [examples/pooling/embed/vision_embedding_online.py](../../examples/pooling/embed/vision_embedding_online.py)
#### Extra parameters
The following [pooling parameters][vllm.PoolingParams] are supported.
```python
--8<-- "vllm/pooling_params.py:common-pooling-params"
--8<-- "vllm/pooling_params.py:embed-pooling-params"
```
The following Embeddings API parameters are supported:
??? code
```python
--8<-- "vllm/entrypoints/pooling/base/protocol.py:pooling-common-params"
--8<-- "vllm/entrypoints/pooling/base/protocol.py:completion-params"
--8<-- "vllm/entrypoints/pooling/base/protocol.py:encoding-params"
--8<-- "vllm/entrypoints/pooling/base/protocol.py:embed-params"
```
The following extra parameters are supported:
??? code
```python
--8<-- "vllm/entrypoints/pooling/base/protocol.py:pooling-common-extra-params"
--8<-- "vllm/entrypoints/pooling/base/protocol.py:completion-extra-params"
--8<-- "vllm/entrypoints/pooling/base/protocol.py:encoding-extra-params"
--8<-- "vllm/entrypoints/pooling/base/protocol.py:embed-extra-params"
```
For chat-like input (i.e. if `messages` is passed), the following parameters are supported:
The following parameters are supported by default:
??? code
```python
--8<-- "vllm/entrypoints/pooling/base/protocol.py:pooling-common-params"
--8<-- "vllm/entrypoints/pooling/base/protocol.py:chat-params"
--8<-- "vllm/entrypoints/pooling/base/protocol.py:encoding-params"
--8<-- "vllm/entrypoints/pooling/base/protocol.py:embed-params"
```
these extra parameters are supported instead:
??? code
```python
--8<-- "vllm/entrypoints/pooling/base/protocol.py:pooling-common-extra-params"
--8<-- "vllm/entrypoints/pooling/base/protocol.py:chat-extra-params"
--8<-- "vllm/entrypoints/pooling/base/protocol.py:encoding-extra-params"
--8<-- "vllm/entrypoints/pooling/base/protocol.py:embed-extra-params"
```
### Cohere Embed API
Our API is also compatible with [Cohere's Embed v2 API](https://docs.cohere.com/reference/embed) which adds support for some modern embedding feature such as truncation, output dimensions, embedding types, and input types. This endpoint works with any embedding model (including multimodal models).
#### Cohere Embed API request parameters
| Parameter | Type | Required | Description |
| --------- | ---- | -------- | ----------- |
| `model` | string | Yes | Model name |
| `input_type` | string | No | Prompt prefix key (model-dependent, see below) |
| `texts` | list[string] | No | Text inputs (use one of `texts`, `images`, or `inputs`) |
| `images` | list[string] | No | Base64 data URI images |
| `inputs` | list[object] | No | Mixed text and image content objects |
| `embedding_types` | list[string] | No | Output types (default: `["float"]`) |
| `output_dimension` | int | No | Truncate embeddings to this dimension (Matryoshka) |
| `truncate` | string | No | `END`, `START`, or `NONE` (default: `END`) |
#### Text embedding
```bash
curl -X POST "http://localhost:8000/v2/embed" \
-H "Content-Type: application/json" \
-d '{
"model": "Snowflake/snowflake-arctic-embed-m-v1.5",
"input_type": "query",
"texts": ["Hello world", "How are you?"],
"embedding_types": ["float"]
}'
```
??? console "Response"
```json
{
"id": "embd-...",
"embeddings": {
"float": [
[0.012, -0.034, ...],
[0.056, 0.078, ...]
]
},
"texts": ["Hello world", "How are you?"],
"meta": {
"api_version": {"version": "2"},
"billed_units": {"input_tokens": 12}
}
}
```
#### Mixed text and image inputs
For multimodal models, you can embed images by passing base64 data URIs. The `inputs` field accepts a list of objects with mixed text and image content:
```bash
curl -X POST "http://localhost:8000/v2/embed" \
-H "Content-Type: application/json" \
-d '{
"model": "google/siglip-so400m-patch14-384",
"inputs": [
{
"content": [
{"type": "text", "text": "A photo of a cat"},
{"type": "image_url", "image_url": {"url": "data:image/png;base64,iVBOR..."}}
]
}
],
"embedding_types": ["float"]
}'
```
#### Embedding types
The `embedding_types` parameter controls the output format. Multiple types can be requested in a single call:
| Type | Description |
| ---- | ----------- |
| `float` | Raw float32 embeddings (default) |
| `binary` | Bit-packed signed binary |
| `ubinary` | Bit-packed unsigned binary |
| `base64` | Little-endian float32 encoded as base64 |
```bash
curl -X POST "http://localhost:8000/v2/embed" \
-H "Content-Type: application/json" \
-d '{
"model": "Snowflake/snowflake-arctic-embed-m-v1.5",
"input_type": "query",
"texts": ["What is machine learning?"],
"embedding_types": ["float", "binary"]
}'
```
??? console "Response"
```json
{
"id": "embd-...",
"embeddings": {
"float": [[0.012, -0.034, ...]],
"binary": [[42, -117, ...]]
},
"texts": ["What is machine learning?"],
"meta": {
"api_version": {"version": "2"},
"billed_units": {"input_tokens": 8}
}
}
```
#### Truncation
The `truncate` parameter controls how inputs exceeding the model's maximum sequence length are handled:
| Value | Behavior |
| ----- | --------- |
| `END` (default) | Keep the first tokens, drop the end |
| `START` | Keep the last tokens, drop the beginning |
| `NONE` | Return an error if the input is too long |
#### Input type and prompt prefixes
The `input_type` field selects a prompt prefix to prepend to each text input. The available values
depend on the model:
- **Models with `task_instructions` in `config.json`**: The keys from the `task_instructions` dict are
the valid `input_type` values and the corresponding value is prepended to each text.
- **Models with `config_sentence_transformers.json` prompts**: The keys from the `prompts` dict are
the valid `input_type` values. For example, `Snowflake/snowflake-arctic-embed-xs` defines `"query"`,
so setting `input_type: "query"` prepends `"Represent this sentence for searching relevant passages: "`.
- **Other models**: `input_type` is not accepted and will raise a validation error if passed.
### Transcriptions API
Our Transcriptions API is compatible with [OpenAI's Transcriptions API](https://platform.openai.com/docs/api-reference/audio/createTranscription);
@@ -759,172 +464,8 @@ It consists of two endpoints:
- `/tokenize` corresponds to calling `tokenizer.encode()`.
- `/detokenize` corresponds to calling `tokenizer.decode()`.
### Pooling API
Our Pooling API encodes input prompts using a [pooling model](../models/pooling_models.md) and returns the corresponding hidden states.
The input format is the same as [Embeddings API](#embeddings-api), but the output data can contain an arbitrary nested list, not just a 1-D list of floats.
Code example: [examples/pooling/pooling/pooling_online.py](../../examples/pooling/pooling/pooling_online.py)
### Classification API
Our Classification API directly supports Hugging Face sequence-classification models such as [ai21labs/Jamba-tiny-reward-dev](https://huggingface.co/ai21labs/Jamba-tiny-reward-dev) and [jason9693/Qwen2.5-1.5B-apeach](https://huggingface.co/jason9693/Qwen2.5-1.5B-apeach).
We automatically wrap any other transformer via `as_seq_cls_model()`, which pools on the last token, attaches a `RowParallelLinear` head, and applies a softmax to produce per-class probabilities.
Code example: [examples/pooling/classify/classification_online.py](../../examples/pooling/classify/classification_online.py)
#### Example Requests
You can classify multiple texts by passing an array of strings:
```bash
curl -v "http://127.0.0.1:8000/classify" \
-H "Content-Type: application/json" \
-d '{
"model": "jason9693/Qwen2.5-1.5B-apeach",
"input": [
"Loved the new café—coffee was great.",
"This update broke everything. Frustrating."
]
}'
```
??? console "Response"
```json
{
"id": "classify-7c87cac407b749a6935d8c7ce2a8fba2",
"object": "list",
"created": 1745383065,
"model": "jason9693/Qwen2.5-1.5B-apeach",
"data": [
{
"index": 0,
"label": "Default",
"probs": [
0.565970778465271,
0.4340292513370514
],
"num_classes": 2
},
{
"index": 1,
"label": "Spoiled",
"probs": [
0.26448777318000793,
0.7355121970176697
],
"num_classes": 2
}
],
"usage": {
"prompt_tokens": 20,
"total_tokens": 20,
"completion_tokens": 0,
"prompt_tokens_details": null
}
}
```
You can also pass a string directly to the `input` field:
```bash
curl -v "http://127.0.0.1:8000/classify" \
-H "Content-Type: application/json" \
-d '{
"model": "jason9693/Qwen2.5-1.5B-apeach",
"input": "Loved the new café—coffee was great."
}'
```
??? console "Response"
```json
{
"id": "classify-9bf17f2847b046c7b2d5495f4b4f9682",
"object": "list",
"created": 1745383213,
"model": "jason9693/Qwen2.5-1.5B-apeach",
"data": [
{
"index": 0,
"label": "Default",
"probs": [
0.565970778465271,
0.4340292513370514
],
"num_classes": 2
}
],
"usage": {
"prompt_tokens": 10,
"total_tokens": 10,
"completion_tokens": 0,
"prompt_tokens_details": null
}
}
```
#### Extra parameters
The following [pooling parameters][vllm.PoolingParams] are supported.
```python
--8<-- "vllm/pooling_params.py:common-pooling-params"
--8<-- "vllm/pooling_params.py:classify-pooling-params"
```
The following Classification API parameters are supported:
??? code
```python
--8<-- "vllm/entrypoints/pooling/base/protocol.py:pooling-common-params"
--8<-- "vllm/entrypoints/pooling/base/protocol.py:completion-params"
--8<-- "vllm/entrypoints/pooling/base/protocol.py:classify-params"
```
The following extra parameters are supported:
??? code
```python
--8<-- "vllm/entrypoints/pooling/base/protocol.py:pooling-common-extra-params"
--8<-- "vllm/entrypoints/pooling/base/protocol.py:completion-extra-params"
--8<-- "vllm/entrypoints/pooling/base/protocol.py:classify-extra-params"
```
For chat-like input (i.e. if `messages` is passed), the following parameters are supported:
??? code
```python
--8<-- "vllm/entrypoints/pooling/base/protocol.py:pooling-common-params"
--8<-- "vllm/entrypoints/pooling/base/protocol.py:chat-params"
--8<-- "vllm/entrypoints/pooling/base/protocol.py:classify-params"
```
these extra parameters are supported instead:
??? code
```python
--8<-- "vllm/entrypoints/pooling/base/protocol.py:pooling-common-extra-params"
--8<-- "vllm/entrypoints/pooling/base/protocol.py:chat-extra-params"
--8<-- "vllm/entrypoints/pooling/base/protocol.py:classify-extra-params"
```
### Score API
Our Score API can apply a cross-encoder model or an embedding model to predict scores for sentence or multimodal pairs. When using an embedding model the score corresponds to the cosine similarity between each embedding pair.
Usually, the score for a sentence pair refers to the similarity between two sentences, on a scale of 0 to 1.
You can find the documentation for cross encoder models at [sbert.net](https://www.sbert.net/docs/package_reference/cross_encoder/cross_encoder.html).
Code example: [examples/pooling/score/score_api_online.py](../../examples/pooling/score/score_api_online.py)
#### Score Template
Some scoring models require a specific prompt format to work correctly. You can specify a custom score template using the `--chat-template` parameter (see [Chat Template](#chat-template)).
@@ -940,307 +481,6 @@ This approach is more robust than index-based access (`messages[0]`, `messages[1
Example template file: [examples/pooling/score/template/nemotron-rerank.jinja](../../examples/pooling/score/template/nemotron-rerank.jinja)
#### Single inference
You can pass a string to both `queries` and `documents`, forming a single sentence pair.
```bash
curl -X 'POST' \
'http://127.0.0.1:8000/score' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "BAAI/bge-reranker-v2-m3",
"encoding_format": "float",
"queries": "What is the capital of France?",
"documents": "The capital of France is Paris."
}'
```
??? console "Response"
```json
{
"id": "score-request-id",
"object": "list",
"created": 693447,
"model": "BAAI/bge-reranker-v2-m3",
"data": [
{
"index": 0,
"object": "score",
"score": 1
}
],
"usage": {}
}
```
#### Batch inference
You can pass a string to `queries` and a list to `documents`, forming multiple sentence pairs
where each pair is built from `queries` and a string in `documents`.
The total number of pairs is `len(documents)`.
??? console "Request"
```bash
curl -X 'POST' \
'http://127.0.0.1:8000/score' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "BAAI/bge-reranker-v2-m3",
"queries": "What is the capital of France?",
"documents": [
"The capital of Brazil is Brasilia.",
"The capital of France is Paris."
]
}'
```
??? console "Response"
```json
{
"id": "score-request-id",
"object": "list",
"created": 693570,
"model": "BAAI/bge-reranker-v2-m3",
"data": [
{
"index": 0,
"object": "score",
"score": 0.001094818115234375
},
{
"index": 1,
"object": "score",
"score": 1
}
],
"usage": {}
}
```
You can pass a list to both `queries` and `documents`, forming multiple sentence pairs
where each pair is built from a string in `queries` and the corresponding string in `documents` (similar to `zip()`).
The total number of pairs is `len(documents)`.
??? console "Request"
```bash
curl -X 'POST' \
'http://127.0.0.1:8000/score' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "BAAI/bge-reranker-v2-m3",
"encoding_format": "float",
"queries": [
"What is the capital of Brazil?",
"What is the capital of France?"
],
"documents": [
"The capital of Brazil is Brasilia.",
"The capital of France is Paris."
]
}'
```
??? console "Response"
```json
{
"id": "score-request-id",
"object": "list",
"created": 693447,
"model": "BAAI/bge-reranker-v2-m3",
"data": [
{
"index": 0,
"object": "score",
"score": 1
},
{
"index": 1,
"object": "score",
"score": 1
}
],
"usage": {}
}
```
#### Multi-modal inputs
You can pass multi-modal inputs to scoring models by passing `content` including a list of multi-modal input (image, etc.) in the request. Refer to the examples below for illustration.
=== "JinaVL-Reranker"
To serve the model:
```bash
vllm serve jinaai/jina-reranker-m0
```
Since the request schema is not defined by OpenAI client, we post a request to the server using the lower-level `requests` library:
??? Code
```python
import requests
response = requests.post(
"http://localhost:8000/v1/score",
json={
"model": "jinaai/jina-reranker-m0",
"queries": "slm markdown",
"documents": [
{
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/handelsblatt-preview.png"
},
}
],
},
{
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/handelsblatt-preview.png"
},
}
]
},
],
},
)
response.raise_for_status()
response_json = response.json()
print("Scoring output:", response_json["data"][0]["score"])
print("Scoring output:", response_json["data"][1]["score"])
```
Full example:
- [examples/pooling/score/vision_score_api_online.py](../../examples/pooling/score/vision_score_api_online.py)
- [examples/pooling/score/vision_rerank_api_online.py](../../examples/pooling/score/vision_rerank_api_online.py)
#### Extra parameters
The following [pooling parameters][vllm.PoolingParams] are supported.
```python
--8<-- "vllm/pooling_params.py:common-pooling-params"
--8<-- "vllm/pooling_params.py:classify-pooling-params"
```
The following Score API parameters are supported:
```python
--8<-- "vllm/entrypoints/pooling/base/protocol.py:pooling-common-params"
```
The following extra parameters are supported:
```python
--8<-- "vllm/entrypoints/pooling/base/protocol.py:pooling-common-extra-params"
--8<-- "vllm/entrypoints/pooling/base/protocol.py:classify-extra-params"
```
### Re-rank API
Our Re-rank API can apply an embedding model or a cross-encoder model to predict relevant scores between a single query, and
each of a list of documents. Usually, the score for a sentence pair refers to the similarity between two sentences or multi-modal inputs (image, etc.), on a scale of 0 to 1.
You can find the documentation for cross encoder models at [sbert.net](https://www.sbert.net/docs/package_reference/cross_encoder/cross_encoder.html).
The rerank endpoints support popular re-rank models such as `BAAI/bge-reranker-base` and other models supporting the
`score` task. Additionally, `/rerank`, `/v1/rerank`, and `/v2/rerank`
endpoints are compatible with both [Jina AI's re-rank API interface](https://jina.ai/reranker/) and
[Cohere's re-rank API interface](https://docs.cohere.com/v2/reference/rerank) to ensure compatibility with
popular open-source tools.
Code example: [examples/pooling/score/rerank_api_online.py](../../examples/pooling/score/rerank_api_online.py)
#### Example Request
Note that the `top_n` request parameter is optional and will default to the length of the `documents` field.
Result documents will be sorted by relevance, and the `index` property can be used to determine original order.
??? console "Request"
```bash
curl -X 'POST' \
'http://127.0.0.1:8000/v1/rerank' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "BAAI/bge-reranker-base",
"query": "What is the capital of France?",
"documents": [
"The capital of Brazil is Brasilia.",
"The capital of France is Paris.",
"Horses and cows are both animals"
]
}'
```
??? console "Response"
```json
{
"id": "rerank-fae51b2b664d4ed38f5969b612edff77",
"model": "BAAI/bge-reranker-base",
"usage": {
"total_tokens": 56
},
"results": [
{
"index": 1,
"document": {
"text": "The capital of France is Paris."
},
"relevance_score": 0.99853515625
},
{
"index": 0,
"document": {
"text": "The capital of Brazil is Brasilia."
},
"relevance_score": 0.0005860328674316406
}
]
}
```
#### Extra parameters
The following [pooling parameters][vllm.PoolingParams] are supported.
```python
--8<-- "vllm/pooling_params.py:common-pooling-params"
--8<-- "vllm/pooling_params.py:classify-pooling-params"
```
The following Re-rank API parameters are supported:
```python
--8<-- "vllm/entrypoints/pooling/base/protocol.py:pooling-common-params"
--8<-- "vllm/entrypoints/pooling/base/protocol.py:classify-extra-params"
```
The following extra parameters are supported:
```python
--8<-- "vllm/entrypoints/pooling/base/protocol.py:pooling-common-extra-params"
--8<-- "vllm/entrypoints/pooling/base/protocol.py:classify-extra-params"
```
## Ray Serve LLM
Ray Serve LLM enables scalable, production-grade serving of the vLLM engine. It integrates tightly with vLLM and extends it with features such as auto-scaling, load balancing, and back-pressure.