docs/models/pooling_models/scoring.md

# Scoring Usages

The score models is designed to compute similarity scores between two input prompts. It supports three model types (aka `score_type`): `cross-encoder`, `late-interaction`, and `bi-encoder`.

!!! note
    vLLM handles only the model inference component of RAG pipelines (such as embedding generation and reranking). For higher-level RAG orchestration, you should leverage integration frameworks like [LangChain](https://github.com/langchain-ai/langchain).

## Summary

- Model Usage: Scoring
- Pooling Task:

| Score Types        | Pooling Tasks         | scoring function         |
|--------------------|-----------------------|--------------------------|
| `cross-encoder`    | `classify` (see note) | linear classifier        |
| `late-interaction` | `token_embed`         | late interaction(MaxSim) |
| `bi-encoder`       | `embed`               | cosine similarity        |

- Offline APIs:
    - `LLM.score`
- Online APIs:
    - [Score API](scoring.md#score-api) (`/score`)
    - [Rerank API](scoring.md#rerank-api) (`/rerank`, `/v1/rerank`, `/v2/rerank`)

!!! note
    Only when a classification model outputs num_labels equal to 1 can it be used as a scoring model and have its scoring API enabled.

## Supported Models

### Cross-encoder models

[Cross-encoder](https://www.sbert.net/examples/applications/cross-encoder/README.html) (aka reranker) models are a subset of classification models that accept two prompts as input and output num_labels equal to 1.

--8<-- [start:supported-cross-encoder-models]

#### Text-only Models

| Architecture | Models | Example HF Models | Score template (see note) | [LoRA](../../features/lora.md) | [PP](../../serving/parallelism_scaling.md) |
| ------------ | ------ | ----------------- | ------------------------- | --------------------------- | --------------------------------------- |
| `BertForSequenceClassification` | BERT-based | `cross-encoder/ms-marco-MiniLM-L-6-v2`, etc. | N/A | | |
| `GemmaForSequenceClassification` | Gemma-based | `BAAI/bge-reranker-v2-gemma`(see note), etc. | [bge-reranker-v2-gemma.jinja](../../../examples/pooling/score/template/bge-reranker-v2-gemma.jinja) | ✅︎ | ✅︎ |
| `GteNewForSequenceClassification` | mGTE-TRM (see note) | `Alibaba-NLP/gte-multilingual-reranker-base`, etc. | N/A | | |
| `LlamaBidirectionalForSequenceClassification`<sup>C</sup> | Llama-based with bidirectional attention | `nvidia/llama-nemotron-rerank-1b-v2`, etc. | [nemotron-rerank.jinja](../../../examples/pooling/score/template/nemotron-rerank.jinja) | ✅︎ | ✅︎ |
| `Qwen2ForSequenceClassification`<sup>C</sup> | Qwen2-based | `mixedbread-ai/mxbai-rerank-base-v2`(see note), etc. | [mxbai_rerank_v2.jinja](../../../examples/pooling/score/template/mxbai_rerank_v2.jinja) | ✅︎ | ✅︎ |
| `Qwen3ForSequenceClassification`<sup>C</sup> | Qwen3-based | `tomaarsen/Qwen3-Reranker-0.6B-seq-cls`, `Qwen/Qwen3-Reranker-0.6B`(see note), etc. | [qwen3_reranker.jinja](../../../examples/pooling/score/template/qwen3_reranker.jinja) | ✅︎ | ✅︎ |
| `RobertaForSequenceClassification` | RoBERTa-based | `cross-encoder/quora-roberta-base`, etc. | N/A | | |
| `XLMRobertaForSequenceClassification` | XLM-RoBERTa-based | `BAAI/bge-reranker-v2-m3`, etc. | N/A | | |
| `*Model`<sup>C</sup>, `*ForCausalLM`<sup>C</sup>, etc. | Generative models | N/A | N/A | \* | \* |

<sup>C</sup> Automatically converted into a classification model via `--convert classify`. ([details](./README.md#model-conversion))  
\* Feature support is the same as that of the original model.

!!! note
    Some models require a specific prompt format to work correctly.

    You can find Example HF Models's corresponding score template in [examples/pooling/score/template/](../../../examples/pooling/score/template)

    Examples : [examples/pooling/score/using_template_offline.py](../../../examples/pooling/score/using_template_offline.py) [examples/pooling/score/using_template_online.py](../../../examples/pooling/score/using_template_online.py)

!!! note
    Load the official original `BAAI/bge-reranker-v2-gemma` by using the following command.

    ```bash
    vllm serve BAAI/bge-reranker-v2-gemma --hf_overrides '{"architectures": ["GemmaForSequenceClassification"],"classifier_from_token": ["Yes"],"method": "no_post_processing"}'
    ```

!!! note
    The second-generation GTE model (mGTE-TRM) is named `NewForSequenceClassification`. The name `NewForSequenceClassification` is too generic, you should set `--hf-overrides '{"architectures": ["GteNewForSequenceClassification"]}'` to specify the use of the `GteNewForSequenceClassification` architecture.

!!! note
    Load the official original `mxbai-rerank-v2` by using the following command.

    ```bash
    vllm serve mixedbread-ai/mxbai-rerank-base-v2 --hf_overrides '{"architectures": ["Qwen2ForSequenceClassification"],"classifier_from_token": ["0", "1"], "method": "from_2_way_softmax"}'
    ```

!!! note
    Load the official original `Qwen3 Reranker` by using the following command. More information can be found at: [examples/pooling/score/qwen3_reranker_offline.py](../../../examples/pooling/score/qwen3_reranker_offline.py) [examples/pooling/score/qwen3_reranker_online.py](../../../examples/pooling/score/qwen3_reranker_online.py).

    ```bash
    vllm serve Qwen/Qwen3-Reranker-0.6B --hf_overrides '{"architectures": ["Qwen3ForSequenceClassification"],"classifier_from_token": ["no", "yes"],"is_original_qwen3_reranker": true}'
    ```

#### Multimodal Models

!!! note
    For more information about multimodal models inputs, see [this page](../supported_models.md#list-of-multimodal-language-models).

| Architecture | Models | Inputs | Example HF Models | [LoRA](../../features/lora.md) | [PP](../../serving/parallelism_scaling.md) |
| ------------ | ------ | ------ | ----------------- | ------------------------------ | ------------------------------------------ |
| `JinaVLForSequenceClassification` | JinaVL-based | T + I<sup>E+</sup> | `jinaai/jina-reranker-m0`, etc. | ✅︎ | ✅︎ |
| `LlamaNemotronVLForSequenceClassification` | Llama Nemotron Reranker + SigLIP | T + I<sup>E+</sup> | `nvidia/llama-nemotron-rerank-vl-1b-v2` | | |
| `Qwen3VLForSequenceClassification` | Qwen3-VL-Reranker | T + I<sup>E+</sup> + V<sup>E+</sup> | `Qwen/Qwen3-VL-Reranker-2B`(see note), etc. | ✅︎ | ✅︎ |

<sup>C</sup> Automatically converted into a classification model via `--convert classify`. ([details](README.md#model-conversion))  
\* Feature support is the same as that of the original model.

!!! note
    Similar to Qwen3-Reranker, you need to use the following `--hf_overrides` to load the official original `Qwen3-VL-Reranker`.

    ```bash
    vllm serve Qwen/Qwen3-VL-Reranker-2B --hf_overrides '{"architectures": ["Qwen3VLForSequenceClassification"],"classifier_from_token": ["no", "yes"],"is_original_qwen3_reranker": true}'
    ```

--8<-- [end:supported-cross-encoder-models]

### Late-interaction models

All models that support token embedding task also support using the score API to compute similarity scores by calculating the late interaction of two input prompts. See [this page](token_embed.md) for more information about token embedding models.

--8<-- "docs/models/pooling_models/token_embed.md:supported-token-embed-models"

### Bi-encoder

All models that support embedding task also support using the score API to compute similarity scores by calculating the cosine similarity of two input prompt's embeddings. See [this page](embed.md) for more information about embedding models.

--8<-- "docs/models/pooling_models/embed.md:supported-embed-models"

## Offline Inference

### Pooling Parameters

The following [pooling parameters][vllm.PoolingParams] are only supported by cross-encoder models and do not work for late-interaction and bi-encoder models.

```python
--8<-- "vllm/pooling_params.py:common-pooling-params"
--8<-- "vllm/pooling_params.py:classify-pooling-params"
```

### `LLM.score`

The [score][vllm.LLM.score] method outputs similarity scores between sentence pairs.

```python
from vllm import LLM

llm = LLM(model="BAAI/bge-reranker-v2-m3", runner="pooling")
(output,) = llm.score(
    "What is the capital of France?",
    "The capital of Brazil is Brasilia.",
)

score = output.outputs.score
print(f"Score: {score}")
```

A code example can be found here: [examples/basic/offline_inference/score.py](../../../examples/basic/offline_inference/score.py)

## Online Serving

### Score API

Our Score API (`/score`) is similar to `LLM.score`, compute similarity scores between two input prompts.

#### Parameters

The following Score API parameters are supported:

```python
--8<-- "vllm/entrypoints/pooling/base/protocol.py:pooling-common-params"
--8<-- "vllm/entrypoints/pooling/base/protocol.py:pooling-common-extra-params"
--8<-- "vllm/entrypoints/pooling/base/protocol.py:classify-extra-params"
```

#### Examples

##### Single inference

You can pass a string to both `queries` and `documents`, forming a single sentence pair.

```bash
curl -X 'POST' \
  'http://127.0.0.1:8000/score' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "BAAI/bge-reranker-v2-m3",
  "encoding_format": "float",
  "queries": "What is the capital of France?",
  "documents": "The capital of France is Paris."
}'
```

??? console "Response"

    ```json
    {
      "id": "score-request-id",
      "object": "list",
      "created": 693447,
      "model": "BAAI/bge-reranker-v2-m3",
      "data": [
        {
          "index": 0,
          "object": "score",
          "score": 1
        }
      ],
      "usage": {}
    }
    ```

##### Batch inference

You can pass a string to `queries` and a list to `documents`, forming multiple sentence pairs
where each pair is built from `queries` and a string in `documents`.
The total number of pairs is `len(documents)`.

??? console "Request"

    ```bash
    curl -X 'POST' \
      'http://127.0.0.1:8000/score' \
      -H 'accept: application/json' \
      -H 'Content-Type: application/json' \
      -d '{
      "model": "BAAI/bge-reranker-v2-m3",
      "queries": "What is the capital of France?",
      "documents": [
        "The capital of Brazil is Brasilia.",
        "The capital of France is Paris."
      ]
    }'
    ```

??? console "Response"

    ```json
    {
      "id": "score-request-id",
      "object": "list",
      "created": 693570,
      "model": "BAAI/bge-reranker-v2-m3",
      "data": [
        {
          "index": 0,
          "object": "score",
          "score": 0.001094818115234375
        },
        {
          "index": 1,
          "object": "score",
          "score": 1
        }
      ],
      "usage": {}
    }
    ```

You can pass a list to both `queries` and `documents`, forming multiple sentence pairs
where each pair is built from a string in `queries` and the corresponding string in `documents` (similar to `zip()`).
The total number of pairs is `len(documents)`.

??? console "Request"

    ```bash
    curl -X 'POST' \
      'http://127.0.0.1:8000/score' \
      -H 'accept: application/json' \
      -H 'Content-Type: application/json' \
      -d '{
      "model": "BAAI/bge-reranker-v2-m3",
      "encoding_format": "float",
      "queries": [
        "What is the capital of Brazil?",
        "What is the capital of France?"
      ],
      "documents": [
        "The capital of Brazil is Brasilia.",
        "The capital of France is Paris."
      ]
    }'
    ```

??? console "Response"

    ```json
    {
      "id": "score-request-id",
      "object": "list",
      "created": 693447,
      "model": "BAAI/bge-reranker-v2-m3",
      "data": [
        {
          "index": 0,
          "object": "score",
          "score": 1
        },
        {
          "index": 1,
          "object": "score",
          "score": 1
        }
      ],
      "usage": {}
    }
    ```

##### Multi-modal inputs

You can pass multi-modal inputs to scoring models by passing `content` including a list of multi-modal input (image, etc.) in the request. Refer to the examples below for illustration.

=== "JinaVL-Reranker"

    To serve the model:

    ```bash
    vllm serve jinaai/jina-reranker-m0
    ```

    Since the request schema is not defined by OpenAI client, we post a request to the server using the lower-level `requests` library:

    ??? Code

        ```python
        import requests
        
        response = requests.post(
            "http://localhost:8000/v1/score",
            json={
                "model": "jinaai/jina-reranker-m0",
                "queries": "slm markdown",
                "documents": [
                    {
                        "content": [
                            {
                                "type": "image_url",
                                "image_url": {
                                    "url": "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/handelsblatt-preview.png"
                                },
                            }
                        ],
                    },
                    {
                        "content": [
                            {
                                "type": "image_url",
                                "image_url": {
                                    "url": "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/handelsblatt-preview.png"
                                },
                            }
                        ]
                    },
                ],
            },
        )
        response.raise_for_status()
        response_json = response.json()
        print("Scoring output:", response_json["data"][0]["score"])
        print("Scoring output:", response_json["data"][1]["score"])
        ```
Full example:

- [examples/pooling/score/vision_score_api_online.py](../../../examples/pooling/score/vision_score_api_online.py)
- [examples/pooling/score/vision_rerank_api_online.py](../../../examples/pooling/score/vision_rerank_api_online.py)

### Rerank API

`/rerank`, `/v1/rerank`, and `/v2/rerank` APIs are compatible with both [Jina AI's rerank API interface](https://jina.ai/reranker/) and
[Cohere's rerank API interface](https://docs.cohere.com/v2/reference/rerank) to ensure compatibility with
popular open-source tools.

Code example: [examples/pooling/score/rerank_api_online.py](../../../examples/pooling/score/rerank_api_online.py)

#### Parameters

The following rerank api parameters are supported:

```python
--8<-- "vllm/entrypoints/pooling/base/protocol.py:pooling-common-params"
--8<-- "vllm/entrypoints/pooling/base/protocol.py:pooling-common-extra-params"
--8<-- "vllm/entrypoints/pooling/base/protocol.py:classify-extra-params"
```

#### Examples

Note that the `top_n` request parameter is optional and will default to the length of the `documents` field.
Result documents will be sorted by relevance, and the `index` property can be used to determine original order.

??? console "Request"

    ```bash
    curl -X 'POST' \
      'http://127.0.0.1:8000/v1/rerank' \
      -H 'accept: application/json' \
      -H 'Content-Type: application/json' \
      -d '{
      "model": "BAAI/bge-reranker-base",
      "query": "What is the capital of France?",
      "documents": [
        "The capital of Brazil is Brasilia.",
        "The capital of France is Paris.",
        "Horses and cows are both animals"
      ]
    }'
    ```

??? console "Response"

    ```json
    {
      "id": "rerank-fae51b2b664d4ed38f5969b612edff77",
      "model": "BAAI/bge-reranker-base",
      "usage": {
        "total_tokens": 56
      },
      "results": [
        {
          "index": 1,
          "document": {
            "text": "The capital of France is Paris."
          },
          "relevance_score": 0.99853515625
        },
        {
          "index": 0,
          "document": {
            "text": "The capital of Brazil is Brasilia."
          },
          "relevance_score": 0.0005860328674316406
        }
      ]
    }
    ```

## More examples

More examples can be found here: [examples/pooling/score](../../../examples/pooling/score)

## Supported Features

AS cross-encoder models are a subset of classification models that accept two prompts as input and output num_labels equal to 1, cross-encoder features should be consistent with (sequence) classification. For more information, see [this page](classify.md#supported-features).

### Score Template

Score templates are supported for **cross-encoder** models only. If you are using an **embedding** model for scoring, vLLM does not apply a score template.

Some scoring models require a specific prompt format to work correctly. You can specify a custom score template using the `--chat-template` parameter (see [Chat Template](../../serving/openai_compatible_server.md#chat-template)).

Like chat templates, the score template receives a `messages` list. For scoring, each message has a `role` attribute—either `"query"` or `"document"`. For the usual kind of point-wise cross-encoder, you can expect exactly two messages: one query and one document. To access the query and document content, use Jinja's `selectattr` filter:

- **Query**: `{{ (messages | selectattr("role", "eq", "query") | first).content }}`
- **Document**: `{{ (messages | selectattr("role", "eq", "document") | first).content }}`

This approach is more robust than index-based access (`messages[0]`, `messages[1]`) because it selects messages by their semantic role. It also avoids assumptions about message ordering if additional message types are added to `messages` in the future.

Example template file: [examples/pooling/score/template/nemotron-rerank.jinja](../../../examples/pooling/score/template/nemotron-rerank.jinja)

### Enable/disable activation

You can enable or disable activation via `use_activation` only works for cross-encoder models.
[Docs] Reorganize pooling docs. (#35592) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: wang.yuqi <noooop@126.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> 2026-03-19 19:25:47 +08:00			`# Scoring Usages`

			The score models is designed to compute similarity scores between two input prompts. It supports three model types (aka `score_type`): `cross-encoder`, `late-interaction`, and `bi-encoder`.

			`!!! note`
			`vLLM handles only the model inference component of RAG pipelines (such as embedding generation and reranking). For higher-level RAG orchestration, you should leverage integration frameworks like [LangChain](https://github.com/langchain-ai/langchain).`

			`## Summary`

			`- Model Usage: Scoring`
			`- Pooling Task:`

[Model] Deprecate the score task (this will not affect users). (#37537) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> 2026-03-20 16:07:56 +08:00			`\| Score Types \| Pooling Tasks \| scoring function \|`
			`\|--------------------\|-----------------------\|--------------------------\|`
			\| `cross-encoder` \| `classify` (see note) \| linear classifier \|
			\| `late-interaction` \| `token_embed` \| late interaction(MaxSim) \|
			\| `bi-encoder` \| `embed` \| cosine similarity \|
[Docs] Reorganize pooling docs. (#35592) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: wang.yuqi <noooop@126.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> 2026-03-19 19:25:47 +08:00
			`- Offline APIs:`
			- `LLM.score`
			`- Online APIs:`
			- [Score API](scoring.md#score-api) (`/score`)
			- [Rerank API](scoring.md#rerank-api) (`/rerank`, `/v1/rerank`, `/v2/rerank`)

[Model] Deprecate the score task (this will not affect users). (#37537) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> 2026-03-20 16:07:56 +08:00			`!!! note`
			`Only when a classification model outputs num_labels equal to 1 can it be used as a scoring model and have its scoring API enabled.`

[Docs] Reorganize pooling docs. (#35592) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: wang.yuqi <noooop@126.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> 2026-03-19 19:25:47 +08:00			`## Supported Models`

			`### Cross-encoder models`

			`[Cross-encoder](https://www.sbert.net/examples/applications/cross-encoder/README.html) (aka reranker) models are a subset of classification models that accept two prompts as input and output num_labels equal to 1.`

[Model] Deprecate the score task (this will not affect users). (#37537) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> 2026-03-20 16:07:56 +08:00			`--8<-- [start:supported-cross-encoder-models]`
[Docs] Reorganize pooling docs. (#35592) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: wang.yuqi <noooop@126.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> 2026-03-19 19:25:47 +08:00
			`#### Text-only Models`

			`\| Architecture \| Models \| Example HF Models \| Score template (see note) \| [LoRA](../../features/lora.md) \| [PP](../../serving/parallelism_scaling.md) \|`
			`\| ------------ \| ------ \| ----------------- \| ------------------------- \| --------------------------- \| --------------------------------------- \|`
			\| `BertForSequenceClassification` \| BERT-based \| `cross-encoder/ms-marco-MiniLM-L-6-v2`, etc. \| N/A \| \| \|
			\| `GemmaForSequenceClassification` \| Gemma-based \| `BAAI/bge-reranker-v2-gemma`(see note), etc. \| [bge-reranker-v2-gemma.jinja](../../../examples/pooling/score/template/bge-reranker-v2-gemma.jinja) \| ✅︎ \| ✅︎ \|
			\| `GteNewForSequenceClassification` \| mGTE-TRM (see note) \| `Alibaba-NLP/gte-multilingual-reranker-base`, etc. \| N/A \| \| \|
			\| `LlamaBidirectionalForSequenceClassification`<sup>C</sup> \| Llama-based with bidirectional attention \| `nvidia/llama-nemotron-rerank-1b-v2`, etc. \| [nemotron-rerank.jinja](../../../examples/pooling/score/template/nemotron-rerank.jinja) \| ✅︎ \| ✅︎ \|
			\| `Qwen2ForSequenceClassification`<sup>C</sup> \| Qwen2-based \| `mixedbread-ai/mxbai-rerank-base-v2`(see note), etc. \| [mxbai_rerank_v2.jinja](../../../examples/pooling/score/template/mxbai_rerank_v2.jinja) \| ✅︎ \| ✅︎ \|
			\| `Qwen3ForSequenceClassification`<sup>C</sup> \| Qwen3-based \| `tomaarsen/Qwen3-Reranker-0.6B-seq-cls`, `Qwen/Qwen3-Reranker-0.6B`(see note), etc. \| [qwen3_reranker.jinja](../../../examples/pooling/score/template/qwen3_reranker.jinja) \| ✅︎ \| ✅︎ \|
			\| `RobertaForSequenceClassification` \| RoBERTa-based \| `cross-encoder/quora-roberta-base`, etc. \| N/A \| \| \|
			\| `XLMRobertaForSequenceClassification` \| XLM-RoBERTa-based \| `BAAI/bge-reranker-v2-m3`, etc. \| N/A \| \| \|
			\| `Model`<sup>C</sup>, `ForCausalLM`<sup>C</sup>, etc. \| Generative models \| N/A \| N/A \| \* \| \* \|

			<sup>C</sup> Automatically converted into a classification model via `--convert classify`. ([details](./README.md#model-conversion))
			`\* Feature support is the same as that of the original model.`

			`!!! note`
			`Some models require a specific prompt format to work correctly.`

			`You can find Example HF Models's corresponding score template in [examples/pooling/score/template/](../../../examples/pooling/score/template)`

			`Examples : [examples/pooling/score/using_template_offline.py](../../../examples/pooling/score/using_template_offline.py) [examples/pooling/score/using_template_online.py](../../../examples/pooling/score/using_template_online.py)`

			`!!! note`
			Load the official original `BAAI/bge-reranker-v2-gemma` by using the following command.

			```bash
			`vllm serve BAAI/bge-reranker-v2-gemma --hf_overrides '{"architectures": ["GemmaForSequenceClassification"],"classifier_from_token": ["Yes"],"method": "no_post_processing"}'`
			```

			`!!! note`
			The second-generation GTE model (mGTE-TRM) is named `NewForSequenceClassification`. The name `NewForSequenceClassification` is too generic, you should set `--hf-overrides '{"architectures": ["GteNewForSequenceClassification"]}'` to specify the use of the `GteNewForSequenceClassification` architecture.

			`!!! note`
			Load the official original `mxbai-rerank-v2` by using the following command.

			```bash
			`vllm serve mixedbread-ai/mxbai-rerank-base-v2 --hf_overrides '{"architectures": ["Qwen2ForSequenceClassification"],"classifier_from_token": ["0", "1"], "method": "from_2_way_softmax"}'`
			```

			`!!! note`
			Load the official original `Qwen3 Reranker` by using the following command. More information can be found at: [examples/pooling/score/qwen3_reranker_offline.py](../../../examples/pooling/score/qwen3_reranker_offline.py) [examples/pooling/score/qwen3_reranker_online.py](../../../examples/pooling/score/qwen3_reranker_online.py).

			```bash
			`vllm serve Qwen/Qwen3-Reranker-0.6B --hf_overrides '{"architectures": ["Qwen3ForSequenceClassification"],"classifier_from_token": ["no", "yes"],"is_original_qwen3_reranker": true}'`
			```

			`#### Multimodal Models`

			`!!! note`
			`For more information about multimodal models inputs, see [this page](../supported_models.md#list-of-multimodal-language-models).`

			`\| Architecture \| Models \| Inputs \| Example HF Models \| [LoRA](../../features/lora.md) \| [PP](../../serving/parallelism_scaling.md) \|`
			`\| ------------ \| ------ \| ------ \| ----------------- \| ------------------------------ \| ------------------------------------------ \|`
			\| `JinaVLForSequenceClassification` \| JinaVL-based \| T + I<sup>E+</sup> \| `jinaai/jina-reranker-m0`, etc. \| ✅︎ \| ✅︎ \|
			\| `LlamaNemotronVLForSequenceClassification` \| Llama Nemotron Reranker + SigLIP \| T + I<sup>E+</sup> \| `nvidia/llama-nemotron-rerank-vl-1b-v2` \| \| \|
			\| `Qwen3VLForSequenceClassification` \| Qwen3-VL-Reranker \| T + I<sup>E+</sup> + V<sup>E+</sup> \| `Qwen/Qwen3-VL-Reranker-2B`(see note), etc. \| ✅︎ \| ✅︎ \|

			<sup>C</sup> Automatically converted into a classification model via `--convert classify`. ([details](README.md#model-conversion))
			`\* Feature support is the same as that of the original model.`

			`!!! note`
			Similar to Qwen3-Reranker, you need to use the following `--hf_overrides` to load the official original `Qwen3-VL-Reranker`.

			```bash
			`vllm serve Qwen/Qwen3-VL-Reranker-2B --hf_overrides '{"architectures": ["Qwen3VLForSequenceClassification"],"classifier_from_token": ["no", "yes"],"is_original_qwen3_reranker": true}'`
			```

[Model] Deprecate the score task (this will not affect users). (#37537) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> 2026-03-20 16:07:56 +08:00			`--8<-- [end:supported-cross-encoder-models]`
[Docs] Reorganize pooling docs. (#35592) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: wang.yuqi <noooop@126.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> 2026-03-19 19:25:47 +08:00
			`### Late-interaction models`

			`All models that support token embedding task also support using the score API to compute similarity scores by calculating the late interaction of two input prompts. See [this page](token_embed.md) for more information about token embedding models.`

			`--8<-- "docs/models/pooling_models/token_embed.md:supported-token-embed-models"`

			`### Bi-encoder`

			`All models that support embedding task also support using the score API to compute similarity scores by calculating the cosine similarity of two input prompt's embeddings. See [this page](embed.md) for more information about embedding models.`

			`--8<-- "docs/models/pooling_models/embed.md:supported-embed-models"`

			`## Offline Inference`

			`### Pooling Parameters`

			`The following [pooling parameters][vllm.PoolingParams] are only supported by cross-encoder models and do not work for late-interaction and bi-encoder models.`

			```python
			`--8<-- "vllm/pooling_params.py:common-pooling-params"`
			`--8<-- "vllm/pooling_params.py:classify-pooling-params"`
			```

			### `LLM.score`

			`The [score][vllm.LLM.score] method outputs similarity scores between sentence pairs.`

			```python
			`from vllm import LLM`

			`llm = LLM(model="BAAI/bge-reranker-v2-m3", runner="pooling")`
			`(output,) = llm.score(`
			`"What is the capital of France?",`
			`"The capital of Brazil is Brasilia.",`
			`)`

			`score = output.outputs.score`
			`print(f"Score: {score}")`
			```

			`A code example can be found here: [examples/basic/offline_inference/score.py](../../../examples/basic/offline_inference/score.py)`

			`## Online Serving`

			`### Score API`

			Our Score API (`/score`) is similar to `LLM.score`, compute similarity scores between two input prompts.

			`#### Parameters`

			`The following Score API parameters are supported:`

			```python
			`--8<-- "vllm/entrypoints/pooling/base/protocol.py:pooling-common-params"`
			`--8<-- "vllm/entrypoints/pooling/base/protocol.py:pooling-common-extra-params"`
			`--8<-- "vllm/entrypoints/pooling/base/protocol.py:classify-extra-params"`
			```

			`#### Examples`

			`##### Single inference`

			You can pass a string to both `queries` and `documents`, forming a single sentence pair.

			```bash
			`curl -X 'POST' \`
			`'http://127.0.0.1:8000/score' \`
			`-H 'accept: application/json' \`
			`-H 'Content-Type: application/json' \`
			`-d '{`
			`"model": "BAAI/bge-reranker-v2-m3",`
			`"encoding_format": "float",`
			`"queries": "What is the capital of France?",`
			`"documents": "The capital of France is Paris."`
			`}'`
			```

			`??? console "Response"`

			```json
			`{`
			`"id": "score-request-id",`
			`"object": "list",`
			`"created": 693447,`
			`"model": "BAAI/bge-reranker-v2-m3",`
			`"data": [`
			`{`
			`"index": 0,`
			`"object": "score",`
			`"score": 1`
			`}`
			`],`
			`"usage": {}`
			`}`
			```

			`##### Batch inference`

			You can pass a string to `queries` and a list to `documents`, forming multiple sentence pairs
			where each pair is built from `queries` and a string in `documents`.
			The total number of pairs is `len(documents)`.

			`??? console "Request"`

			```bash
			`curl -X 'POST' \`
			`'http://127.0.0.1:8000/score' \`
			`-H 'accept: application/json' \`
			`-H 'Content-Type: application/json' \`
			`-d '{`
			`"model": "BAAI/bge-reranker-v2-m3",`
			`"queries": "What is the capital of France?",`
			`"documents": [`
			`"The capital of Brazil is Brasilia.",`
			`"The capital of France is Paris."`
			`]`
			`}'`
			```

			`??? console "Response"`

			```json
			`{`
			`"id": "score-request-id",`
			`"object": "list",`
			`"created": 693570,`
			`"model": "BAAI/bge-reranker-v2-m3",`
			`"data": [`
			`{`
			`"index": 0,`
			`"object": "score",`
			`"score": 0.001094818115234375`
			`},`
			`{`
			`"index": 1,`
			`"object": "score",`
			`"score": 1`
			`}`
			`],`
			`"usage": {}`
			`}`
			```

			You can pass a list to both `queries` and `documents`, forming multiple sentence pairs
			where each pair is built from a string in `queries` and the corresponding string in `documents` (similar to `zip()`).
			The total number of pairs is `len(documents)`.

			`??? console "Request"`

			```bash
			`curl -X 'POST' \`
			`'http://127.0.0.1:8000/score' \`
			`-H 'accept: application/json' \`
			`-H 'Content-Type: application/json' \`
			`-d '{`
			`"model": "BAAI/bge-reranker-v2-m3",`
			`"encoding_format": "float",`
			`"queries": [`
			`"What is the capital of Brazil?",`
			`"What is the capital of France?"`
			`],`
			`"documents": [`
			`"The capital of Brazil is Brasilia.",`
			`"The capital of France is Paris."`
			`]`
			`}'`
			```

			`??? console "Response"`

			```json
			`{`
			`"id": "score-request-id",`
			`"object": "list",`
			`"created": 693447,`
			`"model": "BAAI/bge-reranker-v2-m3",`
			`"data": [`
			`{`
			`"index": 0,`
			`"object": "score",`
			`"score": 1`
			`},`
			`{`
			`"index": 1,`
			`"object": "score",`
			`"score": 1`
			`}`
			`],`
			`"usage": {}`
			`}`
			```

			`##### Multi-modal inputs`

			You can pass multi-modal inputs to scoring models by passing `content` including a list of multi-modal input (image, etc.) in the request. Refer to the examples below for illustration.

			`=== "JinaVL-Reranker"`

			`To serve the model:`

			```bash
			`vllm serve jinaai/jina-reranker-m0`
			```

			Since the request schema is not defined by OpenAI client, we post a request to the server using the lower-level `requests` library:

			`??? Code`

			```python
			`import requests`

			`response = requests.post(`
			`"http://localhost:8000/v1/score",`
			`json={`
			`"model": "jinaai/jina-reranker-m0",`
			`"queries": "slm markdown",`
			`"documents": [`
			`{`
			`"content": [`
			`{`
			`"type": "image_url",`
			`"image_url": {`
			`"url": "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/handelsblatt-preview.png"`
			`},`
			`}`
			`],`
			`},`
			`{`
			`"content": [`
			`{`
			`"type": "image_url",`
			`"image_url": {`
			`"url": "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/handelsblatt-preview.png"`
			`},`
			`}`
			`]`
			`},`
			`],`
			`},`
			`)`
			`response.raise_for_status()`
			`response_json = response.json()`
			`print("Scoring output:", response_json["data"][0]["score"])`
			`print("Scoring output:", response_json["data"][1]["score"])`
			```
			`Full example:`

			`- [examples/pooling/score/vision_score_api_online.py](../../../examples/pooling/score/vision_score_api_online.py)`
			`- [examples/pooling/score/vision_rerank_api_online.py](../../../examples/pooling/score/vision_rerank_api_online.py)`

			`### Rerank API`

			`/rerank`, `/v1/rerank`, and `/v2/rerank` APIs are compatible with both [Jina AI's rerank API interface](https://jina.ai/reranker/) and
			`[Cohere's rerank API interface](https://docs.cohere.com/v2/reference/rerank) to ensure compatibility with`
			`popular open-source tools.`

			`Code example: [examples/pooling/score/rerank_api_online.py](../../../examples/pooling/score/rerank_api_online.py)`

			`#### Parameters`

			`The following rerank api parameters are supported:`

			```python
			`--8<-- "vllm/entrypoints/pooling/base/protocol.py:pooling-common-params"`
			`--8<-- "vllm/entrypoints/pooling/base/protocol.py:pooling-common-extra-params"`
			`--8<-- "vllm/entrypoints/pooling/base/protocol.py:classify-extra-params"`
			```

			`#### Examples`

			Note that the `top_n` request parameter is optional and will default to the length of the `documents` field.
			Result documents will be sorted by relevance, and the `index` property can be used to determine original order.

			`??? console "Request"`

			```bash
			`curl -X 'POST' \`
			`'http://127.0.0.1:8000/v1/rerank' \`
			`-H 'accept: application/json' \`
			`-H 'Content-Type: application/json' \`
			`-d '{`
			`"model": "BAAI/bge-reranker-base",`
			`"query": "What is the capital of France?",`
			`"documents": [`
			`"The capital of Brazil is Brasilia.",`
			`"The capital of France is Paris.",`
			`"Horses and cows are both animals"`
			`]`
			`}'`
			```

			`??? console "Response"`

			```json
			`{`
			`"id": "rerank-fae51b2b664d4ed38f5969b612edff77",`
			`"model": "BAAI/bge-reranker-base",`
			`"usage": {`
			`"total_tokens": 56`
			`},`
			`"results": [`
			`{`
			`"index": 1,`
			`"document": {`
			`"text": "The capital of France is Paris."`
			`},`
			`"relevance_score": 0.99853515625`
			`},`
			`{`
			`"index": 0,`
			`"document": {`
			`"text": "The capital of Brazil is Brasilia."`
			`},`
			`"relevance_score": 0.0005860328674316406`
			`}`
			`]`
			`}`
			```

			`## More examples`

			`More examples can be found here: [examples/pooling/score](../../../examples/pooling/score)`

			`## Supported Features`

			`AS cross-encoder models are a subset of classification models that accept two prompts as input and output num_labels equal to 1, cross-encoder features should be consistent with (sequence) classification. For more information, see [this page](classify.md#supported-features).`

			`### Score Template`

			`Score templates are supported for cross-encoder models only. If you are using an embedding model for scoring, vLLM does not apply a score template.`

			Some scoring models require a specific prompt format to work correctly. You can specify a custom score template using the `--chat-template` parameter (see [Chat Template](../../serving/openai_compatible_server.md#chat-template)).

			Like chat templates, the score template receives a `messages` list. For scoring, each message has a `role` attribute—either `"query"` or `"document"`. For the usual kind of point-wise cross-encoder, you can expect exactly two messages: one query and one document. To access the query and document content, use Jinja's `selectattr` filter:

			- Query: `{{ (messages \| selectattr("role", "eq", "query") \| first).content }}`
			- Document: `{{ (messages \| selectattr("role", "eq", "document") \| first).content }}`

			This approach is more robust than index-based access (`messages[0]`, `messages[1]`) because it selects messages by their semantic role. It also avoids assumptions about message ordering if additional message types are added to `messages` in the future.

			`Example template file: [examples/pooling/score/template/nemotron-rerank.jinja](../../../examples/pooling/score/template/nemotron-rerank.jinja)`

			`### Enable/disable activation`

			You can enable or disable activation via `use_activation` only works for cross-encoder models.