Files

wang.yuqi ed359c497a [Model] Deprecate the score task (this will not affect users). (#37537 )

Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>

2026-03-20 08:07:56 +00:00

18 KiB

Raw Blame History

Scoring Usages

The score models is designed to compute similarity scores between two input prompts. It supports three model types (aka score_type): cross-encoder, late-interaction, and bi-encoder.

!!! note vLLM handles only the model inference component of RAG pipelines (such as embedding generation and reranking). For higher-level RAG orchestration, you should leverage integration frameworks like LangChain.

Summary

Model Usage: Scoring
Pooling Task:

Score Types	Pooling Tasks	scoring function
`cross-encoder`	`classify` (see note)	linear classifier
`late-interaction`	`token_embed`	late interaction(MaxSim)
`bi-encoder`	`embed`	cosine similarity

Offline APIs:
- LLM.score
Online APIs:
- Score API (/score)
- Rerank API (/rerank, /v1/rerank, /v2/rerank)

!!! note Only when a classification model outputs num_labels equal to 1 can it be used as a scoring model and have its scoring API enabled.

Supported Models

Cross-encoder models

Cross-encoder (aka reranker) models are a subset of classification models that accept two prompts as input and output num_labels equal to 1.

--8<-- [start:supported-cross-encoder-models]

Text-only Models

Architecture	Models	Example HF Models	Score template (see note)	LoRA	PP
`BertForSequenceClassification`	BERT-based	`cross-encoder/ms-marco-MiniLM-L-6-v2`, etc.	N/A
`GemmaForSequenceClassification`	Gemma-based	`BAAI/bge-reranker-v2-gemma`(see note), etc.	bge-reranker-v2-gemma.jinja	✅︎	✅︎
`GteNewForSequenceClassification`	mGTE-TRM (see note)	`Alibaba-NLP/gte-multilingual-reranker-base`, etc.	N/A
`LlamaBidirectionalForSequenceClassification`^C	Llama-based with bidirectional attention	`nvidia/llama-nemotron-rerank-1b-v2`, etc.	nemotron-rerank.jinja	✅︎	✅︎
`Qwen2ForSequenceClassification`^C	Qwen2-based	`mixedbread-ai/mxbai-rerank-base-v2`(see note), etc.	mxbai_rerank_v2.jinja	✅︎	✅︎
`Qwen3ForSequenceClassification`^C	Qwen3-based	`tomaarsen/Qwen3-Reranker-0.6B-seq-cls`, `Qwen/Qwen3-Reranker-0.6B`(see note), etc.	qwen3_reranker.jinja	✅︎	✅︎
`RobertaForSequenceClassification`	RoBERTa-based	`cross-encoder/quora-roberta-base`, etc.	N/A
`XLMRobertaForSequenceClassification`	XLM-RoBERTa-based	`BAAI/bge-reranker-v2-m3`, etc.	N/A
`Model`^C, `ForCausalLM`^C, etc.	Generative models	N/A	N/A	*	*

^C Automatically converted into a classification model via --convert classify. (details)
* Feature support is the same as that of the original model.

!!! note Some models require a specific prompt format to work correctly.

You can find Example HF Models's corresponding score template in [examples/pooling/score/template/](../../../examples/pooling/score/template)

Examples : [examples/pooling/score/using_template_offline.py](../../../examples/pooling/score/using_template_offline.py) [examples/pooling/score/using_template_online.py](../../../examples/pooling/score/using_template_online.py)

!!! note Load the official original BAAI/bge-reranker-v2-gemma by using the following command.

```bash
vllm serve BAAI/bge-reranker-v2-gemma --hf_overrides '{"architectures": ["GemmaForSequenceClassification"],"classifier_from_token": ["Yes"],"method": "no_post_processing"}'
```

!!! note The second-generation GTE model (mGTE-TRM) is named NewForSequenceClassification. The name NewForSequenceClassification is too generic, you should set --hf-overrides '{"architectures": ["GteNewForSequenceClassification"]}' to specify the use of the GteNewForSequenceClassification architecture.

!!! note Load the official original mxbai-rerank-v2 by using the following command.

```bash
vllm serve mixedbread-ai/mxbai-rerank-base-v2 --hf_overrides '{"architectures": ["Qwen2ForSequenceClassification"],"classifier_from_token": ["0", "1"], "method": "from_2_way_softmax"}'
```

!!! note Load the official original Qwen3 Reranker by using the following command. More information can be found at: examples/pooling/score/qwen3_reranker_offline.py examples/pooling/score/qwen3_reranker_online.py.

```bash
vllm serve Qwen/Qwen3-Reranker-0.6B --hf_overrides '{"architectures": ["Qwen3ForSequenceClassification"],"classifier_from_token": ["no", "yes"],"is_original_qwen3_reranker": true}'
```

Multimodal Models

!!! note For more information about multimodal models inputs, see this page.

Architecture	Models	Inputs	Example HF Models	LoRA	PP
`JinaVLForSequenceClassification`	JinaVL-based	T + I^E+	`jinaai/jina-reranker-m0`, etc.	✅︎	✅︎
`LlamaNemotronVLForSequenceClassification`	Llama Nemotron Reranker + SigLIP	T + I^E+	`nvidia/llama-nemotron-rerank-vl-1b-v2`
`Qwen3VLForSequenceClassification`	Qwen3-VL-Reranker	T + I^E+ + V^E+	`Qwen/Qwen3-VL-Reranker-2B`(see note), etc.	✅︎	✅︎

^C Automatically converted into a classification model via --convert classify. (details)
* Feature support is the same as that of the original model.

!!! note Similar to Qwen3-Reranker, you need to use the following --hf_overrides to load the official original Qwen3-VL-Reranker.

```bash
vllm serve Qwen/Qwen3-VL-Reranker-2B --hf_overrides '{"architectures": ["Qwen3VLForSequenceClassification"],"classifier_from_token": ["no", "yes"],"is_original_qwen3_reranker": true}'
```

--8<-- [end:supported-cross-encoder-models]

Late-interaction models

All models that support token embedding task also support using the score API to compute similarity scores by calculating the late interaction of two input prompts. See this page for more information about token embedding models.

--8<-- "docs/models/pooling_models/token_embed.md:supported-token-embed-models"

Bi-encoder

All models that support embedding task also support using the score API to compute similarity scores by calculating the cosine similarity of two input prompt's embeddings. See this page for more information about embedding models.

--8<-- "docs/models/pooling_models/embed.md:supported-embed-models"

Offline Inference

Pooling Parameters

The following [pooling parameters][vllm.PoolingParams] are only supported by cross-encoder models and do not work for late-interaction and bi-encoder models.

--8<-- "vllm/pooling_params.py:common-pooling-params"
--8<-- "vllm/pooling_params.py:classify-pooling-params"

`LLM.score`

The [score][vllm.LLM.score] method outputs similarity scores between sentence pairs.

from vllm import LLM

llm = LLM(model="BAAI/bge-reranker-v2-m3", runner="pooling")
(output,) = llm.score(
    "What is the capital of France?",
    "The capital of Brazil is Brasilia.",
)

score = output.outputs.score
print(f"Score: {score}")

A code example can be found here: examples/basic/offline_inference/score.py

Online Serving

Score API

Our Score API (/score) is similar to LLM.score, compute similarity scores between two input prompts.

Parameters

The following Score API parameters are supported:

--8<-- "vllm/entrypoints/pooling/base/protocol.py:pooling-common-params"
--8<-- "vllm/entrypoints/pooling/base/protocol.py:pooling-common-extra-params"
--8<-- "vllm/entrypoints/pooling/base/protocol.py:classify-extra-params"

Examples

Single inference

You can pass a string to both queries and documents, forming a single sentence pair.

curl -X 'POST' \
  'http://127.0.0.1:8000/score' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "BAAI/bge-reranker-v2-m3",
  "encoding_format": "float",
  "queries": "What is the capital of France?",
  "documents": "The capital of France is Paris."
}'

??? console "Response"

```json
{
  "id": "score-request-id",
  "object": "list",
  "created": 693447,
  "model": "BAAI/bge-reranker-v2-m3",
  "data": [
    {
      "index": 0,
      "object": "score",
      "score": 1
    }
  ],
  "usage": {}
}
```

Batch inference

You can pass a string to queries and a list to documents, forming multiple sentence pairs where each pair is built from queries and a string in documents. The total number of pairs is len(documents).

??? console "Request"

```bash
curl -X 'POST' \
  'http://127.0.0.1:8000/score' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "BAAI/bge-reranker-v2-m3",
  "queries": "What is the capital of France?",
  "documents": [
    "The capital of Brazil is Brasilia.",
    "The capital of France is Paris."
  ]
}'
```

??? console "Response"

```json
{
  "id": "score-request-id",
  "object": "list",
  "created": 693570,
  "model": "BAAI/bge-reranker-v2-m3",
  "data": [
    {
      "index": 0,
      "object": "score",
      "score": 0.001094818115234375
    },
    {
      "index": 1,
      "object": "score",
      "score": 1
    }
  ],
  "usage": {}
}
```

You can pass a list to both queries and documents, forming multiple sentence pairs where each pair is built from a string in queries and the corresponding string in documents (similar to zip()). The total number of pairs is len(documents).

??? console "Request"

```bash
curl -X 'POST' \
  'http://127.0.0.1:8000/score' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "BAAI/bge-reranker-v2-m3",
  "encoding_format": "float",
  "queries": [
    "What is the capital of Brazil?",
    "What is the capital of France?"
  ],
  "documents": [
    "The capital of Brazil is Brasilia.",
    "The capital of France is Paris."
  ]
}'
```

??? console "Response"

```json
{
  "id": "score-request-id",
  "object": "list",
  "created": 693447,
  "model": "BAAI/bge-reranker-v2-m3",
  "data": [
    {
      "index": 0,
      "object": "score",
      "score": 1
    },
    {
      "index": 1,
      "object": "score",
      "score": 1
    }
  ],
  "usage": {}
}
```

You can pass multi-modal inputs to scoring models by passing content including a list of multi-modal input (image, etc.) in the request. Refer to the examples below for illustration.

=== "JinaVL-Reranker"

To serve the model:

```bash
vllm serve jinaai/jina-reranker-m0
```

Since the request schema is not defined by OpenAI client, we post a request to the server using the lower-level `requests` library:

??? Code

    ```python
    import requests
    
    response = requests.post(
        "http://localhost:8000/v1/score",
        json={
            "model": "jinaai/jina-reranker-m0",
            "queries": "slm markdown",
            "documents": [
                {
                    "content": [
                        {
                            "type": "image_url",
                            "image_url": {
                                "url": "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/handelsblatt-preview.png"
                            },
                        }
                    ],
                },
                {
                    "content": [
                        {
                            "type": "image_url",
                            "image_url": {
                                "url": "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/handelsblatt-preview.png"
                            },
                        }
                    ]
                },
            ],
        },
    )
    response.raise_for_status()
    response_json = response.json()
    print("Scoring output:", response_json["data"][0]["score"])
    print("Scoring output:", response_json["data"][1]["score"])
    ```

Full example:

Rerank API

/rerank, /v1/rerank, and /v2/rerank APIs are compatible with both Jina AI's rerank API interface and Cohere's rerank API interface to ensure compatibility with popular open-source tools.

Code example: examples/pooling/score/rerank_api_online.py

Parameters

The following rerank api parameters are supported:

--8<-- "vllm/entrypoints/pooling/base/protocol.py:pooling-common-params"
--8<-- "vllm/entrypoints/pooling/base/protocol.py:pooling-common-extra-params"
--8<-- "vllm/entrypoints/pooling/base/protocol.py:classify-extra-params"

Examples

Note that the top_n request parameter is optional and will default to the length of the documents field. Result documents will be sorted by relevance, and the index property can be used to determine original order.

??? console "Request"

```bash
curl -X 'POST' \
  'http://127.0.0.1:8000/v1/rerank' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "BAAI/bge-reranker-base",
  "query": "What is the capital of France?",
  "documents": [
    "The capital of Brazil is Brasilia.",
    "The capital of France is Paris.",
    "Horses and cows are both animals"
  ]
}'
```

??? console "Response"

```json
{
  "id": "rerank-fae51b2b664d4ed38f5969b612edff77",
  "model": "BAAI/bge-reranker-base",
  "usage": {
    "total_tokens": 56
  },
  "results": [
    {
      "index": 1,
      "document": {
        "text": "The capital of France is Paris."
      },
      "relevance_score": 0.99853515625
    },
    {
      "index": 0,
      "document": {
        "text": "The capital of Brazil is Brasilia."
      },
      "relevance_score": 0.0005860328674316406
    }
  ]
}
```

More examples

More examples can be found here: examples/pooling/score

Supported Features

AS cross-encoder models are a subset of classification models that accept two prompts as input and output num_labels equal to 1, cross-encoder features should be consistent with (sequence) classification. For more information, see this page.

Score Template

Score templates are supported for cross-encoder models only. If you are using an embedding model for scoring, vLLM does not apply a score template.

Some scoring models require a specific prompt format to work correctly. You can specify a custom score template using the --chat-template parameter (see Chat Template).

Like chat templates, the score template receives a messages list. For scoring, each message has a role attribute—either "query" or "document". For the usual kind of point-wise cross-encoder, you can expect exactly two messages: one query and one document. To access the query and document content, use Jinja's selectattr filter:

Query: {{ (messages | selectattr("role", "eq", "query") | first).content }}
Document: {{ (messages | selectattr("role", "eq", "document") | first).content }}

This approach is more robust than index-based access (messages[0], messages[1]) because it selects messages by their semantic role. It also avoids assumptions about message ordering if additional message types are added to messages in the future.

Example template file: examples/pooling/score/template/nemotron-rerank.jinja

Enable/disable activation

You can enable or disable activation via use_activation only works for cross-encoder models.

18 KiB Raw Blame History Unescape Escape

Scoring Usages

Summary

Supported Models

Cross-encoder models

Text-only Models

Multimodal Models

Late-interaction models

Bi-encoder

Offline Inference

Pooling Parameters

LLM.score

Online Serving

Score API

Parameters

Examples

Single inference

Batch inference

Multi-modal inputs

Rerank API

Parameters

Examples

More examples

Supported Features

Score Template

Enable/disable activation

18 KiB

Raw Blame History

`LLM.score`