# Scoring Usages The score models is designed to compute similarity scores between two input prompts. It supports three model types (aka `score_type`): `cross-encoder`, `late-interaction`, and `bi-encoder`. !!! note vLLM handles only the model inference component of RAG pipelines (such as embedding generation and reranking). For higher-level RAG orchestration, you should leverage integration frameworks like [LangChain](https://github.com/langchain-ai/langchain). ## Summary - Model Usage: Scoring - Pooling Task: | Score Types | Pooling Tasks | scoring function | |--------------------|-----------------------|--------------------------| | `cross-encoder` | `classify` (see note) | linear classifier | | `late-interaction` | `token_embed` | late interaction(MaxSim) | | `bi-encoder` | `embed` | cosine similarity | - Offline APIs: - `LLM.score` - Online APIs: - [Score API](scoring.md#score-api) (`/score`) - [Rerank API](scoring.md#rerank-api) (`/rerank`, `/v1/rerank`, `/v2/rerank`) !!! note Only when a classification model outputs num_labels equal to 1 can it be used as a scoring model and have its scoring API enabled. ## Supported Models ### Cross-encoder models [Cross-encoder](https://www.sbert.net/examples/applications/cross-encoder/README.html) (aka reranker) models are a subset of classification models that accept two prompts as input and output num_labels equal to 1. --8<-- [start:supported-cross-encoder-models] #### Text-only Models | Architecture | Models | Example HF Models | Score template (see note) | [LoRA](../../features/lora.md) | [PP](../../serving/parallelism_scaling.md) | | ------------ | ------ | ----------------- | ------------------------- | --------------------------- | --------------------------------------- | | `BertForSequenceClassification` | BERT-based | `cross-encoder/ms-marco-MiniLM-L-6-v2`, etc. | N/A | | | | `GemmaForSequenceClassification` | Gemma-based | `BAAI/bge-reranker-v2-gemma`(see note), etc. | [bge-reranker-v2-gemma.jinja](../../../examples/pooling/score/template/bge-reranker-v2-gemma.jinja) | ✅︎ | ✅︎ | | `GteNewForSequenceClassification` | mGTE-TRM (see note) | `Alibaba-NLP/gte-multilingual-reranker-base`, etc. | N/A | | | | `LlamaBidirectionalForSequenceClassification`C | Llama-based with bidirectional attention | `nvidia/llama-nemotron-rerank-1b-v2`, etc. | [nemotron-rerank.jinja](../../../examples/pooling/score/template/nemotron-rerank.jinja) | ✅︎ | ✅︎ | | `Qwen2ForSequenceClassification`C | Qwen2-based | `mixedbread-ai/mxbai-rerank-base-v2`(see note), etc. | [mxbai_rerank_v2.jinja](../../../examples/pooling/score/template/mxbai_rerank_v2.jinja) | ✅︎ | ✅︎ | | `Qwen3ForSequenceClassification`C | Qwen3-based | `tomaarsen/Qwen3-Reranker-0.6B-seq-cls`, `Qwen/Qwen3-Reranker-0.6B`(see note), etc. | [qwen3_reranker.jinja](../../../examples/pooling/score/template/qwen3_reranker.jinja) | ✅︎ | ✅︎ | | `RobertaForSequenceClassification` | RoBERTa-based | `cross-encoder/quora-roberta-base`, etc. | N/A | | | | `XLMRobertaForSequenceClassification` | XLM-RoBERTa-based | `BAAI/bge-reranker-v2-m3`, etc. | N/A | | | | `*Model`C, `*ForCausalLM`C, etc. | Generative models | N/A | N/A | \* | \* | C Automatically converted into a classification model via `--convert classify`. ([details](./README.md#model-conversion)) \* Feature support is the same as that of the original model. !!! note Some models require a specific prompt format to work correctly. You can find Example HF Models's corresponding score template in [examples/pooling/score/template/](../../../examples/pooling/score/template) Examples : [examples/pooling/score/using_template_offline.py](../../../examples/pooling/score/using_template_offline.py) [examples/pooling/score/using_template_online.py](../../../examples/pooling/score/using_template_online.py) !!! note Load the official original `BAAI/bge-reranker-v2-gemma` by using the following command. ```bash vllm serve BAAI/bge-reranker-v2-gemma --hf_overrides '{"architectures": ["GemmaForSequenceClassification"],"classifier_from_token": ["Yes"],"method": "no_post_processing"}' ``` !!! note The second-generation GTE model (mGTE-TRM) is named `NewForSequenceClassification`. The name `NewForSequenceClassification` is too generic, you should set `--hf-overrides '{"architectures": ["GteNewForSequenceClassification"]}'` to specify the use of the `GteNewForSequenceClassification` architecture. !!! note Load the official original `mxbai-rerank-v2` by using the following command. ```bash vllm serve mixedbread-ai/mxbai-rerank-base-v2 --hf_overrides '{"architectures": ["Qwen2ForSequenceClassification"],"classifier_from_token": ["0", "1"], "method": "from_2_way_softmax"}' ``` !!! note Load the official original `Qwen3 Reranker` by using the following command. More information can be found at: [examples/pooling/score/qwen3_reranker_offline.py](../../../examples/pooling/score/qwen3_reranker_offline.py) [examples/pooling/score/qwen3_reranker_online.py](../../../examples/pooling/score/qwen3_reranker_online.py). ```bash vllm serve Qwen/Qwen3-Reranker-0.6B --hf_overrides '{"architectures": ["Qwen3ForSequenceClassification"],"classifier_from_token": ["no", "yes"],"is_original_qwen3_reranker": true}' ``` #### Multimodal Models !!! note For more information about multimodal models inputs, see [this page](../supported_models.md#list-of-multimodal-language-models). | Architecture | Models | Inputs | Example HF Models | [LoRA](../../features/lora.md) | [PP](../../serving/parallelism_scaling.md) | | ------------ | ------ | ------ | ----------------- | ------------------------------ | ------------------------------------------ | | `JinaVLForSequenceClassification` | JinaVL-based | T + IE+ | `jinaai/jina-reranker-m0`, etc. | ✅︎ | ✅︎ | | `LlamaNemotronVLForSequenceClassification` | Llama Nemotron Reranker + SigLIP | T + IE+ | `nvidia/llama-nemotron-rerank-vl-1b-v2` | | | | `Qwen3VLForSequenceClassification` | Qwen3-VL-Reranker | T + IE+ + VE+ | `Qwen/Qwen3-VL-Reranker-2B`(see note), etc. | ✅︎ | ✅︎ | C Automatically converted into a classification model via `--convert classify`. ([details](README.md#model-conversion)) \* Feature support is the same as that of the original model. !!! note Similar to Qwen3-Reranker, you need to use the following `--hf_overrides` to load the official original `Qwen3-VL-Reranker`. ```bash vllm serve Qwen/Qwen3-VL-Reranker-2B --hf_overrides '{"architectures": ["Qwen3VLForSequenceClassification"],"classifier_from_token": ["no", "yes"],"is_original_qwen3_reranker": true}' ``` --8<-- [end:supported-cross-encoder-models] ### Late-interaction models All models that support token embedding task also support using the score API to compute similarity scores by calculating the late interaction of two input prompts. See [this page](token_embed.md) for more information about token embedding models. --8<-- "docs/models/pooling_models/token_embed.md:supported-token-embed-models" ### Bi-encoder All models that support embedding task also support using the score API to compute similarity scores by calculating the cosine similarity of two input prompt's embeddings. See [this page](embed.md) for more information about embedding models. --8<-- "docs/models/pooling_models/embed.md:supported-embed-models" ## Offline Inference ### Pooling Parameters The following [pooling parameters][vllm.PoolingParams] are only supported by cross-encoder models and do not work for late-interaction and bi-encoder models. ```python --8<-- "vllm/pooling_params.py:common-pooling-params" --8<-- "vllm/pooling_params.py:classify-pooling-params" ``` ### `LLM.score` The [score][vllm.LLM.score] method outputs similarity scores between sentence pairs. ```python from vllm import LLM llm = LLM(model="BAAI/bge-reranker-v2-m3", runner="pooling") (output,) = llm.score( "What is the capital of France?", "The capital of Brazil is Brasilia.", ) score = output.outputs.score print(f"Score: {score}") ``` A code example can be found here: [examples/basic/offline_inference/score.py](../../../examples/basic/offline_inference/score.py) ## Online Serving ### Score API Our Score API (`/score`) is similar to `LLM.score`, compute similarity scores between two input prompts. #### Parameters The following Score API parameters are supported: ```python --8<-- "vllm/entrypoints/pooling/base/protocol.py:pooling-common-params" --8<-- "vllm/entrypoints/pooling/base/protocol.py:pooling-common-extra-params" --8<-- "vllm/entrypoints/pooling/base/protocol.py:classify-extra-params" ``` #### Examples ##### Single inference You can pass a string to both `queries` and `documents`, forming a single sentence pair. ```bash curl -X 'POST' \ 'http://127.0.0.1:8000/score' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "model": "BAAI/bge-reranker-v2-m3", "encoding_format": "float", "queries": "What is the capital of France?", "documents": "The capital of France is Paris." }' ``` ??? console "Response" ```json { "id": "score-request-id", "object": "list", "created": 693447, "model": "BAAI/bge-reranker-v2-m3", "data": [ { "index": 0, "object": "score", "score": 1 } ], "usage": {} } ``` ##### Batch inference You can pass a string to `queries` and a list to `documents`, forming multiple sentence pairs where each pair is built from `queries` and a string in `documents`. The total number of pairs is `len(documents)`. ??? console "Request" ```bash curl -X 'POST' \ 'http://127.0.0.1:8000/score' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "model": "BAAI/bge-reranker-v2-m3", "queries": "What is the capital of France?", "documents": [ "The capital of Brazil is Brasilia.", "The capital of France is Paris." ] }' ``` ??? console "Response" ```json { "id": "score-request-id", "object": "list", "created": 693570, "model": "BAAI/bge-reranker-v2-m3", "data": [ { "index": 0, "object": "score", "score": 0.001094818115234375 }, { "index": 1, "object": "score", "score": 1 } ], "usage": {} } ``` You can pass a list to both `queries` and `documents`, forming multiple sentence pairs where each pair is built from a string in `queries` and the corresponding string in `documents` (similar to `zip()`). The total number of pairs is `len(documents)`. ??? console "Request" ```bash curl -X 'POST' \ 'http://127.0.0.1:8000/score' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "model": "BAAI/bge-reranker-v2-m3", "encoding_format": "float", "queries": [ "What is the capital of Brazil?", "What is the capital of France?" ], "documents": [ "The capital of Brazil is Brasilia.", "The capital of France is Paris." ] }' ``` ??? console "Response" ```json { "id": "score-request-id", "object": "list", "created": 693447, "model": "BAAI/bge-reranker-v2-m3", "data": [ { "index": 0, "object": "score", "score": 1 }, { "index": 1, "object": "score", "score": 1 } ], "usage": {} } ``` ##### Multi-modal inputs You can pass multi-modal inputs to scoring models by passing `content` including a list of multi-modal input (image, etc.) in the request. Refer to the examples below for illustration. === "JinaVL-Reranker" To serve the model: ```bash vllm serve jinaai/jina-reranker-m0 ``` Since the request schema is not defined by OpenAI client, we post a request to the server using the lower-level `requests` library: ??? Code ```python import requests response = requests.post( "http://localhost:8000/v1/score", json={ "model": "jinaai/jina-reranker-m0", "queries": "slm markdown", "documents": [ { "content": [ { "type": "image_url", "image_url": { "url": "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/handelsblatt-preview.png" }, } ], }, { "content": [ { "type": "image_url", "image_url": { "url": "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/handelsblatt-preview.png" }, } ] }, ], }, ) response.raise_for_status() response_json = response.json() print("Scoring output:", response_json["data"][0]["score"]) print("Scoring output:", response_json["data"][1]["score"]) ``` Full example: - [examples/pooling/score/vision_score_api_online.py](../../../examples/pooling/score/vision_score_api_online.py) - [examples/pooling/score/vision_rerank_api_online.py](../../../examples/pooling/score/vision_rerank_api_online.py) ### Rerank API `/rerank`, `/v1/rerank`, and `/v2/rerank` APIs are compatible with both [Jina AI's rerank API interface](https://jina.ai/reranker/) and [Cohere's rerank API interface](https://docs.cohere.com/v2/reference/rerank) to ensure compatibility with popular open-source tools. Code example: [examples/pooling/score/rerank_api_online.py](../../../examples/pooling/score/rerank_api_online.py) #### Parameters The following rerank api parameters are supported: ```python --8<-- "vllm/entrypoints/pooling/base/protocol.py:pooling-common-params" --8<-- "vllm/entrypoints/pooling/base/protocol.py:pooling-common-extra-params" --8<-- "vllm/entrypoints/pooling/base/protocol.py:classify-extra-params" ``` #### Examples Note that the `top_n` request parameter is optional and will default to the length of the `documents` field. Result documents will be sorted by relevance, and the `index` property can be used to determine original order. ??? console "Request" ```bash curl -X 'POST' \ 'http://127.0.0.1:8000/v1/rerank' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "model": "BAAI/bge-reranker-base", "query": "What is the capital of France?", "documents": [ "The capital of Brazil is Brasilia.", "The capital of France is Paris.", "Horses and cows are both animals" ] }' ``` ??? console "Response" ```json { "id": "rerank-fae51b2b664d4ed38f5969b612edff77", "model": "BAAI/bge-reranker-base", "usage": { "total_tokens": 56 }, "results": [ { "index": 1, "document": { "text": "The capital of France is Paris." }, "relevance_score": 0.99853515625 }, { "index": 0, "document": { "text": "The capital of Brazil is Brasilia." }, "relevance_score": 0.0005860328674316406 } ] } ``` ## More examples More examples can be found here: [examples/pooling/score](../../../examples/pooling/score) ## Supported Features AS cross-encoder models are a subset of classification models that accept two prompts as input and output num_labels equal to 1, cross-encoder features should be consistent with (sequence) classification. For more information, see [this page](classify.md#supported-features). ### Score Template Score templates are supported for **cross-encoder** models only. If you are using an **embedding** model for scoring, vLLM does not apply a score template. Some scoring models require a specific prompt format to work correctly. You can specify a custom score template using the `--chat-template` parameter (see [Chat Template](../../serving/openai_compatible_server.md#chat-template)). Like chat templates, the score template receives a `messages` list. For scoring, each message has a `role` attribute—either `"query"` or `"document"`. For the usual kind of point-wise cross-encoder, you can expect exactly two messages: one query and one document. To access the query and document content, use Jinja's `selectattr` filter: - **Query**: `{{ (messages | selectattr("role", "eq", "query") | first).content }}` - **Document**: `{{ (messages | selectattr("role", "eq", "document") | first).content }}` This approach is more robust than index-based access (`messages[0]`, `messages[1]`) because it selects messages by their semantic role. It also avoids assumptions about message ordering if additional message types are added to `messages` in the future. Example template file: [examples/pooling/score/template/nemotron-rerank.jinja](../../../examples/pooling/score/template/nemotron-rerank.jinja) ### Enable/disable activation You can enable or disable activation via `use_activation` only works for cross-encoder models.