docs/models/pooling_models.md

# Pooling Models

vLLM also supports pooling models, such as embedding, classification, and reward models.

In vLLM, pooling models implement the [VllmModelForPooling][vllm.model_executor.models.VllmModelForPooling] interface.
These models use a [Pooler][vllm.model_executor.layers.pooler.Pooler] to extract the final hidden states of the input
before returning them.

!!! note
    We currently support pooling models primarily for convenience. This is not guaranteed to provide any performance improvements over using Hugging Face Transformers or Sentence Transformers directly.

    We plan to optimize pooling models in vLLM. Please comment on <https://github.com/vllm-project/vllm/issues/21796> if you have any suggestions!

## Configuration

### Model Runner

Run a model in pooling mode via the option `--runner pooling`.

!!! tip
    There is no need to set this option in the vast majority of cases as vLLM can automatically
    detect the appropriate model runner via `--runner auto`.

### Model Conversion

vLLM can adapt models for various pooling tasks via the option `--convert <type>`.

If `--runner pooling` has been set (manually or automatically) but the model does not implement the
[VllmModelForPooling][vllm.model_executor.models.VllmModelForPooling] interface,
vLLM will attempt to automatically convert the model according to the architecture names
shown in the table below.

| Architecture                                    | `--convert` | Supported pooling tasks               |
|-------------------------------------------------|-------------|---------------------------------------|
| `*ForTextEncoding`, `*EmbeddingModel`, `*Model` | `embed`     | `token_embed`, `embed`                |
| `*ForRewardModeling`, `*RewardModel`            | `embed`     | `token_embed`, `embed`                |
| `*For*Classification`, `*ClassificationModel`   | `classify`  | `token_classify`, `classify`, `score` |

!!! tip
    You can explicitly set `--convert <type>` to specify how to convert the model.

### Pooling Tasks

Each pooling model in vLLM supports one or more of these tasks according to
[Pooler.get_supported_tasks][vllm.model_executor.layers.pooler.Pooler.get_supported_tasks],
enabling the corresponding APIs:

| Task             | APIs                                                                          |
|------------------|-------------------------------------------------------------------------------|
| `embed`          | `LLM.embed(...)`, `LLM.score(...)`\*, `LLM.encode(..., pooling_task="embed")` |
| `classify`       | `LLM.classify(...)`, `LLM.encode(..., pooling_task="classify")`               |
| `score`          | `LLM.score(...)`                                                              |
| `token_classify` | `LLM.reward(...)`, `LLM.encode(..., pooling_task="token_classify")`           |
| `token_embed`    | `LLM.encode(..., pooling_task="token_embed")`                                 |
| `plugin`         | `LLM.encode(..., pooling_task="plugin")`                                      |

\* The `LLM.score(...)` API falls back to `embed` task if the model does not support `score` task.

### Pooler Configuration

#### Predefined models

If the [Pooler][vllm.model_executor.layers.pooler.Pooler] defined by the model accepts `pooler_config`,
you can override some of its attributes via the `--pooler-config` option.

#### Converted models

If the model has been converted via `--convert` (see above),
the pooler assigned to each task has the following attributes by default:

| Task       | Pooling Type | Normalization | Softmax |
|------------|--------------|---------------|---------|
| `embed`    | `LAST`       | ✅︎            | ❌      |
| `classify` | `LAST`       | ❌            | ✅︎      |

When loading [Sentence Transformers](https://huggingface.co/sentence-transformers) models,
its Sentence Transformers configuration file (`modules.json`) takes priority over the model's defaults.

You can further customize this via the `--pooler-config` option,
which takes priority over both the model's and Sentence Transformers' defaults.

## Offline Inference

The [LLM][vllm.LLM] class provides various methods for offline inference.
See [configuration](../api/README.md#configuration) for a list of options when initializing the model.

### `LLM.embed`

The [embed][vllm.LLM.embed] method outputs an embedding vector for each prompt.
It is primarily designed for embedding models.

```python
from vllm import LLM

llm = LLM(model="intfloat/e5-small", runner="pooling")
(output,) = llm.embed("Hello, my name is")

embeds = output.outputs.embedding
print(f"Embeddings: {embeds!r} (size={len(embeds)})")
```

A code example can be found here: [examples/offline_inference/basic/embed.py](../../examples/offline_inference/basic/embed.py)

### `LLM.classify`

The [classify][vllm.LLM.classify] method outputs a probability vector for each prompt.
It is primarily designed for classification models.

```python
from vllm import LLM

llm = LLM(model="jason9693/Qwen2.5-1.5B-apeach", runner="pooling")
(output,) = llm.classify("Hello, my name is")

probs = output.outputs.probs
print(f"Class Probabilities: {probs!r} (size={len(probs)})")
```

A code example can be found here: [examples/offline_inference/basic/classify.py](../../examples/offline_inference/basic/classify.py)

### `LLM.score`

The [score][vllm.LLM.score] method outputs similarity scores between sentence pairs.
It is designed for embedding models and cross-encoder models. Embedding models use cosine similarity, and [cross-encoder models](https://www.sbert.net/examples/applications/cross-encoder/README.html) serve as rerankers between candidate query-document pairs in RAG systems.

!!! note
    vLLM can only perform the model inference component (e.g. embedding, reranking) of RAG.
    To handle RAG at a higher level, you should use integration frameworks such as [LangChain](https://github.com/langchain-ai/langchain).

```python
from vllm import LLM

llm = LLM(model="BAAI/bge-reranker-v2-m3", runner="pooling")
(output,) = llm.score(
    "What is the capital of France?",
    "The capital of Brazil is Brasilia.",
)

score = output.outputs.score
print(f"Score: {score}")
```

A code example can be found here: [examples/offline_inference/basic/score.py](../../examples/offline_inference/basic/score.py)

### `LLM.reward`

The [reward][vllm.LLM.reward] method is available to all reward models in vLLM.

```python
from vllm import LLM

llm = LLM(model="internlm/internlm2-1_8b-reward", runner="pooling", trust_remote_code=True)
(output,) = llm.reward("Hello, my name is")

data = output.outputs.data
print(f"Data: {data!r}")
```

A code example can be found here: [examples/offline_inference/basic/reward.py](../../examples/offline_inference/basic/reward.py)

### `LLM.encode`

The [encode][vllm.LLM.encode] method is available to all pooling models in vLLM.

!!! note
    Please use one of the more specific methods or set the task directly when using `LLM.encode`:

    - For embeddings, use `LLM.embed(...)` or `pooling_task="embed"`.
    - For classification logits, use `LLM.classify(...)` or `pooling_task="classify"`.
    - For similarity scores, use `LLM.score(...)`.
    - For rewards, use `LLM.reward(...)` or `pooling_task="token_classify"`.
    - For token classification, use `pooling_task="token_classify"`.
    - For multi-vector retrieval, use `pooling_task="token_embed"`.
    - For IO Processor Plugins, use `pooling_task="plugin"`.

```python
from vllm import LLM

llm = LLM(model="intfloat/e5-small", runner="pooling")
(output,) = llm.encode("Hello, my name is", pooling_task="embed")

data = output.outputs.data
print(f"Data: {data!r}")
```

## Online Serving

Our [OpenAI-Compatible Server](../serving/openai_compatible_server.md) provides endpoints that correspond to the offline APIs:

- [Embeddings API](../serving/openai_compatible_server.md#embeddings-api) is similar to `LLM.embed`, accepting both text and [multi-modal inputs](../features/multimodal_inputs.md) for embedding models.
- [Classification API](../serving/openai_compatible_server.md#classification-api) is similar to `LLM.classify` and is applicable to sequence classification models.
- [Score API](../serving/openai_compatible_server.md#score-api) is similar to `LLM.score` for cross-encoder models.
- [Pooling API](../serving/openai_compatible_server.md#pooling-api) is similar to `LLM.encode`, being applicable to all types of pooling models.

!!! note
    Please use one of the more specific endpoints or set the task directly when using the [Pooling API](../serving/openai_compatible_server.md#pooling-api):

    - For embeddings, use [Embeddings API](../serving/openai_compatible_server.md#embeddings-api) or `"task":"embed"`.
    - For classification logits, use [Classification API](../serving/openai_compatible_server.md#classification-api) or `"task":"classify"`.
    - For similarity scores, use [Score API](../serving/openai_compatible_server.md#score-api).
    - For rewards, use `"task":"token_classify"`.
    - For token classification, use `"task":"token_classify"`.
    - For multi-vector retrieval, use `"task":"token_embed"`.
    - For IO Processor Plugins, use `"task":"plugin"`.

```python
# start a supported embeddings model server with `vllm serve`, e.g.
# vllm serve intfloat/e5-small
import requests

host = "localhost"
port = "8000"
model_name = "intfloat/e5-small"

api_url = f"http://{host}:{port}/pooling"

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
prompt = {"model": model_name, "input": prompts, "task": "embed"}

response = requests.post(api_url, json=prompt)

for output in response.json()["data"]:
    data = output["data"]
    print(f"Data: {data!r} (size={len(data)})")
```

## Matryoshka Embeddings

[Matryoshka Embeddings](https://sbert.net/examples/sentence_transformer/training/matryoshka/README.html#matryoshka-embeddings) or [Matryoshka Representation Learning (MRL)](https://arxiv.org/abs/2205.13147) is a technique used in training embedding models. It allows users to trade off between performance and cost.

!!! warning
    Not all embedding models are trained using Matryoshka Representation Learning. To avoid misuse of the `dimensions` parameter, vLLM returns an error for requests that attempt to change the output dimension of models that do not support Matryoshka Embeddings.

    For example, setting `dimensions` parameter while using the `BAAI/bge-m3` model will result in the following error.

    ```json
    {"object":"error","message":"Model \"BAAI/bge-m3\" does not support matryoshka representation, changing output dimensions will lead to poor results.","type":"BadRequestError","param":null,"code":400}
    ```

### Manually enable Matryoshka Embeddings

There is currently no official interface for specifying support for Matryoshka Embeddings. In vLLM, if `is_matryoshka` is `True` in `config.json`, you can change the output dimension to arbitrary values. Use `matryoshka_dimensions` to control the allowed output dimensions.

For models that support Matryoshka Embeddings but are not recognized by vLLM, manually override the config using `hf_overrides={"is_matryoshka": True}` or `hf_overrides={"matryoshka_dimensions": [<allowed output dimensions>]}` (offline), or `--hf-overrides '{"is_matryoshka": true}'` or `--hf-overrides '{"matryoshka_dimensions": [<allowed output dimensions>]}'` (online).

Here is an example to serve a model with Matryoshka Embeddings enabled.

```bash
vllm serve Snowflake/snowflake-arctic-embed-m-v1.5 --hf-overrides '{"matryoshka_dimensions":[256]}'
```

### Offline Inference

You can change the output dimensions of embedding models that support Matryoshka Embeddings by using the dimensions parameter in [PoolingParams][vllm.PoolingParams].

```python
from vllm import LLM, PoolingParams

llm = LLM(
    model="jinaai/jina-embeddings-v3",
    runner="pooling",
    trust_remote_code=True,
)
outputs = llm.embed(
    ["Follow the white rabbit."],
    pooling_params=PoolingParams(dimensions=32),
)
print(outputs[0].outputs)
```

A code example can be found here: [examples/pooling/embed/embed_matryoshka_fy_offline.py](../../examples/pooling/embed/embed_matryoshka_fy_offline.py)

### Online Inference

Use the following command to start the vLLM server.

```bash
vllm serve jinaai/jina-embeddings-v3 --trust-remote-code
```

You can change the output dimensions of embedding models that support Matryoshka Embeddings by using the dimensions parameter.

```bash
curl http://127.0.0.1:8000/v1/embeddings \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "input": "Follow the white rabbit.",
    "model": "jinaai/jina-embeddings-v3",
    "encoding_format": "float",
    "dimensions": 32
  }'
```

Expected output:

```json
{"id":"embd-5c21fc9a5c9d4384a1b021daccaf9f64","object":"list","created":1745476417,"model":"jinaai/jina-embeddings-v3","data":[{"index":0,"object":"embedding","embedding":[-0.3828125,-0.1357421875,0.03759765625,0.125,0.21875,0.09521484375,-0.003662109375,0.1591796875,-0.130859375,-0.0869140625,-0.1982421875,0.1689453125,-0.220703125,0.1728515625,-0.2275390625,-0.0712890625,-0.162109375,-0.283203125,-0.055419921875,-0.0693359375,0.031982421875,-0.04052734375,-0.2734375,0.1826171875,-0.091796875,0.220703125,0.37890625,-0.0888671875,-0.12890625,-0.021484375,-0.0091552734375,0.23046875]}],"usage":{"prompt_tokens":8,"total_tokens":8,"completion_tokens":0,"prompt_tokens_details":null}}
```

An OpenAI client example can be found here: [examples/pooling/embed/openai_embedding_matryoshka_fy_client.py](../../examples/pooling/embed/openai_embedding_matryoshka_fy_client.py)

## Specific models

### ColBERT Late Interaction Models

[ColBERT](https://arxiv.org/abs/2004.12832) (Contextualized Late Interaction over BERT) is a retrieval model that uses per-token embeddings and MaxSim scoring for document ranking. Unlike single-vector embedding models, ColBERT retains token-level representations and computes relevance scores through late interaction, providing better accuracy while being more efficient than cross-encoders.

vLLM supports ColBERT models with multiple encoder backbones:

| Architecture | Backbone | Example HF Models |
|---|---|---|
| `HF_ColBERT` | BERT | `answerdotai/answerai-colbert-small-v1`, `colbert-ir/colbertv2.0` |
| `ColBERTModernBertModel` | ModernBERT | `lightonai/GTE-ModernColBERT-v1` |
| `ColBERTJinaRobertaModel` | Jina XLM-RoBERTa | `jinaai/jina-colbert-v2` |

**BERT-based ColBERT** models work out of the box:

```shell
vllm serve answerdotai/answerai-colbert-small-v1
```

For **non-BERT backbones**, use `--hf-overrides` to set the correct architecture:

```shell
# ModernBERT backbone
vllm serve lightonai/GTE-ModernColBERT-v1 \
    --hf-overrides '{"architectures": ["ColBERTModernBertModel"]}'

# Jina XLM-RoBERTa backbone
vllm serve jinaai/jina-colbert-v2 \
    --hf-overrides '{"architectures": ["ColBERTJinaRobertaModel"]}' \
    --trust-remote-code
```

Then you can use the rerank endpoint:

```shell
curl -s http://localhost:8000/rerank -H "Content-Type: application/json" -d '{
    "model": "answerdotai/answerai-colbert-small-v1",
    "query": "What is machine learning?",
    "documents": [
        "Machine learning is a subset of artificial intelligence.",
        "Python is a programming language.",
        "Deep learning uses neural networks."
    ]
}'
```

Or the score endpoint:

```shell
curl -s http://localhost:8000/score -H "Content-Type: application/json" -d '{
    "model": "answerdotai/answerai-colbert-small-v1",
    "text_1": "What is machine learning?",
    "text_2": ["Machine learning is a subset of AI.", "The weather is sunny."]
}'
```

You can also get the raw token embeddings using the pooling endpoint with `token_embed` task:

```shell
curl -s http://localhost:8000/pooling -H "Content-Type: application/json" -d '{
    "model": "answerdotai/answerai-colbert-small-v1",
    "input": "What is machine learning?",
    "task": "token_embed"
}'
```

An example can be found here: [examples/pooling/score/colbert_rerank_online.py](../../examples/pooling/score/colbert_rerank_online.py)

### ColQwen3 Multi-Modal Late Interaction Models

ColQwen3 is based on [ColPali](https://arxiv.org/abs/2407.01449), which extends ColBERT's late interaction approach to **multi-modal** inputs. While ColBERT operates on text-only token embeddings, ColPali/ColQwen3 can embed both **text and images** (e.g. PDF pages, screenshots, diagrams) into per-token L2-normalized vectors and compute relevance via MaxSim scoring. ColQwen3 specifically uses Qwen3-VL as its vision-language backbone.

| Architecture | Backbone | Example HF Models |
|---|---|---|
| `ColQwen3` | Qwen3-VL | `TomoroAI/tomoro-colqwen3-embed-4b`, `TomoroAI/tomoro-colqwen3-embed-8b` |
| `OpsColQwen3Model` | Qwen3-VL | `OpenSearch-AI/Ops-Colqwen3-4B`, `OpenSearch-AI/Ops-Colqwen3-8B` |
| `Qwen3VLNemotronEmbedModel` | Qwen3-VL | `nvidia/nemotron-colembed-vl-4b-v2`, `nvidia/nemotron-colembed-vl-8b-v2` |

Start the server:

```shell
vllm serve TomoroAI/tomoro-colqwen3-embed-4b --max-model-len 4096
```

#### Text-only scoring and reranking

Use the `/rerank` endpoint:

```shell
curl -s http://localhost:8000/rerank -H "Content-Type: application/json" -d '{
    "model": "TomoroAI/tomoro-colqwen3-embed-4b",
    "query": "What is machine learning?",
    "documents": [
        "Machine learning is a subset of artificial intelligence.",
        "Python is a programming language.",
        "Deep learning uses neural networks."
    ]
}'
```

Or the `/score` endpoint:

```shell
curl -s http://localhost:8000/score -H "Content-Type: application/json" -d '{
    "model": "TomoroAI/tomoro-colqwen3-embed-4b",
    "text_1": "What is the capital of France?",
    "text_2": ["The capital of France is Paris.", "Python is a programming language."]
}'
```

#### Multi-modal scoring and reranking (text query × image documents)

The `/score` and `/rerank` endpoints also accept multi-modal inputs directly.
Pass image documents using the `data_1`/`data_2` (for `/score`) or `documents` (for `/rerank`) fields
with a `content` list containing `image_url` and `text` parts — the same format used by the
OpenAI chat completion API:

Score a text query against image documents:

```shell
curl -s http://localhost:8000/score -H "Content-Type: application/json" -d '{
    "model": "TomoroAI/tomoro-colqwen3-embed-4b",
    "data_1": "Retrieve the city of Beijing",
    "data_2": [
        {
            "content": [
                {"type": "image_url", "image_url": {"url": "data:image/png;base64,<BASE64>"}},
                {"type": "text", "text": "Describe the image."}
            ]
        }
    ]
}'
```

Rerank image documents by a text query:

```shell
curl -s http://localhost:8000/rerank -H "Content-Type: application/json" -d '{
    "model": "TomoroAI/tomoro-colqwen3-embed-4b",
    "query": "Retrieve the city of Beijing",
    "documents": [
        {
            "content": [
                {"type": "image_url", "image_url": {"url": "data:image/png;base64,<BASE64_1>"}},
                {"type": "text", "text": "Describe the image."}
            ]
        },
        {
            "content": [
                {"type": "image_url", "image_url": {"url": "data:image/png;base64,<BASE64_2>"}},
                {"type": "text", "text": "Describe the image."}
            ]
        }
    ],
    "top_n": 2
}'
```

#### Raw token embeddings

You can also get the raw token embeddings using the `/pooling` endpoint with `token_embed` task:

```shell
curl -s http://localhost:8000/pooling -H "Content-Type: application/json" -d '{
    "model": "TomoroAI/tomoro-colqwen3-embed-4b",
    "input": "What is machine learning?",
    "task": "token_embed"
}'
```

For **image inputs** via the pooling endpoint, use the chat-style `messages` field:

```shell
curl -s http://localhost:8000/pooling -H "Content-Type: application/json" -d '{
    "model": "TomoroAI/tomoro-colqwen3-embed-4b",
    "messages": [
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": "data:image/png;base64,<BASE64>"}},
                {"type": "text", "text": "Describe the image."}
            ]
        }
    ]
}'
```

#### Examples

- Multi-vector retrieval: [examples/pooling/token_embed/colqwen3_token_embed_online.py](../../examples/pooling/token_embed/colqwen3_token_embed_online.py)
- Reranking (text + multi-modal): [examples/pooling/score/colqwen3_rerank_online.py](../../examples/pooling/score/colqwen3_rerank_online.py)

### BAAI/bge-m3

The `BAAI/bge-m3` model comes with extra weights for sparse and colbert embeddings but unfortunately in its `config.json`
the architecture is declared as `XLMRobertaModel`, which makes `vLLM` load it as a vanilla ROBERTA model without the
extra weights. To load the full model weights, override its architecture like this:

```shell
vllm serve BAAI/bge-m3 --hf-overrides '{"architectures": ["BgeM3EmbeddingModel"]}'
```

Then you obtain the sparse embeddings like this:

```shell
curl -s http://localhost:8000/pooling -H "Content-Type: application/json" -d '{
     "model": "BAAI/bge-m3",
     "task": "token_classify",
     "input": ["What is BGE M3?", "Defination of BM25"]
}'
```

Due to limitations in the output schema, the output consists of a list of
token scores for each token for each input. This means that you'll have to call
`/tokenize` as well to be able to pair tokens with scores.
Refer to the tests in  `tests/models/language/pooling/test_bge_m3.py` to see how
to do that.

You can obtain the colbert embeddings like this:

```shell
curl -s http://localhost:8000/pooling -H "Content-Type: application/json" -d '{
     "model": "BAAI/bge-m3",
     "task": "token_embed",
     "input": ["What is BGE M3?", "Defination of BM25"]
}'
```

## Deprecated Features

### Encode task

We have split the `encode` task into two more specific token-wise tasks: `token_embed` and `token_classify`:

- `token_embed` is the same as `embed`, using normalization as the activation.
- `token_classify` is the same as `classify`, by default using softmax as the activation.

Pooling models now default support all pooling, you can use it without any settings.

- Extracting hidden states prefers using `token_embed` task.
- Reward models prefers using `token_classify` task.
-												Stop using title frontmatter and fix doc that can only be reached by search (#20623)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-07-08 11:27:40 +01:00
+								# Pooling Models
-												[Docs] Convert rST to MyST (Markdown) (#11145)

Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
											
										
										
											2024-12-23 17:35:38 -05:00
-												Optimize the wording of the document and unify the terminology and th… (#29491)


											
										
										
											2025-11-26 21:16:12 +08:00
+								vLLM also supports pooling models, such as embedding, classification, and reward models.
-												[Docs] Convert rST to MyST (Markdown) (#11145)

Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
											
										
										
											2024-12-23 17:35:38 -05:00
-												Migrate docs from Sphinx to MkDocs (#18145)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-05-23 11:09:53 +02:00
+								In vLLM, pooling models implement the [VllmModelForPooling][vllm.model_executor.models.VllmModelForPooling] interface.
-												[Deprecation][2/N] Replace `--task` with `--runner` and `--convert` (#21470)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-07-28 10:42:40 +08:00
+								These models use a [Pooler][vllm.model_executor.layers.pooler.Pooler] to extract the final hidden states of the input
-												[Docs] Convert rST to MyST (Markdown) (#11145)

Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
											
										
										
											2024-12-23 17:35:38 -05:00
+								before returning them.
-												Migrate docs from Sphinx to MkDocs (#18145)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-05-23 11:09:53 +02:00
+								!!! note
-												Optimize the wording of the document and unify the terminology and th… (#29491)


											
										
										
											2025-11-26 21:16:12 +08:00
+								    We currently support pooling models primarily for convenience. This is not guaranteed to provide any performance improvements over using Hugging Face Transformers or Sentence Transformers directly.
-												[Doc] Link to RFC for pooling optimizations (#21806)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-07-29 14:53:18 +08:00
-												Optimize the wording of the document and unify the terminology and th… (#29491)


											
										
										
											2025-11-26 21:16:12 +08:00
+								    We plan to optimize pooling models in vLLM. Please comment on <https://github.com/vllm-project/vllm/issues/21796> if you have any suggestions!
-												[Docs] Convert rST to MyST (Markdown) (#11145)

Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
											
										
										
											2024-12-23 17:35:38 -05:00
-												[Deprecation][2/N] Replace `--task` with `--runner` and `--convert` (#21470)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-07-28 10:42:40 +08:00
+								## Configuration
-												[Doc] Show default pooling method in a table (#11904)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-01-10 11:25:20 +08:00
-												[Deprecation][2/N] Replace `--task` with `--runner` and `--convert` (#21470)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-07-28 10:42:40 +08:00
+								### Model Runner
-												[Docs] Convert rST to MyST (Markdown) (#11145)

Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
											
										
										
											2024-12-23 17:35:38 -05:00
-												[Deprecation][2/N] Replace `--task` with `--runner` and `--convert` (#21470)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-07-28 10:42:40 +08:00
+								Run a model in pooling mode via the option `--runner pooling`.
-												[Docs] Convert rST to MyST (Markdown) (#11145)

Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
											
										
										
											2024-12-23 17:35:38 -05:00
-												[Deprecation][2/N] Replace `--task` with `--runner` and `--convert` (#21470)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-07-28 10:42:40 +08:00
+								!!! tip
 								    There is no need to set this option in the vast majority of cases as vLLM can automatically
-												Optimize the wording of the document and unify the terminology and th… (#29491)


											
										
										
											2025-11-26 21:16:12 +08:00
+								    detect the appropriate model runner via `--runner auto`.
-												[Deprecation][2/N] Replace `--task` with `--runner` and `--convert` (#21470)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-07-28 10:42:40 +08:00
 								### Model Conversion
 								vLLM can adapt models for various pooling tasks via the option `--convert <type>`.
 								If `--runner pooling` has been set (manually or automatically) but the model does not implement the
 								[VllmModelForPooling][vllm.model_executor.models.VllmModelForPooling] interface,
 								vLLM will attempt to automatically convert the model according to the architecture names
 								shown in the table below.
-												[Frontend][Doc][5/N] Improve all pooling task | Polish encode (pooling) api & Document. (#25524)

Signed-off-by: wang.yuqi <noooop@126.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
											
										
										
											2025-10-30 20:13:05 +08:00
+								| Architecture                                    | `--convert` | Supported pooling tasks               |
 								|-------------------------------------------------|-------------|---------------------------------------|
 								| `*ForTextEncoding`, `*EmbeddingModel`, `*Model` | `embed`     | `token_embed`, `embed`                |
-												[Model][7/N] Improve all pooling task | Deprecation as_reward_model. Extract hidden states prefer using new multi-vector retrieval API (#26686)

Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
											
										
										
											2025-12-08 16:10:09 +08:00
+								| `*ForRewardModeling`, `*RewardModel`            | `embed`     | `token_embed`, `embed`                |
-												[Frontend][Doc][5/N] Improve all pooling task | Polish encode (pooling) api & Document. (#25524)

Signed-off-by: wang.yuqi <noooop@126.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
											
										
										
											2025-10-30 20:13:05 +08:00
+								| `*For*Classification`, `*ClassificationModel`   | `classify`  | `token_classify`, `classify`, `score` |
-												[Deprecation][2/N] Replace `--task` with `--runner` and `--convert` (#21470)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-07-28 10:42:40 +08:00
 								!!! tip
 								    You can explicitly set `--convert <type>` to specify how to convert the model.
 								### Pooling Tasks
 								Each pooling model in vLLM supports one or more of these tasks according to
 								[Pooler.get_supported_tasks][vllm.model_executor.layers.pooler.Pooler.get_supported_tasks],
 								enabling the corresponding APIs:
-												[Model][1/N] Support multiple poolers at model level (#21227)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-07-21 17:22:21 +08:00
-												[Frontend][Doc][5/N] Improve all pooling task | Polish encode (pooling) api & Document. (#25524)

Signed-off-by: wang.yuqi <noooop@126.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
											
										
										
											2025-10-30 20:13:05 +08:00
+								| Task             | APIs                                                                          |
 								|------------------|-------------------------------------------------------------------------------|
 								| `embed`          | `LLM.embed(...)`, `LLM.score(...)`\*, `LLM.encode(..., pooling_task="embed")` |
 								| `classify`       | `LLM.classify(...)`, `LLM.encode(..., pooling_task="classify")`               |
 								| `score`          | `LLM.score(...)`                                                              |
 								| `token_classify` | `LLM.reward(...)`, `LLM.encode(..., pooling_task="token_classify")`           |
 								| `token_embed`    | `LLM.encode(..., pooling_task="token_embed")`                                 |
 								| `plugin`         | `LLM.encode(..., pooling_task="plugin")`                                      |
-												[Model][1/N] Support multiple poolers at model level (#21227)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-07-21 17:22:21 +08:00
-												[Frontend] Add LLM.reward specific to reward models (#21720)

Signed-off-by: wang.yuqi <noooop@126.com>
											
										
										
											2025-07-30 11:56:03 +08:00
+								\* The `LLM.score(...)` API falls back to `embed` task if the model does not support `score` task.
-												[Model][1/N] Support multiple poolers at model level (#21227)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-07-21 17:22:21 +08:00
-												[Deprecation][2/N] Replace `--task` with `--runner` and `--convert` (#21470)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-07-28 10:42:40 +08:00
+								### Pooler Configuration
-												[Model][1/N] Support multiple poolers at model level (#21227)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-07-21 17:22:21 +08:00
-												[Deprecation][2/N] Replace `--task` with `--runner` and `--convert` (#21470)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-07-28 10:42:40 +08:00
+								#### Predefined models
 								If the [Pooler][vllm.model_executor.layers.pooler.Pooler] defined by the model accepts `pooler_config`,
-												Move `PoolerConfig` from `config/__init__.py` to `config/pooler.py` (#25181)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-09-19 12:02:55 +01:00
+								you can override some of its attributes via the `--pooler-config` option.
-												[Deprecation][2/N] Replace `--task` with `--runner` and `--convert` (#21470)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-07-28 10:42:40 +08:00
 								#### Converted models
 								If the model has been converted via `--convert` (see above),
 								the pooler assigned to each task has the following attributes by default:
-												[Model][1/N] Support multiple poolers at model level (#21227)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-07-21 17:22:21 +08:00
-												[Frontend] Add LLM.reward specific to reward models (#21720)

Signed-off-by: wang.yuqi <noooop@126.com>
											
										
										
											2025-07-30 11:56:03 +08:00
+								| Task       | Pooling Type | Normalization | Softmax |
 								|------------|--------------|---------------|---------|
 								| `embed`    | `LAST`       | ✅︎            | ❌      |
 								| `classify` | `LAST`       | ❌            | ✅︎      |
-												[Model][1/N] Support multiple poolers at model level (#21227)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-07-21 17:22:21 +08:00
-												[Docs] Convert rST to MyST (Markdown) (#11145)

Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
											
										
										
											2024-12-23 17:35:38 -05:00
+								When loading [Sentence Transformers](https://huggingface.co/sentence-transformers) models,
-												[Deprecation][2/N] Replace `--task` with `--runner` and `--convert` (#21470)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-07-28 10:42:40 +08:00
+								its Sentence Transformers configuration file (`modules.json`) takes priority over the model's defaults.
-												[Model][1/N] Support multiple poolers at model level (#21227)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-07-21 17:22:21 +08:00
-												Move `PoolerConfig` from `config/__init__.py` to `config/pooler.py` (#25181)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-09-19 12:02:55 +01:00
+								You can further customize this via the `--pooler-config` option,
-												Optimize the wording of the document and unify the terminology and th… (#29491)


											
										
										
											2025-11-26 21:16:12 +08:00
+								which takes priority over both the model's and Sentence Transformers' defaults.
-												[Model][1/N] Support multiple poolers at model level (#21227)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-07-21 17:22:21 +08:00
-												[Doc] Show default pooling method in a table (#11904)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-01-10 11:25:20 +08:00
+								## Offline Inference
-												Migrate docs from Sphinx to MkDocs (#18145)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-05-23 11:09:53 +02:00
+								The [LLM][vllm.LLM] class provides various methods for offline inference.
-												[Docs] Fix broken links to `docs/api/summary.md` (#23637)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-08-26 14:00:18 +01:00
+								See [configuration](../api/README.md#configuration) for a list of options when initializing the model.
-												[Docs] Convert rST to MyST (Markdown) (#11145)

Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
											
										
										
											2024-12-23 17:35:38 -05:00
 								### `LLM.embed`
-												Migrate docs from Sphinx to MkDocs (#18145)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-05-23 11:09:53 +02:00
+								The [embed][vllm.LLM.embed] method outputs an embedding vector for each prompt.
-												[Docs] Convert rST to MyST (Markdown) (#11145)

Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
											
										
										
											2024-12-23 17:35:38 -05:00
+								It is primarily designed for embedding models.
 								```python
-												[doc] add missing imports (#15699)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>
											
										
										
											2025-03-28 23:56:48 +08:00
+								from vllm import LLM
-												[Frontend] Add LLM.reward specific to reward models (#21720)

Signed-off-by: wang.yuqi <noooop@126.com>
											
										
										
											2025-07-30 11:56:03 +08:00
+								llm = LLM(model="intfloat/e5-small", runner="pooling")
-												[Docs] Convert rST to MyST (Markdown) (#11145)

Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
											
										
										
											2024-12-23 17:35:38 -05:00
+								(output,) = llm.embed("Hello, my name is")
 								embeds = output.outputs.embedding
 								print(f"Embeddings: {embeds!r} (size={len(embeds)})")
 								```
-												[Docs] Reduce custom syntax used in docs (#27009)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-10-17 04:05:34 +01:00
+								A code example can be found here: [examples/offline_inference/basic/embed.py](../../examples/offline_inference/basic/embed.py)
-												[Docs] Convert rST to MyST (Markdown) (#11145)

Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
											
										
										
											2024-12-23 17:35:38 -05:00
 								### `LLM.classify`
-												Migrate docs from Sphinx to MkDocs (#18145)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-05-23 11:09:53 +02:00
+								The [classify][vllm.LLM.classify] method outputs a probability vector for each prompt.
-												[Docs] Convert rST to MyST (Markdown) (#11145)

Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
											
										
										
											2024-12-23 17:35:38 -05:00
+								It is primarily designed for classification models.
 								```python
-												[doc] add missing imports (#15699)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>
											
										
										
											2025-03-28 23:56:48 +08:00
+								from vllm import LLM
-												[Deprecation][2/N] Replace `--task` with `--runner` and `--convert` (#21470)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-07-28 10:42:40 +08:00
+								llm = LLM(model="jason9693/Qwen2.5-1.5B-apeach", runner="pooling")
-												[Docs] Convert rST to MyST (Markdown) (#11145)

Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
											
										
										
											2024-12-23 17:35:38 -05:00
+								(output,) = llm.classify("Hello, my name is")
 								probs = output.outputs.probs
 								print(f"Class Probabilities: {probs!r} (size={len(probs)})")
 								```
-												[Docs] Reduce custom syntax used in docs (#27009)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-10-17 04:05:34 +01:00
+								A code example can be found here: [examples/offline_inference/basic/classify.py](../../examples/offline_inference/basic/classify.py)
-												[Docs] Convert rST to MyST (Markdown) (#11145)

Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
											
										
										
											2024-12-23 17:35:38 -05:00
 								### `LLM.score`
-												Migrate docs from Sphinx to MkDocs (#18145)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-05-23 11:09:53 +02:00
+								The [score][vllm.LLM.score] method outputs similarity scores between sentence pairs.
-												[Doc] Update pooling model docs (#22186)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-08-04 18:37:06 +08:00
+								It is designed for embedding models and cross-encoder models. Embedding models use cosine similarity, and [cross-encoder models](https://www.sbert.net/examples/applications/cross-encoder/README.html) serve as rerankers between candidate query-document pairs in RAG systems.
-												[Docs] Convert rST to MyST (Markdown) (#11145)

Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
											
										
										
											2024-12-23 17:35:38 -05:00
-												Migrate docs from Sphinx to MkDocs (#18145)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-05-23 11:09:53 +02:00
+								!!! note
 								    vLLM can only perform the model inference component (e.g. embedding, reranking) of RAG.
 								    To handle RAG at a higher level, you should use integration frameworks such as [LangChain](https://github.com/langchain-ai/langchain).
-												[Docs] Convert rST to MyST (Markdown) (#11145)

Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
											
										
										
											2024-12-23 17:35:38 -05:00
 								```python
-												[doc] add missing imports (#15699)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>
											
										
										
											2025-03-28 23:56:48 +08:00
+								from vllm import LLM
-												[Deprecation][2/N] Replace `--task` with `--runner` and `--convert` (#21470)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-07-28 10:42:40 +08:00
+								llm = LLM(model="BAAI/bge-reranker-v2-m3", runner="pooling")
-												[Doc] ruff format remaining Python examples (#26795)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-10-15 16:25:49 +08:00
+								(output,) = llm.score(
 								    "What is the capital of France?",
 								    "The capital of Brazil is Brasilia.",
 								)
-												[Docs] Convert rST to MyST (Markdown) (#11145)

Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
											
										
										
											2024-12-23 17:35:38 -05:00
 								score = output.outputs.score
 								print(f"Score: {score}")
 								```
-												[Docs] Reduce custom syntax used in docs (#27009)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-10-17 04:05:34 +01:00
+								A code example can be found here: [examples/offline_inference/basic/score.py](../../examples/offline_inference/basic/score.py)
-												[Docs] Convert rST to MyST (Markdown) (#11145)

Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
											
										
										
											2024-12-23 17:35:38 -05:00
-												[Frontend] Add LLM.reward specific to reward models (#21720)

Signed-off-by: wang.yuqi <noooop@126.com>
											
										
										
											2025-07-30 11:56:03 +08:00
+								### `LLM.reward`
 								The [reward][vllm.LLM.reward] method is available to all reward models in vLLM.
 								```python
 								from vllm import LLM
 								llm = LLM(model="internlm/internlm2-1_8b-reward", runner="pooling", trust_remote_code=True)
 								(output,) = llm.reward("Hello, my name is")
 								data = output.outputs.data
 								print(f"Data: {data!r}")
 								```
-												[Docs] Reduce custom syntax used in docs (#27009)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-10-17 04:05:34 +01:00
+								A code example can be found here: [examples/offline_inference/basic/reward.py](../../examples/offline_inference/basic/reward.py)
-												[Frontend] Add LLM.reward specific to reward models (#21720)

Signed-off-by: wang.yuqi <noooop@126.com>
											
										
										
											2025-07-30 11:56:03 +08:00
 								### `LLM.encode`
 								The [encode][vllm.LLM.encode] method is available to all pooling models in vLLM.
 								!!! note
 								    Please use one of the more specific methods or set the task directly when using `LLM.encode`:
 								    - For embeddings, use `LLM.embed(...)` or `pooling_task="embed"`.
 								    - For classification logits, use `LLM.classify(...)` or `pooling_task="classify"`.
-												Optimize the wording of the document and unify the terminology and th… (#29491)


											
										
										
											2025-11-26 21:16:12 +08:00
+								    - For similarity scores, use `LLM.score(...)`.
-												[Frontend][Doc][5/N] Improve all pooling task | Polish encode (pooling) api & Document. (#25524)

Signed-off-by: wang.yuqi <noooop@126.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
											
										
										
											2025-10-30 20:13:05 +08:00
+								    - For rewards, use `LLM.reward(...)` or `pooling_task="token_classify"`.
 								    - For token classification, use `pooling_task="token_classify"`.
-												Optimize the wording of the document and unify the terminology and th… (#29491)


											
										
										
											2025-11-26 21:16:12 +08:00
+								    - For multi-vector retrieval, use `pooling_task="token_embed"`.
 								    - For IO Processor Plugins, use `pooling_task="plugin"`.
-												[Frontend] Add LLM.reward specific to reward models (#21720)

Signed-off-by: wang.yuqi <noooop@126.com>
											
										
										
											2025-07-30 11:56:03 +08:00
 								```python
 								from vllm import LLM
 								llm = LLM(model="intfloat/e5-small", runner="pooling")
 								(output,) = llm.encode("Hello, my name is", pooling_task="embed")
 								data = output.outputs.data
 								print(f"Data: {data!r}")
 								```
-												Replace "online inference" with "online serving" (#11923)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-01-10 12:05:56 +00:00
+								## Online Serving
-												[Docs] Convert rST to MyST (Markdown) (#11145)

Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
											
										
										
											2024-12-23 17:35:38 -05:00
-												Remove unnecessary explicit title anchors and use relative links instead (#20620)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-07-08 10:49:13 +01:00
+								Our [OpenAI-Compatible Server](../serving/openai_compatible_server.md) provides endpoints that correspond to the offline APIs:
-												[Docs] Convert rST to MyST (Markdown) (#11145)

Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
											
										
										
											2024-12-23 17:35:38 -05:00
-												[Docs] Replace all explicit anchors with real links (#27087)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-10-17 10:22:06 +01:00
+								- [Embeddings API](../serving/openai_compatible_server.md#embeddings-api) is similar to `LLM.embed`, accepting both text and [multi-modal inputs](../features/multimodal_inputs.md) for embedding models.
 								- [Classification API](../serving/openai_compatible_server.md#classification-api) is similar to `LLM.classify` and is applicable to sequence classification models.
 								- [Score API](../serving/openai_compatible_server.md#score-api) is similar to `LLM.score` for cross-encoder models.
-												[Frontend][Doc][5/N] Improve all pooling task | Polish encode (pooling) api & Document. (#25524)

Signed-off-by: wang.yuqi <noooop@126.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
											
										
										
											2025-10-30 20:13:05 +08:00
+								- [Pooling API](../serving/openai_compatible_server.md#pooling-api) is similar to `LLM.encode`, being applicable to all types of pooling models.
 								!!! note
-												Optimize the wording of the document and unify the terminology and th… (#29491)


											
										
										
											2025-11-26 21:16:12 +08:00
+								    Please use one of the more specific endpoints or set the task directly when using the [Pooling API](../serving/openai_compatible_server.md#pooling-api):
-												[Frontend][Doc][5/N] Improve all pooling task | Polish encode (pooling) api & Document. (#25524)

Signed-off-by: wang.yuqi <noooop@126.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
											
										
										
											2025-10-30 20:13:05 +08:00
 								    - For embeddings, use [Embeddings API](../serving/openai_compatible_server.md#embeddings-api) or `"task":"embed"`.
-												Optimize the wording of the document and unify the terminology and th… (#29491)


											
										
										
											2025-11-26 21:16:12 +08:00
+								    - For classification logits, use [Classification API](../serving/openai_compatible_server.md#classification-api) or `"task":"classify"`.
 								    - For similarity scores, use [Score API](../serving/openai_compatible_server.md#score-api).
 								    - For rewards, use `"task":"token_classify"`.
 								    - For token classification, use `"task":"token_classify"`.
 								    - For multi-vector retrieval, use `"task":"token_embed"`.
 								    - For IO Processor Plugins, use `"task":"plugin"`.
-												[Frontend][Doc][5/N] Improve all pooling task | Polish encode (pooling) api & Document. (#25524)

Signed-off-by: wang.yuqi <noooop@126.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
											
										
										
											2025-10-30 20:13:05 +08:00
 								```python
 								# start a supported embeddings model server with `vllm serve`, e.g.
 								# vllm serve intfloat/e5-small
 								import requests
 								host = "localhost"
 								port = "8000"
 								model_name = "intfloat/e5-small"
 								api_url = f"http://{host}:{port}/pooling"
 								prompts = [
 								    "Hello, my name is",
 								    "The president of the United States is",
 								    "The capital of France is",
 								    "The future of AI is",
 								]
 								prompt = {"model": model_name, "input": prompts, "task": "embed"}
 								response = requests.post(api_url, json=prompt)
 								for output in response.json()["data"]:
 								    data = output["data"]
 								    print(f"Data: {data!r} (size={len(data)})")
 								```
-												[Doc] Document Matryoshka Representation Learning support (#16770)


											
										
										
											2025-04-17 21:37:37 +08:00
 								## Matryoshka Embeddings
-												Optimize the wording of the document and unify the terminology and th… (#29491)


											
										
										
											2025-11-26 21:16:12 +08:00
+								[Matryoshka Embeddings](https://sbert.net/examples/sentence_transformer/training/matryoshka/README.html#matryoshka-embeddings) or [Matryoshka Representation Learning (MRL)](https://arxiv.org/abs/2205.13147) is a technique used in training embedding models. It allows users to trade off between performance and cost.
-												[Doc] Document Matryoshka Representation Learning support (#16770)


											
										
										
											2025-04-17 21:37:37 +08:00
-												Migrate docs from Sphinx to MkDocs (#18145)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-05-23 11:09:53 +02:00
+								!!! warning
 								    Not all embedding models are trained using Matryoshka Representation Learning. To avoid misuse of the `dimensions` parameter, vLLM returns an error for requests that attempt to change the output dimension of models that do not support Matryoshka Embeddings.
-												[Doc] Document Matryoshka Representation Learning support (#16770)


											
										
										
											2025-04-17 21:37:37 +08:00
-												Migrate docs from Sphinx to MkDocs (#18145)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-05-23 11:09:53 +02:00
+								    For example, setting `dimensions` parameter while using the `BAAI/bge-m3` model will result in the following error.
-												[Doc] Document Matryoshka Representation Learning support (#16770)


											
										
										
											2025-04-17 21:37:37 +08:00
-												Migrate docs from Sphinx to MkDocs (#18145)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-05-23 11:09:53 +02:00
+								    ```json
 								    {"object":"error","message":"Model \"BAAI/bge-m3\" does not support matryoshka representation, changing output dimensions will lead to poor results.","type":"BadRequestError","param":null,"code":400}
 								    ```
-												[Doc] Document Matryoshka Representation Learning support (#16770)


											
										
										
											2025-04-17 21:37:37 +08:00
 								### Manually enable Matryoshka Embeddings
-												Optimize the wording of the document and unify the terminology and th… (#29491)


											
										
										
											2025-11-26 21:16:12 +08:00
+								There is currently no official interface for specifying support for Matryoshka Embeddings. In vLLM, if `is_matryoshka` is `True` in `config.json`, you can change the output dimension to arbitrary values. Use `matryoshka_dimensions` to control the allowed output dimensions.
-												[Doc] Document Matryoshka Representation Learning support (#16770)


											
										
										
											2025-04-17 21:37:37 +08:00
-												Optimize the wording of the document and unify the terminology and th… (#29491)


											
										
										
											2025-11-26 21:16:12 +08:00
+								For models that support Matryoshka Embeddings but are not recognized by vLLM, manually override the config using `hf_overrides={"is_matryoshka": True}` or `hf_overrides={"matryoshka_dimensions": [<allowed output dimensions>]}` (offline), or `--hf-overrides '{"is_matryoshka": true}'` or `--hf-overrides '{"matryoshka_dimensions": [<allowed output dimensions>]}'` (online).
-												[Doc] Document Matryoshka Representation Learning support (#16770)


											
										
										
											2025-04-17 21:37:37 +08:00
 								Here is an example to serve a model with Matryoshka Embeddings enabled.
-												[Doc] ruff format remaining Python examples (#26795)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-10-15 16:25:49 +08:00
+								```bash
-												Fix CLI parameter documentation inconsistency in pooling_models.md (#23630)


											
										
										
											2025-08-26 13:35:37 +05:30
+								vllm serve Snowflake/snowflake-arctic-embed-m-v1.5 --hf-overrides '{"matryoshka_dimensions":[256]}'
-												[Doc] Document Matryoshka Representation Learning support (#16770)


											
										
										
											2025-04-17 21:37:37 +08:00
+								```
 								### Offline Inference
-												Migrate docs from Sphinx to MkDocs (#18145)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-05-23 11:09:53 +02:00
+								You can change the output dimensions of embedding models that support Matryoshka Embeddings by using the dimensions parameter in [PoolingParams][vllm.PoolingParams].
-												[Doc] Document Matryoshka Representation Learning support (#16770)


											
										
										
											2025-04-17 21:37:37 +08:00
 								```python
 								from vllm import LLM, PoolingParams
-												[Doc] ruff format remaining Python examples (#26795)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-10-15 16:25:49 +08:00
+								llm = LLM(
 								    model="jinaai/jina-embeddings-v3",
 								    runner="pooling",
 								    trust_remote_code=True,
 								)
 								outputs = llm.embed(
 								    ["Follow the white rabbit."],
 								    pooling_params=PoolingParams(dimensions=32),
 								)
-												[Doc] Document Matryoshka Representation Learning support (#16770)


											
										
										
											2025-04-17 21:37:37 +08:00
+								print(outputs[0].outputs)
 								```
-												[Frontend][2/n] Make pooling entrypoints request schema consensus | ChatRequest (#32574)

Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
											
										
										
											2026-01-22 18:32:44 +08:00
+								A code example can be found here: [examples/pooling/embed/embed_matryoshka_fy_offline.py](../../examples/pooling/embed/embed_matryoshka_fy_offline.py)
-												[Doc] Document Matryoshka Representation Learning support (#16770)


											
										
										
											2025-04-17 21:37:37 +08:00
 								### Online Inference
-												Optimize the wording of the document and unify the terminology and th… (#29491)


											
										
										
											2025-11-26 21:16:12 +08:00
+								Use the following command to start the vLLM server.
-												[Doc] Document Matryoshka Representation Learning support (#16770)


											
										
										
											2025-04-17 21:37:37 +08:00
-												[Doc] ruff format remaining Python examples (#26795)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-10-15 16:25:49 +08:00
+								```bash
-												[Doc] Document Matryoshka Representation Learning support (#16770)


											
										
										
											2025-04-17 21:37:37 +08:00
+								vllm serve jinaai/jina-embeddings-v3 --trust-remote-code
 								```
 								You can change the output dimensions of embedding models that support Matryoshka Embeddings by using the dimensions parameter.
-												[Doc] ruff format remaining Python examples (#26795)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-10-15 16:25:49 +08:00
+								```bash
-												[Doc] Document Matryoshka Representation Learning support (#16770)


											
										
										
											2025-04-17 21:37:37 +08:00
+								curl http://127.0.0.1:8000/v1/embeddings \
 								  -H 'accept: application/json' \
 								  -H 'Content-Type: application/json' \
 								  -d '{
 								    "input": "Follow the white rabbit.",
 								    "model": "jinaai/jina-embeddings-v3",
 								    "encoding_format": "float",
-												[Frontend] Using matryoshka_dimensions control the allowed output dimensions. (#16970)


											
										
										
											2025-04-24 22:06:28 +08:00
+								    "dimensions": 32
-												[Doc] Document Matryoshka Representation Learning support (#16770)


											
										
										
											2025-04-17 21:37:37 +08:00
+								  }'
 								```
 								Expected output:
 								```json
-												[Frontend] Using matryoshka_dimensions control the allowed output dimensions. (#16970)


											
										
										
											2025-04-24 22:06:28 +08:00
+								{"id":"embd-5c21fc9a5c9d4384a1b021daccaf9f64","object":"list","created":1745476417,"model":"jinaai/jina-embeddings-v3","data":[{"index":0,"object":"embedding","embedding":[-0.3828125,-0.1357421875,0.03759765625,0.125,0.21875,0.09521484375,-0.003662109375,0.1591796875,-0.130859375,-0.0869140625,-0.1982421875,0.1689453125,-0.220703125,0.1728515625,-0.2275390625,-0.0712890625,-0.162109375,-0.283203125,-0.055419921875,-0.0693359375,0.031982421875,-0.04052734375,-0.2734375,0.1826171875,-0.091796875,0.220703125,0.37890625,-0.0888671875,-0.12890625,-0.021484375,-0.0091552734375,0.23046875]}],"usage":{"prompt_tokens":8,"total_tokens":8,"completion_tokens":0,"prompt_tokens_details":null}}
-												[Doc] Document Matryoshka Representation Learning support (#16770)


											
										
										
											2025-04-17 21:37:37 +08:00
+								```
-												[Frontend][2/n] Make pooling entrypoints request schema consensus | ChatRequest (#32574)

Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
											
										
										
											2026-01-22 18:32:44 +08:00
+								An OpenAI client example can be found here: [examples/pooling/embed/openai_embedding_matryoshka_fy_client.py](../../examples/pooling/embed/openai_embedding_matryoshka_fy_client.py)
-												[Frontend][Doc][5/N] Improve all pooling task | Polish encode (pooling) api & Document. (#25524)

Signed-off-by: wang.yuqi <noooop@126.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
											
										
										
											2025-10-30 20:13:05 +08:00
-												Support bge-m3 sparse embeddings and colbert embeddings (#14526)

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
											
										
										
											2026-01-22 12:52:57 -03:00
+								## Specific models
-												feat: Add ColBERT late interaction model support (#33686)

Signed-off-by: Ilya Boytsov <ilyaboytsov1805@gmail.com>
Signed-off-by: Ilya Boytsov <boytsovpanamera@mail.ru>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io>
											
										
										
											2026-02-05 01:05:13 +01:00
+								### ColBERT Late Interaction Models
 								[ColBERT](https://arxiv.org/abs/2004.12832) (Contextualized Late Interaction over BERT) is a retrieval model that uses per-token embeddings and MaxSim scoring for document ranking. Unlike single-vector embedding models, ColBERT retains token-level representations and computes relevance scores through late interaction, providing better accuracy while being more efficient than cross-encoders.
-												Extend ColBERT support to non-standard BERT backbones (#34170)

Signed-off-by: Ilya Boytsov <ilya.boytsov@aleph-alpha.com>
											
										
										
											2026-02-13 10:53:09 +01:00
+								vLLM supports ColBERT models with multiple encoder backbones:
 								| Architecture | Backbone | Example HF Models |
 								|---|---|---|
 								| `HF_ColBERT` | BERT | `answerdotai/answerai-colbert-small-v1`, `colbert-ir/colbertv2.0` |
 								| `ColBERTModernBertModel` | ModernBERT | `lightonai/GTE-ModernColBERT-v1` |
 								| `ColBERTJinaRobertaModel` | Jina XLM-RoBERTa | `jinaai/jina-colbert-v2` |
 								**BERT-based ColBERT** models work out of the box:
-												feat: Add ColBERT late interaction model support (#33686)

Signed-off-by: Ilya Boytsov <ilyaboytsov1805@gmail.com>
Signed-off-by: Ilya Boytsov <boytsovpanamera@mail.ru>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io>
											
										
										
											2026-02-05 01:05:13 +01:00
 								```shell
 								vllm serve answerdotai/answerai-colbert-small-v1
 								```
-												Extend ColBERT support to non-standard BERT backbones (#34170)

Signed-off-by: Ilya Boytsov <ilya.boytsov@aleph-alpha.com>
											
										
										
											2026-02-13 10:53:09 +01:00
+								For **non-BERT backbones**, use `--hf-overrides` to set the correct architecture:
-												feat: Add ColBERT late interaction model support (#33686)

Signed-off-by: Ilya Boytsov <ilyaboytsov1805@gmail.com>
Signed-off-by: Ilya Boytsov <boytsovpanamera@mail.ru>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io>
											
										
										
											2026-02-05 01:05:13 +01:00
 								```shell
-												Extend ColBERT support to non-standard BERT backbones (#34170)

Signed-off-by: Ilya Boytsov <ilya.boytsov@aleph-alpha.com>
											
										
										
											2026-02-13 10:53:09 +01:00
+								# ModernBERT backbone
 								vllm serve lightonai/GTE-ModernColBERT-v1 \
 								    --hf-overrides '{"architectures": ["ColBERTModernBertModel"]}'
 								# Jina XLM-RoBERTa backbone
 								vllm serve jinaai/jina-colbert-v2 \
 								    --hf-overrides '{"architectures": ["ColBERTJinaRobertaModel"]}' \
 								    --trust-remote-code
-												feat: Add ColBERT late interaction model support (#33686)

Signed-off-by: Ilya Boytsov <ilyaboytsov1805@gmail.com>
Signed-off-by: Ilya Boytsov <boytsovpanamera@mail.ru>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io>
											
										
										
											2026-02-05 01:05:13 +01:00
+								```
 								Then you can use the rerank endpoint:
 								```shell
 								curl -s http://localhost:8000/rerank -H "Content-Type: application/json" -d '{
 								    "model": "answerdotai/answerai-colbert-small-v1",
 								    "query": "What is machine learning?",
 								    "documents": [
 								        "Machine learning is a subset of artificial intelligence.",
 								        "Python is a programming language.",
 								        "Deep learning uses neural networks."
 								    ]
 								}'
 								```
 								Or the score endpoint:
 								```shell
 								curl -s http://localhost:8000/score -H "Content-Type: application/json" -d '{
 								    "model": "answerdotai/answerai-colbert-small-v1",
 								    "text_1": "What is machine learning?",
 								    "text_2": ["Machine learning is a subset of AI.", "The weather is sunny."]
 								}'
 								```
 								You can also get the raw token embeddings using the pooling endpoint with `token_embed` task:
 								```shell
 								curl -s http://localhost:8000/pooling -H "Content-Type: application/json" -d '{
 								    "model": "answerdotai/answerai-colbert-small-v1",
 								    "input": "What is machine learning?",
 								    "task": "token_embed"
 								}'
 								```
 								An example can be found here: [examples/pooling/score/colbert_rerank_online.py](../../examples/pooling/score/colbert_rerank_online.py)
-												[new model] add COLQwen3 code & Inference (#34398)

Signed-off-by: craftsangjae <craftsangjae@gmail.com>
Signed-off-by: katacoder <craftsangjae@gmail.com>
											
										
										
											2026-02-14 13:15:19 +09:00
+								### ColQwen3 Multi-Modal Late Interaction Models
 								ColQwen3 is based on [ColPali](https://arxiv.org/abs/2407.01449), which extends ColBERT's late interaction approach to **multi-modal** inputs. While ColBERT operates on text-only token embeddings, ColPali/ColQwen3 can embed both **text and images** (e.g. PDF pages, screenshots, diagrams) into per-token L2-normalized vectors and compute relevance via MaxSim scoring. ColQwen3 specifically uses Qwen3-VL as its vision-language backbone.
 								| Architecture | Backbone | Example HF Models |
 								|---|---|---|
 								| `ColQwen3` | Qwen3-VL | `TomoroAI/tomoro-colqwen3-embed-4b`, `TomoroAI/tomoro-colqwen3-embed-8b` |
 								| `OpsColQwen3Model` | Qwen3-VL | `OpenSearch-AI/Ops-Colqwen3-4B`, `OpenSearch-AI/Ops-Colqwen3-8B` |
-												[Frontend] Support multimodal inputs for late-interaction scoring (ColQwen3) + NewModel: nvidia/nemotron-colembed (#34574)

Signed-off-by: craftsangjae <craftsangjae@gmail.com>
											
										
										
											2026-02-21 13:01:40 +09:00
+								| `Qwen3VLNemotronEmbedModel` | Qwen3-VL | `nvidia/nemotron-colembed-vl-4b-v2`, `nvidia/nemotron-colembed-vl-8b-v2` |
-												[new model] add COLQwen3 code & Inference (#34398)

Signed-off-by: craftsangjae <craftsangjae@gmail.com>
Signed-off-by: katacoder <craftsangjae@gmail.com>
											
										
										
											2026-02-14 13:15:19 +09:00
 								Start the server:
 								```shell
 								vllm serve TomoroAI/tomoro-colqwen3-embed-4b --max-model-len 4096
 								```
-												[Frontend] Support multimodal inputs for late-interaction scoring (ColQwen3) + NewModel: nvidia/nemotron-colembed (#34574)

Signed-off-by: craftsangjae <craftsangjae@gmail.com>
											
										
										
											2026-02-21 13:01:40 +09:00
+								#### Text-only scoring and reranking
 								Use the `/rerank` endpoint:
-												[new model] add COLQwen3 code & Inference (#34398)

Signed-off-by: craftsangjae <craftsangjae@gmail.com>
Signed-off-by: katacoder <craftsangjae@gmail.com>
											
										
										
											2026-02-14 13:15:19 +09:00
 								```shell
 								curl -s http://localhost:8000/rerank -H "Content-Type: application/json" -d '{
 								    "model": "TomoroAI/tomoro-colqwen3-embed-4b",
 								    "query": "What is machine learning?",
 								    "documents": [
 								        "Machine learning is a subset of artificial intelligence.",
 								        "Python is a programming language.",
 								        "Deep learning uses neural networks."
 								    ]
 								}'
 								```
-												[Frontend] Support multimodal inputs for late-interaction scoring (ColQwen3) + NewModel: nvidia/nemotron-colembed (#34574)

Signed-off-by: craftsangjae <craftsangjae@gmail.com>
											
										
										
											2026-02-21 13:01:40 +09:00
+								Or the `/score` endpoint:
-												[new model] add COLQwen3 code & Inference (#34398)

Signed-off-by: craftsangjae <craftsangjae@gmail.com>
Signed-off-by: katacoder <craftsangjae@gmail.com>
											
										
										
											2026-02-14 13:15:19 +09:00
 								```shell
 								curl -s http://localhost:8000/score -H "Content-Type: application/json" -d '{
 								    "model": "TomoroAI/tomoro-colqwen3-embed-4b",
 								    "text_1": "What is the capital of France?",
 								    "text_2": ["The capital of France is Paris.", "Python is a programming language."]
 								}'
 								```
-												[Frontend] Support multimodal inputs for late-interaction scoring (ColQwen3) + NewModel: nvidia/nemotron-colembed (#34574)

Signed-off-by: craftsangjae <craftsangjae@gmail.com>
											
										
										
											2026-02-21 13:01:40 +09:00
+								#### Multi-modal scoring and reranking (text query × image documents)
 								The `/score` and `/rerank` endpoints also accept multi-modal inputs directly.
 								Pass image documents using the `data_1`/`data_2` (for `/score`) or `documents` (for `/rerank`) fields
 								with a `content` list containing `image_url` and `text` parts — the same format used by the
 								OpenAI chat completion API:
 								Score a text query against image documents:
 								```shell
 								curl -s http://localhost:8000/score -H "Content-Type: application/json" -d '{
 								    "model": "TomoroAI/tomoro-colqwen3-embed-4b",
 								    "data_1": "Retrieve the city of Beijing",
 								    "data_2": [
 								        {
 								            "content": [
 								                {"type": "image_url", "image_url": {"url": "data:image/png;base64,<BASE64>"}},
 								                {"type": "text", "text": "Describe the image."}
 								            ]
 								        }
 								    ]
 								}'
 								```
 								Rerank image documents by a text query:
 								```shell
 								curl -s http://localhost:8000/rerank -H "Content-Type: application/json" -d '{
 								    "model": "TomoroAI/tomoro-colqwen3-embed-4b",
 								    "query": "Retrieve the city of Beijing",
 								    "documents": [
 								        {
 								            "content": [
 								                {"type": "image_url", "image_url": {"url": "data:image/png;base64,<BASE64_1>"}},
 								                {"type": "text", "text": "Describe the image."}
 								            ]
 								        },
 								        {
 								            "content": [
 								                {"type": "image_url", "image_url": {"url": "data:image/png;base64,<BASE64_2>"}},
 								                {"type": "text", "text": "Describe the image."}
 								            ]
 								        }
 								    ],
 								    "top_n": 2
 								}'
 								```
 								#### Raw token embeddings
 								You can also get the raw token embeddings using the `/pooling` endpoint with `token_embed` task:
-												[new model] add COLQwen3 code & Inference (#34398)

Signed-off-by: craftsangjae <craftsangjae@gmail.com>
Signed-off-by: katacoder <craftsangjae@gmail.com>
											
										
										
											2026-02-14 13:15:19 +09:00
 								```shell
 								curl -s http://localhost:8000/pooling -H "Content-Type: application/json" -d '{
 								    "model": "TomoroAI/tomoro-colqwen3-embed-4b",
 								    "input": "What is machine learning?",
 								    "task": "token_embed"
 								}'
 								```
-												[Frontend] Support multimodal inputs for late-interaction scoring (ColQwen3) + NewModel: nvidia/nemotron-colembed (#34574)

Signed-off-by: craftsangjae <craftsangjae@gmail.com>
											
										
										
											2026-02-21 13:01:40 +09:00
+								For **image inputs** via the pooling endpoint, use the chat-style `messages` field:
-												[new model] add COLQwen3 code & Inference (#34398)

Signed-off-by: craftsangjae <craftsangjae@gmail.com>
Signed-off-by: katacoder <craftsangjae@gmail.com>
											
										
										
											2026-02-14 13:15:19 +09:00
 								```shell
 								curl -s http://localhost:8000/pooling -H "Content-Type: application/json" -d '{
 								    "model": "TomoroAI/tomoro-colqwen3-embed-4b",
 								    "messages": [
 								        {
 								            "role": "user",
 								            "content": [
 								                {"type": "image_url", "image_url": {"url": "data:image/png;base64,<BASE64>"}},
 								                {"type": "text", "text": "Describe the image."}
 								            ]
 								        }
 								    ]
 								}'
 								```
-												[Frontend] Support multimodal inputs for late-interaction scoring (ColQwen3) + NewModel: nvidia/nemotron-colembed (#34574)

Signed-off-by: craftsangjae <craftsangjae@gmail.com>
											
										
										
											2026-02-21 13:01:40 +09:00
+								#### Examples
-												[new model] add COLQwen3 code & Inference (#34398)

Signed-off-by: craftsangjae <craftsangjae@gmail.com>
Signed-off-by: katacoder <craftsangjae@gmail.com>
											
										
										
											2026-02-14 13:15:19 +09:00
 								- Multi-vector retrieval: [examples/pooling/token_embed/colqwen3_token_embed_online.py](../../examples/pooling/token_embed/colqwen3_token_embed_online.py)
-												[Frontend] Support multimodal inputs for late-interaction scoring (ColQwen3) + NewModel: nvidia/nemotron-colembed (#34574)

Signed-off-by: craftsangjae <craftsangjae@gmail.com>
											
										
										
											2026-02-21 13:01:40 +09:00
+								- Reranking (text + multi-modal): [examples/pooling/score/colqwen3_rerank_online.py](../../examples/pooling/score/colqwen3_rerank_online.py)
-												[new model] add COLQwen3 code & Inference (#34398)

Signed-off-by: craftsangjae <craftsangjae@gmail.com>
Signed-off-by: katacoder <craftsangjae@gmail.com>
											
										
										
											2026-02-14 13:15:19 +09:00
-												Support bge-m3 sparse embeddings and colbert embeddings (#14526)

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
											
										
										
											2026-01-22 12:52:57 -03:00
+								### BAAI/bge-m3
 								The `BAAI/bge-m3` model comes with extra weights for sparse and colbert embeddings but unfortunately in its `config.json`
 								the architecture is declared as `XLMRobertaModel`, which makes `vLLM` load it as a vanilla ROBERTA model without the
 								extra weights. To load the full model weights, override its architecture like this:
 								```shell
 								vllm serve BAAI/bge-m3 --hf-overrides '{"architectures": ["BgeM3EmbeddingModel"]}'
 								```
 								Then you obtain the sparse embeddings like this:
 								```shell
 								curl -s http://localhost:8000/pooling -H "Content-Type: application/json" -d '{
 								     "model": "BAAI/bge-m3",
 								     "task": "token_classify",
 								     "input": ["What is BGE M3?", "Defination of BM25"]
 								}'
 								```
-												[Doc]: fixing multiple typos in diverse files (#33256)

Signed-off-by: Didier Durand <durand.didier@gmail.com>
Signed-off-by: Didier Durand <2927957+didier-durand@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
											
										
										
											2026-01-29 09:52:03 +01:00
+								Due to limitations in the output schema, the output consists of a list of
-												Support bge-m3 sparse embeddings and colbert embeddings (#14526)

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
											
										
										
											2026-01-22 12:52:57 -03:00
+								token scores for each token for each input. This means that you'll have to call
 								`/tokenize` as well to be able to pair tokens with scores.
 								Refer to the tests in  `tests/models/language/pooling/test_bge_m3.py` to see how
 								to do that.
 								You can obtain the colbert embeddings like this:
 								```shell
 								curl -s http://localhost:8000/pooling -H "Content-Type: application/json" -d '{
 								     "model": "BAAI/bge-m3",
 								     "task": "token_embed",
 								     "input": ["What is BGE M3?", "Defination of BM25"]
 								}'
 								```
-												[Frontend][Doc][5/N] Improve all pooling task | Polish encode (pooling) api & Document. (#25524)

Signed-off-by: wang.yuqi <noooop@126.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
											
										
										
											2025-10-30 20:13:05 +08:00
+								## Deprecated Features
 								### Encode task
-												Optimize the wording of the document and unify the terminology and th… (#29491)


											
										
										
											2025-11-26 21:16:12 +08:00
+								We have split the `encode` task into two more specific token-wise tasks: `token_embed` and `token_classify`:
-												[Frontend][Doc][5/N] Improve all pooling task | Polish encode (pooling) api & Document. (#25524)

Signed-off-by: wang.yuqi <noooop@126.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
											
										
										
											2025-10-30 20:13:05 +08:00
-												Optimize the wording of the document and unify the terminology and th… (#29491)


											
										
										
											2025-11-26 21:16:12 +08:00
+								- `token_embed` is the same as `embed`, using normalization as the activation.
 								- `token_classify` is the same as `classify`, by default using softmax as the activation.
-												[Frontend][Doc][5/N] Improve all pooling task | Polish encode (pooling) api & Document. (#25524)

Signed-off-by: wang.yuqi <noooop@126.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
											
										
										
											2025-10-30 20:13:05 +08:00
-												[Model][7/N] Improve all pooling task | Deprecation as_reward_model. Extract hidden states prefer using new multi-vector retrieval API (#26686)

Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
											
										
										
											2025-12-08 16:10:09 +08:00
+								Pooling models now default support all pooling, you can use it without any settings.
 								- Extracting hidden states prefers using `token_embed` task.
 								- Reward models prefers using `token_classify` task.