docs/models/pooling_models/token_embed.md

# Token Embedding Usages

## Summary

- Model Usage: Token classification models
- Pooling Tasks: `token_embed`
- Offline APIs:
    - `LLM.encode(..., pooling_task="token_embed")`
- Online APIs:
    - Pooling API (`/pooling`)

The difference between the (sequence) embedding task and the token embedding task is that (sequence) embedding outputs one embedding for each sequence, while token embedding outputs a embedding for each token.

Many embedding models support both (sequence) embedding and token embedding. For further details on (sequence) embedding, please refer to [this page](embed.md).

!!! note

    Pooling multitask support is deprecated and will be removed in v0.20. When the default pooling task (embed) is not 
    what you want, you need to manually specify it via via `PoolerConfig(task="token_embed")` offline or
    `--pooler-config.task token_embed` online.

## Typical Use Cases

### Multi-Vector Retrieval

For implementation examples, see:

Offline: [examples/pooling/token_embed/multi_vector_retrieval_offline.py](../../../examples/pooling/token_embed/multi_vector_retrieval_offline.py)

Online: [examples/pooling/token_embed/multi_vector_retrieval_online.py](../../../examples/pooling/token_embed/multi_vector_retrieval_online.py)

### Late interaction

Similarity scores can be computed using late interaction between two input prompts via the score API. For more information, see [Score API](scoring.md).

### Extract last hidden states

Models of any architecture can be converted into embedding models using `--convert embed`. Token embedding can then be used to extract the last hidden states from these models.

## Supported Models

--8<-- [start:supported-token-embed-models]

### Text-only Models

| Architecture | Models | Example HF Models | [LoRA](../../features/lora.md) | [PP](../../serving/parallelism_scaling.md) |
| ------------ | ------ | ----------------- | -------------------- | ------------------------- |
| `ColBERTLfm2Model` | LFM2 | `LiquidAI/LFM2-ColBERT-350M` | | |
| `ColBERTModernBertModel` | ModernBERT | `lightonai/GTE-ModernColBERT-v1` | | |
| `ColBERTJinaRobertaModel` | Jina XLM-RoBERTa | `jinaai/jina-colbert-v2` | | |
| `HF_ColBERT` | BERT | `answerdotai/answerai-colbert-small-v1`, `colbert-ir/colbertv2.0` | | |
| `*Model`<sup>C</sup>, `*ForCausalLM`<sup>C</sup>, etc. | Generative models | N/A | \* | \* |

### Multimodal Models

!!! note
    For more information about multimodal models inputs, see [this page](../supported_models.md#list-of-multimodal-language-models).

| Architecture | Models | Inputs | Example HF Models | [LoRA](../../features/lora.md) | [PP](../../serving/parallelism_scaling.md) |
| ------------ | ------ | ----- | ----------------- | ------------------------------ | ------------------------------------------ |
| `ColModernVBertForRetrieval` | ColModernVBERT | T / I | `ModernVBERT/colmodernvbert-merged` | | |
| `ColPaliForRetrieval` | ColPali | T / I | `vidore/colpali-v1.3-hf` | | |
| `ColQwen3` | Qwen3-VL | T / I | `TomoroAI/tomoro-colqwen3-embed-4b`, `TomoroAI/tomoro-colqwen3-embed-8b` | | |
| `ColQwen3_5` | ColQwen3.5 | T + I + V | `athrael-soju/colqwen3.5-4.5B-v3` | | |
| `OpsColQwen3Model` | Qwen3-VL | T / I | `OpenSearch-AI/Ops-Colqwen3-4B`, `OpenSearch-AI/Ops-Colqwen3-8B` | | |
| `Qwen3VLNemotronEmbedModel` | Qwen3-VL | T / I | `nvidia/nemotron-colembed-vl-4b-v2`, `nvidia/nemotron-colembed-vl-8b-v2` | ✅︎ | ✅︎ |
| `*ForConditionalGeneration`<sup>C</sup>, `*ForCausalLM`<sup>C</sup>, etc. | Generative models | \* | N/A | \* | \* |

<sup>C</sup> Automatically converted into an embedding model via `--convert embed`. ([details](./README.md#model-conversion))  
\* Feature support is the same as that of the original model.

If your model is not in the above list, we will try to automatically convert the model using [as_embedding_model][vllm.model_executor.models.adapters.as_embedding_model].

--8<-- [end:supported-token-embed-models]

## Offline Inference

### Pooling Parameters

The following [pooling parameters][vllm.PoolingParams] are supported.

```python
--8<-- "vllm/pooling_params.py:common-pooling-params"
--8<-- "vllm/pooling_params.py:embed-pooling-params"
```

### `LLM.encode`

The [encode][vllm.LLM.encode] method is available to all pooling models in vLLM.

Set `pooling_task="token_embed"` when using `LLM.encode` for token embedding Models:

```python
from vllm import LLM

llm = LLM(model="answerdotai/answerai-colbert-small-v1", runner="pooling")
(output,) = llm.encode("Hello, my name is", pooling_task="token_embed")

data = output.outputs.data
print(f"Data: {data!r}")
```

### `LLM.score`

The [score][vllm.LLM.score] method outputs similarity scores between sentence pairs.

All models that support token embedding task also support using the score API to compute similarity scores by calculating the late interaction of two input prompts.

```python
from vllm import LLM

llm = LLM(model="answerdotai/answerai-colbert-small-v1", runner="pooling")
(output,) = llm.score(
    "What is the capital of France?",
    "The capital of Brazil is Brasilia.",
)

score = output.outputs.score
print(f"Score: {score}")
```

## Online Serving

Please refer to the [pooling API](README.md#pooling-api) and use `"task":"token_embed"`.

## More examples

More examples can be found here: [examples/pooling/token_embed](../../../examples/pooling/token_embed)

## Supported Features

Token embedding features should be consistent with (sequence) embedding. For more information, see [this page](embed.md#supported-features).
[Docs] Reorganize pooling docs. (#35592) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: wang.yuqi <noooop@126.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> 2026-03-19 19:25:47 +08:00			`# Token Embedding Usages`

			`## Summary`

			`- Model Usage: Token classification models`
			- Pooling Tasks: `token_embed`
			`- Offline APIs:`
			- `LLM.encode(..., pooling_task="token_embed")`
			`- Online APIs:`
			- Pooling API (`/pooling`)

			`The difference between the (sequence) embedding task and the token embedding task is that (sequence) embedding outputs one embedding for each sequence, while token embedding outputs a embedding for each token.`

			`Many embedding models support both (sequence) embedding and token embedding. For further details on (sequence) embedding, please refer to [this page](embed.md).`

[Deprecate] Deprecate pooling multi task support. (#37956) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: wang.yuqi <noooop@126.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> 2026-03-24 22:07:47 +08:00			`!!! note`

			`Pooling multitask support is deprecated and will be removed in v0.20. When the default pooling task (embed) is not`
			what you want, you need to manually specify it via via `PoolerConfig(task="token_embed")` offline or
			`--pooler-config.task token_embed` online.

[Docs] Reorganize pooling docs. (#35592) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: wang.yuqi <noooop@126.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> 2026-03-19 19:25:47 +08:00			`## Typical Use Cases`

			`### Multi-Vector Retrieval`

			`For implementation examples, see:`

			`Offline: [examples/pooling/token_embed/multi_vector_retrieval_offline.py](../../../examples/pooling/token_embed/multi_vector_retrieval_offline.py)`

			`Online: [examples/pooling/token_embed/multi_vector_retrieval_online.py](../../../examples/pooling/token_embed/multi_vector_retrieval_online.py)`

			`### Late interaction`

			`Similarity scores can be computed using late interaction between two input prompts via the score API. For more information, see [Score API](scoring.md).`

			`### Extract last hidden states`

			Models of any architecture can be converted into embedding models using `--convert embed`. Token embedding can then be used to extract the last hidden states from these models.

			`## Supported Models`

			`--8<-- [start:supported-token-embed-models]`

			`### Text-only Models`

			`\| Architecture \| Models \| Example HF Models \| [LoRA](../../features/lora.md) \| [PP](../../serving/parallelism_scaling.md) \|`
			`\| ------------ \| ------ \| ----------------- \| -------------------- \| ------------------------- \|`
[Model] Add LFM2-ColBERT-350M support (#37528) Signed-off-by: Ilya Boytsov <ilyaboytsov1805@gmail.com> 2026-03-20 15:57:57 +01:00			\| `ColBERTLfm2Model` \| LFM2 \| `LiquidAI/LFM2-ColBERT-350M` \| \| \|
[Docs] Reorganize pooling docs. (#35592) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: wang.yuqi <noooop@126.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> 2026-03-19 19:25:47 +08:00			\| `ColBERTModernBertModel` \| ModernBERT \| `lightonai/GTE-ModernColBERT-v1` \| \| \|
			\| `ColBERTJinaRobertaModel` \| Jina XLM-RoBERTa \| `jinaai/jina-colbert-v2` \| \| \|
			\| `HF_ColBERT` \| BERT \| `answerdotai/answerai-colbert-small-v1`, `colbert-ir/colbertv2.0` \| \| \|
			\| `Model`<sup>C</sup>, `ForCausalLM`<sup>C</sup>, etc. \| Generative models \| N/A \| \* \| \* \|

			`### Multimodal Models`

			`!!! note`
			`For more information about multimodal models inputs, see [this page](../supported_models.md#list-of-multimodal-language-models).`

			`\| Architecture \| Models \| Inputs \| Example HF Models \| [LoRA](../../features/lora.md) \| [PP](../../serving/parallelism_scaling.md) \|`
			`\| ------------ \| ------ \| ----- \| ----------------- \| ------------------------------ \| ------------------------------------------ \|`
			\| `ColModernVBertForRetrieval` \| ColModernVBERT \| T / I \| `ModernVBERT/colmodernvbert-merged` \| \| \|
			\| `ColPaliForRetrieval` \| ColPali \| T / I \| `vidore/colpali-v1.3-hf` \| \| \|
			\| `ColQwen3` \| Qwen3-VL \| T / I \| `TomoroAI/tomoro-colqwen3-embed-4b`, `TomoroAI/tomoro-colqwen3-embed-8b` \| \| \|
			\| `ColQwen3_5` \| ColQwen3.5 \| T + I + V \| `athrael-soju/colqwen3.5-4.5B-v3` \| \| \|
			\| `OpsColQwen3Model` \| Qwen3-VL \| T / I \| `OpenSearch-AI/Ops-Colqwen3-4B`, `OpenSearch-AI/Ops-Colqwen3-8B` \| \| \|
			\| `Qwen3VLNemotronEmbedModel` \| Qwen3-VL \| T / I \| `nvidia/nemotron-colembed-vl-4b-v2`, `nvidia/nemotron-colembed-vl-8b-v2` \| ✅︎ \| ✅︎ \|
			\| `ForConditionalGeneration`<sup>C</sup>, `ForCausalLM`<sup>C</sup>, etc. \| Generative models \| \* \| N/A \| \* \| \* \|

			<sup>C</sup> Automatically converted into an embedding model via `--convert embed`. ([details](./README.md#model-conversion))
			`\* Feature support is the same as that of the original model.`

			`If your model is not in the above list, we will try to automatically convert the model using [as_embedding_model][vllm.model_executor.models.adapters.as_embedding_model].`

			`--8<-- [end:supported-token-embed-models]`

			`## Offline Inference`

			`### Pooling Parameters`

			`The following [pooling parameters][vllm.PoolingParams] are supported.`

			```python
			`--8<-- "vllm/pooling_params.py:common-pooling-params"`
			`--8<-- "vllm/pooling_params.py:embed-pooling-params"`
			```

			### `LLM.encode`

			`The [encode][vllm.LLM.encode] method is available to all pooling models in vLLM.`

			Set `pooling_task="token_embed"` when using `LLM.encode` for token embedding Models:

			```python
			`from vllm import LLM`

			`llm = LLM(model="answerdotai/answerai-colbert-small-v1", runner="pooling")`
			`(output,) = llm.encode("Hello, my name is", pooling_task="token_embed")`

			`data = output.outputs.data`
			`print(f"Data: {data!r}")`
			```

			### `LLM.score`

			`The [score][vllm.LLM.score] method outputs similarity scores between sentence pairs.`

			`All models that support token embedding task also support using the score API to compute similarity scores by calculating the late interaction of two input prompts.`

			```python
			`from vllm import LLM`

			`llm = LLM(model="answerdotai/answerai-colbert-small-v1", runner="pooling")`
			`(output,) = llm.score(`
			`"What is the capital of France?",`
			`"The capital of Brazil is Brasilia.",`
			`)`

			`score = output.outputs.score`
			`print(f"Score: {score}")`
			```

			`## Online Serving`

			Please refer to the [pooling API](README.md#pooling-api) and use `"task":"token_embed"`.

			`## More examples`

			`More examples can be found here: [examples/pooling/token_embed](../../../examples/pooling/token_embed)`

			`## Supported Features`

			`Token embedding features should be consistent with (sequence) embedding. For more information, see [this page](embed.md#supported-features).`