# Pooling Models vLLM also supports pooling models, such as embedding, classification, and reward models. In vLLM, pooling models implement the [VllmModelForPooling][vllm.model_executor.models.VllmModelForPooling] interface. These models use a [Pooler][vllm.model_executor.layers.pooler.Pooler] to extract the final hidden states of the input before returning them. !!! note We currently support pooling models primarily for convenience. This is not guaranteed to provide any performance improvements over using Hugging Face Transformers or Sentence Transformers directly. We plan to optimize pooling models in vLLM. Please comment on if you have any suggestions! ## Configuration ### Model Runner Run a model in pooling mode via the option `--runner pooling`. !!! tip There is no need to set this option in the vast majority of cases as vLLM can automatically detect the appropriate model runner via `--runner auto`. ### Model Conversion vLLM can adapt models for various pooling tasks via the option `--convert `. If `--runner pooling` has been set (manually or automatically) but the model does not implement the [VllmModelForPooling][vllm.model_executor.models.VllmModelForPooling] interface, vLLM will attempt to automatically convert the model according to the architecture names shown in the table below. | Architecture | `--convert` | Supported pooling tasks | |-------------------------------------------------|-------------|---------------------------------------| | `*ForTextEncoding`, `*EmbeddingModel`, `*Model` | `embed` | `token_embed`, `embed` | | `*ForRewardModeling`, `*RewardModel` | `embed` | `token_embed`, `embed` | | `*For*Classification`, `*ClassificationModel` | `classify` | `token_classify`, `classify`, `score` | !!! tip You can explicitly set `--convert ` to specify how to convert the model. ### Pooling Tasks Each pooling model in vLLM supports one or more of these tasks according to [Pooler.get_supported_tasks][vllm.model_executor.layers.pooler.Pooler.get_supported_tasks], enabling the corresponding APIs: | Task | APIs | |------------------|-------------------------------------------------------------------------------| | `embed` | `LLM.embed(...)`, `LLM.score(...)`\*, `LLM.encode(..., pooling_task="embed")` | | `classify` | `LLM.classify(...)`, `LLM.encode(..., pooling_task="classify")` | | `score` | `LLM.score(...)` | | `token_classify` | `LLM.reward(...)`, `LLM.encode(..., pooling_task="token_classify")` | | `token_embed` | `LLM.encode(..., pooling_task="token_embed")` | | `plugin` | `LLM.encode(..., pooling_task="plugin")` | \* The `LLM.score(...)` API falls back to `embed` task if the model does not support `score` task. ### Pooler Configuration #### Predefined models If the [Pooler][vllm.model_executor.layers.pooler.Pooler] defined by the model accepts `pooler_config`, you can override some of its attributes via the `--pooler-config` option. #### Converted models If the model has been converted via `--convert` (see above), the pooler assigned to each task has the following attributes by default: | Task | Pooling Type | Normalization | Softmax | |------------|--------------|---------------|---------| | `embed` | `LAST` | ✅︎ | ❌ | | `classify` | `LAST` | ❌ | ✅︎ | When loading [Sentence Transformers](https://huggingface.co/sentence-transformers) models, its Sentence Transformers configuration file (`modules.json`) takes priority over the model's defaults. You can further customize this via the `--pooler-config` option, which takes priority over both the model's and Sentence Transformers' defaults. ## Offline Inference The [LLM][vllm.LLM] class provides various methods for offline inference. See [configuration](../api/README.md#configuration) for a list of options when initializing the model. ### `LLM.embed` The [embed][vllm.LLM.embed] method outputs an embedding vector for each prompt. It is primarily designed for embedding models. ```python from vllm import LLM llm = LLM(model="intfloat/e5-small", runner="pooling") (output,) = llm.embed("Hello, my name is") embeds = output.outputs.embedding print(f"Embeddings: {embeds!r} (size={len(embeds)})") ``` A code example can be found here: [examples/offline_inference/basic/embed.py](../../examples/offline_inference/basic/embed.py) ### `LLM.classify` The [classify][vllm.LLM.classify] method outputs a probability vector for each prompt. It is primarily designed for classification models. ```python from vllm import LLM llm = LLM(model="jason9693/Qwen2.5-1.5B-apeach", runner="pooling") (output,) = llm.classify("Hello, my name is") probs = output.outputs.probs print(f"Class Probabilities: {probs!r} (size={len(probs)})") ``` A code example can be found here: [examples/offline_inference/basic/classify.py](../../examples/offline_inference/basic/classify.py) ### `LLM.score` The [score][vllm.LLM.score] method outputs similarity scores between sentence pairs. It is designed for embedding models and cross-encoder models. Embedding models use cosine similarity, and [cross-encoder models](https://www.sbert.net/examples/applications/cross-encoder/README.html) serve as rerankers between candidate query-document pairs in RAG systems. !!! note vLLM can only perform the model inference component (e.g. embedding, reranking) of RAG. To handle RAG at a higher level, you should use integration frameworks such as [LangChain](https://github.com/langchain-ai/langchain). ```python from vllm import LLM llm = LLM(model="BAAI/bge-reranker-v2-m3", runner="pooling") (output,) = llm.score( "What is the capital of France?", "The capital of Brazil is Brasilia.", ) score = output.outputs.score print(f"Score: {score}") ``` A code example can be found here: [examples/offline_inference/basic/score.py](../../examples/offline_inference/basic/score.py) ### `LLM.reward` The [reward][vllm.LLM.reward] method is available to all reward models in vLLM. ```python from vllm import LLM llm = LLM(model="internlm/internlm2-1_8b-reward", runner="pooling", trust_remote_code=True) (output,) = llm.reward("Hello, my name is") data = output.outputs.data print(f"Data: {data!r}") ``` A code example can be found here: [examples/offline_inference/basic/reward.py](../../examples/offline_inference/basic/reward.py) ### `LLM.encode` The [encode][vllm.LLM.encode] method is available to all pooling models in vLLM. !!! note Please use one of the more specific methods or set the task directly when using `LLM.encode`: - For embeddings, use `LLM.embed(...)` or `pooling_task="embed"`. - For classification logits, use `LLM.classify(...)` or `pooling_task="classify"`. - For similarity scores, use `LLM.score(...)`. - For rewards, use `LLM.reward(...)` or `pooling_task="token_classify"`. - For token classification, use `pooling_task="token_classify"`. - For multi-vector retrieval, use `pooling_task="token_embed"`. - For IO Processor Plugins, use `pooling_task="plugin"`. ```python from vllm import LLM llm = LLM(model="intfloat/e5-small", runner="pooling") (output,) = llm.encode("Hello, my name is", pooling_task="embed") data = output.outputs.data print(f"Data: {data!r}") ``` ## Online Serving Our [OpenAI-Compatible Server](../serving/openai_compatible_server.md) provides endpoints that correspond to the offline APIs: - [Embeddings API](../serving/openai_compatible_server.md#embeddings-api) is similar to `LLM.embed`, accepting both text and [multi-modal inputs](../features/multimodal_inputs.md) for embedding models. - [Classification API](../serving/openai_compatible_server.md#classification-api) is similar to `LLM.classify` and is applicable to sequence classification models. - [Score API](../serving/openai_compatible_server.md#score-api) is similar to `LLM.score` for cross-encoder models. - [Pooling API](../serving/openai_compatible_server.md#pooling-api) is similar to `LLM.encode`, being applicable to all types of pooling models. !!! note Please use one of the more specific endpoints or set the task directly when using the [Pooling API](../serving/openai_compatible_server.md#pooling-api): - For embeddings, use [Embeddings API](../serving/openai_compatible_server.md#embeddings-api) or `"task":"embed"`. - For classification logits, use [Classification API](../serving/openai_compatible_server.md#classification-api) or `"task":"classify"`. - For similarity scores, use [Score API](../serving/openai_compatible_server.md#score-api). - For rewards, use `"task":"token_classify"`. - For token classification, use `"task":"token_classify"`. - For multi-vector retrieval, use `"task":"token_embed"`. - For IO Processor Plugins, use `"task":"plugin"`. ```python # start a supported embeddings model server with `vllm serve`, e.g. # vllm serve intfloat/e5-small import requests host = "localhost" port = "8000" model_name = "intfloat/e5-small" api_url = f"http://{host}:{port}/pooling" prompts = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] prompt = {"model": model_name, "input": prompts, "task": "embed"} response = requests.post(api_url, json=prompt) for output in response.json()["data"]: data = output["data"] print(f"Data: {data!r} (size={len(data)})") ``` ## Matryoshka Embeddings [Matryoshka Embeddings](https://sbert.net/examples/sentence_transformer/training/matryoshka/README.html#matryoshka-embeddings) or [Matryoshka Representation Learning (MRL)](https://arxiv.org/abs/2205.13147) is a technique used in training embedding models. It allows users to trade off between performance and cost. !!! warning Not all embedding models are trained using Matryoshka Representation Learning. To avoid misuse of the `dimensions` parameter, vLLM returns an error for requests that attempt to change the output dimension of models that do not support Matryoshka Embeddings. For example, setting `dimensions` parameter while using the `BAAI/bge-m3` model will result in the following error. ```json {"object":"error","message":"Model \"BAAI/bge-m3\" does not support matryoshka representation, changing output dimensions will lead to poor results.","type":"BadRequestError","param":null,"code":400} ``` ### Manually enable Matryoshka Embeddings There is currently no official interface for specifying support for Matryoshka Embeddings. In vLLM, if `is_matryoshka` is `True` in `config.json`, you can change the output dimension to arbitrary values. Use `matryoshka_dimensions` to control the allowed output dimensions. For models that support Matryoshka Embeddings but are not recognized by vLLM, manually override the config using `hf_overrides={"is_matryoshka": True}` or `hf_overrides={"matryoshka_dimensions": []}` (offline), or `--hf-overrides '{"is_matryoshka": true}'` or `--hf-overrides '{"matryoshka_dimensions": []}'` (online). Here is an example to serve a model with Matryoshka Embeddings enabled. ```bash vllm serve Snowflake/snowflake-arctic-embed-m-v1.5 --hf-overrides '{"matryoshka_dimensions":[256]}' ``` ### Offline Inference You can change the output dimensions of embedding models that support Matryoshka Embeddings by using the dimensions parameter in [PoolingParams][vllm.PoolingParams]. ```python from vllm import LLM, PoolingParams llm = LLM( model="jinaai/jina-embeddings-v3", runner="pooling", trust_remote_code=True, ) outputs = llm.embed( ["Follow the white rabbit."], pooling_params=PoolingParams(dimensions=32), ) print(outputs[0].outputs) ``` A code example can be found here: [examples/pooling/embed/embed_matryoshka_fy_offline.py](../../examples/pooling/embed/embed_matryoshka_fy_offline.py) ### Online Inference Use the following command to start the vLLM server. ```bash vllm serve jinaai/jina-embeddings-v3 --trust-remote-code ``` You can change the output dimensions of embedding models that support Matryoshka Embeddings by using the dimensions parameter. ```bash curl http://127.0.0.1:8000/v1/embeddings \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "input": "Follow the white rabbit.", "model": "jinaai/jina-embeddings-v3", "encoding_format": "float", "dimensions": 32 }' ``` Expected output: ```json {"id":"embd-5c21fc9a5c9d4384a1b021daccaf9f64","object":"list","created":1745476417,"model":"jinaai/jina-embeddings-v3","data":[{"index":0,"object":"embedding","embedding":[-0.3828125,-0.1357421875,0.03759765625,0.125,0.21875,0.09521484375,-0.003662109375,0.1591796875,-0.130859375,-0.0869140625,-0.1982421875,0.1689453125,-0.220703125,0.1728515625,-0.2275390625,-0.0712890625,-0.162109375,-0.283203125,-0.055419921875,-0.0693359375,0.031982421875,-0.04052734375,-0.2734375,0.1826171875,-0.091796875,0.220703125,0.37890625,-0.0888671875,-0.12890625,-0.021484375,-0.0091552734375,0.23046875]}],"usage":{"prompt_tokens":8,"total_tokens":8,"completion_tokens":0,"prompt_tokens_details":null}} ``` An OpenAI client example can be found here: [examples/pooling/embed/openai_embedding_matryoshka_fy_client.py](../../examples/pooling/embed/openai_embedding_matryoshka_fy_client.py) ## Specific models ### ColBERT Late Interaction Models [ColBERT](https://arxiv.org/abs/2004.12832) (Contextualized Late Interaction over BERT) is a retrieval model that uses per-token embeddings and MaxSim scoring for document ranking. Unlike single-vector embedding models, ColBERT retains token-level representations and computes relevance scores through late interaction, providing better accuracy while being more efficient than cross-encoders. vLLM supports ColBERT models with multiple encoder backbones: | Architecture | Backbone | Example HF Models | |---|---|---| | `HF_ColBERT` | BERT | `answerdotai/answerai-colbert-small-v1`, `colbert-ir/colbertv2.0` | | `ColBERTModernBertModel` | ModernBERT | `lightonai/GTE-ModernColBERT-v1` | | `ColBERTJinaRobertaModel` | Jina XLM-RoBERTa | `jinaai/jina-colbert-v2` | **BERT-based ColBERT** models work out of the box: ```shell vllm serve answerdotai/answerai-colbert-small-v1 ``` For **non-BERT backbones**, use `--hf-overrides` to set the correct architecture: ```shell # ModernBERT backbone vllm serve lightonai/GTE-ModernColBERT-v1 \ --hf-overrides '{"architectures": ["ColBERTModernBertModel"]}' # Jina XLM-RoBERTa backbone vllm serve jinaai/jina-colbert-v2 \ --hf-overrides '{"architectures": ["ColBERTJinaRobertaModel"]}' \ --trust-remote-code ``` Then you can use the rerank endpoint: ```shell curl -s http://localhost:8000/rerank -H "Content-Type: application/json" -d '{ "model": "answerdotai/answerai-colbert-small-v1", "query": "What is machine learning?", "documents": [ "Machine learning is a subset of artificial intelligence.", "Python is a programming language.", "Deep learning uses neural networks." ] }' ``` Or the score endpoint: ```shell curl -s http://localhost:8000/score -H "Content-Type: application/json" -d '{ "model": "answerdotai/answerai-colbert-small-v1", "text_1": "What is machine learning?", "text_2": ["Machine learning is a subset of AI.", "The weather is sunny."] }' ``` You can also get the raw token embeddings using the pooling endpoint with `token_embed` task: ```shell curl -s http://localhost:8000/pooling -H "Content-Type: application/json" -d '{ "model": "answerdotai/answerai-colbert-small-v1", "input": "What is machine learning?", "task": "token_embed" }' ``` An example can be found here: [examples/pooling/score/colbert_rerank_online.py](../../examples/pooling/score/colbert_rerank_online.py) ### ColQwen3 Multi-Modal Late Interaction Models ColQwen3 is based on [ColPali](https://arxiv.org/abs/2407.01449), which extends ColBERT's late interaction approach to **multi-modal** inputs. While ColBERT operates on text-only token embeddings, ColPali/ColQwen3 can embed both **text and images** (e.g. PDF pages, screenshots, diagrams) into per-token L2-normalized vectors and compute relevance via MaxSim scoring. ColQwen3 specifically uses Qwen3-VL as its vision-language backbone. | Architecture | Backbone | Example HF Models | |---|---|---| | `ColQwen3` | Qwen3-VL | `TomoroAI/tomoro-colqwen3-embed-4b`, `TomoroAI/tomoro-colqwen3-embed-8b` | | `OpsColQwen3Model` | Qwen3-VL | `OpenSearch-AI/Ops-Colqwen3-4B`, `OpenSearch-AI/Ops-Colqwen3-8B` | | `Qwen3VLNemotronEmbedModel` | Qwen3-VL | `nvidia/nemotron-colembed-vl-4b-v2`, `nvidia/nemotron-colembed-vl-8b-v2` | Start the server: ```shell vllm serve TomoroAI/tomoro-colqwen3-embed-4b --max-model-len 4096 ``` #### Text-only scoring and reranking Use the `/rerank` endpoint: ```shell curl -s http://localhost:8000/rerank -H "Content-Type: application/json" -d '{ "model": "TomoroAI/tomoro-colqwen3-embed-4b", "query": "What is machine learning?", "documents": [ "Machine learning is a subset of artificial intelligence.", "Python is a programming language.", "Deep learning uses neural networks." ] }' ``` Or the `/score` endpoint: ```shell curl -s http://localhost:8000/score -H "Content-Type: application/json" -d '{ "model": "TomoroAI/tomoro-colqwen3-embed-4b", "text_1": "What is the capital of France?", "text_2": ["The capital of France is Paris.", "Python is a programming language."] }' ``` #### Multi-modal scoring and reranking (text query × image documents) The `/score` and `/rerank` endpoints also accept multi-modal inputs directly. Pass image documents using the `data_1`/`data_2` (for `/score`) or `documents` (for `/rerank`) fields with a `content` list containing `image_url` and `text` parts — the same format used by the OpenAI chat completion API: Score a text query against image documents: ```shell curl -s http://localhost:8000/score -H "Content-Type: application/json" -d '{ "model": "TomoroAI/tomoro-colqwen3-embed-4b", "data_1": "Retrieve the city of Beijing", "data_2": [ { "content": [ {"type": "image_url", "image_url": {"url": "data:image/png;base64,"}}, {"type": "text", "text": "Describe the image."} ] } ] }' ``` Rerank image documents by a text query: ```shell curl -s http://localhost:8000/rerank -H "Content-Type: application/json" -d '{ "model": "TomoroAI/tomoro-colqwen3-embed-4b", "query": "Retrieve the city of Beijing", "documents": [ { "content": [ {"type": "image_url", "image_url": {"url": "data:image/png;base64,"}}, {"type": "text", "text": "Describe the image."} ] }, { "content": [ {"type": "image_url", "image_url": {"url": "data:image/png;base64,"}}, {"type": "text", "text": "Describe the image."} ] } ], "top_n": 2 }' ``` #### Raw token embeddings You can also get the raw token embeddings using the `/pooling` endpoint with `token_embed` task: ```shell curl -s http://localhost:8000/pooling -H "Content-Type: application/json" -d '{ "model": "TomoroAI/tomoro-colqwen3-embed-4b", "input": "What is machine learning?", "task": "token_embed" }' ``` For **image inputs** via the pooling endpoint, use the chat-style `messages` field: ```shell curl -s http://localhost:8000/pooling -H "Content-Type: application/json" -d '{ "model": "TomoroAI/tomoro-colqwen3-embed-4b", "messages": [ { "role": "user", "content": [ {"type": "image_url", "image_url": {"url": "data:image/png;base64,"}}, {"type": "text", "text": "Describe the image."} ] } ] }' ``` #### Examples - Multi-vector retrieval: [examples/pooling/token_embed/colqwen3_token_embed_online.py](../../examples/pooling/token_embed/colqwen3_token_embed_online.py) - Reranking (text + multi-modal): [examples/pooling/score/colqwen3_rerank_online.py](../../examples/pooling/score/colqwen3_rerank_online.py) ### BAAI/bge-m3 The `BAAI/bge-m3` model comes with extra weights for sparse and colbert embeddings but unfortunately in its `config.json` the architecture is declared as `XLMRobertaModel`, which makes `vLLM` load it as a vanilla ROBERTA model without the extra weights. To load the full model weights, override its architecture like this: ```shell vllm serve BAAI/bge-m3 --hf-overrides '{"architectures": ["BgeM3EmbeddingModel"]}' ``` Then you obtain the sparse embeddings like this: ```shell curl -s http://localhost:8000/pooling -H "Content-Type: application/json" -d '{ "model": "BAAI/bge-m3", "task": "token_classify", "input": ["What is BGE M3?", "Defination of BM25"] }' ``` Due to limitations in the output schema, the output consists of a list of token scores for each token for each input. This means that you'll have to call `/tokenize` as well to be able to pair tokens with scores. Refer to the tests in `tests/models/language/pooling/test_bge_m3.py` to see how to do that. You can obtain the colbert embeddings like this: ```shell curl -s http://localhost:8000/pooling -H "Content-Type: application/json" -d '{ "model": "BAAI/bge-m3", "task": "token_embed", "input": ["What is BGE M3?", "Defination of BM25"] }' ``` ## Deprecated Features ### Encode task We have split the `encode` task into two more specific token-wise tasks: `token_embed` and `token_classify`: - `token_embed` is the same as `embed`, using normalization as the activation. - `token_classify` is the same as `classify`, by default using softmax as the activation. Pooling models now default support all pooling, you can use it without any settings. - Extracting hidden states prefers using `token_embed` task. - Reward models prefers using `token_classify` task.