Files
vllm/docs/models/pooling_models/embed.md
Vineeta Tiwari b58c5f28aa docs: fix broken offline inference paths in documentation (#37998)
Signed-off-by: Vineeta Tiwari <vineeta.tiwari2@ibm.com>
Signed-off-by: Vineeta Tiwari <vineetatiwari2000@gmail.com>
Co-authored-by: Vineeta Tiwari <vineeta.tiwari2@ibm.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-24 17:35:14 +00:00

22 KiB
Raw Blame History

Embedding Usages

Embedding models are a class of machine learning models designed to transform unstructured data—such as text, images, or audio—into a structured numerical representation known as an embedding.

Summary

  • Model Usage: (sequence) embedding
  • Pooling Task: embed
  • Offline APIs:
    • LLM.embed(...)
    • LLM.encode(..., pooling_task="embed")
    • LLM.score(...)
  • Online APIs:

The primary distinction between (sequence) embedding and token embedding lies in their output granularity: (sequence) embedding produces a single embedding vector for an entire input sequence, whereas token embedding generates an embedding for each individual token within the sequence.

Many embedding models support both (sequence) embedding and token embedding. For further details on token embedding, please refer to this page.

Typical Use Cases

Embedding

The most basic use case of embedding models is to embed the inputs, e.g. for RAG.

Pairwise Similarity

You can compute pairwise similarity scores to build a similarity matrix using the Score API.

Supported Models

--8<-- [start:supported-embed-models]

Text-only Models

Architecture Models Example HF Models LoRA PP
BertModel BERT-based BAAI/bge-base-en-v1.5, Snowflake/snowflake-arctic-embed-xs, etc.
BertSpladeSparseEmbeddingModel SPLADE naver/splade-v3
ErnieModel BERT-like Chinese ERNIE shibing624/text2vec-base-chinese-sentence
Gemma2ModelC Gemma 2-based BAAI/bge-multilingual-gemma2, etc.
Gemma3TextModelC Gemma 3-based google/embeddinggemma-300m, etc.
GritLM GritLM parasail-ai/GritLM-7B-vllm.
GteModel Arctic-Embed-2.0-M Snowflake/snowflake-arctic-embed-m-v2.0.
GteNewModel mGTE-TRM (see note) Alibaba-NLP/gte-multilingual-base, etc.
LlamaBidirectionalModelC Llama-based with bidirectional attention nvidia/llama-nemotron-embed-1b-v2, etc.
LlamaModelC, LlamaForCausalLMC, MistralModelC, etc. Llama-based intfloat/e5-mistral-7b-instruct, etc.
ModernBertModel ModernBERT-based Alibaba-NLP/gte-modernbert-base, etc.
NomicBertModel Nomic BERT nomic-ai/nomic-embed-text-v1, nomic-ai/nomic-embed-text-v2-moe, Snowflake/snowflake-arctic-embed-m-long, etc.
Qwen2ModelC, Qwen2ForCausalLMC Qwen2-based ssmits/Qwen2-7B-Instruct-embed-base (see note), Alibaba-NLP/gte-Qwen2-7B-instruct (see note), etc.
Qwen3ModelC, Qwen3ForCausalLMC Qwen3-based Qwen/Qwen3-Embedding-0.6B, etc.
RobertaModel, RobertaForMaskedLM RoBERTa-based sentence-transformers/all-roberta-large-v1, etc.
VoyageQwen3BidirectionalEmbedModelC Voyage Qwen3-based with bidirectional attention voyageai/voyage-4-nano, etc.
XLMRobertaModel XLMRobertaModel-based BAAI/bge-m3 (see note), intfloat/multilingual-e5-base, jinaai/jina-embeddings-v3 (see note), etc.
*ModelC, *ForCausalLMC, etc. Generative models N/A * *

!!! note The second-generation GTE model (mGTE-TRM) is named NewModel. The name NewModel is too generic, you should set --hf-overrides '{"architectures": ["GteNewModel"]}' to specify the use of the GteNewModel architecture.

!!! note ssmits/Qwen2-7B-Instruct-embed-base has an improperly defined Sentence Transformers config. You need to manually set mean pooling by passing --pooler-config '{"pooling_type": "MEAN"}'.

!!! note For Alibaba-NLP/gte-Qwen2-*, you need to enable --trust-remote-code for the correct tokenizer to be loaded. See relevant issue on HF Transformers.

!!! note The BAAI/bge-m3 model comes with extra weights for sparse and colbert embeddings, See this page for more information.

!!! note jinaai/jina-embeddings-v3 supports multiple tasks through LoRA, while vllm temporarily only supports text-matching tasks by merging LoRA weights.

Multimodal Models

!!! note For more information about multimodal models inputs, see this page.

Architecture Models Inputs Example HF Models LoRA PP
CLIPModel CLIP T / I openai/clip-vit-base-patch32, openai/clip-vit-large-patch14, etc.
LlamaNemotronVLModel Llama Nemotron Embedding + SigLIP T + I nvidia/llama-nemotron-embed-vl-1b-v2
LlavaNextForConditionalGenerationC LLaVA-NeXT-based T / I royokong/e5-v
Phi3VForCausalLMC Phi-3-Vision-based T + I TIGER-Lab/VLM2Vec-Full
Qwen3VLForConditionalGenerationC Qwen3-VL T + I + V Qwen/Qwen3-VL-Embedding-2B, etc.
SiglipModel SigLIP, SigLIP2 T / I google/siglip-base-patch16-224, google/siglip2-base-patch16-224
*ForConditionalGenerationC, *ForCausalLMC, etc. Generative models * N/A * *

C Automatically converted into an embedding model via --convert embed. (details)
* Feature support is the same as that of the original model.

If your model is not in the above list, we will try to automatically convert the model using [as_embedding_model][vllm.model_executor.models.adapters.as_embedding_model]. By default, the embeddings of the whole prompt are extracted from the normalized hidden state corresponding to the last token.

!!! note Although vLLM supports automatically converting models of any architecture into embedding models via --convert embed, to get the best results, you should use pooling models that are specifically trained as such.

--8<-- [end:supported-embed-models]

Offline Inference

Pooling Parameters

The following [pooling parameters][vllm.PoolingParams] are supported.

--8<-- "vllm/pooling_params.py:common-pooling-params"
--8<-- "vllm/pooling_params.py:embed-pooling-params"

LLM.embed

The [embed][vllm.LLM.embed] method outputs an embedding vector for each prompt.

from vllm import LLM

llm = LLM(model="intfloat/e5-small", runner="pooling")
(output,) = llm.embed("Hello, my name is")

embeds = output.outputs.embedding
print(f"Embeddings: {embeds!r} (size={len(embeds)})")

A code example can be found here: examples/basic/offline_inference/embed.py

LLM.encode

The [encode][vllm.LLM.encode] method is available to all pooling models in vLLM.

Set pooling_task="embed" when using LLM.encode for embedding Models:

from vllm import LLM

llm = LLM(model="intfloat/e5-small", runner="pooling")
(output,) = llm.encode("Hello, my name is", pooling_task="embed")

data = output.outputs.data
print(f"Data: {data!r}")

LLM.score

The [score][vllm.LLM.score] method outputs similarity scores between sentence pairs.

All models that support embedding task also support using the score API to compute similarity scores by calculating the cosine similarity of two input prompt's embeddings.

from vllm import LLM

llm = LLM(model="intfloat/e5-small", runner="pooling")
(output,) = llm.score(
    "What is the capital of France?",
    "The capital of Brazil is Brasilia.",
)

score = output.outputs.score
print(f"Score: {score}")

Online Serving

OpenAI-Compatible Embeddings API

Our Embeddings API is compatible with OpenAI's Embeddings API; you can use the official OpenAI Python client to interact with it.

Code example: examples/pooling/embed/openai_embedding_client.py

Completion Parameters

The following Classification API parameters are supported:

??? code

```python
--8<-- "vllm/entrypoints/pooling/base/protocol.py:pooling-common-params"
--8<-- "vllm/entrypoints/pooling/base/protocol.py:completion-params"
--8<-- "vllm/entrypoints/pooling/base/protocol.py:encoding-params"
--8<-- "vllm/entrypoints/pooling/base/protocol.py:embed-params"
```

The following extra parameters are supported:

??? code

```python
--8<-- "vllm/entrypoints/pooling/base/protocol.py:pooling-common-extra-params"
--8<-- "vllm/entrypoints/pooling/base/protocol.py:completion-extra-params"
--8<-- "vllm/entrypoints/pooling/base/protocol.py:encoding-extra-params"
--8<-- "vllm/entrypoints/pooling/base/protocol.py:embed-extra-params"
```

Chat Parameters

For chat-like input (i.e. if messages is passed), the following parameters are supported:

??? code

```python
--8<-- "vllm/entrypoints/pooling/base/protocol.py:pooling-common-params"
--8<-- "vllm/entrypoints/pooling/base/protocol.py:chat-params"
--8<-- "vllm/entrypoints/pooling/base/protocol.py:encoding-params"
--8<-- "vllm/entrypoints/pooling/base/protocol.py:embed-params"
```

these extra parameters are supported instead:

??? code

```python
--8<-- "vllm/entrypoints/pooling/base/protocol.py:pooling-common-extra-params"
--8<-- "vllm/entrypoints/pooling/base/protocol.py:chat-extra-params"
--8<-- "vllm/entrypoints/pooling/base/protocol.py:encoding-extra-params"
--8<-- "vllm/entrypoints/pooling/base/protocol.py:embed-extra-params"
```

Examples

If the model has a chat template, you can replace inputs with a list of messages (same schema as Chat API) which will be treated as a single prompt to the model. Here is a convenience function for calling the API while retaining OpenAI's type annotations:

??? code

```python
from openai import OpenAI
from openai._types import NOT_GIVEN, NotGiven
from openai.types.chat import ChatCompletionMessageParam
from openai.types.create_embedding_response import CreateEmbeddingResponse

def create_chat_embeddings(
    client: OpenAI,
    *,
    messages: list[ChatCompletionMessageParam],
    model: str,
    encoding_format: Union[Literal["base64", "float"], NotGiven] = NOT_GIVEN,
) -> CreateEmbeddingResponse:
    return client.post(
        "/embeddings",
        cast_to=CreateEmbeddingResponse,
        body={"messages": messages, "model": model, "encoding_format": encoding_format},
    )
```
Multi-modal inputs

You can pass multi-modal inputs to embedding models by defining a custom chat template for the server and passing a list of messages in the request. Refer to the examples below for illustration.

=== "VLM2Vec"

To serve the model:

```bash
vllm serve TIGER-Lab/VLM2Vec-Full --runner pooling \
  --trust-remote-code \
  --max-model-len 4096 \
  --chat-template examples/pooling/embed/template/vlm2vec_phi3v.jinja
```

!!! important
    Since VLM2Vec has the same model architecture as Phi-3.5-Vision, we have to explicitly pass `--runner pooling`
    to run this model in embedding mode instead of text generation mode.

    The custom chat template is completely different from the original one for this model,
    and can be found here: [examples/pooling/embed/template/vlm2vec_phi3v.jinja](../../../examples/pooling/embed/template/vlm2vec_phi3v.jinja)

Since the request schema is not defined by OpenAI client, we post a request to the server using the lower-level `requests` library:

??? code

    ```python
    from openai import OpenAI
    client = OpenAI(
        base_url="http://localhost:8000/v1",
        api_key="EMPTY",
    )
    image_url = "https://vllm-public-assets.s3.us-west-2.amazonaws.com/vision_model_images/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"

    response = create_chat_embeddings(
        client,
        model="TIGER-Lab/VLM2Vec-Full",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "image_url", "image_url": {"url": image_url}},
                    {"type": "text", "text": "Represent the given image."},
                ],
            }
        ],
        encoding_format="float",
    )

    print("Image embedding output:", response.data[0].embedding)
    ```

=== "DSE-Qwen2-MRL"

To serve the model:

```bash
vllm serve MrLight/dse-qwen2-2b-mrl-v1 --runner pooling \
  --trust-remote-code \
  --max-model-len 8192 \
  --chat-template examples/pooling/embed/template/dse_qwen2_vl.jinja
```

!!! important
    Like with VLM2Vec, we have to explicitly pass `--runner pooling`.

    Additionally, `MrLight/dse-qwen2-2b-mrl-v1` requires an EOS token for embeddings, which is handled
    by a custom chat template: [examples/pooling/embed/template/dse_qwen2_vl.jinja](../../../examples/pooling/embed/template/dse_qwen2_vl.jinja)

!!! important
    `MrLight/dse-qwen2-2b-mrl-v1` requires a placeholder image of the minimum image size for text query embeddings. See the full code
    example below for details.

Full example: examples/pooling/embed/vision_embedding_online.py

Cohere Embed API

Our API is also compatible with Cohere's Embed v2 API which adds support for some modern embedding feature such as truncation, output dimensions, embedding types, and input types. This endpoint works with any embedding model (including multimodal models).

Cohere Embed API request parameters

Parameter Type Required Description
model string Yes Model name
input_type string No Prompt prefix key (model-dependent, see below)
texts list[string] No Text inputs (use one of texts, images, or inputs)
images list[string] No Base64 data URI images
inputs list[object] No Mixed text and image content objects
embedding_types list[string] No Output types (default: ["float"])
output_dimension int No Truncate embeddings to this dimension (Matryoshka)
truncate string No END, START, or NONE (default: END)

Text embedding

curl -X POST "http://localhost:8000/v2/embed" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Snowflake/snowflake-arctic-embed-m-v1.5",
    "input_type": "query",
    "texts": ["Hello world", "How are you?"],
    "embedding_types": ["float"]
  }'

??? console "Response"

```json
{
  "id": "embd-...",
  "embeddings": {
    "float": [
      [0.012, -0.034, ...],
      [0.056, 0.078, ...]
    ]
  },
  "texts": ["Hello world", "How are you?"],
  "meta": {
    "api_version": {"version": "2"},
    "billed_units": {"input_tokens": 12}
  }
}
```

Mixed text and image inputs

For multimodal models, you can embed images by passing base64 data URIs. The inputs field accepts a list of objects with mixed text and image content:

curl -X POST "http://localhost:8000/v2/embed" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/siglip-so400m-patch14-384",
    "inputs": [
      {
        "content": [
          {"type": "text", "text": "A photo of a cat"},
          {"type": "image_url", "image_url": {"url": "data:image/png;base64,iVBOR..."}}
        ]
      }
    ],
    "embedding_types": ["float"]
  }'

Embedding types

The embedding_types parameter controls the output format. Multiple types can be requested in a single call:

Type Description
float Raw float32 embeddings (default)
binary Bit-packed signed binary
ubinary Bit-packed unsigned binary
base64 Little-endian float32 encoded as base64
curl -X POST "http://localhost:8000/v2/embed" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Snowflake/snowflake-arctic-embed-m-v1.5",
    "input_type": "query",
    "texts": ["What is machine learning?"],
    "embedding_types": ["float", "binary"]
  }'

??? console "Response"

```json
{
  "id": "embd-...",
  "embeddings": {
    "float": [[0.012, -0.034, ...]],
    "binary": [[42, -117, ...]]
  },
  "texts": ["What is machine learning?"],
  "meta": {
    "api_version": {"version": "2"},
    "billed_units": {"input_tokens": 8}
  }
}
```

Truncation

The truncate parameter controls how inputs exceeding the model's maximum sequence length are handled:

Value Behavior
END (default) Keep the first tokens, drop the end
START Keep the last tokens, drop the beginning
NONE Return an error if the input is too long

Input type and prompt prefixes

The input_type field selects a prompt prefix to prepend to each text input. The available values depend on the model:

  • Models with task_instructions in config.json: The keys from the task_instructions dict are the valid input_type values and the corresponding value is prepended to each text.
  • Models with config_sentence_transformers.json prompts: The keys from the prompts dict are the valid input_type values. For example, Snowflake/snowflake-arctic-embed-xs defines "query", so setting input_type: "query" prepends "Represent this sentence for searching relevant passages: ".
  • Other models: input_type is not accepted and will raise a validation error if passed.

More examples

More examples can be found here: examples/pooling/embed

Supported Features

Enable/disable normalize

You can enable or disable normalize via use_activation.

Matryoshka Embeddings

Matryoshka Embeddings or Matryoshka Representation Learning (MRL) is a technique used in training embedding models. It allows users to trade off between performance and cost.

!!! warning Not all embedding models are trained using Matryoshka Representation Learning. To avoid misuse of the dimensions parameter, vLLM returns an error for requests that attempt to change the output dimension of models that do not support Matryoshka Embeddings.

For example, setting `dimensions` parameter while using the `BAAI/bge-m3` model will result in the following error.

```json
{"object":"error","message":"Model \"BAAI/bge-m3\" does not support matryoshka representation, changing output dimensions will lead to poor results.","type":"BadRequestError","param":null,"code":400}
```

Manually enable Matryoshka Embeddings

There is currently no official interface for specifying support for Matryoshka Embeddings. In vLLM, if is_matryoshka is True in config.json, you can change the output dimension to arbitrary values. Use matryoshka_dimensions to control the allowed output dimensions.

For models that support Matryoshka Embeddings but are not recognized by vLLM, manually override the config using hf_overrides={"is_matryoshka": True} or hf_overrides={"matryoshka_dimensions": [<allowed output dimensions>]} (offline), or --hf-overrides '{"is_matryoshka": true}' or --hf-overrides '{"matryoshka_dimensions": [<allowed output dimensions>]}' (online).

Here is an example to serve a model with Matryoshka Embeddings enabled.

vllm serve Snowflake/snowflake-arctic-embed-m-v1.5 --hf-overrides '{"matryoshka_dimensions":[256]}'

Offline Inference

You can change the output dimensions of embedding models that support Matryoshka Embeddings by using the dimensions parameter in [PoolingParams][vllm.PoolingParams].

from vllm import LLM, PoolingParams

llm = LLM(
    model="jinaai/jina-embeddings-v3",
    runner="pooling",
    trust_remote_code=True,
)
outputs = llm.embed(
    ["Follow the white rabbit."],
    pooling_params=PoolingParams(dimensions=32),
)
print(outputs[0].outputs)

A code example can be found here: examples/pooling/embed/embed_matryoshka_fy_offline.py

Online Inference

Use the following command to start the vLLM server.

vllm serve jinaai/jina-embeddings-v3 --trust-remote-code

You can change the output dimensions of embedding models that support Matryoshka Embeddings by using the dimensions parameter.

curl http://127.0.0.1:8000/v1/embeddings \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "input": "Follow the white rabbit.",
    "model": "jinaai/jina-embeddings-v3",
    "encoding_format": "float",
    "dimensions": 32
  }'

Expected output:

{"id":"embd-5c21fc9a5c9d4384a1b021daccaf9f64","object":"list","created":1745476417,"model":"jinaai/jina-embeddings-v3","data":[{"index":0,"object":"embedding","embedding":[-0.3828125,-0.1357421875,0.03759765625,0.125,0.21875,0.09521484375,-0.003662109375,0.1591796875,-0.130859375,-0.0869140625,-0.1982421875,0.1689453125,-0.220703125,0.1728515625,-0.2275390625,-0.0712890625,-0.162109375,-0.283203125,-0.055419921875,-0.0693359375,0.031982421875,-0.04052734375,-0.2734375,0.1826171875,-0.091796875,0.220703125,0.37890625,-0.0888671875,-0.12890625,-0.021484375,-0.0091552734375,0.23046875]}],"usage":{"prompt_tokens":8,"total_tokens":8,"completion_tokens":0,"prompt_tokens_details":null}}

An OpenAI client example can be found here: examples/pooling/embed/openai_embedding_matryoshka_fy_client.py

Removed Features

Remove normalize from PoolingParams

We have already removed normalize from PoolingParams, use use_activation instead.