[Misc] Split up pooling tasks (#10820)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
This commit is contained in:
Cyrus Leung
2024-12-11 17:28:00 +08:00
committed by GitHub
parent 40766ca1b8
commit 8f10d5e393
27 changed files with 527 additions and 168 deletions

View File

@@ -3,11 +3,21 @@
Supported Models
================
vLLM supports a variety of generative and embedding models from `HuggingFace (HF) Transformers <https://huggingface.co/models>`_.
This page lists the model architectures that are currently supported by vLLM.
vLLM supports generative and pooling models across various tasks.
If a model supports more than one task, you can set the task via the :code:`--task` argument.
For each task, we list the model architectures that have been implemented in vLLM.
Alongside each architecture, we include some popular models that use it.
For other models, you can check the :code:`config.json` file inside the model repository.
Loading a Model
^^^^^^^^^^^^^^^
HuggingFace Hub
+++++++++++++++
By default, vLLM loads models from `HuggingFace (HF) Hub <https://huggingface.co/models>`_.
To determine whether a given model is supported, you can check the :code:`config.json` file inside the HF repository.
If the :code:`"architectures"` field contains a model architecture listed below, then it should be supported in theory.
.. tip::
@@ -17,38 +27,57 @@ If the :code:`"architectures"` field contains a model architecture listed below,
from vllm import LLM
llm = LLM(model=...) # Name or path of your model
# For generative models (task=generate) only
llm = LLM(model=..., task="generate") # Name or path of your model
output = llm.generate("Hello, my name is")
print(output)
If vLLM successfully generates text, it indicates that your model is supported.
# For pooling models (task={embed,classify,reward}) only
llm = LLM(model=..., task="embed") # Name or path of your model
output = llm.encode("Hello, my name is")
print(output)
If vLLM successfully returns text (for generative models) or hidden states (for pooling models), it indicates that your model is supported.
Otherwise, please refer to :ref:`Adding a New Model <adding_a_new_model>` and :ref:`Enabling Multimodal Inputs <enabling_multimodal_inputs>`
for instructions on how to implement your model in vLLM.
Alternatively, you can `open an issue on GitHub <https://github.com/vllm-project/vllm/issues/new/choose>`_ to request vLLM support.
.. note::
To use models from `ModelScope <https://www.modelscope.cn>`_ instead of HuggingFace Hub, set an environment variable:
ModelScope
++++++++++
.. code-block:: shell
To use models from `ModelScope <https://www.modelscope.cn>`_ instead of HuggingFace Hub, set an environment variable:
$ export VLLM_USE_MODELSCOPE=True
.. code-block:: shell
And use with :code:`trust_remote_code=True`.
$ export VLLM_USE_MODELSCOPE=True
.. code-block:: python
And use with :code:`trust_remote_code=True`.
from vllm import LLM
.. code-block:: python
llm = LLM(model=..., revision=..., trust_remote_code=True) # Name or path of your model
output = llm.generate("Hello, my name is")
print(output)
from vllm import LLM
Text-only Language Models
^^^^^^^^^^^^^^^^^^^^^^^^^
llm = LLM(model=..., revision=..., task=..., trust_remote_code=True)
Text Generation
---------------
# For generative models (task=generate) only
output = llm.generate("Hello, my name is")
print(output)
# For pooling models (task={embed,classify,reward}) only
output = llm.encode("Hello, my name is")
print(output)
List of Text-only Language Models
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Generative Models
+++++++++++++++++
See :ref:`this page <generative_models>` for more information on how to use generative models.
Text Generation (``--task generate``)
-------------------------------------
.. list-table::
:widths: 25 25 50 5 5
@@ -328,8 +357,24 @@ Text Generation
.. note::
Currently, the ROCm version of vLLM supports Mistral and Mixtral only for context lengths up to 4096.
Text Embedding
--------------
Pooling Models
++++++++++++++
See :ref:`this page <pooling_models>` for more information on how to use pooling models.
.. important::
Since some model architectures support both generative and pooling tasks,
you should explicitly specify the task type to ensure that the model is used in pooling mode instead of generative mode.
Text Embedding (``--task embed``)
---------------------------------
Any text generation model can be converted into an embedding model by passing :code:`--task embed`.
.. note::
To get the best results, you should use pooling models that are specifically trained as such.
The following table lists those that are tested in vLLM.
.. list-table::
:widths: 25 25 50 5 5
@@ -371,13 +416,6 @@ Text Embedding
-
-
.. important::
Some model architectures support both generation and embedding tasks.
In this case, you have to pass :code:`--task embedding` to run the model in embedding mode.
.. tip::
You can override the model's pooling method by passing :code:`--override-pooler-config`.
.. note::
:code:`ssmits/Qwen2-7B-Instruct-embed-base` has an improperly defined Sentence Transformers config.
You should manually set mean pooling by passing :code:`--override-pooler-config '{"pooling_type": "MEAN"}'`.
@@ -389,8 +427,8 @@ Text Embedding
On the other hand, its 1.5B variant (:code:`Alibaba-NLP/gte-Qwen2-1.5B-instruct`) uses causal attention
despite being described otherwise on its model card.
Reward Modeling
---------------
Reward Modeling (``--task reward``)
-----------------------------------
.. list-table::
:widths: 25 25 50 5 5
@@ -416,11 +454,8 @@ Reward Modeling
For process-supervised reward models such as :code:`peiyi9979/math-shepherd-mistral-7b-prm`, the pooling config should be set explicitly,
e.g.: :code:`--override-pooler-config '{"pooling_type": "STEP", "step_tag_id": 123, "returned_token_ids": [456, 789]}'`.
.. note::
As an interim measure, these models are supported in both offline and online inference via Embeddings API.
Classification
---------------
Classification (``--task classify``)
------------------------------------
.. list-table::
:widths: 25 25 50 5 5
@@ -437,11 +472,8 @@ Classification
- ✅︎
- ✅︎
.. note::
As an interim measure, these models are supported in both offline and online inference via Embeddings API.
Sentence Pair Scoring
---------------------
Sentence Pair Scoring (``--task score``)
----------------------------------------
.. list-table::
:widths: 25 25 50 5 5
@@ -468,13 +500,10 @@ Sentence Pair Scoring
-
-
.. note::
These models are supported in both offline and online inference via Score API.
.. _supported_mm_models:
Multimodal Language Models
^^^^^^^^^^^^^^^^^^^^^^^^^^
List of Multimodal Language Models
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The following modalities are supported depending on the model:
@@ -491,8 +520,15 @@ On the other hand, modalities separated by :code:`/` are mutually exclusive.
- e.g.: :code:`T / I` means that the model supports text-only and image-only inputs, but not text-with-image inputs.
Text Generation
---------------
See :ref:`this page <multimodal_inputs>` on how to pass multi-modal inputs to the model.
Generative Models
+++++++++++++++++
See :ref:`this page <generative_models>` for more information on how to use generative models.
Text Generation (``--task generate``)
-------------------------------------
.. list-table::
:widths: 25 25 15 20 5 5 5
@@ -696,8 +732,24 @@ Text Generation
The official :code:`openbmb/MiniCPM-V-2` doesn't work yet, so we need to use a fork (:code:`HwwwH/MiniCPM-V-2`) for now.
For more details, please see: https://github.com/vllm-project/vllm/pull/4087#issuecomment-2250397630
Multimodal Embedding
--------------------
Pooling Models
++++++++++++++
See :ref:`this page <pooling_models>` for more information on how to use pooling models.
.. important::
Since some model architectures support both generative and pooling tasks,
you should explicitly specify the task type to ensure that the model is used in pooling mode instead of generative mode.
Text Embedding (``--task embed``)
---------------------------------
Any text generation model can be converted into an embedding model by passing :code:`--task embed`.
.. note::
To get the best results, you should use pooling models that are specifically trained as such.
The following table lists those that are tested in vLLM.
.. list-table::
:widths: 25 25 15 25 5 5
@@ -728,12 +780,7 @@ Multimodal Embedding
-
- ✅︎
.. important::
Some model architectures support both generation and embedding tasks.
In this case, you have to pass :code:`--task embedding` to run the model in embedding mode.
.. tip::
You can override the model's pooling method by passing :code:`--override-pooler-config`.
----
Model Support Policy
=====================