[Misc] Split up pooling tasks (#10820)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-11 17:28:00 +08:00
parent 40766ca1b8
commit 8f10d5e393
27 changed files with 527 additions and 168 deletions
--- a/docs/source/models/supported_models.rst
+++ b/docs/source/models/supported_models.rst
@@ -3,11 +3,21 @@
 Supported Models
 ================

-vLLM supports a variety of generative and embedding models from `HuggingFace (HF) Transformers <https://huggingface.co/models>`_.
-This page lists the model architectures that are currently supported by vLLM.
+vLLM supports generative and pooling models across various tasks.
+If a model supports more than one task, you can set the task via the :code:`--task` argument.
+
+For each task, we list the model architectures that have been implemented in vLLM.
 Alongside each architecture, we include some popular models that use it.

-For other models, you can check the :code:`config.json` file inside the model repository.
+Loading a Model
+^^^^^^^^^^^^^^^
+
+HuggingFace Hub
+++++++++++++++
+
+By default, vLLM loads models from `HuggingFace (HF) Hub <https://huggingface.co/models>`_.
+
+To determine whether a given model is supported, you can check the :code:`config.json` file inside the HF repository.
 If the :code:`"architectures"` field contains a model architecture listed below, then it should be supported in theory.

 .. tip::
@@ -17,38 +27,57 @@ If the :code:`"architectures"` field contains a model architecture listed below,

        from vllm import LLM

-        llm = LLM(model=...)  # Name or path of your model
+        # For generative models (task=generate) only
+        llm = LLM(model=..., task="generate")  # Name or path of your model
        output = llm.generate("Hello, my name is")
        print(output)

-    If vLLM successfully generates text, it indicates that your model is supported.
+        # For pooling models (task={embed,classify,reward}) only
+        llm = LLM(model=..., task="embed")  # Name or path of your model
+        output = llm.encode("Hello, my name is")
+        print(output)
+
+    If vLLM successfully returns text (for generative models) or hidden states (for pooling models), it indicates that your model is supported.

 Otherwise, please refer to :ref:`Adding a New Model <adding_a_new_model>` and :ref:`Enabling Multimodal Inputs <enabling_multimodal_inputs>` 
 for instructions on how to implement your model in vLLM.
 Alternatively, you can `open an issue on GitHub <https://github.com/vllm-project/vllm/issues/new/choose>`_ to request vLLM support.

-.. note::
-    To use models from `ModelScope <https://www.modelscope.cn>`_ instead of HuggingFace Hub, set an environment variable:
+ModelScope
++++++++++

-    .. code-block:: shell
+To use models from `ModelScope <https://www.modelscope.cn>`_ instead of HuggingFace Hub, set an environment variable:

-       $ export VLLM_USE_MODELSCOPE=True
+.. code-block:: shell

-    And use with :code:`trust_remote_code=True`.
+    $ export VLLM_USE_MODELSCOPE=True

-    .. code-block:: python
+And use with :code:`trust_remote_code=True`.

-        from vllm import LLM
+.. code-block:: python

-        llm = LLM(model=..., revision=..., trust_remote_code=True)  # Name or path of your model
-        output = llm.generate("Hello, my name is")
-        print(output)
+    from vllm import LLM

-Text-only Language Models
-^^^^^^^^^^^^^^^^^^^^^^^^^
+    llm = LLM(model=..., revision=..., task=..., trust_remote_code=True)

-Text Generation
---------------
+    # For generative models (task=generate) only
+    output = llm.generate("Hello, my name is")
+    print(output)
+
+    # For pooling models (task={embed,classify,reward}) only
+    output = llm.encode("Hello, my name is")
+    print(output)
+
+List of Text-only Language Models
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Generative Models
+++++++++++++++++
+
+See :ref:`this page <generative_models>` for more information on how to use generative models.
+
+Text Generation (``--task generate``)
+-------------------------------------

 .. list-table::
  :widths: 25 25 50 5 5
@@ -328,8 +357,24 @@ Text Generation
 .. note::
    Currently, the ROCm version of vLLM supports Mistral and Mixtral only for context lengths up to 4096.

-Text Embedding
--------------
+Pooling Models
++++++++++++++
+
+See :ref:`this page <pooling_models>` for more information on how to use pooling models.
+
+.. important::
+    Since some model architectures support both generative and pooling tasks,
+    you should explicitly specify the task type to ensure that the model is used in pooling mode instead of generative mode.
+
+Text Embedding (``--task embed``)
+---------------------------------
+
+Any text generation model can be converted into an embedding model by passing :code:`--task embed`.
+
+.. note::
+    To get the best results, you should use pooling models that are specifically trained as such.
+
+The following table lists those that are tested in vLLM.

 .. list-table::
  :widths: 25 25 50 5 5
@@ -371,13 +416,6 @@ Text Embedding
    - 
    - 

-.. important::
-  Some model architectures support both generation and embedding tasks.
-  In this case, you have to pass :code:`--task embedding` to run the model in embedding mode.
-
-.. tip::
-  You can override the model's pooling method by passing :code:`--override-pooler-config`.
-
 .. note::
  :code:`ssmits/Qwen2-7B-Instruct-embed-base` has an improperly defined Sentence Transformers config.
  You should manually set mean pooling by passing :code:`--override-pooler-config '{"pooling_type": "MEAN"}'`.
@@ -389,8 +427,8 @@ Text Embedding
  On the other hand, its 1.5B variant (:code:`Alibaba-NLP/gte-Qwen2-1.5B-instruct`) uses causal attention
  despite being described otherwise on its model card.

-Reward Modeling
---------------
+Reward Modeling (``--task reward``)
+-----------------------------------

 .. list-table::
  :widths: 25 25 50 5 5
@@ -416,11 +454,8 @@ Reward Modeling
  For process-supervised reward models such as :code:`peiyi9979/math-shepherd-mistral-7b-prm`, the pooling config should be set explicitly,
  e.g.: :code:`--override-pooler-config '{"pooling_type": "STEP", "step_tag_id": 123, "returned_token_ids": [456, 789]}'`.

-.. note::
-    As an interim measure, these models are supported in both offline and online inference via Embeddings API.
-
-Classification
---------------
+Classification (``--task classify``)
+------------------------------------

 .. list-table::
  :widths: 25 25 50 5 5
@@ -437,11 +472,8 @@ Classification
    - ✅︎
    - ✅︎

-.. note::
-    As an interim measure, these models are supported in both offline and online inference via Embeddings API.
-
-Sentence Pair Scoring
---------------------
+Sentence Pair Scoring (``--task score``)
+----------------------------------------

 .. list-table::
  :widths: 25 25 50 5 5
@@ -468,13 +500,10 @@ Sentence Pair Scoring
    - 
    - 

-.. note::
-    These models are supported in both offline and online inference via Score API.
-
 .. _supported_mm_models:

-Multimodal Language Models
-^^^^^^^^^^^^^^^^^^^^^^^^^^
+List of Multimodal Language Models
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 The following modalities are supported depending on the model:

@@ -491,8 +520,15 @@ On the other hand, modalities separated by :code:`/` are mutually exclusive.

 - e.g.: :code:`T / I` means that the model supports text-only and image-only inputs, but not text-with-image inputs.

-Text Generation
---------------
+See :ref:`this page <multimodal_inputs>` on how to pass multi-modal inputs to the model.
+
+Generative Models
+++++++++++++++++
+
+See :ref:`this page <generative_models>` for more information on how to use generative models.
+
+Text Generation (``--task generate``)
+-------------------------------------

 .. list-table::
  :widths: 25 25 15 20 5 5 5
@@ -696,8 +732,24 @@ Text Generation
  The official :code:`openbmb/MiniCPM-V-2` doesn't work yet, so we need to use a fork (:code:`HwwwH/MiniCPM-V-2`) for now.
  For more details, please see: https://github.com/vllm-project/vllm/pull/4087#issuecomment-2250397630

-Multimodal Embedding
--------------------
+Pooling Models
++++++++++++++
+
+See :ref:`this page <pooling_models>` for more information on how to use pooling models.
+
+.. important::
+    Since some model architectures support both generative and pooling tasks,
+    you should explicitly specify the task type to ensure that the model is used in pooling mode instead of generative mode.
+
+Text Embedding (``--task embed``)
+---------------------------------
+
+Any text generation model can be converted into an embedding model by passing :code:`--task embed`.
+
+.. note::
+    To get the best results, you should use pooling models that are specifically trained as such.
+
+The following table lists those that are tested in vLLM.

 .. list-table::
  :widths: 25 25 15 25 5 5
@@ -728,12 +780,7 @@ Multimodal Embedding
    - 
    - ✅︎

-.. important::
-  Some model architectures support both generation and embedding tasks.
-  In this case, you have to pass :code:`--task embedding` to run the model in embedding mode.
-
-.. tip::
-  You can override the model's pooling method by passing :code:`--override-pooler-config`.
+----

 Model Support Policy
 =====================