[VLM][Core] Support profiling with multiple multi-modal inputs per prompt (#7126)

2024-08-15 01:55:42 +08:00
parent 70b746efcf
commit 3f674a49b5
38 changed files with 572 additions and 216 deletions
--- a/docs/source/dev/input_processing/input_processing_pipeline.rst
+++ b/docs/source/dev/input_processing/input_processing_pipeline.rst
@@ -17,4 +17,4 @@ Input Processing Pipeline

 6. If the data contains multi-modal data, convert it into keyword arguments using :meth:`MULTIMODAL_REGISTRY.map_input <vllm.multimodal.MultiModalRegistry.map_input>`.

-   - For example, convert a :class:`PIL.Image.Image` input to its pixel values for a vision language model.
+   - For example, convert a :class:`PIL.Image.Image` input to its pixel values for a vision model.
--- a/docs/source/dev/multimodal/multimodal_index.rst
+++ b/docs/source/dev/multimodal/multimodal_index.rst
@@ -15,6 +15,9 @@ by following :ref:`this guide <adding_multimodal_plugin>`.

 Looking to add your own multi-modal model? Please follow the instructions listed :ref:`here <enabling_multimodal_inputs>`.

+..
+  TODO: Add usage of --limit-mm-per-prompt when multi-image input is officially supported
+
 Guides
 ++++++