[Frontend] Add OpenAI Vision API Support (#5237)

Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-06-07 11:23:32 -07:00
parent ca3ea51bde
commit 7a9cb294ae
9 changed files with 653 additions and 19 deletions
--- a/docs/source/models/vlm.rst
+++ b/docs/source/models/vlm.rst
@@ -3,7 +3,7 @@
 Using VLMs
 ==========

-This document shows you how to run and serve Vision Language Models (VLMs) using vLLM.
+vLLM provides experimental support for Vision Language Models (VLMs). This document shows you how to run and serve these models using vLLM.

 Engine Arguments
 ----------------
@@ -54,3 +54,69 @@ For now, we only support a single image per text prompt. To pass an image to the
        print(generated_text)

 A code example can be found in `examples/llava_example.py <https://github.com/vllm-project/vllm/blob/main/examples/llava_example.py>`_.
+
+Online OpenAI Vision API Compatible Inference
+----------------------------------------------
+
+You can serve vision language models with vLLM's HTTP server that is compatible with `OpenAI Vision API <https://platform.openai.com/docs/guides/vision>`_.
+
+.. note::
+    Currently, vLLM supports only **single** ``image_url`` input per ``messages``. Support for multi-image inputs will be
+    added in the future.
+
+Below is an example on how to launch the same ``llava-hf/llava-1.5-7b-hf`` with vLLM API server.
+
+.. important::
+    Since OpenAI Vision API is based on `Chat <https://platform.openai.com/docs/api-reference/chat>`_ API, a chat template 
+    is **required** to launch the API server if the model's tokenizer does not come with one. In this example, we use the 
+    HuggingFace Llava chat template that you can find in the example folder `here <https://github.com/vllm-project/vllm/blob/main/examples/template_llava.jinja>`_.
+
+.. code-block:: bash
+
+    python -m vllm.entrypoints.openai.api_server \
+        --model llava-hf/llava-1.5-7b-hf \
+        --image-input-type pixel_values \
+        --image-token-id 32000 \
+        --image-input-shape 1,3,336,336 \
+        --image-feature-size 576 \
+        --chat-template template_llava.jinja
+
+To consume the server, you can use the OpenAI client like in the example below:
+
+.. code-block:: python
+
+    from openai import OpenAI
+    openai_api_key = "EMPTY"
+    openai_api_base = "http://localhost:8000/v1"
+    client = OpenAI(
+        api_key=openai_api_key,
+        base_url=openai_api_base,
+    )
+    chat_response = client.chat.completions.create(
+        model="llava-hf/llava-1.5-7b-hf",
+        messages=[{
+            "role": "user",
+            "content": [
+                {"type": "text", "text": "What's in this image?"},
+                {
+                    "type": "image_url",
+                    "image_url": {
+                        "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
+                    },
+                },
+            ],
+        }],
+    )
+    print("Chat response:", chat_response)
+
+.. note::
+
+    By default, the timeout for fetching images through http url is ``5`` seconds. You can override this by setting the environment variable:
+
+    .. code-block:: shell
+
+        export VLLM_IMAGE_FETCH_TIMEOUT=<timeout>
+
+.. note::
+    The prompt formatting with the image token ``<image>`` is not needed when serving VLMs with the API server since the prompt will be 
+    processed automatically by the server.