[Doc] Create a new "Usage" section (#10827)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-05 11:19:35 +08:00
parent 8d370e91cb
commit aa39a8e175
25 changed files with 218 additions and 125 deletions
--- a/docs/source/models/enabling_multimodal_inputs.rst
+++ b/docs/source/models/enabling_multimodal_inputs.rst
@@ -3,7 +3,7 @@
 Enabling Multimodal Inputs
 ==========================

-This document walks you through the steps to extend a vLLM model so that it accepts :ref:`multi-modal <multi_modality>` inputs.
+This document walks you through the steps to extend a vLLM model so that it accepts :ref:`multi-modal inputs <multimodal_inputs>`.

 .. seealso::
    :ref:`adding_a_new_model`
--- a/docs/source/models/engine_args.rst
+++ b/docs/source/models/engine_args.rst
@@ -1,23 +0,0 @@
-.. _engine_args:
-
-Engine Arguments
-================
-
-Below, you can find an explanation of every engine argument for vLLM:
-
-.. argparse::
-    :module: vllm.engine.arg_utils
-    :func: _engine_args_parser
-    :prog: vllm serve
-    :nodefaultconst:
-
-Async Engine Arguments
----------------------
-
-Below are the additional arguments related to the asynchronous engine:
-
-.. argparse::
-    :module: vllm.engine.arg_utils
-    :func: _async_engine_args_parser
-    :prog: vllm serve
-    :nodefaultconst:
--- a/docs/source/models/lora.rst
+++ b/docs/source/models/lora.rst
@@ -1,225 +0,0 @@
-.. _lora:
-
-Using LoRA adapters
-===================
-
-This document shows you how to use `LoRA adapters <https://arxiv.org/abs/2106.09685>`_ with vLLM on top of a base model.
-
-LoRA adapters can be used with any vLLM model that implements :class:`~vllm.model_executor.models.interfaces.SupportsLoRA`.
-
-Adapters can be efficiently served on a per request basis with minimal overhead. First we download the adapter(s) and save
-them locally with
-
-.. code-block:: python
-
-    from huggingface_hub import snapshot_download
-
-    sql_lora_path = snapshot_download(repo_id="yard1/llama-2-7b-sql-lora-test")
-
-
-Then we instantiate the base model and pass in the ``enable_lora=True`` flag:
-
-.. code-block:: python
-
-    from vllm import LLM, SamplingParams
-    from vllm.lora.request import LoRARequest
-
-    llm = LLM(model="meta-llama/Llama-2-7b-hf", enable_lora=True)
-
-
-We can now submit the prompts and call ``llm.generate`` with the ``lora_request`` parameter. The first parameter
-of ``LoRARequest`` is a human identifiable name, the second parameter is a globally unique ID for the adapter and
-the third parameter is the path to the LoRA adapter.
-
-.. code-block:: python
-
-    sampling_params = SamplingParams(
-        temperature=0,
-        max_tokens=256,
-        stop=["[/assistant]"]
-    )
-
-    prompts = [
-         "[user] Write a SQL query to answer the question based on the table schema.\n\n context: CREATE TABLE table_name_74 (icao VARCHAR, airport VARCHAR)\n\n question: Name the ICAO for lilongwe international airport [/user] [assistant]",
-         "[user] Write a SQL query to answer the question based on the table schema.\n\n context: CREATE TABLE table_name_11 (nationality VARCHAR, elector VARCHAR)\n\n question: When Anchero Pantaleone was the elector what is under nationality? [/user] [assistant]",
-    ]
-
-    outputs = llm.generate(
-        prompts,
-        sampling_params,
-        lora_request=LoRARequest("sql_adapter", 1, sql_lora_path)
-    )
-
-
-Check out `examples/multilora_inference.py <https://github.com/vllm-project/vllm/blob/main/examples/multilora_inference.py>`_
-for an example of how to use LoRA adapters with the async engine and how to use more advanced configuration options.
-
-Serving LoRA Adapters
---------------------
-LoRA adapted models can also be served with the Open-AI compatible vLLM server. To do so, we use
-``--lora-modules {name}={path} {name}={path}`` to specify each LoRA module when we kickoff the server:
-
-.. code-block:: bash
-
-    vllm serve meta-llama/Llama-2-7b-hf \
-        --enable-lora \
-        --lora-modules sql-lora=$HOME/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/snapshots/0dfa347e8877a4d4ed19ee56c140fa518470028c/
-
-.. note::
-   The commit ID `0dfa347e8877a4d4ed19ee56c140fa518470028c` may change over time. Please check the latest commit ID in your environment to ensure you are using the correct one.
-
-The server entrypoint accepts all other LoRA configuration parameters (``max_loras``, ``max_lora_rank``, ``max_cpu_loras``,
-etc.), which will apply to all forthcoming requests. Upon querying the ``/models`` endpoint, we should see our LoRA along
-with its base model:
-
-.. code-block:: bash
-
-    curl localhost:8000/v1/models | jq .
-    {
-        "object": "list",
-        "data": [
-            {
-                "id": "meta-llama/Llama-2-7b-hf",
-                "object": "model",
-                ...
-            },
-            {
-                "id": "sql-lora",
-                "object": "model",
-                ...
-            }
-        ]
-    }
-
-Requests can specify the LoRA adapter as if it were any other model via the ``model`` request parameter. The requests will be
-processed according to the server-wide LoRA configuration (i.e. in parallel with base model requests, and potentially other
-LoRA adapter requests if they were provided and ``max_loras`` is set high enough).
-
-The following is an example request
-
-.. code-block:: bash
-
-    curl http://localhost:8000/v1/completions \
-        -H "Content-Type: application/json" \
-        -d '{
-            "model": "sql-lora",
-            "prompt": "San Francisco is a",
-            "max_tokens": 7,
-            "temperature": 0
-        }' | jq
-
-
-Dynamically serving LoRA Adapters
---------------------------------
-
-In addition to serving LoRA adapters at server startup, the vLLM server now supports dynamically loading and unloading
-LoRA adapters at runtime through dedicated API endpoints. This feature can be particularly useful when the flexibility
-to change models on-the-fly is needed.
-
-Note: Enabling this feature in production environments is risky as user may participate model adapter management.
-
-To enable dynamic LoRA loading and unloading, ensure that the environment variable `VLLM_ALLOW_RUNTIME_LORA_UPDATING`
-is set to `True`. When this option is enabled, the API server will log a warning to indicate that dynamic loading is active.
-
-.. code-block:: bash
-
-    export VLLM_ALLOW_RUNTIME_LORA_UPDATING=True
-
-
-Loading a LoRA Adapter:
-
-To dynamically load a LoRA adapter, send a POST request to the `/v1/load_lora_adapter` endpoint with the necessary
-details of the adapter to be loaded. The request payload should include the name and path to the LoRA adapter.
-
-Example request to load a LoRA adapter:
-
-.. code-block:: bash
-
-    curl -X POST http://localhost:8000/v1/load_lora_adapter \
-    -H "Content-Type: application/json" \
-    -d '{
-        "lora_name": "sql_adapter",
-        "lora_path": "/path/to/sql-lora-adapter"
-    }'
-
-Upon a successful request, the API will respond with a 200 OK status code. If an error occurs, such as if the adapter
-cannot be found or loaded, an appropriate error message will be returned.
-
-Unloading a LoRA Adapter:
-
-To unload a LoRA adapter that has been previously loaded, send a POST request to the `/v1/unload_lora_adapter` endpoint
-with the name or ID of the adapter to be unloaded.
-
-Example request to unload a LoRA adapter:
-
-.. code-block:: bash
-
-    curl -X POST http://localhost:8000/v1/unload_lora_adapter \
-    -H "Content-Type: application/json" \
-    -d '{
-        "lora_name": "sql_adapter"
-    }'
-
-
-New format for `--lora-modules`
-------------------------------
-
-In the previous version, users would provide LoRA modules via the following format, either as a key-value pair or in JSON format. For example:
-
-.. code-block:: bash
-
-    --lora-modules sql-lora=$HOME/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/snapshots/0dfa347e8877a4d4ed19ee56c140fa518470028c/
-
-This would only include the `name` and `path` for each LoRA module, but did not provide a way to specify a `base_model_name`.
-Now, you can specify a base_model_name alongside the name and path using JSON format. For example:
-
-.. code-block:: bash
-
-    --lora-modules '{"name": "sql-lora", "path": "/path/to/lora", "base_model_name": "meta-llama/Llama-2-7b"}'
-
-To provide the backward compatibility support, you can still use the old key-value format (name=path), but the `base_model_name` will remain unspecified in that case.
-
-
-Lora model lineage in model card
--------------------------------
-
-The new format of `--lora-modules` is mainly to support the display of parent model information in the model card. Here's an explanation of how your current response supports this:
-
- The `parent` field of LoRA model `sql-lora` now links to its base model `meta-llama/Llama-2-7b-hf`. This correctly reflects the hierarchical relationship between the base model and the LoRA adapter.
- The `root` field points to the artifact location of the lora adapter.
-
-.. code-block:: bash
-
-    $ curl http://localhost:8000/v1/models
-
-    {
-        "object": "list",
-        "data": [
-            {
-            "id": "meta-llama/Llama-2-7b-hf",
-            "object": "model",
-            "created": 1715644056,
-            "owned_by": "vllm",
-            "root": "~/.cache/huggingface/hub/models--meta-llama--Llama-2-7b-hf/snapshots/01c7f73d771dfac7d292323805ebc428287df4f9/",
-            "parent": null,
-            "permission": [
-                {
-                .....
-                }
-            ]
-            },
-            {
-            "id": "sql-lora",
-            "object": "model",
-            "created": 1715644056,
-            "owned_by": "vllm",
-            "root": "~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/snapshots/0dfa347e8877a4d4ed19ee56c140fa518470028c/",
-            "parent": meta-llama/Llama-2-7b-hf,
-            "permission": [
-                {
-                ....
-                }
-            ]
-            }
-        ]
-    }
--- a/docs/source/models/performance.rst
+++ b/docs/source/models/performance.rst
@@ -1,65 +0,0 @@
-.. _performance:
-
-Performance and Tuning
-======================
-
-Preemption
----------
-Due to the auto-regressive nature of transformer architecture, there are times when KV cache space is insufficient to handle all batched requests.
-The vLLM can preempt requests to free up KV cache space for other requests. Preempted requests are recomputed when sufficient KV cache space becomes
-available again. When this occurs, the following warning is printed:
-
-```
-WARNING 05-09 00:49:33 scheduler.py:1057] Sequence group 0 is preempted by PreemptionMode.SWAP mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_cumulative_preemption_cnt=1
-```
-
-While this mechanism ensures system robustness, preemption and recomputation can adversely affect end-to-end latency.
-If you frequently encounter preemptions from the vLLM engine, consider the following actions:
-
- Increase `gpu_memory_utilization`. The vLLM pre-allocates GPU cache by using gpu_memory_utilization% of memory. By increasing this utilization, you can provide more KV cache space.
- Decrease `max_num_seqs` or `max_num_batched_tokens`. This can reduce the number of concurrent requests in a batch, thereby requiring less KV cache space.
- Increase `tensor_parallel_size`. This approach shards model weights, so each GPU has more memory available for KV cache.
-
-You can also monitor the number of preemption requests through Prometheus metrics exposed by the vLLM. Additionally, you can log the cumulative number of preemption requests by setting disable_log_stats=False.
-
-.. _chunked-prefill:
-
-Chunked Prefill
---------------
-vLLM supports an experimental feature chunked prefill. Chunked prefill allows to chunk large prefills into smaller chunks and batch them together with decode requests.
-
-You can enable the feature by specifying ``--enable-chunked-prefill`` in the command line or setting ``enable_chunked_prefill=True`` in the LLM constructor.
-
-.. code-block:: python
-
-    llm = LLM(model="meta-llama/Llama-2-7b-hf", enable_chunked_prefill=True)
-    # Set max_num_batched_tokens to tune performance.
-    # NOTE: 512 is the default max_num_batched_tokens for chunked prefill.
-    # llm = LLM(model="meta-llama/Llama-2-7b-hf", enable_chunked_prefill=True, max_num_batched_tokens=512)
-
-By default, vLLM scheduler prioritizes prefills and doesn't batch prefill and decode to the same batch.
-This policy optimizes the TTFT (time to the first token), but incurs slower ITL (inter token latency) and inefficient GPU utilization.
-
-Once chunked prefill is enabled, the policy is changed to prioritize decode requests.
-It batches all pending decode requests to the batch before scheduling any prefill.
-When there are available token_budget (``max_num_batched_tokens``), it schedules pending prefills.
-If a last pending prefill request cannot fit into ``max_num_batched_tokens``, it chunks it.
-
-This policy has two benefits:
-
- It improves ITL and generation decode because decode requests are prioritized.
- It helps achieve better GPU utilization by locating compute-bound (prefill) and memory-bound (decode) requests to the same batch.
-
-You can tune the performance by changing ``max_num_batched_tokens``.
-By default, it is set to 512, which has the best ITL on A100 in the initial benchmark (llama 70B and mixtral 8x22B).
-Smaller ``max_num_batched_tokens`` achieves better ITL because there are fewer prefills interrupting decodes.
-Higher ``max_num_batched_tokens`` achieves better TTFT as you can put more prefill to the batch.
-
- If ``max_num_batched_tokens`` is the same as ``max_model_len``, that's almost the equivalent to the default scheduling policy (except that it still prioritizes decodes).
- Note that the default value (512) of ``max_num_batched_tokens`` is optimized for ITL, and it may have lower throughput than the default scheduler.
-
-We recommend you set ``max_num_batched_tokens > 2048`` for throughput.
-
-See related papers for more details (https://arxiv.org/pdf/2401.08671 or https://arxiv.org/pdf/2308.16369).
-
-Please try out this feature and let us know your feedback via GitHub issues!
--- a/docs/source/models/spec_decode.rst
+++ b/docs/source/models/spec_decode.rst
@@ -1,207 +0,0 @@
-.. _spec_decode:
-
-Speculative decoding in vLLM
-============================
-
-.. warning::
-    Please note that speculative decoding in vLLM is not yet optimized and does
-    not usually yield inter-token latency reductions for all prompt datasets or sampling parameters. The work
-    to optimize it is ongoing and can be followed in `this issue. <https://github.com/vllm-project/vllm/issues/4630>`_
-
-This document shows how to use `Speculative Decoding <https://x.com/karpathy/status/1697318534555336961>`_ with vLLM.
-Speculative decoding is a technique which improves inter-token latency in memory-bound LLM inference.
-
-Speculating with a draft model
------------------------------
-
-The following code configures vLLM in an offline mode to use speculative decoding with a draft model, speculating 5 tokens at a time.
-
-.. code-block:: python
-
-    from vllm import LLM, SamplingParams
-
-    prompts = [
-        "The future of AI is",
-    ]
-    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
-
-    llm = LLM(
-        model="facebook/opt-6.7b",
-        tensor_parallel_size=1,
-        speculative_model="facebook/opt-125m",
-        num_speculative_tokens=5,
-    )
-    outputs = llm.generate(prompts, sampling_params)
-
-    for output in outputs:
-        prompt = output.prompt
-        generated_text = output.outputs[0].text
-        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
-
-To perform the same with an online mode launch the server:
-
-.. code-block:: bash
-
-    python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8000 --model facebook/opt-6.7b \
-        --seed 42 -tp 1 --speculative_model facebook/opt-125m --use-v2-block-manager \
-        --num_speculative_tokens 5 --gpu_memory_utilization 0.8
-
-Then use a client:
-
-.. code-block:: python
-
-    from openai import OpenAI
-
-    # Modify OpenAI's API key and API base to use vLLM's API server.
-    openai_api_key = "EMPTY"
-    openai_api_base = "http://localhost:8000/v1"
-
-    client = OpenAI(
-        # defaults to os.environ.get("OPENAI_API_KEY")
-        api_key=openai_api_key,
-        base_url=openai_api_base,
-    )
-
-    models = client.models.list()
-    model = models.data[0].id
-
-    # Completion API
-    stream = False
-    completion = client.completions.create(
-        model=model,
-        prompt="The future of AI is",
-        echo=False,
-        n=1,
-        stream=stream,
-    )
-
-    print("Completion results:")
-    if stream:
-        for c in completion:
-            print(c)
-    else:
-        print(completion)
-
-Speculating by matching n-grams in the prompt
---------------------------------------------
-
-The following code configures vLLM to use speculative decoding where proposals are generated by
-matching n-grams in the prompt. For more information read `this thread. <https://x.com/joao_gante/status/1747322413006643259>`_
-
-.. code-block:: python
-
-    from vllm import LLM, SamplingParams
-
-    prompts = [
-        "The future of AI is",
-    ]
-    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
-
-    llm = LLM(
-        model="facebook/opt-6.7b",
-        tensor_parallel_size=1,
-        speculative_model="[ngram]",
-        num_speculative_tokens=5,
-        ngram_prompt_lookup_max=4,
-    )
-    outputs = llm.generate(prompts, sampling_params)
-
-    for output in outputs:
-        prompt = output.prompt
-        generated_text = output.outputs[0].text
-        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
-
-Speculating using MLP speculators
---------------------------------
-
-The following code configures vLLM to use speculative decoding where proposals are generated by
-draft models that conditioning draft predictions on both context vectors and sampled tokens.
-For more information see `this blog <https://pytorch.org/blog/hitchhikers-guide-speculative-decoding/>`_ or
-`this technical report <https://arxiv.org/abs/2404.19124>`_.
-
-.. code-block:: python
-
-    from vllm import LLM, SamplingParams
-
-    prompts = [
-        "The future of AI is",
-    ]
-    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
-
-    llm = LLM(
-        model="meta-llama/Meta-Llama-3.1-70B-Instruct",
-        tensor_parallel_size=4,
-        speculative_model="ibm-fms/llama3-70b-accelerator",
-        speculative_draft_tensor_parallel_size=1,
-    )
-    outputs = llm.generate(prompts, sampling_params)
-
-    for output in outputs:
-        prompt = output.prompt
-        generated_text = output.outputs[0].text
-        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
-
-Note that these speculative models currently need to be run without tensor parallelism, although
-it is possible to run the main model using tensor parallelism (see example above). Since the
-speculative models are relatively small, we still see significant speedups. However, this
-limitation will be fixed in a future release.
-
-A variety of speculative models of this type are available on HF hub:
-
-* `llama-13b-accelerator <https://huggingface.co/ibm-fms/llama-13b-accelerator>`_
-* `llama3-8b-accelerator <https://huggingface.co/ibm-fms/llama3-8b-accelerator>`_
-* `codellama-34b-accelerator <https://huggingface.co/ibm-fms/codellama-34b-accelerator>`_
-* `llama2-70b-accelerator <https://huggingface.co/ibm-fms/llama2-70b-accelerator>`_
-* `llama3-70b-accelerator <https://huggingface.co/ibm-fms/llama3-70b-accelerator>`_
-* `granite-3b-code-instruct-accelerator <https://huggingface.co/ibm-granite/granite-3b-code-instruct-accelerator>`_
-* `granite-8b-code-instruct-accelerator <https://huggingface.co/ibm-granite/granite-8b-code-instruct-accelerator>`_
-* `granite-7b-instruct-accelerator <https://huggingface.co/ibm-granite/granite-7b-instruct-accelerator>`_
-* `granite-20b-code-instruct-accelerator <https://huggingface.co/ibm-granite/granite-20b-code-instruct-accelerator>`_
-
-Lossless guarantees of Speculative Decoding
-------------------------------------------
-In vLLM, speculative decoding aims to enhance inference efficiency while maintaining accuracy. This section addresses the lossless guarantees of 
-speculative decoding, breaking down the guarantees into three key areas:
-
-1. **Theoretical Losslessness**
-   - Speculative decoding sampling is theoretically lossless up to the precision limits of hardware numerics. Floating-point errors might 
-   cause slight variations in output distributions, as discussed 
-   in `Accelerating Large Language Model Decoding with Speculative Sampling <https://arxiv.org/pdf/2302.01318>`_
-
-2. **Algorithmic Losslessness**
-   - vLLM’s implementation of speculative decoding is algorithmically validated to be lossless. Key validation tests include:
-
-    - **Rejection Sampler Convergence**: Ensures that samples from vLLM’s rejection sampler align with the target 
-      distribution. `View Test Code <https://github.com/vllm-project/vllm/blob/47b65a550866c7ffbd076ecb74106714838ce7da/tests/samplers/test_rejection_sampler.py#L252>`_
-
-    - **Greedy Sampling Equality**: Confirms that greedy sampling with speculative decoding matches greedy sampling
-      without it. This verifies that vLLM's speculative decoding framework, when integrated with the vLLM forward pass and the vLLM rejection sampler, 
-      provides a lossless guarantee.  Almost all of the tests in `this directory <https://github.com/vllm-project/vllm/tree/b67ae00cdbbe1a58ffc8ff170f0c8d79044a684a/tests/spec_decode/e2e>`_
-      verify this property using `this assertion implementation <https://github.com/vllm-project/vllm/blob/b67ae00cdbbe1a58ffc8ff170f0c8d79044a684a/tests/spec_decode/e2e/conftest.py#L291>`_
-
-3. **vLLM Logprob Stability**
-   - vLLM does not currently guarantee stable token log probabilities (logprobs). This can result in different outputs for the 
-   same request across runs. For more details, see the FAQ section 
-   titled *Can the output of a prompt vary across runs in vLLM?* in the `FAQs <../serving/faq>`_.
-
-
-**Conclusion**
-
-While vLLM strives to ensure losslessness in speculative decoding, variations in generated outputs with and without speculative decoding 
-can occur due to following factors:
-
- **Floating-Point Precision**: Differences in hardware numerical precision may lead to slight discrepancies in the output distribution.
-
- **Batch Size and Numerical Stability**: Changes in batch size may cause variations in logprobs and output probabilities, potentially 
-  due to non-deterministic behavior in batched operations or numerical instability.
-
-**Mitigation Strategies**
-
-For mitigation strategies, please refer to the FAQ entry *Can the output of a prompt vary across runs in vLLM?* in the `FAQs <../serving/faq>`_.
-
-Resources for vLLM contributors
-------------------------------
-* `A Hacker's Guide to Speculative Decoding in vLLM <https://www.youtube.com/watch?v=9wNAgpX6z_4>`_
-* `What is Lookahead Scheduling in vLLM? <https://docs.google.com/document/d/1Z9TvqzzBPnh5WHcRwjvK2UEeFeq5zMZb5mFE8jR0HCs/edit#heading=h.1fjfb0donq5a>`_
-* `Information on batch expansion <https://docs.google.com/document/d/1T-JaS2T1NRfdP51qzqpyakoCXxSXTtORppiwaj5asxA/edit#heading=h.kk7dq05lc6q8>`_
-* `Dynamic speculative decoding <https://github.com/vllm-project/vllm/issues/4565>`_
--- a/docs/source/models/structured_outputs.rst
+++ b/docs/source/models/structured_outputs.rst
@@ -1,267 +0,0 @@
-.. _structured_outputs:
-
-Structured Outputs
-==================
-
-vLLM supports the generation of structured outputs using `outlines <https://github.com/dottxt-ai/outlines>`_ or `lm-format-enforcer <https://github.com/noamgat/lm-format-enforcer>`_ as backends for the guided decoding.
-This document shows you some examples of the different options that are available to generate structured outputs. 
-
-
-Online Inference (OpenAI API)
-----------------------------
-
-You can generate structured outputs using the OpenAI's `Completions <https://platform.openai.com/docs/api-reference/completions>`_ and `Chat <https://platform.openai.com/docs/api-reference/chat>`_  API.
-
-The following parameters are supported, which must be added as extra parameters:
-
- ``guided_choice``: the output will be exactly one of the choices.
- ``guided_regex``: the output will follow the regex pattern.
- ``guided_json``: the output will follow the JSON schema.
- ``guided_grammar``: the output will follow the context free grammar.
- ``guided_whitespace_pattern``: used to override the default whitespace pattern for guided json decoding.
- ``guided_decoding_backend``: used to select the guided decoding backend to use.
-
-You can see the complete list of supported parameters on the `OpenAI Compatible Server </../serving/openai_compatible_server.html>`_ page. 
-
-Now let´s see an example for each of the cases, starting with the ``guided_choice``, as it´s the easiest one: 
-
-.. code-block:: python
-
-    from openai import OpenAI
-    client = OpenAI(
-        base_url="http://localhost:8000/v1",
-        api_key="-",
-    )
-
-    completion = client.chat.completions.create(
-        model="Qwen/Qwen2.5-3B-Instruct",
-        messages=[
-            {"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"}
-        ],
-        extra_body={"guided_choice": ["positive", "negative"]},
-    )
-    print(completion.choices[0].message.content)
-
-
-The next example shows how to use the ``guided_regex``. The idea is to generate an email address, given a simple regex template: 
-
-.. code-block:: python
-
-    completion = client.chat.completions.create(
-        model="Qwen/Qwen2.5-3B-Instruct",
-        messages=[
-            {
-                "role": "user",
-                "content": "Generate an example email address for Alan Turing, who works in Enigma. End in .com and new line. Example result: alan.turing@enigma.com\n",
-            }
-        ],
-        extra_body={"guided_regex": "\w+@\w+\.com\n", "stop": ["\n"]},
-    )
-    print(completion.choices[0].message.content)
-
-One of the most relevant features in structured text generation is the option to generate a valid JSON with pre-defined fields and formats. 
-For this we can use the ``guided_json`` parameter in two different ways:
-
- Using directly a `JSON Schema <https://json-schema.org/>`_ 
- Defining a `Pydantic model <https://docs.pydantic.dev/latest/>`_ and then extracting the JSON Schema from it (which is normally an easier option).
-
-The next example shows how to use the ``guided_json`` parameter with a Pydantic model:
-
-.. code-block:: python
-
-    from pydantic import BaseModel
-    from enum import Enum
-
-    class CarType(str, Enum):
-        sedan = "sedan"
-        suv = "SUV"
-        truck = "Truck"
-        coupe = "Coupe"
-
-
-    class CarDescription(BaseModel):
-        brand: str
-        model: str
-        car_type: CarType
-
-
-    json_schema = CarDescription.model_json_schema()
-
-    completion = client.chat.completions.create(
-        model="Qwen/Qwen2.5-3B-Instruct",
-        messages=[
-            {
-                "role": "user",
-                "content": "Generate a JSON with the brand, model and car_type of the most iconic car from the 90's",
-            }
-        ],
-        extra_body={"guided_json": json_schema},
-    )
-    print(completion.choices[0].message.content)
-
-.. tip::
-    While not strictly necessary, normally it´s better to indicate in the prompt that a JSON needs to be generated and which fields and how should the LLM fill them.
-    This can improve the results notably in most cases.
-
-
-Finally we have the ``guided_grammar``, which probably is the most difficult one to use but it´s really powerful, as it allows us to define complete languages like SQL queries.
-It works by using a context free EBNF grammar, which for example we can use to define a specific format of simplified SQL queries, like in the example below:
-
-.. code-block:: python
-
-    simplified_sql_grammar = """
-        ?start: select_statement
-
-        ?select_statement: "SELECT " column_list " FROM " table_name
-
-        ?column_list: column_name ("," column_name)*
-
-        ?table_name: identifier
-
-        ?column_name: identifier
-
-        ?identifier: /[a-zA-Z_][a-zA-Z0-9_]*/
-    """
-
-    completion = client.chat.completions.create(
-        model="Qwen/Qwen2.5-3B-Instruct",
-        messages=[
-            {
-                "role": "user",
-                "content": "Generate an SQL query to show the 'username' and 'email' from the 'users' table.",
-            }
-        ],
-        extra_body={"guided_grammar": simplified_sql_grammar},
-    )
-    print(completion.choices[0].message.content)
-
-The complete code of the examples can be found on `examples/openai_chat_completion_structured_outputs.py <https://github.com/vllm-project/vllm/blob/main/examples/openai_chat_completion_structured_outputs.py>`_.
-
-Experimental Automatic Parsing (OpenAI API)
--------------------------------------------
-
-This section covers the OpenAI beta wrapper over the ``client.chat.completions.create()`` method that provides richer integrations with Python specific types.
-
-At the time of writing (``openai==1.54.4``), this is a "beta" feature in the OpenAI client library. Code reference can be found `here <https://github.com/openai/openai-python/blob/52357cff50bee57ef442e94d78a0de38b4173fc2/src/openai/resources/beta/chat/completions.py#L100-L104>`_.
-
-For the following examples, vLLM was setup using ``vllm serve meta-llama/Llama-3.1-8B-Instruct``
-
-Here is a simple example demonstrating how to get structured output using Pydantic models:
-
-.. code-block:: python
-
-    from pydantic import BaseModel
-    from openai import OpenAI
-
-
-    class Info(BaseModel):
-        name: str
-        age: int
-
-
-    client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="dummy")
-    completion = client.beta.chat.completions.parse(
-        model="meta-llama/Llama-3.1-8B-Instruct",
-        messages=[
-            {"role": "system", "content": "You are a helpful assistant."},
-            {"role": "user", "content": "My name is Cameron, I'm 28. What's my name and age?"},
-        ],
-        response_format=Info,
-        extra_body=dict(guided_decoding_backend="outlines"),
-    )
-
-    message = completion.choices[0].message
-    print(message)
-    assert message.parsed
-    print("Name:", message.parsed.name)
-    print("Age:", message.parsed.age)
-
-Output:
-
-.. code-block:: console
-
-    ParsedChatCompletionMessage[Testing](content='{"name": "Cameron", "age": 28}', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=[], parsed=Testing(name='Cameron', age=28))
-    Name: Cameron
-    Age: 28
-
-
-Here is a more complex example using nested Pydantic models to handle a step-by-step math solution:
-
-.. code-block:: python
-
-    from typing import List
-    from pydantic import BaseModel
-    from openai import OpenAI
-
-
-    class Step(BaseModel):
-        explanation: str
-        output: str
-
-
-    class MathResponse(BaseModel):
-        steps: List[Step]
-        final_answer: str
-
-
-    client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="dummy")
-    completion = client.beta.chat.completions.parse(
-        model="meta-llama/Llama-3.1-8B-Instruct",
-        messages=[
-            {"role": "system", "content": "You are a helpful expert math tutor."},
-            {"role": "user", "content": "Solve 8x + 31 = 2."},
-        ],
-        response_format=MathResponse,
-        extra_body=dict(guided_decoding_backend="outlines"),
-    )
-
-    message = completion.choices[0].message
-    print(message)
-    assert message.parsed
-    for i, step in enumerate(message.parsed.steps):
-        print(f"Step #{i}:", step)
-    print("Answer:", message.parsed.final_answer)
-
-Output:
-
-.. code-block:: console
-
-    ParsedChatCompletionMessage[MathResponse](content='{ "steps": [{ "explanation": "First, let\'s isolate the term with the variable \'x\'. To do this, we\'ll subtract 31 from both sides of the equation.", "output": "8x + 31 - 31 = 2 - 31"}, { "explanation": "By subtracting 31 from both sides, we simplify the equation to 8x = -29.", "output": "8x = -29"}, { "explanation": "Next, let\'s isolate \'x\' by dividing both sides of the equation by 8.", "output": "8x / 8 = -29 / 8"}], "final_answer": "x = -29/8" }', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=[], parsed=MathResponse(steps=[Step(explanation="First, let's isolate the term with the variable 'x'. To do this, we'll subtract 31 from both sides of the equation.", output='8x + 31 - 31 = 2 - 31'), Step(explanation='By subtracting 31 from both sides, we simplify the equation to 8x = -29.', output='8x = -29'), Step(explanation="Next, let's isolate 'x' by dividing both sides of the equation by 8.", output='8x / 8 = -29 / 8')], final_answer='x = -29/8'))
-    Step #0: explanation="First, let's isolate the term with the variable 'x'. To do this, we'll subtract 31 from both sides of the equation." output='8x + 31 - 31 = 2 - 31'
-    Step #1: explanation='By subtracting 31 from both sides, we simplify the equation to 8x = -29.' output='8x = -29'
-    Step #2: explanation="Next, let's isolate 'x' by dividing both sides of the equation by 8." output='8x / 8 = -29 / 8'
-    Answer: x = -29/8
-
-Offline Inference
-----------------
-
-Offline inference allows for the same types of guided decoding.
-To use it, we´ll need to configure the guided decoding using the class ``GuidedDecodingParams`` inside ``SamplingParams``. 
-The main available options inside ``GuidedDecodingParams`` are: 
-
- ``json`` 
- ``regex`` 
- ``choice``
- ``grammar``
- ``backend``
- ``whitespace_pattern``
-
-These parameters can be used in the same way as the parameters from the Online Inference examples above. 
-One example for the usage of the ``choices`` parameter is shown below: 
-
-.. code-block:: python
-
-    from vllm import LLM, SamplingParams
-    from vllm.sampling_params import GuidedDecodingParams
-
-    llm = LLM(model="HuggingFaceTB/SmolLM2-1.7B-Instruct")
-
-    guided_decoding_params = GuidedDecodingParams(choice=["Positive", "Negative"])
-    sampling_params = SamplingParams(guided_decoding=guided_decoding_params)
-    outputs = llm.generate(
-        prompts="Classify this sentiment: vLLM is wonderful!",
-        sampling_params=sampling_params,
-    )
-    print(outputs[0].outputs[0].text)
-
-A complete example with all options can be found in `examples/offline_inference_structured_outputs.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_structured_outputs.py>`_.
--- a/docs/source/models/supported_models.rst
+++ b/docs/source/models/supported_models.rst
@@ -471,6 +471,8 @@ Sentence Pair Scoring
 .. note::
    These models are supported in both offline and online inference via Score API.

+.. _supported_mm_models:
+
 Multimodal Language Models
 ^^^^^^^^^^^^^^^^^^^^^^^^^^

@@ -489,8 +491,6 @@ On the other hand, modalities separated by :code:`/` are mutually exclusive.

 - e.g.: :code:`T / I` means that the model supports text-only and image-only inputs, but not text-with-image inputs.

-.. _supported_vlms:
-
 Text Generation
 ---------------

@@ -646,6 +646,21 @@ Text Generation
 | :sup:`E` Pre-computed embeddings can be inputted for this modality.
 | :sup:`+` Multiple items can be inputted per text prompt for this modality.

+.. important::
+    To enable multiple multi-modal items per text prompt, you have to set :code:`limit_mm_per_prompt` (offline inference)
+    or :code:`--limit-mm-per-prompt` (online inference). For example, to enable passing up to 4 images per text prompt:
+
+    .. code-block:: python
+
+        llm = LLM(
+            model="Qwen/Qwen2-VL-7B-Instruct",
+            limit_mm_per_prompt={"image": 4},
+        )
+
+    .. code-block:: bash
+
+        vllm serve Qwen/Qwen2-VL-7B-Instruct --limit-mm-per-prompt image=4
+
 .. note::
  vLLM currently only supports adding LoRA to the language backbone of multimodal models.

--- a/docs/source/models/vlm.rst
+++ b/docs/source/models/vlm.rst
@@ -1,330 +0,0 @@
-.. _vlm:
-
-Using VLMs
-==========
-
-vLLM provides experimental support for Vision Language Models (VLMs). See the :ref:`list of supported VLMs here <supported_vlms>`.
-This document shows you how to run and serve these models using vLLM.
-
-.. note::
-    We are actively iterating on VLM support. See `this RFC <https://github.com/vllm-project/vllm/issues/4194>`_ for upcoming changes,
-    and `open an issue on GitHub <https://github.com/vllm-project/vllm/issues/new/choose>`_ if you have any feedback or feature requests.
-
-Offline Inference
-----------------
-
-Single-image input
-^^^^^^^^^^^^^^^^^^
-
-The :class:`~vllm.LLM` class can be instantiated in much the same way as language-only models.
-
-.. code-block:: python
-
-    llm = LLM(model="llava-hf/llava-1.5-7b-hf")
-
-To pass an image to the model, note the following in :class:`vllm.inputs.PromptType`:
-
-* ``prompt``: The prompt should follow the format that is documented on HuggingFace.
-* ``multi_modal_data``: This is a dictionary that follows the schema defined in :class:`vllm.multimodal.MultiModalDataDict`.
-
-.. code-block:: python
-
-    # Refer to the HuggingFace repo for the correct format to use
-    prompt = "USER: <image>\nWhat is the content of this image?\nASSISTANT:"
-
-    # Load the image using PIL.Image
-    image = PIL.Image.open(...)
-
-    # Single prompt inference
-    outputs = llm.generate({
-        "prompt": prompt,
-        "multi_modal_data": {"image": image},
-    })
-
-    for o in outputs:
-        generated_text = o.outputs[0].text
-        print(generated_text)
-
-    # Inference with image embeddings as input
-    image_embeds = torch.load(...) # torch.Tensor of shape (1, image_feature_size, hidden_size of LM)
-    outputs = llm.generate({
-        "prompt": prompt,
-        "multi_modal_data": {"image": image_embeds},
-    })
-
-    for o in outputs:
-        generated_text = o.outputs[0].text
-        print(generated_text)
-
-    # Inference with image embeddings as input with additional parameters
-    # Specifically, we are conducting a trial run of Qwen2VL and MiniCPM-V with the new input format, which utilizes additional parameters.
-    mm_data = {}
-
-    image_embeds = torch.load(...) # torch.Tensor of shape (num_images, image_feature_size, hidden_size of LM)
-    # For Qwen2VL, image_grid_thw is needed to calculate positional encoding.
-    mm_data['image'] = {
-        "image_embeds": image_embeds,
-        "image_grid_thw": torch.load(...) # torch.Tensor of shape (1, 3),
-    }
-    # For MiniCPM-V, image_size_list is needed to calculate details of the sliced image.
-    mm_data['image'] = {
-        "image_embeds": image_embeds,
-        "image_size_list": [image.size] # list of image sizes
-    }
-    outputs = llm.generate({
-        "prompt": prompt,
-        "multi_modal_data": mm_data,
-    })
-
-    for o in outputs:
-        generated_text = o.outputs[0].text
-        print(generated_text)
-
-    # Batch inference
-    image_1 = PIL.Image.open(...)
-    image_2 = PIL.Image.open(...)
-    outputs = llm.generate(
-        [
-            {
-                "prompt": "USER: <image>\nWhat is the content of this image?\nASSISTANT:",
-                "multi_modal_data": {"image": image_1},
-            },
-            {
-                "prompt": "USER: <image>\nWhat's the color of this image?\nASSISTANT:",
-                "multi_modal_data": {"image": image_2},
-            }
-        ]
-    )
-
-    for o in outputs:
-        generated_text = o.outputs[0].text
-        print(generated_text)
-
-A code example can be found in `examples/offline_inference_vision_language.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_vision_language.py>`_.
-
-Multi-image input
-^^^^^^^^^^^^^^^^^
-
-Multi-image input is only supported for a subset of VLMs, as shown :ref:`here <supported_vlms>`.
-
-To enable multiple multi-modal items per text prompt, you have to set ``limit_mm_per_prompt`` for the :class:`~vllm.LLM` class.
-
-.. code-block:: python
-
-    llm = LLM(
-        model="microsoft/Phi-3.5-vision-instruct",
-        trust_remote_code=True,  # Required to load Phi-3.5-vision
-        max_model_len=4096,  # Otherwise, it may not fit in smaller GPUs
-        limit_mm_per_prompt={"image": 2},  # The maximum number to accept
-    )
-
-Instead of passing in a single image, you can pass in a list of images.
-
-.. code-block:: python
-
-    # Refer to the HuggingFace repo for the correct format to use
-    prompt = "<|user|>\n<|image_1|>\n<|image_2|>\nWhat is the content of each image?<|end|>\n<|assistant|>\n"
-
-    # Load the images using PIL.Image
-    image1 = PIL.Image.open(...)
-    image2 = PIL.Image.open(...)
-
-    outputs = llm.generate({
-        "prompt": prompt,
-        "multi_modal_data": {
-            "image": [image1, image2]
-        },
-    })
-
-    for o in outputs:
-        generated_text = o.outputs[0].text
-        print(generated_text)
-
-A code example can be found in `examples/offline_inference_vision_language_multi_image.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_vision_language_multi_image.py>`_.
-
-Multi-image input can be extended to perform video captioning. We show this with `Qwen2-VL <https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct>`_ as it supports videos:
-
-.. code-block:: python
-
-    # Specify the maximum number of frames per video to be 4. This can be changed.
-    llm = LLM("Qwen/Qwen2-VL-2B-Instruct", limit_mm_per_prompt={"image": 4})
-
-    # Create the request payload.
-    video_frames = ... # load your video making sure it only has the number of frames specified earlier.
-    message = {
-        "role": "user",
-        "content": [
-            {"type": "text", "text": "Describe this set of frames. Consider the frames to be a part of the same video."},
-        ],
-    }
-    for i in range(len(video_frames)):
-        base64_image = encode_image(video_frames[i]) # base64 encoding.
-        new_image = {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}}
-        message["content"].append(new_image)
-
-    # Perform inference and log output.
-    outputs = llm.chat([message])
-
-    for o in outputs:
-        generated_text = o.outputs[0].text
-        print(generated_text)
-
-Online Inference
----------------
-
-OpenAI Vision API
-^^^^^^^^^^^^^^^^^
-
-You can serve vision language models with vLLM's HTTP server that is compatible with `OpenAI Vision API <https://platform.openai.com/docs/guides/vision>`_.
-
-Below is an example on how to launch the same ``microsoft/Phi-3.5-vision-instruct`` with vLLM's OpenAI-compatible API server.
-
-.. code-block:: bash
-
-    vllm serve microsoft/Phi-3.5-vision-instruct --task generate \
-      --trust-remote-code --max-model-len 4096 --limit-mm-per-prompt image=2
-
-.. important::
-    Since OpenAI Vision API is based on `Chat Completions API <https://platform.openai.com/docs/api-reference/chat>`_,
-    a chat template is **required** to launch the API server.
-
-    Although Phi-3.5-Vision comes with a chat template, for other models you may have to provide one if the model's tokenizer does not come with it.
-    The chat template can be inferred based on the documentation on the model's HuggingFace repo.
-    For example, LLaVA-1.5 (``llava-hf/llava-1.5-7b-hf``) requires a chat template that can be found `here <https://github.com/vllm-project/vllm/blob/main/examples/template_llava.jinja>`_.
-
-To consume the server, you can use the OpenAI client like in the example below:
-
-.. code-block:: python
-
-    from openai import OpenAI
-
-    openai_api_key = "EMPTY"
-    openai_api_base = "http://localhost:8000/v1"
-
-    client = OpenAI(
-        api_key=openai_api_key,
-        base_url=openai_api_base,
-    )
-
-    # Single-image input inference
-    image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
-
-    chat_response = client.chat.completions.create(
-        model="microsoft/Phi-3.5-vision-instruct",
-        messages=[{
-            "role": "user",
-            "content": [
-                # NOTE: The prompt formatting with the image token `<image>` is not needed
-                # since the prompt will be processed automatically by the API server.
-                {"type": "text", "text": "What’s in this image?"},
-                {"type": "image_url", "image_url": {"url": image_url}},
-            ],
-        }],
-    )
-    print("Chat completion output:", chat_response.choices[0].message.content)
-
-    # Multi-image input inference
-    image_url_duck = "https://upload.wikimedia.org/wikipedia/commons/d/da/2015_Kaczka_krzy%C5%BCowka_w_wodzie_%28samiec%29.jpg"
-    image_url_lion = "https://upload.wikimedia.org/wikipedia/commons/7/77/002_The_lion_king_Snyggve_in_the_Serengeti_National_Park_Photo_by_Giles_Laurent.jpg"
-
-    chat_response = client.chat.completions.create(
-        model="microsoft/Phi-3.5-vision-instruct",
-        messages=[{
-            "role": "user",
-            "content": [
-                {"type": "text", "text": "What are the animals in these images?"},
-                {"type": "image_url", "image_url": {"url": image_url_duck}},
-                {"type": "image_url", "image_url": {"url": image_url_lion}},
-            ],
-        }],
-    )
-    print("Chat completion output:", chat_response.choices[0].message.content)
-
-A full code example can be found in `examples/openai_chat_completion_client_for_multimodal.py <https://github.com/vllm-project/vllm/blob/main/examples/openai_chat_completion_client_for_multimodal.py>`_.
-
-.. tip::
-    Loading from local file paths is also supported on vLLM: You can specify the allowed local media path via ``--allowed-local-media-path`` when launching the API server/engine,
-    and pass the file path as ``url`` in the API request.
-
-.. tip::
-    There is no need to place image placeholders in the text content of the API request - they are already represented by the image content.
-    In fact, you can place image placeholders in the middle of the text by interleaving text and image content.
-
-.. note::
-
-    By default, the timeout for fetching images through http url is ``5`` seconds. You can override this by setting the environment variable:
-
-    .. code-block:: console
-
-        $ export VLLM_IMAGE_FETCH_TIMEOUT=<timeout>
-
-Chat Embeddings API
-^^^^^^^^^^^^^^^^^^^
-
-vLLM's Chat Embeddings API is a superset of OpenAI's `Embeddings API <https://platform.openai.com/docs/api-reference/embeddings>`_,
-where a list of ``messages`` can be passed instead of batched ``inputs``. This enables multi-modal inputs to be passed to embedding models.
-
-.. tip::
-    The schema of ``messages`` is exactly the same as in Chat Completions API.
-
-In this example, we will serve the ``TIGER-Lab/VLM2Vec-Full`` model.
-
-.. code-block:: bash
-
-    vllm serve TIGER-Lab/VLM2Vec-Full --task embedding \
-      --trust-remote-code --max-model-len 4096 --chat-template examples/template_vlm2vec.jinja
-
-.. important::
-
-    Since VLM2Vec has the same model architecture as Phi-3.5-Vision, we have to explicitly pass ``--task embedding``
-    to run this model in embedding mode instead of text generation mode.
-
-.. important::
-
-    VLM2Vec does not expect chat-based input. We use a `custom chat template <https://github.com/vllm-project/vllm/blob/main/examples/template_vlm2vec.jinja>`_
-    to combine the text and images together.
-
-Since the request schema is not defined by OpenAI client, we post a request to the server using the lower-level ``requests`` library:
-
-.. code-block:: python
-
-    import requests
-
-    image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
-
-    response = requests.post(
-        "http://localhost:8000/v1/embeddings",
-        json={
-            "model": "TIGER-Lab/VLM2Vec-Full",
-            "messages": [{
-                "role": "user",
-                "content": [
-                    {"type": "image_url", "image_url": {"url": image_url}},
-                    {"type": "text", "text": "Represent the given image."},
-                ],
-            }],
-            "encoding_format": "float",
-        },
-    )
-    response.raise_for_status()
-    response_json = response.json()
-    print("Embedding output:", response_json["data"][0]["embedding"])
-
-Here is an example for serving the ``MrLight/dse-qwen2-2b-mrl-v1`` model.
-
-.. code-block:: bash
-
-    vllm serve MrLight/dse-qwen2-2b-mrl-v1 --task embedding \
-      --trust-remote-code --max-model-len 8192 --chat-template examples/template_dse_qwen2_vl.jinja
-
-.. important::
-
-    Like with VLM2Vec, we have to explicitly pass ``--task embedding``. Additionally, ``MrLight/dse-qwen2-2b-mrl-v1`` requires an EOS token for embeddings, 
-    which is handled by the jinja template.
-
-.. important::
-
-    Also important, ``MrLight/dse-qwen2-2b-mrl-v1`` requires a placeholder image of the minimum image size for text query embeddings. See the full code 
-    example below for details.
-
-A full code example can be found in `examples/openai_chat_embedding_client_for_multimodal.py <https://github.com/vllm-project/vllm/blob/main/examples/openai_chat_embedding_client_for_multimodal.py>`_.