[Docs] Convert rST to MyST (Markdown) (#11145)

Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
This commit is contained in:
Rafael Vasquez
2024-12-23 17:35:38 -05:00
committed by GitHub
parent 94d545a1a1
commit 32aa2059ad
167 changed files with 7863 additions and 8131 deletions

View File

@@ -0,0 +1,468 @@
(compatibility-matrix)=
# Compatibility Matrix
The tables below show mutually exclusive features and the support on some hardware.
```{note}
Check the '✗' with links to see tracking issue for unsupported feature/hardware combination.
```
## Feature x Feature
```{raw} html
<style>
/* Make smaller to try to improve readability */
td {
font-size: 0.8rem;
text-align: center;
}
th {
text-align: center;
font-size: 0.8rem;
}
</style>
```
```{list-table}
:header-rows: 1
:stub-columns: 1
:widths: auto
* - Feature
- [CP](#chunked-prefill)
- [APC](#apc)
- [LoRA](#lora-adapter)
- <abbr title="Prompt Adapter">prmpt adptr</abbr>
- [SD](#spec_decode)
- CUDA graph
- <abbr title="Pooling Models">pooling</abbr>
- <abbr title="Encoder-Decoder Models">enc-dec</abbr>
- <abbr title="Logprobs">logP</abbr>
- <abbr title="Prompt Logprobs">prmpt logP</abbr>
- <abbr title="Async Output Processing">async output</abbr>
- multi-step
- <abbr title="Multimodal Inputs">mm</abbr>
- best-of
- beam-search
- <abbr title="Guided Decoding">guided dec</abbr>
* - [CP](#chunked-prefill)
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
* - [APC](#apc)
- ✅
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
* - [LoRA](#lora-adapter)
- [✗](https://github.com/vllm-project/vllm/pull/9057)
- ✅
-
-
-
-
-
-
-
-
-
-
-
-
-
-
* - <abbr title="Prompt Adapter">prmpt adptr</abbr>
- ✅
- ✅
- ✅
-
-
-
-
-
-
-
-
-
-
-
-
-
* - [SD](#spec_decode)
- ✅
- ✅
- ✗
- ✅
-
-
-
-
-
-
-
-
-
-
-
-
* - CUDA graph
- ✅
- ✅
- ✅
- ✅
- ✅
-
-
-
-
-
-
-
-
-
-
-
* - <abbr title="Pooling Models">pooling</abbr>
- ✗
- ✗
- ✗
- ✗
- ✗
- ✗
-
-
-
-
-
-
-
-
-
-
* - <abbr title="Encoder-Decoder Models">enc-dec</abbr>
- ✗
- [✗](https://github.com/vllm-project/vllm/issues/7366)
- ✗
- ✗
- [✗](https://github.com/vllm-project/vllm/issues/7366)
- ✅
- ✅
-
-
-
-
-
-
-
-
-
* - <abbr title="Logprobs">logP</abbr>
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
- ✗
- ✅
-
-
-
-
-
-
-
-
* - <abbr title="Prompt Logprobs">prmpt logP</abbr>
- ✅
- ✅
- ✅
- ✅
- [✗](https://github.com/vllm-project/vllm/pull/8199)
- ✅
- ✗
- ✅
- ✅
-
-
-
-
-
-
-
* - <abbr title="Async Output Processing">async output</abbr>
- ✅
- ✅
- ✅
- ✅
- ✗
- ✅
- ✗
- ✗
- ✅
- ✅
-
-
-
-
-
-
* - multi-step
- ✗
- ✅
- ✗
- ✅
- ✗
- ✅
- ✗
- ✗
- ✅
- [✗](https://github.com/vllm-project/vllm/issues/8198)
- ✅
-
-
-
-
-
* - <abbr title="Multimodal Inputs">mm</abbr>
- ✅
- [✗](https://github.com/vllm-project/vllm/pull/8348)
- [✗](https://github.com/vllm-project/vllm/pull/7199)
- ?
- ?
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
- ?
-
-
-
-
* - best-of
- ✅
- ✅
- ✅
- ✅
- [✗](https://github.com/vllm-project/vllm/issues/6137)
- ✅
- ✗
- ✅
- ✅
- ✅
- ?
- [✗](https://github.com/vllm-project/vllm/issues/7968)
- ✅
-
-
-
* - beam-search
- ✅
- ✅
- ✅
- ✅
- [✗](https://github.com/vllm-project/vllm/issues/6137)
- ✅
- ✗
- ✅
- ✅
- ✅
- ?
- [✗](https://github.com/vllm-project/vllm/issues/7968>)
- ?
- ✅
-
-
* - <abbr title="Guided Decoding">guided dec</abbr>
- ✅
- ✅
- ?
- ?
- ✅
- ✅
- ✗
- ?
- ✅
- ✅
- ✅
- [✗](https://github.com/vllm-project/vllm/issues/9893)
- ?
- ✅
- ✅
-
```
### Feature x Hardware
```{list-table}
:header-rows: 1
:stub-columns: 1
:widths: auto
* - Feature
- Volta
- Turing
- Ampere
- Ada
- Hopper
- CPU
- AMD
* - [CP](#chunked-prefill)
- [✗](https://github.com/vllm-project/vllm/issues/2729)
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
* - [APC](#apc)
- [✗](https://github.com/vllm-project/vllm/issues/3687)
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
* - [LoRA](#lora-adapter)
- ✅
- ✅
- ✅
- ✅
- ✅
- [✗](https://github.com/vllm-project/vllm/pull/4830)
- ✅
* - <abbr title="Prompt Adapter">prmpt adptr</abbr>
- ✅
- ✅
- ✅
- ✅
- ✅
- [✗](https://github.com/vllm-project/vllm/issues/8475)
- ✅
* - [SD](#spec_decode)
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
* - CUDA graph
- ✅
- ✅
- ✅
- ✅
- ✅
- ✗
- ✅
* - <abbr title="Pooling Models">pooling</abbr>
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
- ?
* - <abbr title="Encoder-Decoder Models">enc-dec</abbr>
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
- ✗
* - <abbr title="Multimodal Inputs">mm</abbr>
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
* - <abbr title="Logprobs">logP</abbr>
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
* - <abbr title="Prompt Logprobs">prmpt logP</abbr>
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
* - <abbr title="Async Output Processing">async output</abbr>
- ✅
- ✅
- ✅
- ✅
- ✅
- ✗
- ✗
* - multi-step
- ✅
- ✅
- ✅
- ✅
- ✅
- [✗](https://github.com/vllm-project/vllm/issues/8477)
- ✅
* - best-of
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
* - beam-search
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
* - <abbr title="Guided Decoding">guided dec</abbr>
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
```

View File

@@ -1,468 +0,0 @@
.. _compatibility_matrix:
Compatibility Matrix
====================
The tables below show mutually exclusive features and the support on some hardware.
.. note::
Check the '✗' with links to see tracking issue for unsupported feature/hardware combination.
Feature x Feature
-----------------
.. raw:: html
<style>
/* Make smaller to try to improve readability */
td {
font-size: 0.8rem;
text-align: center;
}
th {
text-align: center;
font-size: 0.8rem;
}
</style>
.. list-table::
:header-rows: 1
:widths: auto
* - Feature
- :ref:`CP <chunked-prefill>`
- :ref:`APC <apc>`
- :ref:`LoRA <lora>`
- :abbr:`prmpt adptr (Prompt Adapter)`
- :ref:`SD <spec_decode>`
- CUDA graph
- :abbr:`pooling (Pooling Models)`
- :abbr:`enc-dec (Encoder-Decoder Models)`
- :abbr:`logP (Logprobs)`
- :abbr:`prmpt logP (Prompt Logprobs)`
- :abbr:`async output (Async Output Processing)`
- multi-step
- :abbr:`mm (Multimodal Inputs)`
- best-of
- beam-search
- :abbr:`guided dec (Guided Decoding)`
* - :ref:`CP <chunked-prefill>`
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
* - :ref:`APC <apc>`
- ✅
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
* - :ref:`LoRA <lora>`
- `✗ <https://github.com/vllm-project/vllm/pull/9057>`__
- ✅
-
-
-
-
-
-
-
-
-
-
-
-
-
-
* - :abbr:`prmpt adptr (Prompt Adapter)`
- ✅
- ✅
- ✅
-
-
-
-
-
-
-
-
-
-
-
-
-
* - :ref:`SD <spec_decode>`
- ✅
- ✅
- ✗
- ✅
-
-
-
-
-
-
-
-
-
-
-
-
* - CUDA graph
- ✅
- ✅
- ✅
- ✅
- ✅
-
-
-
-
-
-
-
-
-
-
-
* - :abbr:`pooling (Pooling Models)`
- ✗
- ✗
- ✗
- ✗
- ✗
- ✗
-
-
-
-
-
-
-
-
-
-
* - :abbr:`enc-dec (Encoder-Decoder Models)`
- ✗
- `✗ <https://github.com/vllm-project/vllm/issues/7366>`__
- ✗
- ✗
- `✗ <https://github.com/vllm-project/vllm/issues/7366>`__
- ✅
- ✅
-
-
-
-
-
-
-
-
-
* - :abbr:`logP (Logprobs)`
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
- ✗
- ✅
-
-
-
-
-
-
-
-
* - :abbr:`prmpt logP (Prompt Logprobs)`
- ✅
- ✅
- ✅
- ✅
- `✗ <https://github.com/vllm-project/vllm/pull/8199>`__
- ✅
- ✗
- ✅
- ✅
-
-
-
-
-
-
-
* - :abbr:`async output (Async Output Processing)`
- ✅
- ✅
- ✅
- ✅
- ✗
- ✅
- ✗
- ✗
- ✅
- ✅
-
-
-
-
-
-
* - multi-step
- ✗
- ✅
- ✗
- ✅
- ✗
- ✅
- ✗
- ✗
- ✅
- `✗ <https://github.com/vllm-project/vllm/issues/8198>`__
- ✅
-
-
-
-
-
* - :abbr:`mm (Multimodal Inputs)`
- ✅
- `✗ <https://github.com/vllm-project/vllm/pull/8348>`__
- `✗ <https://github.com/vllm-project/vllm/pull/7199>`__
- ?
- ?
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
- ?
-
-
-
-
* - best-of
- ✅
- ✅
- ✅
- ✅
- `✗ <https://github.com/vllm-project/vllm/issues/6137>`__
- ✅
- ✗
- ✅
- ✅
- ✅
- ?
- `✗ <https://github.com/vllm-project/vllm/issues/7968>`__
- ✅
-
-
-
* - beam-search
- ✅
- ✅
- ✅
- ✅
- `✗ <https://github.com/vllm-project/vllm/issues/6137>`__
- ✅
- ✗
- ✅
- ✅
- ✅
- ?
- `✗ <https://github.com/vllm-project/vllm/issues/7968>`__
- ?
- ✅
-
-
* - :abbr:`guided dec (Guided Decoding)`
- ✅
- ✅
- ?
- ?
- ✅
- ✅
- ✗
- ?
- ✅
- ✅
- ✅
- `✗ <https://github.com/vllm-project/vllm/issues/9893>`__
- ?
- ✅
- ✅
-
Feature x Hardware
^^^^^^^^^^^^^^^^^^
.. list-table::
:header-rows: 1
:widths: auto
* - Feature
- Volta
- Turing
- Ampere
- Ada
- Hopper
- CPU
- AMD
* - :ref:`CP <chunked-prefill>`
- `✗ <https://github.com/vllm-project/vllm/issues/2729>`__
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
* - :ref:`APC <apc>`
- `✗ <https://github.com/vllm-project/vllm/issues/3687>`__
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
* - :ref:`LoRA <lora>`
- ✅
- ✅
- ✅
- ✅
- ✅
- `✗ <https://github.com/vllm-project/vllm/pull/4830>`__
- ✅
* - :abbr:`prmpt adptr (Prompt Adapter)`
- ✅
- ✅
- ✅
- ✅
- ✅
- `✗ <https://github.com/vllm-project/vllm/issues/8475>`__
- ✅
* - :ref:`SD <spec_decode>`
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
* - CUDA graph
- ✅
- ✅
- ✅
- ✅
- ✅
- ✗
- ✅
* - :abbr:`pooling (Pooling Models)`
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
- ?
* - :abbr:`enc-dec (Encoder-Decoder Models)`
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
- ✗
* - :abbr:`mm (Multimodal Inputs)`
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
* - :abbr:`logP (Logprobs)`
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
* - :abbr:`prmpt logP (Prompt Logprobs)`
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
* - :abbr:`async output (Async Output Processing)`
- ✅
- ✅
- ✅
- ✅
- ✅
- ✗
- ✗
* - multi-step
- ✅
- ✅
- ✅
- ✅
- ✅
- `✗ <https://github.com/vllm-project/vllm/issues/8477>`__
- ✅
* - best-of
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
* - beam-search
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
* - :abbr:`guided dec (Guided Decoding)`
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅

View File

@@ -0,0 +1,64 @@
(disagg-prefill)=
# Disaggregated prefilling (experimental)
This page introduces you the disaggregated prefilling feature in vLLM. This feature is experimental and subject to change.
## Why disaggregated prefilling?
Two main reasons:
- **Tuning time-to-first-token (TTFT) and inter-token-latency (ITL) separately**. Disaggregated prefilling put prefill and decode phase of LLM inference inside different vLLM instances. This gives you the flexibility to assign different parallel strategies (e.g. `tp` and `pp`) to tune TTFT without affecting ITL, or to tune ITL without affecting TTFT.
- **Controlling tail ITL**. Without disaggregated prefilling, vLLM may insert some prefill jobs during the decoding of one request. This results in higher tail latency. Disaggregated prefilling helps you solve this issue and control tail ITL. Chunked prefill with a proper chunk size also can achieve the same goal, but in practice it's hard to figure out the correct chunk size value. So disaggregated prefilling is a much more reliable way to control tail ITL.
```{note}
Disaggregated prefill DOES NOT improve throughput.
```
## Usage example
Please refer to `examples/disaggregated_prefill.sh` for the example usage of disaggregated prefilling.
## Benchmarks
Please refer to `benchmarks/disagg_benchmarks/` for disaggregated prefilling benchmarks.
## Development
We implement disaggregated prefilling by running 2 vLLM instances. One for prefill (we call it prefill instance) and one for decode (we call it decode instance), and then use a connector to transfer the prefill KV caches and results from prefill instance to decode instance.
All disaggregated prefilling implementation is under `vllm/distributed/kv_transfer`.
Key abstractions for disaggregated prefilling:
- **Connector**: Connector allows **kv consumer** to retrieve the KV caches of a batch of request from **kv producer**.
- **LookupBuffer**: LookupBuffer provides two API: `insert` KV cache and `drop_select` KV cache. The semantics of `insert` and `drop_select` are similar to SQL, where `insert` inserts a KV cache into the buffer, and `drop_select` returns the KV cache that matches the given condition and drop it from the buffer.
- **Pipe**: A single-direction FIFO pipe for tensor transmission. It supports `send_tensor` and `recv_tensor`.
```{note}
`insert` is non-blocking operation but `drop_select` is blocking operation.
```
Here is a figure illustrating how the above 3 abstractions are organized:
```{image} /assets/usage/disagg_prefill/abstraction.jpg
:alt: Disaggregated prefilling abstractions
```
The workflow of disaggregated prefilling is as follows:
```{image} /assets/usage/disagg_prefill/overview.jpg
:alt: Disaggregated prefilling workflow
```
The `buffer` corresponds to `insert` API in LookupBuffer, and the `drop_select` corresponds to `drop_select` API in LookupBuffer.
## Third-party contributions
Disaggregated prefilling is highly related to infrastructure, so vLLM relies on third-party connectors for production-level disaggregated prefilling (and vLLM team will actively review and merge new PRs for third-party connectors).
We recommend three ways of implementations:
- **Fully-customized connector**: Implement your own `Connector`, and call third-party libraries to send and receive KV caches, and many many more (like editing vLLM's model input to perform customized prefilling, etc). This approach gives you the most control, but at the risk of being incompatible with future vLLM versions.
- **Database-like connector**: Implement your own `LookupBuffer` and support the `insert` and `drop_select` APIs just like SQL.
- **Distributed P2P connector**: Implement your own `Pipe` and support the `send_tensor` and `recv_tensor` APIs, just like `torch.distributed`.

View File

@@ -1,69 +0,0 @@
.. _disagg_prefill:
Disaggregated prefilling (experimental)
=======================================
This page introduces you the disaggregated prefilling feature in vLLM. This feature is experimental and subject to change.
Why disaggregated prefilling?
-----------------------------
Two main reasons:
* **Tuning time-to-first-token (TTFT) and inter-token-latency (ITL) separately**. Disaggregated prefilling put prefill and decode phase of LLM inference inside different vLLM instances. This gives you the flexibility to assign different parallel strategies (e.g. ``tp`` and ``pp``) to tune TTFT without affecting ITL, or to tune ITL without affecting TTFT.
* **Controlling tail ITL**. Without disaggregated prefilling, vLLM may insert some prefill jobs during the decoding of one request. This results in higher tail latency. Disaggregated prefilling helps you solve this issue and control tail ITL. Chunked prefill with a proper chunk size also can achieve the same goal, but in practice it's hard to figure out the correct chunk size value. So disaggregated prefilling is a much more reliable way to control tail ITL.
.. note::
Disaggregated prefill DOES NOT improve throughput.
Usage example
-------------
Please refer to ``examples/disaggregated_prefill.sh`` for the example usage of disaggregated prefilling.
Benchmarks
----------
Please refer to ``benchmarks/disagg_benchmarks/`` for disaggregated prefilling benchmarks.
Development
-----------
We implement disaggregated prefilling by running 2 vLLM instances. One for prefill (we call it prefill instance) and one for decode (we call it decode instance), and then use a connector to transfer the prefill KV caches and results from prefill instance to decode instance.
All disaggregated prefilling implementation is under ``vllm/distributed/kv_transfer``.
Key abstractions for disaggregated prefilling:
* **Connector**: Connector allows **kv consumer** to retrieve the KV caches of a batch of request from **kv producer**.
* **LookupBuffer**: LookupBuffer provides two API: ``insert`` KV cache and ``drop_select`` KV cache. The semantics of ``insert`` and ``drop_select`` are similar to SQL, where ``insert`` inserts a KV cache into the buffer, and ``drop_select`` returns the KV cache that matches the given condition and drop it from the buffer.
* **Pipe**: A single-direction FIFO pipe for tensor transmission. It supports ``send_tensor`` and ``recv_tensor``.
.. note::
``insert`` is non-blocking operation but ``drop_select`` is blocking operation.
Here is a figure illustrating how the above 3 abstractions are organized:
.. image:: /assets/usage/disagg_prefill/abstraction.jpg
:alt: Disaggregated prefilling abstractions
The workflow of disaggregated prefilling is as follows:
.. image:: /assets/usage/disagg_prefill/overview.jpg
:alt: Disaggregated prefilling workflow
The ``buffer`` corresponds to ``insert`` API in LookupBuffer, and the ``drop_select`` corresponds to ``drop_select`` API in LookupBuffer.
Third-party contributions
-------------------------
Disaggregated prefilling is highly related to infrastructure, so vLLM relies on third-party connectors for production-level disaggregated prefilling (and vLLM team will actively review and merge new PRs for third-party connectors).
We recommend three ways of implementations:
* **Fully-customized connector**: Implement your own ``Connector``, and call third-party libraries to send and receive KV caches, and many many more (like editing vLLM's model input to perform customized prefilling, etc). This approach gives you the most control, but at the risk of being incompatible with future vLLM versions.
* **Database-like connector**: Implement your own ``LookupBuffer`` and support the ``insert`` and ``drop_select`` APIs just like SQL.
* **Distributed P2P connector**: Implement your own ``Pipe`` and support the ``send_tensor`` and ``recv_tensor`` APIs, just like `torch.distributed`.

View File

@@ -1,23 +1,25 @@
.. _engine_args:
(engine-args)=
Engine Arguments
================
# Engine Arguments
Below, you can find an explanation of every engine argument for vLLM:
```{eval-rst}
.. argparse::
:module: vllm.engine.arg_utils
:func: _engine_args_parser
:prog: vllm serve
:nodefaultconst:
```
Async Engine Arguments
----------------------
## Async Engine Arguments
Below are the additional arguments related to the asynchronous engine:
```{eval-rst}
.. argparse::
:module: vllm.engine.arg_utils
:func: _async_engine_args_parser
:prog: vllm serve
:nodefaultconst:
:nodefaultconst:
```

View File

@@ -0,0 +1,15 @@
# Environment Variables
vLLM uses the following environment variables to configure the system:
```{warning}
Please note that `VLLM_PORT` and `VLLM_HOST_IP` set the port and ip for vLLM's **internal usage**. It is not the port and ip for the API server. If you use `--host $VLLM_HOST_IP` and `--port $VLLM_PORT` to start the API server, it will not work.
All environment variables used by vLLM are prefixed with `VLLM_`. **Special care should be taken for Kubernetes users**: please do not name the service as `vllm`, otherwise environment variables set by Kubernetes might conflict with vLLM's environment variables, because [Kubernetes sets environment variables for each service with the capitalized service name as the prefix](https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables).
```
```{literalinclude} ../../../vllm/envs.py
:end-before: end-env-vars-definition
:language: python
:start-after: begin-env-vars-definition
```

View File

@@ -1,14 +0,0 @@
Environment Variables
========================
vLLM uses the following environment variables to configure the system:
.. warning::
Please note that ``VLLM_PORT`` and ``VLLM_HOST_IP`` set the port and ip for vLLM's **internal usage**. It is not the port and ip for the API server. If you use ``--host $VLLM_HOST_IP`` and ``--port $VLLM_PORT`` to start the API server, it will not work.
All environment variables used by vLLM are prefixed with ``VLLM_``. **Special care should be taken for Kubernetes users**: please do not name the service as ``vllm``, otherwise environment variables set by Kubernetes might conflict with vLLM's environment variables, because `Kubernetes sets environment variables for each service with the capitalized service name as the prefix <https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables>`_.
.. literalinclude:: ../../../vllm/envs.py
:language: python
:start-after: begin-env-vars-definition
:end-before: end-env-vars-definition

View File

@@ -1,34 +1,33 @@
.. _faq:
(faq)=
Frequently Asked Questions
===========================
# Frequently Asked Questions
Q: How can I serve multiple models on a single port using the OpenAI API?
> Q: How can I serve multiple models on a single port using the OpenAI API?
A: Assuming that you're referring to using OpenAI compatible server to serve multiple models at once, that is not currently supported, you can run multiple instances of the server (each serving a different model) at the same time, and have another layer to route the incoming request to the correct server accordingly.
----------------------------------------
______________________________________________________________________
Q: Which model to use for offline inference embedding?
> Q: Which model to use for offline inference embedding?
A: You can try `e5-mistral-7b-instruct <https://huggingface.co/intfloat/e5-mistral-7b-instruct>`__ and `BAAI/bge-base-en-v1.5 <https://huggingface.co/BAAI/bge-base-en-v1.5>`__;
more are listed :ref:`here <supported_models>`.
A: You can try [e5-mistral-7b-instruct](https://huggingface.co/intfloat/e5-mistral-7b-instruct) and [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5);
more are listed [here](#supported-models).
By extracting hidden states, vLLM can automatically convert text generation models like `Llama-3-8B <https://huggingface.co/meta-llama/Meta-Llama-3-8B>`__,
`Mistral-7B-Instruct-v0.3 <https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3>`__ into embedding models,
By extracting hidden states, vLLM can automatically convert text generation models like [Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B),
[Mistral-7B-Instruct-v0.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3) into embedding models,
but they are expected be inferior to models that are specifically trained on embedding tasks.
----------------------------------------
______________________________________________________________________
Q: Can the output of a prompt vary across runs in vLLM?
> Q: Can the output of a prompt vary across runs in vLLM?
A: Yes, it can. vLLM does not guarantee stable log probabilities (logprobs) for the output tokens. Variations in logprobs may occur due to
numerical instability in Torch operations or non-deterministic behavior in batched Torch operations when batching changes. For more details,
see the `Numerical Accuracy section <https://pytorch.org/docs/stable/notes/numerical_accuracy.html#batched-computations-or-slice-computations>`_.
numerical instability in Torch operations or non-deterministic behavior in batched Torch operations when batching changes. For more details,
see the [Numerical Accuracy section](https://pytorch.org/docs/stable/notes/numerical_accuracy.html#batched-computations-or-slice-computations).
In vLLM, the same requests might be batched differently due to factors such as other concurrent requests,
changes in batch size, or batch expansion in speculative decoding. These batching variations, combined with numerical instability of Torch operations,
can lead to slightly different logit/logprob values at each step. Such differences can accumulate, potentially resulting in
changes in batch size, or batch expansion in speculative decoding. These batching variations, combined with numerical instability of Torch operations,
can lead to slightly different logit/logprob values at each step. Such differences can accumulate, potentially resulting in
different tokens being sampled. Once a different token is sampled, further divergence is likely.
**Mitigation Strategies**

215
docs/source/usage/lora.md Normal file
View File

@@ -0,0 +1,215 @@
(lora-adapter)=
# LoRA Adapters
This document shows you how to use [LoRA adapters](https://arxiv.org/abs/2106.09685) with vLLM on top of a base model.
LoRA adapters can be used with any vLLM model that implements {class}`~vllm.model_executor.models.interfaces.SupportsLoRA`.
Adapters can be efficiently served on a per request basis with minimal overhead. First we download the adapter(s) and save
them locally with
```python
from huggingface_hub import snapshot_download
sql_lora_path = snapshot_download(repo_id="yard1/llama-2-7b-sql-lora-test")
```
Then we instantiate the base model and pass in the `enable_lora=True` flag:
```python
from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest
llm = LLM(model="meta-llama/Llama-2-7b-hf", enable_lora=True)
```
We can now submit the prompts and call `llm.generate` with the `lora_request` parameter. The first parameter
of `LoRARequest` is a human identifiable name, the second parameter is a globally unique ID for the adapter and
the third parameter is the path to the LoRA adapter.
```python
sampling_params = SamplingParams(
temperature=0,
max_tokens=256,
stop=["[/assistant]"]
)
prompts = [
"[user] Write a SQL query to answer the question based on the table schema.\n\n context: CREATE TABLE table_name_74 (icao VARCHAR, airport VARCHAR)\n\n question: Name the ICAO for lilongwe international airport [/user] [assistant]",
"[user] Write a SQL query to answer the question based on the table schema.\n\n context: CREATE TABLE table_name_11 (nationality VARCHAR, elector VARCHAR)\n\n question: When Anchero Pantaleone was the elector what is under nationality? [/user] [assistant]",
]
outputs = llm.generate(
prompts,
sampling_params,
lora_request=LoRARequest("sql_adapter", 1, sql_lora_path)
)
```
Check out [examples/multilora_inference.py](https://github.com/vllm-project/vllm/blob/main/examples/multilora_inference.py)
for an example of how to use LoRA adapters with the async engine and how to use more advanced configuration options.
## Serving LoRA Adapters
LoRA adapted models can also be served with the Open-AI compatible vLLM server. To do so, we use
`--lora-modules {name}={path} {name}={path}` to specify each LoRA module when we kickoff the server:
```bash
vllm serve meta-llama/Llama-2-7b-hf \
--enable-lora \
--lora-modules sql-lora=$HOME/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/snapshots/0dfa347e8877a4d4ed19ee56c140fa518470028c/
```
```{note}
The commit ID `0dfa347e8877a4d4ed19ee56c140fa518470028c` may change over time. Please check the latest commit ID in your environment to ensure you are using the correct one.
```
The server entrypoint accepts all other LoRA configuration parameters (`max_loras`, `max_lora_rank`, `max_cpu_loras`,
etc.), which will apply to all forthcoming requests. Upon querying the `/models` endpoint, we should see our LoRA along
with its base model:
```bash
curl localhost:8000/v1/models | jq .
{
"object": "list",
"data": [
{
"id": "meta-llama/Llama-2-7b-hf",
"object": "model",
...
},
{
"id": "sql-lora",
"object": "model",
...
}
]
}
```
Requests can specify the LoRA adapter as if it were any other model via the `model` request parameter. The requests will be
processed according to the server-wide LoRA configuration (i.e. in parallel with base model requests, and potentially other
LoRA adapter requests if they were provided and `max_loras` is set high enough).
The following is an example request
```bash
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "sql-lora",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}' | jq
```
## Dynamically serving LoRA Adapters
In addition to serving LoRA adapters at server startup, the vLLM server now supports dynamically loading and unloading
LoRA adapters at runtime through dedicated API endpoints. This feature can be particularly useful when the flexibility
to change models on-the-fly is needed.
Note: Enabling this feature in production environments is risky as user may participate model adapter management.
To enable dynamic LoRA loading and unloading, ensure that the environment variable `VLLM_ALLOW_RUNTIME_LORA_UPDATING`
is set to `True`. When this option is enabled, the API server will log a warning to indicate that dynamic loading is active.
```bash
export VLLM_ALLOW_RUNTIME_LORA_UPDATING=True
```
Loading a LoRA Adapter:
To dynamically load a LoRA adapter, send a POST request to the `/v1/load_lora_adapter` endpoint with the necessary
details of the adapter to be loaded. The request payload should include the name and path to the LoRA adapter.
Example request to load a LoRA adapter:
```bash
curl -X POST http://localhost:8000/v1/load_lora_adapter \
-H "Content-Type: application/json" \
-d '{
"lora_name": "sql_adapter",
"lora_path": "/path/to/sql-lora-adapter"
}'
```
Upon a successful request, the API will respond with a 200 OK status code. If an error occurs, such as if the adapter
cannot be found or loaded, an appropriate error message will be returned.
Unloading a LoRA Adapter:
To unload a LoRA adapter that has been previously loaded, send a POST request to the `/v1/unload_lora_adapter` endpoint
with the name or ID of the adapter to be unloaded.
Example request to unload a LoRA adapter:
```bash
curl -X POST http://localhost:8000/v1/unload_lora_adapter \
-H "Content-Type: application/json" \
-d '{
"lora_name": "sql_adapter"
}'
```
## New format for `--lora-modules`
In the previous version, users would provide LoRA modules via the following format, either as a key-value pair or in JSON format. For example:
```bash
--lora-modules sql-lora=$HOME/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/snapshots/0dfa347e8877a4d4ed19ee56c140fa518470028c/
```
This would only include the `name` and `path` for each LoRA module, but did not provide a way to specify a `base_model_name`.
Now, you can specify a base_model_name alongside the name and path using JSON format. For example:
```bash
--lora-modules '{"name": "sql-lora", "path": "/path/to/lora", "base_model_name": "meta-llama/Llama-2-7b"}'
```
To provide the backward compatibility support, you can still use the old key-value format (name=path), but the `base_model_name` will remain unspecified in that case.
## Lora model lineage in model card
The new format of `--lora-modules` is mainly to support the display of parent model information in the model card. Here's an explanation of how your current response supports this:
- The `parent` field of LoRA model `sql-lora` now links to its base model `meta-llama/Llama-2-7b-hf`. This correctly reflects the hierarchical relationship between the base model and the LoRA adapter.
- The `root` field points to the artifact location of the lora adapter.
```bash
$ curl http://localhost:8000/v1/models
{
"object": "list",
"data": [
{
"id": "meta-llama/Llama-2-7b-hf",
"object": "model",
"created": 1715644056,
"owned_by": "vllm",
"root": "~/.cache/huggingface/hub/models--meta-llama--Llama-2-7b-hf/snapshots/01c7f73d771dfac7d292323805ebc428287df4f9/",
"parent": null,
"permission": [
{
.....
}
]
},
{
"id": "sql-lora",
"object": "model",
"created": 1715644056,
"owned_by": "vllm",
"root": "~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/snapshots/0dfa347e8877a4d4ed19ee56c140fa518470028c/",
"parent": meta-llama/Llama-2-7b-hf,
"permission": [
{
....
}
]
}
]
}
```

View File

@@ -1,225 +0,0 @@
.. _lora:
LoRA Adapters
=============
This document shows you how to use `LoRA adapters <https://arxiv.org/abs/2106.09685>`_ with vLLM on top of a base model.
LoRA adapters can be used with any vLLM model that implements :class:`~vllm.model_executor.models.interfaces.SupportsLoRA`.
Adapters can be efficiently served on a per request basis with minimal overhead. First we download the adapter(s) and save
them locally with
.. code-block:: python
from huggingface_hub import snapshot_download
sql_lora_path = snapshot_download(repo_id="yard1/llama-2-7b-sql-lora-test")
Then we instantiate the base model and pass in the ``enable_lora=True`` flag:
.. code-block:: python
from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest
llm = LLM(model="meta-llama/Llama-2-7b-hf", enable_lora=True)
We can now submit the prompts and call ``llm.generate`` with the ``lora_request`` parameter. The first parameter
of ``LoRARequest`` is a human identifiable name, the second parameter is a globally unique ID for the adapter and
the third parameter is the path to the LoRA adapter.
.. code-block:: python
sampling_params = SamplingParams(
temperature=0,
max_tokens=256,
stop=["[/assistant]"]
)
prompts = [
"[user] Write a SQL query to answer the question based on the table schema.\n\n context: CREATE TABLE table_name_74 (icao VARCHAR, airport VARCHAR)\n\n question: Name the ICAO for lilongwe international airport [/user] [assistant]",
"[user] Write a SQL query to answer the question based on the table schema.\n\n context: CREATE TABLE table_name_11 (nationality VARCHAR, elector VARCHAR)\n\n question: When Anchero Pantaleone was the elector what is under nationality? [/user] [assistant]",
]
outputs = llm.generate(
prompts,
sampling_params,
lora_request=LoRARequest("sql_adapter", 1, sql_lora_path)
)
Check out `examples/multilora_inference.py <https://github.com/vllm-project/vllm/blob/main/examples/multilora_inference.py>`_
for an example of how to use LoRA adapters with the async engine and how to use more advanced configuration options.
Serving LoRA Adapters
---------------------
LoRA adapted models can also be served with the Open-AI compatible vLLM server. To do so, we use
``--lora-modules {name}={path} {name}={path}`` to specify each LoRA module when we kickoff the server:
.. code-block:: bash
vllm serve meta-llama/Llama-2-7b-hf \
--enable-lora \
--lora-modules sql-lora=$HOME/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/snapshots/0dfa347e8877a4d4ed19ee56c140fa518470028c/
.. note::
The commit ID `0dfa347e8877a4d4ed19ee56c140fa518470028c` may change over time. Please check the latest commit ID in your environment to ensure you are using the correct one.
The server entrypoint accepts all other LoRA configuration parameters (``max_loras``, ``max_lora_rank``, ``max_cpu_loras``,
etc.), which will apply to all forthcoming requests. Upon querying the ``/models`` endpoint, we should see our LoRA along
with its base model:
.. code-block:: bash
curl localhost:8000/v1/models | jq .
{
"object": "list",
"data": [
{
"id": "meta-llama/Llama-2-7b-hf",
"object": "model",
...
},
{
"id": "sql-lora",
"object": "model",
...
}
]
}
Requests can specify the LoRA adapter as if it were any other model via the ``model`` request parameter. The requests will be
processed according to the server-wide LoRA configuration (i.e. in parallel with base model requests, and potentially other
LoRA adapter requests if they were provided and ``max_loras`` is set high enough).
The following is an example request
.. code-block:: bash
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "sql-lora",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}' | jq
Dynamically serving LoRA Adapters
---------------------------------
In addition to serving LoRA adapters at server startup, the vLLM server now supports dynamically loading and unloading
LoRA adapters at runtime through dedicated API endpoints. This feature can be particularly useful when the flexibility
to change models on-the-fly is needed.
Note: Enabling this feature in production environments is risky as user may participate model adapter management.
To enable dynamic LoRA loading and unloading, ensure that the environment variable `VLLM_ALLOW_RUNTIME_LORA_UPDATING`
is set to `True`. When this option is enabled, the API server will log a warning to indicate that dynamic loading is active.
.. code-block:: bash
export VLLM_ALLOW_RUNTIME_LORA_UPDATING=True
Loading a LoRA Adapter:
To dynamically load a LoRA adapter, send a POST request to the `/v1/load_lora_adapter` endpoint with the necessary
details of the adapter to be loaded. The request payload should include the name and path to the LoRA adapter.
Example request to load a LoRA adapter:
.. code-block:: bash
curl -X POST http://localhost:8000/v1/load_lora_adapter \
-H "Content-Type: application/json" \
-d '{
"lora_name": "sql_adapter",
"lora_path": "/path/to/sql-lora-adapter"
}'
Upon a successful request, the API will respond with a 200 OK status code. If an error occurs, such as if the adapter
cannot be found or loaded, an appropriate error message will be returned.
Unloading a LoRA Adapter:
To unload a LoRA adapter that has been previously loaded, send a POST request to the `/v1/unload_lora_adapter` endpoint
with the name or ID of the adapter to be unloaded.
Example request to unload a LoRA adapter:
.. code-block:: bash
curl -X POST http://localhost:8000/v1/unload_lora_adapter \
-H "Content-Type: application/json" \
-d '{
"lora_name": "sql_adapter"
}'
New format for `--lora-modules`
-------------------------------
In the previous version, users would provide LoRA modules via the following format, either as a key-value pair or in JSON format. For example:
.. code-block:: bash
--lora-modules sql-lora=$HOME/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/snapshots/0dfa347e8877a4d4ed19ee56c140fa518470028c/
This would only include the `name` and `path` for each LoRA module, but did not provide a way to specify a `base_model_name`.
Now, you can specify a base_model_name alongside the name and path using JSON format. For example:
.. code-block:: bash
--lora-modules '{"name": "sql-lora", "path": "/path/to/lora", "base_model_name": "meta-llama/Llama-2-7b"}'
To provide the backward compatibility support, you can still use the old key-value format (name=path), but the `base_model_name` will remain unspecified in that case.
Lora model lineage in model card
--------------------------------
The new format of `--lora-modules` is mainly to support the display of parent model information in the model card. Here's an explanation of how your current response supports this:
- The `parent` field of LoRA model `sql-lora` now links to its base model `meta-llama/Llama-2-7b-hf`. This correctly reflects the hierarchical relationship between the base model and the LoRA adapter.
- The `root` field points to the artifact location of the lora adapter.
.. code-block:: bash
$ curl http://localhost:8000/v1/models
{
"object": "list",
"data": [
{
"id": "meta-llama/Llama-2-7b-hf",
"object": "model",
"created": 1715644056,
"owned_by": "vllm",
"root": "~/.cache/huggingface/hub/models--meta-llama--Llama-2-7b-hf/snapshots/01c7f73d771dfac7d292323805ebc428287df4f9/",
"parent": null,
"permission": [
{
.....
}
]
},
{
"id": "sql-lora",
"object": "model",
"created": 1715644056,
"owned_by": "vllm",
"root": "~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/snapshots/0dfa347e8877a4d4ed19ee56c140fa518470028c/",
"parent": meta-llama/Llama-2-7b-hf,
"permission": [
{
....
}
]
}
]
}

View File

@@ -0,0 +1,486 @@
(multimodal-inputs)=
# Multimodal Inputs
This page teaches you how to pass multi-modal inputs to [multi-modal models](#supported-mm-models) in vLLM.
```{note}
We are actively iterating on multi-modal support. See [this RFC](https://github.com/vllm-project/vllm/issues/4194) for upcoming changes,
and [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) if you have any feedback or feature requests.
```
## Offline Inference
To input multi-modal data, follow this schema in {class}`vllm.inputs.PromptType`:
- `prompt`: The prompt should follow the format that is documented on HuggingFace.
- `multi_modal_data`: This is a dictionary that follows the schema defined in {class}`vllm.multimodal.MultiModalDataDict`.
### Image
You can pass a single image to the {code}`'image'` field of the multi-modal dictionary, as shown in the following examples:
```python
llm = LLM(model="llava-hf/llava-1.5-7b-hf")
# Refer to the HuggingFace repo for the correct format to use
prompt = "USER: <image>\nWhat is the content of this image?\nASSISTANT:"
# Load the image using PIL.Image
image = PIL.Image.open(...)
# Single prompt inference
outputs = llm.generate({
"prompt": prompt,
"multi_modal_data": {"image": image},
})
for o in outputs:
generated_text = o.outputs[0].text
print(generated_text)
# Batch inference
image_1 = PIL.Image.open(...)
image_2 = PIL.Image.open(...)
outputs = llm.generate(
[
{
"prompt": "USER: <image>\nWhat is the content of this image?\nASSISTANT:",
"multi_modal_data": {"image": image_1},
},
{
"prompt": "USER: <image>\nWhat's the color of this image?\nASSISTANT:",
"multi_modal_data": {"image": image_2},
}
]
)
for o in outputs:
generated_text = o.outputs[0].text
print(generated_text)
```
A code example can be found in [examples/offline_inference_vision_language.py](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_vision_language.py).
To substitute multiple images inside the same text prompt, you can pass in a list of images instead:
```python
llm = LLM(
model="microsoft/Phi-3.5-vision-instruct",
trust_remote_code=True, # Required to load Phi-3.5-vision
max_model_len=4096, # Otherwise, it may not fit in smaller GPUs
limit_mm_per_prompt={"image": 2}, # The maximum number to accept
)
# Refer to the HuggingFace repo for the correct format to use
prompt = "<|user|>\n<|image_1|>\n<|image_2|>\nWhat is the content of each image?<|end|>\n<|assistant|>\n"
# Load the images using PIL.Image
image1 = PIL.Image.open(...)
image2 = PIL.Image.open(...)
outputs = llm.generate({
"prompt": prompt,
"multi_modal_data": {
"image": [image1, image2]
},
})
for o in outputs:
generated_text = o.outputs[0].text
print(generated_text)
```
A code example can be found in [examples/offline_inference_vision_language_multi_image.py](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_vision_language_multi_image.py).
Multi-image input can be extended to perform video captioning. We show this with [Qwen2-VL](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct) as it supports videos:
```python
# Specify the maximum number of frames per video to be 4. This can be changed.
llm = LLM("Qwen/Qwen2-VL-2B-Instruct", limit_mm_per_prompt={"image": 4})
# Create the request payload.
video_frames = ... # load your video making sure it only has the number of frames specified earlier.
message = {
"role": "user",
"content": [
{"type": "text", "text": "Describe this set of frames. Consider the frames to be a part of the same video."},
],
}
for i in range(len(video_frames)):
base64_image = encode_image(video_frames[i]) # base64 encoding.
new_image = {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}}
message["content"].append(new_image)
# Perform inference and log output.
outputs = llm.chat([message])
for o in outputs:
generated_text = o.outputs[0].text
print(generated_text)
```
### Video
You can pass a list of NumPy arrays directly to the {code}`'video'` field of the multi-modal dictionary
instead of using multi-image input.
Please refer to [examples/offline_inference_vision_language.py](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_vision_language.py) for more details.
### Audio
You can pass a tuple {code}`(array, sampling_rate)` to the {code}`'audio'` field of the multi-modal dictionary.
Please refer to [examples/offline_inference_audio_language.py](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_audio_language.py) for more details.
### Embedding
To input pre-computed embeddings belonging to a data type (i.e. image, video, or audio) directly to the language model,
pass a tensor of shape {code}`(num_items, feature_size, hidden_size of LM)` to the corresponding field of the multi-modal dictionary.
```python
# Inference with image embeddings as input
llm = LLM(model="llava-hf/llava-1.5-7b-hf")
# Refer to the HuggingFace repo for the correct format to use
prompt = "USER: <image>\nWhat is the content of this image?\nASSISTANT:"
# Embeddings for single image
# torch.Tensor of shape (1, image_feature_size, hidden_size of LM)
image_embeds = torch.load(...)
outputs = llm.generate({
"prompt": prompt,
"multi_modal_data": {"image": image_embeds},
})
for o in outputs:
generated_text = o.outputs[0].text
print(generated_text)
```
For Qwen2-VL and MiniCPM-V, we accept additional parameters alongside the embeddings:
```python
# Construct the prompt based on your model
prompt = ...
# Embeddings for multiple images
# torch.Tensor of shape (num_images, image_feature_size, hidden_size of LM)
image_embeds = torch.load(...)
# Qwen2-VL
llm = LLM("Qwen/Qwen2-VL-2B-Instruct", limit_mm_per_prompt={"image": 4})
mm_data = {
"image": {
"image_embeds": image_embeds,
# image_grid_thw is needed to calculate positional encoding.
"image_grid_thw": torch.load(...), # torch.Tensor of shape (1, 3),
}
}
# MiniCPM-V
llm = LLM("openbmb/MiniCPM-V-2_6", trust_remote_code=True, limit_mm_per_prompt={"image": 4})
mm_data = {
"image": {
"image_embeds": image_embeds,
# image_size_list is needed to calculate details of the sliced image.
"image_size_list": [image.size for image in images], # list of image sizes
}
}
outputs = llm.generate({
"prompt": prompt,
"multi_modal_data": mm_data,
})
for o in outputs:
generated_text = o.outputs[0].text
print(generated_text)
```
## Online Inference
Our OpenAI-compatible server accepts multi-modal data via the [Chat Completions API](https://platform.openai.com/docs/api-reference/chat).
```{important}
A chat template is **required** to use Chat Completions API.
Although most models come with a chat template, for others you have to define one yourself.
The chat template can be inferred based on the documentation on the model's HuggingFace repo.
For example, LLaVA-1.5 (`llava-hf/llava-1.5-7b-hf`) requires a chat template that can be found [here](https://github.com/vllm-project/vllm/blob/main/examples/template_llava.jinja).
```
### Image
Image input is supported according to [OpenAI Vision API](https://platform.openai.com/docs/guides/vision).
Here is a simple example using Phi-3.5-Vision.
First, launch the OpenAI-compatible server:
```bash
vllm serve microsoft/Phi-3.5-vision-instruct --task generate \
--trust-remote-code --max-model-len 4096 --limit-mm-per-prompt image=2
```
Then, you can use the OpenAI client as follows:
```python
from openai import OpenAI
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
# Single-image input inference
image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
chat_response = client.chat.completions.create(
model="microsoft/Phi-3.5-vision-instruct",
messages=[{
"role": "user",
"content": [
# NOTE: The prompt formatting with the image token `<image>` is not needed
# since the prompt will be processed automatically by the API server.
{"type": "text", "text": "Whats in this image?"},
{"type": "image_url", "image_url": {"url": image_url}},
],
}],
)
print("Chat completion output:", chat_response.choices[0].message.content)
# Multi-image input inference
image_url_duck = "https://upload.wikimedia.org/wikipedia/commons/d/da/2015_Kaczka_krzy%C5%BCowka_w_wodzie_%28samiec%29.jpg"
image_url_lion = "https://upload.wikimedia.org/wikipedia/commons/7/77/002_The_lion_king_Snyggve_in_the_Serengeti_National_Park_Photo_by_Giles_Laurent.jpg"
chat_response = client.chat.completions.create(
model="microsoft/Phi-3.5-vision-instruct",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "What are the animals in these images?"},
{"type": "image_url", "image_url": {"url": image_url_duck}},
{"type": "image_url", "image_url": {"url": image_url_lion}},
],
}],
)
print("Chat completion output:", chat_response.choices[0].message.content)
```
A full code example can be found in [examples/openai_chat_completion_client_for_multimodal.py](https://github.com/vllm-project/vllm/blob/main/examples/openai_chat_completion_client_for_multimodal.py).
```{tip}
Loading from local file paths is also supported on vLLM: You can specify the allowed local media path via `--allowed-local-media-path` when launching the API server/engine,
and pass the file path as `url` in the API request.
```
```{tip}
There is no need to place image placeholders in the text content of the API request - they are already represented by the image content.
In fact, you can place image placeholders in the middle of the text by interleaving text and image content.
```
````{note}
By default, the timeout for fetching images through HTTP URL is `5` seconds.
You can override this by setting the environment variable:
```console
$ export VLLM_IMAGE_FETCH_TIMEOUT=<timeout>
```
````
### Video
Instead of {code}`image_url`, you can pass a video file via {code}`video_url`.
You can use [these tests](https://github.com/vllm-project/vllm/blob/main/tests/entrypoints/openai/test_video.py) as reference.
````{note}
By default, the timeout for fetching videos through HTTP URL url is `30` seconds.
You can override this by setting the environment variable:
```console
$ export VLLM_VIDEO_FETCH_TIMEOUT=<timeout>
```
````
### Audio
Audio input is supported according to [OpenAI Audio API](https://platform.openai.com/docs/guides/audio?audio-generation-quickstart-example=audio-in).
Here is a simple example using Ultravox-v0.3.
First, launch the OpenAI-compatible server:
```bash
vllm serve fixie-ai/ultravox-v0_3
```
Then, you can use the OpenAI client as follows:
```python
import base64
import requests
from openai import OpenAI
from vllm.assets.audio import AudioAsset
def encode_base64_content_from_url(content_url: str) -> str:
"""Encode a content retrieved from a remote url to base64 format."""
with requests.get(content_url) as response:
response.raise_for_status()
result = base64.b64encode(response.content).decode('utf-8')
return result
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
# Any format supported by librosa is supported
audio_url = AudioAsset("winning_call").url
audio_base64 = encode_base64_content_from_url(audio_url)
chat_completion_from_base64 = client.chat.completions.create(
messages=[{
"role": "user",
"content": [
{
"type": "text",
"text": "What's in this audio?"
},
{
"type": "input_audio",
"input_audio": {
"data": audio_base64,
"format": "wav"
},
},
],
}],
model=model,
max_completion_tokens=64,
)
result = chat_completion_from_base64.choices[0].message.content
print("Chat completion output from input audio:", result)
```
Alternatively, you can pass {code}`audio_url`, which is the audio counterpart of {code}`image_url` for image input:
```python
chat_completion_from_url = client.chat.completions.create(
messages=[{
"role": "user",
"content": [
{
"type": "text",
"text": "What's in this audio?"
},
{
"type": "audio_url",
"audio_url": {
"url": audio_url
},
},
],
}],
model=model,
max_completion_tokens=64,
)
result = chat_completion_from_url.choices[0].message.content
print("Chat completion output from audio url:", result)
```
A full code example can be found in [examples/openai_chat_completion_client_for_multimodal.py](https://github.com/vllm-project/vllm/blob/main/examples/openai_chat_completion_client_for_multimodal.py).
````{note}
By default, the timeout for fetching audios through HTTP URL is `10` seconds.
You can override this by setting the environment variable:
```console
$ export VLLM_AUDIO_FETCH_TIMEOUT=<timeout>
```
````
### Embedding
vLLM's Embeddings API is a superset of OpenAI's [Embeddings API](https://platform.openai.com/docs/api-reference/embeddings),
where a list of chat `messages` can be passed instead of batched `inputs`. This enables multi-modal inputs to be passed to embedding models.
```{tip}
The schema of `messages` is exactly the same as in Chat Completions API.
You can refer to the above tutorials for more details on how to pass each type of multi-modal data.
```
Usually, embedding models do not expect chat-based input, so we need to use a custom chat template to format the text and images.
Refer to the examples below for illustration.
Here is an end-to-end example using VLM2Vec. To serve the model:
```bash
vllm serve TIGER-Lab/VLM2Vec-Full --task embed \
--trust-remote-code --max-model-len 4096 --chat-template examples/template_vlm2vec.jinja
```
```{important}
Since VLM2Vec has the same model architecture as Phi-3.5-Vision, we have to explicitly pass `--task embed`
to run this model in embedding mode instead of text generation mode.
The custom chat template is completely different from the original one for this model,
and can be found [here](https://github.com/vllm-project/vllm/blob/main/examples/template_vlm2vec.jinja).
```
Since the request schema is not defined by OpenAI client, we post a request to the server using the lower-level `requests` library:
```python
import requests
image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
response = requests.post(
"http://localhost:8000/v1/embeddings",
json={
"model": "TIGER-Lab/VLM2Vec-Full",
"messages": [{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": image_url}},
{"type": "text", "text": "Represent the given image."},
],
}],
"encoding_format": "float",
},
)
response.raise_for_status()
response_json = response.json()
print("Embedding output:", response_json["data"][0]["embedding"])
```
Below is another example, this time using the `MrLight/dse-qwen2-2b-mrl-v1` model.
```bash
vllm serve MrLight/dse-qwen2-2b-mrl-v1 --task embed \
--trust-remote-code --max-model-len 8192 --chat-template examples/template_dse_qwen2_vl.jinja
```
```{important}
Like with VLM2Vec, we have to explicitly pass `--task embed`.
Additionally, `MrLight/dse-qwen2-2b-mrl-v1` requires an EOS token for embeddings, which is handled
by [this custom chat template](https://github.com/vllm-project/vllm/blob/main/examples/template_dse_qwen2_vl.jinja).
```
```{important}
Also important, `MrLight/dse-qwen2-2b-mrl-v1` requires a placeholder image of the minimum image size for text query embeddings. See the full code
example below for details.
```
A full code example can be found in [examples/openai_chat_embedding_client_for_multimodal.py](https://github.com/vllm-project/vllm/blob/main/examples/openai_chat_embedding_client_for_multimodal.py).

View File

@@ -1,492 +0,0 @@
.. _multimodal_inputs:
Multimodal Inputs
=================
This page teaches you how to pass multi-modal inputs to :ref:`multi-modal models <supported_mm_models>` in vLLM.
.. note::
We are actively iterating on multi-modal support. See `this RFC <https://github.com/vllm-project/vllm/issues/4194>`_ for upcoming changes,
and `open an issue on GitHub <https://github.com/vllm-project/vllm/issues/new/choose>`_ if you have any feedback or feature requests.
Offline Inference
-----------------
To input multi-modal data, follow this schema in :class:`vllm.inputs.PromptType`:
* ``prompt``: The prompt should follow the format that is documented on HuggingFace.
* ``multi_modal_data``: This is a dictionary that follows the schema defined in :class:`vllm.multimodal.MultiModalDataDict`.
Image
^^^^^
You can pass a single image to the :code:`'image'` field of the multi-modal dictionary, as shown in the following examples:
.. code-block:: python
llm = LLM(model="llava-hf/llava-1.5-7b-hf")
# Refer to the HuggingFace repo for the correct format to use
prompt = "USER: <image>\nWhat is the content of this image?\nASSISTANT:"
# Load the image using PIL.Image
image = PIL.Image.open(...)
# Single prompt inference
outputs = llm.generate({
"prompt": prompt,
"multi_modal_data": {"image": image},
})
for o in outputs:
generated_text = o.outputs[0].text
print(generated_text)
# Batch inference
image_1 = PIL.Image.open(...)
image_2 = PIL.Image.open(...)
outputs = llm.generate(
[
{
"prompt": "USER: <image>\nWhat is the content of this image?\nASSISTANT:",
"multi_modal_data": {"image": image_1},
},
{
"prompt": "USER: <image>\nWhat's the color of this image?\nASSISTANT:",
"multi_modal_data": {"image": image_2},
}
]
)
for o in outputs:
generated_text = o.outputs[0].text
print(generated_text)
A code example can be found in `examples/offline_inference_vision_language.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_vision_language.py>`_.
To substitute multiple images inside the same text prompt, you can pass in a list of images instead:
.. code-block:: python
llm = LLM(
model="microsoft/Phi-3.5-vision-instruct",
trust_remote_code=True, # Required to load Phi-3.5-vision
max_model_len=4096, # Otherwise, it may not fit in smaller GPUs
limit_mm_per_prompt={"image": 2}, # The maximum number to accept
)
# Refer to the HuggingFace repo for the correct format to use
prompt = "<|user|>\n<|image_1|>\n<|image_2|>\nWhat is the content of each image?<|end|>\n<|assistant|>\n"
# Load the images using PIL.Image
image1 = PIL.Image.open(...)
image2 = PIL.Image.open(...)
outputs = llm.generate({
"prompt": prompt,
"multi_modal_data": {
"image": [image1, image2]
},
})
for o in outputs:
generated_text = o.outputs[0].text
print(generated_text)
A code example can be found in `examples/offline_inference_vision_language_multi_image.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_vision_language_multi_image.py>`_.
Multi-image input can be extended to perform video captioning. We show this with `Qwen2-VL <https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct>`_ as it supports videos:
.. code-block:: python
# Specify the maximum number of frames per video to be 4. This can be changed.
llm = LLM("Qwen/Qwen2-VL-2B-Instruct", limit_mm_per_prompt={"image": 4})
# Create the request payload.
video_frames = ... # load your video making sure it only has the number of frames specified earlier.
message = {
"role": "user",
"content": [
{"type": "text", "text": "Describe this set of frames. Consider the frames to be a part of the same video."},
],
}
for i in range(len(video_frames)):
base64_image = encode_image(video_frames[i]) # base64 encoding.
new_image = {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}}
message["content"].append(new_image)
# Perform inference and log output.
outputs = llm.chat([message])
for o in outputs:
generated_text = o.outputs[0].text
print(generated_text)
Video
^^^^^
You can pass a list of NumPy arrays directly to the :code:`'video'` field of the multi-modal dictionary
instead of using multi-image input.
Please refer to `examples/offline_inference_vision_language.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_vision_language.py>`_ for more details.
Audio
^^^^^
You can pass a tuple :code:`(array, sampling_rate)` to the :code:`'audio'` field of the multi-modal dictionary.
Please refer to `examples/offline_inference_audio_language.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_audio_language.py>`_ for more details.
Embedding
^^^^^^^^^
To input pre-computed embeddings belonging to a data type (i.e. image, video, or audio) directly to the language model,
pass a tensor of shape :code:`(num_items, feature_size, hidden_size of LM)` to the corresponding field of the multi-modal dictionary.
.. code-block:: python
# Inference with image embeddings as input
llm = LLM(model="llava-hf/llava-1.5-7b-hf")
# Refer to the HuggingFace repo for the correct format to use
prompt = "USER: <image>\nWhat is the content of this image?\nASSISTANT:"
# Embeddings for single image
# torch.Tensor of shape (1, image_feature_size, hidden_size of LM)
image_embeds = torch.load(...)
outputs = llm.generate({
"prompt": prompt,
"multi_modal_data": {"image": image_embeds},
})
for o in outputs:
generated_text = o.outputs[0].text
print(generated_text)
For Qwen2-VL and MiniCPM-V, we accept additional parameters alongside the embeddings:
.. code-block:: python
# Construct the prompt based on your model
prompt = ...
# Embeddings for multiple images
# torch.Tensor of shape (num_images, image_feature_size, hidden_size of LM)
image_embeds = torch.load(...)
# Qwen2-VL
llm = LLM("Qwen/Qwen2-VL-2B-Instruct", limit_mm_per_prompt={"image": 4})
mm_data = {
"image": {
"image_embeds": image_embeds,
# image_grid_thw is needed to calculate positional encoding.
"image_grid_thw": torch.load(...), # torch.Tensor of shape (1, 3),
}
}
# MiniCPM-V
llm = LLM("openbmb/MiniCPM-V-2_6", trust_remote_code=True, limit_mm_per_prompt={"image": 4})
mm_data = {
"image": {
"image_embeds": image_embeds,
# image_size_list is needed to calculate details of the sliced image.
"image_size_list": [image.size for image in images], # list of image sizes
}
}
outputs = llm.generate({
"prompt": prompt,
"multi_modal_data": mm_data,
})
for o in outputs:
generated_text = o.outputs[0].text
print(generated_text)
Online Inference
----------------
Our OpenAI-compatible server accepts multi-modal data via the `Chat Completions API <https://platform.openai.com/docs/api-reference/chat>`_.
.. important::
A chat template is **required** to use Chat Completions API.
Although most models come with a chat template, for others you have to define one yourself.
The chat template can be inferred based on the documentation on the model's HuggingFace repo.
For example, LLaVA-1.5 (``llava-hf/llava-1.5-7b-hf``) requires a chat template that can be found `here <https://github.com/vllm-project/vllm/blob/main/examples/template_llava.jinja>`__.
Image
^^^^^
Image input is supported according to `OpenAI Vision API <https://platform.openai.com/docs/guides/vision>`_.
Here is a simple example using Phi-3.5-Vision.
First, launch the OpenAI-compatible server:
.. code-block:: bash
vllm serve microsoft/Phi-3.5-vision-instruct --task generate \
--trust-remote-code --max-model-len 4096 --limit-mm-per-prompt image=2
Then, you can use the OpenAI client as follows:
.. code-block:: python
from openai import OpenAI
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
# Single-image input inference
image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
chat_response = client.chat.completions.create(
model="microsoft/Phi-3.5-vision-instruct",
messages=[{
"role": "user",
"content": [
# NOTE: The prompt formatting with the image token `<image>` is not needed
# since the prompt will be processed automatically by the API server.
{"type": "text", "text": "Whats in this image?"},
{"type": "image_url", "image_url": {"url": image_url}},
],
}],
)
print("Chat completion output:", chat_response.choices[0].message.content)
# Multi-image input inference
image_url_duck = "https://upload.wikimedia.org/wikipedia/commons/d/da/2015_Kaczka_krzy%C5%BCowka_w_wodzie_%28samiec%29.jpg"
image_url_lion = "https://upload.wikimedia.org/wikipedia/commons/7/77/002_The_lion_king_Snyggve_in_the_Serengeti_National_Park_Photo_by_Giles_Laurent.jpg"
chat_response = client.chat.completions.create(
model="microsoft/Phi-3.5-vision-instruct",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "What are the animals in these images?"},
{"type": "image_url", "image_url": {"url": image_url_duck}},
{"type": "image_url", "image_url": {"url": image_url_lion}},
],
}],
)
print("Chat completion output:", chat_response.choices[0].message.content)
A full code example can be found in `examples/openai_chat_completion_client_for_multimodal.py <https://github.com/vllm-project/vllm/blob/main/examples/openai_chat_completion_client_for_multimodal.py>`_.
.. tip::
Loading from local file paths is also supported on vLLM: You can specify the allowed local media path via ``--allowed-local-media-path`` when launching the API server/engine,
and pass the file path as ``url`` in the API request.
.. tip::
There is no need to place image placeholders in the text content of the API request - they are already represented by the image content.
In fact, you can place image placeholders in the middle of the text by interleaving text and image content.
.. note::
By default, the timeout for fetching images through HTTP URL is ``5`` seconds.
You can override this by setting the environment variable:
.. code-block:: console
$ export VLLM_IMAGE_FETCH_TIMEOUT=<timeout>
Video
^^^^^
Instead of :code:`image_url`, you can pass a video file via :code:`video_url`.
You can use `these tests <https://github.com/vllm-project/vllm/blob/main/tests/entrypoints/openai/test_video.py>`_ as reference.
.. note::
By default, the timeout for fetching videos through HTTP URL url is ``30`` seconds.
You can override this by setting the environment variable:
.. code-block:: console
$ export VLLM_VIDEO_FETCH_TIMEOUT=<timeout>
Audio
^^^^^
Audio input is supported according to `OpenAI Audio API <https://platform.openai.com/docs/guides/audio?audio-generation-quickstart-example=audio-in>`_.
Here is a simple example using Ultravox-v0.3.
First, launch the OpenAI-compatible server:
.. code-block:: bash
vllm serve fixie-ai/ultravox-v0_3
Then, you can use the OpenAI client as follows:
.. code-block:: python
import base64
import requests
from openai import OpenAI
from vllm.assets.audio import AudioAsset
def encode_base64_content_from_url(content_url: str) -> str:
"""Encode a content retrieved from a remote url to base64 format."""
with requests.get(content_url) as response:
response.raise_for_status()
result = base64.b64encode(response.content).decode('utf-8')
return result
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
# Any format supported by librosa is supported
audio_url = AudioAsset("winning_call").url
audio_base64 = encode_base64_content_from_url(audio_url)
chat_completion_from_base64 = client.chat.completions.create(
messages=[{
"role": "user",
"content": [
{
"type": "text",
"text": "What's in this audio?"
},
{
"type": "input_audio",
"input_audio": {
"data": audio_base64,
"format": "wav"
},
},
],
}],
model=model,
max_completion_tokens=64,
)
result = chat_completion_from_base64.choices[0].message.content
print("Chat completion output from input audio:", result)
Alternatively, you can pass :code:`audio_url`, which is the audio counterpart of :code:`image_url` for image input:
.. code-block:: python
chat_completion_from_url = client.chat.completions.create(
messages=[{
"role": "user",
"content": [
{
"type": "text",
"text": "What's in this audio?"
},
{
"type": "audio_url",
"audio_url": {
"url": audio_url
},
},
],
}],
model=model,
max_completion_tokens=64,
)
result = chat_completion_from_url.choices[0].message.content
print("Chat completion output from audio url:", result)
A full code example can be found in `examples/openai_chat_completion_client_for_multimodal.py <https://github.com/vllm-project/vllm/blob/main/examples/openai_chat_completion_client_for_multimodal.py>`_.
.. note::
By default, the timeout for fetching audios through HTTP URL is ``10`` seconds.
You can override this by setting the environment variable:
.. code-block:: console
$ export VLLM_AUDIO_FETCH_TIMEOUT=<timeout>
Embedding
^^^^^^^^^
vLLM's Embeddings API is a superset of OpenAI's `Embeddings API <https://platform.openai.com/docs/api-reference/embeddings>`_,
where a list of chat ``messages`` can be passed instead of batched ``inputs``. This enables multi-modal inputs to be passed to embedding models.
.. tip::
The schema of ``messages`` is exactly the same as in Chat Completions API.
You can refer to the above tutorials for more details on how to pass each type of multi-modal data.
Usually, embedding models do not expect chat-based input, so we need to use a custom chat template to format the text and images.
Refer to the examples below for illustration.
Here is an end-to-end example using VLM2Vec. To serve the model:
.. code-block:: bash
vllm serve TIGER-Lab/VLM2Vec-Full --task embed \
--trust-remote-code --max-model-len 4096 --chat-template examples/template_vlm2vec.jinja
.. important::
Since VLM2Vec has the same model architecture as Phi-3.5-Vision, we have to explicitly pass ``--task embed``
to run this model in embedding mode instead of text generation mode.
The custom chat template is completely different from the original one for this model,
and can be found `here <https://github.com/vllm-project/vllm/blob/main/examples/template_vlm2vec.jinja>`__.
Since the request schema is not defined by OpenAI client, we post a request to the server using the lower-level ``requests`` library:
.. code-block:: python
import requests
image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
response = requests.post(
"http://localhost:8000/v1/embeddings",
json={
"model": "TIGER-Lab/VLM2Vec-Full",
"messages": [{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": image_url}},
{"type": "text", "text": "Represent the given image."},
],
}],
"encoding_format": "float",
},
)
response.raise_for_status()
response_json = response.json()
print("Embedding output:", response_json["data"][0]["embedding"])
Below is another example, this time using the ``MrLight/dse-qwen2-2b-mrl-v1`` model.
.. code-block:: bash
vllm serve MrLight/dse-qwen2-2b-mrl-v1 --task embed \
--trust-remote-code --max-model-len 8192 --chat-template examples/template_dse_qwen2_vl.jinja
.. important::
Like with VLM2Vec, we have to explicitly pass ``--task embed``.
Additionally, ``MrLight/dse-qwen2-2b-mrl-v1`` requires an EOS token for embeddings, which is handled
by `this custom chat template <https://github.com/vllm-project/vllm/blob/main/examples/template_dse_qwen2_vl.jinja>`__.
.. important::
Also important, ``MrLight/dse-qwen2-2b-mrl-v1`` requires a placeholder image of the minimum image size for text query embeddings. See the full code
example below for details.
A full code example can be found in `examples/openai_chat_embedding_client_for_multimodal.py <https://github.com/vllm-project/vllm/blob/main/examples/openai_chat_embedding_client_for_multimodal.py>`_.

View File

@@ -1,16 +1,15 @@
.. _performance:
(performance)=
Performance and Tuning
======================
# Performance and Tuning
## Preemption
Preemption
----------
Due to the auto-regressive nature of transformer architecture, there are times when KV cache space is insufficient to handle all batched requests.
The vLLM can preempt requests to free up KV cache space for other requests. Preempted requests are recomputed when sufficient KV cache space becomes
available again. When this occurs, the following warning is printed:
```
WARNING 05-09 00:49:33 scheduler.py:1057] Sequence group 0 is preempted by PreemptionMode.SWAP mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_cumulative_preemption_cnt=1
WARNING 05-09 00:49:33 scheduler.py:1057 Sequence group 0 is preempted by PreemptionMode.SWAP mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_cumulative_preemption_cnt=1
```
While this mechanism ensures system robustness, preemption and recomputation can adversely affect end-to-end latency.
@@ -22,44 +21,44 @@ If you frequently encounter preemptions from the vLLM engine, consider the follo
You can also monitor the number of preemption requests through Prometheus metrics exposed by the vLLM. Additionally, you can log the cumulative number of preemption requests by setting disable_log_stats=False.
.. _chunked-prefill:
(chunked-prefill)=
## Chunked Prefill
Chunked Prefill
---------------
vLLM supports an experimental feature chunked prefill. Chunked prefill allows to chunk large prefills into smaller chunks and batch them together with decode requests.
You can enable the feature by specifying ``--enable-chunked-prefill`` in the command line or setting ``enable_chunked_prefill=True`` in the LLM constructor.
You can enable the feature by specifying `--enable-chunked-prefill` in the command line or setting `enable_chunked_prefill=True` in the LLM constructor.
.. code-block:: python
llm = LLM(model="meta-llama/Llama-2-7b-hf", enable_chunked_prefill=True)
# Set max_num_batched_tokens to tune performance.
# NOTE: 512 is the default max_num_batched_tokens for chunked prefill.
# llm = LLM(model="meta-llama/Llama-2-7b-hf", enable_chunked_prefill=True, max_num_batched_tokens=512)
```python
llm = LLM(model="meta-llama/Llama-2-7b-hf", enable_chunked_prefill=True)
# Set max_num_batched_tokens to tune performance.
# NOTE: 512 is the default max_num_batched_tokens for chunked prefill.
# llm = LLM(model="meta-llama/Llama-2-7b-hf", enable_chunked_prefill=True, max_num_batched_tokens=512)
```
By default, vLLM scheduler prioritizes prefills and doesn't batch prefill and decode to the same batch.
This policy optimizes the TTFT (time to the first token), but incurs slower ITL (inter token latency) and inefficient GPU utilization.
Once chunked prefill is enabled, the policy is changed to prioritize decode requests.
It batches all pending decode requests to the batch before scheduling any prefill.
When there are available token_budget (``max_num_batched_tokens``), it schedules pending prefills.
If a last pending prefill request cannot fit into ``max_num_batched_tokens``, it chunks it.
When there are available token_budget (`max_num_batched_tokens`), it schedules pending prefills.
If a last pending prefill request cannot fit into `max_num_batched_tokens`, it chunks it.
This policy has two benefits:
- It improves ITL and generation decode because decode requests are prioritized.
- It helps achieve better GPU utilization by locating compute-bound (prefill) and memory-bound (decode) requests to the same batch.
You can tune the performance by changing ``max_num_batched_tokens``.
You can tune the performance by changing `max_num_batched_tokens`.
By default, it is set to 512, which has the best ITL on A100 in the initial benchmark (llama 70B and mixtral 8x22B).
Smaller ``max_num_batched_tokens`` achieves better ITL because there are fewer prefills interrupting decodes.
Higher ``max_num_batched_tokens`` achieves better TTFT as you can put more prefill to the batch.
Smaller `max_num_batched_tokens` achieves better ITL because there are fewer prefills interrupting decodes.
Higher `max_num_batched_tokens` achieves better TTFT as you can put more prefill to the batch.
- If ``max_num_batched_tokens`` is the same as ``max_model_len``, that's almost the equivalent to the default scheduling policy (except that it still prioritizes decodes).
- Note that the default value (512) of ``max_num_batched_tokens`` is optimized for ITL, and it may have lower throughput than the default scheduler.
- If `max_num_batched_tokens` is the same as `max_model_len`, that's almost the equivalent to the default scheduling policy (except that it still prioritizes decodes).
- Note that the default value (512) of `max_num_batched_tokens` is optimized for ITL, and it may have lower throughput than the default scheduler.
We recommend you set ``max_num_batched_tokens > 2048`` for throughput.
We recommend you set `max_num_batched_tokens > 2048` for throughput.
See related papers for more details (https://arxiv.org/pdf/2401.08671 or https://arxiv.org/pdf/2308.16369).
See related papers for more details (<https://arxiv.org/pdf/2401.08671> or <https://arxiv.org/pdf/2308.16369>).
Please try out this feature and let us know your feedback via GitHub issues!
Please try out this feature and let us know your feedback via GitHub issues!

View File

@@ -0,0 +1,205 @@
(spec-decode)=
# Speculative decoding
```{warning}
Please note that speculative decoding in vLLM is not yet optimized and does
not usually yield inter-token latency reductions for all prompt datasets or sampling parameters. The work
to optimize it is ongoing and can be followed in [this issue.](https://github.com/vllm-project/vllm/issues/4630)
```
```{warning}
Currently, speculative decoding in vLLM is not compatible with pipeline parallelism.
```
This document shows how to use [Speculative Decoding](https://x.com/karpathy/status/1697318534555336961) with vLLM.
Speculative decoding is a technique which improves inter-token latency in memory-bound LLM inference.
## Speculating with a draft model
The following code configures vLLM in an offline mode to use speculative decoding with a draft model, speculating 5 tokens at a time.
```python
from vllm import LLM, SamplingParams
prompts = [
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(
model="facebook/opt-6.7b",
tensor_parallel_size=1,
speculative_model="facebook/opt-125m",
num_speculative_tokens=5,
)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
To perform the same with an online mode launch the server:
```bash
python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8000 --model facebook/opt-6.7b \
--seed 42 -tp 1 --speculative_model facebook/opt-125m --use-v2-block-manager \
--num_speculative_tokens 5 --gpu_memory_utilization 0.8
```
Then use a client:
```python
from openai import OpenAI
# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
# defaults to os.environ.get("OPENAI_API_KEY")
api_key=openai_api_key,
base_url=openai_api_base,
)
models = client.models.list()
model = models.data[0].id
# Completion API
stream = False
completion = client.completions.create(
model=model,
prompt="The future of AI is",
echo=False,
n=1,
stream=stream,
)
print("Completion results:")
if stream:
for c in completion:
print(c)
else:
print(completion)
```
## Speculating by matching n-grams in the prompt
The following code configures vLLM to use speculative decoding where proposals are generated by
matching n-grams in the prompt. For more information read [this thread.](https://x.com/joao_gante/status/1747322413006643259)
```python
from vllm import LLM, SamplingParams
prompts = [
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(
model="facebook/opt-6.7b",
tensor_parallel_size=1,
speculative_model="[ngram]",
num_speculative_tokens=5,
ngram_prompt_lookup_max=4,
)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
## Speculating using MLP speculators
The following code configures vLLM to use speculative decoding where proposals are generated by
draft models that conditioning draft predictions on both context vectors and sampled tokens.
For more information see [this blog](https://pytorch.org/blog/hitchhikers-guide-speculative-decoding/) or
[this technical report](https://arxiv.org/abs/2404.19124).
```python
from vllm import LLM, SamplingParams
prompts = [
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(
model="meta-llama/Meta-Llama-3.1-70B-Instruct",
tensor_parallel_size=4,
speculative_model="ibm-fms/llama3-70b-accelerator",
speculative_draft_tensor_parallel_size=1,
)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
Note that these speculative models currently need to be run without tensor parallelism, although
it is possible to run the main model using tensor parallelism (see example above). Since the
speculative models are relatively small, we still see significant speedups. However, this
limitation will be fixed in a future release.
A variety of speculative models of this type are available on HF hub:
- [llama-13b-accelerator](https://huggingface.co/ibm-fms/llama-13b-accelerator)
- [llama3-8b-accelerator](https://huggingface.co/ibm-fms/llama3-8b-accelerator)
- [codellama-34b-accelerator](https://huggingface.co/ibm-fms/codellama-34b-accelerator)
- [llama2-70b-accelerator](https://huggingface.co/ibm-fms/llama2-70b-accelerator)
- [llama3-70b-accelerator](https://huggingface.co/ibm-fms/llama3-70b-accelerator)
- [granite-3b-code-instruct-accelerator](https://huggingface.co/ibm-granite/granite-3b-code-instruct-accelerator)
- [granite-8b-code-instruct-accelerator](https://huggingface.co/ibm-granite/granite-8b-code-instruct-accelerator)
- [granite-7b-instruct-accelerator](https://huggingface.co/ibm-granite/granite-7b-instruct-accelerator)
- [granite-20b-code-instruct-accelerator](https://huggingface.co/ibm-granite/granite-20b-code-instruct-accelerator)
## Lossless guarantees of Speculative Decoding
In vLLM, speculative decoding aims to enhance inference efficiency while maintaining accuracy. This section addresses the lossless guarantees of
speculative decoding, breaking down the guarantees into three key areas:
1. **Theoretical Losslessness**
\- Speculative decoding sampling is theoretically lossless up to the precision limits of hardware numerics. Floating-point errors might
cause slight variations in output distributions, as discussed
in [Accelerating Large Language Model Decoding with Speculative Sampling](https://arxiv.org/pdf/2302.01318)
2. **Algorithmic Losslessness**
\- vLLMs implementation of speculative decoding is algorithmically validated to be lossless. Key validation tests include:
> - **Rejection Sampler Convergence**: Ensures that samples from vLLMs rejection sampler align with the target
> distribution. [View Test Code](https://github.com/vllm-project/vllm/blob/47b65a550866c7ffbd076ecb74106714838ce7da/tests/samplers/test_rejection_sampler.py#L252)
> - **Greedy Sampling Equality**: Confirms that greedy sampling with speculative decoding matches greedy sampling
> without it. This verifies that vLLM's speculative decoding framework, when integrated with the vLLM forward pass and the vLLM rejection sampler,
> provides a lossless guarantee. Almost all of the tests in [this directory](https://github.com/vllm-project/vllm/tree/b67ae00cdbbe1a58ffc8ff170f0c8d79044a684a/tests/spec_decode/e2e)
> verify this property using [this assertion implementation](https://github.com/vllm-project/vllm/blob/b67ae00cdbbe1a58ffc8ff170f0c8d79044a684a/tests/spec_decode/e2e/conftest.py#L291)
3. **vLLM Logprob Stability**
\- vLLM does not currently guarantee stable token log probabilities (logprobs). This can result in different outputs for the
same request across runs. For more details, see the FAQ section
titled *Can the output of a prompt vary across runs in vLLM?* in the {ref}`FAQs <faq>`.
**Conclusion**
While vLLM strives to ensure losslessness in speculative decoding, variations in generated outputs with and without speculative decoding
can occur due to following factors:
- **Floating-Point Precision**: Differences in hardware numerical precision may lead to slight discrepancies in the output distribution.
- **Batch Size and Numerical Stability**: Changes in batch size may cause variations in logprobs and output probabilities, potentially
due to non-deterministic behavior in batched operations or numerical instability.
**Mitigation Strategies**
For mitigation strategies, please refer to the FAQ entry *Can the output of a prompt vary across runs in vLLM?* in the {ref}`FAQs <faq>`.
## Resources for vLLM contributors
- [A Hacker's Guide to Speculative Decoding in vLLM](https://www.youtube.com/watch?v=9wNAgpX6z_4)
- [What is Lookahead Scheduling in vLLM?](https://docs.google.com/document/d/1Z9TvqzzBPnh5WHcRwjvK2UEeFeq5zMZb5mFE8jR0HCs/edit#heading=h.1fjfb0donq5a)
- [Information on batch expansion](https://docs.google.com/document/d/1T-JaS2T1NRfdP51qzqpyakoCXxSXTtORppiwaj5asxA/edit#heading=h.kk7dq05lc6q8)
- [Dynamic speculative decoding](https://github.com/vllm-project/vllm/issues/4565)

View File

@@ -1,210 +0,0 @@
.. _spec_decode:
Speculative decoding
====================
.. warning::
Please note that speculative decoding in vLLM is not yet optimized and does
not usually yield inter-token latency reductions for all prompt datasets or sampling parameters. The work
to optimize it is ongoing and can be followed in `this issue. <https://github.com/vllm-project/vllm/issues/4630>`_
.. warning::
Currently, speculative decoding in vLLM is not compatible with pipeline parallelism.
This document shows how to use `Speculative Decoding <https://x.com/karpathy/status/1697318534555336961>`_ with vLLM.
Speculative decoding is a technique which improves inter-token latency in memory-bound LLM inference.
Speculating with a draft model
------------------------------
The following code configures vLLM in an offline mode to use speculative decoding with a draft model, speculating 5 tokens at a time.
.. code-block:: python
from vllm import LLM, SamplingParams
prompts = [
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(
model="facebook/opt-6.7b",
tensor_parallel_size=1,
speculative_model="facebook/opt-125m",
num_speculative_tokens=5,
)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
To perform the same with an online mode launch the server:
.. code-block:: bash
python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8000 --model facebook/opt-6.7b \
--seed 42 -tp 1 --speculative_model facebook/opt-125m --use-v2-block-manager \
--num_speculative_tokens 5 --gpu_memory_utilization 0.8
Then use a client:
.. code-block:: python
from openai import OpenAI
# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
# defaults to os.environ.get("OPENAI_API_KEY")
api_key=openai_api_key,
base_url=openai_api_base,
)
models = client.models.list()
model = models.data[0].id
# Completion API
stream = False
completion = client.completions.create(
model=model,
prompt="The future of AI is",
echo=False,
n=1,
stream=stream,
)
print("Completion results:")
if stream:
for c in completion:
print(c)
else:
print(completion)
Speculating by matching n-grams in the prompt
---------------------------------------------
The following code configures vLLM to use speculative decoding where proposals are generated by
matching n-grams in the prompt. For more information read `this thread. <https://x.com/joao_gante/status/1747322413006643259>`_
.. code-block:: python
from vllm import LLM, SamplingParams
prompts = [
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(
model="facebook/opt-6.7b",
tensor_parallel_size=1,
speculative_model="[ngram]",
num_speculative_tokens=5,
ngram_prompt_lookup_max=4,
)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
Speculating using MLP speculators
---------------------------------
The following code configures vLLM to use speculative decoding where proposals are generated by
draft models that conditioning draft predictions on both context vectors and sampled tokens.
For more information see `this blog <https://pytorch.org/blog/hitchhikers-guide-speculative-decoding/>`_ or
`this technical report <https://arxiv.org/abs/2404.19124>`_.
.. code-block:: python
from vllm import LLM, SamplingParams
prompts = [
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(
model="meta-llama/Meta-Llama-3.1-70B-Instruct",
tensor_parallel_size=4,
speculative_model="ibm-fms/llama3-70b-accelerator",
speculative_draft_tensor_parallel_size=1,
)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
Note that these speculative models currently need to be run without tensor parallelism, although
it is possible to run the main model using tensor parallelism (see example above). Since the
speculative models are relatively small, we still see significant speedups. However, this
limitation will be fixed in a future release.
A variety of speculative models of this type are available on HF hub:
* `llama-13b-accelerator <https://huggingface.co/ibm-fms/llama-13b-accelerator>`_
* `llama3-8b-accelerator <https://huggingface.co/ibm-fms/llama3-8b-accelerator>`_
* `codellama-34b-accelerator <https://huggingface.co/ibm-fms/codellama-34b-accelerator>`_
* `llama2-70b-accelerator <https://huggingface.co/ibm-fms/llama2-70b-accelerator>`_
* `llama3-70b-accelerator <https://huggingface.co/ibm-fms/llama3-70b-accelerator>`_
* `granite-3b-code-instruct-accelerator <https://huggingface.co/ibm-granite/granite-3b-code-instruct-accelerator>`_
* `granite-8b-code-instruct-accelerator <https://huggingface.co/ibm-granite/granite-8b-code-instruct-accelerator>`_
* `granite-7b-instruct-accelerator <https://huggingface.co/ibm-granite/granite-7b-instruct-accelerator>`_
* `granite-20b-code-instruct-accelerator <https://huggingface.co/ibm-granite/granite-20b-code-instruct-accelerator>`_
Lossless guarantees of Speculative Decoding
-------------------------------------------
In vLLM, speculative decoding aims to enhance inference efficiency while maintaining accuracy. This section addresses the lossless guarantees of
speculative decoding, breaking down the guarantees into three key areas:
1. **Theoretical Losslessness**
- Speculative decoding sampling is theoretically lossless up to the precision limits of hardware numerics. Floating-point errors might
cause slight variations in output distributions, as discussed
in `Accelerating Large Language Model Decoding with Speculative Sampling <https://arxiv.org/pdf/2302.01318>`_
2. **Algorithmic Losslessness**
- vLLMs implementation of speculative decoding is algorithmically validated to be lossless. Key validation tests include:
- **Rejection Sampler Convergence**: Ensures that samples from vLLMs rejection sampler align with the target
distribution. `View Test Code <https://github.com/vllm-project/vllm/blob/47b65a550866c7ffbd076ecb74106714838ce7da/tests/samplers/test_rejection_sampler.py#L252>`_
- **Greedy Sampling Equality**: Confirms that greedy sampling with speculative decoding matches greedy sampling
without it. This verifies that vLLM's speculative decoding framework, when integrated with the vLLM forward pass and the vLLM rejection sampler,
provides a lossless guarantee. Almost all of the tests in `this directory <https://github.com/vllm-project/vllm/tree/b67ae00cdbbe1a58ffc8ff170f0c8d79044a684a/tests/spec_decode/e2e>`_
verify this property using `this assertion implementation <https://github.com/vllm-project/vllm/blob/b67ae00cdbbe1a58ffc8ff170f0c8d79044a684a/tests/spec_decode/e2e/conftest.py#L291>`_
3. **vLLM Logprob Stability**
- vLLM does not currently guarantee stable token log probabilities (logprobs). This can result in different outputs for the
same request across runs. For more details, see the FAQ section
titled *Can the output of a prompt vary across runs in vLLM?* in the :ref:`FAQs <faq>`.
**Conclusion**
While vLLM strives to ensure losslessness in speculative decoding, variations in generated outputs with and without speculative decoding
can occur due to following factors:
- **Floating-Point Precision**: Differences in hardware numerical precision may lead to slight discrepancies in the output distribution.
- **Batch Size and Numerical Stability**: Changes in batch size may cause variations in logprobs and output probabilities, potentially
due to non-deterministic behavior in batched operations or numerical instability.
**Mitigation Strategies**
For mitigation strategies, please refer to the FAQ entry *Can the output of a prompt vary across runs in vLLM?* in the :ref:`FAQs <faq>`.
Resources for vLLM contributors
-------------------------------
* `A Hacker's Guide to Speculative Decoding in vLLM <https://www.youtube.com/watch?v=9wNAgpX6z_4>`_
* `What is Lookahead Scheduling in vLLM? <https://docs.google.com/document/d/1Z9TvqzzBPnh5WHcRwjvK2UEeFeq5zMZb5mFE8jR0HCs/edit#heading=h.1fjfb0donq5a>`_
* `Information on batch expansion <https://docs.google.com/document/d/1T-JaS2T1NRfdP51qzqpyakoCXxSXTtORppiwaj5asxA/edit#heading=h.kk7dq05lc6q8>`_
* `Dynamic speculative decoding <https://github.com/vllm-project/vllm/issues/4565>`_

View File

@@ -0,0 +1,260 @@
(structured-outputs)=
# Structured Outputs
vLLM supports the generation of structured outputs using [outlines](https://github.com/dottxt-ai/outlines) or [lm-format-enforcer](https://github.com/noamgat/lm-format-enforcer) as backends for the guided decoding.
This document shows you some examples of the different options that are available to generate structured outputs.
## Online Inference (OpenAI API)
You can generate structured outputs using the OpenAI's [Completions](https://platform.openai.com/docs/api-reference/completions) and [Chat](https://platform.openai.com/docs/api-reference/chat) API.
The following parameters are supported, which must be added as extra parameters:
- `guided_choice`: the output will be exactly one of the choices.
- `guided_regex`: the output will follow the regex pattern.
- `guided_json`: the output will follow the JSON schema.
- `guided_grammar`: the output will follow the context free grammar.
- `guided_whitespace_pattern`: used to override the default whitespace pattern for guided json decoding.
- `guided_decoding_backend`: used to select the guided decoding backend to use.
You can see the complete list of supported parameters on the [OpenAI Compatible Server](../serving/openai_compatible_server.md) page.
Now let´s see an example for each of the cases, starting with the `guided_choice`, as it´s the easiest one:
```python
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="-",
)
completion = client.chat.completions.create(
model="Qwen/Qwen2.5-3B-Instruct",
messages=[
{"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"}
],
extra_body={"guided_choice": ["positive", "negative"]},
)
print(completion.choices[0].message.content)
```
The next example shows how to use the `guided_regex`. The idea is to generate an email address, given a simple regex template:
```python
completion = client.chat.completions.create(
model="Qwen/Qwen2.5-3B-Instruct",
messages=[
{
"role": "user",
"content": "Generate an example email address for Alan Turing, who works in Enigma. End in .com and new line. Example result: alan.turing@enigma.com\n",
}
],
extra_body={"guided_regex": "\w+@\w+\.com\n", "stop": ["\n"]},
)
print(completion.choices[0].message.content)
```
One of the most relevant features in structured text generation is the option to generate a valid JSON with pre-defined fields and formats.
For this we can use the `guided_json` parameter in two different ways:
- Using directly a [JSON Schema](https://json-schema.org/)
- Defining a [Pydantic model](https://docs.pydantic.dev/latest/) and then extracting the JSON Schema from it (which is normally an easier option).
The next example shows how to use the `guided_json` parameter with a Pydantic model:
```python
from pydantic import BaseModel
from enum import Enum
class CarType(str, Enum):
sedan = "sedan"
suv = "SUV"
truck = "Truck"
coupe = "Coupe"
class CarDescription(BaseModel):
brand: str
model: str
car_type: CarType
json_schema = CarDescription.model_json_schema()
completion = client.chat.completions.create(
model="Qwen/Qwen2.5-3B-Instruct",
messages=[
{
"role": "user",
"content": "Generate a JSON with the brand, model and car_type of the most iconic car from the 90's",
}
],
extra_body={"guided_json": json_schema},
)
print(completion.choices[0].message.content)
```
```{tip}
While not strictly necessary, normally it´s better to indicate in the prompt that a JSON needs to be generated and which fields and how should the LLM fill them.
This can improve the results notably in most cases.
```
Finally we have the `guided_grammar`, which probably is the most difficult one to use but it´s really powerful, as it allows us to define complete languages like SQL queries.
It works by using a context free EBNF grammar, which for example we can use to define a specific format of simplified SQL queries, like in the example below:
```python
simplified_sql_grammar = """
?start: select_statement
?select_statement: "SELECT " column_list " FROM " table_name
?column_list: column_name ("," column_name)*
?table_name: identifier
?column_name: identifier
?identifier: /[a-zA-Z_][a-zA-Z0-9_]*/
"""
completion = client.chat.completions.create(
model="Qwen/Qwen2.5-3B-Instruct",
messages=[
{
"role": "user",
"content": "Generate an SQL query to show the 'username' and 'email' from the 'users' table.",
}
],
extra_body={"guided_grammar": simplified_sql_grammar},
)
print(completion.choices[0].message.content)
```
The complete code of the examples can be found on [examples/openai_chat_completion_structured_outputs.py](https://github.com/vllm-project/vllm/blob/main/examples/openai_chat_completion_structured_outputs.py).
## Experimental Automatic Parsing (OpenAI API)
This section covers the OpenAI beta wrapper over the `client.chat.completions.create()` method that provides richer integrations with Python specific types.
At the time of writing (`openai==1.54.4`), this is a "beta" feature in the OpenAI client library. Code reference can be found [here](https://github.com/openai/openai-python/blob/52357cff50bee57ef442e94d78a0de38b4173fc2/src/openai/resources/beta/chat/completions.py#L100-L104).
For the following examples, vLLM was setup using `vllm serve meta-llama/Llama-3.1-8B-Instruct`
Here is a simple example demonstrating how to get structured output using Pydantic models:
```python
from pydantic import BaseModel
from openai import OpenAI
class Info(BaseModel):
name: str
age: int
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="dummy")
completion = client.beta.chat.completions.parse(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "My name is Cameron, I'm 28. What's my name and age?"},
],
response_format=Info,
extra_body=dict(guided_decoding_backend="outlines"),
)
message = completion.choices[0].message
print(message)
assert message.parsed
print("Name:", message.parsed.name)
print("Age:", message.parsed.age)
```
Output:
```console
ParsedChatCompletionMessage[Testing](content='{"name": "Cameron", "age": 28}', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=[], parsed=Testing(name='Cameron', age=28))
Name: Cameron
Age: 28
```
Here is a more complex example using nested Pydantic models to handle a step-by-step math solution:
```python
from typing import List
from pydantic import BaseModel
from openai import OpenAI
class Step(BaseModel):
explanation: str
output: str
class MathResponse(BaseModel):
steps: List[Step]
final_answer: str
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="dummy")
completion = client.beta.chat.completions.parse(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful expert math tutor."},
{"role": "user", "content": "Solve 8x + 31 = 2."},
],
response_format=MathResponse,
extra_body=dict(guided_decoding_backend="outlines"),
)
message = completion.choices[0].message
print(message)
assert message.parsed
for i, step in enumerate(message.parsed.steps):
print(f"Step #{i}:", step)
print("Answer:", message.parsed.final_answer)
```
Output:
```console
ParsedChatCompletionMessage[MathResponse](content='{ "steps": [{ "explanation": "First, let\'s isolate the term with the variable \'x\'. To do this, we\'ll subtract 31 from both sides of the equation.", "output": "8x + 31 - 31 = 2 - 31"}, { "explanation": "By subtracting 31 from both sides, we simplify the equation to 8x = -29.", "output": "8x = -29"}, { "explanation": "Next, let\'s isolate \'x\' by dividing both sides of the equation by 8.", "output": "8x / 8 = -29 / 8"}], "final_answer": "x = -29/8" }', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=[], parsed=MathResponse(steps=[Step(explanation="First, let's isolate the term with the variable 'x'. To do this, we'll subtract 31 from both sides of the equation.", output='8x + 31 - 31 = 2 - 31'), Step(explanation='By subtracting 31 from both sides, we simplify the equation to 8x = -29.', output='8x = -29'), Step(explanation="Next, let's isolate 'x' by dividing both sides of the equation by 8.", output='8x / 8 = -29 / 8')], final_answer='x = -29/8'))
Step #0: explanation="First, let's isolate the term with the variable 'x'. To do this, we'll subtract 31 from both sides of the equation." output='8x + 31 - 31 = 2 - 31'
Step #1: explanation='By subtracting 31 from both sides, we simplify the equation to 8x = -29.' output='8x = -29'
Step #2: explanation="Next, let's isolate 'x' by dividing both sides of the equation by 8." output='8x / 8 = -29 / 8'
Answer: x = -29/8
```
## Offline Inference
Offline inference allows for the same types of guided decoding.
To use it, we´ll need to configure the guided decoding using the class `GuidedDecodingParams` inside `SamplingParams`.
The main available options inside `GuidedDecodingParams` are:
- `json`
- `regex`
- `choice`
- `grammar`
- `backend`
- `whitespace_pattern`
These parameters can be used in the same way as the parameters from the Online Inference examples above.
One example for the usage of the `choices` parameter is shown below:
```python
from vllm import LLM, SamplingParams
from vllm.sampling_params import GuidedDecodingParams
llm = LLM(model="HuggingFaceTB/SmolLM2-1.7B-Instruct")
guided_decoding_params = GuidedDecodingParams(choice=["Positive", "Negative"])
sampling_params = SamplingParams(guided_decoding=guided_decoding_params)
outputs = llm.generate(
prompts="Classify this sentiment: vLLM is wonderful!",
sampling_params=sampling_params,
)
print(outputs[0].outputs[0].text)
```
A complete example with all options can be found in [examples/offline_inference_structured_outputs.py](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_structured_outputs.py).

View File

@@ -1,267 +0,0 @@
.. _structured_outputs:
Structured Outputs
==================
vLLM supports the generation of structured outputs using `outlines <https://github.com/dottxt-ai/outlines>`_ or `lm-format-enforcer <https://github.com/noamgat/lm-format-enforcer>`_ as backends for the guided decoding.
This document shows you some examples of the different options that are available to generate structured outputs.
Online Inference (OpenAI API)
-----------------------------
You can generate structured outputs using the OpenAI's `Completions <https://platform.openai.com/docs/api-reference/completions>`_ and `Chat <https://platform.openai.com/docs/api-reference/chat>`_ API.
The following parameters are supported, which must be added as extra parameters:
- ``guided_choice``: the output will be exactly one of the choices.
- ``guided_regex``: the output will follow the regex pattern.
- ``guided_json``: the output will follow the JSON schema.
- ``guided_grammar``: the output will follow the context free grammar.
- ``guided_whitespace_pattern``: used to override the default whitespace pattern for guided json decoding.
- ``guided_decoding_backend``: used to select the guided decoding backend to use.
You can see the complete list of supported parameters on the `OpenAI Compatible Server </../serving/openai_compatible_server.html>`_ page.
Now let´s see an example for each of the cases, starting with the ``guided_choice``, as it´s the easiest one:
.. code-block:: python
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="-",
)
completion = client.chat.completions.create(
model="Qwen/Qwen2.5-3B-Instruct",
messages=[
{"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"}
],
extra_body={"guided_choice": ["positive", "negative"]},
)
print(completion.choices[0].message.content)
The next example shows how to use the ``guided_regex``. The idea is to generate an email address, given a simple regex template:
.. code-block:: python
completion = client.chat.completions.create(
model="Qwen/Qwen2.5-3B-Instruct",
messages=[
{
"role": "user",
"content": "Generate an example email address for Alan Turing, who works in Enigma. End in .com and new line. Example result: alan.turing@enigma.com\n",
}
],
extra_body={"guided_regex": "\w+@\w+\.com\n", "stop": ["\n"]},
)
print(completion.choices[0].message.content)
One of the most relevant features in structured text generation is the option to generate a valid JSON with pre-defined fields and formats.
For this we can use the ``guided_json`` parameter in two different ways:
- Using directly a `JSON Schema <https://json-schema.org/>`_
- Defining a `Pydantic model <https://docs.pydantic.dev/latest/>`_ and then extracting the JSON Schema from it (which is normally an easier option).
The next example shows how to use the ``guided_json`` parameter with a Pydantic model:
.. code-block:: python
from pydantic import BaseModel
from enum import Enum
class CarType(str, Enum):
sedan = "sedan"
suv = "SUV"
truck = "Truck"
coupe = "Coupe"
class CarDescription(BaseModel):
brand: str
model: str
car_type: CarType
json_schema = CarDescription.model_json_schema()
completion = client.chat.completions.create(
model="Qwen/Qwen2.5-3B-Instruct",
messages=[
{
"role": "user",
"content": "Generate a JSON with the brand, model and car_type of the most iconic car from the 90's",
}
],
extra_body={"guided_json": json_schema},
)
print(completion.choices[0].message.content)
.. tip::
While not strictly necessary, normally it´s better to indicate in the prompt that a JSON needs to be generated and which fields and how should the LLM fill them.
This can improve the results notably in most cases.
Finally we have the ``guided_grammar``, which probably is the most difficult one to use but it´s really powerful, as it allows us to define complete languages like SQL queries.
It works by using a context free EBNF grammar, which for example we can use to define a specific format of simplified SQL queries, like in the example below:
.. code-block:: python
simplified_sql_grammar = """
?start: select_statement
?select_statement: "SELECT " column_list " FROM " table_name
?column_list: column_name ("," column_name)*
?table_name: identifier
?column_name: identifier
?identifier: /[a-zA-Z_][a-zA-Z0-9_]*/
"""
completion = client.chat.completions.create(
model="Qwen/Qwen2.5-3B-Instruct",
messages=[
{
"role": "user",
"content": "Generate an SQL query to show the 'username' and 'email' from the 'users' table.",
}
],
extra_body={"guided_grammar": simplified_sql_grammar},
)
print(completion.choices[0].message.content)
The complete code of the examples can be found on `examples/openai_chat_completion_structured_outputs.py <https://github.com/vllm-project/vllm/blob/main/examples/openai_chat_completion_structured_outputs.py>`_.
Experimental Automatic Parsing (OpenAI API)
--------------------------------------------
This section covers the OpenAI beta wrapper over the ``client.chat.completions.create()`` method that provides richer integrations with Python specific types.
At the time of writing (``openai==1.54.4``), this is a "beta" feature in the OpenAI client library. Code reference can be found `here <https://github.com/openai/openai-python/blob/52357cff50bee57ef442e94d78a0de38b4173fc2/src/openai/resources/beta/chat/completions.py#L100-L104>`_.
For the following examples, vLLM was setup using ``vllm serve meta-llama/Llama-3.1-8B-Instruct``
Here is a simple example demonstrating how to get structured output using Pydantic models:
.. code-block:: python
from pydantic import BaseModel
from openai import OpenAI
class Info(BaseModel):
name: str
age: int
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="dummy")
completion = client.beta.chat.completions.parse(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "My name is Cameron, I'm 28. What's my name and age?"},
],
response_format=Info,
extra_body=dict(guided_decoding_backend="outlines"),
)
message = completion.choices[0].message
print(message)
assert message.parsed
print("Name:", message.parsed.name)
print("Age:", message.parsed.age)
Output:
.. code-block:: console
ParsedChatCompletionMessage[Testing](content='{"name": "Cameron", "age": 28}', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=[], parsed=Testing(name='Cameron', age=28))
Name: Cameron
Age: 28
Here is a more complex example using nested Pydantic models to handle a step-by-step math solution:
.. code-block:: python
from typing import List
from pydantic import BaseModel
from openai import OpenAI
class Step(BaseModel):
explanation: str
output: str
class MathResponse(BaseModel):
steps: List[Step]
final_answer: str
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="dummy")
completion = client.beta.chat.completions.parse(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful expert math tutor."},
{"role": "user", "content": "Solve 8x + 31 = 2."},
],
response_format=MathResponse,
extra_body=dict(guided_decoding_backend="outlines"),
)
message = completion.choices[0].message
print(message)
assert message.parsed
for i, step in enumerate(message.parsed.steps):
print(f"Step #{i}:", step)
print("Answer:", message.parsed.final_answer)
Output:
.. code-block:: console
ParsedChatCompletionMessage[MathResponse](content='{ "steps": [{ "explanation": "First, let\'s isolate the term with the variable \'x\'. To do this, we\'ll subtract 31 from both sides of the equation.", "output": "8x + 31 - 31 = 2 - 31"}, { "explanation": "By subtracting 31 from both sides, we simplify the equation to 8x = -29.", "output": "8x = -29"}, { "explanation": "Next, let\'s isolate \'x\' by dividing both sides of the equation by 8.", "output": "8x / 8 = -29 / 8"}], "final_answer": "x = -29/8" }', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=[], parsed=MathResponse(steps=[Step(explanation="First, let's isolate the term with the variable 'x'. To do this, we'll subtract 31 from both sides of the equation.", output='8x + 31 - 31 = 2 - 31'), Step(explanation='By subtracting 31 from both sides, we simplify the equation to 8x = -29.', output='8x = -29'), Step(explanation="Next, let's isolate 'x' by dividing both sides of the equation by 8.", output='8x / 8 = -29 / 8')], final_answer='x = -29/8'))
Step #0: explanation="First, let's isolate the term with the variable 'x'. To do this, we'll subtract 31 from both sides of the equation." output='8x + 31 - 31 = 2 - 31'
Step #1: explanation='By subtracting 31 from both sides, we simplify the equation to 8x = -29.' output='8x = -29'
Step #2: explanation="Next, let's isolate 'x' by dividing both sides of the equation by 8." output='8x / 8 = -29 / 8'
Answer: x = -29/8
Offline Inference
-----------------
Offline inference allows for the same types of guided decoding.
To use it, we´ll need to configure the guided decoding using the class ``GuidedDecodingParams`` inside ``SamplingParams``.
The main available options inside ``GuidedDecodingParams`` are:
- ``json``
- ``regex``
- ``choice``
- ``grammar``
- ``backend``
- ``whitespace_pattern``
These parameters can be used in the same way as the parameters from the Online Inference examples above.
One example for the usage of the ``choices`` parameter is shown below:
.. code-block:: python
from vllm import LLM, SamplingParams
from vllm.sampling_params import GuidedDecodingParams
llm = LLM(model="HuggingFaceTB/SmolLM2-1.7B-Instruct")
guided_decoding_params = GuidedDecodingParams(choice=["Positive", "Negative"])
sampling_params = SamplingParams(guided_decoding=guided_decoding_params)
outputs = llm.generate(
prompts="Classify this sentiment: vLLM is wonderful!",
sampling_params=sampling_params,
)
print(outputs[0].outputs[0].text)
A complete example with all options can be found in `examples/offline_inference_structured_outputs.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_structured_outputs.py>`_.

View File

@@ -47,7 +47,7 @@ tail ~/.config/vllm/usage_stats.json
## Opt-out of Usage Stats Collection
You can opt-out of usage stats collection by setting the VLLM_NO_USAGE_STATS or DO_NOT_TRACK environment variable, or by creating a ~/.config/vllm/do_not_track file:
You can opt-out of usage stats collection by setting the `VLLM_NO_USAGE_STATS` or `DO_NOT_TRACK` environment variable, or by creating a `~/.config/vllm/do_not_track` file:
```bash
# Any of the following methods can disable usage stats collection