[Docs] Convert rST to MyST (Markdown) (#11145)
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
This commit is contained in:
468
docs/source/usage/compatibility_matrix.md
Normal file
468
docs/source/usage/compatibility_matrix.md
Normal file
@@ -0,0 +1,468 @@
|
||||
(compatibility-matrix)=
|
||||
|
||||
# Compatibility Matrix
|
||||
|
||||
The tables below show mutually exclusive features and the support on some hardware.
|
||||
|
||||
```{note}
|
||||
Check the '✗' with links to see tracking issue for unsupported feature/hardware combination.
|
||||
```
|
||||
|
||||
## Feature x Feature
|
||||
|
||||
```{raw} html
|
||||
<style>
|
||||
/* Make smaller to try to improve readability */
|
||||
td {
|
||||
font-size: 0.8rem;
|
||||
text-align: center;
|
||||
}
|
||||
|
||||
th {
|
||||
text-align: center;
|
||||
font-size: 0.8rem;
|
||||
}
|
||||
</style>
|
||||
```
|
||||
|
||||
```{list-table}
|
||||
:header-rows: 1
|
||||
:stub-columns: 1
|
||||
:widths: auto
|
||||
|
||||
* - Feature
|
||||
- [CP](#chunked-prefill)
|
||||
- [APC](#apc)
|
||||
- [LoRA](#lora-adapter)
|
||||
- <abbr title="Prompt Adapter">prmpt adptr</abbr>
|
||||
- [SD](#spec_decode)
|
||||
- CUDA graph
|
||||
- <abbr title="Pooling Models">pooling</abbr>
|
||||
- <abbr title="Encoder-Decoder Models">enc-dec</abbr>
|
||||
- <abbr title="Logprobs">logP</abbr>
|
||||
- <abbr title="Prompt Logprobs">prmpt logP</abbr>
|
||||
- <abbr title="Async Output Processing">async output</abbr>
|
||||
- multi-step
|
||||
- <abbr title="Multimodal Inputs">mm</abbr>
|
||||
- best-of
|
||||
- beam-search
|
||||
- <abbr title="Guided Decoding">guided dec</abbr>
|
||||
* - [CP](#chunked-prefill)
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - [APC](#apc)
|
||||
- ✅
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - [LoRA](#lora-adapter)
|
||||
- [✗](https://github.com/vllm-project/vllm/pull/9057)
|
||||
- ✅
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - <abbr title="Prompt Adapter">prmpt adptr</abbr>
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - [SD](#spec_decode)
|
||||
- ✅
|
||||
- ✅
|
||||
- ✗
|
||||
- ✅
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - CUDA graph
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - <abbr title="Pooling Models">pooling</abbr>
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - <abbr title="Encoder-Decoder Models">enc-dec</abbr>
|
||||
- ✗
|
||||
- [✗](https://github.com/vllm-project/vllm/issues/7366)
|
||||
- ✗
|
||||
- ✗
|
||||
- [✗](https://github.com/vllm-project/vllm/issues/7366)
|
||||
- ✅
|
||||
- ✅
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - <abbr title="Logprobs">logP</abbr>
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✗
|
||||
- ✅
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - <abbr title="Prompt Logprobs">prmpt logP</abbr>
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- [✗](https://github.com/vllm-project/vllm/pull/8199)
|
||||
- ✅
|
||||
- ✗
|
||||
- ✅
|
||||
- ✅
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - <abbr title="Async Output Processing">async output</abbr>
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✗
|
||||
- ✅
|
||||
- ✗
|
||||
- ✗
|
||||
- ✅
|
||||
- ✅
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - multi-step
|
||||
- ✗
|
||||
- ✅
|
||||
- ✗
|
||||
- ✅
|
||||
- ✗
|
||||
- ✅
|
||||
- ✗
|
||||
- ✗
|
||||
- ✅
|
||||
- [✗](https://github.com/vllm-project/vllm/issues/8198)
|
||||
- ✅
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - <abbr title="Multimodal Inputs">mm</abbr>
|
||||
- ✅
|
||||
- [✗](https://github.com/vllm-project/vllm/pull/8348)
|
||||
- [✗](https://github.com/vllm-project/vllm/pull/7199)
|
||||
- ?
|
||||
- ?
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ?
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - best-of
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- [✗](https://github.com/vllm-project/vllm/issues/6137)
|
||||
- ✅
|
||||
- ✗
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ?
|
||||
- [✗](https://github.com/vllm-project/vllm/issues/7968)
|
||||
- ✅
|
||||
-
|
||||
-
|
||||
-
|
||||
* - beam-search
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- [✗](https://github.com/vllm-project/vllm/issues/6137)
|
||||
- ✅
|
||||
- ✗
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ?
|
||||
- [✗](https://github.com/vllm-project/vllm/issues/7968>)
|
||||
- ?
|
||||
- ✅
|
||||
-
|
||||
-
|
||||
* - <abbr title="Guided Decoding">guided dec</abbr>
|
||||
- ✅
|
||||
- ✅
|
||||
- ?
|
||||
- ?
|
||||
- ✅
|
||||
- ✅
|
||||
- ✗
|
||||
- ?
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- [✗](https://github.com/vllm-project/vllm/issues/9893)
|
||||
- ?
|
||||
- ✅
|
||||
- ✅
|
||||
-
|
||||
|
||||
```
|
||||
|
||||
### Feature x Hardware
|
||||
|
||||
```{list-table}
|
||||
:header-rows: 1
|
||||
:stub-columns: 1
|
||||
:widths: auto
|
||||
|
||||
* - Feature
|
||||
- Volta
|
||||
- Turing
|
||||
- Ampere
|
||||
- Ada
|
||||
- Hopper
|
||||
- CPU
|
||||
- AMD
|
||||
* - [CP](#chunked-prefill)
|
||||
- [✗](https://github.com/vllm-project/vllm/issues/2729)
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
* - [APC](#apc)
|
||||
- [✗](https://github.com/vllm-project/vllm/issues/3687)
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
* - [LoRA](#lora-adapter)
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- [✗](https://github.com/vllm-project/vllm/pull/4830)
|
||||
- ✅
|
||||
* - <abbr title="Prompt Adapter">prmpt adptr</abbr>
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- [✗](https://github.com/vllm-project/vllm/issues/8475)
|
||||
- ✅
|
||||
* - [SD](#spec_decode)
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
* - CUDA graph
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✗
|
||||
- ✅
|
||||
* - <abbr title="Pooling Models">pooling</abbr>
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ?
|
||||
* - <abbr title="Encoder-Decoder Models">enc-dec</abbr>
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✗
|
||||
* - <abbr title="Multimodal Inputs">mm</abbr>
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
* - <abbr title="Logprobs">logP</abbr>
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
* - <abbr title="Prompt Logprobs">prmpt logP</abbr>
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
* - <abbr title="Async Output Processing">async output</abbr>
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✗
|
||||
- ✗
|
||||
* - multi-step
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- [✗](https://github.com/vllm-project/vllm/issues/8477)
|
||||
- ✅
|
||||
* - best-of
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
* - beam-search
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
* - <abbr title="Guided Decoding">guided dec</abbr>
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
```
|
||||
@@ -1,468 +0,0 @@
|
||||
.. _compatibility_matrix:
|
||||
|
||||
Compatibility Matrix
|
||||
====================
|
||||
|
||||
The tables below show mutually exclusive features and the support on some hardware.
|
||||
|
||||
.. note::
|
||||
|
||||
Check the '✗' with links to see tracking issue for unsupported feature/hardware combination.
|
||||
|
||||
Feature x Feature
|
||||
-----------------
|
||||
|
||||
|
||||
.. raw:: html
|
||||
|
||||
<style>
|
||||
/* Make smaller to try to improve readability */
|
||||
td {
|
||||
font-size: 0.8rem;
|
||||
text-align: center;
|
||||
}
|
||||
|
||||
th {
|
||||
text-align: center;
|
||||
font-size: 0.8rem;
|
||||
}
|
||||
</style>
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
:widths: auto
|
||||
|
||||
* - Feature
|
||||
- :ref:`CP <chunked-prefill>`
|
||||
- :ref:`APC <apc>`
|
||||
- :ref:`LoRA <lora>`
|
||||
- :abbr:`prmpt adptr (Prompt Adapter)`
|
||||
- :ref:`SD <spec_decode>`
|
||||
- CUDA graph
|
||||
- :abbr:`pooling (Pooling Models)`
|
||||
- :abbr:`enc-dec (Encoder-Decoder Models)`
|
||||
- :abbr:`logP (Logprobs)`
|
||||
- :abbr:`prmpt logP (Prompt Logprobs)`
|
||||
- :abbr:`async output (Async Output Processing)`
|
||||
- multi-step
|
||||
- :abbr:`mm (Multimodal Inputs)`
|
||||
- best-of
|
||||
- beam-search
|
||||
- :abbr:`guided dec (Guided Decoding)`
|
||||
* - :ref:`CP <chunked-prefill>`
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - :ref:`APC <apc>`
|
||||
- ✅
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - :ref:`LoRA <lora>`
|
||||
- `✗ <https://github.com/vllm-project/vllm/pull/9057>`__
|
||||
- ✅
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - :abbr:`prmpt adptr (Prompt Adapter)`
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - :ref:`SD <spec_decode>`
|
||||
- ✅
|
||||
- ✅
|
||||
- ✗
|
||||
- ✅
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - CUDA graph
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - :abbr:`pooling (Pooling Models)`
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - :abbr:`enc-dec (Encoder-Decoder Models)`
|
||||
- ✗
|
||||
- `✗ <https://github.com/vllm-project/vllm/issues/7366>`__
|
||||
- ✗
|
||||
- ✗
|
||||
- `✗ <https://github.com/vllm-project/vllm/issues/7366>`__
|
||||
- ✅
|
||||
- ✅
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - :abbr:`logP (Logprobs)`
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✗
|
||||
- ✅
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - :abbr:`prmpt logP (Prompt Logprobs)`
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- `✗ <https://github.com/vllm-project/vllm/pull/8199>`__
|
||||
- ✅
|
||||
- ✗
|
||||
- ✅
|
||||
- ✅
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - :abbr:`async output (Async Output Processing)`
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✗
|
||||
- ✅
|
||||
- ✗
|
||||
- ✗
|
||||
- ✅
|
||||
- ✅
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - multi-step
|
||||
- ✗
|
||||
- ✅
|
||||
- ✗
|
||||
- ✅
|
||||
- ✗
|
||||
- ✅
|
||||
- ✗
|
||||
- ✗
|
||||
- ✅
|
||||
- `✗ <https://github.com/vllm-project/vllm/issues/8198>`__
|
||||
- ✅
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - :abbr:`mm (Multimodal Inputs)`
|
||||
- ✅
|
||||
- `✗ <https://github.com/vllm-project/vllm/pull/8348>`__
|
||||
- `✗ <https://github.com/vllm-project/vllm/pull/7199>`__
|
||||
- ?
|
||||
- ?
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ?
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - best-of
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- `✗ <https://github.com/vllm-project/vllm/issues/6137>`__
|
||||
- ✅
|
||||
- ✗
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ?
|
||||
- `✗ <https://github.com/vllm-project/vllm/issues/7968>`__
|
||||
- ✅
|
||||
-
|
||||
-
|
||||
-
|
||||
* - beam-search
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- `✗ <https://github.com/vllm-project/vllm/issues/6137>`__
|
||||
- ✅
|
||||
- ✗
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ?
|
||||
- `✗ <https://github.com/vllm-project/vllm/issues/7968>`__
|
||||
- ?
|
||||
- ✅
|
||||
-
|
||||
-
|
||||
* - :abbr:`guided dec (Guided Decoding)`
|
||||
- ✅
|
||||
- ✅
|
||||
- ?
|
||||
- ?
|
||||
- ✅
|
||||
- ✅
|
||||
- ✗
|
||||
- ?
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- `✗ <https://github.com/vllm-project/vllm/issues/9893>`__
|
||||
- ?
|
||||
- ✅
|
||||
- ✅
|
||||
-
|
||||
|
||||
|
||||
Feature x Hardware
|
||||
^^^^^^^^^^^^^^^^^^
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
:widths: auto
|
||||
|
||||
* - Feature
|
||||
- Volta
|
||||
- Turing
|
||||
- Ampere
|
||||
- Ada
|
||||
- Hopper
|
||||
- CPU
|
||||
- AMD
|
||||
* - :ref:`CP <chunked-prefill>`
|
||||
- `✗ <https://github.com/vllm-project/vllm/issues/2729>`__
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
* - :ref:`APC <apc>`
|
||||
- `✗ <https://github.com/vllm-project/vllm/issues/3687>`__
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
* - :ref:`LoRA <lora>`
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- `✗ <https://github.com/vllm-project/vllm/pull/4830>`__
|
||||
- ✅
|
||||
* - :abbr:`prmpt adptr (Prompt Adapter)`
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- `✗ <https://github.com/vllm-project/vllm/issues/8475>`__
|
||||
- ✅
|
||||
* - :ref:`SD <spec_decode>`
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
* - CUDA graph
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✗
|
||||
- ✅
|
||||
* - :abbr:`pooling (Pooling Models)`
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ?
|
||||
* - :abbr:`enc-dec (Encoder-Decoder Models)`
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✗
|
||||
* - :abbr:`mm (Multimodal Inputs)`
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
* - :abbr:`logP (Logprobs)`
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
* - :abbr:`prmpt logP (Prompt Logprobs)`
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
* - :abbr:`async output (Async Output Processing)`
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✗
|
||||
- ✗
|
||||
* - multi-step
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- `✗ <https://github.com/vllm-project/vllm/issues/8477>`__
|
||||
- ✅
|
||||
* - best-of
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
* - beam-search
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
* - :abbr:`guided dec (Guided Decoding)`
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
64
docs/source/usage/disagg_prefill.md
Normal file
64
docs/source/usage/disagg_prefill.md
Normal file
@@ -0,0 +1,64 @@
|
||||
(disagg-prefill)=
|
||||
|
||||
# Disaggregated prefilling (experimental)
|
||||
|
||||
This page introduces you the disaggregated prefilling feature in vLLM. This feature is experimental and subject to change.
|
||||
|
||||
## Why disaggregated prefilling?
|
||||
|
||||
Two main reasons:
|
||||
|
||||
- **Tuning time-to-first-token (TTFT) and inter-token-latency (ITL) separately**. Disaggregated prefilling put prefill and decode phase of LLM inference inside different vLLM instances. This gives you the flexibility to assign different parallel strategies (e.g. `tp` and `pp`) to tune TTFT without affecting ITL, or to tune ITL without affecting TTFT.
|
||||
- **Controlling tail ITL**. Without disaggregated prefilling, vLLM may insert some prefill jobs during the decoding of one request. This results in higher tail latency. Disaggregated prefilling helps you solve this issue and control tail ITL. Chunked prefill with a proper chunk size also can achieve the same goal, but in practice it's hard to figure out the correct chunk size value. So disaggregated prefilling is a much more reliable way to control tail ITL.
|
||||
|
||||
```{note}
|
||||
Disaggregated prefill DOES NOT improve throughput.
|
||||
```
|
||||
|
||||
## Usage example
|
||||
|
||||
Please refer to `examples/disaggregated_prefill.sh` for the example usage of disaggregated prefilling.
|
||||
|
||||
## Benchmarks
|
||||
|
||||
Please refer to `benchmarks/disagg_benchmarks/` for disaggregated prefilling benchmarks.
|
||||
|
||||
## Development
|
||||
|
||||
We implement disaggregated prefilling by running 2 vLLM instances. One for prefill (we call it prefill instance) and one for decode (we call it decode instance), and then use a connector to transfer the prefill KV caches and results from prefill instance to decode instance.
|
||||
|
||||
All disaggregated prefilling implementation is under `vllm/distributed/kv_transfer`.
|
||||
|
||||
Key abstractions for disaggregated prefilling:
|
||||
|
||||
- **Connector**: Connector allows **kv consumer** to retrieve the KV caches of a batch of request from **kv producer**.
|
||||
- **LookupBuffer**: LookupBuffer provides two API: `insert` KV cache and `drop_select` KV cache. The semantics of `insert` and `drop_select` are similar to SQL, where `insert` inserts a KV cache into the buffer, and `drop_select` returns the KV cache that matches the given condition and drop it from the buffer.
|
||||
- **Pipe**: A single-direction FIFO pipe for tensor transmission. It supports `send_tensor` and `recv_tensor`.
|
||||
|
||||
```{note}
|
||||
`insert` is non-blocking operation but `drop_select` is blocking operation.
|
||||
```
|
||||
|
||||
Here is a figure illustrating how the above 3 abstractions are organized:
|
||||
|
||||
```{image} /assets/usage/disagg_prefill/abstraction.jpg
|
||||
:alt: Disaggregated prefilling abstractions
|
||||
```
|
||||
|
||||
The workflow of disaggregated prefilling is as follows:
|
||||
|
||||
```{image} /assets/usage/disagg_prefill/overview.jpg
|
||||
:alt: Disaggregated prefilling workflow
|
||||
```
|
||||
|
||||
The `buffer` corresponds to `insert` API in LookupBuffer, and the `drop_select` corresponds to `drop_select` API in LookupBuffer.
|
||||
|
||||
## Third-party contributions
|
||||
|
||||
Disaggregated prefilling is highly related to infrastructure, so vLLM relies on third-party connectors for production-level disaggregated prefilling (and vLLM team will actively review and merge new PRs for third-party connectors).
|
||||
|
||||
We recommend three ways of implementations:
|
||||
|
||||
- **Fully-customized connector**: Implement your own `Connector`, and call third-party libraries to send and receive KV caches, and many many more (like editing vLLM's model input to perform customized prefilling, etc). This approach gives you the most control, but at the risk of being incompatible with future vLLM versions.
|
||||
- **Database-like connector**: Implement your own `LookupBuffer` and support the `insert` and `drop_select` APIs just like SQL.
|
||||
- **Distributed P2P connector**: Implement your own `Pipe` and support the `send_tensor` and `recv_tensor` APIs, just like `torch.distributed`.
|
||||
@@ -1,69 +0,0 @@
|
||||
.. _disagg_prefill:
|
||||
|
||||
Disaggregated prefilling (experimental)
|
||||
=======================================
|
||||
|
||||
This page introduces you the disaggregated prefilling feature in vLLM. This feature is experimental and subject to change.
|
||||
|
||||
Why disaggregated prefilling?
|
||||
-----------------------------
|
||||
|
||||
Two main reasons:
|
||||
|
||||
* **Tuning time-to-first-token (TTFT) and inter-token-latency (ITL) separately**. Disaggregated prefilling put prefill and decode phase of LLM inference inside different vLLM instances. This gives you the flexibility to assign different parallel strategies (e.g. ``tp`` and ``pp``) to tune TTFT without affecting ITL, or to tune ITL without affecting TTFT.
|
||||
* **Controlling tail ITL**. Without disaggregated prefilling, vLLM may insert some prefill jobs during the decoding of one request. This results in higher tail latency. Disaggregated prefilling helps you solve this issue and control tail ITL. Chunked prefill with a proper chunk size also can achieve the same goal, but in practice it's hard to figure out the correct chunk size value. So disaggregated prefilling is a much more reliable way to control tail ITL.
|
||||
|
||||
.. note::
|
||||
Disaggregated prefill DOES NOT improve throughput.
|
||||
|
||||
Usage example
|
||||
-------------
|
||||
|
||||
Please refer to ``examples/disaggregated_prefill.sh`` for the example usage of disaggregated prefilling.
|
||||
|
||||
|
||||
Benchmarks
|
||||
----------
|
||||
|
||||
Please refer to ``benchmarks/disagg_benchmarks/`` for disaggregated prefilling benchmarks.
|
||||
|
||||
|
||||
Development
|
||||
-----------
|
||||
|
||||
We implement disaggregated prefilling by running 2 vLLM instances. One for prefill (we call it prefill instance) and one for decode (we call it decode instance), and then use a connector to transfer the prefill KV caches and results from prefill instance to decode instance.
|
||||
|
||||
All disaggregated prefilling implementation is under ``vllm/distributed/kv_transfer``.
|
||||
|
||||
Key abstractions for disaggregated prefilling:
|
||||
|
||||
* **Connector**: Connector allows **kv consumer** to retrieve the KV caches of a batch of request from **kv producer**.
|
||||
* **LookupBuffer**: LookupBuffer provides two API: ``insert`` KV cache and ``drop_select`` KV cache. The semantics of ``insert`` and ``drop_select`` are similar to SQL, where ``insert`` inserts a KV cache into the buffer, and ``drop_select`` returns the KV cache that matches the given condition and drop it from the buffer.
|
||||
* **Pipe**: A single-direction FIFO pipe for tensor transmission. It supports ``send_tensor`` and ``recv_tensor``.
|
||||
|
||||
.. note::
|
||||
``insert`` is non-blocking operation but ``drop_select`` is blocking operation.
|
||||
|
||||
Here is a figure illustrating how the above 3 abstractions are organized:
|
||||
|
||||
.. image:: /assets/usage/disagg_prefill/abstraction.jpg
|
||||
:alt: Disaggregated prefilling abstractions
|
||||
|
||||
The workflow of disaggregated prefilling is as follows:
|
||||
|
||||
.. image:: /assets/usage/disagg_prefill/overview.jpg
|
||||
:alt: Disaggregated prefilling workflow
|
||||
|
||||
The ``buffer`` corresponds to ``insert`` API in LookupBuffer, and the ``drop_select`` corresponds to ``drop_select`` API in LookupBuffer.
|
||||
|
||||
|
||||
Third-party contributions
|
||||
-------------------------
|
||||
|
||||
Disaggregated prefilling is highly related to infrastructure, so vLLM relies on third-party connectors for production-level disaggregated prefilling (and vLLM team will actively review and merge new PRs for third-party connectors).
|
||||
|
||||
We recommend three ways of implementations:
|
||||
|
||||
* **Fully-customized connector**: Implement your own ``Connector``, and call third-party libraries to send and receive KV caches, and many many more (like editing vLLM's model input to perform customized prefilling, etc). This approach gives you the most control, but at the risk of being incompatible with future vLLM versions.
|
||||
* **Database-like connector**: Implement your own ``LookupBuffer`` and support the ``insert`` and ``drop_select`` APIs just like SQL.
|
||||
* **Distributed P2P connector**: Implement your own ``Pipe`` and support the ``send_tensor`` and ``recv_tensor`` APIs, just like `torch.distributed`.
|
||||
@@ -1,23 +1,25 @@
|
||||
.. _engine_args:
|
||||
(engine-args)=
|
||||
|
||||
Engine Arguments
|
||||
================
|
||||
# Engine Arguments
|
||||
|
||||
Below, you can find an explanation of every engine argument for vLLM:
|
||||
|
||||
```{eval-rst}
|
||||
.. argparse::
|
||||
:module: vllm.engine.arg_utils
|
||||
:func: _engine_args_parser
|
||||
:prog: vllm serve
|
||||
:nodefaultconst:
|
||||
```
|
||||
|
||||
Async Engine Arguments
|
||||
----------------------
|
||||
## Async Engine Arguments
|
||||
|
||||
Below are the additional arguments related to the asynchronous engine:
|
||||
|
||||
```{eval-rst}
|
||||
.. argparse::
|
||||
:module: vllm.engine.arg_utils
|
||||
:func: _async_engine_args_parser
|
||||
:prog: vllm serve
|
||||
:nodefaultconst:
|
||||
:nodefaultconst:
|
||||
```
|
||||
15
docs/source/usage/env_vars.md
Normal file
15
docs/source/usage/env_vars.md
Normal file
@@ -0,0 +1,15 @@
|
||||
# Environment Variables
|
||||
|
||||
vLLM uses the following environment variables to configure the system:
|
||||
|
||||
```{warning}
|
||||
Please note that `VLLM_PORT` and `VLLM_HOST_IP` set the port and ip for vLLM's **internal usage**. It is not the port and ip for the API server. If you use `--host $VLLM_HOST_IP` and `--port $VLLM_PORT` to start the API server, it will not work.
|
||||
|
||||
All environment variables used by vLLM are prefixed with `VLLM_`. **Special care should be taken for Kubernetes users**: please do not name the service as `vllm`, otherwise environment variables set by Kubernetes might conflict with vLLM's environment variables, because [Kubernetes sets environment variables for each service with the capitalized service name as the prefix](https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables).
|
||||
```
|
||||
|
||||
```{literalinclude} ../../../vllm/envs.py
|
||||
:end-before: end-env-vars-definition
|
||||
:language: python
|
||||
:start-after: begin-env-vars-definition
|
||||
```
|
||||
@@ -1,14 +0,0 @@
|
||||
Environment Variables
|
||||
========================
|
||||
|
||||
vLLM uses the following environment variables to configure the system:
|
||||
|
||||
.. warning::
|
||||
Please note that ``VLLM_PORT`` and ``VLLM_HOST_IP`` set the port and ip for vLLM's **internal usage**. It is not the port and ip for the API server. If you use ``--host $VLLM_HOST_IP`` and ``--port $VLLM_PORT`` to start the API server, it will not work.
|
||||
|
||||
All environment variables used by vLLM are prefixed with ``VLLM_``. **Special care should be taken for Kubernetes users**: please do not name the service as ``vllm``, otherwise environment variables set by Kubernetes might conflict with vLLM's environment variables, because `Kubernetes sets environment variables for each service with the capitalized service name as the prefix <https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables>`_.
|
||||
|
||||
.. literalinclude:: ../../../vllm/envs.py
|
||||
:language: python
|
||||
:start-after: begin-env-vars-definition
|
||||
:end-before: end-env-vars-definition
|
||||
@@ -1,34 +1,33 @@
|
||||
.. _faq:
|
||||
(faq)=
|
||||
|
||||
Frequently Asked Questions
|
||||
===========================
|
||||
# Frequently Asked Questions
|
||||
|
||||
Q: How can I serve multiple models on a single port using the OpenAI API?
|
||||
> Q: How can I serve multiple models on a single port using the OpenAI API?
|
||||
|
||||
A: Assuming that you're referring to using OpenAI compatible server to serve multiple models at once, that is not currently supported, you can run multiple instances of the server (each serving a different model) at the same time, and have another layer to route the incoming request to the correct server accordingly.
|
||||
|
||||
----------------------------------------
|
||||
______________________________________________________________________
|
||||
|
||||
Q: Which model to use for offline inference embedding?
|
||||
> Q: Which model to use for offline inference embedding?
|
||||
|
||||
A: You can try `e5-mistral-7b-instruct <https://huggingface.co/intfloat/e5-mistral-7b-instruct>`__ and `BAAI/bge-base-en-v1.5 <https://huggingface.co/BAAI/bge-base-en-v1.5>`__;
|
||||
more are listed :ref:`here <supported_models>`.
|
||||
A: You can try [e5-mistral-7b-instruct](https://huggingface.co/intfloat/e5-mistral-7b-instruct) and [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5);
|
||||
more are listed [here](#supported-models).
|
||||
|
||||
By extracting hidden states, vLLM can automatically convert text generation models like `Llama-3-8B <https://huggingface.co/meta-llama/Meta-Llama-3-8B>`__,
|
||||
`Mistral-7B-Instruct-v0.3 <https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3>`__ into embedding models,
|
||||
By extracting hidden states, vLLM can automatically convert text generation models like [Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B),
|
||||
[Mistral-7B-Instruct-v0.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3) into embedding models,
|
||||
but they are expected be inferior to models that are specifically trained on embedding tasks.
|
||||
|
||||
----------------------------------------
|
||||
______________________________________________________________________
|
||||
|
||||
Q: Can the output of a prompt vary across runs in vLLM?
|
||||
> Q: Can the output of a prompt vary across runs in vLLM?
|
||||
|
||||
A: Yes, it can. vLLM does not guarantee stable log probabilities (logprobs) for the output tokens. Variations in logprobs may occur due to
|
||||
numerical instability in Torch operations or non-deterministic behavior in batched Torch operations when batching changes. For more details,
|
||||
see the `Numerical Accuracy section <https://pytorch.org/docs/stable/notes/numerical_accuracy.html#batched-computations-or-slice-computations>`_.
|
||||
numerical instability in Torch operations or non-deterministic behavior in batched Torch operations when batching changes. For more details,
|
||||
see the [Numerical Accuracy section](https://pytorch.org/docs/stable/notes/numerical_accuracy.html#batched-computations-or-slice-computations).
|
||||
|
||||
In vLLM, the same requests might be batched differently due to factors such as other concurrent requests,
|
||||
changes in batch size, or batch expansion in speculative decoding. These batching variations, combined with numerical instability of Torch operations,
|
||||
can lead to slightly different logit/logprob values at each step. Such differences can accumulate, potentially resulting in
|
||||
changes in batch size, or batch expansion in speculative decoding. These batching variations, combined with numerical instability of Torch operations,
|
||||
can lead to slightly different logit/logprob values at each step. Such differences can accumulate, potentially resulting in
|
||||
different tokens being sampled. Once a different token is sampled, further divergence is likely.
|
||||
|
||||
**Mitigation Strategies**
|
||||
215
docs/source/usage/lora.md
Normal file
215
docs/source/usage/lora.md
Normal file
@@ -0,0 +1,215 @@
|
||||
(lora-adapter)=
|
||||
|
||||
# LoRA Adapters
|
||||
|
||||
This document shows you how to use [LoRA adapters](https://arxiv.org/abs/2106.09685) with vLLM on top of a base model.
|
||||
|
||||
LoRA adapters can be used with any vLLM model that implements {class}`~vllm.model_executor.models.interfaces.SupportsLoRA`.
|
||||
|
||||
Adapters can be efficiently served on a per request basis with minimal overhead. First we download the adapter(s) and save
|
||||
them locally with
|
||||
|
||||
```python
|
||||
from huggingface_hub import snapshot_download
|
||||
|
||||
sql_lora_path = snapshot_download(repo_id="yard1/llama-2-7b-sql-lora-test")
|
||||
```
|
||||
|
||||
Then we instantiate the base model and pass in the `enable_lora=True` flag:
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
from vllm.lora.request import LoRARequest
|
||||
|
||||
llm = LLM(model="meta-llama/Llama-2-7b-hf", enable_lora=True)
|
||||
```
|
||||
|
||||
We can now submit the prompts and call `llm.generate` with the `lora_request` parameter. The first parameter
|
||||
of `LoRARequest` is a human identifiable name, the second parameter is a globally unique ID for the adapter and
|
||||
the third parameter is the path to the LoRA adapter.
|
||||
|
||||
```python
|
||||
sampling_params = SamplingParams(
|
||||
temperature=0,
|
||||
max_tokens=256,
|
||||
stop=["[/assistant]"]
|
||||
)
|
||||
|
||||
prompts = [
|
||||
"[user] Write a SQL query to answer the question based on the table schema.\n\n context: CREATE TABLE table_name_74 (icao VARCHAR, airport VARCHAR)\n\n question: Name the ICAO for lilongwe international airport [/user] [assistant]",
|
||||
"[user] Write a SQL query to answer the question based on the table schema.\n\n context: CREATE TABLE table_name_11 (nationality VARCHAR, elector VARCHAR)\n\n question: When Anchero Pantaleone was the elector what is under nationality? [/user] [assistant]",
|
||||
]
|
||||
|
||||
outputs = llm.generate(
|
||||
prompts,
|
||||
sampling_params,
|
||||
lora_request=LoRARequest("sql_adapter", 1, sql_lora_path)
|
||||
)
|
||||
```
|
||||
|
||||
Check out [examples/multilora_inference.py](https://github.com/vllm-project/vllm/blob/main/examples/multilora_inference.py)
|
||||
for an example of how to use LoRA adapters with the async engine and how to use more advanced configuration options.
|
||||
|
||||
## Serving LoRA Adapters
|
||||
|
||||
LoRA adapted models can also be served with the Open-AI compatible vLLM server. To do so, we use
|
||||
`--lora-modules {name}={path} {name}={path}` to specify each LoRA module when we kickoff the server:
|
||||
|
||||
```bash
|
||||
vllm serve meta-llama/Llama-2-7b-hf \
|
||||
--enable-lora \
|
||||
--lora-modules sql-lora=$HOME/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/snapshots/0dfa347e8877a4d4ed19ee56c140fa518470028c/
|
||||
```
|
||||
|
||||
```{note}
|
||||
The commit ID `0dfa347e8877a4d4ed19ee56c140fa518470028c` may change over time. Please check the latest commit ID in your environment to ensure you are using the correct one.
|
||||
```
|
||||
|
||||
The server entrypoint accepts all other LoRA configuration parameters (`max_loras`, `max_lora_rank`, `max_cpu_loras`,
|
||||
etc.), which will apply to all forthcoming requests. Upon querying the `/models` endpoint, we should see our LoRA along
|
||||
with its base model:
|
||||
|
||||
```bash
|
||||
curl localhost:8000/v1/models | jq .
|
||||
{
|
||||
"object": "list",
|
||||
"data": [
|
||||
{
|
||||
"id": "meta-llama/Llama-2-7b-hf",
|
||||
"object": "model",
|
||||
...
|
||||
},
|
||||
{
|
||||
"id": "sql-lora",
|
||||
"object": "model",
|
||||
...
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Requests can specify the LoRA adapter as if it were any other model via the `model` request parameter. The requests will be
|
||||
processed according to the server-wide LoRA configuration (i.e. in parallel with base model requests, and potentially other
|
||||
LoRA adapter requests if they were provided and `max_loras` is set high enough).
|
||||
|
||||
The following is an example request
|
||||
|
||||
```bash
|
||||
curl http://localhost:8000/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "sql-lora",
|
||||
"prompt": "San Francisco is a",
|
||||
"max_tokens": 7,
|
||||
"temperature": 0
|
||||
}' | jq
|
||||
```
|
||||
|
||||
## Dynamically serving LoRA Adapters
|
||||
|
||||
In addition to serving LoRA adapters at server startup, the vLLM server now supports dynamically loading and unloading
|
||||
LoRA adapters at runtime through dedicated API endpoints. This feature can be particularly useful when the flexibility
|
||||
to change models on-the-fly is needed.
|
||||
|
||||
Note: Enabling this feature in production environments is risky as user may participate model adapter management.
|
||||
|
||||
To enable dynamic LoRA loading and unloading, ensure that the environment variable `VLLM_ALLOW_RUNTIME_LORA_UPDATING`
|
||||
is set to `True`. When this option is enabled, the API server will log a warning to indicate that dynamic loading is active.
|
||||
|
||||
```bash
|
||||
export VLLM_ALLOW_RUNTIME_LORA_UPDATING=True
|
||||
```
|
||||
|
||||
Loading a LoRA Adapter:
|
||||
|
||||
To dynamically load a LoRA adapter, send a POST request to the `/v1/load_lora_adapter` endpoint with the necessary
|
||||
details of the adapter to be loaded. The request payload should include the name and path to the LoRA adapter.
|
||||
|
||||
Example request to load a LoRA adapter:
|
||||
|
||||
```bash
|
||||
curl -X POST http://localhost:8000/v1/load_lora_adapter \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"lora_name": "sql_adapter",
|
||||
"lora_path": "/path/to/sql-lora-adapter"
|
||||
}'
|
||||
```
|
||||
|
||||
Upon a successful request, the API will respond with a 200 OK status code. If an error occurs, such as if the adapter
|
||||
cannot be found or loaded, an appropriate error message will be returned.
|
||||
|
||||
Unloading a LoRA Adapter:
|
||||
|
||||
To unload a LoRA adapter that has been previously loaded, send a POST request to the `/v1/unload_lora_adapter` endpoint
|
||||
with the name or ID of the adapter to be unloaded.
|
||||
|
||||
Example request to unload a LoRA adapter:
|
||||
|
||||
```bash
|
||||
curl -X POST http://localhost:8000/v1/unload_lora_adapter \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"lora_name": "sql_adapter"
|
||||
}'
|
||||
```
|
||||
|
||||
## New format for `--lora-modules`
|
||||
|
||||
In the previous version, users would provide LoRA modules via the following format, either as a key-value pair or in JSON format. For example:
|
||||
|
||||
```bash
|
||||
--lora-modules sql-lora=$HOME/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/snapshots/0dfa347e8877a4d4ed19ee56c140fa518470028c/
|
||||
```
|
||||
|
||||
This would only include the `name` and `path` for each LoRA module, but did not provide a way to specify a `base_model_name`.
|
||||
Now, you can specify a base_model_name alongside the name and path using JSON format. For example:
|
||||
|
||||
```bash
|
||||
--lora-modules '{"name": "sql-lora", "path": "/path/to/lora", "base_model_name": "meta-llama/Llama-2-7b"}'
|
||||
```
|
||||
|
||||
To provide the backward compatibility support, you can still use the old key-value format (name=path), but the `base_model_name` will remain unspecified in that case.
|
||||
|
||||
## Lora model lineage in model card
|
||||
|
||||
The new format of `--lora-modules` is mainly to support the display of parent model information in the model card. Here's an explanation of how your current response supports this:
|
||||
|
||||
- The `parent` field of LoRA model `sql-lora` now links to its base model `meta-llama/Llama-2-7b-hf`. This correctly reflects the hierarchical relationship between the base model and the LoRA adapter.
|
||||
- The `root` field points to the artifact location of the lora adapter.
|
||||
|
||||
```bash
|
||||
$ curl http://localhost:8000/v1/models
|
||||
|
||||
{
|
||||
"object": "list",
|
||||
"data": [
|
||||
{
|
||||
"id": "meta-llama/Llama-2-7b-hf",
|
||||
"object": "model",
|
||||
"created": 1715644056,
|
||||
"owned_by": "vllm",
|
||||
"root": "~/.cache/huggingface/hub/models--meta-llama--Llama-2-7b-hf/snapshots/01c7f73d771dfac7d292323805ebc428287df4f9/",
|
||||
"parent": null,
|
||||
"permission": [
|
||||
{
|
||||
.....
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "sql-lora",
|
||||
"object": "model",
|
||||
"created": 1715644056,
|
||||
"owned_by": "vllm",
|
||||
"root": "~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/snapshots/0dfa347e8877a4d4ed19ee56c140fa518470028c/",
|
||||
"parent": meta-llama/Llama-2-7b-hf,
|
||||
"permission": [
|
||||
{
|
||||
....
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
@@ -1,225 +0,0 @@
|
||||
.. _lora:
|
||||
|
||||
LoRA Adapters
|
||||
=============
|
||||
|
||||
This document shows you how to use `LoRA adapters <https://arxiv.org/abs/2106.09685>`_ with vLLM on top of a base model.
|
||||
|
||||
LoRA adapters can be used with any vLLM model that implements :class:`~vllm.model_executor.models.interfaces.SupportsLoRA`.
|
||||
|
||||
Adapters can be efficiently served on a per request basis with minimal overhead. First we download the adapter(s) and save
|
||||
them locally with
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from huggingface_hub import snapshot_download
|
||||
|
||||
sql_lora_path = snapshot_download(repo_id="yard1/llama-2-7b-sql-lora-test")
|
||||
|
||||
|
||||
Then we instantiate the base model and pass in the ``enable_lora=True`` flag:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from vllm import LLM, SamplingParams
|
||||
from vllm.lora.request import LoRARequest
|
||||
|
||||
llm = LLM(model="meta-llama/Llama-2-7b-hf", enable_lora=True)
|
||||
|
||||
|
||||
We can now submit the prompts and call ``llm.generate`` with the ``lora_request`` parameter. The first parameter
|
||||
of ``LoRARequest`` is a human identifiable name, the second parameter is a globally unique ID for the adapter and
|
||||
the third parameter is the path to the LoRA adapter.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
sampling_params = SamplingParams(
|
||||
temperature=0,
|
||||
max_tokens=256,
|
||||
stop=["[/assistant]"]
|
||||
)
|
||||
|
||||
prompts = [
|
||||
"[user] Write a SQL query to answer the question based on the table schema.\n\n context: CREATE TABLE table_name_74 (icao VARCHAR, airport VARCHAR)\n\n question: Name the ICAO for lilongwe international airport [/user] [assistant]",
|
||||
"[user] Write a SQL query to answer the question based on the table schema.\n\n context: CREATE TABLE table_name_11 (nationality VARCHAR, elector VARCHAR)\n\n question: When Anchero Pantaleone was the elector what is under nationality? [/user] [assistant]",
|
||||
]
|
||||
|
||||
outputs = llm.generate(
|
||||
prompts,
|
||||
sampling_params,
|
||||
lora_request=LoRARequest("sql_adapter", 1, sql_lora_path)
|
||||
)
|
||||
|
||||
|
||||
Check out `examples/multilora_inference.py <https://github.com/vllm-project/vllm/blob/main/examples/multilora_inference.py>`_
|
||||
for an example of how to use LoRA adapters with the async engine and how to use more advanced configuration options.
|
||||
|
||||
Serving LoRA Adapters
|
||||
---------------------
|
||||
LoRA adapted models can also be served with the Open-AI compatible vLLM server. To do so, we use
|
||||
``--lora-modules {name}={path} {name}={path}`` to specify each LoRA module when we kickoff the server:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
vllm serve meta-llama/Llama-2-7b-hf \
|
||||
--enable-lora \
|
||||
--lora-modules sql-lora=$HOME/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/snapshots/0dfa347e8877a4d4ed19ee56c140fa518470028c/
|
||||
|
||||
.. note::
|
||||
The commit ID `0dfa347e8877a4d4ed19ee56c140fa518470028c` may change over time. Please check the latest commit ID in your environment to ensure you are using the correct one.
|
||||
|
||||
The server entrypoint accepts all other LoRA configuration parameters (``max_loras``, ``max_lora_rank``, ``max_cpu_loras``,
|
||||
etc.), which will apply to all forthcoming requests. Upon querying the ``/models`` endpoint, we should see our LoRA along
|
||||
with its base model:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
curl localhost:8000/v1/models | jq .
|
||||
{
|
||||
"object": "list",
|
||||
"data": [
|
||||
{
|
||||
"id": "meta-llama/Llama-2-7b-hf",
|
||||
"object": "model",
|
||||
...
|
||||
},
|
||||
{
|
||||
"id": "sql-lora",
|
||||
"object": "model",
|
||||
...
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
Requests can specify the LoRA adapter as if it were any other model via the ``model`` request parameter. The requests will be
|
||||
processed according to the server-wide LoRA configuration (i.e. in parallel with base model requests, and potentially other
|
||||
LoRA adapter requests if they were provided and ``max_loras`` is set high enough).
|
||||
|
||||
The following is an example request
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
curl http://localhost:8000/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "sql-lora",
|
||||
"prompt": "San Francisco is a",
|
||||
"max_tokens": 7,
|
||||
"temperature": 0
|
||||
}' | jq
|
||||
|
||||
|
||||
Dynamically serving LoRA Adapters
|
||||
---------------------------------
|
||||
|
||||
In addition to serving LoRA adapters at server startup, the vLLM server now supports dynamically loading and unloading
|
||||
LoRA adapters at runtime through dedicated API endpoints. This feature can be particularly useful when the flexibility
|
||||
to change models on-the-fly is needed.
|
||||
|
||||
Note: Enabling this feature in production environments is risky as user may participate model adapter management.
|
||||
|
||||
To enable dynamic LoRA loading and unloading, ensure that the environment variable `VLLM_ALLOW_RUNTIME_LORA_UPDATING`
|
||||
is set to `True`. When this option is enabled, the API server will log a warning to indicate that dynamic loading is active.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
export VLLM_ALLOW_RUNTIME_LORA_UPDATING=True
|
||||
|
||||
|
||||
Loading a LoRA Adapter:
|
||||
|
||||
To dynamically load a LoRA adapter, send a POST request to the `/v1/load_lora_adapter` endpoint with the necessary
|
||||
details of the adapter to be loaded. The request payload should include the name and path to the LoRA adapter.
|
||||
|
||||
Example request to load a LoRA adapter:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
curl -X POST http://localhost:8000/v1/load_lora_adapter \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"lora_name": "sql_adapter",
|
||||
"lora_path": "/path/to/sql-lora-adapter"
|
||||
}'
|
||||
|
||||
Upon a successful request, the API will respond with a 200 OK status code. If an error occurs, such as if the adapter
|
||||
cannot be found or loaded, an appropriate error message will be returned.
|
||||
|
||||
Unloading a LoRA Adapter:
|
||||
|
||||
To unload a LoRA adapter that has been previously loaded, send a POST request to the `/v1/unload_lora_adapter` endpoint
|
||||
with the name or ID of the adapter to be unloaded.
|
||||
|
||||
Example request to unload a LoRA adapter:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
curl -X POST http://localhost:8000/v1/unload_lora_adapter \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"lora_name": "sql_adapter"
|
||||
}'
|
||||
|
||||
|
||||
New format for `--lora-modules`
|
||||
-------------------------------
|
||||
|
||||
In the previous version, users would provide LoRA modules via the following format, either as a key-value pair or in JSON format. For example:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
--lora-modules sql-lora=$HOME/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/snapshots/0dfa347e8877a4d4ed19ee56c140fa518470028c/
|
||||
|
||||
This would only include the `name` and `path` for each LoRA module, but did not provide a way to specify a `base_model_name`.
|
||||
Now, you can specify a base_model_name alongside the name and path using JSON format. For example:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
--lora-modules '{"name": "sql-lora", "path": "/path/to/lora", "base_model_name": "meta-llama/Llama-2-7b"}'
|
||||
|
||||
To provide the backward compatibility support, you can still use the old key-value format (name=path), but the `base_model_name` will remain unspecified in that case.
|
||||
|
||||
|
||||
Lora model lineage in model card
|
||||
--------------------------------
|
||||
|
||||
The new format of `--lora-modules` is mainly to support the display of parent model information in the model card. Here's an explanation of how your current response supports this:
|
||||
|
||||
- The `parent` field of LoRA model `sql-lora` now links to its base model `meta-llama/Llama-2-7b-hf`. This correctly reflects the hierarchical relationship between the base model and the LoRA adapter.
|
||||
- The `root` field points to the artifact location of the lora adapter.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ curl http://localhost:8000/v1/models
|
||||
|
||||
{
|
||||
"object": "list",
|
||||
"data": [
|
||||
{
|
||||
"id": "meta-llama/Llama-2-7b-hf",
|
||||
"object": "model",
|
||||
"created": 1715644056,
|
||||
"owned_by": "vllm",
|
||||
"root": "~/.cache/huggingface/hub/models--meta-llama--Llama-2-7b-hf/snapshots/01c7f73d771dfac7d292323805ebc428287df4f9/",
|
||||
"parent": null,
|
||||
"permission": [
|
||||
{
|
||||
.....
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "sql-lora",
|
||||
"object": "model",
|
||||
"created": 1715644056,
|
||||
"owned_by": "vllm",
|
||||
"root": "~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/snapshots/0dfa347e8877a4d4ed19ee56c140fa518470028c/",
|
||||
"parent": meta-llama/Llama-2-7b-hf,
|
||||
"permission": [
|
||||
{
|
||||
....
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
486
docs/source/usage/multimodal_inputs.md
Normal file
486
docs/source/usage/multimodal_inputs.md
Normal file
@@ -0,0 +1,486 @@
|
||||
(multimodal-inputs)=
|
||||
|
||||
# Multimodal Inputs
|
||||
|
||||
This page teaches you how to pass multi-modal inputs to [multi-modal models](#supported-mm-models) in vLLM.
|
||||
|
||||
```{note}
|
||||
We are actively iterating on multi-modal support. See [this RFC](https://github.com/vllm-project/vllm/issues/4194) for upcoming changes,
|
||||
and [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) if you have any feedback or feature requests.
|
||||
```
|
||||
|
||||
## Offline Inference
|
||||
|
||||
To input multi-modal data, follow this schema in {class}`vllm.inputs.PromptType`:
|
||||
|
||||
- `prompt`: The prompt should follow the format that is documented on HuggingFace.
|
||||
- `multi_modal_data`: This is a dictionary that follows the schema defined in {class}`vllm.multimodal.MultiModalDataDict`.
|
||||
|
||||
### Image
|
||||
|
||||
You can pass a single image to the {code}`'image'` field of the multi-modal dictionary, as shown in the following examples:
|
||||
|
||||
```python
|
||||
llm = LLM(model="llava-hf/llava-1.5-7b-hf")
|
||||
|
||||
# Refer to the HuggingFace repo for the correct format to use
|
||||
prompt = "USER: <image>\nWhat is the content of this image?\nASSISTANT:"
|
||||
|
||||
# Load the image using PIL.Image
|
||||
image = PIL.Image.open(...)
|
||||
|
||||
# Single prompt inference
|
||||
outputs = llm.generate({
|
||||
"prompt": prompt,
|
||||
"multi_modal_data": {"image": image},
|
||||
})
|
||||
|
||||
for o in outputs:
|
||||
generated_text = o.outputs[0].text
|
||||
print(generated_text)
|
||||
|
||||
# Batch inference
|
||||
image_1 = PIL.Image.open(...)
|
||||
image_2 = PIL.Image.open(...)
|
||||
outputs = llm.generate(
|
||||
[
|
||||
{
|
||||
"prompt": "USER: <image>\nWhat is the content of this image?\nASSISTANT:",
|
||||
"multi_modal_data": {"image": image_1},
|
||||
},
|
||||
{
|
||||
"prompt": "USER: <image>\nWhat's the color of this image?\nASSISTANT:",
|
||||
"multi_modal_data": {"image": image_2},
|
||||
}
|
||||
]
|
||||
)
|
||||
|
||||
for o in outputs:
|
||||
generated_text = o.outputs[0].text
|
||||
print(generated_text)
|
||||
```
|
||||
|
||||
A code example can be found in [examples/offline_inference_vision_language.py](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_vision_language.py).
|
||||
|
||||
To substitute multiple images inside the same text prompt, you can pass in a list of images instead:
|
||||
|
||||
```python
|
||||
llm = LLM(
|
||||
model="microsoft/Phi-3.5-vision-instruct",
|
||||
trust_remote_code=True, # Required to load Phi-3.5-vision
|
||||
max_model_len=4096, # Otherwise, it may not fit in smaller GPUs
|
||||
limit_mm_per_prompt={"image": 2}, # The maximum number to accept
|
||||
)
|
||||
|
||||
# Refer to the HuggingFace repo for the correct format to use
|
||||
prompt = "<|user|>\n<|image_1|>\n<|image_2|>\nWhat is the content of each image?<|end|>\n<|assistant|>\n"
|
||||
|
||||
# Load the images using PIL.Image
|
||||
image1 = PIL.Image.open(...)
|
||||
image2 = PIL.Image.open(...)
|
||||
|
||||
outputs = llm.generate({
|
||||
"prompt": prompt,
|
||||
"multi_modal_data": {
|
||||
"image": [image1, image2]
|
||||
},
|
||||
})
|
||||
|
||||
for o in outputs:
|
||||
generated_text = o.outputs[0].text
|
||||
print(generated_text)
|
||||
```
|
||||
|
||||
A code example can be found in [examples/offline_inference_vision_language_multi_image.py](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_vision_language_multi_image.py).
|
||||
|
||||
Multi-image input can be extended to perform video captioning. We show this with [Qwen2-VL](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct) as it supports videos:
|
||||
|
||||
```python
|
||||
# Specify the maximum number of frames per video to be 4. This can be changed.
|
||||
llm = LLM("Qwen/Qwen2-VL-2B-Instruct", limit_mm_per_prompt={"image": 4})
|
||||
|
||||
# Create the request payload.
|
||||
video_frames = ... # load your video making sure it only has the number of frames specified earlier.
|
||||
message = {
|
||||
"role": "user",
|
||||
"content": [
|
||||
{"type": "text", "text": "Describe this set of frames. Consider the frames to be a part of the same video."},
|
||||
],
|
||||
}
|
||||
for i in range(len(video_frames)):
|
||||
base64_image = encode_image(video_frames[i]) # base64 encoding.
|
||||
new_image = {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}}
|
||||
message["content"].append(new_image)
|
||||
|
||||
# Perform inference and log output.
|
||||
outputs = llm.chat([message])
|
||||
|
||||
for o in outputs:
|
||||
generated_text = o.outputs[0].text
|
||||
print(generated_text)
|
||||
```
|
||||
|
||||
### Video
|
||||
|
||||
You can pass a list of NumPy arrays directly to the {code}`'video'` field of the multi-modal dictionary
|
||||
instead of using multi-image input.
|
||||
|
||||
Please refer to [examples/offline_inference_vision_language.py](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_vision_language.py) for more details.
|
||||
|
||||
### Audio
|
||||
|
||||
You can pass a tuple {code}`(array, sampling_rate)` to the {code}`'audio'` field of the multi-modal dictionary.
|
||||
|
||||
Please refer to [examples/offline_inference_audio_language.py](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_audio_language.py) for more details.
|
||||
|
||||
### Embedding
|
||||
|
||||
To input pre-computed embeddings belonging to a data type (i.e. image, video, or audio) directly to the language model,
|
||||
pass a tensor of shape {code}`(num_items, feature_size, hidden_size of LM)` to the corresponding field of the multi-modal dictionary.
|
||||
|
||||
```python
|
||||
# Inference with image embeddings as input
|
||||
llm = LLM(model="llava-hf/llava-1.5-7b-hf")
|
||||
|
||||
# Refer to the HuggingFace repo for the correct format to use
|
||||
prompt = "USER: <image>\nWhat is the content of this image?\nASSISTANT:"
|
||||
|
||||
# Embeddings for single image
|
||||
# torch.Tensor of shape (1, image_feature_size, hidden_size of LM)
|
||||
image_embeds = torch.load(...)
|
||||
|
||||
outputs = llm.generate({
|
||||
"prompt": prompt,
|
||||
"multi_modal_data": {"image": image_embeds},
|
||||
})
|
||||
|
||||
for o in outputs:
|
||||
generated_text = o.outputs[0].text
|
||||
print(generated_text)
|
||||
```
|
||||
|
||||
For Qwen2-VL and MiniCPM-V, we accept additional parameters alongside the embeddings:
|
||||
|
||||
```python
|
||||
# Construct the prompt based on your model
|
||||
prompt = ...
|
||||
|
||||
# Embeddings for multiple images
|
||||
# torch.Tensor of shape (num_images, image_feature_size, hidden_size of LM)
|
||||
image_embeds = torch.load(...)
|
||||
|
||||
# Qwen2-VL
|
||||
llm = LLM("Qwen/Qwen2-VL-2B-Instruct", limit_mm_per_prompt={"image": 4})
|
||||
mm_data = {
|
||||
"image": {
|
||||
"image_embeds": image_embeds,
|
||||
# image_grid_thw is needed to calculate positional encoding.
|
||||
"image_grid_thw": torch.load(...), # torch.Tensor of shape (1, 3),
|
||||
}
|
||||
}
|
||||
|
||||
# MiniCPM-V
|
||||
llm = LLM("openbmb/MiniCPM-V-2_6", trust_remote_code=True, limit_mm_per_prompt={"image": 4})
|
||||
mm_data = {
|
||||
"image": {
|
||||
"image_embeds": image_embeds,
|
||||
# image_size_list is needed to calculate details of the sliced image.
|
||||
"image_size_list": [image.size for image in images], # list of image sizes
|
||||
}
|
||||
}
|
||||
|
||||
outputs = llm.generate({
|
||||
"prompt": prompt,
|
||||
"multi_modal_data": mm_data,
|
||||
})
|
||||
|
||||
for o in outputs:
|
||||
generated_text = o.outputs[0].text
|
||||
print(generated_text)
|
||||
```
|
||||
|
||||
## Online Inference
|
||||
|
||||
Our OpenAI-compatible server accepts multi-modal data via the [Chat Completions API](https://platform.openai.com/docs/api-reference/chat).
|
||||
|
||||
```{important}
|
||||
A chat template is **required** to use Chat Completions API.
|
||||
|
||||
Although most models come with a chat template, for others you have to define one yourself.
|
||||
The chat template can be inferred based on the documentation on the model's HuggingFace repo.
|
||||
For example, LLaVA-1.5 (`llava-hf/llava-1.5-7b-hf`) requires a chat template that can be found [here](https://github.com/vllm-project/vllm/blob/main/examples/template_llava.jinja).
|
||||
```
|
||||
|
||||
### Image
|
||||
|
||||
Image input is supported according to [OpenAI Vision API](https://platform.openai.com/docs/guides/vision).
|
||||
Here is a simple example using Phi-3.5-Vision.
|
||||
|
||||
First, launch the OpenAI-compatible server:
|
||||
|
||||
```bash
|
||||
vllm serve microsoft/Phi-3.5-vision-instruct --task generate \
|
||||
--trust-remote-code --max-model-len 4096 --limit-mm-per-prompt image=2
|
||||
```
|
||||
|
||||
Then, you can use the OpenAI client as follows:
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
|
||||
openai_api_key = "EMPTY"
|
||||
openai_api_base = "http://localhost:8000/v1"
|
||||
|
||||
client = OpenAI(
|
||||
api_key=openai_api_key,
|
||||
base_url=openai_api_base,
|
||||
)
|
||||
|
||||
# Single-image input inference
|
||||
image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
|
||||
|
||||
chat_response = client.chat.completions.create(
|
||||
model="microsoft/Phi-3.5-vision-instruct",
|
||||
messages=[{
|
||||
"role": "user",
|
||||
"content": [
|
||||
# NOTE: The prompt formatting with the image token `<image>` is not needed
|
||||
# since the prompt will be processed automatically by the API server.
|
||||
{"type": "text", "text": "What’s in this image?"},
|
||||
{"type": "image_url", "image_url": {"url": image_url}},
|
||||
],
|
||||
}],
|
||||
)
|
||||
print("Chat completion output:", chat_response.choices[0].message.content)
|
||||
|
||||
# Multi-image input inference
|
||||
image_url_duck = "https://upload.wikimedia.org/wikipedia/commons/d/da/2015_Kaczka_krzy%C5%BCowka_w_wodzie_%28samiec%29.jpg"
|
||||
image_url_lion = "https://upload.wikimedia.org/wikipedia/commons/7/77/002_The_lion_king_Snyggve_in_the_Serengeti_National_Park_Photo_by_Giles_Laurent.jpg"
|
||||
|
||||
chat_response = client.chat.completions.create(
|
||||
model="microsoft/Phi-3.5-vision-instruct",
|
||||
messages=[{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{"type": "text", "text": "What are the animals in these images?"},
|
||||
{"type": "image_url", "image_url": {"url": image_url_duck}},
|
||||
{"type": "image_url", "image_url": {"url": image_url_lion}},
|
||||
],
|
||||
}],
|
||||
)
|
||||
print("Chat completion output:", chat_response.choices[0].message.content)
|
||||
```
|
||||
|
||||
A full code example can be found in [examples/openai_chat_completion_client_for_multimodal.py](https://github.com/vllm-project/vllm/blob/main/examples/openai_chat_completion_client_for_multimodal.py).
|
||||
|
||||
```{tip}
|
||||
Loading from local file paths is also supported on vLLM: You can specify the allowed local media path via `--allowed-local-media-path` when launching the API server/engine,
|
||||
and pass the file path as `url` in the API request.
|
||||
```
|
||||
|
||||
```{tip}
|
||||
There is no need to place image placeholders in the text content of the API request - they are already represented by the image content.
|
||||
In fact, you can place image placeholders in the middle of the text by interleaving text and image content.
|
||||
```
|
||||
|
||||
````{note}
|
||||
By default, the timeout for fetching images through HTTP URL is `5` seconds.
|
||||
You can override this by setting the environment variable:
|
||||
|
||||
```console
|
||||
$ export VLLM_IMAGE_FETCH_TIMEOUT=<timeout>
|
||||
```
|
||||
````
|
||||
|
||||
### Video
|
||||
|
||||
Instead of {code}`image_url`, you can pass a video file via {code}`video_url`.
|
||||
|
||||
You can use [these tests](https://github.com/vllm-project/vllm/blob/main/tests/entrypoints/openai/test_video.py) as reference.
|
||||
|
||||
````{note}
|
||||
By default, the timeout for fetching videos through HTTP URL url is `30` seconds.
|
||||
You can override this by setting the environment variable:
|
||||
|
||||
```console
|
||||
$ export VLLM_VIDEO_FETCH_TIMEOUT=<timeout>
|
||||
```
|
||||
````
|
||||
|
||||
### Audio
|
||||
|
||||
Audio input is supported according to [OpenAI Audio API](https://platform.openai.com/docs/guides/audio?audio-generation-quickstart-example=audio-in).
|
||||
Here is a simple example using Ultravox-v0.3.
|
||||
|
||||
First, launch the OpenAI-compatible server:
|
||||
|
||||
```bash
|
||||
vllm serve fixie-ai/ultravox-v0_3
|
||||
```
|
||||
|
||||
Then, you can use the OpenAI client as follows:
|
||||
|
||||
```python
|
||||
import base64
|
||||
import requests
|
||||
from openai import OpenAI
|
||||
from vllm.assets.audio import AudioAsset
|
||||
|
||||
def encode_base64_content_from_url(content_url: str) -> str:
|
||||
"""Encode a content retrieved from a remote url to base64 format."""
|
||||
|
||||
with requests.get(content_url) as response:
|
||||
response.raise_for_status()
|
||||
result = base64.b64encode(response.content).decode('utf-8')
|
||||
|
||||
return result
|
||||
|
||||
openai_api_key = "EMPTY"
|
||||
openai_api_base = "http://localhost:8000/v1"
|
||||
|
||||
client = OpenAI(
|
||||
api_key=openai_api_key,
|
||||
base_url=openai_api_base,
|
||||
)
|
||||
|
||||
# Any format supported by librosa is supported
|
||||
audio_url = AudioAsset("winning_call").url
|
||||
audio_base64 = encode_base64_content_from_url(audio_url)
|
||||
|
||||
chat_completion_from_base64 = client.chat.completions.create(
|
||||
messages=[{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{
|
||||
"type": "text",
|
||||
"text": "What's in this audio?"
|
||||
},
|
||||
{
|
||||
"type": "input_audio",
|
||||
"input_audio": {
|
||||
"data": audio_base64,
|
||||
"format": "wav"
|
||||
},
|
||||
},
|
||||
],
|
||||
}],
|
||||
model=model,
|
||||
max_completion_tokens=64,
|
||||
)
|
||||
|
||||
result = chat_completion_from_base64.choices[0].message.content
|
||||
print("Chat completion output from input audio:", result)
|
||||
```
|
||||
|
||||
Alternatively, you can pass {code}`audio_url`, which is the audio counterpart of {code}`image_url` for image input:
|
||||
|
||||
```python
|
||||
chat_completion_from_url = client.chat.completions.create(
|
||||
messages=[{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{
|
||||
"type": "text",
|
||||
"text": "What's in this audio?"
|
||||
},
|
||||
{
|
||||
"type": "audio_url",
|
||||
"audio_url": {
|
||||
"url": audio_url
|
||||
},
|
||||
},
|
||||
],
|
||||
}],
|
||||
model=model,
|
||||
max_completion_tokens=64,
|
||||
)
|
||||
|
||||
result = chat_completion_from_url.choices[0].message.content
|
||||
print("Chat completion output from audio url:", result)
|
||||
```
|
||||
|
||||
A full code example can be found in [examples/openai_chat_completion_client_for_multimodal.py](https://github.com/vllm-project/vllm/blob/main/examples/openai_chat_completion_client_for_multimodal.py).
|
||||
|
||||
````{note}
|
||||
By default, the timeout for fetching audios through HTTP URL is `10` seconds.
|
||||
You can override this by setting the environment variable:
|
||||
|
||||
```console
|
||||
$ export VLLM_AUDIO_FETCH_TIMEOUT=<timeout>
|
||||
```
|
||||
````
|
||||
|
||||
### Embedding
|
||||
|
||||
vLLM's Embeddings API is a superset of OpenAI's [Embeddings API](https://platform.openai.com/docs/api-reference/embeddings),
|
||||
where a list of chat `messages` can be passed instead of batched `inputs`. This enables multi-modal inputs to be passed to embedding models.
|
||||
|
||||
```{tip}
|
||||
The schema of `messages` is exactly the same as in Chat Completions API.
|
||||
You can refer to the above tutorials for more details on how to pass each type of multi-modal data.
|
||||
```
|
||||
|
||||
Usually, embedding models do not expect chat-based input, so we need to use a custom chat template to format the text and images.
|
||||
Refer to the examples below for illustration.
|
||||
|
||||
Here is an end-to-end example using VLM2Vec. To serve the model:
|
||||
|
||||
```bash
|
||||
vllm serve TIGER-Lab/VLM2Vec-Full --task embed \
|
||||
--trust-remote-code --max-model-len 4096 --chat-template examples/template_vlm2vec.jinja
|
||||
```
|
||||
|
||||
```{important}
|
||||
Since VLM2Vec has the same model architecture as Phi-3.5-Vision, we have to explicitly pass `--task embed`
|
||||
to run this model in embedding mode instead of text generation mode.
|
||||
|
||||
The custom chat template is completely different from the original one for this model,
|
||||
and can be found [here](https://github.com/vllm-project/vllm/blob/main/examples/template_vlm2vec.jinja).
|
||||
```
|
||||
|
||||
Since the request schema is not defined by OpenAI client, we post a request to the server using the lower-level `requests` library:
|
||||
|
||||
```python
|
||||
import requests
|
||||
|
||||
image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
|
||||
|
||||
response = requests.post(
|
||||
"http://localhost:8000/v1/embeddings",
|
||||
json={
|
||||
"model": "TIGER-Lab/VLM2Vec-Full",
|
||||
"messages": [{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{"type": "image_url", "image_url": {"url": image_url}},
|
||||
{"type": "text", "text": "Represent the given image."},
|
||||
],
|
||||
}],
|
||||
"encoding_format": "float",
|
||||
},
|
||||
)
|
||||
response.raise_for_status()
|
||||
response_json = response.json()
|
||||
print("Embedding output:", response_json["data"][0]["embedding"])
|
||||
```
|
||||
|
||||
Below is another example, this time using the `MrLight/dse-qwen2-2b-mrl-v1` model.
|
||||
|
||||
```bash
|
||||
vllm serve MrLight/dse-qwen2-2b-mrl-v1 --task embed \
|
||||
--trust-remote-code --max-model-len 8192 --chat-template examples/template_dse_qwen2_vl.jinja
|
||||
```
|
||||
|
||||
```{important}
|
||||
Like with VLM2Vec, we have to explicitly pass `--task embed`.
|
||||
|
||||
Additionally, `MrLight/dse-qwen2-2b-mrl-v1` requires an EOS token for embeddings, which is handled
|
||||
by [this custom chat template](https://github.com/vllm-project/vllm/blob/main/examples/template_dse_qwen2_vl.jinja).
|
||||
```
|
||||
|
||||
```{important}
|
||||
Also important, `MrLight/dse-qwen2-2b-mrl-v1` requires a placeholder image of the minimum image size for text query embeddings. See the full code
|
||||
example below for details.
|
||||
```
|
||||
|
||||
A full code example can be found in [examples/openai_chat_embedding_client_for_multimodal.py](https://github.com/vllm-project/vllm/blob/main/examples/openai_chat_embedding_client_for_multimodal.py).
|
||||
@@ -1,492 +0,0 @@
|
||||
.. _multimodal_inputs:
|
||||
|
||||
Multimodal Inputs
|
||||
=================
|
||||
|
||||
This page teaches you how to pass multi-modal inputs to :ref:`multi-modal models <supported_mm_models>` in vLLM.
|
||||
|
||||
.. note::
|
||||
We are actively iterating on multi-modal support. See `this RFC <https://github.com/vllm-project/vllm/issues/4194>`_ for upcoming changes,
|
||||
and `open an issue on GitHub <https://github.com/vllm-project/vllm/issues/new/choose>`_ if you have any feedback or feature requests.
|
||||
|
||||
Offline Inference
|
||||
-----------------
|
||||
|
||||
To input multi-modal data, follow this schema in :class:`vllm.inputs.PromptType`:
|
||||
|
||||
* ``prompt``: The prompt should follow the format that is documented on HuggingFace.
|
||||
* ``multi_modal_data``: This is a dictionary that follows the schema defined in :class:`vllm.multimodal.MultiModalDataDict`.
|
||||
|
||||
Image
|
||||
^^^^^
|
||||
|
||||
You can pass a single image to the :code:`'image'` field of the multi-modal dictionary, as shown in the following examples:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
llm = LLM(model="llava-hf/llava-1.5-7b-hf")
|
||||
|
||||
# Refer to the HuggingFace repo for the correct format to use
|
||||
prompt = "USER: <image>\nWhat is the content of this image?\nASSISTANT:"
|
||||
|
||||
# Load the image using PIL.Image
|
||||
image = PIL.Image.open(...)
|
||||
|
||||
# Single prompt inference
|
||||
outputs = llm.generate({
|
||||
"prompt": prompt,
|
||||
"multi_modal_data": {"image": image},
|
||||
})
|
||||
|
||||
for o in outputs:
|
||||
generated_text = o.outputs[0].text
|
||||
print(generated_text)
|
||||
|
||||
# Batch inference
|
||||
image_1 = PIL.Image.open(...)
|
||||
image_2 = PIL.Image.open(...)
|
||||
outputs = llm.generate(
|
||||
[
|
||||
{
|
||||
"prompt": "USER: <image>\nWhat is the content of this image?\nASSISTANT:",
|
||||
"multi_modal_data": {"image": image_1},
|
||||
},
|
||||
{
|
||||
"prompt": "USER: <image>\nWhat's the color of this image?\nASSISTANT:",
|
||||
"multi_modal_data": {"image": image_2},
|
||||
}
|
||||
]
|
||||
)
|
||||
|
||||
for o in outputs:
|
||||
generated_text = o.outputs[0].text
|
||||
print(generated_text)
|
||||
|
||||
A code example can be found in `examples/offline_inference_vision_language.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_vision_language.py>`_.
|
||||
|
||||
To substitute multiple images inside the same text prompt, you can pass in a list of images instead:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
llm = LLM(
|
||||
model="microsoft/Phi-3.5-vision-instruct",
|
||||
trust_remote_code=True, # Required to load Phi-3.5-vision
|
||||
max_model_len=4096, # Otherwise, it may not fit in smaller GPUs
|
||||
limit_mm_per_prompt={"image": 2}, # The maximum number to accept
|
||||
)
|
||||
|
||||
# Refer to the HuggingFace repo for the correct format to use
|
||||
prompt = "<|user|>\n<|image_1|>\n<|image_2|>\nWhat is the content of each image?<|end|>\n<|assistant|>\n"
|
||||
|
||||
# Load the images using PIL.Image
|
||||
image1 = PIL.Image.open(...)
|
||||
image2 = PIL.Image.open(...)
|
||||
|
||||
outputs = llm.generate({
|
||||
"prompt": prompt,
|
||||
"multi_modal_data": {
|
||||
"image": [image1, image2]
|
||||
},
|
||||
})
|
||||
|
||||
for o in outputs:
|
||||
generated_text = o.outputs[0].text
|
||||
print(generated_text)
|
||||
|
||||
A code example can be found in `examples/offline_inference_vision_language_multi_image.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_vision_language_multi_image.py>`_.
|
||||
|
||||
Multi-image input can be extended to perform video captioning. We show this with `Qwen2-VL <https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct>`_ as it supports videos:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
# Specify the maximum number of frames per video to be 4. This can be changed.
|
||||
llm = LLM("Qwen/Qwen2-VL-2B-Instruct", limit_mm_per_prompt={"image": 4})
|
||||
|
||||
# Create the request payload.
|
||||
video_frames = ... # load your video making sure it only has the number of frames specified earlier.
|
||||
message = {
|
||||
"role": "user",
|
||||
"content": [
|
||||
{"type": "text", "text": "Describe this set of frames. Consider the frames to be a part of the same video."},
|
||||
],
|
||||
}
|
||||
for i in range(len(video_frames)):
|
||||
base64_image = encode_image(video_frames[i]) # base64 encoding.
|
||||
new_image = {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}}
|
||||
message["content"].append(new_image)
|
||||
|
||||
# Perform inference and log output.
|
||||
outputs = llm.chat([message])
|
||||
|
||||
for o in outputs:
|
||||
generated_text = o.outputs[0].text
|
||||
print(generated_text)
|
||||
|
||||
Video
|
||||
^^^^^
|
||||
|
||||
You can pass a list of NumPy arrays directly to the :code:`'video'` field of the multi-modal dictionary
|
||||
instead of using multi-image input.
|
||||
|
||||
Please refer to `examples/offline_inference_vision_language.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_vision_language.py>`_ for more details.
|
||||
|
||||
Audio
|
||||
^^^^^
|
||||
|
||||
You can pass a tuple :code:`(array, sampling_rate)` to the :code:`'audio'` field of the multi-modal dictionary.
|
||||
|
||||
Please refer to `examples/offline_inference_audio_language.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_audio_language.py>`_ for more details.
|
||||
|
||||
Embedding
|
||||
^^^^^^^^^
|
||||
|
||||
To input pre-computed embeddings belonging to a data type (i.e. image, video, or audio) directly to the language model,
|
||||
pass a tensor of shape :code:`(num_items, feature_size, hidden_size of LM)` to the corresponding field of the multi-modal dictionary.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
# Inference with image embeddings as input
|
||||
llm = LLM(model="llava-hf/llava-1.5-7b-hf")
|
||||
|
||||
# Refer to the HuggingFace repo for the correct format to use
|
||||
prompt = "USER: <image>\nWhat is the content of this image?\nASSISTANT:"
|
||||
|
||||
# Embeddings for single image
|
||||
# torch.Tensor of shape (1, image_feature_size, hidden_size of LM)
|
||||
image_embeds = torch.load(...)
|
||||
|
||||
outputs = llm.generate({
|
||||
"prompt": prompt,
|
||||
"multi_modal_data": {"image": image_embeds},
|
||||
})
|
||||
|
||||
for o in outputs:
|
||||
generated_text = o.outputs[0].text
|
||||
print(generated_text)
|
||||
|
||||
For Qwen2-VL and MiniCPM-V, we accept additional parameters alongside the embeddings:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
# Construct the prompt based on your model
|
||||
prompt = ...
|
||||
|
||||
# Embeddings for multiple images
|
||||
# torch.Tensor of shape (num_images, image_feature_size, hidden_size of LM)
|
||||
image_embeds = torch.load(...)
|
||||
|
||||
# Qwen2-VL
|
||||
llm = LLM("Qwen/Qwen2-VL-2B-Instruct", limit_mm_per_prompt={"image": 4})
|
||||
mm_data = {
|
||||
"image": {
|
||||
"image_embeds": image_embeds,
|
||||
# image_grid_thw is needed to calculate positional encoding.
|
||||
"image_grid_thw": torch.load(...), # torch.Tensor of shape (1, 3),
|
||||
}
|
||||
}
|
||||
|
||||
# MiniCPM-V
|
||||
llm = LLM("openbmb/MiniCPM-V-2_6", trust_remote_code=True, limit_mm_per_prompt={"image": 4})
|
||||
mm_data = {
|
||||
"image": {
|
||||
"image_embeds": image_embeds,
|
||||
# image_size_list is needed to calculate details of the sliced image.
|
||||
"image_size_list": [image.size for image in images], # list of image sizes
|
||||
}
|
||||
}
|
||||
|
||||
outputs = llm.generate({
|
||||
"prompt": prompt,
|
||||
"multi_modal_data": mm_data,
|
||||
})
|
||||
|
||||
for o in outputs:
|
||||
generated_text = o.outputs[0].text
|
||||
print(generated_text)
|
||||
|
||||
Online Inference
|
||||
----------------
|
||||
|
||||
Our OpenAI-compatible server accepts multi-modal data via the `Chat Completions API <https://platform.openai.com/docs/api-reference/chat>`_.
|
||||
|
||||
.. important::
|
||||
A chat template is **required** to use Chat Completions API.
|
||||
|
||||
Although most models come with a chat template, for others you have to define one yourself.
|
||||
The chat template can be inferred based on the documentation on the model's HuggingFace repo.
|
||||
For example, LLaVA-1.5 (``llava-hf/llava-1.5-7b-hf``) requires a chat template that can be found `here <https://github.com/vllm-project/vllm/blob/main/examples/template_llava.jinja>`__.
|
||||
|
||||
Image
|
||||
^^^^^
|
||||
|
||||
Image input is supported according to `OpenAI Vision API <https://platform.openai.com/docs/guides/vision>`_.
|
||||
Here is a simple example using Phi-3.5-Vision.
|
||||
|
||||
First, launch the OpenAI-compatible server:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
vllm serve microsoft/Phi-3.5-vision-instruct --task generate \
|
||||
--trust-remote-code --max-model-len 4096 --limit-mm-per-prompt image=2
|
||||
|
||||
Then, you can use the OpenAI client as follows:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from openai import OpenAI
|
||||
|
||||
openai_api_key = "EMPTY"
|
||||
openai_api_base = "http://localhost:8000/v1"
|
||||
|
||||
client = OpenAI(
|
||||
api_key=openai_api_key,
|
||||
base_url=openai_api_base,
|
||||
)
|
||||
|
||||
# Single-image input inference
|
||||
image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
|
||||
|
||||
chat_response = client.chat.completions.create(
|
||||
model="microsoft/Phi-3.5-vision-instruct",
|
||||
messages=[{
|
||||
"role": "user",
|
||||
"content": [
|
||||
# NOTE: The prompt formatting with the image token `<image>` is not needed
|
||||
# since the prompt will be processed automatically by the API server.
|
||||
{"type": "text", "text": "What’s in this image?"},
|
||||
{"type": "image_url", "image_url": {"url": image_url}},
|
||||
],
|
||||
}],
|
||||
)
|
||||
print("Chat completion output:", chat_response.choices[0].message.content)
|
||||
|
||||
# Multi-image input inference
|
||||
image_url_duck = "https://upload.wikimedia.org/wikipedia/commons/d/da/2015_Kaczka_krzy%C5%BCowka_w_wodzie_%28samiec%29.jpg"
|
||||
image_url_lion = "https://upload.wikimedia.org/wikipedia/commons/7/77/002_The_lion_king_Snyggve_in_the_Serengeti_National_Park_Photo_by_Giles_Laurent.jpg"
|
||||
|
||||
chat_response = client.chat.completions.create(
|
||||
model="microsoft/Phi-3.5-vision-instruct",
|
||||
messages=[{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{"type": "text", "text": "What are the animals in these images?"},
|
||||
{"type": "image_url", "image_url": {"url": image_url_duck}},
|
||||
{"type": "image_url", "image_url": {"url": image_url_lion}},
|
||||
],
|
||||
}],
|
||||
)
|
||||
print("Chat completion output:", chat_response.choices[0].message.content)
|
||||
|
||||
A full code example can be found in `examples/openai_chat_completion_client_for_multimodal.py <https://github.com/vllm-project/vllm/blob/main/examples/openai_chat_completion_client_for_multimodal.py>`_.
|
||||
|
||||
.. tip::
|
||||
Loading from local file paths is also supported on vLLM: You can specify the allowed local media path via ``--allowed-local-media-path`` when launching the API server/engine,
|
||||
and pass the file path as ``url`` in the API request.
|
||||
|
||||
.. tip::
|
||||
There is no need to place image placeholders in the text content of the API request - they are already represented by the image content.
|
||||
In fact, you can place image placeholders in the middle of the text by interleaving text and image content.
|
||||
|
||||
.. note::
|
||||
|
||||
By default, the timeout for fetching images through HTTP URL is ``5`` seconds.
|
||||
You can override this by setting the environment variable:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ export VLLM_IMAGE_FETCH_TIMEOUT=<timeout>
|
||||
|
||||
Video
|
||||
^^^^^
|
||||
|
||||
Instead of :code:`image_url`, you can pass a video file via :code:`video_url`.
|
||||
|
||||
You can use `these tests <https://github.com/vllm-project/vllm/blob/main/tests/entrypoints/openai/test_video.py>`_ as reference.
|
||||
|
||||
.. note::
|
||||
|
||||
By default, the timeout for fetching videos through HTTP URL url is ``30`` seconds.
|
||||
You can override this by setting the environment variable:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ export VLLM_VIDEO_FETCH_TIMEOUT=<timeout>
|
||||
|
||||
Audio
|
||||
^^^^^
|
||||
|
||||
Audio input is supported according to `OpenAI Audio API <https://platform.openai.com/docs/guides/audio?audio-generation-quickstart-example=audio-in>`_.
|
||||
Here is a simple example using Ultravox-v0.3.
|
||||
|
||||
First, launch the OpenAI-compatible server:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
vllm serve fixie-ai/ultravox-v0_3
|
||||
|
||||
Then, you can use the OpenAI client as follows:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import base64
|
||||
import requests
|
||||
from openai import OpenAI
|
||||
from vllm.assets.audio import AudioAsset
|
||||
|
||||
def encode_base64_content_from_url(content_url: str) -> str:
|
||||
"""Encode a content retrieved from a remote url to base64 format."""
|
||||
|
||||
with requests.get(content_url) as response:
|
||||
response.raise_for_status()
|
||||
result = base64.b64encode(response.content).decode('utf-8')
|
||||
|
||||
return result
|
||||
|
||||
openai_api_key = "EMPTY"
|
||||
openai_api_base = "http://localhost:8000/v1"
|
||||
|
||||
client = OpenAI(
|
||||
api_key=openai_api_key,
|
||||
base_url=openai_api_base,
|
||||
)
|
||||
|
||||
# Any format supported by librosa is supported
|
||||
audio_url = AudioAsset("winning_call").url
|
||||
audio_base64 = encode_base64_content_from_url(audio_url)
|
||||
|
||||
chat_completion_from_base64 = client.chat.completions.create(
|
||||
messages=[{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{
|
||||
"type": "text",
|
||||
"text": "What's in this audio?"
|
||||
},
|
||||
{
|
||||
"type": "input_audio",
|
||||
"input_audio": {
|
||||
"data": audio_base64,
|
||||
"format": "wav"
|
||||
},
|
||||
},
|
||||
],
|
||||
}],
|
||||
model=model,
|
||||
max_completion_tokens=64,
|
||||
)
|
||||
|
||||
result = chat_completion_from_base64.choices[0].message.content
|
||||
print("Chat completion output from input audio:", result)
|
||||
|
||||
Alternatively, you can pass :code:`audio_url`, which is the audio counterpart of :code:`image_url` for image input:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
chat_completion_from_url = client.chat.completions.create(
|
||||
messages=[{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{
|
||||
"type": "text",
|
||||
"text": "What's in this audio?"
|
||||
},
|
||||
{
|
||||
"type": "audio_url",
|
||||
"audio_url": {
|
||||
"url": audio_url
|
||||
},
|
||||
},
|
||||
],
|
||||
}],
|
||||
model=model,
|
||||
max_completion_tokens=64,
|
||||
)
|
||||
|
||||
result = chat_completion_from_url.choices[0].message.content
|
||||
print("Chat completion output from audio url:", result)
|
||||
|
||||
A full code example can be found in `examples/openai_chat_completion_client_for_multimodal.py <https://github.com/vllm-project/vllm/blob/main/examples/openai_chat_completion_client_for_multimodal.py>`_.
|
||||
|
||||
.. note::
|
||||
|
||||
By default, the timeout for fetching audios through HTTP URL is ``10`` seconds.
|
||||
You can override this by setting the environment variable:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ export VLLM_AUDIO_FETCH_TIMEOUT=<timeout>
|
||||
|
||||
Embedding
|
||||
^^^^^^^^^
|
||||
|
||||
vLLM's Embeddings API is a superset of OpenAI's `Embeddings API <https://platform.openai.com/docs/api-reference/embeddings>`_,
|
||||
where a list of chat ``messages`` can be passed instead of batched ``inputs``. This enables multi-modal inputs to be passed to embedding models.
|
||||
|
||||
.. tip::
|
||||
The schema of ``messages`` is exactly the same as in Chat Completions API.
|
||||
You can refer to the above tutorials for more details on how to pass each type of multi-modal data.
|
||||
|
||||
Usually, embedding models do not expect chat-based input, so we need to use a custom chat template to format the text and images.
|
||||
Refer to the examples below for illustration.
|
||||
|
||||
Here is an end-to-end example using VLM2Vec. To serve the model:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
vllm serve TIGER-Lab/VLM2Vec-Full --task embed \
|
||||
--trust-remote-code --max-model-len 4096 --chat-template examples/template_vlm2vec.jinja
|
||||
|
||||
.. important::
|
||||
|
||||
Since VLM2Vec has the same model architecture as Phi-3.5-Vision, we have to explicitly pass ``--task embed``
|
||||
to run this model in embedding mode instead of text generation mode.
|
||||
|
||||
The custom chat template is completely different from the original one for this model,
|
||||
and can be found `here <https://github.com/vllm-project/vllm/blob/main/examples/template_vlm2vec.jinja>`__.
|
||||
|
||||
Since the request schema is not defined by OpenAI client, we post a request to the server using the lower-level ``requests`` library:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import requests
|
||||
|
||||
image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
|
||||
|
||||
response = requests.post(
|
||||
"http://localhost:8000/v1/embeddings",
|
||||
json={
|
||||
"model": "TIGER-Lab/VLM2Vec-Full",
|
||||
"messages": [{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{"type": "image_url", "image_url": {"url": image_url}},
|
||||
{"type": "text", "text": "Represent the given image."},
|
||||
],
|
||||
}],
|
||||
"encoding_format": "float",
|
||||
},
|
||||
)
|
||||
response.raise_for_status()
|
||||
response_json = response.json()
|
||||
print("Embedding output:", response_json["data"][0]["embedding"])
|
||||
|
||||
Below is another example, this time using the ``MrLight/dse-qwen2-2b-mrl-v1`` model.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
vllm serve MrLight/dse-qwen2-2b-mrl-v1 --task embed \
|
||||
--trust-remote-code --max-model-len 8192 --chat-template examples/template_dse_qwen2_vl.jinja
|
||||
|
||||
.. important::
|
||||
|
||||
Like with VLM2Vec, we have to explicitly pass ``--task embed``.
|
||||
|
||||
Additionally, ``MrLight/dse-qwen2-2b-mrl-v1`` requires an EOS token for embeddings, which is handled
|
||||
by `this custom chat template <https://github.com/vllm-project/vllm/blob/main/examples/template_dse_qwen2_vl.jinja>`__.
|
||||
|
||||
.. important::
|
||||
|
||||
Also important, ``MrLight/dse-qwen2-2b-mrl-v1`` requires a placeholder image of the minimum image size for text query embeddings. See the full code
|
||||
example below for details.
|
||||
|
||||
A full code example can be found in `examples/openai_chat_embedding_client_for_multimodal.py <https://github.com/vllm-project/vllm/blob/main/examples/openai_chat_embedding_client_for_multimodal.py>`_.
|
||||
@@ -1,16 +1,15 @@
|
||||
.. _performance:
|
||||
(performance)=
|
||||
|
||||
Performance and Tuning
|
||||
======================
|
||||
# Performance and Tuning
|
||||
|
||||
## Preemption
|
||||
|
||||
Preemption
|
||||
----------
|
||||
Due to the auto-regressive nature of transformer architecture, there are times when KV cache space is insufficient to handle all batched requests.
|
||||
The vLLM can preempt requests to free up KV cache space for other requests. Preempted requests are recomputed when sufficient KV cache space becomes
|
||||
available again. When this occurs, the following warning is printed:
|
||||
|
||||
```
|
||||
WARNING 05-09 00:49:33 scheduler.py:1057] Sequence group 0 is preempted by PreemptionMode.SWAP mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_cumulative_preemption_cnt=1
|
||||
WARNING 05-09 00:49:33 scheduler.py:1057 Sequence group 0 is preempted by PreemptionMode.SWAP mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_cumulative_preemption_cnt=1
|
||||
```
|
||||
|
||||
While this mechanism ensures system robustness, preemption and recomputation can adversely affect end-to-end latency.
|
||||
@@ -22,44 +21,44 @@ If you frequently encounter preemptions from the vLLM engine, consider the follo
|
||||
|
||||
You can also monitor the number of preemption requests through Prometheus metrics exposed by the vLLM. Additionally, you can log the cumulative number of preemption requests by setting disable_log_stats=False.
|
||||
|
||||
.. _chunked-prefill:
|
||||
(chunked-prefill)=
|
||||
|
||||
## Chunked Prefill
|
||||
|
||||
Chunked Prefill
|
||||
---------------
|
||||
vLLM supports an experimental feature chunked prefill. Chunked prefill allows to chunk large prefills into smaller chunks and batch them together with decode requests.
|
||||
|
||||
You can enable the feature by specifying ``--enable-chunked-prefill`` in the command line or setting ``enable_chunked_prefill=True`` in the LLM constructor.
|
||||
You can enable the feature by specifying `--enable-chunked-prefill` in the command line or setting `enable_chunked_prefill=True` in the LLM constructor.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
llm = LLM(model="meta-llama/Llama-2-7b-hf", enable_chunked_prefill=True)
|
||||
# Set max_num_batched_tokens to tune performance.
|
||||
# NOTE: 512 is the default max_num_batched_tokens for chunked prefill.
|
||||
# llm = LLM(model="meta-llama/Llama-2-7b-hf", enable_chunked_prefill=True, max_num_batched_tokens=512)
|
||||
```python
|
||||
llm = LLM(model="meta-llama/Llama-2-7b-hf", enable_chunked_prefill=True)
|
||||
# Set max_num_batched_tokens to tune performance.
|
||||
# NOTE: 512 is the default max_num_batched_tokens for chunked prefill.
|
||||
# llm = LLM(model="meta-llama/Llama-2-7b-hf", enable_chunked_prefill=True, max_num_batched_tokens=512)
|
||||
```
|
||||
|
||||
By default, vLLM scheduler prioritizes prefills and doesn't batch prefill and decode to the same batch.
|
||||
This policy optimizes the TTFT (time to the first token), but incurs slower ITL (inter token latency) and inefficient GPU utilization.
|
||||
|
||||
Once chunked prefill is enabled, the policy is changed to prioritize decode requests.
|
||||
It batches all pending decode requests to the batch before scheduling any prefill.
|
||||
When there are available token_budget (``max_num_batched_tokens``), it schedules pending prefills.
|
||||
If a last pending prefill request cannot fit into ``max_num_batched_tokens``, it chunks it.
|
||||
When there are available token_budget (`max_num_batched_tokens`), it schedules pending prefills.
|
||||
If a last pending prefill request cannot fit into `max_num_batched_tokens`, it chunks it.
|
||||
|
||||
This policy has two benefits:
|
||||
|
||||
- It improves ITL and generation decode because decode requests are prioritized.
|
||||
- It helps achieve better GPU utilization by locating compute-bound (prefill) and memory-bound (decode) requests to the same batch.
|
||||
|
||||
You can tune the performance by changing ``max_num_batched_tokens``.
|
||||
You can tune the performance by changing `max_num_batched_tokens`.
|
||||
By default, it is set to 512, which has the best ITL on A100 in the initial benchmark (llama 70B and mixtral 8x22B).
|
||||
Smaller ``max_num_batched_tokens`` achieves better ITL because there are fewer prefills interrupting decodes.
|
||||
Higher ``max_num_batched_tokens`` achieves better TTFT as you can put more prefill to the batch.
|
||||
Smaller `max_num_batched_tokens` achieves better ITL because there are fewer prefills interrupting decodes.
|
||||
Higher `max_num_batched_tokens` achieves better TTFT as you can put more prefill to the batch.
|
||||
|
||||
- If ``max_num_batched_tokens`` is the same as ``max_model_len``, that's almost the equivalent to the default scheduling policy (except that it still prioritizes decodes).
|
||||
- Note that the default value (512) of ``max_num_batched_tokens`` is optimized for ITL, and it may have lower throughput than the default scheduler.
|
||||
- If `max_num_batched_tokens` is the same as `max_model_len`, that's almost the equivalent to the default scheduling policy (except that it still prioritizes decodes).
|
||||
- Note that the default value (512) of `max_num_batched_tokens` is optimized for ITL, and it may have lower throughput than the default scheduler.
|
||||
|
||||
We recommend you set ``max_num_batched_tokens > 2048`` for throughput.
|
||||
We recommend you set `max_num_batched_tokens > 2048` for throughput.
|
||||
|
||||
See related papers for more details (https://arxiv.org/pdf/2401.08671 or https://arxiv.org/pdf/2308.16369).
|
||||
See related papers for more details (<https://arxiv.org/pdf/2401.08671> or <https://arxiv.org/pdf/2308.16369>).
|
||||
|
||||
Please try out this feature and let us know your feedback via GitHub issues!
|
||||
Please try out this feature and let us know your feedback via GitHub issues!
|
||||
205
docs/source/usage/spec_decode.md
Normal file
205
docs/source/usage/spec_decode.md
Normal file
@@ -0,0 +1,205 @@
|
||||
(spec-decode)=
|
||||
|
||||
# Speculative decoding
|
||||
|
||||
```{warning}
|
||||
Please note that speculative decoding in vLLM is not yet optimized and does
|
||||
not usually yield inter-token latency reductions for all prompt datasets or sampling parameters. The work
|
||||
to optimize it is ongoing and can be followed in [this issue.](https://github.com/vllm-project/vllm/issues/4630)
|
||||
```
|
||||
|
||||
```{warning}
|
||||
Currently, speculative decoding in vLLM is not compatible with pipeline parallelism.
|
||||
```
|
||||
|
||||
This document shows how to use [Speculative Decoding](https://x.com/karpathy/status/1697318534555336961) with vLLM.
|
||||
Speculative decoding is a technique which improves inter-token latency in memory-bound LLM inference.
|
||||
|
||||
## Speculating with a draft model
|
||||
|
||||
The following code configures vLLM in an offline mode to use speculative decoding with a draft model, speculating 5 tokens at a time.
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
prompts = [
|
||||
"The future of AI is",
|
||||
]
|
||||
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
|
||||
|
||||
llm = LLM(
|
||||
model="facebook/opt-6.7b",
|
||||
tensor_parallel_size=1,
|
||||
speculative_model="facebook/opt-125m",
|
||||
num_speculative_tokens=5,
|
||||
)
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
```
|
||||
|
||||
To perform the same with an online mode launch the server:
|
||||
|
||||
```bash
|
||||
python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8000 --model facebook/opt-6.7b \
|
||||
--seed 42 -tp 1 --speculative_model facebook/opt-125m --use-v2-block-manager \
|
||||
--num_speculative_tokens 5 --gpu_memory_utilization 0.8
|
||||
```
|
||||
|
||||
Then use a client:
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
|
||||
# Modify OpenAI's API key and API base to use vLLM's API server.
|
||||
openai_api_key = "EMPTY"
|
||||
openai_api_base = "http://localhost:8000/v1"
|
||||
|
||||
client = OpenAI(
|
||||
# defaults to os.environ.get("OPENAI_API_KEY")
|
||||
api_key=openai_api_key,
|
||||
base_url=openai_api_base,
|
||||
)
|
||||
|
||||
models = client.models.list()
|
||||
model = models.data[0].id
|
||||
|
||||
# Completion API
|
||||
stream = False
|
||||
completion = client.completions.create(
|
||||
model=model,
|
||||
prompt="The future of AI is",
|
||||
echo=False,
|
||||
n=1,
|
||||
stream=stream,
|
||||
)
|
||||
|
||||
print("Completion results:")
|
||||
if stream:
|
||||
for c in completion:
|
||||
print(c)
|
||||
else:
|
||||
print(completion)
|
||||
```
|
||||
|
||||
## Speculating by matching n-grams in the prompt
|
||||
|
||||
The following code configures vLLM to use speculative decoding where proposals are generated by
|
||||
matching n-grams in the prompt. For more information read [this thread.](https://x.com/joao_gante/status/1747322413006643259)
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
prompts = [
|
||||
"The future of AI is",
|
||||
]
|
||||
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
|
||||
|
||||
llm = LLM(
|
||||
model="facebook/opt-6.7b",
|
||||
tensor_parallel_size=1,
|
||||
speculative_model="[ngram]",
|
||||
num_speculative_tokens=5,
|
||||
ngram_prompt_lookup_max=4,
|
||||
)
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
```
|
||||
|
||||
## Speculating using MLP speculators
|
||||
|
||||
The following code configures vLLM to use speculative decoding where proposals are generated by
|
||||
draft models that conditioning draft predictions on both context vectors and sampled tokens.
|
||||
For more information see [this blog](https://pytorch.org/blog/hitchhikers-guide-speculative-decoding/) or
|
||||
[this technical report](https://arxiv.org/abs/2404.19124).
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
prompts = [
|
||||
"The future of AI is",
|
||||
]
|
||||
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
|
||||
|
||||
llm = LLM(
|
||||
model="meta-llama/Meta-Llama-3.1-70B-Instruct",
|
||||
tensor_parallel_size=4,
|
||||
speculative_model="ibm-fms/llama3-70b-accelerator",
|
||||
speculative_draft_tensor_parallel_size=1,
|
||||
)
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
```
|
||||
|
||||
Note that these speculative models currently need to be run without tensor parallelism, although
|
||||
it is possible to run the main model using tensor parallelism (see example above). Since the
|
||||
speculative models are relatively small, we still see significant speedups. However, this
|
||||
limitation will be fixed in a future release.
|
||||
|
||||
A variety of speculative models of this type are available on HF hub:
|
||||
|
||||
- [llama-13b-accelerator](https://huggingface.co/ibm-fms/llama-13b-accelerator)
|
||||
- [llama3-8b-accelerator](https://huggingface.co/ibm-fms/llama3-8b-accelerator)
|
||||
- [codellama-34b-accelerator](https://huggingface.co/ibm-fms/codellama-34b-accelerator)
|
||||
- [llama2-70b-accelerator](https://huggingface.co/ibm-fms/llama2-70b-accelerator)
|
||||
- [llama3-70b-accelerator](https://huggingface.co/ibm-fms/llama3-70b-accelerator)
|
||||
- [granite-3b-code-instruct-accelerator](https://huggingface.co/ibm-granite/granite-3b-code-instruct-accelerator)
|
||||
- [granite-8b-code-instruct-accelerator](https://huggingface.co/ibm-granite/granite-8b-code-instruct-accelerator)
|
||||
- [granite-7b-instruct-accelerator](https://huggingface.co/ibm-granite/granite-7b-instruct-accelerator)
|
||||
- [granite-20b-code-instruct-accelerator](https://huggingface.co/ibm-granite/granite-20b-code-instruct-accelerator)
|
||||
|
||||
## Lossless guarantees of Speculative Decoding
|
||||
|
||||
In vLLM, speculative decoding aims to enhance inference efficiency while maintaining accuracy. This section addresses the lossless guarantees of
|
||||
speculative decoding, breaking down the guarantees into three key areas:
|
||||
|
||||
1. **Theoretical Losslessness**
|
||||
\- Speculative decoding sampling is theoretically lossless up to the precision limits of hardware numerics. Floating-point errors might
|
||||
cause slight variations in output distributions, as discussed
|
||||
in [Accelerating Large Language Model Decoding with Speculative Sampling](https://arxiv.org/pdf/2302.01318)
|
||||
|
||||
2. **Algorithmic Losslessness**
|
||||
\- vLLM’s implementation of speculative decoding is algorithmically validated to be lossless. Key validation tests include:
|
||||
|
||||
> - **Rejection Sampler Convergence**: Ensures that samples from vLLM’s rejection sampler align with the target
|
||||
> distribution. [View Test Code](https://github.com/vllm-project/vllm/blob/47b65a550866c7ffbd076ecb74106714838ce7da/tests/samplers/test_rejection_sampler.py#L252)
|
||||
> - **Greedy Sampling Equality**: Confirms that greedy sampling with speculative decoding matches greedy sampling
|
||||
> without it. This verifies that vLLM's speculative decoding framework, when integrated with the vLLM forward pass and the vLLM rejection sampler,
|
||||
> provides a lossless guarantee. Almost all of the tests in [this directory](https://github.com/vllm-project/vllm/tree/b67ae00cdbbe1a58ffc8ff170f0c8d79044a684a/tests/spec_decode/e2e)
|
||||
> verify this property using [this assertion implementation](https://github.com/vllm-project/vllm/blob/b67ae00cdbbe1a58ffc8ff170f0c8d79044a684a/tests/spec_decode/e2e/conftest.py#L291)
|
||||
|
||||
3. **vLLM Logprob Stability**
|
||||
\- vLLM does not currently guarantee stable token log probabilities (logprobs). This can result in different outputs for the
|
||||
same request across runs. For more details, see the FAQ section
|
||||
titled *Can the output of a prompt vary across runs in vLLM?* in the {ref}`FAQs <faq>`.
|
||||
|
||||
**Conclusion**
|
||||
|
||||
While vLLM strives to ensure losslessness in speculative decoding, variations in generated outputs with and without speculative decoding
|
||||
can occur due to following factors:
|
||||
|
||||
- **Floating-Point Precision**: Differences in hardware numerical precision may lead to slight discrepancies in the output distribution.
|
||||
- **Batch Size and Numerical Stability**: Changes in batch size may cause variations in logprobs and output probabilities, potentially
|
||||
due to non-deterministic behavior in batched operations or numerical instability.
|
||||
|
||||
**Mitigation Strategies**
|
||||
|
||||
For mitigation strategies, please refer to the FAQ entry *Can the output of a prompt vary across runs in vLLM?* in the {ref}`FAQs <faq>`.
|
||||
|
||||
## Resources for vLLM contributors
|
||||
|
||||
- [A Hacker's Guide to Speculative Decoding in vLLM](https://www.youtube.com/watch?v=9wNAgpX6z_4)
|
||||
- [What is Lookahead Scheduling in vLLM?](https://docs.google.com/document/d/1Z9TvqzzBPnh5WHcRwjvK2UEeFeq5zMZb5mFE8jR0HCs/edit#heading=h.1fjfb0donq5a)
|
||||
- [Information on batch expansion](https://docs.google.com/document/d/1T-JaS2T1NRfdP51qzqpyakoCXxSXTtORppiwaj5asxA/edit#heading=h.kk7dq05lc6q8)
|
||||
- [Dynamic speculative decoding](https://github.com/vllm-project/vllm/issues/4565)
|
||||
@@ -1,210 +0,0 @@
|
||||
.. _spec_decode:
|
||||
|
||||
Speculative decoding
|
||||
====================
|
||||
|
||||
.. warning::
|
||||
Please note that speculative decoding in vLLM is not yet optimized and does
|
||||
not usually yield inter-token latency reductions for all prompt datasets or sampling parameters. The work
|
||||
to optimize it is ongoing and can be followed in `this issue. <https://github.com/vllm-project/vllm/issues/4630>`_
|
||||
|
||||
.. warning::
|
||||
Currently, speculative decoding in vLLM is not compatible with pipeline parallelism.
|
||||
|
||||
This document shows how to use `Speculative Decoding <https://x.com/karpathy/status/1697318534555336961>`_ with vLLM.
|
||||
Speculative decoding is a technique which improves inter-token latency in memory-bound LLM inference.
|
||||
|
||||
Speculating with a draft model
|
||||
------------------------------
|
||||
|
||||
The following code configures vLLM in an offline mode to use speculative decoding with a draft model, speculating 5 tokens at a time.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
prompts = [
|
||||
"The future of AI is",
|
||||
]
|
||||
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
|
||||
|
||||
llm = LLM(
|
||||
model="facebook/opt-6.7b",
|
||||
tensor_parallel_size=1,
|
||||
speculative_model="facebook/opt-125m",
|
||||
num_speculative_tokens=5,
|
||||
)
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
|
||||
To perform the same with an online mode launch the server:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8000 --model facebook/opt-6.7b \
|
||||
--seed 42 -tp 1 --speculative_model facebook/opt-125m --use-v2-block-manager \
|
||||
--num_speculative_tokens 5 --gpu_memory_utilization 0.8
|
||||
|
||||
Then use a client:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from openai import OpenAI
|
||||
|
||||
# Modify OpenAI's API key and API base to use vLLM's API server.
|
||||
openai_api_key = "EMPTY"
|
||||
openai_api_base = "http://localhost:8000/v1"
|
||||
|
||||
client = OpenAI(
|
||||
# defaults to os.environ.get("OPENAI_API_KEY")
|
||||
api_key=openai_api_key,
|
||||
base_url=openai_api_base,
|
||||
)
|
||||
|
||||
models = client.models.list()
|
||||
model = models.data[0].id
|
||||
|
||||
# Completion API
|
||||
stream = False
|
||||
completion = client.completions.create(
|
||||
model=model,
|
||||
prompt="The future of AI is",
|
||||
echo=False,
|
||||
n=1,
|
||||
stream=stream,
|
||||
)
|
||||
|
||||
print("Completion results:")
|
||||
if stream:
|
||||
for c in completion:
|
||||
print(c)
|
||||
else:
|
||||
print(completion)
|
||||
|
||||
Speculating by matching n-grams in the prompt
|
||||
---------------------------------------------
|
||||
|
||||
The following code configures vLLM to use speculative decoding where proposals are generated by
|
||||
matching n-grams in the prompt. For more information read `this thread. <https://x.com/joao_gante/status/1747322413006643259>`_
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
prompts = [
|
||||
"The future of AI is",
|
||||
]
|
||||
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
|
||||
|
||||
llm = LLM(
|
||||
model="facebook/opt-6.7b",
|
||||
tensor_parallel_size=1,
|
||||
speculative_model="[ngram]",
|
||||
num_speculative_tokens=5,
|
||||
ngram_prompt_lookup_max=4,
|
||||
)
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
|
||||
Speculating using MLP speculators
|
||||
---------------------------------
|
||||
|
||||
The following code configures vLLM to use speculative decoding where proposals are generated by
|
||||
draft models that conditioning draft predictions on both context vectors and sampled tokens.
|
||||
For more information see `this blog <https://pytorch.org/blog/hitchhikers-guide-speculative-decoding/>`_ or
|
||||
`this technical report <https://arxiv.org/abs/2404.19124>`_.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
prompts = [
|
||||
"The future of AI is",
|
||||
]
|
||||
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
|
||||
|
||||
llm = LLM(
|
||||
model="meta-llama/Meta-Llama-3.1-70B-Instruct",
|
||||
tensor_parallel_size=4,
|
||||
speculative_model="ibm-fms/llama3-70b-accelerator",
|
||||
speculative_draft_tensor_parallel_size=1,
|
||||
)
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
|
||||
Note that these speculative models currently need to be run without tensor parallelism, although
|
||||
it is possible to run the main model using tensor parallelism (see example above). Since the
|
||||
speculative models are relatively small, we still see significant speedups. However, this
|
||||
limitation will be fixed in a future release.
|
||||
|
||||
A variety of speculative models of this type are available on HF hub:
|
||||
|
||||
* `llama-13b-accelerator <https://huggingface.co/ibm-fms/llama-13b-accelerator>`_
|
||||
* `llama3-8b-accelerator <https://huggingface.co/ibm-fms/llama3-8b-accelerator>`_
|
||||
* `codellama-34b-accelerator <https://huggingface.co/ibm-fms/codellama-34b-accelerator>`_
|
||||
* `llama2-70b-accelerator <https://huggingface.co/ibm-fms/llama2-70b-accelerator>`_
|
||||
* `llama3-70b-accelerator <https://huggingface.co/ibm-fms/llama3-70b-accelerator>`_
|
||||
* `granite-3b-code-instruct-accelerator <https://huggingface.co/ibm-granite/granite-3b-code-instruct-accelerator>`_
|
||||
* `granite-8b-code-instruct-accelerator <https://huggingface.co/ibm-granite/granite-8b-code-instruct-accelerator>`_
|
||||
* `granite-7b-instruct-accelerator <https://huggingface.co/ibm-granite/granite-7b-instruct-accelerator>`_
|
||||
* `granite-20b-code-instruct-accelerator <https://huggingface.co/ibm-granite/granite-20b-code-instruct-accelerator>`_
|
||||
|
||||
Lossless guarantees of Speculative Decoding
|
||||
-------------------------------------------
|
||||
In vLLM, speculative decoding aims to enhance inference efficiency while maintaining accuracy. This section addresses the lossless guarantees of
|
||||
speculative decoding, breaking down the guarantees into three key areas:
|
||||
|
||||
1. **Theoretical Losslessness**
|
||||
- Speculative decoding sampling is theoretically lossless up to the precision limits of hardware numerics. Floating-point errors might
|
||||
cause slight variations in output distributions, as discussed
|
||||
in `Accelerating Large Language Model Decoding with Speculative Sampling <https://arxiv.org/pdf/2302.01318>`_
|
||||
|
||||
2. **Algorithmic Losslessness**
|
||||
- vLLM’s implementation of speculative decoding is algorithmically validated to be lossless. Key validation tests include:
|
||||
|
||||
- **Rejection Sampler Convergence**: Ensures that samples from vLLM’s rejection sampler align with the target
|
||||
distribution. `View Test Code <https://github.com/vllm-project/vllm/blob/47b65a550866c7ffbd076ecb74106714838ce7da/tests/samplers/test_rejection_sampler.py#L252>`_
|
||||
|
||||
- **Greedy Sampling Equality**: Confirms that greedy sampling with speculative decoding matches greedy sampling
|
||||
without it. This verifies that vLLM's speculative decoding framework, when integrated with the vLLM forward pass and the vLLM rejection sampler,
|
||||
provides a lossless guarantee. Almost all of the tests in `this directory <https://github.com/vllm-project/vllm/tree/b67ae00cdbbe1a58ffc8ff170f0c8d79044a684a/tests/spec_decode/e2e>`_
|
||||
verify this property using `this assertion implementation <https://github.com/vllm-project/vllm/blob/b67ae00cdbbe1a58ffc8ff170f0c8d79044a684a/tests/spec_decode/e2e/conftest.py#L291>`_
|
||||
|
||||
3. **vLLM Logprob Stability**
|
||||
- vLLM does not currently guarantee stable token log probabilities (logprobs). This can result in different outputs for the
|
||||
same request across runs. For more details, see the FAQ section
|
||||
titled *Can the output of a prompt vary across runs in vLLM?* in the :ref:`FAQs <faq>`.
|
||||
|
||||
|
||||
**Conclusion**
|
||||
|
||||
While vLLM strives to ensure losslessness in speculative decoding, variations in generated outputs with and without speculative decoding
|
||||
can occur due to following factors:
|
||||
|
||||
- **Floating-Point Precision**: Differences in hardware numerical precision may lead to slight discrepancies in the output distribution.
|
||||
|
||||
- **Batch Size and Numerical Stability**: Changes in batch size may cause variations in logprobs and output probabilities, potentially
|
||||
due to non-deterministic behavior in batched operations or numerical instability.
|
||||
|
||||
**Mitigation Strategies**
|
||||
|
||||
For mitigation strategies, please refer to the FAQ entry *Can the output of a prompt vary across runs in vLLM?* in the :ref:`FAQs <faq>`.
|
||||
|
||||
Resources for vLLM contributors
|
||||
-------------------------------
|
||||
* `A Hacker's Guide to Speculative Decoding in vLLM <https://www.youtube.com/watch?v=9wNAgpX6z_4>`_
|
||||
* `What is Lookahead Scheduling in vLLM? <https://docs.google.com/document/d/1Z9TvqzzBPnh5WHcRwjvK2UEeFeq5zMZb5mFE8jR0HCs/edit#heading=h.1fjfb0donq5a>`_
|
||||
* `Information on batch expansion <https://docs.google.com/document/d/1T-JaS2T1NRfdP51qzqpyakoCXxSXTtORppiwaj5asxA/edit#heading=h.kk7dq05lc6q8>`_
|
||||
* `Dynamic speculative decoding <https://github.com/vllm-project/vllm/issues/4565>`_
|
||||
260
docs/source/usage/structured_outputs.md
Normal file
260
docs/source/usage/structured_outputs.md
Normal file
@@ -0,0 +1,260 @@
|
||||
(structured-outputs)=
|
||||
|
||||
# Structured Outputs
|
||||
|
||||
vLLM supports the generation of structured outputs using [outlines](https://github.com/dottxt-ai/outlines) or [lm-format-enforcer](https://github.com/noamgat/lm-format-enforcer) as backends for the guided decoding.
|
||||
This document shows you some examples of the different options that are available to generate structured outputs.
|
||||
|
||||
## Online Inference (OpenAI API)
|
||||
|
||||
You can generate structured outputs using the OpenAI's [Completions](https://platform.openai.com/docs/api-reference/completions) and [Chat](https://platform.openai.com/docs/api-reference/chat) API.
|
||||
|
||||
The following parameters are supported, which must be added as extra parameters:
|
||||
|
||||
- `guided_choice`: the output will be exactly one of the choices.
|
||||
- `guided_regex`: the output will follow the regex pattern.
|
||||
- `guided_json`: the output will follow the JSON schema.
|
||||
- `guided_grammar`: the output will follow the context free grammar.
|
||||
- `guided_whitespace_pattern`: used to override the default whitespace pattern for guided json decoding.
|
||||
- `guided_decoding_backend`: used to select the guided decoding backend to use.
|
||||
|
||||
You can see the complete list of supported parameters on the [OpenAI Compatible Server](../serving/openai_compatible_server.md) page.
|
||||
|
||||
Now let´s see an example for each of the cases, starting with the `guided_choice`, as it´s the easiest one:
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
client = OpenAI(
|
||||
base_url="http://localhost:8000/v1",
|
||||
api_key="-",
|
||||
)
|
||||
|
||||
completion = client.chat.completions.create(
|
||||
model="Qwen/Qwen2.5-3B-Instruct",
|
||||
messages=[
|
||||
{"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"}
|
||||
],
|
||||
extra_body={"guided_choice": ["positive", "negative"]},
|
||||
)
|
||||
print(completion.choices[0].message.content)
|
||||
```
|
||||
|
||||
The next example shows how to use the `guided_regex`. The idea is to generate an email address, given a simple regex template:
|
||||
|
||||
```python
|
||||
completion = client.chat.completions.create(
|
||||
model="Qwen/Qwen2.5-3B-Instruct",
|
||||
messages=[
|
||||
{
|
||||
"role": "user",
|
||||
"content": "Generate an example email address for Alan Turing, who works in Enigma. End in .com and new line. Example result: alan.turing@enigma.com\n",
|
||||
}
|
||||
],
|
||||
extra_body={"guided_regex": "\w+@\w+\.com\n", "stop": ["\n"]},
|
||||
)
|
||||
print(completion.choices[0].message.content)
|
||||
```
|
||||
|
||||
One of the most relevant features in structured text generation is the option to generate a valid JSON with pre-defined fields and formats.
|
||||
For this we can use the `guided_json` parameter in two different ways:
|
||||
|
||||
- Using directly a [JSON Schema](https://json-schema.org/)
|
||||
- Defining a [Pydantic model](https://docs.pydantic.dev/latest/) and then extracting the JSON Schema from it (which is normally an easier option).
|
||||
|
||||
The next example shows how to use the `guided_json` parameter with a Pydantic model:
|
||||
|
||||
```python
|
||||
from pydantic import BaseModel
|
||||
from enum import Enum
|
||||
|
||||
class CarType(str, Enum):
|
||||
sedan = "sedan"
|
||||
suv = "SUV"
|
||||
truck = "Truck"
|
||||
coupe = "Coupe"
|
||||
|
||||
|
||||
class CarDescription(BaseModel):
|
||||
brand: str
|
||||
model: str
|
||||
car_type: CarType
|
||||
|
||||
|
||||
json_schema = CarDescription.model_json_schema()
|
||||
|
||||
completion = client.chat.completions.create(
|
||||
model="Qwen/Qwen2.5-3B-Instruct",
|
||||
messages=[
|
||||
{
|
||||
"role": "user",
|
||||
"content": "Generate a JSON with the brand, model and car_type of the most iconic car from the 90's",
|
||||
}
|
||||
],
|
||||
extra_body={"guided_json": json_schema},
|
||||
)
|
||||
print(completion.choices[0].message.content)
|
||||
```
|
||||
|
||||
```{tip}
|
||||
While not strictly necessary, normally it´s better to indicate in the prompt that a JSON needs to be generated and which fields and how should the LLM fill them.
|
||||
This can improve the results notably in most cases.
|
||||
```
|
||||
|
||||
Finally we have the `guided_grammar`, which probably is the most difficult one to use but it´s really powerful, as it allows us to define complete languages like SQL queries.
|
||||
It works by using a context free EBNF grammar, which for example we can use to define a specific format of simplified SQL queries, like in the example below:
|
||||
|
||||
```python
|
||||
simplified_sql_grammar = """
|
||||
?start: select_statement
|
||||
|
||||
?select_statement: "SELECT " column_list " FROM " table_name
|
||||
|
||||
?column_list: column_name ("," column_name)*
|
||||
|
||||
?table_name: identifier
|
||||
|
||||
?column_name: identifier
|
||||
|
||||
?identifier: /[a-zA-Z_][a-zA-Z0-9_]*/
|
||||
"""
|
||||
|
||||
completion = client.chat.completions.create(
|
||||
model="Qwen/Qwen2.5-3B-Instruct",
|
||||
messages=[
|
||||
{
|
||||
"role": "user",
|
||||
"content": "Generate an SQL query to show the 'username' and 'email' from the 'users' table.",
|
||||
}
|
||||
],
|
||||
extra_body={"guided_grammar": simplified_sql_grammar},
|
||||
)
|
||||
print(completion.choices[0].message.content)
|
||||
```
|
||||
|
||||
The complete code of the examples can be found on [examples/openai_chat_completion_structured_outputs.py](https://github.com/vllm-project/vllm/blob/main/examples/openai_chat_completion_structured_outputs.py).
|
||||
|
||||
## Experimental Automatic Parsing (OpenAI API)
|
||||
|
||||
This section covers the OpenAI beta wrapper over the `client.chat.completions.create()` method that provides richer integrations with Python specific types.
|
||||
|
||||
At the time of writing (`openai==1.54.4`), this is a "beta" feature in the OpenAI client library. Code reference can be found [here](https://github.com/openai/openai-python/blob/52357cff50bee57ef442e94d78a0de38b4173fc2/src/openai/resources/beta/chat/completions.py#L100-L104).
|
||||
|
||||
For the following examples, vLLM was setup using `vllm serve meta-llama/Llama-3.1-8B-Instruct`
|
||||
|
||||
Here is a simple example demonstrating how to get structured output using Pydantic models:
|
||||
|
||||
```python
|
||||
from pydantic import BaseModel
|
||||
from openai import OpenAI
|
||||
|
||||
|
||||
class Info(BaseModel):
|
||||
name: str
|
||||
age: int
|
||||
|
||||
|
||||
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="dummy")
|
||||
completion = client.beta.chat.completions.parse(
|
||||
model="meta-llama/Llama-3.1-8B-Instruct",
|
||||
messages=[
|
||||
{"role": "system", "content": "You are a helpful assistant."},
|
||||
{"role": "user", "content": "My name is Cameron, I'm 28. What's my name and age?"},
|
||||
],
|
||||
response_format=Info,
|
||||
extra_body=dict(guided_decoding_backend="outlines"),
|
||||
)
|
||||
|
||||
message = completion.choices[0].message
|
||||
print(message)
|
||||
assert message.parsed
|
||||
print("Name:", message.parsed.name)
|
||||
print("Age:", message.parsed.age)
|
||||
```
|
||||
|
||||
Output:
|
||||
|
||||
```console
|
||||
ParsedChatCompletionMessage[Testing](content='{"name": "Cameron", "age": 28}', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=[], parsed=Testing(name='Cameron', age=28))
|
||||
Name: Cameron
|
||||
Age: 28
|
||||
```
|
||||
|
||||
Here is a more complex example using nested Pydantic models to handle a step-by-step math solution:
|
||||
|
||||
```python
|
||||
from typing import List
|
||||
from pydantic import BaseModel
|
||||
from openai import OpenAI
|
||||
|
||||
|
||||
class Step(BaseModel):
|
||||
explanation: str
|
||||
output: str
|
||||
|
||||
|
||||
class MathResponse(BaseModel):
|
||||
steps: List[Step]
|
||||
final_answer: str
|
||||
|
||||
|
||||
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="dummy")
|
||||
completion = client.beta.chat.completions.parse(
|
||||
model="meta-llama/Llama-3.1-8B-Instruct",
|
||||
messages=[
|
||||
{"role": "system", "content": "You are a helpful expert math tutor."},
|
||||
{"role": "user", "content": "Solve 8x + 31 = 2."},
|
||||
],
|
||||
response_format=MathResponse,
|
||||
extra_body=dict(guided_decoding_backend="outlines"),
|
||||
)
|
||||
|
||||
message = completion.choices[0].message
|
||||
print(message)
|
||||
assert message.parsed
|
||||
for i, step in enumerate(message.parsed.steps):
|
||||
print(f"Step #{i}:", step)
|
||||
print("Answer:", message.parsed.final_answer)
|
||||
```
|
||||
|
||||
Output:
|
||||
|
||||
```console
|
||||
ParsedChatCompletionMessage[MathResponse](content='{ "steps": [{ "explanation": "First, let\'s isolate the term with the variable \'x\'. To do this, we\'ll subtract 31 from both sides of the equation.", "output": "8x + 31 - 31 = 2 - 31"}, { "explanation": "By subtracting 31 from both sides, we simplify the equation to 8x = -29.", "output": "8x = -29"}, { "explanation": "Next, let\'s isolate \'x\' by dividing both sides of the equation by 8.", "output": "8x / 8 = -29 / 8"}], "final_answer": "x = -29/8" }', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=[], parsed=MathResponse(steps=[Step(explanation="First, let's isolate the term with the variable 'x'. To do this, we'll subtract 31 from both sides of the equation.", output='8x + 31 - 31 = 2 - 31'), Step(explanation='By subtracting 31 from both sides, we simplify the equation to 8x = -29.', output='8x = -29'), Step(explanation="Next, let's isolate 'x' by dividing both sides of the equation by 8.", output='8x / 8 = -29 / 8')], final_answer='x = -29/8'))
|
||||
Step #0: explanation="First, let's isolate the term with the variable 'x'. To do this, we'll subtract 31 from both sides of the equation." output='8x + 31 - 31 = 2 - 31'
|
||||
Step #1: explanation='By subtracting 31 from both sides, we simplify the equation to 8x = -29.' output='8x = -29'
|
||||
Step #2: explanation="Next, let's isolate 'x' by dividing both sides of the equation by 8." output='8x / 8 = -29 / 8'
|
||||
Answer: x = -29/8
|
||||
```
|
||||
|
||||
## Offline Inference
|
||||
|
||||
Offline inference allows for the same types of guided decoding.
|
||||
To use it, we´ll need to configure the guided decoding using the class `GuidedDecodingParams` inside `SamplingParams`.
|
||||
The main available options inside `GuidedDecodingParams` are:
|
||||
|
||||
- `json`
|
||||
- `regex`
|
||||
- `choice`
|
||||
- `grammar`
|
||||
- `backend`
|
||||
- `whitespace_pattern`
|
||||
|
||||
These parameters can be used in the same way as the parameters from the Online Inference examples above.
|
||||
One example for the usage of the `choices` parameter is shown below:
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
from vllm.sampling_params import GuidedDecodingParams
|
||||
|
||||
llm = LLM(model="HuggingFaceTB/SmolLM2-1.7B-Instruct")
|
||||
|
||||
guided_decoding_params = GuidedDecodingParams(choice=["Positive", "Negative"])
|
||||
sampling_params = SamplingParams(guided_decoding=guided_decoding_params)
|
||||
outputs = llm.generate(
|
||||
prompts="Classify this sentiment: vLLM is wonderful!",
|
||||
sampling_params=sampling_params,
|
||||
)
|
||||
print(outputs[0].outputs[0].text)
|
||||
```
|
||||
|
||||
A complete example with all options can be found in [examples/offline_inference_structured_outputs.py](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_structured_outputs.py).
|
||||
@@ -1,267 +0,0 @@
|
||||
.. _structured_outputs:
|
||||
|
||||
Structured Outputs
|
||||
==================
|
||||
|
||||
vLLM supports the generation of structured outputs using `outlines <https://github.com/dottxt-ai/outlines>`_ or `lm-format-enforcer <https://github.com/noamgat/lm-format-enforcer>`_ as backends for the guided decoding.
|
||||
This document shows you some examples of the different options that are available to generate structured outputs.
|
||||
|
||||
|
||||
Online Inference (OpenAI API)
|
||||
-----------------------------
|
||||
|
||||
You can generate structured outputs using the OpenAI's `Completions <https://platform.openai.com/docs/api-reference/completions>`_ and `Chat <https://platform.openai.com/docs/api-reference/chat>`_ API.
|
||||
|
||||
The following parameters are supported, which must be added as extra parameters:
|
||||
|
||||
- ``guided_choice``: the output will be exactly one of the choices.
|
||||
- ``guided_regex``: the output will follow the regex pattern.
|
||||
- ``guided_json``: the output will follow the JSON schema.
|
||||
- ``guided_grammar``: the output will follow the context free grammar.
|
||||
- ``guided_whitespace_pattern``: used to override the default whitespace pattern for guided json decoding.
|
||||
- ``guided_decoding_backend``: used to select the guided decoding backend to use.
|
||||
|
||||
You can see the complete list of supported parameters on the `OpenAI Compatible Server </../serving/openai_compatible_server.html>`_ page.
|
||||
|
||||
Now let´s see an example for each of the cases, starting with the ``guided_choice``, as it´s the easiest one:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from openai import OpenAI
|
||||
client = OpenAI(
|
||||
base_url="http://localhost:8000/v1",
|
||||
api_key="-",
|
||||
)
|
||||
|
||||
completion = client.chat.completions.create(
|
||||
model="Qwen/Qwen2.5-3B-Instruct",
|
||||
messages=[
|
||||
{"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"}
|
||||
],
|
||||
extra_body={"guided_choice": ["positive", "negative"]},
|
||||
)
|
||||
print(completion.choices[0].message.content)
|
||||
|
||||
|
||||
The next example shows how to use the ``guided_regex``. The idea is to generate an email address, given a simple regex template:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
completion = client.chat.completions.create(
|
||||
model="Qwen/Qwen2.5-3B-Instruct",
|
||||
messages=[
|
||||
{
|
||||
"role": "user",
|
||||
"content": "Generate an example email address for Alan Turing, who works in Enigma. End in .com and new line. Example result: alan.turing@enigma.com\n",
|
||||
}
|
||||
],
|
||||
extra_body={"guided_regex": "\w+@\w+\.com\n", "stop": ["\n"]},
|
||||
)
|
||||
print(completion.choices[0].message.content)
|
||||
|
||||
One of the most relevant features in structured text generation is the option to generate a valid JSON with pre-defined fields and formats.
|
||||
For this we can use the ``guided_json`` parameter in two different ways:
|
||||
|
||||
- Using directly a `JSON Schema <https://json-schema.org/>`_
|
||||
- Defining a `Pydantic model <https://docs.pydantic.dev/latest/>`_ and then extracting the JSON Schema from it (which is normally an easier option).
|
||||
|
||||
The next example shows how to use the ``guided_json`` parameter with a Pydantic model:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from pydantic import BaseModel
|
||||
from enum import Enum
|
||||
|
||||
class CarType(str, Enum):
|
||||
sedan = "sedan"
|
||||
suv = "SUV"
|
||||
truck = "Truck"
|
||||
coupe = "Coupe"
|
||||
|
||||
|
||||
class CarDescription(BaseModel):
|
||||
brand: str
|
||||
model: str
|
||||
car_type: CarType
|
||||
|
||||
|
||||
json_schema = CarDescription.model_json_schema()
|
||||
|
||||
completion = client.chat.completions.create(
|
||||
model="Qwen/Qwen2.5-3B-Instruct",
|
||||
messages=[
|
||||
{
|
||||
"role": "user",
|
||||
"content": "Generate a JSON with the brand, model and car_type of the most iconic car from the 90's",
|
||||
}
|
||||
],
|
||||
extra_body={"guided_json": json_schema},
|
||||
)
|
||||
print(completion.choices[0].message.content)
|
||||
|
||||
.. tip::
|
||||
While not strictly necessary, normally it´s better to indicate in the prompt that a JSON needs to be generated and which fields and how should the LLM fill them.
|
||||
This can improve the results notably in most cases.
|
||||
|
||||
|
||||
Finally we have the ``guided_grammar``, which probably is the most difficult one to use but it´s really powerful, as it allows us to define complete languages like SQL queries.
|
||||
It works by using a context free EBNF grammar, which for example we can use to define a specific format of simplified SQL queries, like in the example below:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
simplified_sql_grammar = """
|
||||
?start: select_statement
|
||||
|
||||
?select_statement: "SELECT " column_list " FROM " table_name
|
||||
|
||||
?column_list: column_name ("," column_name)*
|
||||
|
||||
?table_name: identifier
|
||||
|
||||
?column_name: identifier
|
||||
|
||||
?identifier: /[a-zA-Z_][a-zA-Z0-9_]*/
|
||||
"""
|
||||
|
||||
completion = client.chat.completions.create(
|
||||
model="Qwen/Qwen2.5-3B-Instruct",
|
||||
messages=[
|
||||
{
|
||||
"role": "user",
|
||||
"content": "Generate an SQL query to show the 'username' and 'email' from the 'users' table.",
|
||||
}
|
||||
],
|
||||
extra_body={"guided_grammar": simplified_sql_grammar},
|
||||
)
|
||||
print(completion.choices[0].message.content)
|
||||
|
||||
The complete code of the examples can be found on `examples/openai_chat_completion_structured_outputs.py <https://github.com/vllm-project/vllm/blob/main/examples/openai_chat_completion_structured_outputs.py>`_.
|
||||
|
||||
Experimental Automatic Parsing (OpenAI API)
|
||||
--------------------------------------------
|
||||
|
||||
This section covers the OpenAI beta wrapper over the ``client.chat.completions.create()`` method that provides richer integrations with Python specific types.
|
||||
|
||||
At the time of writing (``openai==1.54.4``), this is a "beta" feature in the OpenAI client library. Code reference can be found `here <https://github.com/openai/openai-python/blob/52357cff50bee57ef442e94d78a0de38b4173fc2/src/openai/resources/beta/chat/completions.py#L100-L104>`_.
|
||||
|
||||
For the following examples, vLLM was setup using ``vllm serve meta-llama/Llama-3.1-8B-Instruct``
|
||||
|
||||
Here is a simple example demonstrating how to get structured output using Pydantic models:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from pydantic import BaseModel
|
||||
from openai import OpenAI
|
||||
|
||||
|
||||
class Info(BaseModel):
|
||||
name: str
|
||||
age: int
|
||||
|
||||
|
||||
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="dummy")
|
||||
completion = client.beta.chat.completions.parse(
|
||||
model="meta-llama/Llama-3.1-8B-Instruct",
|
||||
messages=[
|
||||
{"role": "system", "content": "You are a helpful assistant."},
|
||||
{"role": "user", "content": "My name is Cameron, I'm 28. What's my name and age?"},
|
||||
],
|
||||
response_format=Info,
|
||||
extra_body=dict(guided_decoding_backend="outlines"),
|
||||
)
|
||||
|
||||
message = completion.choices[0].message
|
||||
print(message)
|
||||
assert message.parsed
|
||||
print("Name:", message.parsed.name)
|
||||
print("Age:", message.parsed.age)
|
||||
|
||||
Output:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
ParsedChatCompletionMessage[Testing](content='{"name": "Cameron", "age": 28}', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=[], parsed=Testing(name='Cameron', age=28))
|
||||
Name: Cameron
|
||||
Age: 28
|
||||
|
||||
|
||||
Here is a more complex example using nested Pydantic models to handle a step-by-step math solution:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from typing import List
|
||||
from pydantic import BaseModel
|
||||
from openai import OpenAI
|
||||
|
||||
|
||||
class Step(BaseModel):
|
||||
explanation: str
|
||||
output: str
|
||||
|
||||
|
||||
class MathResponse(BaseModel):
|
||||
steps: List[Step]
|
||||
final_answer: str
|
||||
|
||||
|
||||
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="dummy")
|
||||
completion = client.beta.chat.completions.parse(
|
||||
model="meta-llama/Llama-3.1-8B-Instruct",
|
||||
messages=[
|
||||
{"role": "system", "content": "You are a helpful expert math tutor."},
|
||||
{"role": "user", "content": "Solve 8x + 31 = 2."},
|
||||
],
|
||||
response_format=MathResponse,
|
||||
extra_body=dict(guided_decoding_backend="outlines"),
|
||||
)
|
||||
|
||||
message = completion.choices[0].message
|
||||
print(message)
|
||||
assert message.parsed
|
||||
for i, step in enumerate(message.parsed.steps):
|
||||
print(f"Step #{i}:", step)
|
||||
print("Answer:", message.parsed.final_answer)
|
||||
|
||||
Output:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
ParsedChatCompletionMessage[MathResponse](content='{ "steps": [{ "explanation": "First, let\'s isolate the term with the variable \'x\'. To do this, we\'ll subtract 31 from both sides of the equation.", "output": "8x + 31 - 31 = 2 - 31"}, { "explanation": "By subtracting 31 from both sides, we simplify the equation to 8x = -29.", "output": "8x = -29"}, { "explanation": "Next, let\'s isolate \'x\' by dividing both sides of the equation by 8.", "output": "8x / 8 = -29 / 8"}], "final_answer": "x = -29/8" }', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=[], parsed=MathResponse(steps=[Step(explanation="First, let's isolate the term with the variable 'x'. To do this, we'll subtract 31 from both sides of the equation.", output='8x + 31 - 31 = 2 - 31'), Step(explanation='By subtracting 31 from both sides, we simplify the equation to 8x = -29.', output='8x = -29'), Step(explanation="Next, let's isolate 'x' by dividing both sides of the equation by 8.", output='8x / 8 = -29 / 8')], final_answer='x = -29/8'))
|
||||
Step #0: explanation="First, let's isolate the term with the variable 'x'. To do this, we'll subtract 31 from both sides of the equation." output='8x + 31 - 31 = 2 - 31'
|
||||
Step #1: explanation='By subtracting 31 from both sides, we simplify the equation to 8x = -29.' output='8x = -29'
|
||||
Step #2: explanation="Next, let's isolate 'x' by dividing both sides of the equation by 8." output='8x / 8 = -29 / 8'
|
||||
Answer: x = -29/8
|
||||
|
||||
Offline Inference
|
||||
-----------------
|
||||
|
||||
Offline inference allows for the same types of guided decoding.
|
||||
To use it, we´ll need to configure the guided decoding using the class ``GuidedDecodingParams`` inside ``SamplingParams``.
|
||||
The main available options inside ``GuidedDecodingParams`` are:
|
||||
|
||||
- ``json``
|
||||
- ``regex``
|
||||
- ``choice``
|
||||
- ``grammar``
|
||||
- ``backend``
|
||||
- ``whitespace_pattern``
|
||||
|
||||
These parameters can be used in the same way as the parameters from the Online Inference examples above.
|
||||
One example for the usage of the ``choices`` parameter is shown below:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from vllm import LLM, SamplingParams
|
||||
from vllm.sampling_params import GuidedDecodingParams
|
||||
|
||||
llm = LLM(model="HuggingFaceTB/SmolLM2-1.7B-Instruct")
|
||||
|
||||
guided_decoding_params = GuidedDecodingParams(choice=["Positive", "Negative"])
|
||||
sampling_params = SamplingParams(guided_decoding=guided_decoding_params)
|
||||
outputs = llm.generate(
|
||||
prompts="Classify this sentiment: vLLM is wonderful!",
|
||||
sampling_params=sampling_params,
|
||||
)
|
||||
print(outputs[0].outputs[0].text)
|
||||
|
||||
A complete example with all options can be found in `examples/offline_inference_structured_outputs.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_structured_outputs.py>`_.
|
||||
@@ -47,7 +47,7 @@ tail ~/.config/vllm/usage_stats.json
|
||||
|
||||
## Opt-out of Usage Stats Collection
|
||||
|
||||
You can opt-out of usage stats collection by setting the VLLM_NO_USAGE_STATS or DO_NOT_TRACK environment variable, or by creating a ~/.config/vllm/do_not_track file:
|
||||
You can opt-out of usage stats collection by setting the `VLLM_NO_USAGE_STATS` or `DO_NOT_TRACK` environment variable, or by creating a `~/.config/vllm/do_not_track` file:
|
||||
|
||||
```bash
|
||||
# Any of the following methods can disable usage stats collection
|
||||
|
||||
Reference in New Issue
Block a user