[Doc] Convert docs to use colon fences (#12471)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
@@ -6,9 +6,9 @@
|
||||
|
||||
Automatic Prefix Caching (APC in short) caches the KV cache of existing queries, so that a new query can directly reuse the KV cache if it shares the same prefix with one of the existing queries, allowing the new query to skip the computation of the shared part.
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
Technical details on how vLLM implements APC can be found [here](#design-automatic-prefix-caching).
|
||||
```
|
||||
:::
|
||||
|
||||
## Enabling APC in vLLM
|
||||
|
||||
|
||||
@@ -4,13 +4,13 @@
|
||||
|
||||
The tables below show mutually exclusive features and the support on some hardware.
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
Check the '✗' with links to see tracking issue for unsupported feature/hardware combination.
|
||||
```
|
||||
:::
|
||||
|
||||
## Feature x Feature
|
||||
|
||||
```{raw} html
|
||||
:::{raw} html
|
||||
<style>
|
||||
/* Make smaller to try to improve readability */
|
||||
td {
|
||||
@@ -23,448 +23,447 @@ Check the '✗' with links to see tracking issue for unsupported feature/hardwar
|
||||
font-size: 0.8rem;
|
||||
}
|
||||
</style>
|
||||
```
|
||||
:::
|
||||
|
||||
```{list-table}
|
||||
:header-rows: 1
|
||||
:stub-columns: 1
|
||||
:widths: auto
|
||||
:::{list-table}
|
||||
:header-rows: 1
|
||||
:stub-columns: 1
|
||||
:widths: auto
|
||||
|
||||
* - Feature
|
||||
- [CP](#chunked-prefill)
|
||||
- [APC](#automatic-prefix-caching)
|
||||
- [LoRA](#lora-adapter)
|
||||
- <abbr title="Prompt Adapter">prmpt adptr</abbr>
|
||||
- [SD](#spec_decode)
|
||||
- CUDA graph
|
||||
- <abbr title="Pooling Models">pooling</abbr>
|
||||
- <abbr title="Encoder-Decoder Models">enc-dec</abbr>
|
||||
- <abbr title="Logprobs">logP</abbr>
|
||||
- <abbr title="Prompt Logprobs">prmpt logP</abbr>
|
||||
- <abbr title="Async Output Processing">async output</abbr>
|
||||
- multi-step
|
||||
- <abbr title="Multimodal Inputs">mm</abbr>
|
||||
- best-of
|
||||
- beam-search
|
||||
- <abbr title="Guided Decoding">guided dec</abbr>
|
||||
* - [CP](#chunked-prefill)
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - [APC](#automatic-prefix-caching)
|
||||
- ✅
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - [LoRA](#lora-adapter)
|
||||
- [✗](gh-pr:9057)
|
||||
- ✅
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - <abbr title="Prompt Adapter">prmpt adptr</abbr>
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - [SD](#spec_decode)
|
||||
- ✅
|
||||
- ✅
|
||||
- ✗
|
||||
- ✅
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - CUDA graph
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - <abbr title="Pooling Models">pooling</abbr>
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - <abbr title="Encoder-Decoder Models">enc-dec</abbr>
|
||||
- ✗
|
||||
- [✗](gh-issue:7366)
|
||||
- ✗
|
||||
- ✗
|
||||
- [✗](gh-issue:7366)
|
||||
- ✅
|
||||
- ✅
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - <abbr title="Logprobs">logP</abbr>
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✗
|
||||
- ✅
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - <abbr title="Prompt Logprobs">prmpt logP</abbr>
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- [✗](gh-pr:8199)
|
||||
- ✅
|
||||
- ✗
|
||||
- ✅
|
||||
- ✅
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - <abbr title="Async Output Processing">async output</abbr>
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✗
|
||||
- ✅
|
||||
- ✗
|
||||
- ✗
|
||||
- ✅
|
||||
- ✅
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - multi-step
|
||||
- ✗
|
||||
- ✅
|
||||
- ✗
|
||||
- ✅
|
||||
- ✗
|
||||
- ✅
|
||||
- ✗
|
||||
- ✗
|
||||
- ✅
|
||||
- [✗](gh-issue:8198)
|
||||
- ✅
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - <abbr title="Multimodal Inputs">mm</abbr>
|
||||
- ✅
|
||||
- [✗](gh-pr:8348)
|
||||
- [✗](gh-pr:7199)
|
||||
- ?
|
||||
- ?
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ?
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - best-of
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- [✗](gh-issue:6137)
|
||||
- ✅
|
||||
- ✗
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ?
|
||||
- [✗](gh-issue:7968)
|
||||
- ✅
|
||||
-
|
||||
-
|
||||
-
|
||||
* - beam-search
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- [✗](gh-issue:6137)
|
||||
- ✅
|
||||
- ✗
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ?
|
||||
- [✗](gh-issue:7968>)
|
||||
- ?
|
||||
- ✅
|
||||
-
|
||||
-
|
||||
* - <abbr title="Guided Decoding">guided dec</abbr>
|
||||
- ✅
|
||||
- ✅
|
||||
- ?
|
||||
- ?
|
||||
- [✗](gh-issue:11484)
|
||||
- ✅
|
||||
- ✗
|
||||
- ?
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- [✗](gh-issue:9893)
|
||||
- ?
|
||||
- ✅
|
||||
- ✅
|
||||
-
|
||||
|
||||
```
|
||||
- * Feature
|
||||
* [CP](#chunked-prefill)
|
||||
* [APC](#automatic-prefix-caching)
|
||||
* [LoRA](#lora-adapter)
|
||||
* <abbr title="Prompt Adapter">prmpt adptr</abbr>
|
||||
* [SD](#spec_decode)
|
||||
* CUDA graph
|
||||
* <abbr title="Pooling Models">pooling</abbr>
|
||||
* <abbr title="Encoder-Decoder Models">enc-dec</abbr>
|
||||
* <abbr title="Logprobs">logP</abbr>
|
||||
* <abbr title="Prompt Logprobs">prmpt logP</abbr>
|
||||
* <abbr title="Async Output Processing">async output</abbr>
|
||||
* multi-step
|
||||
* <abbr title="Multimodal Inputs">mm</abbr>
|
||||
* best-of
|
||||
* beam-search
|
||||
* <abbr title="Guided Decoding">guided dec</abbr>
|
||||
- * [CP](#chunked-prefill)
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
- * [APC](#automatic-prefix-caching)
|
||||
* ✅
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
- * [LoRA](#lora-adapter)
|
||||
* [✗](gh-pr:9057)
|
||||
* ✅
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
- * <abbr title="Prompt Adapter">prmpt adptr</abbr>
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
- * [SD](#spec_decode)
|
||||
* ✅
|
||||
* ✅
|
||||
* ✗
|
||||
* ✅
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
- * CUDA graph
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
- * <abbr title="Pooling Models">pooling</abbr>
|
||||
* ✗
|
||||
* ✗
|
||||
* ✗
|
||||
* ✗
|
||||
* ✗
|
||||
* ✗
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
- * <abbr title="Encoder-Decoder Models">enc-dec</abbr>
|
||||
* ✗
|
||||
* [✗](gh-issue:7366)
|
||||
* ✗
|
||||
* ✗
|
||||
* [✗](gh-issue:7366)
|
||||
* ✅
|
||||
* ✅
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
- * <abbr title="Logprobs">logP</abbr>
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✗
|
||||
* ✅
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
- * <abbr title="Prompt Logprobs">prmpt logP</abbr>
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* [✗](gh-pr:8199)
|
||||
* ✅
|
||||
* ✗
|
||||
* ✅
|
||||
* ✅
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
- * <abbr title="Async Output Processing">async output</abbr>
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✗
|
||||
* ✅
|
||||
* ✗
|
||||
* ✗
|
||||
* ✅
|
||||
* ✅
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
- * multi-step
|
||||
* ✗
|
||||
* ✅
|
||||
* ✗
|
||||
* ✅
|
||||
* ✗
|
||||
* ✅
|
||||
* ✗
|
||||
* ✗
|
||||
* ✅
|
||||
* [✗](gh-issue:8198)
|
||||
* ✅
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
- * <abbr title="Multimodal Inputs">mm</abbr>
|
||||
* ✅
|
||||
* [✗](gh-pr:8348)
|
||||
* [✗](gh-pr:7199)
|
||||
* ?
|
||||
* ?
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ?
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
- * best-of
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* [✗](gh-issue:6137)
|
||||
* ✅
|
||||
* ✗
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ?
|
||||
* [✗](gh-issue:7968)
|
||||
* ✅
|
||||
*
|
||||
*
|
||||
*
|
||||
- * beam-search
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* [✗](gh-issue:6137)
|
||||
* ✅
|
||||
* ✗
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ?
|
||||
* [✗](gh-issue:7968>)
|
||||
* ?
|
||||
* ✅
|
||||
*
|
||||
*
|
||||
- * <abbr title="Guided Decoding">guided dec</abbr>
|
||||
* ✅
|
||||
* ✅
|
||||
* ?
|
||||
* ?
|
||||
* [✗](gh-issue:11484)
|
||||
* ✅
|
||||
* ✗
|
||||
* ?
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* [✗](gh-issue:9893)
|
||||
* ?
|
||||
* ✅
|
||||
* ✅
|
||||
*
|
||||
:::
|
||||
|
||||
(feature-x-hardware)=
|
||||
|
||||
## Feature x Hardware
|
||||
|
||||
```{list-table}
|
||||
:header-rows: 1
|
||||
:stub-columns: 1
|
||||
:widths: auto
|
||||
:::{list-table}
|
||||
:header-rows: 1
|
||||
:stub-columns: 1
|
||||
:widths: auto
|
||||
|
||||
* - Feature
|
||||
- Volta
|
||||
- Turing
|
||||
- Ampere
|
||||
- Ada
|
||||
- Hopper
|
||||
- CPU
|
||||
- AMD
|
||||
* - [CP](#chunked-prefill)
|
||||
- [✗](gh-issue:2729)
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
* - [APC](#automatic-prefix-caching)
|
||||
- [✗](gh-issue:3687)
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
* - [LoRA](#lora-adapter)
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
* - <abbr title="Prompt Adapter">prmpt adptr</abbr>
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- [✗](gh-issue:8475)
|
||||
- ✅
|
||||
* - [SD](#spec_decode)
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
* - CUDA graph
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✗
|
||||
- ✅
|
||||
* - <abbr title="Pooling Models">pooling</abbr>
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ?
|
||||
* - <abbr title="Encoder-Decoder Models">enc-dec</abbr>
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✗
|
||||
* - <abbr title="Multimodal Inputs">mm</abbr>
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
* - <abbr title="Logprobs">logP</abbr>
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
* - <abbr title="Prompt Logprobs">prmpt logP</abbr>
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
* - <abbr title="Async Output Processing">async output</abbr>
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✗
|
||||
- ✗
|
||||
* - multi-step
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- [✗](gh-issue:8477)
|
||||
- ✅
|
||||
* - best-of
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
* - beam-search
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
* - <abbr title="Guided Decoding">guided dec</abbr>
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
```
|
||||
- * Feature
|
||||
* Volta
|
||||
* Turing
|
||||
* Ampere
|
||||
* Ada
|
||||
* Hopper
|
||||
* CPU
|
||||
* AMD
|
||||
- * [CP](#chunked-prefill)
|
||||
* [✗](gh-issue:2729)
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
- * [APC](#automatic-prefix-caching)
|
||||
* [✗](gh-issue:3687)
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
- * [LoRA](#lora-adapter)
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
- * <abbr title="Prompt Adapter">prmpt adptr</abbr>
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* [✗](gh-issue:8475)
|
||||
* ✅
|
||||
- * [SD](#spec_decode)
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
- * CUDA graph
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✗
|
||||
* ✅
|
||||
- * <abbr title="Pooling Models">pooling</abbr>
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ?
|
||||
- * <abbr title="Encoder-Decoder Models">enc-dec</abbr>
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✗
|
||||
- * <abbr title="Multimodal Inputs">mm</abbr>
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
- * <abbr title="Logprobs">logP</abbr>
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
- * <abbr title="Prompt Logprobs">prmpt logP</abbr>
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
- * <abbr title="Async Output Processing">async output</abbr>
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✗
|
||||
* ✗
|
||||
- * multi-step
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* [✗](gh-issue:8477)
|
||||
* ✅
|
||||
- * best-of
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
- * beam-search
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
- * <abbr title="Guided Decoding">guided dec</abbr>
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
:::
|
||||
|
||||
@@ -4,9 +4,9 @@
|
||||
|
||||
This page introduces you the disaggregated prefilling feature in vLLM.
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
This feature is experimental and subject to change.
|
||||
```
|
||||
:::
|
||||
|
||||
## Why disaggregated prefilling?
|
||||
|
||||
@@ -15,9 +15,9 @@ Two main reasons:
|
||||
- **Tuning time-to-first-token (TTFT) and inter-token-latency (ITL) separately**. Disaggregated prefilling put prefill and decode phase of LLM inference inside different vLLM instances. This gives you the flexibility to assign different parallel strategies (e.g. `tp` and `pp`) to tune TTFT without affecting ITL, or to tune ITL without affecting TTFT.
|
||||
- **Controlling tail ITL**. Without disaggregated prefilling, vLLM may insert some prefill jobs during the decoding of one request. This results in higher tail latency. Disaggregated prefilling helps you solve this issue and control tail ITL. Chunked prefill with a proper chunk size also can achieve the same goal, but in practice it's hard to figure out the correct chunk size value. So disaggregated prefilling is a much more reliable way to control tail ITL.
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
Disaggregated prefill DOES NOT improve throughput.
|
||||
```
|
||||
:::
|
||||
|
||||
## Usage example
|
||||
|
||||
@@ -39,21 +39,21 @@ Key abstractions for disaggregated prefilling:
|
||||
- **LookupBuffer**: LookupBuffer provides two API: `insert` KV cache and `drop_select` KV cache. The semantics of `insert` and `drop_select` are similar to SQL, where `insert` inserts a KV cache into the buffer, and `drop_select` returns the KV cache that matches the given condition and drop it from the buffer.
|
||||
- **Pipe**: A single-direction FIFO pipe for tensor transmission. It supports `send_tensor` and `recv_tensor`.
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
`insert` is non-blocking operation but `drop_select` is blocking operation.
|
||||
```
|
||||
:::
|
||||
|
||||
Here is a figure illustrating how the above 3 abstractions are organized:
|
||||
|
||||
```{image} /assets/features/disagg_prefill/abstraction.jpg
|
||||
:::{image} /assets/features/disagg_prefill/abstraction.jpg
|
||||
:alt: Disaggregated prefilling abstractions
|
||||
```
|
||||
:::
|
||||
|
||||
The workflow of disaggregated prefilling is as follows:
|
||||
|
||||
```{image} /assets/features/disagg_prefill/overview.jpg
|
||||
:::{image} /assets/features/disagg_prefill/overview.jpg
|
||||
:alt: Disaggregated prefilling workflow
|
||||
```
|
||||
:::
|
||||
|
||||
The `buffer` corresponds to `insert` API in LookupBuffer, and the `drop_select` corresponds to `drop_select` API in LookupBuffer.
|
||||
|
||||
|
||||
@@ -60,9 +60,9 @@ vllm serve meta-llama/Llama-2-7b-hf \
|
||||
--lora-modules sql-lora=$HOME/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/snapshots/0dfa347e8877a4d4ed19ee56c140fa518470028c/
|
||||
```
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
The commit ID `0dfa347e8877a4d4ed19ee56c140fa518470028c` may change over time. Please check the latest commit ID in your environment to ensure you are using the correct one.
|
||||
```
|
||||
:::
|
||||
|
||||
The server entrypoint accepts all other LoRA configuration parameters (`max_loras`, `max_lora_rank`, `max_cpu_loras`,
|
||||
etc.), which will apply to all forthcoming requests. Upon querying the `/models` endpoint, we should see our LoRA along
|
||||
|
||||
@@ -2,11 +2,11 @@
|
||||
|
||||
# AutoAWQ
|
||||
|
||||
```{warning}
|
||||
:::{warning}
|
||||
Please note that AWQ support in vLLM is under-optimized at the moment. We would recommend using the unquantized version of the model for better
|
||||
accuracy and higher throughput. Currently, you can use AWQ as a way to reduce memory footprint. As of now, it is more suitable for low latency
|
||||
inference with small number of concurrent requests. vLLM's AWQ implementation have lower throughput than unquantized version.
|
||||
```
|
||||
:::
|
||||
|
||||
To create a new 4-bit quantized model, you can leverage [AutoAWQ](https://github.com/casper-hansen/AutoAWQ).
|
||||
Quantizing reduces the model's precision from FP16 to INT4 which effectively reduces the file size by ~70%.
|
||||
|
||||
@@ -14,10 +14,10 @@ The FP8 types typically supported in hardware have two distinct representations,
|
||||
- **E4M3**: Consists of 1 sign bit, 4 exponent bits, and 3 bits of mantissa. It can store values up to +/-448 and `nan`.
|
||||
- **E5M2**: Consists of 1 sign bit, 5 exponent bits, and 2 bits of mantissa. It can store values up to +/-57344, +/- `inf`, and `nan`. The tradeoff for the increased dynamic range is lower precision of the stored values.
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
FP8 computation is supported on NVIDIA GPUs with compute capability > 8.9 (Ada Lovelace, Hopper).
|
||||
FP8 models will run on compute capability > 8.0 (Ampere) as weight-only W8A16, utilizing FP8 Marlin.
|
||||
```
|
||||
:::
|
||||
|
||||
## Quick Start with Online Dynamic Quantization
|
||||
|
||||
@@ -32,9 +32,9 @@ model = LLM("facebook/opt-125m", quantization="fp8")
|
||||
result = model.generate("Hello, my name is")
|
||||
```
|
||||
|
||||
```{warning}
|
||||
:::{warning}
|
||||
Currently, we load the model at original precision before quantizing down to 8-bits, so you need enough memory to load the whole model.
|
||||
```
|
||||
:::
|
||||
|
||||
## Installation
|
||||
|
||||
@@ -110,9 +110,9 @@ model.generate("Hello my name is")
|
||||
|
||||
Evaluate accuracy with `lm_eval` (for example on 250 samples of `gsm8k`):
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
Quantized models can be sensitive to the presence of the `bos` token. `lm_eval` does not add a `bos` token by default, so make sure to include the `add_bos_token=True` argument when running your evaluations.
|
||||
```
|
||||
:::
|
||||
|
||||
```console
|
||||
$ MODEL=$PWD/Meta-Llama-3-8B-Instruct-FP8-Dynamic
|
||||
@@ -137,10 +137,10 @@ If you encounter any issues or have feature requests, please open an issue on th
|
||||
|
||||
## Deprecated Flow
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
The following information is preserved for reference and search purposes.
|
||||
The quantization method described below is deprecated in favor of the `llmcompressor` method described above.
|
||||
```
|
||||
:::
|
||||
|
||||
For static per-tensor offline quantization to FP8, please install the [AutoFP8 library](https://github.com/neuralmagic/autofp8).
|
||||
|
||||
|
||||
@@ -2,13 +2,13 @@
|
||||
|
||||
# GGUF
|
||||
|
||||
```{warning}
|
||||
:::{warning}
|
||||
Please note that GGUF support in vLLM is highly experimental and under-optimized at the moment, it might be incompatible with other features. Currently, you can use GGUF as a way to reduce memory footprint. If you encounter any issues, please report them to the vLLM team.
|
||||
```
|
||||
:::
|
||||
|
||||
```{warning}
|
||||
:::{warning}
|
||||
Currently, vllm only supports loading single-file GGUF models. If you have a multi-files GGUF model, you can use [gguf-split](https://github.com/ggerganov/llama.cpp/pull/6135) tool to merge them to a single-file model.
|
||||
```
|
||||
:::
|
||||
|
||||
To run a GGUF model with vLLM, you can download and use the local GGUF model from [TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF](https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF) with the following command:
|
||||
|
||||
@@ -25,9 +25,9 @@ You can also add `--tensor-parallel-size 2` to enable tensor parallelism inferen
|
||||
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 --tensor-parallel-size 2
|
||||
```
|
||||
|
||||
```{warning}
|
||||
:::{warning}
|
||||
We recommend using the tokenizer from base model instead of GGUF model. Because the tokenizer conversion from GGUF is time-consuming and unstable, especially for some models with large vocab size.
|
||||
```
|
||||
:::
|
||||
|
||||
You can also use the GGUF model directly through the LLM entrypoint:
|
||||
|
||||
|
||||
@@ -4,7 +4,7 @@
|
||||
|
||||
Quantization trades off model precision for smaller memory footprint, allowing large models to be run on a wider range of devices.
|
||||
|
||||
```{toctree}
|
||||
:::{toctree}
|
||||
:caption: Contents
|
||||
:maxdepth: 1
|
||||
|
||||
@@ -15,4 +15,4 @@ gguf
|
||||
int8
|
||||
fp8
|
||||
quantized_kvcache
|
||||
```
|
||||
:::
|
||||
|
||||
@@ -7,9 +7,9 @@ This quantization method is particularly useful for reducing model size while ma
|
||||
|
||||
Please visit the HF collection of [quantized INT8 checkpoints of popular LLMs ready to use with vLLM](https://huggingface.co/collections/neuralmagic/int8-llms-for-vllm-668ec32c049dca0369816415).
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
INT8 computation is supported on NVIDIA GPUs with compute capability > 7.5 (Turing, Ampere, Ada Lovelace, Hopper).
|
||||
```
|
||||
:::
|
||||
|
||||
## Prerequisites
|
||||
|
||||
@@ -119,9 +119,9 @@ $ lm_eval --model vllm \
|
||||
--batch_size 'auto'
|
||||
```
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
Quantized models can be sensitive to the presence of the `bos` token. Make sure to include the `add_bos_token=True` argument when running evaluations.
|
||||
```
|
||||
:::
|
||||
|
||||
## Best Practices
|
||||
|
||||
|
||||
@@ -4,128 +4,129 @@
|
||||
|
||||
The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM:
|
||||
|
||||
```{list-table}
|
||||
:::{list-table}
|
||||
:header-rows: 1
|
||||
:widths: 20 8 8 8 8 8 8 8 8 8 8
|
||||
|
||||
* - Implementation
|
||||
- Volta
|
||||
- Turing
|
||||
- Ampere
|
||||
- Ada
|
||||
- Hopper
|
||||
- AMD GPU
|
||||
- Intel GPU
|
||||
- x86 CPU
|
||||
- AWS Inferentia
|
||||
- Google TPU
|
||||
* - AWQ
|
||||
- ✗
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✗
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✗
|
||||
- ✗
|
||||
* - GPTQ
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✗
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✗
|
||||
- ✗
|
||||
* - Marlin (GPTQ/AWQ/FP8)
|
||||
- ✗
|
||||
- ✗
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
* - INT8 (W8A8)
|
||||
- ✗
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✗
|
||||
- ✗
|
||||
- ✅︎
|
||||
- ✗
|
||||
- ✗
|
||||
* - FP8 (W8A8)
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
* - AQLM
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
* - bitsandbytes
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
* - DeepSpeedFP
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
* - GGUF
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
```
|
||||
- * Implementation
|
||||
* Volta
|
||||
* Turing
|
||||
* Ampere
|
||||
* Ada
|
||||
* Hopper
|
||||
* AMD GPU
|
||||
* Intel GPU
|
||||
* x86 CPU
|
||||
* AWS Inferentia
|
||||
* Google TPU
|
||||
- * AWQ
|
||||
* ✗
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✗
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✗
|
||||
* ✗
|
||||
- * GPTQ
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✗
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✗
|
||||
* ✗
|
||||
- * Marlin (GPTQ/AWQ/FP8)
|
||||
* ✗
|
||||
* ✗
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✗
|
||||
* ✗
|
||||
* ✗
|
||||
* ✗
|
||||
* ✗
|
||||
- * INT8 (W8A8)
|
||||
* ✗
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✗
|
||||
* ✗
|
||||
* ✅︎
|
||||
* ✗
|
||||
* ✗
|
||||
- * FP8 (W8A8)
|
||||
* ✗
|
||||
* ✗
|
||||
* ✗
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✗
|
||||
* ✗
|
||||
* ✗
|
||||
* ✗
|
||||
- * AQLM
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✗
|
||||
* ✗
|
||||
* ✗
|
||||
* ✗
|
||||
* ✗
|
||||
- * bitsandbytes
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✗
|
||||
* ✗
|
||||
* ✗
|
||||
* ✗
|
||||
* ✗
|
||||
- * DeepSpeedFP
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✗
|
||||
* ✗
|
||||
* ✗
|
||||
* ✗
|
||||
* ✗
|
||||
- * GGUF
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✗
|
||||
* ✗
|
||||
* ✗
|
||||
* ✗
|
||||
|
||||
:::
|
||||
|
||||
- Volta refers to SM 7.0, Turing to SM 7.5, Ampere to SM 8.0/8.6, Ada to SM 8.9, and Hopper to SM 9.0.
|
||||
- "✅︎" indicates that the quantization method is supported on the specified hardware.
|
||||
- "✗" indicates that the quantization method is not supported on the specified hardware.
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
This compatibility chart is subject to change as vLLM continues to evolve and expand its support for different hardware platforms and quantization methods.
|
||||
|
||||
For the most up-to-date information on hardware support and quantization methods, please refer to <gh-dir:vllm/model_executor/layers/quantization> or consult with the vLLM development team.
|
||||
```
|
||||
:::
|
||||
|
||||
@@ -2,15 +2,15 @@
|
||||
|
||||
# Speculative Decoding
|
||||
|
||||
```{warning}
|
||||
:::{warning}
|
||||
Please note that speculative decoding in vLLM is not yet optimized and does
|
||||
not usually yield inter-token latency reductions for all prompt datasets or sampling parameters.
|
||||
The work to optimize it is ongoing and can be followed here: <gh-issue:4630>
|
||||
```
|
||||
:::
|
||||
|
||||
```{warning}
|
||||
:::{warning}
|
||||
Currently, speculative decoding in vLLM is not compatible with pipeline parallelism.
|
||||
```
|
||||
:::
|
||||
|
||||
This document shows how to use [Speculative Decoding](https://x.com/karpathy/status/1697318534555336961) with vLLM.
|
||||
Speculative decoding is a technique which improves inter-token latency in memory-bound LLM inference.
|
||||
|
||||
@@ -95,10 +95,10 @@ completion = client.chat.completions.create(
|
||||
print(completion.choices[0].message.content)
|
||||
```
|
||||
|
||||
```{tip}
|
||||
:::{tip}
|
||||
While not strictly necessary, normally it´s better to indicate in the prompt that a JSON needs to be generated and which fields and how should the LLM fill them.
|
||||
This can improve the results notably in most cases.
|
||||
```
|
||||
:::
|
||||
|
||||
Finally we have the `guided_grammar`, which probably is the most difficult one to use but it´s really powerful, as it allows us to define complete languages like SQL queries.
|
||||
It works by using a context free EBNF grammar, which for example we can use to define a specific format of simplified SQL queries, like in the example below:
|
||||
|
||||
Reference in New Issue
Block a user