Migrate docs from Sphinx to MkDocs (#18145)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
28
docs/features/automatic_prefix_caching.md
Normal file
28
docs/features/automatic_prefix_caching.md
Normal file
@@ -0,0 +1,28 @@
|
||||
---
|
||||
title: Automatic Prefix Caching
|
||||
---
|
||||
[](){ #automatic-prefix-caching }
|
||||
|
||||
## Introduction
|
||||
|
||||
Automatic Prefix Caching (APC in short) caches the KV cache of existing queries, so that a new query can directly reuse the KV cache if it shares the same prefix with one of the existing queries, allowing the new query to skip the computation of the shared part.
|
||||
|
||||
!!! note
|
||||
Technical details on how vLLM implements APC can be found [here][design-automatic-prefix-caching].
|
||||
|
||||
## Enabling APC in vLLM
|
||||
|
||||
Set `enable_prefix_caching=True` in vLLM engine to enable APC. Here is an example:
|
||||
|
||||
<gh-file:examples/offline_inference/automatic_prefix_caching.py>
|
||||
|
||||
## Example workloads
|
||||
|
||||
We describe two example workloads, where APC can provide huge performance benefit:
|
||||
|
||||
- Long document query, where the user repeatedly queries the same long document (e.g. software manual or annual report) with different queries. In this case, instead of processing the long document again and again, APC allows vLLM to process this long document *only once*, and all future requests can avoid recomputing this long document by reusing its KV cache. This allows vLLM to serve future requests with much higher throughput and much lower latency.
|
||||
- Multi-round conversation, where the user may chat with the application multiple times in the same chatting session. In this case, instead of processing the whole chatting history again and again, APC allows vLLM to reuse the processing results of the chat history across all future rounds of conversation, allowing vLLM to serve future requests with much higher throughput and much lower latency.
|
||||
|
||||
## Limits
|
||||
|
||||
APC in general does not reduce the performance of vLLM. With that being said, APC only reduces the time of processing the queries (the prefilling phase) and does not reduce the time of generating new tokens (the decoding phase). So APC does not bring performance gain when vLLM spends most of the time generating answers to the queries (e.g. when the length of the answer is long), or new queries do not share the same prefix with any of existing queries (so that the computation cannot be reused).
|
||||
77
docs/features/compatibility_matrix.md
Normal file
77
docs/features/compatibility_matrix.md
Normal file
@@ -0,0 +1,77 @@
|
||||
---
|
||||
title: Compatibility Matrix
|
||||
---
|
||||
[](){ #compatibility-matrix }
|
||||
|
||||
The tables below show mutually exclusive features and the support on some hardware.
|
||||
|
||||
The symbols used have the following meanings:
|
||||
|
||||
- ✅ = Full compatibility
|
||||
- 🟠 = Partial compatibility
|
||||
- ❌ = No compatibility
|
||||
|
||||
!!! note
|
||||
Check the ❌ or 🟠 with links to see tracking issue for unsupported feature/hardware combination.
|
||||
|
||||
## Feature x Feature
|
||||
|
||||
<style>
|
||||
td:not(:first-child) {
|
||||
text-align: center !important;
|
||||
}
|
||||
td {
|
||||
padding: 0.5rem !important;
|
||||
white-space: nowrap;
|
||||
}
|
||||
|
||||
th {
|
||||
padding: 0.5rem !important;
|
||||
min-width: 0 !important;
|
||||
}
|
||||
|
||||
th:not(:first-child) {
|
||||
writing-mode: vertical-lr;
|
||||
transform: rotate(180deg)
|
||||
}
|
||||
</style>
|
||||
|
||||
| Feature | [CP][chunked-prefill] | [APC][automatic-prefix-caching] | [LoRA][lora-adapter] | <abbr title="Prompt Adapter">prmpt adptr</abbr> | [SD][spec-decode] | CUDA graph | <abbr title="Pooling Models">pooling</abbr> | <abbr title="Encoder-Decoder Models">enc-dec</abbr> | <abbr title="Logprobs">logP</abbr> | <abbr title="Prompt Logprobs">prmpt logP</abbr> | <abbr title="Async Output Processing">async output</abbr> | multi-step | <abbr title="Multimodal Inputs">mm</abbr> | best-of | beam-search |
|
||||
|-----------------------------------------------------------|-------------------------|-----------------------------------|------------------------|---------------------------------------------------|---------------------|--------------|-----------------------------------------------|-------------------------------------------------------|--------------------------------------|---------------------------------------------------|-------------------------------------------------------------|--------------------|---------------------------------------------|-----------|---------------|
|
||||
| [CP][chunked-prefill] | ✅ | | | | | | | | | | | | | | |
|
||||
| [APC][automatic-prefix-caching] | ✅ | ✅ | | | | | | | | | | | | | |
|
||||
| [LoRA][lora-adapter] | ✅ | ✅ | ✅ | | | | | | | | | | | | |
|
||||
| <abbr title="Prompt Adapter">prmpt adptr</abbr> | ✅ | ✅ | ✅ | ✅ | | | | | | | | | | | |
|
||||
| [SD][spec-decode] | ✅ | ✅ | ❌ | ✅ | ✅ | | | | | | | | | | |
|
||||
| CUDA graph | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | | | | | | | | |
|
||||
| <abbr title="Pooling Models">pooling</abbr> | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ | | | | | | | | |
|
||||
| <abbr title="Encoder-Decoder Models">enc-dec</abbr> | ❌ | [❌](gh-issue:7366) | ❌ | ❌ | [❌](gh-issue:7366) | ✅ | ✅ | ✅ | | | | | | | |
|
||||
| <abbr title="Logprobs">logP</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | | | | | | |
|
||||
| <abbr title="Prompt Logprobs">prmpt logP</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ | | | | | |
|
||||
| <abbr title="Async Output Processing">async output</abbr> | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ | | | | |
|
||||
| multi-step | ❌ | ✅ | ❌ | ✅ | ❌ | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | | | |
|
||||
| <abbr title="Multimodal Inputs">mm</abbr> | ✅ | [🟠](gh-pr:8348) | [🟠](gh-pr:4194) | ❔ | ❔ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❔ | ✅ | | |
|
||||
| best-of | ✅ | ✅ | ✅ | ✅ | [❌](gh-issue:6137) | ✅ | ❌ | ✅ | ✅ | ✅ | ❔ | [❌](gh-issue:7968) | ✅ | ✅ | |
|
||||
| beam-search | ✅ | ✅ | ✅ | ✅ | [❌](gh-issue:6137) | ✅ | ❌ | ✅ | ✅ | ✅ | ❔ | [❌](gh-issue:7968) | ❔ | ✅ | ✅ |
|
||||
|
||||
[](){ #feature-x-hardware }
|
||||
|
||||
## Feature x Hardware
|
||||
|
||||
| Feature | Volta | Turing | Ampere | Ada | Hopper | CPU | AMD |
|
||||
|-----------------------------------------------------------|--------------------|----------|----------|-------|----------|--------------------|-------|
|
||||
| [CP][chunked-prefill] | [❌](gh-issue:2729) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||
| [APC][automatic-prefix-caching] | [❌](gh-issue:3687) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||
| [LoRA][lora-adapter] | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||
| <abbr title="Prompt Adapter">prmpt adptr</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | [❌](gh-issue:8475) | ✅ |
|
||||
| [SD][spec-decode] | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||
| CUDA graph | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ |
|
||||
| <abbr title="Pooling Models">pooling</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❔ |
|
||||
| <abbr title="Encoder-Decoder Models">enc-dec</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
|
||||
| <abbr title="Multimodal Inputs">mm</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||
| <abbr title="Logprobs">logP</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||
| <abbr title="Prompt Logprobs">prmpt logP</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||
| <abbr title="Async Output Processing">async output</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |
|
||||
| multi-step | ✅ | ✅ | ✅ | ✅ | ✅ | [❌](gh-issue:8477) | ✅ |
|
||||
| best-of | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||
| beam-search | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||
62
docs/features/disagg_prefill.md
Normal file
62
docs/features/disagg_prefill.md
Normal file
@@ -0,0 +1,62 @@
|
||||
---
|
||||
title: Disaggregated Prefilling (experimental)
|
||||
---
|
||||
[](){ #disagg-prefill }
|
||||
|
||||
This page introduces you the disaggregated prefilling feature in vLLM.
|
||||
|
||||
!!! note
|
||||
This feature is experimental and subject to change.
|
||||
|
||||
## Why disaggregated prefilling?
|
||||
|
||||
Two main reasons:
|
||||
|
||||
- **Tuning time-to-first-token (TTFT) and inter-token-latency (ITL) separately**. Disaggregated prefilling put prefill and decode phase of LLM inference inside different vLLM instances. This gives you the flexibility to assign different parallel strategies (e.g. `tp` and `pp`) to tune TTFT without affecting ITL, or to tune ITL without affecting TTFT.
|
||||
- **Controlling tail ITL**. Without disaggregated prefilling, vLLM may insert some prefill jobs during the decoding of one request. This results in higher tail latency. Disaggregated prefilling helps you solve this issue and control tail ITL. Chunked prefill with a proper chunk size also can achieve the same goal, but in practice it's hard to figure out the correct chunk size value. So disaggregated prefilling is a much more reliable way to control tail ITL.
|
||||
|
||||
!!! note
|
||||
Disaggregated prefill DOES NOT improve throughput.
|
||||
|
||||
## Usage example
|
||||
|
||||
Please refer to <gh-file:examples/online_serving/disaggregated_prefill.sh> for the example usage of disaggregated prefilling.
|
||||
|
||||
## Benchmarks
|
||||
|
||||
Please refer to <gh-file:benchmarks/disagg_benchmarks> for disaggregated prefilling benchmarks.
|
||||
|
||||
## Development
|
||||
|
||||
We implement disaggregated prefilling by running 2 vLLM instances. One for prefill (we call it prefill instance) and one for decode (we call it decode instance), and then use a connector to transfer the prefill KV caches and results from prefill instance to decode instance.
|
||||
|
||||
All disaggregated prefilling implementation is under `vllm/distributed/kv_transfer`.
|
||||
|
||||
Key abstractions for disaggregated prefilling:
|
||||
|
||||
- **Connector**: Connector allows **kv consumer** to retrieve the KV caches of a batch of request from **kv producer**.
|
||||
- **LookupBuffer**: LookupBuffer provides two API: `insert` KV cache and `drop_select` KV cache. The semantics of `insert` and `drop_select` are similar to SQL, where `insert` inserts a KV cache into the buffer, and `drop_select` returns the KV cache that matches the given condition and drop it from the buffer.
|
||||
- **Pipe**: A single-direction FIFO pipe for tensor transmission. It supports `send_tensor` and `recv_tensor`.
|
||||
|
||||
!!! note
|
||||
`insert` is non-blocking operation but `drop_select` is blocking operation.
|
||||
|
||||
Here is a figure illustrating how the above 3 abstractions are organized:
|
||||
|
||||

|
||||
|
||||
The workflow of disaggregated prefilling is as follows:
|
||||
|
||||

|
||||
|
||||
The `buffer` corresponds to `insert` API in LookupBuffer, and the `drop_select` corresponds to `drop_select` API in LookupBuffer.
|
||||
|
||||
## Third-party contributions
|
||||
|
||||
Disaggregated prefilling is highly related to infrastructure, so vLLM relies on third-party connectors for production-level disaggregated prefilling (and vLLM team will actively review and merge new PRs for third-party connectors).
|
||||
|
||||
We recommend three ways of implementations:
|
||||
|
||||
- **Fully-customized connector**: Implement your own `Connector`, and call third-party libraries to send and receive KV caches, and many many more (like editing vLLM's model input to perform customized prefilling, etc). This approach gives you the most control, but at the risk of being incompatible with future vLLM versions.
|
||||
- **Database-like connector**: Implement your own `LookupBuffer` and support the `insert` and `drop_select` APIs just like SQL.
|
||||
- **Distributed P2P connector**: Implement your own `Pipe` and support the `send_tensor` and `recv_tensor` APIs, just like `torch.distributed`.
|
||||
270
docs/features/lora.md
Normal file
270
docs/features/lora.md
Normal file
@@ -0,0 +1,270 @@
|
||||
---
|
||||
title: LoRA Adapters
|
||||
---
|
||||
[](){ #lora-adapter }
|
||||
|
||||
This document shows you how to use [LoRA adapters](https://arxiv.org/abs/2106.09685) with vLLM on top of a base model.
|
||||
|
||||
LoRA adapters can be used with any vLLM model that implements [SupportsLoRA][vllm.model_executor.models.interfaces.SupportsLoRA].
|
||||
|
||||
Adapters can be efficiently served on a per request basis with minimal overhead. First we download the adapter(s) and save
|
||||
them locally with
|
||||
|
||||
```python
|
||||
from huggingface_hub import snapshot_download
|
||||
|
||||
sql_lora_path = snapshot_download(repo_id="yard1/llama-2-7b-sql-lora-test")
|
||||
```
|
||||
|
||||
Then we instantiate the base model and pass in the `enable_lora=True` flag:
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
from vllm.lora.request import LoRARequest
|
||||
|
||||
llm = LLM(model="meta-llama/Llama-2-7b-hf", enable_lora=True)
|
||||
```
|
||||
|
||||
We can now submit the prompts and call `llm.generate` with the `lora_request` parameter. The first parameter
|
||||
of `LoRARequest` is a human identifiable name, the second parameter is a globally unique ID for the adapter and
|
||||
the third parameter is the path to the LoRA adapter.
|
||||
|
||||
```python
|
||||
sampling_params = SamplingParams(
|
||||
temperature=0,
|
||||
max_tokens=256,
|
||||
stop=["[/assistant]"]
|
||||
)
|
||||
|
||||
prompts = [
|
||||
"[user] Write a SQL query to answer the question based on the table schema.\n\n context: CREATE TABLE table_name_74 (icao VARCHAR, airport VARCHAR)\n\n question: Name the ICAO for lilongwe international airport [/user] [assistant]",
|
||||
"[user] Write a SQL query to answer the question based on the table schema.\n\n context: CREATE TABLE table_name_11 (nationality VARCHAR, elector VARCHAR)\n\n question: When Anchero Pantaleone was the elector what is under nationality? [/user] [assistant]",
|
||||
]
|
||||
|
||||
outputs = llm.generate(
|
||||
prompts,
|
||||
sampling_params,
|
||||
lora_request=LoRARequest("sql_adapter", 1, sql_lora_path)
|
||||
)
|
||||
```
|
||||
|
||||
Check out <gh-file:examples/offline_inference/multilora_inference.py> for an example of how to use LoRA adapters with the async engine and how to use more advanced configuration options.
|
||||
|
||||
## Serving LoRA Adapters
|
||||
|
||||
LoRA adapted models can also be served with the Open-AI compatible vLLM server. To do so, we use
|
||||
`--lora-modules {name}={path} {name}={path}` to specify each LoRA module when we kickoff the server:
|
||||
|
||||
```bash
|
||||
vllm serve meta-llama/Llama-2-7b-hf \
|
||||
--enable-lora \
|
||||
--lora-modules sql-lora=$HOME/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/snapshots/0dfa347e8877a4d4ed19ee56c140fa518470028c/
|
||||
```
|
||||
|
||||
!!! note
|
||||
The commit ID `0dfa347e8877a4d4ed19ee56c140fa518470028c` may change over time. Please check the latest commit ID in your environment to ensure you are using the correct one.
|
||||
|
||||
The server entrypoint accepts all other LoRA configuration parameters (`max_loras`, `max_lora_rank`, `max_cpu_loras`,
|
||||
etc.), which will apply to all forthcoming requests. Upon querying the `/models` endpoint, we should see our LoRA along
|
||||
with its base model (if `jq` is not installed, you can follow [this guide](https://jqlang.org/download/) to install it.):
|
||||
|
||||
```bash
|
||||
curl localhost:8000/v1/models | jq .
|
||||
{
|
||||
"object": "list",
|
||||
"data": [
|
||||
{
|
||||
"id": "meta-llama/Llama-2-7b-hf",
|
||||
"object": "model",
|
||||
...
|
||||
},
|
||||
{
|
||||
"id": "sql-lora",
|
||||
"object": "model",
|
||||
...
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Requests can specify the LoRA adapter as if it were any other model via the `model` request parameter. The requests will be
|
||||
processed according to the server-wide LoRA configuration (i.e. in parallel with base model requests, and potentially other
|
||||
LoRA adapter requests if they were provided and `max_loras` is set high enough).
|
||||
|
||||
The following is an example request
|
||||
|
||||
```bash
|
||||
curl http://localhost:8000/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "sql-lora",
|
||||
"prompt": "San Francisco is a",
|
||||
"max_tokens": 7,
|
||||
"temperature": 0
|
||||
}' | jq
|
||||
```
|
||||
|
||||
## Dynamically serving LoRA Adapters
|
||||
|
||||
In addition to serving LoRA adapters at server startup, the vLLM server supports dynamically configuring LoRA adapters at runtime through dedicated API endpoints and plugins. This feature can be particularly useful when the flexibility to change models on-the-fly is needed.
|
||||
|
||||
Note: Enabling this feature in production environments is risky as users may participate in model adapter management.
|
||||
|
||||
To enable dynamic LoRA configuration, ensure that the environment variable `VLLM_ALLOW_RUNTIME_LORA_UPDATING`
|
||||
is set to `True`.
|
||||
|
||||
```bash
|
||||
export VLLM_ALLOW_RUNTIME_LORA_UPDATING=True
|
||||
```
|
||||
|
||||
### Using API Endpoints
|
||||
Loading a LoRA Adapter:
|
||||
|
||||
To dynamically load a LoRA adapter, send a POST request to the `/v1/load_lora_adapter` endpoint with the necessary
|
||||
details of the adapter to be loaded. The request payload should include the name and path to the LoRA adapter.
|
||||
|
||||
Example request to load a LoRA adapter:
|
||||
|
||||
```bash
|
||||
curl -X POST http://localhost:8000/v1/load_lora_adapter \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"lora_name": "sql_adapter",
|
||||
"lora_path": "/path/to/sql-lora-adapter"
|
||||
}'
|
||||
```
|
||||
|
||||
Upon a successful request, the API will respond with a `200 OK` status code from `vllm serve`, and `curl` returns the response body: `Success: LoRA adapter 'sql_adapter' added successfully`. If an error occurs, such as if the adapter
|
||||
cannot be found or loaded, an appropriate error message will be returned.
|
||||
|
||||
Unloading a LoRA Adapter:
|
||||
|
||||
To unload a LoRA adapter that has been previously loaded, send a POST request to the `/v1/unload_lora_adapter` endpoint
|
||||
with the name or ID of the adapter to be unloaded.
|
||||
|
||||
Upon a successful request, the API responds with a `200 OK` status code from `vllm serve`, and `curl` returns the response body: `Success: LoRA adapter 'sql_adapter' removed successfully`.
|
||||
|
||||
Example request to unload a LoRA adapter:
|
||||
|
||||
```bash
|
||||
curl -X POST http://localhost:8000/v1/unload_lora_adapter \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"lora_name": "sql_adapter"
|
||||
}'
|
||||
```
|
||||
|
||||
### Using Plugins
|
||||
Alternatively, you can use the LoRAResolver plugin to dynamically load LoRA adapters. LoRAResolver plugins enable you to load LoRA adapters from both local and remote sources such as local file system and S3. On every request, when there's a new model name that hasn't been loaded yet, the LoRAResolver will try to resolve and load the corresponding LoRA adapter.
|
||||
|
||||
You can set up multiple LoRAResolver plugins if you want to load LoRA adapters from different sources. For example, you might have one resolver for local files and another for S3 storage. vLLM will load the first LoRA adapter that it finds.
|
||||
|
||||
You can either install existing plugins or implement your own. By default, vLLM comes with a [resolver plugin to load LoRA adapters from a local directory.](https://github.com/vllm-project/vllm/tree/main/vllm/plugins/lora_resolvers)
|
||||
To enable this resolver, set `VLLM_ALLOW_RUNTIME_LORA_UPDATING` to True, set `VLLM_PLUGINS` to include `lora_filesystem_resolver`, and then set `VLLM_LORA_RESOLVER_CACHE_DIR` to a local directory. When vLLM receives a request using a LoRA adapter `foobar`,
|
||||
it will first look in the local directory for a directory `foobar`, and attempt to load the contents of that directory as a LoRA adapter. If successful, the request will complete as normal and
|
||||
that adapter will then be available for normal use on the server.
|
||||
|
||||
Alternatively, follow these example steps to implement your own plugin:
|
||||
1. Implement the LoRAResolver interface.
|
||||
|
||||
Example of a simple S3 LoRAResolver implementation:
|
||||
|
||||
```python
|
||||
import os
|
||||
import s3fs
|
||||
from vllm.lora.request import LoRARequest
|
||||
from vllm.lora.resolver import LoRAResolver
|
||||
|
||||
class S3LoRAResolver(LoRAResolver):
|
||||
def __init__(self):
|
||||
self.s3 = s3fs.S3FileSystem()
|
||||
self.s3_path_format = os.getenv("S3_PATH_TEMPLATE")
|
||||
self.local_path_format = os.getenv("LOCAL_PATH_TEMPLATE")
|
||||
|
||||
async def resolve_lora(self, base_model_name, lora_name):
|
||||
s3_path = self.s3_path_format.format(base_model_name=base_model_name, lora_name=lora_name)
|
||||
local_path = self.local_path_format.format(base_model_name=base_model_name, lora_name=lora_name)
|
||||
|
||||
# Download the LoRA from S3 to the local path
|
||||
await self.s3._get(
|
||||
s3_path, local_path, recursive=True, maxdepth=1
|
||||
)
|
||||
|
||||
lora_request = LoRARequest(
|
||||
lora_name=lora_name,
|
||||
lora_path=local_path,
|
||||
lora_int_id=abs(hash(lora_name))
|
||||
)
|
||||
return lora_request
|
||||
```
|
||||
|
||||
2. Register LoRAResolver plugin.
|
||||
|
||||
```python
|
||||
from vllm.lora.resolver import LoRAResolverRegistry
|
||||
|
||||
s3_resolver = S3LoRAResolver()
|
||||
LoRAResolverRegistry.register_resolver("s3_resolver", s3_resolver)
|
||||
```
|
||||
|
||||
For more details, refer to the [vLLM's Plugins System](../design/plugin_system.md).
|
||||
|
||||
## New format for `--lora-modules`
|
||||
|
||||
In the previous version, users would provide LoRA modules via the following format, either as a key-value pair or in JSON format. For example:
|
||||
|
||||
```bash
|
||||
--lora-modules sql-lora=$HOME/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/snapshots/0dfa347e8877a4d4ed19ee56c140fa518470028c/
|
||||
```
|
||||
|
||||
This would only include the `name` and `path` for each LoRA module, but did not provide a way to specify a `base_model_name`.
|
||||
Now, you can specify a base_model_name alongside the name and path using JSON format. For example:
|
||||
|
||||
```bash
|
||||
--lora-modules '{"name": "sql-lora", "path": "/path/to/lora", "base_model_name": "meta-llama/Llama-2-7b"}'
|
||||
```
|
||||
|
||||
To provide the backward compatibility support, you can still use the old key-value format (name=path), but the `base_model_name` will remain unspecified in that case.
|
||||
|
||||
## LoRA model lineage in model card
|
||||
|
||||
The new format of `--lora-modules` is mainly to support the display of parent model information in the model card. Here's an explanation of how your current response supports this:
|
||||
|
||||
- The `parent` field of LoRA model `sql-lora` now links to its base model `meta-llama/Llama-2-7b-hf`. This correctly reflects the hierarchical relationship between the base model and the LoRA adapter.
|
||||
- The `root` field points to the artifact location of the lora adapter.
|
||||
|
||||
```bash
|
||||
$ curl http://localhost:8000/v1/models
|
||||
|
||||
{
|
||||
"object": "list",
|
||||
"data": [
|
||||
{
|
||||
"id": "meta-llama/Llama-2-7b-hf",
|
||||
"object": "model",
|
||||
"created": 1715644056,
|
||||
"owned_by": "vllm",
|
||||
"root": "~/.cache/huggingface/hub/models--meta-llama--Llama-2-7b-hf/snapshots/01c7f73d771dfac7d292323805ebc428287df4f9/",
|
||||
"parent": null,
|
||||
"permission": [
|
||||
{
|
||||
.....
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "sql-lora",
|
||||
"object": "model",
|
||||
"created": 1715644056,
|
||||
"owned_by": "vllm",
|
||||
"root": "~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/snapshots/0dfa347e8877a4d4ed19ee56c140fa518470028c/",
|
||||
"parent": meta-llama/Llama-2-7b-hf,
|
||||
"permission": [
|
||||
{
|
||||
....
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
531
docs/features/multimodal_inputs.md
Normal file
531
docs/features/multimodal_inputs.md
Normal file
@@ -0,0 +1,531 @@
|
||||
---
|
||||
title: Multimodal Inputs
|
||||
---
|
||||
[](){ #multimodal-inputs }
|
||||
|
||||
This page teaches you how to pass multi-modal inputs to [multi-modal models][supported-mm-models] in vLLM.
|
||||
|
||||
!!! note
|
||||
We are actively iterating on multi-modal support. See [this RFC](gh-issue:4194) for upcoming changes,
|
||||
and [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) if you have any feedback or feature requests.
|
||||
|
||||
## Offline Inference
|
||||
|
||||
To input multi-modal data, follow this schema in [vllm.inputs.PromptType][]:
|
||||
|
||||
- `prompt`: The prompt should follow the format that is documented on HuggingFace.
|
||||
- `multi_modal_data`: This is a dictionary that follows the schema defined in [vllm.multimodal.inputs.MultiModalDataDict][].
|
||||
|
||||
### Image Inputs
|
||||
|
||||
You can pass a single image to the `'image'` field of the multi-modal dictionary, as shown in the following examples:
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
|
||||
llm = LLM(model="llava-hf/llava-1.5-7b-hf")
|
||||
|
||||
# Refer to the HuggingFace repo for the correct format to use
|
||||
prompt = "USER: <image>\nWhat is the content of this image?\nASSISTANT:"
|
||||
|
||||
# Load the image using PIL.Image
|
||||
image = PIL.Image.open(...)
|
||||
|
||||
# Single prompt inference
|
||||
outputs = llm.generate({
|
||||
"prompt": prompt,
|
||||
"multi_modal_data": {"image": image},
|
||||
})
|
||||
|
||||
for o in outputs:
|
||||
generated_text = o.outputs[0].text
|
||||
print(generated_text)
|
||||
|
||||
# Batch inference
|
||||
image_1 = PIL.Image.open(...)
|
||||
image_2 = PIL.Image.open(...)
|
||||
outputs = llm.generate(
|
||||
[
|
||||
{
|
||||
"prompt": "USER: <image>\nWhat is the content of this image?\nASSISTANT:",
|
||||
"multi_modal_data": {"image": image_1},
|
||||
},
|
||||
{
|
||||
"prompt": "USER: <image>\nWhat's the color of this image?\nASSISTANT:",
|
||||
"multi_modal_data": {"image": image_2},
|
||||
}
|
||||
]
|
||||
)
|
||||
|
||||
for o in outputs:
|
||||
generated_text = o.outputs[0].text
|
||||
print(generated_text)
|
||||
```
|
||||
|
||||
Full example: <gh-file:examples/offline_inference/vision_language.py>
|
||||
|
||||
To substitute multiple images inside the same text prompt, you can pass in a list of images instead:
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
|
||||
llm = LLM(
|
||||
model="microsoft/Phi-3.5-vision-instruct",
|
||||
trust_remote_code=True, # Required to load Phi-3.5-vision
|
||||
max_model_len=4096, # Otherwise, it may not fit in smaller GPUs
|
||||
limit_mm_per_prompt={"image": 2}, # The maximum number to accept
|
||||
)
|
||||
|
||||
# Refer to the HuggingFace repo for the correct format to use
|
||||
prompt = "<|user|>\n<|image_1|>\n<|image_2|>\nWhat is the content of each image?<|end|>\n<|assistant|>\n"
|
||||
|
||||
# Load the images using PIL.Image
|
||||
image1 = PIL.Image.open(...)
|
||||
image2 = PIL.Image.open(...)
|
||||
|
||||
outputs = llm.generate({
|
||||
"prompt": prompt,
|
||||
"multi_modal_data": {
|
||||
"image": [image1, image2]
|
||||
},
|
||||
})
|
||||
|
||||
for o in outputs:
|
||||
generated_text = o.outputs[0].text
|
||||
print(generated_text)
|
||||
```
|
||||
|
||||
Full example: <gh-file:examples/offline_inference/vision_language_multi_image.py>
|
||||
|
||||
Multi-image input can be extended to perform video captioning. We show this with [Qwen2-VL](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct) as it supports videos:
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
|
||||
# Specify the maximum number of frames per video to be 4. This can be changed.
|
||||
llm = LLM("Qwen/Qwen2-VL-2B-Instruct", limit_mm_per_prompt={"image": 4})
|
||||
|
||||
# Create the request payload.
|
||||
video_frames = ... # load your video making sure it only has the number of frames specified earlier.
|
||||
message = {
|
||||
"role": "user",
|
||||
"content": [
|
||||
{"type": "text", "text": "Describe this set of frames. Consider the frames to be a part of the same video."},
|
||||
],
|
||||
}
|
||||
for i in range(len(video_frames)):
|
||||
base64_image = encode_image(video_frames[i]) # base64 encoding.
|
||||
new_image = {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}}
|
||||
message["content"].append(new_image)
|
||||
|
||||
# Perform inference and log output.
|
||||
outputs = llm.chat([message])
|
||||
|
||||
for o in outputs:
|
||||
generated_text = o.outputs[0].text
|
||||
print(generated_text)
|
||||
```
|
||||
|
||||
### Video Inputs
|
||||
|
||||
You can pass a list of NumPy arrays directly to the `'video'` field of the multi-modal dictionary
|
||||
instead of using multi-image input.
|
||||
|
||||
Full example: <gh-file:examples/offline_inference/vision_language.py>
|
||||
|
||||
### Audio Inputs
|
||||
|
||||
You can pass a tuple `(array, sampling_rate)` to the `'audio'` field of the multi-modal dictionary.
|
||||
|
||||
Full example: <gh-file:examples/offline_inference/audio_language.py>
|
||||
|
||||
### Embedding Inputs
|
||||
|
||||
To input pre-computed embeddings belonging to a data type (i.e. image, video, or audio) directly to the language model,
|
||||
pass a tensor of shape `(num_items, feature_size, hidden_size of LM)` to the corresponding field of the multi-modal dictionary.
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
|
||||
# Inference with image embeddings as input
|
||||
llm = LLM(model="llava-hf/llava-1.5-7b-hf")
|
||||
|
||||
# Refer to the HuggingFace repo for the correct format to use
|
||||
prompt = "USER: <image>\nWhat is the content of this image?\nASSISTANT:"
|
||||
|
||||
# Embeddings for single image
|
||||
# torch.Tensor of shape (1, image_feature_size, hidden_size of LM)
|
||||
image_embeds = torch.load(...)
|
||||
|
||||
outputs = llm.generate({
|
||||
"prompt": prompt,
|
||||
"multi_modal_data": {"image": image_embeds},
|
||||
})
|
||||
|
||||
for o in outputs:
|
||||
generated_text = o.outputs[0].text
|
||||
print(generated_text)
|
||||
```
|
||||
|
||||
For Qwen2-VL and MiniCPM-V, we accept additional parameters alongside the embeddings:
|
||||
|
||||
```python
|
||||
# Construct the prompt based on your model
|
||||
prompt = ...
|
||||
|
||||
# Embeddings for multiple images
|
||||
# torch.Tensor of shape (num_images, image_feature_size, hidden_size of LM)
|
||||
image_embeds = torch.load(...)
|
||||
|
||||
# Qwen2-VL
|
||||
llm = LLM("Qwen/Qwen2-VL-2B-Instruct", limit_mm_per_prompt={"image": 4})
|
||||
mm_data = {
|
||||
"image": {
|
||||
"image_embeds": image_embeds,
|
||||
# image_grid_thw is needed to calculate positional encoding.
|
||||
"image_grid_thw": torch.load(...), # torch.Tensor of shape (1, 3),
|
||||
}
|
||||
}
|
||||
|
||||
# MiniCPM-V
|
||||
llm = LLM("openbmb/MiniCPM-V-2_6", trust_remote_code=True, limit_mm_per_prompt={"image": 4})
|
||||
mm_data = {
|
||||
"image": {
|
||||
"image_embeds": image_embeds,
|
||||
# image_sizes is needed to calculate details of the sliced image.
|
||||
"image_sizes": [image.size for image in images], # list of image sizes
|
||||
}
|
||||
}
|
||||
|
||||
outputs = llm.generate({
|
||||
"prompt": prompt,
|
||||
"multi_modal_data": mm_data,
|
||||
})
|
||||
|
||||
for o in outputs:
|
||||
generated_text = o.outputs[0].text
|
||||
print(generated_text)
|
||||
```
|
||||
|
||||
## Online Serving
|
||||
|
||||
Our OpenAI-compatible server accepts multi-modal data via the [Chat Completions API](https://platform.openai.com/docs/api-reference/chat).
|
||||
|
||||
!!! warning
|
||||
A chat template is **required** to use Chat Completions API.
|
||||
For HF format models, the default chat template is defined inside `chat_template.json` or `tokenizer_config.json`.
|
||||
|
||||
If no default chat template is available, we will first look for a built-in fallback in <gh-file:vllm/transformers_utils/chat_templates/registry.py>.
|
||||
If no fallback is available, an error is raised and you have to provide the chat template manually via the `--chat-template` argument.
|
||||
|
||||
For certain models, we provide alternative chat templates inside <gh-dir:vllm/examples>.
|
||||
For example, VLM2Vec uses <gh-file:examples/template_vlm2vec.jinja> which is different from the default one for Phi-3-Vision.
|
||||
|
||||
### Image Inputs
|
||||
|
||||
Image input is supported according to [OpenAI Vision API](https://platform.openai.com/docs/guides/vision).
|
||||
Here is a simple example using Phi-3.5-Vision.
|
||||
|
||||
First, launch the OpenAI-compatible server:
|
||||
|
||||
```bash
|
||||
vllm serve microsoft/Phi-3.5-vision-instruct --task generate \
|
||||
--trust-remote-code --max-model-len 4096 --limit-mm-per-prompt '{"image":2}'
|
||||
```
|
||||
|
||||
Then, you can use the OpenAI client as follows:
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
|
||||
openai_api_key = "EMPTY"
|
||||
openai_api_base = "http://localhost:8000/v1"
|
||||
|
||||
client = OpenAI(
|
||||
api_key=openai_api_key,
|
||||
base_url=openai_api_base,
|
||||
)
|
||||
|
||||
# Single-image input inference
|
||||
image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
|
||||
|
||||
chat_response = client.chat.completions.create(
|
||||
model="microsoft/Phi-3.5-vision-instruct",
|
||||
messages=[{
|
||||
"role": "user",
|
||||
"content": [
|
||||
# NOTE: The prompt formatting with the image token `<image>` is not needed
|
||||
# since the prompt will be processed automatically by the API server.
|
||||
{"type": "text", "text": "What’s in this image?"},
|
||||
{"type": "image_url", "image_url": {"url": image_url}},
|
||||
],
|
||||
}],
|
||||
)
|
||||
print("Chat completion output:", chat_response.choices[0].message.content)
|
||||
|
||||
# Multi-image input inference
|
||||
image_url_duck = "https://upload.wikimedia.org/wikipedia/commons/d/da/2015_Kaczka_krzy%C5%BCowka_w_wodzie_%28samiec%29.jpg"
|
||||
image_url_lion = "https://upload.wikimedia.org/wikipedia/commons/7/77/002_The_lion_king_Snyggve_in_the_Serengeti_National_Park_Photo_by_Giles_Laurent.jpg"
|
||||
|
||||
chat_response = client.chat.completions.create(
|
||||
model="microsoft/Phi-3.5-vision-instruct",
|
||||
messages=[{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{"type": "text", "text": "What are the animals in these images?"},
|
||||
{"type": "image_url", "image_url": {"url": image_url_duck}},
|
||||
{"type": "image_url", "image_url": {"url": image_url_lion}},
|
||||
],
|
||||
}],
|
||||
)
|
||||
print("Chat completion output:", chat_response.choices[0].message.content)
|
||||
```
|
||||
|
||||
Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for_multimodal.py>
|
||||
|
||||
!!! tip
|
||||
Loading from local file paths is also supported on vLLM: You can specify the allowed local media path via `--allowed-local-media-path` when launching the API server/engine,
|
||||
and pass the file path as `url` in the API request.
|
||||
|
||||
!!! tip
|
||||
There is no need to place image placeholders in the text content of the API request - they are already represented by the image content.
|
||||
In fact, you can place image placeholders in the middle of the text by interleaving text and image content.
|
||||
|
||||
!!! note
|
||||
By default, the timeout for fetching images through HTTP URL is `5` seconds.
|
||||
You can override this by setting the environment variable:
|
||||
|
||||
```console
|
||||
export VLLM_IMAGE_FETCH_TIMEOUT=<timeout>
|
||||
```
|
||||
|
||||
### Video Inputs
|
||||
|
||||
Instead of `image_url`, you can pass a video file via `video_url`. Here is a simple example using [LLaVA-OneVision](https://huggingface.co/llava-hf/llava-onevision-qwen2-0.5b-ov-hf).
|
||||
|
||||
First, launch the OpenAI-compatible server:
|
||||
|
||||
```bash
|
||||
vllm serve llava-hf/llava-onevision-qwen2-0.5b-ov-hf --task generate --max-model-len 8192
|
||||
```
|
||||
|
||||
Then, you can use the OpenAI client as follows:
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
|
||||
openai_api_key = "EMPTY"
|
||||
openai_api_base = "http://localhost:8000/v1"
|
||||
|
||||
client = OpenAI(
|
||||
api_key=openai_api_key,
|
||||
base_url=openai_api_base,
|
||||
)
|
||||
|
||||
video_url = "http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/ForBiggerFun.mp4"
|
||||
|
||||
## Use video url in the payload
|
||||
chat_completion_from_url = client.chat.completions.create(
|
||||
messages=[{
|
||||
"role":
|
||||
"user",
|
||||
"content": [
|
||||
{
|
||||
"type": "text",
|
||||
"text": "What's in this video?"
|
||||
},
|
||||
{
|
||||
"type": "video_url",
|
||||
"video_url": {
|
||||
"url": video_url
|
||||
},
|
||||
},
|
||||
],
|
||||
}],
|
||||
model=model,
|
||||
max_completion_tokens=64,
|
||||
)
|
||||
|
||||
result = chat_completion_from_url.choices[0].message.content
|
||||
print("Chat completion output from image url:", result)
|
||||
```
|
||||
|
||||
Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for_multimodal.py>
|
||||
|
||||
!!! note
|
||||
By default, the timeout for fetching videos through HTTP URL is `30` seconds.
|
||||
You can override this by setting the environment variable:
|
||||
|
||||
```console
|
||||
export VLLM_VIDEO_FETCH_TIMEOUT=<timeout>
|
||||
```
|
||||
|
||||
### Audio Inputs
|
||||
|
||||
Audio input is supported according to [OpenAI Audio API](https://platform.openai.com/docs/guides/audio?audio-generation-quickstart-example=audio-in).
|
||||
Here is a simple example using Ultravox-v0.5-1B.
|
||||
|
||||
First, launch the OpenAI-compatible server:
|
||||
|
||||
```bash
|
||||
vllm serve fixie-ai/ultravox-v0_5-llama-3_2-1b
|
||||
```
|
||||
|
||||
Then, you can use the OpenAI client as follows:
|
||||
|
||||
```python
|
||||
import base64
|
||||
import requests
|
||||
from openai import OpenAI
|
||||
from vllm.assets.audio import AudioAsset
|
||||
|
||||
def encode_base64_content_from_url(content_url: str) -> str:
|
||||
"""Encode a content retrieved from a remote url to base64 format."""
|
||||
|
||||
with requests.get(content_url) as response:
|
||||
response.raise_for_status()
|
||||
result = base64.b64encode(response.content).decode('utf-8')
|
||||
|
||||
return result
|
||||
|
||||
openai_api_key = "EMPTY"
|
||||
openai_api_base = "http://localhost:8000/v1"
|
||||
|
||||
client = OpenAI(
|
||||
api_key=openai_api_key,
|
||||
base_url=openai_api_base,
|
||||
)
|
||||
|
||||
# Any format supported by librosa is supported
|
||||
audio_url = AudioAsset("winning_call").url
|
||||
audio_base64 = encode_base64_content_from_url(audio_url)
|
||||
|
||||
chat_completion_from_base64 = client.chat.completions.create(
|
||||
messages=[{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{
|
||||
"type": "text",
|
||||
"text": "What's in this audio?"
|
||||
},
|
||||
{
|
||||
"type": "input_audio",
|
||||
"input_audio": {
|
||||
"data": audio_base64,
|
||||
"format": "wav"
|
||||
},
|
||||
},
|
||||
],
|
||||
}],
|
||||
model=model,
|
||||
max_completion_tokens=64,
|
||||
)
|
||||
|
||||
result = chat_completion_from_base64.choices[0].message.content
|
||||
print("Chat completion output from input audio:", result)
|
||||
```
|
||||
|
||||
Alternatively, you can pass `audio_url`, which is the audio counterpart of `image_url` for image input:
|
||||
|
||||
```python
|
||||
chat_completion_from_url = client.chat.completions.create(
|
||||
messages=[{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{
|
||||
"type": "text",
|
||||
"text": "What's in this audio?"
|
||||
},
|
||||
{
|
||||
"type": "audio_url",
|
||||
"audio_url": {
|
||||
"url": audio_url
|
||||
},
|
||||
},
|
||||
],
|
||||
}],
|
||||
model=model,
|
||||
max_completion_tokens=64,
|
||||
)
|
||||
|
||||
result = chat_completion_from_url.choices[0].message.content
|
||||
print("Chat completion output from audio url:", result)
|
||||
```
|
||||
|
||||
Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for_multimodal.py>
|
||||
|
||||
!!! note
|
||||
By default, the timeout for fetching audios through HTTP URL is `10` seconds.
|
||||
You can override this by setting the environment variable:
|
||||
|
||||
```console
|
||||
export VLLM_AUDIO_FETCH_TIMEOUT=<timeout>
|
||||
```
|
||||
|
||||
### Embedding Inputs
|
||||
|
||||
To input pre-computed embeddings belonging to a data type (i.e. image, video, or audio) directly to the language model,
|
||||
pass a tensor of shape to the corresponding field of the multi-modal dictionary.
|
||||
#### Image Embedding Inputs
|
||||
For image embeddings, you can pass the base64-encoded tensor to the `image_embeds` field.
|
||||
The following example demonstrates how to pass image embeddings to the OpenAI server:
|
||||
|
||||
```python
|
||||
image_embedding = torch.load(...)
|
||||
grid_thw = torch.load(...) # Required by Qwen/Qwen2-VL-2B-Instruct
|
||||
|
||||
buffer = io.BytesIO()
|
||||
torch.save(image_embedding, buffer)
|
||||
buffer.seek(0)
|
||||
binary_data = buffer.read()
|
||||
base64_image_embedding = base64.b64encode(binary_data).decode('utf-8')
|
||||
|
||||
client = OpenAI(
|
||||
# defaults to os.environ.get("OPENAI_API_KEY")
|
||||
api_key=openai_api_key,
|
||||
base_url=openai_api_base,
|
||||
)
|
||||
|
||||
# Basic usage - this is equivalent to the LLaVA example for offline inference
|
||||
model = "llava-hf/llava-1.5-7b-hf"
|
||||
embeds = {
|
||||
"type": "image_embeds",
|
||||
"image_embeds": f"{base64_image_embedding}"
|
||||
}
|
||||
|
||||
# Pass additional parameters (available to Qwen2-VL and MiniCPM-V)
|
||||
model = "Qwen/Qwen2-VL-2B-Instruct"
|
||||
embeds = {
|
||||
"type": "image_embeds",
|
||||
"image_embeds": {
|
||||
"image_embeds": f"{base64_image_embedding}" , # Required
|
||||
"image_grid_thw": f"{base64_image_grid_thw}" # Required by Qwen/Qwen2-VL-2B-Instruct
|
||||
},
|
||||
}
|
||||
model = "openbmb/MiniCPM-V-2_6"
|
||||
embeds = {
|
||||
"type": "image_embeds",
|
||||
"image_embeds": {
|
||||
"image_embeds": f"{base64_image_embedding}" , # Required
|
||||
"image_sizes": f"{base64_image_sizes}" # Required by openbmb/MiniCPM-V-2_6
|
||||
},
|
||||
}
|
||||
chat_completion = client.chat.completions.create(
|
||||
messages=[
|
||||
{"role": "system", "content": "You are a helpful assistant."},
|
||||
{"role": "user", "content": [
|
||||
{
|
||||
"type": "text",
|
||||
"text": "What's in this image?",
|
||||
},
|
||||
embeds,
|
||||
],
|
||||
},
|
||||
],
|
||||
model=model,
|
||||
)
|
||||
```
|
||||
|
||||
!!! note
|
||||
Only one message can contain `{"type": "image_embeds"}`.
|
||||
If used with a model that requires additional parameters, you must also provide a tensor for each of them, e.g. `image_grid_thw`, `image_sizes`, etc.
|
||||
43
docs/features/prompt_embeds.md
Normal file
43
docs/features/prompt_embeds.md
Normal file
@@ -0,0 +1,43 @@
|
||||
# Prompt Embedding Inputs
|
||||
|
||||
This page teaches you how to pass prompt embedding inputs to vLLM.
|
||||
|
||||
## What are prompt embeddings?
|
||||
|
||||
The traditional flow of text data for a Large Language Model goes from text to token ids (via a tokenizer) then from token ids to prompt embeddings. For a traditional decoder-only model (such as meta-llama/Llama-3.1-8B-Instruct), this step of converting token ids to prompt embeddings happens via a look-up from a learned embedding matrix, but the model is not limited to processing only the embeddings corresponding to its token vocabulary.
|
||||
|
||||
!!! note
|
||||
Prompt embeddings are currently only supported in the v0 engine.
|
||||
|
||||
## Offline Inference
|
||||
|
||||
To input multi-modal data, follow this schema in [vllm.inputs.EmbedsPrompt][]:
|
||||
|
||||
- `prompt_embeds`: A torch tensor representing a sequence of prompt/token embeddings. This has the shape (sequence_length, hidden_size), where sequence length is the number of tokens embeddings and hidden_size is the hidden size (embedding size) of the model.
|
||||
|
||||
### Hugging Face Transformers Inputs
|
||||
|
||||
You can pass prompt embeddings from Hugging Face Transformers models to the `'prompt_embeds'` field of the prompt embedding dictionary, as shown in the following examples:
|
||||
|
||||
<gh-file:examples/offline_inference/prompt_embed_inference.py>
|
||||
|
||||
## Online Serving
|
||||
|
||||
Our OpenAI-compatible server accepts prompt embeddings inputs via the [Completions API](https://platform.openai.com/docs/api-reference/completions). Prompt embeddings inputs are added via a new `'prompt_embeds'` key in the JSON package.
|
||||
|
||||
When a mixture of `'prompt_embeds'` and `'prompt'` inputs are provided in a single request, the prompt embeds are always returned first.
|
||||
|
||||
Prompt embeddings are passed in as base64 encoded torch tensors.
|
||||
|
||||
### Transformers Inputs via OpenAI Client
|
||||
|
||||
First, launch the OpenAI-compatible server:
|
||||
|
||||
```bash
|
||||
vllm serve meta-llama/Llama-3.2-1B-Instruct --task generate \
|
||||
--max-model-len 4096 --enable-prompt-embeds
|
||||
```
|
||||
|
||||
Then, you can use the OpenAI client as follows:
|
||||
|
||||
<gh-file:examples/online_serving/prompt_embed_inference_with_openai_client.py>
|
||||
22
docs/features/quantization/README.md
Normal file
22
docs/features/quantization/README.md
Normal file
@@ -0,0 +1,22 @@
|
||||
---
|
||||
title: Quantization
|
||||
---
|
||||
[](){ #quantization-index }
|
||||
|
||||
Quantization trades off model precision for smaller memory footprint, allowing large models to be run on a wider range of devices.
|
||||
|
||||
Contents:
|
||||
|
||||
- [Supported_Hardware](supported_hardware.md)
|
||||
- [Auto_Awq](auto_awq.md)
|
||||
- [Bnb](bnb.md)
|
||||
- [Bitblas](bitblas.md)
|
||||
- [Gguf](gguf.md)
|
||||
- [Gptqmodel](gptqmodel.md)
|
||||
- [Int4](int4.md)
|
||||
- [Int8](int8.md)
|
||||
- [Fp8](fp8.md)
|
||||
- [Modelopt](modelopt.md)
|
||||
- [Quark](quark.md)
|
||||
- [Quantized_Kvcache](quantized_kvcache.md)
|
||||
- [Torchao](torchao.md)
|
||||
73
docs/features/quantization/auto_awq.md
Normal file
73
docs/features/quantization/auto_awq.md
Normal file
@@ -0,0 +1,73 @@
|
||||
---
|
||||
title: AutoAWQ
|
||||
---
|
||||
[](){ #auto-awq }
|
||||
|
||||
To create a new 4-bit quantized model, you can leverage [AutoAWQ](https://github.com/casper-hansen/AutoAWQ).
|
||||
Quantization reduces the model's precision from BF16/FP16 to INT4 which effectively reduces the total model memory footprint.
|
||||
The main benefits are lower latency and memory usage.
|
||||
|
||||
You can quantize your own models by installing AutoAWQ or picking one of the [6500+ models on Huggingface](https://huggingface.co/models?search=awq).
|
||||
|
||||
```console
|
||||
pip install autoawq
|
||||
```
|
||||
|
||||
After installing AutoAWQ, you are ready to quantize a model. Please refer to the [AutoAWQ documentation](https://casper-hansen.github.io/AutoAWQ/examples/#basic-quantization) for further details. Here is an example of how to quantize `mistralai/Mistral-7B-Instruct-v0.2`:
|
||||
|
||||
```python
|
||||
from awq import AutoAWQForCausalLM
|
||||
from transformers import AutoTokenizer
|
||||
|
||||
model_path = 'mistralai/Mistral-7B-Instruct-v0.2'
|
||||
quant_path = 'mistral-instruct-v0.2-awq'
|
||||
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }
|
||||
|
||||
# Load model
|
||||
model = AutoAWQForCausalLM.from_pretrained(
|
||||
model_path, **{"low_cpu_mem_usage": True, "use_cache": False}
|
||||
)
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
|
||||
|
||||
# Quantize
|
||||
model.quantize(tokenizer, quant_config=quant_config)
|
||||
|
||||
# Save quantized model
|
||||
model.save_quantized(quant_path)
|
||||
tokenizer.save_pretrained(quant_path)
|
||||
|
||||
print(f'Model is quantized and saved at "{quant_path}"')
|
||||
```
|
||||
|
||||
To run an AWQ model with vLLM, you can use [TheBloke/Llama-2-7b-Chat-AWQ](https://huggingface.co/TheBloke/Llama-2-7b-Chat-AWQ) with the following command:
|
||||
|
||||
```console
|
||||
python examples/offline_inference/llm_engine_example.py --model TheBloke/Llama-2-7b-Chat-AWQ --quantization awq
|
||||
```
|
||||
|
||||
AWQ models are also supported directly through the LLM entrypoint:
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
# Sample prompts.
|
||||
prompts = [
|
||||
"Hello, my name is",
|
||||
"The president of the United States is",
|
||||
"The capital of France is",
|
||||
"The future of AI is",
|
||||
]
|
||||
# Create a sampling params object.
|
||||
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
|
||||
|
||||
# Create an LLM.
|
||||
llm = LLM(model="TheBloke/Llama-2-7b-Chat-AWQ", quantization="AWQ")
|
||||
# Generate texts from the prompts. The output is a list of RequestOutput objects
|
||||
# that contain the prompt, generated text, and other information.
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
# Print the outputs.
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
```
|
||||
48
docs/features/quantization/bitblas.md
Normal file
48
docs/features/quantization/bitblas.md
Normal file
@@ -0,0 +1,48 @@
|
||||
---
|
||||
title: BitBLAS
|
||||
---
|
||||
[](){ #bitblas }
|
||||
|
||||
vLLM now supports [BitBLAS](https://github.com/microsoft/BitBLAS) for more efficient and flexible model inference. Compared to other quantization frameworks, BitBLAS provides more precision combinations.
|
||||
|
||||
!!! note
|
||||
Ensure your hardware supports the selected `dtype` (`torch.bfloat16` or `torch.float16`).
|
||||
Most recent NVIDIA GPUs support `float16`, while `bfloat16` is more common on newer architectures like Ampere or Hopper.
|
||||
For details see [supported hardware](https://docs.vllm.ai/en/latest/features/quantization/supported_hardware.html).
|
||||
|
||||
Below are the steps to utilize BitBLAS with vLLM.
|
||||
|
||||
```console
|
||||
pip install bitblas>=0.1.0
|
||||
```
|
||||
|
||||
vLLM reads the model's config file and supports pre-quantized checkpoints.
|
||||
|
||||
You can find pre-quantized models on:
|
||||
|
||||
- [Hugging Face (BitBLAS)](https://huggingface.co/models?search=bitblas)
|
||||
- [Hugging Face (GPTQ)](https://huggingface.co/models?search=gptq)
|
||||
|
||||
Usually, these repositories have a `quantize_config.json` file that includes a `quantization_config` section.
|
||||
|
||||
## Read bitblas format checkpoint
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
import torch
|
||||
|
||||
# "hxbgsyxh/llama-13b-4bit-g-1-bitblas" is a pre-quantized checkpoint.
|
||||
model_id = "hxbgsyxh/llama-13b-4bit-g-1-bitblas"
|
||||
llm = LLM(model=model_id, dtype=torch.bfloat16, trust_remote_code=True, quantization="bitblas")
|
||||
```
|
||||
|
||||
## Read gptq format checkpoint
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
import torch
|
||||
|
||||
# "hxbgsyxh/llama-13b-4bit-g-1" is a pre-quantized checkpoint.
|
||||
model_id = "hxbgsyxh/llama-13b-4bit-g-1"
|
||||
llm = LLM(model=model_id, dtype=torch.float16, trust_remote_code=True, quantization="bitblas", max_model_len=1024)
|
||||
```
|
||||
51
docs/features/quantization/bnb.md
Normal file
51
docs/features/quantization/bnb.md
Normal file
@@ -0,0 +1,51 @@
|
||||
---
|
||||
title: BitsAndBytes
|
||||
---
|
||||
[](){ #bits-and-bytes }
|
||||
|
||||
vLLM now supports [BitsAndBytes](https://github.com/TimDettmers/bitsandbytes) for more efficient model inference.
|
||||
BitsAndBytes quantizes models to reduce memory usage and enhance performance without significantly sacrificing accuracy.
|
||||
Compared to other quantization methods, BitsAndBytes eliminates the need for calibrating the quantized model with input data.
|
||||
|
||||
Below are the steps to utilize BitsAndBytes with vLLM.
|
||||
|
||||
```console
|
||||
pip install bitsandbytes>=0.45.3
|
||||
```
|
||||
|
||||
vLLM reads the model's config file and supports both in-flight quantization and pre-quantized checkpoint.
|
||||
|
||||
You can find bitsandbytes quantized models on <https://huggingface.co/models?search=bitsandbytes>.
|
||||
And usually, these repositories have a config.json file that includes a quantization_config section.
|
||||
|
||||
## Read quantized checkpoint
|
||||
|
||||
For pre-quantized checkpoints, vLLM will try to infer the quantization method from the config file, so you don't need to explicitly specify the quantization argument.
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
import torch
|
||||
# unsloth/tinyllama-bnb-4bit is a pre-quantized checkpoint.
|
||||
model_id = "unsloth/tinyllama-bnb-4bit"
|
||||
llm = LLM(model=model_id, dtype=torch.bfloat16, trust_remote_code=True)
|
||||
```
|
||||
|
||||
## Inflight quantization: load as 4bit quantization
|
||||
|
||||
For inflight 4bit quantization with BitsAndBytes, you need to explicitly specify the quantization argument.
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
import torch
|
||||
model_id = "huggyllama/llama-7b"
|
||||
llm = LLM(model=model_id, dtype=torch.bfloat16, trust_remote_code=True, \
|
||||
quantization="bitsandbytes")
|
||||
```
|
||||
|
||||
## OpenAI Compatible Server
|
||||
|
||||
Append the following to your model arguments for 4bit inflight quantization:
|
||||
|
||||
```console
|
||||
--quantization bitsandbytes
|
||||
```
|
||||
136
docs/features/quantization/fp8.md
Normal file
136
docs/features/quantization/fp8.md
Normal file
@@ -0,0 +1,136 @@
|
||||
---
|
||||
title: FP8 W8A8
|
||||
---
|
||||
[](){ #fp8 }
|
||||
|
||||
vLLM supports FP8 (8-bit floating point) weight and activation quantization using hardware acceleration on GPUs such as Nvidia H100 and AMD MI300x.
|
||||
Currently, only Hopper and Ada Lovelace GPUs are officially supported for W8A8.
|
||||
Ampere GPUs are supported for W8A16 (weight-only FP8) utilizing Marlin kernels.
|
||||
Quantization of models with FP8 allows for a 2x reduction in model memory requirements and up to a 1.6x improvement in throughput with minimal impact on accuracy.
|
||||
|
||||
Please visit the HF collection of [quantized FP8 checkpoints of popular LLMs ready to use with vLLM](https://huggingface.co/collections/neuralmagic/fp8-llms-for-vllm-666742ed2b78b7ac8df13127).
|
||||
|
||||
The FP8 types typically supported in hardware have two distinct representations, each useful in different scenarios:
|
||||
|
||||
- **E4M3**: Consists of 1 sign bit, 4 exponent bits, and 3 bits of mantissa. It can store values up to +/-448 and `nan`.
|
||||
- **E5M2**: Consists of 1 sign bit, 5 exponent bits, and 2 bits of mantissa. It can store values up to +/-57344, +/- `inf`, and `nan`. The tradeoff for the increased dynamic range is lower precision of the stored values.
|
||||
|
||||
!!! note
|
||||
FP8 computation is supported on NVIDIA GPUs with compute capability > 8.9 (Ada Lovelace, Hopper).
|
||||
FP8 models will run on compute capability > 8.0 (Ampere) as weight-only W8A16, utilizing FP8 Marlin.
|
||||
|
||||
## Installation
|
||||
|
||||
To produce performant FP8 quantized models with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library:
|
||||
|
||||
```console
|
||||
pip install llmcompressor
|
||||
```
|
||||
|
||||
## Quantization Process
|
||||
|
||||
The quantization process involves three main steps:
|
||||
|
||||
1. Loading the model
|
||||
2. Applying quantization
|
||||
3. Evaluating accuracy in vLLM
|
||||
|
||||
### 1. Loading the Model
|
||||
|
||||
Load your model and tokenizer using the standard `transformers` AutoModel classes:
|
||||
|
||||
```python
|
||||
from transformers import AutoTokenizer, AutoModelForCausalLM
|
||||
|
||||
MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
MODEL_ID, device_map="auto", torch_dtype="auto",
|
||||
)
|
||||
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
|
||||
```
|
||||
|
||||
### 2. Applying Quantization
|
||||
|
||||
For FP8 quantization, we can recover accuracy with simple RTN quantization. We recommend targeting all `Linear` layers using the `FP8_DYNAMIC` scheme, which uses:
|
||||
|
||||
- Static, per-channel quantization on the weights
|
||||
- Dynamic, per-token quantization on the activations
|
||||
|
||||
Since simple RTN does not require data for weight quantization and the activations are quantized dynamically, we do not need any calibration data for this quantization flow.
|
||||
|
||||
```python
|
||||
from llmcompressor.transformers import oneshot
|
||||
from llmcompressor.modifiers.quantization import QuantizationModifier
|
||||
|
||||
# Configure the simple PTQ quantization
|
||||
recipe = QuantizationModifier(
|
||||
targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"])
|
||||
|
||||
# Apply the quantization algorithm.
|
||||
oneshot(model=model, recipe=recipe)
|
||||
|
||||
# Save the model: Meta-Llama-3-8B-Instruct-FP8-Dynamic
|
||||
SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic"
|
||||
model.save_pretrained(SAVE_DIR)
|
||||
tokenizer.save_pretrained(SAVE_DIR)
|
||||
```
|
||||
|
||||
### 3. Evaluating Accuracy
|
||||
|
||||
Install `vllm` and `lm-evaluation-harness` for evaluation:
|
||||
|
||||
```console
|
||||
pip install vllm lm-eval==0.4.4
|
||||
```
|
||||
|
||||
Load and run the model in `vllm`:
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
model = LLM("./Meta-Llama-3-8B-Instruct-FP8-Dynamic")
|
||||
result = model.generate("Hello my name is")
|
||||
print(result[0].outputs[0].text)
|
||||
```
|
||||
|
||||
Evaluate accuracy with `lm_eval` (for example on 250 samples of `gsm8k`):
|
||||
|
||||
!!! note
|
||||
Quantized models can be sensitive to the presence of the `bos` token. `lm_eval` does not add a `bos` token by default, so make sure to include the `add_bos_token=True` argument when running your evaluations.
|
||||
|
||||
```console
|
||||
$ MODEL=$PWD/Meta-Llama-3-8B-Instruct-FP8-Dynamic
|
||||
$ lm_eval \
|
||||
--model vllm \
|
||||
--model_args pretrained=$MODEL,add_bos_token=True \
|
||||
--tasks gsm8k --num_fewshot 5 --batch_size auto --limit 250
|
||||
```
|
||||
|
||||
Here's an example of the resulting scores:
|
||||
|
||||
```text
|
||||
|Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr|
|
||||
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|
||||
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.768|± |0.0268|
|
||||
| | |strict-match | 5|exact_match|↑ |0.768|± |0.0268|
|
||||
```
|
||||
|
||||
## Troubleshooting and Support
|
||||
|
||||
If you encounter any issues or have feature requests, please open an issue on the [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor/issues) GitHub repository.
|
||||
|
||||
## Online Dynamic Quantization
|
||||
|
||||
Dynamic quantization of an original precision BF16/FP16 model to FP8 can be achieved with vLLM without any calibration data required. You can enable the feature by specifying `--quantization="fp8"` in the command line or setting `quantization="fp8"` in the LLM constructor.
|
||||
|
||||
In this mode, all Linear modules (except for the final `lm_head`) have their weights quantized down to FP8_E4M3 precision with a per-tensor scale. Activations have their minimum and maximum values calculated during each forward pass to provide a dynamic per-tensor scale for high accuracy. As a result, latency improvements are limited in this mode.
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
model = LLM("facebook/opt-125m", quantization="fp8")
|
||||
# INFO 06-10 17:55:42 model_runner.py:157] Loading model weights took 0.1550 GB
|
||||
result = model.generate("Hello, my name is")
|
||||
print(result[0].outputs[0].text)
|
||||
```
|
||||
|
||||
!!! warning
|
||||
Currently, we load the model at original precision before quantizing down to 8-bits, so you need enough memory to load the whole model.
|
||||
77
docs/features/quantization/gguf.md
Normal file
77
docs/features/quantization/gguf.md
Normal file
@@ -0,0 +1,77 @@
|
||||
---
|
||||
title: GGUF
|
||||
---
|
||||
[](){ #gguf }
|
||||
|
||||
!!! warning
|
||||
Please note that GGUF support in vLLM is highly experimental and under-optimized at the moment, it might be incompatible with other features. Currently, you can use GGUF as a way to reduce memory footprint. If you encounter any issues, please report them to the vLLM team.
|
||||
|
||||
!!! warning
|
||||
Currently, vllm only supports loading single-file GGUF models. If you have a multi-files GGUF model, you can use [gguf-split](https://github.com/ggerganov/llama.cpp/pull/6135) tool to merge them to a single-file model.
|
||||
|
||||
To run a GGUF model with vLLM, you can download and use the local GGUF model from [TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF](https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF) with the following command:
|
||||
|
||||
```console
|
||||
wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
|
||||
# We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
|
||||
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0
|
||||
```
|
||||
|
||||
You can also add `--tensor-parallel-size 2` to enable tensor parallelism inference with 2 GPUs:
|
||||
|
||||
```console
|
||||
# We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
|
||||
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 --tensor-parallel-size 2
|
||||
```
|
||||
|
||||
!!! warning
|
||||
We recommend using the tokenizer from base model instead of GGUF model. Because the tokenizer conversion from GGUF is time-consuming and unstable, especially for some models with large vocab size.
|
||||
|
||||
GGUF assumes that huggingface can convert the metadata to a config file. In case huggingface doesn't support your model you can manually create a config and pass it as hf-config-path
|
||||
|
||||
```console
|
||||
# If you model is not supported by huggingface you can manually provide a huggingface compatible config path
|
||||
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 --hf-config-path Tinyllama/TInyLlama-1.1B-Chat-v1.0
|
||||
```
|
||||
|
||||
You can also use the GGUF model directly through the LLM entrypoint:
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
# In this script, we demonstrate how to pass input to the chat method:
|
||||
conversation = [
|
||||
{
|
||||
"role": "system",
|
||||
"content": "You are a helpful assistant"
|
||||
},
|
||||
{
|
||||
"role": "user",
|
||||
"content": "Hello"
|
||||
},
|
||||
{
|
||||
"role": "assistant",
|
||||
"content": "Hello! How can I assist you today?"
|
||||
},
|
||||
{
|
||||
"role": "user",
|
||||
"content": "Write an essay about the importance of higher education.",
|
||||
},
|
||||
]
|
||||
|
||||
# Create a sampling params object.
|
||||
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
|
||||
|
||||
# Create an LLM.
|
||||
llm = LLM(model="./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf",
|
||||
tokenizer="TinyLlama/TinyLlama-1.1B-Chat-v1.0")
|
||||
# Generate texts from the prompts. The output is a list of RequestOutput objects
|
||||
# that contain the prompt, generated text, and other information.
|
||||
outputs = llm.chat(conversation, sampling_params)
|
||||
|
||||
# Print the outputs.
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
```
|
||||
97
docs/features/quantization/gptqmodel.md
Normal file
97
docs/features/quantization/gptqmodel.md
Normal file
@@ -0,0 +1,97 @@
|
||||
---
|
||||
title: GPTQModel
|
||||
---
|
||||
[](){ #gptqmodel }
|
||||
|
||||
To create a new 4-bit or 8-bit GPTQ quantized model, you can leverage [GPTQModel](https://github.com/ModelCloud/GPTQModel) from ModelCloud.AI.
|
||||
|
||||
Quantization reduces the model's precision from BF16/FP16 (16-bits) to INT4 (4-bits) or INT8 (8-bits) which significantly reduces the
|
||||
total model memory footprint while at-the-same-time increasing inference performance.
|
||||
|
||||
Compatible GPTQModel quantized models can leverage the `Marlin` and `Machete` vLLM custom kernels to maximize batching
|
||||
transactions-per-second `tps` and token-latency performance for both Ampere (A100+) and Hopper (H100+) Nvidia GPUs.
|
||||
These two kernels are highly optimized by vLLM and NeuralMagic (now part of Redhat) to allow world-class inference performance of quantized GPTQ
|
||||
models.
|
||||
|
||||
GPTQModel is one of the few quantization toolkits in the world that allows `Dynamic` per-module quantization where different layers and/or modules within a llm model can be further optimized with custom quantization parameters. `Dynamic` quantization
|
||||
is fully integrated into vLLM and backed up by support from the ModelCloud.AI team. Please refer to [GPTQModel readme](https://github.com/ModelCloud/GPTQModel?tab=readme-ov-file#dynamic-quantization-per-module-quantizeconfig-override)
|
||||
for more details on this and other advanced features.
|
||||
|
||||
## Installation
|
||||
|
||||
You can quantize your own models by installing [GPTQModel](https://github.com/ModelCloud/GPTQModel) or picking one of the [5000+ models on Huggingface](https://huggingface.co/models?search=gptq).
|
||||
|
||||
```console
|
||||
pip install -U gptqmodel --no-build-isolation -v
|
||||
```
|
||||
|
||||
## Quantizing a model
|
||||
|
||||
After installing GPTQModel, you are ready to quantize a model. Please refer to the [GPTQModel readme](https://github.com/ModelCloud/GPTQModel/?tab=readme-ov-file#quantization) for further details.
|
||||
|
||||
Here is an example of how to quantize `meta-llama/Llama-3.2-1B-Instruct`:
|
||||
|
||||
```python
|
||||
from datasets import load_dataset
|
||||
from gptqmodel import GPTQModel, QuantizeConfig
|
||||
|
||||
model_id = "meta-llama/Llama-3.2-1B-Instruct"
|
||||
quant_path = "Llama-3.2-1B-Instruct-gptqmodel-4bit"
|
||||
|
||||
calibration_dataset = load_dataset(
|
||||
"allenai/c4",
|
||||
data_files="en/c4-train.00001-of-01024.json.gz",
|
||||
split="train"
|
||||
).select(range(1024))["text"]
|
||||
|
||||
quant_config = QuantizeConfig(bits=4, group_size=128)
|
||||
|
||||
model = GPTQModel.load(model_id, quant_config)
|
||||
|
||||
# increase `batch_size` to match gpu/vram specs to speed up quantization
|
||||
model.quantize(calibration_dataset, batch_size=2)
|
||||
|
||||
model.save(quant_path)
|
||||
```
|
||||
|
||||
## Running a quantized model with vLLM
|
||||
|
||||
To run an GPTQModel quantized model with vLLM, you can use [DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2](https://huggingface.co/ModelCloud/DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2) with the following command:
|
||||
|
||||
```console
|
||||
python examples/offline_inference/llm_engine_example.py --model ModelCloud/DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2
|
||||
```
|
||||
|
||||
## Using GPTQModel with vLLM's Python API
|
||||
|
||||
GPTQModel quantized models are also supported directly through the LLM entrypoint:
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
# Sample prompts.
|
||||
prompts = [
|
||||
"Hello, my name is",
|
||||
"The president of the United States is",
|
||||
"The capital of France is",
|
||||
"The future of AI is",
|
||||
]
|
||||
|
||||
# Create a sampling params object.
|
||||
sampling_params = SamplingParams(temperature=0.6, top_p=0.9)
|
||||
|
||||
# Create an LLM.
|
||||
llm = LLM(model="ModelCloud/DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2")
|
||||
|
||||
# Generate texts from the prompts. The output is a list of RequestOutput objects
|
||||
# that contain the prompt, generated text, and other information.
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
|
||||
# Print the outputs.
|
||||
print("-"*50)
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}\nGenerated text: {generated_text!r}")
|
||||
print("-"*50)
|
||||
```
|
||||
171
docs/features/quantization/int4.md
Normal file
171
docs/features/quantization/int4.md
Normal file
@@ -0,0 +1,171 @@
|
||||
---
|
||||
title: INT4 W4A16
|
||||
---
|
||||
[](){ #int4 }
|
||||
|
||||
vLLM supports quantizing weights to INT4 for memory savings and inference acceleration. This quantization method is particularly useful for reducing model size and maintaining low latency in workloads with low queries per second (QPS).
|
||||
|
||||
Please visit the HF collection of [quantized INT4 checkpoints of popular LLMs ready to use with vLLM](https://huggingface.co/collections/neuralmagic/int4-llms-for-vllm-668ec34bf3c9fa45f857df2c).
|
||||
|
||||
!!! note
|
||||
INT4 computation is supported on NVIDIA GPUs with compute capability > 8.0 (Ampere, Ada Lovelace, Hopper, Blackwell).
|
||||
|
||||
## Prerequisites
|
||||
|
||||
To use INT4 quantization with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library:
|
||||
|
||||
```console
|
||||
pip install llmcompressor
|
||||
```
|
||||
|
||||
Additionally, install `vllm` and `lm-evaluation-harness` for evaluation:
|
||||
|
||||
```console
|
||||
pip install vllm lm-eval==0.4.4
|
||||
```
|
||||
|
||||
## Quantization Process
|
||||
|
||||
The quantization process involves four main steps:
|
||||
|
||||
1. Loading the model
|
||||
2. Preparing calibration data
|
||||
3. Applying quantization
|
||||
4. Evaluating accuracy in vLLM
|
||||
|
||||
### 1. Loading the Model
|
||||
|
||||
Load your model and tokenizer using the standard `transformers` AutoModel classes:
|
||||
|
||||
```python
|
||||
from transformers import AutoTokenizer, AutoModelForCausalLM
|
||||
|
||||
MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
MODEL_ID, device_map="auto", torch_dtype="auto",
|
||||
)
|
||||
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
|
||||
```
|
||||
|
||||
### 2. Preparing Calibration Data
|
||||
|
||||
When quantizing weights to INT4, you need sample data to estimate the weight updates and calibrated scales.
|
||||
It's best to use calibration data that closely matches your deployment data.
|
||||
For a general-purpose instruction-tuned model, you can use a dataset like `ultrachat`:
|
||||
|
||||
```python
|
||||
from datasets import load_dataset
|
||||
|
||||
NUM_CALIBRATION_SAMPLES = 512
|
||||
MAX_SEQUENCE_LENGTH = 2048
|
||||
|
||||
# Load and preprocess the dataset
|
||||
ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft")
|
||||
ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))
|
||||
|
||||
def preprocess(example):
|
||||
return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)}
|
||||
ds = ds.map(preprocess)
|
||||
|
||||
def tokenize(sample):
|
||||
return tokenizer(sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False)
|
||||
ds = ds.map(tokenize, remove_columns=ds.column_names)
|
||||
```
|
||||
|
||||
### 3. Applying Quantization
|
||||
|
||||
Now, apply the quantization algorithms:
|
||||
|
||||
```python
|
||||
from llmcompressor.transformers import oneshot
|
||||
from llmcompressor.modifiers.quantization import GPTQModifier
|
||||
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
|
||||
|
||||
# Configure the quantization algorithms
|
||||
recipe = GPTQModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"])
|
||||
|
||||
# Apply quantization
|
||||
oneshot(
|
||||
model=model,
|
||||
dataset=ds,
|
||||
recipe=recipe,
|
||||
max_seq_length=MAX_SEQUENCE_LENGTH,
|
||||
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
|
||||
)
|
||||
|
||||
# Save the compressed model: Meta-Llama-3-8B-Instruct-W4A16-G128
|
||||
SAVE_DIR = MODEL_ID.split("/")[1] + "-W4A16-G128"
|
||||
model.save_pretrained(SAVE_DIR, save_compressed=True)
|
||||
tokenizer.save_pretrained(SAVE_DIR)
|
||||
```
|
||||
|
||||
This process creates a W4A16 model with weights quantized to 4-bit integers.
|
||||
|
||||
### 4. Evaluating Accuracy
|
||||
|
||||
After quantization, you can load and run the model in vLLM:
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
model = LLM("./Meta-Llama-3-8B-Instruct-W4A16-G128")
|
||||
```
|
||||
|
||||
To evaluate accuracy, you can use `lm_eval`:
|
||||
|
||||
```console
|
||||
$ lm_eval --model vllm \
|
||||
--model_args pretrained="./Meta-Llama-3-8B-Instruct-W4A16-G128",add_bos_token=true \
|
||||
--tasks gsm8k \
|
||||
--num_fewshot 5 \
|
||||
--limit 250 \
|
||||
--batch_size 'auto'
|
||||
```
|
||||
|
||||
!!! note
|
||||
Quantized models can be sensitive to the presence of the `bos` token. Make sure to include the `add_bos_token=True` argument when running evaluations.
|
||||
|
||||
## Best Practices
|
||||
|
||||
- Start with 512 samples for calibration data, and increase if accuracy drops
|
||||
- Ensure the calibration data contains a high variety of samples to prevent overfitting towards a specific use case
|
||||
- Use a sequence length of 2048 as a starting point
|
||||
- Employ the chat template or instruction template that the model was trained with
|
||||
- If you've fine-tuned a model, consider using a sample of your training data for calibration
|
||||
- Tune key hyperparameters to the quantization algorithm:
|
||||
- `dampening_frac` sets how much influence the GPTQ algorithm has. Lower values can improve accuracy, but can lead to numerical instabilities that cause the algorithm to fail.
|
||||
- `actorder` sets the activation ordering. When compressing the weights of a layer weight, the order in which channels are quantized matters. Setting `actorder="weight"` can improve accuracy without added latency.
|
||||
|
||||
The following is an example of an expanded quantization recipe you can tune to your own use case:
|
||||
|
||||
```python
|
||||
from compressed_tensors.quantization import (
|
||||
QuantizationArgs,
|
||||
QuantizationScheme,
|
||||
QuantizationStrategy,
|
||||
QuantizationType,
|
||||
)
|
||||
recipe = GPTQModifier(
|
||||
targets="Linear",
|
||||
config_groups={
|
||||
"config_group": QuantizationScheme(
|
||||
targets=["Linear"],
|
||||
weights=QuantizationArgs(
|
||||
num_bits=4,
|
||||
type=QuantizationType.INT,
|
||||
strategy=QuantizationStrategy.GROUP,
|
||||
group_size=128,
|
||||
symmetric=True,
|
||||
dynamic=False,
|
||||
actorder="weight",
|
||||
),
|
||||
),
|
||||
},
|
||||
ignore=["lm_head"],
|
||||
update_size=NUM_CALIBRATION_SAMPLES,
|
||||
dampening_frac=0.01
|
||||
)
|
||||
```
|
||||
|
||||
## Troubleshooting and Support
|
||||
|
||||
If you encounter any issues or have feature requests, please open an issue on the [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor/issues) GitHub repository. The full INT4 quantization example in `llm-compressor` is available [here](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w4a16/llama3_example.py).
|
||||
140
docs/features/quantization/int8.md
Normal file
140
docs/features/quantization/int8.md
Normal file
@@ -0,0 +1,140 @@
|
||||
---
|
||||
title: INT8 W8A8
|
||||
---
|
||||
[](){ #int8 }
|
||||
|
||||
vLLM supports quantizing weights and activations to INT8 for memory savings and inference acceleration.
|
||||
This quantization method is particularly useful for reducing model size while maintaining good performance.
|
||||
|
||||
Please visit the HF collection of [quantized INT8 checkpoints of popular LLMs ready to use with vLLM](https://huggingface.co/collections/neuralmagic/int8-llms-for-vllm-668ec32c049dca0369816415).
|
||||
|
||||
!!! note
|
||||
INT8 computation is supported on NVIDIA GPUs with compute capability > 7.5 (Turing, Ampere, Ada Lovelace, Hopper, Blackwell).
|
||||
|
||||
## Prerequisites
|
||||
|
||||
To use INT8 quantization with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library:
|
||||
|
||||
```console
|
||||
pip install llmcompressor
|
||||
```
|
||||
|
||||
Additionally, install `vllm` and `lm-evaluation-harness` for evaluation:
|
||||
|
||||
```console
|
||||
pip install vllm lm-eval==0.4.4
|
||||
```
|
||||
|
||||
## Quantization Process
|
||||
|
||||
The quantization process involves four main steps:
|
||||
|
||||
1. Loading the model
|
||||
2. Preparing calibration data
|
||||
3. Applying quantization
|
||||
4. Evaluating accuracy in vLLM
|
||||
|
||||
### 1. Loading the Model
|
||||
|
||||
Load your model and tokenizer using the standard `transformers` AutoModel classes:
|
||||
|
||||
```python
|
||||
from transformers import AutoTokenizer, AutoModelForCausalLM
|
||||
|
||||
MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
MODEL_ID, device_map="auto", torch_dtype="auto",
|
||||
)
|
||||
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
|
||||
```
|
||||
|
||||
### 2. Preparing Calibration Data
|
||||
|
||||
When quantizing activations to INT8, you need sample data to estimate the activation scales.
|
||||
It's best to use calibration data that closely matches your deployment data.
|
||||
For a general-purpose instruction-tuned model, you can use a dataset like `ultrachat`:
|
||||
|
||||
```python
|
||||
from datasets import load_dataset
|
||||
|
||||
NUM_CALIBRATION_SAMPLES = 512
|
||||
MAX_SEQUENCE_LENGTH = 2048
|
||||
|
||||
# Load and preprocess the dataset
|
||||
ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft")
|
||||
ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))
|
||||
|
||||
def preprocess(example):
|
||||
return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)}
|
||||
ds = ds.map(preprocess)
|
||||
|
||||
def tokenize(sample):
|
||||
return tokenizer(sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False)
|
||||
ds = ds.map(tokenize, remove_columns=ds.column_names)
|
||||
```
|
||||
|
||||
### 3. Applying Quantization
|
||||
|
||||
Now, apply the quantization algorithms:
|
||||
|
||||
```python
|
||||
from llmcompressor.transformers import oneshot
|
||||
from llmcompressor.modifiers.quantization import GPTQModifier
|
||||
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
|
||||
|
||||
# Configure the quantization algorithms
|
||||
recipe = [
|
||||
SmoothQuantModifier(smoothing_strength=0.8),
|
||||
GPTQModifier(targets="Linear", scheme="W8A8", ignore=["lm_head"]),
|
||||
]
|
||||
|
||||
# Apply quantization
|
||||
oneshot(
|
||||
model=model,
|
||||
dataset=ds,
|
||||
recipe=recipe,
|
||||
max_seq_length=MAX_SEQUENCE_LENGTH,
|
||||
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
|
||||
)
|
||||
|
||||
# Save the compressed model: Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Per-Token
|
||||
SAVE_DIR = MODEL_ID.split("/")[1] + "-W8A8-Dynamic-Per-Token"
|
||||
model.save_pretrained(SAVE_DIR, save_compressed=True)
|
||||
tokenizer.save_pretrained(SAVE_DIR)
|
||||
```
|
||||
|
||||
This process creates a W8A8 model with weights and activations quantized to 8-bit integers.
|
||||
|
||||
### 4. Evaluating Accuracy
|
||||
|
||||
After quantization, you can load and run the model in vLLM:
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
model = LLM("./Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Per-Token")
|
||||
```
|
||||
|
||||
To evaluate accuracy, you can use `lm_eval`:
|
||||
|
||||
```console
|
||||
$ lm_eval --model vllm \
|
||||
--model_args pretrained="./Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Per-Token",add_bos_token=true \
|
||||
--tasks gsm8k \
|
||||
--num_fewshot 5 \
|
||||
--limit 250 \
|
||||
--batch_size 'auto'
|
||||
```
|
||||
|
||||
!!! note
|
||||
Quantized models can be sensitive to the presence of the `bos` token. Make sure to include the `add_bos_token=True` argument when running evaluations.
|
||||
|
||||
## Best Practices
|
||||
|
||||
- Start with 512 samples for calibration data (increase if accuracy drops)
|
||||
- Use a sequence length of 2048 as a starting point
|
||||
- Employ the chat template or instruction template that the model was trained with
|
||||
- If you've fine-tuned a model, consider using a sample of your training data for calibration
|
||||
|
||||
## Troubleshooting and Support
|
||||
|
||||
If you encounter any issues or have feature requests, please open an issue on the [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor/issues) GitHub repository.
|
||||
78
docs/features/quantization/modelopt.md
Normal file
78
docs/features/quantization/modelopt.md
Normal file
@@ -0,0 +1,78 @@
|
||||
# NVIDIA TensorRT Model Optimizer
|
||||
|
||||
The [NVIDIA TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer) is a library designed to optimize models for inference with NVIDIA GPUs. It includes tools for Post-Training Quantization (PTQ) and Quantization Aware Training (QAT) of Large Language Models (LLMs), Vision Language Models (VLMs), and diffusion models.
|
||||
|
||||
We recommend installing the library with:
|
||||
|
||||
```console
|
||||
pip install nvidia-modelopt
|
||||
```
|
||||
|
||||
## Quantizing HuggingFace Models with PTQ
|
||||
|
||||
You can quantize HuggingFace models using the example scripts provided in the TensorRT Model Optimizer repository. The primary script for LLM PTQ is typically found within the `examples/llm_ptq` directory.
|
||||
|
||||
Below is an example showing how to quantize a model using modelopt's PTQ API:
|
||||
|
||||
```python
|
||||
import modelopt.torch.quantization as mtq
|
||||
from transformers import AutoModelForCausalLM
|
||||
|
||||
# Load the model from HuggingFace
|
||||
model = AutoModelForCausalLM.from_pretrained("<path_or_model_id>")
|
||||
|
||||
# Select the quantization config, for example, FP8
|
||||
config = mtq.FP8_DEFAULT_CFG
|
||||
|
||||
# Define a forward loop function for calibration
|
||||
def forward_loop(model):
|
||||
for data in calib_set:
|
||||
model(data)
|
||||
|
||||
# PTQ with in-place replacement of quantized modules
|
||||
model = mtq.quantize(model, config, forward_loop)
|
||||
```
|
||||
|
||||
After the model is quantized, you can export it to a quantized checkpoint using the export API:
|
||||
|
||||
```python
|
||||
import torch
|
||||
from modelopt.torch.export import export_hf_checkpoint
|
||||
|
||||
with torch.inference_mode():
|
||||
export_hf_checkpoint(
|
||||
model, # The quantized model.
|
||||
export_dir, # The directory where the exported files will be stored.
|
||||
)
|
||||
```
|
||||
|
||||
The quantized checkpoint can then be deployed with vLLM. As an example, the following code shows how to deploy `nvidia/Llama-3.1-8B-Instruct-FP8`, which is the FP8 quantized checkpoint derived from `meta-llama/Llama-3.1-8B-Instruct`, using vLLM:
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
def main():
|
||||
|
||||
model_id = "nvidia/Llama-3.1-8B-Instruct-FP8"
|
||||
# Ensure you specify quantization='modelopt' when loading the modelopt checkpoint
|
||||
llm = LLM(model=model_id, quantization="modelopt", trust_remote_code=True)
|
||||
|
||||
sampling_params = SamplingParams(temperature=0.8, top_p=0.9)
|
||||
|
||||
prompts = [
|
||||
"Hello, my name is",
|
||||
"The president of the United States is",
|
||||
"The capital of France is",
|
||||
"The future of AI is",
|
||||
]
|
||||
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
```
|
||||
148
docs/features/quantization/quantized_kvcache.md
Normal file
148
docs/features/quantization/quantized_kvcache.md
Normal file
@@ -0,0 +1,148 @@
|
||||
---
|
||||
title: Quantized KV Cache
|
||||
---
|
||||
[](){ #quantized-kvcache }
|
||||
|
||||
## FP8 KV Cache
|
||||
|
||||
Quantizing the KV cache to FP8 reduces its memory footprint. This increases the number of tokens that can be stored in the cache, improving throughput.
|
||||
|
||||
### FP8 Formats
|
||||
|
||||
[OCP (Open Compute Project)](https://www.opencompute.org) specifies two common 8-bit floating point data formats:
|
||||
|
||||
- E5M2 (5 exponent bits and 2 mantissa bits)
|
||||
- E4M3FN (4 exponent bits and 3 mantissa bits, often shortened as E4M3)
|
||||
|
||||
The E4M3 format offers higher precision compared to E5M2. However, due to its small dynamic range (±240.0), E4M3 typically requires a higher-precision (FP32) scaling factor alongside each quantized tensor.
|
||||
|
||||
### Current Limitations
|
||||
|
||||
For now, only per-tensor (scalar) scaling factors are supported. Development is ongoing to support scaling factors of a finer granularity (e.g. per-channel).
|
||||
|
||||
### Performance Impact
|
||||
|
||||
The current FP8 KV cache implementation primarily benefits throughput by allowing approximately double the amount of space for KV cache allocation. This enables either:
|
||||
|
||||
- Processing longer context lengths for individual requests, or
|
||||
- Handling more concurrent request batches
|
||||
|
||||
However, there are currently no latency improvements as the implementation does not yet include fused dequantization and attention operations. Future releases will support quantized attention with hardware acceleration, which should provide additional performance benefits. While the most recent silicon offerings (e.g. AMD MI300, NVIDIA Hopper or later) support native hardware conversion between FP8 and other formats (fp32, fp16, bf16), this benefit is not yet fully realized.
|
||||
|
||||
Studies have shown that FP8 E4M3 quantization typically only minimally degrades inference accuracy, making it a practical choice for throughput optimization.
|
||||
|
||||
## Usage Example
|
||||
|
||||
Here is an example of how to enable FP8 quantization:
|
||||
|
||||
```python
|
||||
# To calculate kv cache scales on the fly enable the calculate_kv_scales
|
||||
# parameter
|
||||
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
sampling_params = SamplingParams(temperature=0.7, top_p=0.8)
|
||||
llm = LLM(model="meta-llama/Llama-2-7b-chat-hf",
|
||||
kv_cache_dtype="fp8",
|
||||
calculate_kv_scales=True)
|
||||
prompt = "London is the capital of"
|
||||
out = llm.generate(prompt, sampling_params)[0].outputs[0].text
|
||||
print(out)
|
||||
```
|
||||
|
||||
The `kv_cache_dtype` argument specifies the data type for KV cache storage:
|
||||
- `"auto"`: Uses the model's default "unquantized" data type
|
||||
- `"fp8"` or `"fp8_e4m3"`: Supported on CUDA 11.8+ and ROCm (AMD GPU)
|
||||
- `"fp8_e5m2"`: Supported on CUDA 11.8+
|
||||
|
||||
## Calibrated Scales for Better Accuracy
|
||||
|
||||
For optimal model quality when using FP8 KV Cache, we recommend using calibrated scales tuned to representative inference data. [LLM Compressor](https://github.com/vllm-project/llm-compressor/) is the recommended tool for this process.
|
||||
|
||||
### Installation
|
||||
|
||||
First, install the required dependencies:
|
||||
|
||||
```console
|
||||
pip install llmcompressor
|
||||
```
|
||||
|
||||
### Example Usage
|
||||
|
||||
Here's a complete example using `meta-llama/Llama-3.1-8B-Instruct` (most models can use this same pattern):
|
||||
|
||||
```python
|
||||
from datasets import load_dataset
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||
from llmcompressor.transformers import oneshot
|
||||
|
||||
# Select model and load it
|
||||
MODEL_ID = "meta-llama/Llama-3.1-8B-Instruct"
|
||||
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, device_map="auto", torch_dtype="auto")
|
||||
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
|
||||
|
||||
# Select calibration dataset
|
||||
DATASET_ID = "HuggingFaceH4/ultrachat_200k"
|
||||
DATASET_SPLIT = "train_sft"
|
||||
|
||||
# Configure calibration parameters
|
||||
NUM_CALIBRATION_SAMPLES = 512 # 512 samples is a good starting point
|
||||
MAX_SEQUENCE_LENGTH = 2048
|
||||
|
||||
# Load and preprocess dataset
|
||||
ds = load_dataset(DATASET_ID, split=DATASET_SPLIT)
|
||||
ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))
|
||||
|
||||
def process_and_tokenize(example):
|
||||
text = tokenizer.apply_chat_template(example["messages"], tokenize=False)
|
||||
return tokenizer(
|
||||
text,
|
||||
padding=False,
|
||||
max_length=MAX_SEQUENCE_LENGTH,
|
||||
truncation=True,
|
||||
add_special_tokens=False,
|
||||
)
|
||||
|
||||
ds = ds.map(process_and_tokenize, remove_columns=ds.column_names)
|
||||
|
||||
# Configure quantization settings
|
||||
recipe = """
|
||||
quant_stage:
|
||||
quant_modifiers:
|
||||
QuantizationModifier:
|
||||
kv_cache_scheme:
|
||||
num_bits: 8
|
||||
type: float
|
||||
strategy: tensor
|
||||
dynamic: false
|
||||
symmetric: true
|
||||
"""
|
||||
|
||||
# Apply quantization
|
||||
oneshot(
|
||||
model=model,
|
||||
dataset=ds,
|
||||
recipe=recipe,
|
||||
max_seq_length=MAX_SEQUENCE_LENGTH,
|
||||
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
|
||||
)
|
||||
|
||||
# Save quantized model: Llama-3.1-8B-Instruct-FP8-KV
|
||||
SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-KV"
|
||||
model.save_pretrained(SAVE_DIR, save_compressed=True)
|
||||
tokenizer.save_pretrained(SAVE_DIR)
|
||||
```
|
||||
|
||||
The above script will create a folder in your current directory containing your quantized model (e.g., `Llama-3.1-8B-Instruct-FP8-KV`) with calibrated scales.
|
||||
|
||||
When running the model you must specify `kv_cache_dtype="fp8"` in order to enable the kv cache quantization and use the scales.
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
sampling_params = SamplingParams(temperature=0.7, top_p=0.8)
|
||||
llm = LLM(model="Llama-3.1-8B-Instruct-FP8-KV", kv_cache_dtype="fp8")
|
||||
prompt = "London is the capital of"
|
||||
out = llm.generate(prompt, sampling_params)[0].outputs[0].text
|
||||
print(out)
|
||||
```
|
||||
224
docs/features/quantization/quark.md
Normal file
224
docs/features/quantization/quark.md
Normal file
@@ -0,0 +1,224 @@
|
||||
---
|
||||
title: AMD QUARK
|
||||
---
|
||||
[](){ #quark }
|
||||
|
||||
Quantization can effectively reduce memory and bandwidth usage, accelerate computation and improve
|
||||
throughput while with minimal accuracy loss. vLLM can leverage [Quark](https://quark.docs.amd.com/latest/),
|
||||
the flexible and powerful quantization toolkit, to produce performant quantized models to run on AMD GPUs. Quark has specialized support for quantizing large language models with weight,
|
||||
activation and kv-cache quantization and cutting-edge quantization algorithms like
|
||||
AWQ, GPTQ, Rotation and SmoothQuant.
|
||||
|
||||
## Quark Installation
|
||||
|
||||
Before quantizing models, you need to install Quark. The latest release of Quark can be installed with pip:
|
||||
|
||||
```console
|
||||
pip install amd-quark
|
||||
```
|
||||
|
||||
You can refer to [Quark installation guide](https://quark.docs.amd.com/latest/install.html)
|
||||
for more installation details.
|
||||
|
||||
Additionally, install `vllm` and `lm-evaluation-harness` for evaluation:
|
||||
|
||||
```console
|
||||
pip install vllm lm-eval==0.4.4
|
||||
```
|
||||
|
||||
## Quantization Process
|
||||
|
||||
After installing Quark, we will use an example to illustrate how to use Quark.
|
||||
The Quark quantization process can be listed for 5 steps as below:
|
||||
|
||||
1. Load the model
|
||||
2. Prepare the calibration dataloader
|
||||
3. Set the quantization configuration
|
||||
4. Quantize the model and export
|
||||
5. Evaluation in vLLM
|
||||
|
||||
### 1. Load the Model
|
||||
|
||||
Quark uses [Transformers](https://huggingface.co/docs/transformers/en/index)
|
||||
to fetch model and tokenizer.
|
||||
|
||||
```python
|
||||
from transformers import AutoTokenizer, AutoModelForCausalLM
|
||||
|
||||
MODEL_ID = "meta-llama/Llama-2-70b-chat-hf"
|
||||
MAX_SEQ_LEN = 512
|
||||
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
MODEL_ID, device_map="auto", torch_dtype="auto",
|
||||
)
|
||||
model.eval()
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, model_max_length=MAX_SEQ_LEN)
|
||||
tokenizer.pad_token = tokenizer.eos_token
|
||||
```
|
||||
|
||||
### 2. Prepare the Calibration Dataloader
|
||||
|
||||
Quark uses the [PyTorch Dataloader](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html)
|
||||
to load calibration data. For more details about how to use calibration datasets efficiently, please refer
|
||||
to [Adding Calibration Datasets](https://quark.docs.amd.com/latest/pytorch/calibration_datasets.html).
|
||||
|
||||
```python
|
||||
from datasets import load_dataset
|
||||
from torch.utils.data import DataLoader
|
||||
|
||||
BATCH_SIZE = 1
|
||||
NUM_CALIBRATION_DATA = 512
|
||||
|
||||
# Load the dataset and get calibration data.
|
||||
dataset = load_dataset("mit-han-lab/pile-val-backup", split="validation")
|
||||
text_data = dataset["text"][:NUM_CALIBRATION_DATA]
|
||||
|
||||
tokenized_outputs = tokenizer(text_data, return_tensors="pt",
|
||||
padding=True, truncation=True, max_length=MAX_SEQ_LEN)
|
||||
calib_dataloader = DataLoader(tokenized_outputs['input_ids'],
|
||||
batch_size=BATCH_SIZE, drop_last=True)
|
||||
```
|
||||
|
||||
### 3. Set the Quantization Configuration
|
||||
|
||||
We need to set the quantization configuration, you can check
|
||||
[quark config guide](https://quark.docs.amd.com/latest/pytorch/user_guide_config_description.html)
|
||||
for further details. Here we use FP8 per-tensor quantization on weight, activation,
|
||||
kv-cache and the quantization algorithm is AutoSmoothQuant.
|
||||
|
||||
!!! note
|
||||
Note the quantization algorithm needs a JSON config file and the config file is located in
|
||||
[Quark Pytorch examples](https://quark.docs.amd.com/latest/pytorch/pytorch_examples.html),
|
||||
under the directory `examples/torch/language_modeling/llm_ptq/models`. For example,
|
||||
AutoSmoothQuant config file for Llama is
|
||||
`examples/torch/language_modeling/llm_ptq/models/llama/autosmoothquant_config.json`.
|
||||
|
||||
```python
|
||||
from quark.torch.quantization import (Config, QuantizationConfig,
|
||||
FP8E4M3PerTensorSpec,
|
||||
load_quant_algo_config_from_file)
|
||||
|
||||
# Define fp8/per-tensor/static spec.
|
||||
FP8_PER_TENSOR_SPEC = FP8E4M3PerTensorSpec(observer_method="min_max",
|
||||
is_dynamic=False).to_quantization_spec()
|
||||
|
||||
# Define global quantization config, input tensors and weight apply FP8_PER_TENSOR_SPEC.
|
||||
global_quant_config = QuantizationConfig(input_tensors=FP8_PER_TENSOR_SPEC,
|
||||
weight=FP8_PER_TENSOR_SPEC)
|
||||
|
||||
# Define quantization config for kv-cache layers, output tensors apply FP8_PER_TENSOR_SPEC.
|
||||
KV_CACHE_SPEC = FP8_PER_TENSOR_SPEC
|
||||
kv_cache_layer_names_for_llama = ["*k_proj", "*v_proj"]
|
||||
kv_cache_quant_config = {name :
|
||||
QuantizationConfig(input_tensors=global_quant_config.input_tensors,
|
||||
weight=global_quant_config.weight,
|
||||
output_tensors=KV_CACHE_SPEC)
|
||||
for name in kv_cache_layer_names_for_llama}
|
||||
layer_quant_config = kv_cache_quant_config.copy()
|
||||
|
||||
# Define algorithm config by config file.
|
||||
LLAMA_AUTOSMOOTHQUANT_CONFIG_FILE =
|
||||
'examples/torch/language_modeling/llm_ptq/models/llama/autosmoothquant_config.json'
|
||||
algo_config = load_quant_algo_config_from_file(LLAMA_AUTOSMOOTHQUANT_CONFIG_FILE)
|
||||
|
||||
EXCLUDE_LAYERS = ["lm_head"]
|
||||
quant_config = Config(
|
||||
global_quant_config=global_quant_config,
|
||||
layer_quant_config=layer_quant_config,
|
||||
kv_cache_quant_config=kv_cache_quant_config,
|
||||
exclude=EXCLUDE_LAYERS,
|
||||
algo_config=algo_config)
|
||||
```
|
||||
|
||||
### 4. Quantize the Model and Export
|
||||
|
||||
Then we can apply the quantization. After quantizing, we need to freeze the
|
||||
quantized model first before exporting. Note that we need to export model with format of
|
||||
HuggingFace `safetensors`, you can refer to
|
||||
[HuggingFace format exporting](https://quark.docs.amd.com/latest/pytorch/export/quark_export_hf.html)
|
||||
for more exporting format details.
|
||||
|
||||
```python
|
||||
import torch
|
||||
from quark.torch import ModelQuantizer, ModelExporter
|
||||
from quark.torch.export import ExporterConfig, JsonExporterConfig
|
||||
|
||||
# Apply quantization.
|
||||
quantizer = ModelQuantizer(quant_config)
|
||||
quant_model = quantizer.quantize_model(model, calib_dataloader)
|
||||
|
||||
# Freeze quantized model to export.
|
||||
freezed_model = quantizer.freeze(model)
|
||||
|
||||
# Define export config.
|
||||
LLAMA_KV_CACHE_GROUP = ["*k_proj", "*v_proj"]
|
||||
export_config = ExporterConfig(json_export_config=JsonExporterConfig())
|
||||
export_config.json_export_config.kv_cache_group = LLAMA_KV_CACHE_GROUP
|
||||
|
||||
# Model: Llama-2-70b-chat-hf-w-fp8-a-fp8-kvcache-fp8-pertensor-autosmoothquant
|
||||
EXPORT_DIR = MODEL_ID.split("/")[1] + "-w-fp8-a-fp8-kvcache-fp8-pertensor-autosmoothquant"
|
||||
exporter = ModelExporter(config=export_config, export_dir=EXPORT_DIR)
|
||||
with torch.no_grad():
|
||||
exporter.export_safetensors_model(freezed_model,
|
||||
quant_config=quant_config, tokenizer=tokenizer)
|
||||
```
|
||||
|
||||
### 5. Evaluation in vLLM
|
||||
|
||||
Now, you can load and run the Quark quantized model directly through the LLM entrypoint:
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
# Sample prompts.
|
||||
prompts = [
|
||||
"Hello, my name is",
|
||||
"The president of the United States is",
|
||||
"The capital of France is",
|
||||
"The future of AI is",
|
||||
]
|
||||
# Create a sampling params object.
|
||||
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
|
||||
|
||||
# Create an LLM.
|
||||
llm = LLM(model="Llama-2-70b-chat-hf-w-fp8-a-fp8-kvcache-fp8-pertensor-autosmoothquant",
|
||||
kv_cache_dtype='fp8',quantization='quark')
|
||||
# Generate texts from the prompts. The output is a list of RequestOutput objects
|
||||
# that contain the prompt, generated text, and other information.
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
# Print the outputs.
|
||||
print("\nGenerated Outputs:\n" + "-" * 60)
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}")
|
||||
print(f"Output: {generated_text!r}")
|
||||
print("-" * 60)
|
||||
```
|
||||
|
||||
Or, you can use `lm_eval` to evaluate accuracy:
|
||||
|
||||
```console
|
||||
$ lm_eval --model vllm \
|
||||
--model_args pretrained=Llama-2-70b-chat-hf-w-fp8-a-fp8-kvcache-fp8-pertensor-autosmoothquant,kv_cache_dtype='fp8',quantization='quark' \
|
||||
--tasks gsm8k
|
||||
```
|
||||
|
||||
## Quark Quantization Script
|
||||
In addition to the example of Python API above, Quark also offers a
|
||||
[quantization script](https://quark.docs.amd.com/latest/pytorch/example_quark_torch_llm_ptq.html)
|
||||
to quantize large language models more conveniently. It supports quantizing models with variety
|
||||
of different quantization schemes and optimization algorithms. It can export the quantized model
|
||||
and run evaluation tasks on the fly. With the script, the example above can be:
|
||||
|
||||
```console
|
||||
python3 quantize_quark.py --model_dir meta-llama/Llama-2-70b-chat-hf \
|
||||
--output_dir /path/to/output \
|
||||
--quant_scheme w_fp8_a_fp8 \
|
||||
--kv_cache_dtype fp8 \
|
||||
--quant_algo autosmoothquant \
|
||||
--num_calib_data 512 \
|
||||
--model_export hf_format \
|
||||
--tasks gsm8k
|
||||
```
|
||||
28
docs/features/quantization/supported_hardware.md
Normal file
28
docs/features/quantization/supported_hardware.md
Normal file
@@ -0,0 +1,28 @@
|
||||
---
|
||||
title: Supported Hardware
|
||||
---
|
||||
[](){ #quantization-supported-hardware }
|
||||
|
||||
The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM:
|
||||
|
||||
| Implementation | Volta | Turing | Ampere | Ada | Hopper | AMD GPU | Intel GPU | x86 CPU | AWS Inferentia | Google TPU |
|
||||
|-----------------------|---------|----------|----------|-------|----------|-----------|-------------|-----------|------------------|--------------|
|
||||
| AWQ | ❌ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ✅︎ | ✅︎ | ❌ | ❌ |
|
||||
| GPTQ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ✅︎ | ✅︎ | ❌ | ❌ |
|
||||
| Marlin (GPTQ/AWQ/FP8) | ❌ | ❌ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ |
|
||||
| INT8 (W8A8) | ❌ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ✅︎ | ❌ | ✅︎ |
|
||||
| FP8 (W8A8) | ❌ | ❌ | ❌ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ |
|
||||
| BitBLAS (GPTQ) | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ |
|
||||
| AQLM | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ |
|
||||
| bitsandbytes | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ |
|
||||
| DeepSpeedFP | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ |
|
||||
| GGUF | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ |
|
||||
|
||||
- Volta refers to SM 7.0, Turing to SM 7.5, Ampere to SM 8.0/8.6, Ada to SM 8.9, and Hopper to SM 9.0.
|
||||
- ✅︎ indicates that the quantization method is supported on the specified hardware.
|
||||
- ❌ indicates that the quantization method is not supported on the specified hardware.
|
||||
|
||||
!!! note
|
||||
This compatibility chart is subject to change as vLLM continues to evolve and expand its support for different hardware platforms and quantization methods.
|
||||
|
||||
For the most up-to-date information on hardware support and quantization methods, please refer to <gh-dir:vllm/model_executor/layers/quantization> or consult with the vLLM development team.
|
||||
33
docs/features/quantization/torchao.md
Normal file
33
docs/features/quantization/torchao.md
Normal file
@@ -0,0 +1,33 @@
|
||||
# TorchAO
|
||||
|
||||
TorchAO is an architecture optimization library for PyTorch, it provides high performance dtypes, optimization techniques and kernels for inference and training, featuring composability with native PyTorch features like torch.compile, FSDP etc.. Some benchmark numbers can be found [here](https://github.com/pytorch/ao/tree/main/torchao/quantization#benchmarks).
|
||||
|
||||
We recommend installing the latest torchao nightly with
|
||||
|
||||
```console
|
||||
# Install the latest TorchAO nightly build
|
||||
# Choose the CUDA version that matches your system (cu126, cu128, etc.)
|
||||
pip install --pre torchao>=10.0.0 --index-url https://download.pytorch.org/whl/nightly/cu126
|
||||
```
|
||||
|
||||
## Quantizing HuggingFace Models
|
||||
You can quantize your own huggingface model with torchao, e.g. [transformers](https://huggingface.co/docs/transformers/main/en/quantization/torchao) and [diffusers](https://huggingface.co/docs/diffusers/en/quantization/torchao), and save the checkpoint to huggingface hub like [this](https://huggingface.co/jerryzh168/llama3-8b-int8wo) with the following example code:
|
||||
|
||||
```Python
|
||||
import torch
|
||||
from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
|
||||
from torchao.quantization import Int8WeightOnlyConfig
|
||||
|
||||
model_name = "meta-llama/Meta-Llama-3-8B"
|
||||
quantization_config = TorchAoConfig(Int8WeightOnlyConfig())
|
||||
quantized_model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto", quantization_config=quantization_config)
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
||||
input_text = "What are we having for dinner?"
|
||||
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
|
||||
|
||||
hub_repo = # YOUR HUB REPO ID
|
||||
tokenizer.push_to_hub(hub_repo)
|
||||
quantized_model.push_to_hub(hub_repo, safe_serialization=False)
|
||||
```
|
||||
|
||||
Alternatively, you can use the [TorchAO Quantization space](https://huggingface.co/spaces/medmekk/TorchAO_Quantization) for quantizing models with a simple UI.
|
||||
325
docs/features/reasoning_outputs.md
Normal file
325
docs/features/reasoning_outputs.md
Normal file
@@ -0,0 +1,325 @@
|
||||
---
|
||||
title: Reasoning Outputs
|
||||
---
|
||||
[](){ #reasoning-outputs }
|
||||
|
||||
vLLM offers support for reasoning models like [DeepSeek R1](https://huggingface.co/deepseek-ai/DeepSeek-R1), which are designed to generate outputs containing both reasoning steps and final conclusions.
|
||||
|
||||
Reasoning models return an additional `reasoning_content` field in their outputs, which contains the reasoning steps that led to the final conclusion. This field is not present in the outputs of other models.
|
||||
|
||||
## Supported Models
|
||||
|
||||
vLLM currently supports the following reasoning models:
|
||||
|
||||
| Model Series | Parser Name | Structured Output Support | Tool Calling |
|
||||
|--------------|-------------|------------------|-------------|
|
||||
| [DeepSeek R1 series](https://huggingface.co/collections/deepseek-ai/deepseek-r1-678e1e131c0169c0bc89728d) | `deepseek_r1` | `guided_json`, `guided_regex` | ❌ |
|
||||
| [QwQ-32B](https://huggingface.co/Qwen/QwQ-32B) | `deepseek_r1` | `guided_json`, `guided_regex` | ✅ |
|
||||
| [IBM Granite 3.2 language models](https://huggingface.co/collections/ibm-granite/granite-32-language-models-67b3bc8c13508f6d064cff9a) | `granite` | ❌ | ❌ |
|
||||
| [Qwen3 series](https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2e4f653967f) | `qwen3` | `guided_json`, `guided_regex` | ✅ |
|
||||
|
||||
!!! note
|
||||
IBM Granite 3.2 reasoning is disabled by default; to enable it, you must also pass `thinking=True` in your `chat_template_kwargs`.
|
||||
The reasoning feature for the Qwen3 series is enabled by default. To disable it, you must pass `enable_thinking=False` in your `chat_template_kwargs`.
|
||||
|
||||
## Quickstart
|
||||
|
||||
To use reasoning models, you need to specify the `--reasoning-parser` flags when making a request to the chat completion endpoint. The `--reasoning-parser` flag specifies the reasoning parser to use for extracting reasoning content from the model output.
|
||||
|
||||
```bash
|
||||
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --reasoning-parser deepseek_r1
|
||||
```
|
||||
|
||||
Next, make a request to the model that should return the reasoning content in the response.
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
|
||||
# Modify OpenAI's API key and API base to use vLLM's API server.
|
||||
openai_api_key = "EMPTY"
|
||||
openai_api_base = "http://localhost:8000/v1"
|
||||
|
||||
client = OpenAI(
|
||||
api_key=openai_api_key,
|
||||
base_url=openai_api_base,
|
||||
)
|
||||
|
||||
models = client.models.list()
|
||||
model = models.data[0].id
|
||||
|
||||
# Round 1
|
||||
messages = [{"role": "user", "content": "9.11 and 9.8, which is greater?"}]
|
||||
# For granite, add: `extra_body={"chat_template_kwargs": {"thinking": True}}`
|
||||
# For Qwen3 series, if you want to disable thinking in reasoning mode, add:
|
||||
# extra_body={"chat_template_kwargs": {"enable_thinking": False}}
|
||||
response = client.chat.completions.create(model=model, messages=messages)
|
||||
|
||||
reasoning_content = response.choices[0].message.reasoning_content
|
||||
content = response.choices[0].message.content
|
||||
|
||||
print("reasoning_content:", reasoning_content)
|
||||
print("content:", content)
|
||||
```
|
||||
|
||||
The `reasoning_content` field contains the reasoning steps that led to the final conclusion, while the `content` field contains the final conclusion.
|
||||
|
||||
## Streaming chat completions
|
||||
|
||||
Streaming chat completions are also supported for reasoning models. The `reasoning_content` field is available in the `delta` field in [chat completion response chunks](https://platform.openai.com/docs/api-reference/chat/streaming).
|
||||
|
||||
```json
|
||||
{
|
||||
"id": "chatcmpl-123",
|
||||
"object": "chat.completion.chunk",
|
||||
"created": 1694268190,
|
||||
"model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
|
||||
"system_fingerprint": "fp_44709d6fcb",
|
||||
"choices": [
|
||||
{
|
||||
"index": 0,
|
||||
"delta": {
|
||||
"role": "assistant",
|
||||
"reasoning_content": "is",
|
||||
},
|
||||
"logprobs": null,
|
||||
"finish_reason": null
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
OpenAI Python client library does not officially support `reasoning_content` attribute for streaming output. But the client supports extra attributes in the response. You can use `hasattr` to check if the `reasoning_content` attribute is present in the response. For example:
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
|
||||
# Modify OpenAI's API key and API base to use vLLM's API server.
|
||||
openai_api_key = "EMPTY"
|
||||
openai_api_base = "http://localhost:8000/v1"
|
||||
|
||||
client = OpenAI(
|
||||
api_key=openai_api_key,
|
||||
base_url=openai_api_base,
|
||||
)
|
||||
|
||||
models = client.models.list()
|
||||
model = models.data[0].id
|
||||
|
||||
messages = [{"role": "user", "content": "9.11 and 9.8, which is greater?"}]
|
||||
# For granite, add: `extra_body={"chat_template_kwargs": {"thinking": True}}`
|
||||
# For Qwen3 series, if you want to disable thinking in reasoning mode, add:
|
||||
# extra_body={"chat_template_kwargs": {"enable_thinking": False}}
|
||||
stream = client.chat.completions.create(model=model,
|
||||
messages=messages,
|
||||
stream=True)
|
||||
|
||||
print("client: Start streaming chat completions...")
|
||||
printed_reasoning_content = False
|
||||
printed_content = False
|
||||
|
||||
for chunk in stream:
|
||||
reasoning_content = None
|
||||
content = None
|
||||
# Check the content is reasoning_content or content
|
||||
if hasattr(chunk.choices[0].delta, "reasoning_content"):
|
||||
reasoning_content = chunk.choices[0].delta.reasoning_content
|
||||
elif hasattr(chunk.choices[0].delta, "content"):
|
||||
content = chunk.choices[0].delta.content
|
||||
|
||||
if reasoning_content is not None:
|
||||
if not printed_reasoning_content:
|
||||
printed_reasoning_content = True
|
||||
print("reasoning_content:", end="", flush=True)
|
||||
print(reasoning_content, end="", flush=True)
|
||||
elif content is not None:
|
||||
if not printed_content:
|
||||
printed_content = True
|
||||
print("\ncontent:", end="", flush=True)
|
||||
# Extract and print the content
|
||||
print(content, end="", flush=True)
|
||||
```
|
||||
|
||||
Remember to check whether the `reasoning_content` exists in the response before accessing it. You could checkout the [example](https://github.com/vllm-project/vllm/blob/main/examples/online_serving/openai_chat_completion_with_reasoning_streaming.py).
|
||||
|
||||
## Structured output
|
||||
|
||||
The reasoning content is also available in the structured output. The structured output engine like `xgrammar` will use the reasoning content to generate structured output. It is only supported in v0 engine now.
|
||||
|
||||
```bash
|
||||
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --reasoning-parser deepseek_r1
|
||||
```
|
||||
|
||||
The following is an example client:
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
from pydantic import BaseModel
|
||||
|
||||
# Modify OpenAI's API key and API base to use vLLM's API server.
|
||||
openai_api_key = "EMPTY"
|
||||
openai_api_base = "http://localhost:8000/v1"
|
||||
|
||||
client = OpenAI(
|
||||
api_key=openai_api_key,
|
||||
base_url=openai_api_base,
|
||||
)
|
||||
|
||||
models = client.models.list()
|
||||
model = models.data[0].id
|
||||
|
||||
class People(BaseModel):
|
||||
name: str
|
||||
age: int
|
||||
|
||||
json_schema = People.model_json_schema()
|
||||
|
||||
prompt = ("Generate a JSON with the name and age of one random person.")
|
||||
completion = client.chat.completions.create(
|
||||
model=model,
|
||||
messages=[{
|
||||
"role": "user",
|
||||
"content": prompt,
|
||||
}],
|
||||
extra_body={"guided_json": json_schema},
|
||||
)
|
||||
print("reasoning_content: ", completion.choices[0].message.reasoning_content)
|
||||
print("content: ", completion.choices[0].message.content)
|
||||
```
|
||||
|
||||
## Tool Calling
|
||||
|
||||
The reasoning content is also available when both tool calling and the reasoning parser are enabled. Additionally, tool calling only parses functions from the `content` field, not from the `reasoning_content`.
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
|
||||
|
||||
tools = [{
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "get_weather",
|
||||
"description": "Get the current weather in a given location",
|
||||
"parameters": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"location": {"type": "string", "description": "City and state, e.g., 'San Francisco, CA'"},
|
||||
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
|
||||
},
|
||||
"required": ["location", "unit"]
|
||||
}
|
||||
}
|
||||
}]
|
||||
|
||||
response = client.chat.completions.create(
|
||||
model=client.models.list().data[0].id,
|
||||
messages=[{"role": "user", "content": "What's the weather like in San Francisco?"}],
|
||||
tools=tools,
|
||||
tool_choice="auto"
|
||||
)
|
||||
|
||||
print(response)
|
||||
tool_call = response.choices[0].message.tool_calls[0].function
|
||||
|
||||
print(f"reasoning_content: {response.choices[0].message.reasoning_content}")
|
||||
print(f"Function called: {tool_call.name}")
|
||||
print(f"Arguments: {tool_call.arguments}")
|
||||
```
|
||||
|
||||
For more examples, please refer to <gh-file:examples/online_serving/openai_chat_completion_tool_calls_with_reasoning.py>.
|
||||
|
||||
## Limitations
|
||||
|
||||
- The reasoning content is only available for online serving's chat completion endpoint (`/v1/chat/completions`).
|
||||
|
||||
## How to support a new reasoning model
|
||||
|
||||
You can add a new `ReasoningParser` similar to <gh-file:vllm/reasoning/deepseek_r1_reasoning_parser.py>.
|
||||
|
||||
```python
|
||||
# import the required packages
|
||||
|
||||
from vllm.reasoning import ReasoningParser, ReasoningParserManager
|
||||
from vllm.entrypoints.openai.protocol import (ChatCompletionRequest,
|
||||
DeltaMessage)
|
||||
|
||||
# define a reasoning parser and register it to vllm
|
||||
# the name list in register_module can be used
|
||||
# in --reasoning-parser.
|
||||
@ReasoningParserManager.register_module(["example"])
|
||||
class ExampleParser(ReasoningParser):
|
||||
def __init__(self, tokenizer: AnyTokenizer):
|
||||
super().__init__(tokenizer)
|
||||
|
||||
def extract_reasoning_content_streaming(
|
||||
self,
|
||||
previous_text: str,
|
||||
current_text: str,
|
||||
delta_text: str,
|
||||
previous_token_ids: Sequence[int],
|
||||
current_token_ids: Sequence[int],
|
||||
delta_token_ids: Sequence[int],
|
||||
) -> Union[DeltaMessage, None]:
|
||||
"""
|
||||
Instance method that should be implemented for extracting reasoning
|
||||
from an incomplete response; for use when handling reasoning calls and
|
||||
streaming. Has to be an instance method because it requires state -
|
||||
the current tokens/diffs, but also the information about what has
|
||||
previously been parsed and extracted (see constructor)
|
||||
"""
|
||||
|
||||
def extract_reasoning_content(
|
||||
self, model_output: str, request: ChatCompletionRequest
|
||||
) -> tuple[Optional[str], Optional[str]]:
|
||||
"""
|
||||
Extract reasoning content from a complete model-generated string.
|
||||
|
||||
Used for non-streaming responses where we have the entire model response
|
||||
available before sending to the client.
|
||||
|
||||
Parameters:
|
||||
model_output: str
|
||||
The model-generated string to extract reasoning content from.
|
||||
|
||||
request: ChatCompletionRequest
|
||||
The request object that was used to generate the model_output.
|
||||
|
||||
Returns:
|
||||
tuple[Optional[str], Optional[str]]
|
||||
A tuple containing the reasoning content and the content.
|
||||
"""
|
||||
```
|
||||
|
||||
Additionally, to enable structured output, you'll need to create a new `Reasoner` similar to the one in <gh-file:vllm/reasoning/deepseek_r1_reasoning_parser.py>.
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class DeepSeekReasoner(Reasoner):
|
||||
"""
|
||||
Reasoner for DeepSeek R series models.
|
||||
"""
|
||||
start_token_id: int
|
||||
end_token_id: int
|
||||
|
||||
start_token: str = "<think>"
|
||||
end_token: str = "</think>"
|
||||
|
||||
@classmethod
|
||||
def from_tokenizer(cls, tokenizer: PreTrainedTokenizer) -> Reasoner:
|
||||
return cls(start_token_id=tokenizer.encode(
|
||||
"<think>", add_special_tokens=False)[0],
|
||||
end_token_id=tokenizer.encode("</think>",
|
||||
add_special_tokens=False)[0])
|
||||
|
||||
def is_reasoning_end(self, input_ids: list[int]) -> bool:
|
||||
return self.end_token_id in input_ids
|
||||
...
|
||||
```
|
||||
|
||||
The structured output engine like [xgrammar](https://github.com/mlc-ai/xgrammar) will use `end_token_id` to check if the reasoning content is present in the model output and skip the structured output if it is the case.
|
||||
|
||||
Finally, you can enable reasoning for the model by using the `--reasoning-parser` flags.
|
||||
|
||||
```bash
|
||||
vllm serve <model_tag> --reasoning-parser example
|
||||
```
|
||||
272
docs/features/spec_decode.md
Normal file
272
docs/features/spec_decode.md
Normal file
@@ -0,0 +1,272 @@
|
||||
---
|
||||
title: Speculative Decoding
|
||||
---
|
||||
[](){ #spec-decode }
|
||||
|
||||
!!! warning
|
||||
Please note that speculative decoding in vLLM is not yet optimized and does
|
||||
not usually yield inter-token latency reductions for all prompt datasets or sampling parameters.
|
||||
The work to optimize it is ongoing and can be followed here: <gh-issue:4630>
|
||||
|
||||
!!! warning
|
||||
Currently, speculative decoding in vLLM is not compatible with pipeline parallelism.
|
||||
|
||||
This document shows how to use [Speculative Decoding](https://x.com/karpathy/status/1697318534555336961) with vLLM.
|
||||
Speculative decoding is a technique which improves inter-token latency in memory-bound LLM inference.
|
||||
|
||||
## Speculating with a draft model
|
||||
|
||||
The following code configures vLLM in an offline mode to use speculative decoding with a draft model, speculating 5 tokens at a time.
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
prompts = [
|
||||
"The future of AI is",
|
||||
]
|
||||
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
|
||||
|
||||
llm = LLM(
|
||||
model="facebook/opt-6.7b",
|
||||
tensor_parallel_size=1,
|
||||
speculative_config={
|
||||
"model": "facebook/opt-125m",
|
||||
"num_speculative_tokens": 5,
|
||||
},
|
||||
)
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
```
|
||||
|
||||
To perform the same with an online mode launch the server:
|
||||
|
||||
```bash
|
||||
python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8000 --model facebook/opt-6.7b \
|
||||
--seed 42 -tp 1 --gpu_memory_utilization 0.8 \
|
||||
--speculative_config '{"model": "facebook/opt-125m", "num_speculative_tokens": 5}'
|
||||
```
|
||||
|
||||
!!! warning
|
||||
Note: Please use `--speculative_config` to set all configurations related to speculative decoding. The previous method of specifying the model through `--speculative_model` and adding related parameters (e.g., `--num_speculative_tokens`) separately has been deprecated now.
|
||||
|
||||
Then use a client:
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
|
||||
# Modify OpenAI's API key and API base to use vLLM's API server.
|
||||
openai_api_key = "EMPTY"
|
||||
openai_api_base = "http://localhost:8000/v1"
|
||||
|
||||
client = OpenAI(
|
||||
# defaults to os.environ.get("OPENAI_API_KEY")
|
||||
api_key=openai_api_key,
|
||||
base_url=openai_api_base,
|
||||
)
|
||||
|
||||
models = client.models.list()
|
||||
model = models.data[0].id
|
||||
|
||||
# Completion API
|
||||
stream = False
|
||||
completion = client.completions.create(
|
||||
model=model,
|
||||
prompt="The future of AI is",
|
||||
echo=False,
|
||||
n=1,
|
||||
stream=stream,
|
||||
)
|
||||
|
||||
print("Completion results:")
|
||||
if stream:
|
||||
for c in completion:
|
||||
print(c)
|
||||
else:
|
||||
print(completion)
|
||||
```
|
||||
|
||||
## Speculating by matching n-grams in the prompt
|
||||
|
||||
The following code configures vLLM to use speculative decoding where proposals are generated by
|
||||
matching n-grams in the prompt. For more information read [this thread.](https://x.com/joao_gante/status/1747322413006643259)
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
prompts = [
|
||||
"The future of AI is",
|
||||
]
|
||||
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
|
||||
|
||||
llm = LLM(
|
||||
model="facebook/opt-6.7b",
|
||||
tensor_parallel_size=1,
|
||||
speculative_config={
|
||||
"method": "ngram",
|
||||
"num_speculative_tokens": 5,
|
||||
"prompt_lookup_max": 4,
|
||||
},
|
||||
)
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
```
|
||||
|
||||
## Speculating using MLP speculators
|
||||
|
||||
The following code configures vLLM to use speculative decoding where proposals are generated by
|
||||
draft models that conditioning draft predictions on both context vectors and sampled tokens.
|
||||
For more information see [this blog](https://pytorch.org/blog/hitchhikers-guide-speculative-decoding/) or
|
||||
[this technical report](https://arxiv.org/abs/2404.19124).
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
prompts = [
|
||||
"The future of AI is",
|
||||
]
|
||||
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
|
||||
|
||||
llm = LLM(
|
||||
model="meta-llama/Meta-Llama-3.1-70B-Instruct",
|
||||
tensor_parallel_size=4,
|
||||
speculative_config={
|
||||
"model": "ibm-ai-platform/llama3-70b-accelerator",
|
||||
"draft_tensor_parallel_size": 1,
|
||||
},
|
||||
)
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
```
|
||||
|
||||
Note that these speculative models currently need to be run without tensor parallelism, although
|
||||
it is possible to run the main model using tensor parallelism (see example above). Since the
|
||||
speculative models are relatively small, we still see significant speedups. However, this
|
||||
limitation will be fixed in a future release.
|
||||
|
||||
A variety of speculative models of this type are available on HF hub:
|
||||
|
||||
- [llama-13b-accelerator](https://huggingface.co/ibm-ai-platform/llama-13b-accelerator)
|
||||
- [llama3-8b-accelerator](https://huggingface.co/ibm-ai-platform/llama3-8b-accelerator)
|
||||
- [codellama-34b-accelerator](https://huggingface.co/ibm-ai-platform/codellama-34b-accelerator)
|
||||
- [llama2-70b-accelerator](https://huggingface.co/ibm-ai-platform/llama2-70b-accelerator)
|
||||
- [llama3-70b-accelerator](https://huggingface.co/ibm-ai-platform/llama3-70b-accelerator)
|
||||
- [granite-3b-code-instruct-accelerator](https://huggingface.co/ibm-granite/granite-3b-code-instruct-accelerator)
|
||||
- [granite-8b-code-instruct-accelerator](https://huggingface.co/ibm-granite/granite-8b-code-instruct-accelerator)
|
||||
- [granite-7b-instruct-accelerator](https://huggingface.co/ibm-granite/granite-7b-instruct-accelerator)
|
||||
- [granite-20b-code-instruct-accelerator](https://huggingface.co/ibm-granite/granite-20b-code-instruct-accelerator)
|
||||
|
||||
## Speculating using EAGLE based draft models
|
||||
|
||||
The following code configures vLLM to use speculative decoding where proposals are generated by
|
||||
an [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https://arxiv.org/pdf/2401.15077) based draft model. A more detailed example for offline mode, including how to extract request level acceptance rate, can be found [here](<gh-file:examples/offline_inference/eagle.py>).
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
prompts = [
|
||||
"The future of AI is",
|
||||
]
|
||||
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
|
||||
|
||||
llm = LLM(
|
||||
model="meta-llama/Meta-Llama-3-8B-Instruct",
|
||||
tensor_parallel_size=4,
|
||||
speculative_config={
|
||||
"model": "yuhuili/EAGLE-LLaMA3-Instruct-8B",
|
||||
"draft_tensor_parallel_size": 1,
|
||||
},
|
||||
)
|
||||
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
|
||||
```
|
||||
|
||||
A few important things to consider when using the EAGLE based draft models:
|
||||
|
||||
1. The EAGLE draft models available in the [HF repository for EAGLE models](https://huggingface.co/yuhuili) should
|
||||
be able to be loaded and used directly by vLLM after [PR 12304](https://github.com/vllm-project/vllm/pull/12304).
|
||||
If you are using vllm version before [PR 12304](https://github.com/vllm-project/vllm/pull/12304), please use the
|
||||
[script](https://gist.github.com/abhigoyal1997/1e7a4109ccb7704fbc67f625e86b2d6d) to convert the speculative model,
|
||||
and specify `"model": "path/to/modified/eagle/model"` in `speculative_config`. If weight-loading problems still occur when using the latest version of vLLM, please leave a comment or raise an issue.
|
||||
|
||||
2. The EAGLE based draft models need to be run without tensor parallelism
|
||||
(i.e. draft_tensor_parallel_size is set to 1 in `speculative_config`), although
|
||||
it is possible to run the main model using tensor parallelism (see example above).
|
||||
|
||||
3. When using EAGLE-based speculators with vLLM, the observed speedup is lower than what is
|
||||
reported in the reference implementation [here](https://github.com/SafeAILab/EAGLE). This issue is under
|
||||
investigation and tracked here: [https://github.com/vllm-project/vllm/issues/9565](https://github.com/vllm-project/vllm/issues/9565).
|
||||
|
||||
A variety of EAGLE draft models are available on the Hugging Face hub:
|
||||
|
||||
| Base Model | EAGLE on Hugging Face | # EAGLE Parameters |
|
||||
|---------------------------------------------------------------------|-------------------------------------------|--------------------|
|
||||
| Vicuna-7B-v1.3 | yuhuili/EAGLE-Vicuna-7B-v1.3 | 0.24B |
|
||||
| Vicuna-13B-v1.3 | yuhuili/EAGLE-Vicuna-13B-v1.3 | 0.37B |
|
||||
| Vicuna-33B-v1.3 | yuhuili/EAGLE-Vicuna-33B-v1.3 | 0.56B |
|
||||
| LLaMA2-Chat 7B | yuhuili/EAGLE-llama2-chat-7B | 0.24B |
|
||||
| LLaMA2-Chat 13B | yuhuili/EAGLE-llama2-chat-13B | 0.37B |
|
||||
| LLaMA2-Chat 70B | yuhuili/EAGLE-llama2-chat-70B | 0.99B |
|
||||
| Mixtral-8x7B-Instruct-v0.1 | yuhuili/EAGLE-mixtral-instruct-8x7B | 0.28B |
|
||||
| LLaMA3-Instruct 8B | yuhuili/EAGLE-LLaMA3-Instruct-8B | 0.25B |
|
||||
| LLaMA3-Instruct 70B | yuhuili/EAGLE-LLaMA3-Instruct-70B | 0.99B |
|
||||
| Qwen2-7B-Instruct | yuhuili/EAGLE-Qwen2-7B-Instruct | 0.26B |
|
||||
| Qwen2-72B-Instruct | yuhuili/EAGLE-Qwen2-72B-Instruct | 1.05B |
|
||||
|
||||
## Lossless guarantees of Speculative Decoding
|
||||
|
||||
In vLLM, speculative decoding aims to enhance inference efficiency while maintaining accuracy. This section addresses the lossless guarantees of
|
||||
speculative decoding, breaking down the guarantees into three key areas:
|
||||
|
||||
1. **Theoretical Losslessness**
|
||||
\- Speculative decoding sampling is theoretically lossless up to the precision limits of hardware numerics. Floating-point errors might
|
||||
cause slight variations in output distributions, as discussed
|
||||
in [Accelerating Large Language Model Decoding with Speculative Sampling](https://arxiv.org/pdf/2302.01318)
|
||||
|
||||
2. **Algorithmic Losslessness**
|
||||
\- vLLM’s implementation of speculative decoding is algorithmically validated to be lossless. Key validation tests include:
|
||||
|
||||
> - **Rejection Sampler Convergence**: Ensures that samples from vLLM’s rejection sampler align with the target
|
||||
> distribution. [View Test Code](https://github.com/vllm-project/vllm/blob/47b65a550866c7ffbd076ecb74106714838ce7da/tests/samplers/test_rejection_sampler.py#L252)
|
||||
> - **Greedy Sampling Equality**: Confirms that greedy sampling with speculative decoding matches greedy sampling
|
||||
> without it. This verifies that vLLM's speculative decoding framework, when integrated with the vLLM forward pass and the vLLM rejection sampler,
|
||||
> provides a lossless guarantee. Almost all of the tests in <gh-dir:tests/spec_decode/e2e>.
|
||||
> verify this property using [this assertion implementation](https://github.com/vllm-project/vllm/blob/b67ae00cdbbe1a58ffc8ff170f0c8d79044a684a/tests/spec_decode/e2e/conftest.py#L291)
|
||||
|
||||
3. **vLLM Logprob Stability**
|
||||
\- vLLM does not currently guarantee stable token log probabilities (logprobs). This can result in different outputs for the
|
||||
same request across runs. For more details, see the FAQ section
|
||||
titled *Can the output of a prompt vary across runs in vLLM?* in the [FAQs][faq].
|
||||
|
||||
While vLLM strives to ensure losslessness in speculative decoding, variations in generated outputs with and without speculative decoding
|
||||
can occur due to following factors:
|
||||
|
||||
- **Floating-Point Precision**: Differences in hardware numerical precision may lead to slight discrepancies in the output distribution.
|
||||
- **Batch Size and Numerical Stability**: Changes in batch size may cause variations in logprobs and output probabilities, potentially
|
||||
due to non-deterministic behavior in batched operations or numerical instability.
|
||||
|
||||
For mitigation strategies, please refer to the FAQ entry *Can the output of a prompt vary across runs in vLLM?* in the [FAQs][faq].
|
||||
|
||||
## Resources for vLLM contributors
|
||||
|
||||
- [A Hacker's Guide to Speculative Decoding in vLLM](https://www.youtube.com/watch?v=9wNAgpX6z_4)
|
||||
- [What is Lookahead Scheduling in vLLM?](https://docs.google.com/document/d/1Z9TvqzzBPnh5WHcRwjvK2UEeFeq5zMZb5mFE8jR0HCs/edit#heading=h.1fjfb0donq5a)
|
||||
- [Information on batch expansion](https://docs.google.com/document/d/1T-JaS2T1NRfdP51qzqpyakoCXxSXTtORppiwaj5asxA/edit#heading=h.kk7dq05lc6q8)
|
||||
- [Dynamic speculative decoding](gh-issue:4565)
|
||||
268
docs/features/structured_outputs.md
Normal file
268
docs/features/structured_outputs.md
Normal file
@@ -0,0 +1,268 @@
|
||||
---
|
||||
title: Structured Outputs
|
||||
---
|
||||
[](){ #structured-outputs }
|
||||
|
||||
vLLM supports the generation of structured outputs using
|
||||
[xgrammar](https://github.com/mlc-ai/xgrammar) or
|
||||
[guidance](https://github.com/guidance-ai/llguidance) as backends.
|
||||
This document shows you some examples of the different options that are
|
||||
available to generate structured outputs.
|
||||
|
||||
## Online Serving (OpenAI API)
|
||||
|
||||
You can generate structured outputs using the OpenAI's [Completions](https://platform.openai.com/docs/api-reference/completions) and [Chat](https://platform.openai.com/docs/api-reference/chat) API.
|
||||
|
||||
The following parameters are supported, which must be added as extra parameters:
|
||||
|
||||
- `guided_choice`: the output will be exactly one of the choices.
|
||||
- `guided_regex`: the output will follow the regex pattern.
|
||||
- `guided_json`: the output will follow the JSON schema.
|
||||
- `guided_grammar`: the output will follow the context free grammar.
|
||||
- `structural_tag`: Follow a JSON schema within a set of specified tags within the generated text.
|
||||
|
||||
You can see the complete list of supported parameters on the [OpenAI-Compatible Server][openai-compatible-server] page.
|
||||
|
||||
Structured outputs are supported by default in the OpenAI-Compatible Server. You
|
||||
may choose to specify the backend to use by setting the
|
||||
`--guided-decoding-backend` flag to `vllm serve`. The default backend is `auto`,
|
||||
which will try to choose an appropriate backend based on the details of the
|
||||
request. You may also choose a specific backend, along with
|
||||
some options. A full set of options is available in the `vllm serve --help`
|
||||
text.
|
||||
|
||||
Now let´s see an example for each of the cases, starting with the `guided_choice`, as it´s the easiest one:
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
client = OpenAI(
|
||||
base_url="http://localhost:8000/v1",
|
||||
api_key="-",
|
||||
)
|
||||
|
||||
completion = client.chat.completions.create(
|
||||
model="Qwen/Qwen2.5-3B-Instruct",
|
||||
messages=[
|
||||
{"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"}
|
||||
],
|
||||
extra_body={"guided_choice": ["positive", "negative"]},
|
||||
)
|
||||
print(completion.choices[0].message.content)
|
||||
```
|
||||
|
||||
The next example shows how to use the `guided_regex`. The idea is to generate an email address, given a simple regex template:
|
||||
|
||||
```python
|
||||
completion = client.chat.completions.create(
|
||||
model="Qwen/Qwen2.5-3B-Instruct",
|
||||
messages=[
|
||||
{
|
||||
"role": "user",
|
||||
"content": "Generate an example email address for Alan Turing, who works in Enigma. End in .com and new line. Example result: alan.turing@enigma.com\n",
|
||||
}
|
||||
],
|
||||
extra_body={"guided_regex": r"\w+@\w+\.com\n", "stop": ["\n"]},
|
||||
)
|
||||
print(completion.choices[0].message.content)
|
||||
```
|
||||
|
||||
One of the most relevant features in structured text generation is the option to generate a valid JSON with pre-defined fields and formats.
|
||||
For this we can use the `guided_json` parameter in two different ways:
|
||||
|
||||
- Using directly a [JSON Schema](https://json-schema.org/)
|
||||
- Defining a [Pydantic model](https://docs.pydantic.dev/latest/) and then extracting the JSON Schema from it (which is normally an easier option).
|
||||
|
||||
The next example shows how to use the `guided_json` parameter with a Pydantic model:
|
||||
|
||||
```python
|
||||
from pydantic import BaseModel
|
||||
from enum import Enum
|
||||
|
||||
class CarType(str, Enum):
|
||||
sedan = "sedan"
|
||||
suv = "SUV"
|
||||
truck = "Truck"
|
||||
coupe = "Coupe"
|
||||
|
||||
class CarDescription(BaseModel):
|
||||
brand: str
|
||||
model: str
|
||||
car_type: CarType
|
||||
|
||||
json_schema = CarDescription.model_json_schema()
|
||||
|
||||
completion = client.chat.completions.create(
|
||||
model="Qwen/Qwen2.5-3B-Instruct",
|
||||
messages=[
|
||||
{
|
||||
"role": "user",
|
||||
"content": "Generate a JSON with the brand, model and car_type of the most iconic car from the 90's",
|
||||
}
|
||||
],
|
||||
extra_body={"guided_json": json_schema},
|
||||
)
|
||||
print(completion.choices[0].message.content)
|
||||
```
|
||||
|
||||
!!! tip
|
||||
While not strictly necessary, normally it´s better to indicate in the prompt the
|
||||
JSON schema and how the fields should be populated. This can improve the
|
||||
results notably in most cases.
|
||||
|
||||
Finally we have the `guided_grammar` option, which is probably the most
|
||||
difficult to use, but it´s really powerful. It allows us to define complete
|
||||
languages like SQL queries. It works by using a context free EBNF grammar.
|
||||
As an example, we can use to define a specific format of simplified SQL queries:
|
||||
|
||||
```python
|
||||
simplified_sql_grammar = """
|
||||
root ::= select_statement
|
||||
|
||||
select_statement ::= "SELECT " column " from " table " where " condition
|
||||
|
||||
column ::= "col_1 " | "col_2 "
|
||||
|
||||
table ::= "table_1 " | "table_2 "
|
||||
|
||||
condition ::= column "= " number
|
||||
|
||||
number ::= "1 " | "2 "
|
||||
"""
|
||||
|
||||
completion = client.chat.completions.create(
|
||||
model="Qwen/Qwen2.5-3B-Instruct",
|
||||
messages=[
|
||||
{
|
||||
"role": "user",
|
||||
"content": "Generate an SQL query to show the 'username' and 'email' from the 'users' table.",
|
||||
}
|
||||
],
|
||||
extra_body={"guided_grammar": simplified_sql_grammar},
|
||||
)
|
||||
print(completion.choices[0].message.content)
|
||||
```
|
||||
|
||||
Full example: <gh-file:examples/online_serving/openai_chat_completion_structured_outputs.py>
|
||||
|
||||
## Experimental Automatic Parsing (OpenAI API)
|
||||
|
||||
This section covers the OpenAI beta wrapper over the `client.chat.completions.create()` method that provides richer integrations with Python specific types.
|
||||
|
||||
At the time of writing (`openai==1.54.4`), this is a "beta" feature in the OpenAI client library. Code reference can be found [here](https://github.com/openai/openai-python/blob/52357cff50bee57ef442e94d78a0de38b4173fc2/src/openai/resources/beta/chat/completions.py#L100-L104).
|
||||
|
||||
For the following examples, vLLM was setup using `vllm serve meta-llama/Llama-3.1-8B-Instruct`
|
||||
|
||||
Here is a simple example demonstrating how to get structured output using Pydantic models:
|
||||
|
||||
```python
|
||||
from pydantic import BaseModel
|
||||
from openai import OpenAI
|
||||
|
||||
class Info(BaseModel):
|
||||
name: str
|
||||
age: int
|
||||
|
||||
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="dummy")
|
||||
completion = client.beta.chat.completions.parse(
|
||||
model="meta-llama/Llama-3.1-8B-Instruct",
|
||||
messages=[
|
||||
{"role": "system", "content": "You are a helpful assistant."},
|
||||
{"role": "user", "content": "My name is Cameron, I'm 28. What's my name and age?"},
|
||||
],
|
||||
response_format=Info,
|
||||
extra_body=dict(guided_decoding_backend="outlines"),
|
||||
)
|
||||
|
||||
message = completion.choices[0].message
|
||||
print(message)
|
||||
assert message.parsed
|
||||
print("Name:", message.parsed.name)
|
||||
print("Age:", message.parsed.age)
|
||||
```
|
||||
|
||||
Output:
|
||||
|
||||
```console
|
||||
ParsedChatCompletionMessage[Testing](content='{"name": "Cameron", "age": 28}', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=[], parsed=Testing(name='Cameron', age=28))
|
||||
Name: Cameron
|
||||
Age: 28
|
||||
```
|
||||
|
||||
Here is a more complex example using nested Pydantic models to handle a step-by-step math solution:
|
||||
|
||||
```python
|
||||
from typing import List
|
||||
from pydantic import BaseModel
|
||||
from openai import OpenAI
|
||||
|
||||
class Step(BaseModel):
|
||||
explanation: str
|
||||
output: str
|
||||
|
||||
class MathResponse(BaseModel):
|
||||
steps: list[Step]
|
||||
final_answer: str
|
||||
|
||||
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="dummy")
|
||||
completion = client.beta.chat.completions.parse(
|
||||
model="meta-llama/Llama-3.1-8B-Instruct",
|
||||
messages=[
|
||||
{"role": "system", "content": "You are a helpful expert math tutor."},
|
||||
{"role": "user", "content": "Solve 8x + 31 = 2."},
|
||||
],
|
||||
response_format=MathResponse,
|
||||
extra_body=dict(guided_decoding_backend="outlines"),
|
||||
)
|
||||
|
||||
message = completion.choices[0].message
|
||||
print(message)
|
||||
assert message.parsed
|
||||
for i, step in enumerate(message.parsed.steps):
|
||||
print(f"Step #{i}:", step)
|
||||
print("Answer:", message.parsed.final_answer)
|
||||
```
|
||||
|
||||
Output:
|
||||
|
||||
```console
|
||||
ParsedChatCompletionMessage[MathResponse](content='{ "steps": [{ "explanation": "First, let\'s isolate the term with the variable \'x\'. To do this, we\'ll subtract 31 from both sides of the equation.", "output": "8x + 31 - 31 = 2 - 31"}, { "explanation": "By subtracting 31 from both sides, we simplify the equation to 8x = -29.", "output": "8x = -29"}, { "explanation": "Next, let\'s isolate \'x\' by dividing both sides of the equation by 8.", "output": "8x / 8 = -29 / 8"}], "final_answer": "x = -29/8" }', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=[], parsed=MathResponse(steps=[Step(explanation="First, let's isolate the term with the variable 'x'. To do this, we'll subtract 31 from both sides of the equation.", output='8x + 31 - 31 = 2 - 31'), Step(explanation='By subtracting 31 from both sides, we simplify the equation to 8x = -29.', output='8x = -29'), Step(explanation="Next, let's isolate 'x' by dividing both sides of the equation by 8.", output='8x / 8 = -29 / 8')], final_answer='x = -29/8'))
|
||||
Step #0: explanation="First, let's isolate the term with the variable 'x'. To do this, we'll subtract 31 from both sides of the equation." output='8x + 31 - 31 = 2 - 31'
|
||||
Step #1: explanation='By subtracting 31 from both sides, we simplify the equation to 8x = -29.' output='8x = -29'
|
||||
Step #2: explanation="Next, let's isolate 'x' by dividing both sides of the equation by 8." output='8x / 8 = -29 / 8'
|
||||
Answer: x = -29/8
|
||||
```
|
||||
|
||||
An example of using `structural_tag` can be found here: <gh-file:examples/online_serving/openai_chat_completion_structured_outputs_structural_tag.py>
|
||||
|
||||
## Offline Inference
|
||||
|
||||
Offline inference allows for the same types of guided decoding.
|
||||
To use it, we´ll need to configure the guided decoding using the class `GuidedDecodingParams` inside `SamplingParams`.
|
||||
The main available options inside `GuidedDecodingParams` are:
|
||||
|
||||
- `json`
|
||||
- `regex`
|
||||
- `choice`
|
||||
- `grammar`
|
||||
- `structural_tag`
|
||||
|
||||
These parameters can be used in the same way as the parameters from the Online
|
||||
Serving examples above. One example for the usage of the `choice` parameter is
|
||||
shown below:
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
from vllm.sampling_params import GuidedDecodingParams
|
||||
|
||||
llm = LLM(model="HuggingFaceTB/SmolLM2-1.7B-Instruct")
|
||||
|
||||
guided_decoding_params = GuidedDecodingParams(choice=["Positive", "Negative"])
|
||||
sampling_params = SamplingParams(guided_decoding=guided_decoding_params)
|
||||
outputs = llm.generate(
|
||||
prompts="Classify this sentiment: vLLM is wonderful!",
|
||||
sampling_params=sampling_params,
|
||||
)
|
||||
print(outputs[0].outputs[0].text)
|
||||
```
|
||||
|
||||
Full example: <gh-file:examples/offline_inference/structured_outputs.py>
|
||||
334
docs/features/tool_calling.md
Normal file
334
docs/features/tool_calling.md
Normal file
@@ -0,0 +1,334 @@
|
||||
# Tool Calling
|
||||
|
||||
vLLM currently supports named function calling, as well as the `auto`, `required` (as of `vllm>=0.8.3`) and `none` options for the `tool_choice` field in the chat completion API.
|
||||
|
||||
## Quickstart
|
||||
|
||||
Start the server with tool calling enabled. This example uses Meta's Llama 3.1 8B model, so we need to use the llama3 tool calling chat template from the vLLM examples directory:
|
||||
|
||||
```bash
|
||||
vllm serve meta-llama/Llama-3.1-8B-Instruct \
|
||||
--enable-auto-tool-choice \
|
||||
--tool-call-parser llama3_json \
|
||||
--chat-template examples/tool_chat_template_llama3.1_json.jinja
|
||||
```
|
||||
|
||||
Next, make a request to the model that should result in it using the available tools:
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
import json
|
||||
|
||||
client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
|
||||
|
||||
def get_weather(location: str, unit: str):
|
||||
return f"Getting the weather for {location} in {unit}..."
|
||||
tool_functions = {"get_weather": get_weather}
|
||||
|
||||
tools = [{
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "get_weather",
|
||||
"description": "Get the current weather in a given location",
|
||||
"parameters": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"location": {"type": "string", "description": "City and state, e.g., 'San Francisco, CA'"},
|
||||
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
|
||||
},
|
||||
"required": ["location", "unit"]
|
||||
}
|
||||
}
|
||||
}]
|
||||
|
||||
response = client.chat.completions.create(
|
||||
model=client.models.list().data[0].id,
|
||||
messages=[{"role": "user", "content": "What's the weather like in San Francisco?"}],
|
||||
tools=tools,
|
||||
tool_choice="auto"
|
||||
)
|
||||
|
||||
tool_call = response.choices[0].message.tool_calls[0].function
|
||||
print(f"Function called: {tool_call.name}")
|
||||
print(f"Arguments: {tool_call.arguments}")
|
||||
print(f"Result: {get_weather(**json.loads(tool_call.arguments))}")
|
||||
```
|
||||
|
||||
Example output:
|
||||
|
||||
```text
|
||||
Function called: get_weather
|
||||
Arguments: {"location": "San Francisco, CA", "unit": "fahrenheit"}
|
||||
Result: Getting the weather for San Francisco, CA in fahrenheit...
|
||||
```
|
||||
|
||||
This example demonstrates:
|
||||
|
||||
* Setting up the server with tool calling enabled
|
||||
* Defining an actual function to handle tool calls
|
||||
* Making a request with `tool_choice="auto"`
|
||||
* Handling the structured response and executing the corresponding function
|
||||
|
||||
You can also specify a particular function using named function calling by setting `tool_choice={"type": "function", "function": {"name": "get_weather"}}`. Note that this will use the guided decoding backend - so the first time this is used, there will be several seconds of latency (or more) as the FSM is compiled for the first time before it is cached for subsequent requests.
|
||||
|
||||
Remember that it's the callers responsibility to:
|
||||
|
||||
1. Define appropriate tools in the request
|
||||
2. Include relevant context in the chat messages
|
||||
3. Handle the tool calls in your application logic
|
||||
|
||||
For more advanced usage, including parallel tool calls and different model-specific parsers, see the sections below.
|
||||
|
||||
## Named Function Calling
|
||||
|
||||
vLLM supports named function calling in the chat completion API by default. It does so using Outlines through guided decoding, so this is
|
||||
enabled by default, and will work with any supported model. You are guaranteed a validly-parsable function call - not a
|
||||
high-quality one.
|
||||
|
||||
vLLM will use guided decoding to ensure the response matches the tool parameter object defined by the JSON schema in the `tools` parameter.
|
||||
For best results, we recommend ensuring that the expected output format / schema is specified in the prompt to ensure that the model's intended generation is aligned with the schema that it's being forced to generate by the guided decoding backend.
|
||||
|
||||
To use a named function, you need to define the functions in the `tools` parameter of the chat completion request, and
|
||||
specify the `name` of one of the tools in the `tool_choice` parameter of the chat completion request.
|
||||
|
||||
## Required Function Calling
|
||||
|
||||
vLLM supports the `tool_choice='required'` option in the chat completion API. Similar to the named function calling, it also uses guided decoding, so this is enabled by default and will work with any supported model. The required guided decoding features (JSON schema with `anyOf`) are currently only supported in the V0 engine with the guided decoding backend `outlines`. However, support for alternative decoding backends are on the [roadmap](https://docs.vllm.ai/en/latest/getting_started/v1_user_guide.html#feature-model) for the V1 engine.
|
||||
|
||||
When tool_choice='required' is set, the model is guaranteed to generate one or more tool calls based on the specified tool list in the `tools` parameter. The number of tool calls depends on the user's query. The output format strictly follows the schema defined in the `tools` parameter.
|
||||
|
||||
## Automatic Function Calling
|
||||
|
||||
To enable this feature, you should set the following flags:
|
||||
|
||||
* `--enable-auto-tool-choice` -- **mandatory** Auto tool choice. tells vLLM that you want to enable the model to generate its own tool calls when it
|
||||
deems appropriate.
|
||||
* `--tool-call-parser` -- select the tool parser to use (listed below). Additional tool parsers
|
||||
will continue to be added in the future, and also can register your own tool parsers in the `--tool-parser-plugin`.
|
||||
* `--tool-parser-plugin` -- **optional** tool parser plugin used to register user defined tool parsers into vllm, the registered tool parser name can be specified in `--tool-call-parser`.
|
||||
* `--chat-template` -- **optional** for auto tool choice. the path to the chat template which handles `tool`-role messages and `assistant`-role messages
|
||||
that contain previously generated tool calls. Hermes, Mistral and Llama models have tool-compatible chat templates in their
|
||||
`tokenizer_config.json` files, but you can specify a custom template. This argument can be set to `tool_use` if your model has a tool use-specific chat
|
||||
template configured in the `tokenizer_config.json`. In this case, it will be used per the `transformers` specification. More on this [here](https://huggingface.co/docs/transformers/en/chat_templating#why-do-some-models-have-multiple-templates)
|
||||
from HuggingFace; and you can find an example of this in a `tokenizer_config.json` [here](https://huggingface.co/NousResearch/Hermes-2-Pro-Llama-3-8B/blob/main/tokenizer_config.json)
|
||||
|
||||
If your favorite tool-calling model is not supported, please feel free to contribute a parser & tool use chat template!
|
||||
|
||||
### Hermes Models (`hermes`)
|
||||
|
||||
All Nous Research Hermes-series models newer than Hermes 2 Pro should be supported.
|
||||
|
||||
* `NousResearch/Hermes-2-Pro-*`
|
||||
* `NousResearch/Hermes-2-Theta-*`
|
||||
* `NousResearch/Hermes-3-*`
|
||||
|
||||
_Note that the Hermes 2 **Theta** models are known to have degraded tool call quality & capabilities due to the merge
|
||||
step in their creation_.
|
||||
|
||||
Flags: `--tool-call-parser hermes`
|
||||
|
||||
### Mistral Models (`mistral`)
|
||||
|
||||
Supported models:
|
||||
|
||||
* `mistralai/Mistral-7B-Instruct-v0.3` (confirmed)
|
||||
* Additional mistral function-calling models are compatible as well.
|
||||
|
||||
Known issues:
|
||||
|
||||
1. Mistral 7B struggles to generate parallel tool calls correctly.
|
||||
2. Mistral's `tokenizer_config.json` chat template requires tool call IDs that are exactly 9 digits, which is
|
||||
much shorter than what vLLM generates. Since an exception is thrown when this condition
|
||||
is not met, the following additional chat templates are provided:
|
||||
|
||||
* <gh-file:examples/tool_chat_template_mistral.jinja> - this is the "official" Mistral chat template, but tweaked so that
|
||||
it works with vLLM's tool call IDs (provided `tool_call_id` fields are truncated to the last 9 digits)
|
||||
* <gh-file:examples/tool_chat_template_mistral_parallel.jinja> - this is a "better" version that adds a tool-use system prompt
|
||||
when tools are provided, that results in much better reliability when working with parallel tool calling.
|
||||
|
||||
Recommended flags: `--tool-call-parser mistral --chat-template examples/tool_chat_template_mistral_parallel.jinja`
|
||||
|
||||
### Llama Models (`llama3_json`)
|
||||
|
||||
Supported models:
|
||||
|
||||
All Llama 3.1, 3.2 and 4 models should be supported.
|
||||
|
||||
* `meta-llama/Llama-3.1-*`
|
||||
* `meta-llama/Llama-3.2-*`
|
||||
* `meta-llama/Llama-4-*`
|
||||
|
||||
The tool calling that is supported is the [JSON based tool calling](https://llama.meta.com/docs/model-cards-and-prompt-formats/llama3_1/#json-based-tool-calling). For [pythonic tool calling](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/text_prompt_format.md#zero-shot-function-calling) introduced by the Llama-3.2 models, see the `pythonic` tool parser below. As for llama 4 models, it is recommended to use the `llama4_pythonic` tool parser.
|
||||
|
||||
Other tool calling formats like the built in python tool calling or custom tool calling are not supported.
|
||||
|
||||
Known issues:
|
||||
|
||||
1. Parallel tool calls are not supported for llama 3, but it is supported in llama 4 models.
|
||||
2. The model can generate parameters with a wrong format, such as generating
|
||||
an array serialized as string instead of an array.
|
||||
|
||||
VLLM provides two JSON based chat templates for Llama 3.1 and 3.2:
|
||||
|
||||
* <gh-file:examples/tool_chat_template_llama3.1_json.jinja> - this is the "official" chat template for the Llama 3.1
|
||||
models, but tweaked so that it works better with vLLM.
|
||||
* <gh-file:examples/tool_chat_template_llama3.2_json.jinja> - this extends upon the Llama 3.1 chat template by adding support for
|
||||
images.
|
||||
|
||||
Recommended flags: `--tool-call-parser llama3_json --chat-template {see_above}`
|
||||
|
||||
VLLM also provides a pythonic and JSON based chat template for Llama 4, but pythonic tool calling is recommended:
|
||||
* <gh-file:examples/tool_chat_template_llama4_pythonic.jinja> - this is based on the [official chat template](https://www.llama.com/docs/model-cards-and-prompt-formats/llama4/) for the Llama 4 models.
|
||||
|
||||
For Llama 4 model, use `--tool-call-parser llama4_pythonic --chat-template examples/tool_chat_template_llama4_pythonic.jinja`.
|
||||
|
||||
#### IBM Granite
|
||||
|
||||
Supported models:
|
||||
|
||||
* `ibm-granite/granite-3.0-8b-instruct`
|
||||
|
||||
Recommended flags: `--tool-call-parser granite --chat-template examples/tool_chat_template_granite.jinja`
|
||||
|
||||
<gh-file:examples/tool_chat_template_granite.jinja>: this is a modified chat template from the original on Huggingface. Parallel function calls are supported.
|
||||
|
||||
* `ibm-granite/granite-3.1-8b-instruct`
|
||||
|
||||
Recommended flags: `--tool-call-parser granite`
|
||||
|
||||
The chat template from Huggingface can be used directly. Parallel function calls are supported.
|
||||
|
||||
* `ibm-granite/granite-20b-functioncalling`
|
||||
|
||||
Recommended flags: `--tool-call-parser granite-20b-fc --chat-template examples/tool_chat_template_granite_20b_fc.jinja`
|
||||
|
||||
<gh-file:examples/tool_chat_template_granite_20b_fc.jinja>: this is a modified chat template from the original on Huggingface, which is not vLLM compatible. It blends function description elements from the Hermes template and follows the same system prompt as "Response Generation" mode from [the paper](https://arxiv.org/abs/2407.00121). Parallel function calls are supported.
|
||||
|
||||
### InternLM Models (`internlm`)
|
||||
|
||||
Supported models:
|
||||
|
||||
* `internlm/internlm2_5-7b-chat` (confirmed)
|
||||
* Additional internlm2.5 function-calling models are compatible as well
|
||||
|
||||
Known issues:
|
||||
|
||||
* Although this implementation also supports InternLM2, the tool call results are not stable when testing with the `internlm/internlm2-chat-7b` model.
|
||||
|
||||
Recommended flags: `--tool-call-parser internlm --chat-template examples/tool_chat_template_internlm2_tool.jinja`
|
||||
|
||||
### Jamba Models (`jamba`)
|
||||
|
||||
AI21's Jamba-1.5 models are supported.
|
||||
|
||||
* `ai21labs/AI21-Jamba-1.5-Mini`
|
||||
* `ai21labs/AI21-Jamba-1.5-Large`
|
||||
|
||||
Flags: `--tool-call-parser jamba`
|
||||
|
||||
### Qwen Models
|
||||
|
||||
For Qwen2.5, the chat template in tokenizer_config.json has already included support for the Hermes-style tool use. Therefore, you can use the `hermes` parser to enable tool calls for Qwen models. For more detailed information, please refer to the official [Qwen documentation](https://qwen.readthedocs.io/en/latest/framework/function_call.html#vllm)
|
||||
|
||||
* `Qwen/Qwen2.5-*`
|
||||
* `Qwen/QwQ-32B`
|
||||
|
||||
Flags: `--tool-call-parser hermes`
|
||||
|
||||
### DeepSeek-V3 Models (`deepseek_v3`)
|
||||
|
||||
Supported models:
|
||||
* `deepseek-ai/DeepSeek-V3-0324`
|
||||
|
||||
Flags: `--tool-call-parser deepseek_v3 --chat-template examples/tool_chat_template_deepseekv3.jinja`
|
||||
|
||||
### Models with Pythonic Tool Calls (`pythonic`)
|
||||
|
||||
A growing number of models output a python list to represent tool calls instead of using JSON. This has the advantage of inherently supporting parallel tool calls and removing ambiguity around the JSON schema required for tool calls. The `pythonic` tool parser can support such models.
|
||||
|
||||
As a concrete example, these models may look up the weather in San Francisco and Seattle by generating:
|
||||
|
||||
```python
|
||||
[get_weather(city='San Francisco', metric='celsius'), get_weather(city='Seattle', metric='celsius')]
|
||||
```
|
||||
|
||||
Limitations:
|
||||
|
||||
* The model must not generate both text and tool calls in the same generation. This may not be hard to change for a specific model, but the community currently lacks consensus on which tokens to emit when starting and ending tool calls. (In particular, the Llama 3.2 models emit no such tokens.)
|
||||
* Llama's smaller models struggle to use tools effectively.
|
||||
|
||||
Example supported models:
|
||||
|
||||
* `meta-llama/Llama-3.2-1B-Instruct`\* (use with <gh-file:examples/tool_chat_template_llama3.2_pythonic.jinja>)
|
||||
* `meta-llama/Llama-3.2-3B-Instruct`\* (use with <gh-file:examples/tool_chat_template_llama3.2_pythonic.jinja>)
|
||||
* `Team-ACE/ToolACE-8B` (use with <gh-file:examples/tool_chat_template_toolace.jinja>)
|
||||
* `fixie-ai/ultravox-v0_4-ToolACE-8B` (use with <gh-file:examples/tool_chat_template_toolace.jinja>)
|
||||
* `meta-llama/Llama-4-Scout-17B-16E-Instruct`\* (use with <gh-file:examples/tool_chat_template_llama4_pythonic.jinja>)
|
||||
* `meta-llama/Llama-4-Maverick-17B-128E-Instruct`\* (use with <gh-file:examples/tool_chat_template_llama4_pythonic.jinja>)
|
||||
|
||||
Flags: `--tool-call-parser pythonic --chat-template {see_above}`
|
||||
|
||||
---
|
||||
**WARNING**
|
||||
Llama's smaller models frequently fail to emit tool calls in the correct format. Your mileage may vary.
|
||||
|
||||
---
|
||||
|
||||
## How to write a tool parser plugin
|
||||
|
||||
A tool parser plugin is a Python file containing one or more ToolParser implementations. You can write a ToolParser similar to the `Hermes2ProToolParser` in <gh-file:vllm/entrypoints/openai/tool_parsers/hermes_tool_parser.py>.
|
||||
|
||||
Here is a summary of a plugin file:
|
||||
|
||||
```python
|
||||
|
||||
# import the required packages
|
||||
|
||||
# define a tool parser and register it to vllm
|
||||
# the name list in register_module can be used
|
||||
# in --tool-call-parser. you can define as many
|
||||
# tool parsers as you want here.
|
||||
@ToolParserManager.register_module(["example"])
|
||||
class ExampleToolParser(ToolParser):
|
||||
def __init__(self, tokenizer: AnyTokenizer):
|
||||
super().__init__(tokenizer)
|
||||
|
||||
# adjust request. e.g.: set skip special tokens
|
||||
# to False for tool call output.
|
||||
def adjust_request(
|
||||
self, request: ChatCompletionRequest) -> ChatCompletionRequest:
|
||||
return request
|
||||
|
||||
# implement the tool call parse for stream call
|
||||
def extract_tool_calls_streaming(
|
||||
self,
|
||||
previous_text: str,
|
||||
current_text: str,
|
||||
delta_text: str,
|
||||
previous_token_ids: Sequence[int],
|
||||
current_token_ids: Sequence[int],
|
||||
delta_token_ids: Sequence[int],
|
||||
request: ChatCompletionRequest,
|
||||
) -> Union[DeltaMessage, None]:
|
||||
return delta
|
||||
|
||||
# implement the tool parse for non-stream call
|
||||
def extract_tool_calls(
|
||||
self,
|
||||
model_output: str,
|
||||
request: ChatCompletionRequest,
|
||||
) -> ExtractedToolCallInformation:
|
||||
return ExtractedToolCallInformation(tools_called=False,
|
||||
tool_calls=[],
|
||||
content=text)
|
||||
|
||||
```
|
||||
|
||||
Then you can use this plugin in the command line like this.
|
||||
|
||||
```console
|
||||
--enable-auto-tool-choice \
|
||||
--tool-parser-plugin <absolute path of the plugin file>
|
||||
--tool-call-parser example \
|
||||
--chat-template <your chat template> \
|
||||
```
|
||||
Reference in New Issue
Block a user