[Doc] Create a new "Usage" section (#10827)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
This commit is contained in:
468
docs/source/usage/compatibility_matrix.rst
Normal file
468
docs/source/usage/compatibility_matrix.rst
Normal file
@@ -0,0 +1,468 @@
|
||||
.. _compatibility_matrix:
|
||||
|
||||
Compatibility Matrix
|
||||
====================
|
||||
|
||||
The tables below show mutually exclusive features and the support on some hardware.
|
||||
|
||||
.. note::
|
||||
|
||||
Check the '✗' with links to see tracking issue for unsupported feature/hardware combination.
|
||||
|
||||
Feature x Feature
|
||||
-----------------
|
||||
|
||||
|
||||
.. raw:: html
|
||||
|
||||
<style>
|
||||
/* Make smaller to try to improve readability */
|
||||
td {
|
||||
font-size: 0.8rem;
|
||||
text-align: center;
|
||||
}
|
||||
|
||||
th {
|
||||
text-align: center;
|
||||
font-size: 0.8rem;
|
||||
}
|
||||
</style>
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
:widths: auto
|
||||
|
||||
* - Feature
|
||||
- :ref:`CP <chunked-prefill>`
|
||||
- :ref:`APC <apc>`
|
||||
- :ref:`LoRA <lora>`
|
||||
- :abbr:`prmpt adptr (Prompt Adapter)`
|
||||
- :ref:`SD <spec_decode>`
|
||||
- CUDA graph
|
||||
- :abbr:`emd (Embedding Models)`
|
||||
- :abbr:`enc-dec (Encoder-Decoder Models)`
|
||||
- :abbr:`logP (Logprobs)`
|
||||
- :abbr:`prmpt logP (Prompt Logprobs)`
|
||||
- :abbr:`async output (Async Output Processing)`
|
||||
- multi-step
|
||||
- :abbr:`mm (Multimodal)`
|
||||
- best-of
|
||||
- beam-search
|
||||
- :abbr:`guided dec (Guided Decoding)`
|
||||
* - :ref:`CP <chunked-prefill>`
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - :ref:`APC <apc>`
|
||||
- ✅
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - :ref:`LoRA <lora>`
|
||||
- `✗ <https://github.com/vllm-project/vllm/pull/9057>`__
|
||||
- ✅
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - :abbr:`prmpt adptr (Prompt Adapter)`
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - :ref:`SD <spec_decode>`
|
||||
- ✅
|
||||
- ✅
|
||||
- ✗
|
||||
- ✅
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - CUDA graph
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - :abbr:`emd (Embedding Models)`
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - :abbr:`enc-dec (Encoder-Decoder Models)`
|
||||
- ✗
|
||||
- `✗ <https://github.com/vllm-project/vllm/issues/7366>`__
|
||||
- ✗
|
||||
- ✗
|
||||
- `✗ <https://github.com/vllm-project/vllm/issues/7366>`__
|
||||
- ✅
|
||||
- ✅
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - :abbr:`logP (Logprobs)`
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✗
|
||||
- ✅
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - :abbr:`prmpt logP (Prompt Logprobs)`
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- `✗ <https://github.com/vllm-project/vllm/pull/8199>`__
|
||||
- ✅
|
||||
- ✗
|
||||
- ✅
|
||||
- ✅
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - :abbr:`async output (Async Output Processing)`
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✗
|
||||
- ✅
|
||||
- ✗
|
||||
- ✗
|
||||
- ✅
|
||||
- ✅
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - multi-step
|
||||
- ✗
|
||||
- ✅
|
||||
- ✗
|
||||
- ✅
|
||||
- ✗
|
||||
- ✅
|
||||
- ✗
|
||||
- ✗
|
||||
- ✅
|
||||
- `✗ <https://github.com/vllm-project/vllm/issues/8198>`__
|
||||
- ✅
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - :abbr:`mm (Multimodal)`
|
||||
- ✅
|
||||
- `✗ <https://github.com/vllm-project/vllm/pull/8348>`__
|
||||
- `✗ <https://github.com/vllm-project/vllm/pull/7199>`__
|
||||
- ?
|
||||
- ?
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ?
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - best-of
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- `✗ <https://github.com/vllm-project/vllm/issues/6137>`__
|
||||
- ✅
|
||||
- ✗
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ?
|
||||
- `✗ <https://github.com/vllm-project/vllm/issues/7968>`__
|
||||
- ✅
|
||||
-
|
||||
-
|
||||
-
|
||||
* - beam-search
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- `✗ <https://github.com/vllm-project/vllm/issues/6137>`__
|
||||
- ✅
|
||||
- ✗
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ?
|
||||
- `✗ <https://github.com/vllm-project/vllm/issues/7968>`__
|
||||
- ?
|
||||
- ✅
|
||||
-
|
||||
-
|
||||
* - :abbr:`guided dec (Guided Decoding)`
|
||||
- ✅
|
||||
- ✅
|
||||
- ?
|
||||
- ?
|
||||
- ✅
|
||||
- ✅
|
||||
- ✗
|
||||
- ?
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- `✗ <https://github.com/vllm-project/vllm/issues/9893>`__
|
||||
- ?
|
||||
- ✅
|
||||
- ✅
|
||||
-
|
||||
|
||||
|
||||
Feature x Hardware
|
||||
^^^^^^^^^^^^^^^^^^
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
:widths: auto
|
||||
|
||||
* - Feature
|
||||
- Volta
|
||||
- Turing
|
||||
- Ampere
|
||||
- Ada
|
||||
- Hopper
|
||||
- CPU
|
||||
- AMD
|
||||
* - :ref:`CP <chunked-prefill>`
|
||||
- `✗ <https://github.com/vllm-project/vllm/issues/2729>`__
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
* - :ref:`APC <apc>`
|
||||
- `✗ <https://github.com/vllm-project/vllm/issues/3687>`__
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
* - :ref:`LoRA <lora>`
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- `✗ <https://github.com/vllm-project/vllm/pull/4830>`__
|
||||
- ✅
|
||||
* - :abbr:`prmpt adptr (Prompt Adapter)`
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- `✗ <https://github.com/vllm-project/vllm/issues/8475>`__
|
||||
- ✅
|
||||
* - :ref:`SD <spec_decode>`
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
* - CUDA graph
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✗
|
||||
- ✅
|
||||
* - :abbr:`emd (Embedding Models)`
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ?
|
||||
* - :abbr:`enc-dec (Encoder-Decoder Models)`
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✗
|
||||
* - :abbr:`mm (Multimodal)`
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
* - :abbr:`logP (Logprobs)`
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
* - :abbr:`prmpt logP (Prompt Logprobs)`
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
* - :abbr:`async output (Async Output Processing)`
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✗
|
||||
- ✗
|
||||
* - multi-step
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- `✗ <https://github.com/vllm-project/vllm/issues/8477>`__
|
||||
- ✅
|
||||
* - best-of
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
* - beam-search
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
* - :abbr:`guided dec (Guided Decoding)`
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
23
docs/source/usage/engine_args.rst
Normal file
23
docs/source/usage/engine_args.rst
Normal file
@@ -0,0 +1,23 @@
|
||||
.. _engine_args:
|
||||
|
||||
Engine Arguments
|
||||
================
|
||||
|
||||
Below, you can find an explanation of every engine argument for vLLM:
|
||||
|
||||
.. argparse::
|
||||
:module: vllm.engine.arg_utils
|
||||
:func: _engine_args_parser
|
||||
:prog: vllm serve
|
||||
:nodefaultconst:
|
||||
|
||||
Async Engine Arguments
|
||||
----------------------
|
||||
|
||||
Below are the additional arguments related to the asynchronous engine:
|
||||
|
||||
.. argparse::
|
||||
:module: vllm.engine.arg_utils
|
||||
:func: _async_engine_args_parser
|
||||
:prog: vllm serve
|
||||
:nodefaultconst:
|
||||
14
docs/source/usage/env_vars.rst
Normal file
14
docs/source/usage/env_vars.rst
Normal file
@@ -0,0 +1,14 @@
|
||||
Environment Variables
|
||||
========================
|
||||
|
||||
vLLM uses the following environment variables to configure the system:
|
||||
|
||||
.. warning::
|
||||
Please note that ``VLLM_PORT`` and ``VLLM_HOST_IP`` set the port and ip for vLLM's **internal usage**. It is not the port and ip for the API server. If you use ``--host $VLLM_HOST_IP`` and ``--port $VLLM_PORT`` to start the API server, it will not work.
|
||||
|
||||
All environment variables used by vLLM are prefixed with ``VLLM_``. **Special care should be taken for Kubernetes users**: please do not name the service as ``vllm``, otherwise environment variables set by Kubernetes might conflict with vLLM's environment variables, because `Kubernetes sets environment variables for each service with the capitalized service name as the prefix <https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables>`_.
|
||||
|
||||
.. literalinclude:: ../../../vllm/envs.py
|
||||
:language: python
|
||||
:start-after: begin-env-vars-definition
|
||||
:end-before: end-env-vars-definition
|
||||
33
docs/source/usage/faq.rst
Normal file
33
docs/source/usage/faq.rst
Normal file
@@ -0,0 +1,33 @@
|
||||
.. _faq:
|
||||
|
||||
Frequently Asked Questions
|
||||
===========================
|
||||
|
||||
Q: How can I serve multiple models on a single port using the OpenAI API?
|
||||
|
||||
A: Assuming that you're referring to using OpenAI compatible server to serve multiple models at once, that is not currently supported, you can run multiple instances of the server (each serving a different model) at the same time, and have another layer to route the incoming request to the correct server accordingly.
|
||||
|
||||
----------------------------------------
|
||||
|
||||
Q: Which model to use for offline inference embedding?
|
||||
|
||||
A: If you want to use an embedding model, try: https://huggingface.co/intfloat/e5-mistral-7b-instruct. Instead models, such as Llama-3-8b, Mistral-7B-Instruct-v0.3, are generation models rather than an embedding model
|
||||
|
||||
----------------------------------------
|
||||
|
||||
Q: Can the output of a prompt vary across runs in vLLM?
|
||||
|
||||
A: Yes, it can. vLLM does not guarantee stable log probabilities (logprobs) for the output tokens. Variations in logprobs may occur due to
|
||||
numerical instability in Torch operations or non-deterministic behavior in batched Torch operations when batching changes. For more details,
|
||||
see the `Numerical Accuracy section <https://pytorch.org/docs/stable/notes/numerical_accuracy.html#batched-computations-or-slice-computations>`_.
|
||||
|
||||
In vLLM, the same requests might be batched differently due to factors such as other concurrent requests,
|
||||
changes in batch size, or batch expansion in speculative decoding. These batching variations, combined with numerical instability of Torch operations,
|
||||
can lead to slightly different logit/logprob values at each step. Such differences can accumulate, potentially resulting in
|
||||
different tokens being sampled. Once a different token is sampled, further divergence is likely.
|
||||
|
||||
**Mitigation Strategies**
|
||||
|
||||
- For improved stability and reduced variance, use `float32`. Note that this will require more memory.
|
||||
- If using `bfloat16`, switching to `float16` can also help.
|
||||
- Using request seeds can aid in achieving more stable generation for temperature > 0, but discrepancies due to precision differences may still occur.
|
||||
225
docs/source/usage/lora.rst
Normal file
225
docs/source/usage/lora.rst
Normal file
@@ -0,0 +1,225 @@
|
||||
.. _lora:
|
||||
|
||||
LoRA Adapters
|
||||
=============
|
||||
|
||||
This document shows you how to use `LoRA adapters <https://arxiv.org/abs/2106.09685>`_ with vLLM on top of a base model.
|
||||
|
||||
LoRA adapters can be used with any vLLM model that implements :class:`~vllm.model_executor.models.interfaces.SupportsLoRA`.
|
||||
|
||||
Adapters can be efficiently served on a per request basis with minimal overhead. First we download the adapter(s) and save
|
||||
them locally with
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from huggingface_hub import snapshot_download
|
||||
|
||||
sql_lora_path = snapshot_download(repo_id="yard1/llama-2-7b-sql-lora-test")
|
||||
|
||||
|
||||
Then we instantiate the base model and pass in the ``enable_lora=True`` flag:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from vllm import LLM, SamplingParams
|
||||
from vllm.lora.request import LoRARequest
|
||||
|
||||
llm = LLM(model="meta-llama/Llama-2-7b-hf", enable_lora=True)
|
||||
|
||||
|
||||
We can now submit the prompts and call ``llm.generate`` with the ``lora_request`` parameter. The first parameter
|
||||
of ``LoRARequest`` is a human identifiable name, the second parameter is a globally unique ID for the adapter and
|
||||
the third parameter is the path to the LoRA adapter.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
sampling_params = SamplingParams(
|
||||
temperature=0,
|
||||
max_tokens=256,
|
||||
stop=["[/assistant]"]
|
||||
)
|
||||
|
||||
prompts = [
|
||||
"[user] Write a SQL query to answer the question based on the table schema.\n\n context: CREATE TABLE table_name_74 (icao VARCHAR, airport VARCHAR)\n\n question: Name the ICAO for lilongwe international airport [/user] [assistant]",
|
||||
"[user] Write a SQL query to answer the question based on the table schema.\n\n context: CREATE TABLE table_name_11 (nationality VARCHAR, elector VARCHAR)\n\n question: When Anchero Pantaleone was the elector what is under nationality? [/user] [assistant]",
|
||||
]
|
||||
|
||||
outputs = llm.generate(
|
||||
prompts,
|
||||
sampling_params,
|
||||
lora_request=LoRARequest("sql_adapter", 1, sql_lora_path)
|
||||
)
|
||||
|
||||
|
||||
Check out `examples/multilora_inference.py <https://github.com/vllm-project/vllm/blob/main/examples/multilora_inference.py>`_
|
||||
for an example of how to use LoRA adapters with the async engine and how to use more advanced configuration options.
|
||||
|
||||
Serving LoRA Adapters
|
||||
---------------------
|
||||
LoRA adapted models can also be served with the Open-AI compatible vLLM server. To do so, we use
|
||||
``--lora-modules {name}={path} {name}={path}`` to specify each LoRA module when we kickoff the server:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
vllm serve meta-llama/Llama-2-7b-hf \
|
||||
--enable-lora \
|
||||
--lora-modules sql-lora=$HOME/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/snapshots/0dfa347e8877a4d4ed19ee56c140fa518470028c/
|
||||
|
||||
.. note::
|
||||
The commit ID `0dfa347e8877a4d4ed19ee56c140fa518470028c` may change over time. Please check the latest commit ID in your environment to ensure you are using the correct one.
|
||||
|
||||
The server entrypoint accepts all other LoRA configuration parameters (``max_loras``, ``max_lora_rank``, ``max_cpu_loras``,
|
||||
etc.), which will apply to all forthcoming requests. Upon querying the ``/models`` endpoint, we should see our LoRA along
|
||||
with its base model:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
curl localhost:8000/v1/models | jq .
|
||||
{
|
||||
"object": "list",
|
||||
"data": [
|
||||
{
|
||||
"id": "meta-llama/Llama-2-7b-hf",
|
||||
"object": "model",
|
||||
...
|
||||
},
|
||||
{
|
||||
"id": "sql-lora",
|
||||
"object": "model",
|
||||
...
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
Requests can specify the LoRA adapter as if it were any other model via the ``model`` request parameter. The requests will be
|
||||
processed according to the server-wide LoRA configuration (i.e. in parallel with base model requests, and potentially other
|
||||
LoRA adapter requests if they were provided and ``max_loras`` is set high enough).
|
||||
|
||||
The following is an example request
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
curl http://localhost:8000/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "sql-lora",
|
||||
"prompt": "San Francisco is a",
|
||||
"max_tokens": 7,
|
||||
"temperature": 0
|
||||
}' | jq
|
||||
|
||||
|
||||
Dynamically serving LoRA Adapters
|
||||
---------------------------------
|
||||
|
||||
In addition to serving LoRA adapters at server startup, the vLLM server now supports dynamically loading and unloading
|
||||
LoRA adapters at runtime through dedicated API endpoints. This feature can be particularly useful when the flexibility
|
||||
to change models on-the-fly is needed.
|
||||
|
||||
Note: Enabling this feature in production environments is risky as user may participate model adapter management.
|
||||
|
||||
To enable dynamic LoRA loading and unloading, ensure that the environment variable `VLLM_ALLOW_RUNTIME_LORA_UPDATING`
|
||||
is set to `True`. When this option is enabled, the API server will log a warning to indicate that dynamic loading is active.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
export VLLM_ALLOW_RUNTIME_LORA_UPDATING=True
|
||||
|
||||
|
||||
Loading a LoRA Adapter:
|
||||
|
||||
To dynamically load a LoRA adapter, send a POST request to the `/v1/load_lora_adapter` endpoint with the necessary
|
||||
details of the adapter to be loaded. The request payload should include the name and path to the LoRA adapter.
|
||||
|
||||
Example request to load a LoRA adapter:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
curl -X POST http://localhost:8000/v1/load_lora_adapter \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"lora_name": "sql_adapter",
|
||||
"lora_path": "/path/to/sql-lora-adapter"
|
||||
}'
|
||||
|
||||
Upon a successful request, the API will respond with a 200 OK status code. If an error occurs, such as if the adapter
|
||||
cannot be found or loaded, an appropriate error message will be returned.
|
||||
|
||||
Unloading a LoRA Adapter:
|
||||
|
||||
To unload a LoRA adapter that has been previously loaded, send a POST request to the `/v1/unload_lora_adapter` endpoint
|
||||
with the name or ID of the adapter to be unloaded.
|
||||
|
||||
Example request to unload a LoRA adapter:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
curl -X POST http://localhost:8000/v1/unload_lora_adapter \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"lora_name": "sql_adapter"
|
||||
}'
|
||||
|
||||
|
||||
New format for `--lora-modules`
|
||||
-------------------------------
|
||||
|
||||
In the previous version, users would provide LoRA modules via the following format, either as a key-value pair or in JSON format. For example:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
--lora-modules sql-lora=$HOME/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/snapshots/0dfa347e8877a4d4ed19ee56c140fa518470028c/
|
||||
|
||||
This would only include the `name` and `path` for each LoRA module, but did not provide a way to specify a `base_model_name`.
|
||||
Now, you can specify a base_model_name alongside the name and path using JSON format. For example:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
--lora-modules '{"name": "sql-lora", "path": "/path/to/lora", "base_model_name": "meta-llama/Llama-2-7b"}'
|
||||
|
||||
To provide the backward compatibility support, you can still use the old key-value format (name=path), but the `base_model_name` will remain unspecified in that case.
|
||||
|
||||
|
||||
Lora model lineage in model card
|
||||
--------------------------------
|
||||
|
||||
The new format of `--lora-modules` is mainly to support the display of parent model information in the model card. Here's an explanation of how your current response supports this:
|
||||
|
||||
- The `parent` field of LoRA model `sql-lora` now links to its base model `meta-llama/Llama-2-7b-hf`. This correctly reflects the hierarchical relationship between the base model and the LoRA adapter.
|
||||
- The `root` field points to the artifact location of the lora adapter.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ curl http://localhost:8000/v1/models
|
||||
|
||||
{
|
||||
"object": "list",
|
||||
"data": [
|
||||
{
|
||||
"id": "meta-llama/Llama-2-7b-hf",
|
||||
"object": "model",
|
||||
"created": 1715644056,
|
||||
"owned_by": "vllm",
|
||||
"root": "~/.cache/huggingface/hub/models--meta-llama--Llama-2-7b-hf/snapshots/01c7f73d771dfac7d292323805ebc428287df4f9/",
|
||||
"parent": null,
|
||||
"permission": [
|
||||
{
|
||||
.....
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "sql-lora",
|
||||
"object": "model",
|
||||
"created": 1715644056,
|
||||
"owned_by": "vllm",
|
||||
"root": "~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/snapshots/0dfa347e8877a4d4ed19ee56c140fa518470028c/",
|
||||
"parent": meta-llama/Llama-2-7b-hf,
|
||||
"permission": [
|
||||
{
|
||||
....
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
404
docs/source/usage/multimodal_inputs.rst
Normal file
404
docs/source/usage/multimodal_inputs.rst
Normal file
@@ -0,0 +1,404 @@
|
||||
.. _multimodal_inputs:
|
||||
|
||||
Multimodal Inputs
|
||||
=================
|
||||
|
||||
This page teaches you how to pass multi-modal inputs to :ref:`multi-modal models <supported_mm_models>` in vLLM.
|
||||
|
||||
.. note::
|
||||
We are actively iterating on multi-modal support. See `this RFC <https://github.com/vllm-project/vllm/issues/4194>`_ for upcoming changes,
|
||||
and `open an issue on GitHub <https://github.com/vllm-project/vllm/issues/new/choose>`_ if you have any feedback or feature requests.
|
||||
|
||||
Offline Inference
|
||||
-----------------
|
||||
|
||||
To input multi-modal data, follow this schema in :class:`vllm.inputs.PromptType`:
|
||||
|
||||
* ``prompt``: The prompt should follow the format that is documented on HuggingFace.
|
||||
* ``multi_modal_data``: This is a dictionary that follows the schema defined in :class:`vllm.multimodal.MultiModalDataDict`.
|
||||
|
||||
Image
|
||||
^^^^^
|
||||
|
||||
You can pass a single image to the :code:`'image'` field of the multi-modal dictionary, as shown in the following examples:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
llm = LLM(model="llava-hf/llava-1.5-7b-hf")
|
||||
|
||||
# Refer to the HuggingFace repo for the correct format to use
|
||||
prompt = "USER: <image>\nWhat is the content of this image?\nASSISTANT:"
|
||||
|
||||
# Load the image using PIL.Image
|
||||
image = PIL.Image.open(...)
|
||||
|
||||
# Single prompt inference
|
||||
outputs = llm.generate({
|
||||
"prompt": prompt,
|
||||
"multi_modal_data": {"image": image},
|
||||
})
|
||||
|
||||
for o in outputs:
|
||||
generated_text = o.outputs[0].text
|
||||
print(generated_text)
|
||||
|
||||
# Batch inference
|
||||
image_1 = PIL.Image.open(...)
|
||||
image_2 = PIL.Image.open(...)
|
||||
outputs = llm.generate(
|
||||
[
|
||||
{
|
||||
"prompt": "USER: <image>\nWhat is the content of this image?\nASSISTANT:",
|
||||
"multi_modal_data": {"image": image_1},
|
||||
},
|
||||
{
|
||||
"prompt": "USER: <image>\nWhat's the color of this image?\nASSISTANT:",
|
||||
"multi_modal_data": {"image": image_2},
|
||||
}
|
||||
]
|
||||
)
|
||||
|
||||
for o in outputs:
|
||||
generated_text = o.outputs[0].text
|
||||
print(generated_text)
|
||||
|
||||
A code example can be found in `examples/offline_inference_vision_language.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_vision_language.py>`_.
|
||||
|
||||
To substitute multiple images inside the same text prompt, you can pass in a list of images instead:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
llm = LLM(
|
||||
model="microsoft/Phi-3.5-vision-instruct",
|
||||
trust_remote_code=True, # Required to load Phi-3.5-vision
|
||||
max_model_len=4096, # Otherwise, it may not fit in smaller GPUs
|
||||
limit_mm_per_prompt={"image": 2}, # The maximum number to accept
|
||||
)
|
||||
|
||||
# Refer to the HuggingFace repo for the correct format to use
|
||||
prompt = "<|user|>\n<|image_1|>\n<|image_2|>\nWhat is the content of each image?<|end|>\n<|assistant|>\n"
|
||||
|
||||
# Load the images using PIL.Image
|
||||
image1 = PIL.Image.open(...)
|
||||
image2 = PIL.Image.open(...)
|
||||
|
||||
outputs = llm.generate({
|
||||
"prompt": prompt,
|
||||
"multi_modal_data": {
|
||||
"image": [image1, image2]
|
||||
},
|
||||
})
|
||||
|
||||
for o in outputs:
|
||||
generated_text = o.outputs[0].text
|
||||
print(generated_text)
|
||||
|
||||
A code example can be found in `examples/offline_inference_vision_language_multi_image.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_vision_language_multi_image.py>`_.
|
||||
|
||||
Multi-image input can be extended to perform video captioning. We show this with `Qwen2-VL <https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct>`_ as it supports videos:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
# Specify the maximum number of frames per video to be 4. This can be changed.
|
||||
llm = LLM("Qwen/Qwen2-VL-2B-Instruct", limit_mm_per_prompt={"image": 4})
|
||||
|
||||
# Create the request payload.
|
||||
video_frames = ... # load your video making sure it only has the number of frames specified earlier.
|
||||
message = {
|
||||
"role": "user",
|
||||
"content": [
|
||||
{"type": "text", "text": "Describe this set of frames. Consider the frames to be a part of the same video."},
|
||||
],
|
||||
}
|
||||
for i in range(len(video_frames)):
|
||||
base64_image = encode_image(video_frames[i]) # base64 encoding.
|
||||
new_image = {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}}
|
||||
message["content"].append(new_image)
|
||||
|
||||
# Perform inference and log output.
|
||||
outputs = llm.chat([message])
|
||||
|
||||
for o in outputs:
|
||||
generated_text = o.outputs[0].text
|
||||
print(generated_text)
|
||||
|
||||
Video
|
||||
^^^^^
|
||||
|
||||
You can pass a list of NumPy arrays directly to the :code:`'video'` field of the multi-modal dictionary
|
||||
instead of using multi-image input.
|
||||
|
||||
Please refer to `examples/offline_inference_vision_language.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_vision_language.py>`_ for more details.
|
||||
|
||||
Audio
|
||||
^^^^^
|
||||
|
||||
You can pass a tuple :code:`(array, sampling_rate)` to the :code:`'audio'` field of the multi-modal dictionary.
|
||||
|
||||
Please refer to `examples/offline_inference_audio_language.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_audio_language.py>`_ for more details.
|
||||
|
||||
Embedding
|
||||
^^^^^^^^^
|
||||
|
||||
To input pre-computed embeddings belonging to a data type (i.e. image, video, or audio) directly to the language model,
|
||||
pass a tensor of shape :code:`(num_items, feature_size, hidden_size of LM)` to the corresponding field of the multi-modal dictionary.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
# Inference with image embeddings as input
|
||||
llm = LLM(model="llava-hf/llava-1.5-7b-hf")
|
||||
|
||||
# Refer to the HuggingFace repo for the correct format to use
|
||||
prompt = "USER: <image>\nWhat is the content of this image?\nASSISTANT:"
|
||||
|
||||
# Embeddings for single image
|
||||
# torch.Tensor of shape (1, image_feature_size, hidden_size of LM)
|
||||
image_embeds = torch.load(...)
|
||||
|
||||
outputs = llm.generate({
|
||||
"prompt": prompt,
|
||||
"multi_modal_data": {"image": image_embeds},
|
||||
})
|
||||
|
||||
for o in outputs:
|
||||
generated_text = o.outputs[0].text
|
||||
print(generated_text)
|
||||
|
||||
For Qwen2-VL and MiniCPM-V, we accept additional parameters alongside the embeddings:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
# Construct the prompt based on your model
|
||||
prompt = ...
|
||||
|
||||
# Embeddings for multiple images
|
||||
# torch.Tensor of shape (num_images, image_feature_size, hidden_size of LM)
|
||||
image_embeds = torch.load(...)
|
||||
|
||||
# Qwen2-VL
|
||||
llm = LLM("Qwen/Qwen2-VL-2B-Instruct", limit_mm_per_prompt={"image": 4})
|
||||
mm_data = {
|
||||
"image": {
|
||||
"image_embeds": image_embeds,
|
||||
# image_grid_thw is needed to calculate positional encoding.
|
||||
"image_grid_thw": torch.load(...), # torch.Tensor of shape (1, 3),
|
||||
}
|
||||
}
|
||||
|
||||
# MiniCPM-V
|
||||
llm = LLM("openbmb/MiniCPM-V-2_6", trust_remote_code=True, limit_mm_per_prompt={"image": 4})
|
||||
mm_data = {
|
||||
"image": {
|
||||
"image_embeds": image_embeds,
|
||||
# image_size_list is needed to calculate details of the sliced image.
|
||||
"image_size_list": [image.size for image in images], # list of image sizes
|
||||
}
|
||||
}
|
||||
|
||||
outputs = llm.generate({
|
||||
"prompt": prompt,
|
||||
"multi_modal_data": mm_data,
|
||||
})
|
||||
|
||||
for o in outputs:
|
||||
generated_text = o.outputs[0].text
|
||||
print(generated_text)
|
||||
|
||||
Online Inference
|
||||
----------------
|
||||
|
||||
Our OpenAI-compatible server accepts multi-modal data via the `Chat Completions API <https://platform.openai.com/docs/api-reference/chat>`_.
|
||||
|
||||
.. important::
|
||||
A chat template is **required** to use Chat Completions API.
|
||||
|
||||
Although most models come with a chat template, for others you have to define one yourself.
|
||||
The chat template can be inferred based on the documentation on the model's HuggingFace repo.
|
||||
For example, LLaVA-1.5 (``llava-hf/llava-1.5-7b-hf``) requires a chat template that can be found `here <https://github.com/vllm-project/vllm/blob/main/examples/template_llava.jinja>`__.
|
||||
|
||||
Image
|
||||
^^^^^
|
||||
|
||||
Image input is supported according to `OpenAI Vision API <https://platform.openai.com/docs/guides/vision>`_.
|
||||
Here is a simple example using Phi-3.5-Vision.
|
||||
|
||||
First, launch the OpenAI-compatible server:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
vllm serve microsoft/Phi-3.5-vision-instruct --task generate \
|
||||
--trust-remote-code --max-model-len 4096 --limit-mm-per-prompt image=2
|
||||
|
||||
Then, you can use the OpenAI client as follows:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from openai import OpenAI
|
||||
|
||||
openai_api_key = "EMPTY"
|
||||
openai_api_base = "http://localhost:8000/v1"
|
||||
|
||||
client = OpenAI(
|
||||
api_key=openai_api_key,
|
||||
base_url=openai_api_base,
|
||||
)
|
||||
|
||||
# Single-image input inference
|
||||
image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
|
||||
|
||||
chat_response = client.chat.completions.create(
|
||||
model="microsoft/Phi-3.5-vision-instruct",
|
||||
messages=[{
|
||||
"role": "user",
|
||||
"content": [
|
||||
# NOTE: The prompt formatting with the image token `<image>` is not needed
|
||||
# since the prompt will be processed automatically by the API server.
|
||||
{"type": "text", "text": "What’s in this image?"},
|
||||
{"type": "image_url", "image_url": {"url": image_url}},
|
||||
],
|
||||
}],
|
||||
)
|
||||
print("Chat completion output:", chat_response.choices[0].message.content)
|
||||
|
||||
# Multi-image input inference
|
||||
image_url_duck = "https://upload.wikimedia.org/wikipedia/commons/d/da/2015_Kaczka_krzy%C5%BCowka_w_wodzie_%28samiec%29.jpg"
|
||||
image_url_lion = "https://upload.wikimedia.org/wikipedia/commons/7/77/002_The_lion_king_Snyggve_in_the_Serengeti_National_Park_Photo_by_Giles_Laurent.jpg"
|
||||
|
||||
chat_response = client.chat.completions.create(
|
||||
model="microsoft/Phi-3.5-vision-instruct",
|
||||
messages=[{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{"type": "text", "text": "What are the animals in these images?"},
|
||||
{"type": "image_url", "image_url": {"url": image_url_duck}},
|
||||
{"type": "image_url", "image_url": {"url": image_url_lion}},
|
||||
],
|
||||
}],
|
||||
)
|
||||
print("Chat completion output:", chat_response.choices[0].message.content)
|
||||
|
||||
A full code example can be found in `examples/openai_chat_completion_client_for_multimodal.py <https://github.com/vllm-project/vllm/blob/main/examples/openai_chat_completion_client_for_multimodal.py>`_.
|
||||
|
||||
.. tip::
|
||||
Loading from local file paths is also supported on vLLM: You can specify the allowed local media path via ``--allowed-local-media-path`` when launching the API server/engine,
|
||||
and pass the file path as ``url`` in the API request.
|
||||
|
||||
.. tip::
|
||||
There is no need to place image placeholders in the text content of the API request - they are already represented by the image content.
|
||||
In fact, you can place image placeholders in the middle of the text by interleaving text and image content.
|
||||
|
||||
.. note::
|
||||
|
||||
By default, the timeout for fetching images through HTTP URL is ``5`` seconds.
|
||||
You can override this by setting the environment variable:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ export VLLM_IMAGE_FETCH_TIMEOUT=<timeout>
|
||||
|
||||
Video
|
||||
^^^^^
|
||||
|
||||
Instead of :code:`image_url`, you can pass a video file via :code:`video_url`.
|
||||
|
||||
You can use `these tests <https://github.com/vllm-project/vllm/blob/main/tests/entrypoints/openai/test_video.py>`_ as reference.
|
||||
|
||||
.. note::
|
||||
|
||||
By default, the timeout for fetching videos through HTTP URL url is ``30`` seconds.
|
||||
You can override this by setting the environment variable:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ export VLLM_VIDEO_FETCH_TIMEOUT=<timeout>
|
||||
|
||||
Audio
|
||||
^^^^^
|
||||
|
||||
Instead of :code:`image_url`, you can pass an audio file via :code:`audio_url`.
|
||||
|
||||
A full code example can be found in `examples/openai_chat_completion_client_for_multimodal.py <https://github.com/vllm-project/vllm/blob/main/examples/openai_chat_completion_client_for_multimodal.py>`_.
|
||||
|
||||
.. note::
|
||||
|
||||
By default, the timeout for fetching audios through HTTP URL is ``10`` seconds.
|
||||
You can override this by setting the environment variable:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ export VLLM_AUDIO_FETCH_TIMEOUT=<timeout>
|
||||
|
||||
Embedding
|
||||
^^^^^^^^^
|
||||
|
||||
vLLM's Embeddings API is a superset of OpenAI's `Embeddings API <https://platform.openai.com/docs/api-reference/embeddings>`_,
|
||||
where a list of chat ``messages`` can be passed instead of batched ``inputs``. This enables multi-modal inputs to be passed to embedding models.
|
||||
|
||||
.. tip::
|
||||
The schema of ``messages`` is exactly the same as in Chat Completions API.
|
||||
You can refer to the above tutorials for more details on how to pass each type of multi-modal data.
|
||||
|
||||
Usually, embedding models do not expect chat-based input, so we need to use a custom chat template to format the text and images.
|
||||
Refer to the examples below for illustration.
|
||||
|
||||
Here is an end-to-end example using VLM2Vec. To serve the model:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
vllm serve TIGER-Lab/VLM2Vec-Full --task embedding \
|
||||
--trust-remote-code --max-model-len 4096 --chat-template examples/template_vlm2vec.jinja
|
||||
|
||||
.. important::
|
||||
|
||||
Since VLM2Vec has the same model architecture as Phi-3.5-Vision, we have to explicitly pass ``--task embedding``
|
||||
to run this model in embedding mode instead of text generation mode.
|
||||
|
||||
The custom chat template is completely different from the original one for this model,
|
||||
and can be found `here <https://github.com/vllm-project/vllm/blob/main/examples/template_vlm2vec.jinja>`__.
|
||||
|
||||
Since the request schema is not defined by OpenAI client, we post a request to the server using the lower-level ``requests`` library:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import requests
|
||||
|
||||
image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
|
||||
|
||||
response = requests.post(
|
||||
"http://localhost:8000/v1/embeddings",
|
||||
json={
|
||||
"model": "TIGER-Lab/VLM2Vec-Full",
|
||||
"messages": [{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{"type": "image_url", "image_url": {"url": image_url}},
|
||||
{"type": "text", "text": "Represent the given image."},
|
||||
],
|
||||
}],
|
||||
"encoding_format": "float",
|
||||
},
|
||||
)
|
||||
response.raise_for_status()
|
||||
response_json = response.json()
|
||||
print("Embedding output:", response_json["data"][0]["embedding"])
|
||||
|
||||
Below is another example, this time using the ``MrLight/dse-qwen2-2b-mrl-v1`` model.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
vllm serve MrLight/dse-qwen2-2b-mrl-v1 --task embedding \
|
||||
--trust-remote-code --max-model-len 8192 --chat-template examples/template_dse_qwen2_vl.jinja
|
||||
|
||||
.. important::
|
||||
|
||||
Like with VLM2Vec, we have to explicitly pass ``--task embedding``.
|
||||
|
||||
Additionally, ``MrLight/dse-qwen2-2b-mrl-v1`` requires an EOS token for embeddings, which is handled
|
||||
by `this custom chat template <https://github.com/vllm-project/vllm/blob/main/examples/template_dse_qwen2_vl.jinja>`__.
|
||||
|
||||
.. important::
|
||||
|
||||
Also important, ``MrLight/dse-qwen2-2b-mrl-v1`` requires a placeholder image of the minimum image size for text query embeddings. See the full code
|
||||
example below for details.
|
||||
|
||||
A full code example can be found in `examples/openai_chat_embedding_client_for_multimodal.py <https://github.com/vllm-project/vllm/blob/main/examples/openai_chat_embedding_client_for_multimodal.py>`_.
|
||||
65
docs/source/usage/performance.rst
Normal file
65
docs/source/usage/performance.rst
Normal file
@@ -0,0 +1,65 @@
|
||||
.. _performance:
|
||||
|
||||
Performance and Tuning
|
||||
======================
|
||||
|
||||
Preemption
|
||||
----------
|
||||
Due to the auto-regressive nature of transformer architecture, there are times when KV cache space is insufficient to handle all batched requests.
|
||||
The vLLM can preempt requests to free up KV cache space for other requests. Preempted requests are recomputed when sufficient KV cache space becomes
|
||||
available again. When this occurs, the following warning is printed:
|
||||
|
||||
```
|
||||
WARNING 05-09 00:49:33 scheduler.py:1057] Sequence group 0 is preempted by PreemptionMode.SWAP mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_cumulative_preemption_cnt=1
|
||||
```
|
||||
|
||||
While this mechanism ensures system robustness, preemption and recomputation can adversely affect end-to-end latency.
|
||||
If you frequently encounter preemptions from the vLLM engine, consider the following actions:
|
||||
|
||||
- Increase `gpu_memory_utilization`. The vLLM pre-allocates GPU cache by using gpu_memory_utilization% of memory. By increasing this utilization, you can provide more KV cache space.
|
||||
- Decrease `max_num_seqs` or `max_num_batched_tokens`. This can reduce the number of concurrent requests in a batch, thereby requiring less KV cache space.
|
||||
- Increase `tensor_parallel_size`. This approach shards model weights, so each GPU has more memory available for KV cache.
|
||||
|
||||
You can also monitor the number of preemption requests through Prometheus metrics exposed by the vLLM. Additionally, you can log the cumulative number of preemption requests by setting disable_log_stats=False.
|
||||
|
||||
.. _chunked-prefill:
|
||||
|
||||
Chunked Prefill
|
||||
---------------
|
||||
vLLM supports an experimental feature chunked prefill. Chunked prefill allows to chunk large prefills into smaller chunks and batch them together with decode requests.
|
||||
|
||||
You can enable the feature by specifying ``--enable-chunked-prefill`` in the command line or setting ``enable_chunked_prefill=True`` in the LLM constructor.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
llm = LLM(model="meta-llama/Llama-2-7b-hf", enable_chunked_prefill=True)
|
||||
# Set max_num_batched_tokens to tune performance.
|
||||
# NOTE: 512 is the default max_num_batched_tokens for chunked prefill.
|
||||
# llm = LLM(model="meta-llama/Llama-2-7b-hf", enable_chunked_prefill=True, max_num_batched_tokens=512)
|
||||
|
||||
By default, vLLM scheduler prioritizes prefills and doesn't batch prefill and decode to the same batch.
|
||||
This policy optimizes the TTFT (time to the first token), but incurs slower ITL (inter token latency) and inefficient GPU utilization.
|
||||
|
||||
Once chunked prefill is enabled, the policy is changed to prioritize decode requests.
|
||||
It batches all pending decode requests to the batch before scheduling any prefill.
|
||||
When there are available token_budget (``max_num_batched_tokens``), it schedules pending prefills.
|
||||
If a last pending prefill request cannot fit into ``max_num_batched_tokens``, it chunks it.
|
||||
|
||||
This policy has two benefits:
|
||||
|
||||
- It improves ITL and generation decode because decode requests are prioritized.
|
||||
- It helps achieve better GPU utilization by locating compute-bound (prefill) and memory-bound (decode) requests to the same batch.
|
||||
|
||||
You can tune the performance by changing ``max_num_batched_tokens``.
|
||||
By default, it is set to 512, which has the best ITL on A100 in the initial benchmark (llama 70B and mixtral 8x22B).
|
||||
Smaller ``max_num_batched_tokens`` achieves better ITL because there are fewer prefills interrupting decodes.
|
||||
Higher ``max_num_batched_tokens`` achieves better TTFT as you can put more prefill to the batch.
|
||||
|
||||
- If ``max_num_batched_tokens`` is the same as ``max_model_len``, that's almost the equivalent to the default scheduling policy (except that it still prioritizes decodes).
|
||||
- Note that the default value (512) of ``max_num_batched_tokens`` is optimized for ITL, and it may have lower throughput than the default scheduler.
|
||||
|
||||
We recommend you set ``max_num_batched_tokens > 2048`` for throughput.
|
||||
|
||||
See related papers for more details (https://arxiv.org/pdf/2401.08671 or https://arxiv.org/pdf/2308.16369).
|
||||
|
||||
Please try out this feature and let us know your feedback via GitHub issues!
|
||||
207
docs/source/usage/spec_decode.rst
Normal file
207
docs/source/usage/spec_decode.rst
Normal file
@@ -0,0 +1,207 @@
|
||||
.. _spec_decode:
|
||||
|
||||
Speculative decoding
|
||||
====================
|
||||
|
||||
.. warning::
|
||||
Please note that speculative decoding in vLLM is not yet optimized and does
|
||||
not usually yield inter-token latency reductions for all prompt datasets or sampling parameters. The work
|
||||
to optimize it is ongoing and can be followed in `this issue. <https://github.com/vllm-project/vllm/issues/4630>`_
|
||||
|
||||
This document shows how to use `Speculative Decoding <https://x.com/karpathy/status/1697318534555336961>`_ with vLLM.
|
||||
Speculative decoding is a technique which improves inter-token latency in memory-bound LLM inference.
|
||||
|
||||
Speculating with a draft model
|
||||
------------------------------
|
||||
|
||||
The following code configures vLLM in an offline mode to use speculative decoding with a draft model, speculating 5 tokens at a time.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
prompts = [
|
||||
"The future of AI is",
|
||||
]
|
||||
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
|
||||
|
||||
llm = LLM(
|
||||
model="facebook/opt-6.7b",
|
||||
tensor_parallel_size=1,
|
||||
speculative_model="facebook/opt-125m",
|
||||
num_speculative_tokens=5,
|
||||
)
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
|
||||
To perform the same with an online mode launch the server:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8000 --model facebook/opt-6.7b \
|
||||
--seed 42 -tp 1 --speculative_model facebook/opt-125m --use-v2-block-manager \
|
||||
--num_speculative_tokens 5 --gpu_memory_utilization 0.8
|
||||
|
||||
Then use a client:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from openai import OpenAI
|
||||
|
||||
# Modify OpenAI's API key and API base to use vLLM's API server.
|
||||
openai_api_key = "EMPTY"
|
||||
openai_api_base = "http://localhost:8000/v1"
|
||||
|
||||
client = OpenAI(
|
||||
# defaults to os.environ.get("OPENAI_API_KEY")
|
||||
api_key=openai_api_key,
|
||||
base_url=openai_api_base,
|
||||
)
|
||||
|
||||
models = client.models.list()
|
||||
model = models.data[0].id
|
||||
|
||||
# Completion API
|
||||
stream = False
|
||||
completion = client.completions.create(
|
||||
model=model,
|
||||
prompt="The future of AI is",
|
||||
echo=False,
|
||||
n=1,
|
||||
stream=stream,
|
||||
)
|
||||
|
||||
print("Completion results:")
|
||||
if stream:
|
||||
for c in completion:
|
||||
print(c)
|
||||
else:
|
||||
print(completion)
|
||||
|
||||
Speculating by matching n-grams in the prompt
|
||||
---------------------------------------------
|
||||
|
||||
The following code configures vLLM to use speculative decoding where proposals are generated by
|
||||
matching n-grams in the prompt. For more information read `this thread. <https://x.com/joao_gante/status/1747322413006643259>`_
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
prompts = [
|
||||
"The future of AI is",
|
||||
]
|
||||
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
|
||||
|
||||
llm = LLM(
|
||||
model="facebook/opt-6.7b",
|
||||
tensor_parallel_size=1,
|
||||
speculative_model="[ngram]",
|
||||
num_speculative_tokens=5,
|
||||
ngram_prompt_lookup_max=4,
|
||||
)
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
|
||||
Speculating using MLP speculators
|
||||
---------------------------------
|
||||
|
||||
The following code configures vLLM to use speculative decoding where proposals are generated by
|
||||
draft models that conditioning draft predictions on both context vectors and sampled tokens.
|
||||
For more information see `this blog <https://pytorch.org/blog/hitchhikers-guide-speculative-decoding/>`_ or
|
||||
`this technical report <https://arxiv.org/abs/2404.19124>`_.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
prompts = [
|
||||
"The future of AI is",
|
||||
]
|
||||
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
|
||||
|
||||
llm = LLM(
|
||||
model="meta-llama/Meta-Llama-3.1-70B-Instruct",
|
||||
tensor_parallel_size=4,
|
||||
speculative_model="ibm-fms/llama3-70b-accelerator",
|
||||
speculative_draft_tensor_parallel_size=1,
|
||||
)
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
|
||||
Note that these speculative models currently need to be run without tensor parallelism, although
|
||||
it is possible to run the main model using tensor parallelism (see example above). Since the
|
||||
speculative models are relatively small, we still see significant speedups. However, this
|
||||
limitation will be fixed in a future release.
|
||||
|
||||
A variety of speculative models of this type are available on HF hub:
|
||||
|
||||
* `llama-13b-accelerator <https://huggingface.co/ibm-fms/llama-13b-accelerator>`_
|
||||
* `llama3-8b-accelerator <https://huggingface.co/ibm-fms/llama3-8b-accelerator>`_
|
||||
* `codellama-34b-accelerator <https://huggingface.co/ibm-fms/codellama-34b-accelerator>`_
|
||||
* `llama2-70b-accelerator <https://huggingface.co/ibm-fms/llama2-70b-accelerator>`_
|
||||
* `llama3-70b-accelerator <https://huggingface.co/ibm-fms/llama3-70b-accelerator>`_
|
||||
* `granite-3b-code-instruct-accelerator <https://huggingface.co/ibm-granite/granite-3b-code-instruct-accelerator>`_
|
||||
* `granite-8b-code-instruct-accelerator <https://huggingface.co/ibm-granite/granite-8b-code-instruct-accelerator>`_
|
||||
* `granite-7b-instruct-accelerator <https://huggingface.co/ibm-granite/granite-7b-instruct-accelerator>`_
|
||||
* `granite-20b-code-instruct-accelerator <https://huggingface.co/ibm-granite/granite-20b-code-instruct-accelerator>`_
|
||||
|
||||
Lossless guarantees of Speculative Decoding
|
||||
-------------------------------------------
|
||||
In vLLM, speculative decoding aims to enhance inference efficiency while maintaining accuracy. This section addresses the lossless guarantees of
|
||||
speculative decoding, breaking down the guarantees into three key areas:
|
||||
|
||||
1. **Theoretical Losslessness**
|
||||
- Speculative decoding sampling is theoretically lossless up to the precision limits of hardware numerics. Floating-point errors might
|
||||
cause slight variations in output distributions, as discussed
|
||||
in `Accelerating Large Language Model Decoding with Speculative Sampling <https://arxiv.org/pdf/2302.01318>`_
|
||||
|
||||
2. **Algorithmic Losslessness**
|
||||
- vLLM’s implementation of speculative decoding is algorithmically validated to be lossless. Key validation tests include:
|
||||
|
||||
- **Rejection Sampler Convergence**: Ensures that samples from vLLM’s rejection sampler align with the target
|
||||
distribution. `View Test Code <https://github.com/vllm-project/vllm/blob/47b65a550866c7ffbd076ecb74106714838ce7da/tests/samplers/test_rejection_sampler.py#L252>`_
|
||||
|
||||
- **Greedy Sampling Equality**: Confirms that greedy sampling with speculative decoding matches greedy sampling
|
||||
without it. This verifies that vLLM's speculative decoding framework, when integrated with the vLLM forward pass and the vLLM rejection sampler,
|
||||
provides a lossless guarantee. Almost all of the tests in `this directory <https://github.com/vllm-project/vllm/tree/b67ae00cdbbe1a58ffc8ff170f0c8d79044a684a/tests/spec_decode/e2e>`_
|
||||
verify this property using `this assertion implementation <https://github.com/vllm-project/vllm/blob/b67ae00cdbbe1a58ffc8ff170f0c8d79044a684a/tests/spec_decode/e2e/conftest.py#L291>`_
|
||||
|
||||
3. **vLLM Logprob Stability**
|
||||
- vLLM does not currently guarantee stable token log probabilities (logprobs). This can result in different outputs for the
|
||||
same request across runs. For more details, see the FAQ section
|
||||
titled *Can the output of a prompt vary across runs in vLLM?* in the :ref:`FAQs <faq>`.
|
||||
|
||||
|
||||
**Conclusion**
|
||||
|
||||
While vLLM strives to ensure losslessness in speculative decoding, variations in generated outputs with and without speculative decoding
|
||||
can occur due to following factors:
|
||||
|
||||
- **Floating-Point Precision**: Differences in hardware numerical precision may lead to slight discrepancies in the output distribution.
|
||||
|
||||
- **Batch Size and Numerical Stability**: Changes in batch size may cause variations in logprobs and output probabilities, potentially
|
||||
due to non-deterministic behavior in batched operations or numerical instability.
|
||||
|
||||
**Mitigation Strategies**
|
||||
|
||||
For mitigation strategies, please refer to the FAQ entry *Can the output of a prompt vary across runs in vLLM?* in the :ref:`FAQs <faq>`.
|
||||
|
||||
Resources for vLLM contributors
|
||||
-------------------------------
|
||||
* `A Hacker's Guide to Speculative Decoding in vLLM <https://www.youtube.com/watch?v=9wNAgpX6z_4>`_
|
||||
* `What is Lookahead Scheduling in vLLM? <https://docs.google.com/document/d/1Z9TvqzzBPnh5WHcRwjvK2UEeFeq5zMZb5mFE8jR0HCs/edit#heading=h.1fjfb0donq5a>`_
|
||||
* `Information on batch expansion <https://docs.google.com/document/d/1T-JaS2T1NRfdP51qzqpyakoCXxSXTtORppiwaj5asxA/edit#heading=h.kk7dq05lc6q8>`_
|
||||
* `Dynamic speculative decoding <https://github.com/vllm-project/vllm/issues/4565>`_
|
||||
267
docs/source/usage/structured_outputs.rst
Normal file
267
docs/source/usage/structured_outputs.rst
Normal file
@@ -0,0 +1,267 @@
|
||||
.. _structured_outputs:
|
||||
|
||||
Structured Outputs
|
||||
==================
|
||||
|
||||
vLLM supports the generation of structured outputs using `outlines <https://github.com/dottxt-ai/outlines>`_ or `lm-format-enforcer <https://github.com/noamgat/lm-format-enforcer>`_ as backends for the guided decoding.
|
||||
This document shows you some examples of the different options that are available to generate structured outputs.
|
||||
|
||||
|
||||
Online Inference (OpenAI API)
|
||||
-----------------------------
|
||||
|
||||
You can generate structured outputs using the OpenAI's `Completions <https://platform.openai.com/docs/api-reference/completions>`_ and `Chat <https://platform.openai.com/docs/api-reference/chat>`_ API.
|
||||
|
||||
The following parameters are supported, which must be added as extra parameters:
|
||||
|
||||
- ``guided_choice``: the output will be exactly one of the choices.
|
||||
- ``guided_regex``: the output will follow the regex pattern.
|
||||
- ``guided_json``: the output will follow the JSON schema.
|
||||
- ``guided_grammar``: the output will follow the context free grammar.
|
||||
- ``guided_whitespace_pattern``: used to override the default whitespace pattern for guided json decoding.
|
||||
- ``guided_decoding_backend``: used to select the guided decoding backend to use.
|
||||
|
||||
You can see the complete list of supported parameters on the `OpenAI Compatible Server </../serving/openai_compatible_server.html>`_ page.
|
||||
|
||||
Now let´s see an example for each of the cases, starting with the ``guided_choice``, as it´s the easiest one:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from openai import OpenAI
|
||||
client = OpenAI(
|
||||
base_url="http://localhost:8000/v1",
|
||||
api_key="-",
|
||||
)
|
||||
|
||||
completion = client.chat.completions.create(
|
||||
model="Qwen/Qwen2.5-3B-Instruct",
|
||||
messages=[
|
||||
{"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"}
|
||||
],
|
||||
extra_body={"guided_choice": ["positive", "negative"]},
|
||||
)
|
||||
print(completion.choices[0].message.content)
|
||||
|
||||
|
||||
The next example shows how to use the ``guided_regex``. The idea is to generate an email address, given a simple regex template:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
completion = client.chat.completions.create(
|
||||
model="Qwen/Qwen2.5-3B-Instruct",
|
||||
messages=[
|
||||
{
|
||||
"role": "user",
|
||||
"content": "Generate an example email address for Alan Turing, who works in Enigma. End in .com and new line. Example result: alan.turing@enigma.com\n",
|
||||
}
|
||||
],
|
||||
extra_body={"guided_regex": "\w+@\w+\.com\n", "stop": ["\n"]},
|
||||
)
|
||||
print(completion.choices[0].message.content)
|
||||
|
||||
One of the most relevant features in structured text generation is the option to generate a valid JSON with pre-defined fields and formats.
|
||||
For this we can use the ``guided_json`` parameter in two different ways:
|
||||
|
||||
- Using directly a `JSON Schema <https://json-schema.org/>`_
|
||||
- Defining a `Pydantic model <https://docs.pydantic.dev/latest/>`_ and then extracting the JSON Schema from it (which is normally an easier option).
|
||||
|
||||
The next example shows how to use the ``guided_json`` parameter with a Pydantic model:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from pydantic import BaseModel
|
||||
from enum import Enum
|
||||
|
||||
class CarType(str, Enum):
|
||||
sedan = "sedan"
|
||||
suv = "SUV"
|
||||
truck = "Truck"
|
||||
coupe = "Coupe"
|
||||
|
||||
|
||||
class CarDescription(BaseModel):
|
||||
brand: str
|
||||
model: str
|
||||
car_type: CarType
|
||||
|
||||
|
||||
json_schema = CarDescription.model_json_schema()
|
||||
|
||||
completion = client.chat.completions.create(
|
||||
model="Qwen/Qwen2.5-3B-Instruct",
|
||||
messages=[
|
||||
{
|
||||
"role": "user",
|
||||
"content": "Generate a JSON with the brand, model and car_type of the most iconic car from the 90's",
|
||||
}
|
||||
],
|
||||
extra_body={"guided_json": json_schema},
|
||||
)
|
||||
print(completion.choices[0].message.content)
|
||||
|
||||
.. tip::
|
||||
While not strictly necessary, normally it´s better to indicate in the prompt that a JSON needs to be generated and which fields and how should the LLM fill them.
|
||||
This can improve the results notably in most cases.
|
||||
|
||||
|
||||
Finally we have the ``guided_grammar``, which probably is the most difficult one to use but it´s really powerful, as it allows us to define complete languages like SQL queries.
|
||||
It works by using a context free EBNF grammar, which for example we can use to define a specific format of simplified SQL queries, like in the example below:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
simplified_sql_grammar = """
|
||||
?start: select_statement
|
||||
|
||||
?select_statement: "SELECT " column_list " FROM " table_name
|
||||
|
||||
?column_list: column_name ("," column_name)*
|
||||
|
||||
?table_name: identifier
|
||||
|
||||
?column_name: identifier
|
||||
|
||||
?identifier: /[a-zA-Z_][a-zA-Z0-9_]*/
|
||||
"""
|
||||
|
||||
completion = client.chat.completions.create(
|
||||
model="Qwen/Qwen2.5-3B-Instruct",
|
||||
messages=[
|
||||
{
|
||||
"role": "user",
|
||||
"content": "Generate an SQL query to show the 'username' and 'email' from the 'users' table.",
|
||||
}
|
||||
],
|
||||
extra_body={"guided_grammar": simplified_sql_grammar},
|
||||
)
|
||||
print(completion.choices[0].message.content)
|
||||
|
||||
The complete code of the examples can be found on `examples/openai_chat_completion_structured_outputs.py <https://github.com/vllm-project/vllm/blob/main/examples/openai_chat_completion_structured_outputs.py>`_.
|
||||
|
||||
Experimental Automatic Parsing (OpenAI API)
|
||||
--------------------------------------------
|
||||
|
||||
This section covers the OpenAI beta wrapper over the ``client.chat.completions.create()`` method that provides richer integrations with Python specific types.
|
||||
|
||||
At the time of writing (``openai==1.54.4``), this is a "beta" feature in the OpenAI client library. Code reference can be found `here <https://github.com/openai/openai-python/blob/52357cff50bee57ef442e94d78a0de38b4173fc2/src/openai/resources/beta/chat/completions.py#L100-L104>`_.
|
||||
|
||||
For the following examples, vLLM was setup using ``vllm serve meta-llama/Llama-3.1-8B-Instruct``
|
||||
|
||||
Here is a simple example demonstrating how to get structured output using Pydantic models:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from pydantic import BaseModel
|
||||
from openai import OpenAI
|
||||
|
||||
|
||||
class Info(BaseModel):
|
||||
name: str
|
||||
age: int
|
||||
|
||||
|
||||
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="dummy")
|
||||
completion = client.beta.chat.completions.parse(
|
||||
model="meta-llama/Llama-3.1-8B-Instruct",
|
||||
messages=[
|
||||
{"role": "system", "content": "You are a helpful assistant."},
|
||||
{"role": "user", "content": "My name is Cameron, I'm 28. What's my name and age?"},
|
||||
],
|
||||
response_format=Info,
|
||||
extra_body=dict(guided_decoding_backend="outlines"),
|
||||
)
|
||||
|
||||
message = completion.choices[0].message
|
||||
print(message)
|
||||
assert message.parsed
|
||||
print("Name:", message.parsed.name)
|
||||
print("Age:", message.parsed.age)
|
||||
|
||||
Output:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
ParsedChatCompletionMessage[Testing](content='{"name": "Cameron", "age": 28}', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=[], parsed=Testing(name='Cameron', age=28))
|
||||
Name: Cameron
|
||||
Age: 28
|
||||
|
||||
|
||||
Here is a more complex example using nested Pydantic models to handle a step-by-step math solution:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from typing import List
|
||||
from pydantic import BaseModel
|
||||
from openai import OpenAI
|
||||
|
||||
|
||||
class Step(BaseModel):
|
||||
explanation: str
|
||||
output: str
|
||||
|
||||
|
||||
class MathResponse(BaseModel):
|
||||
steps: List[Step]
|
||||
final_answer: str
|
||||
|
||||
|
||||
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="dummy")
|
||||
completion = client.beta.chat.completions.parse(
|
||||
model="meta-llama/Llama-3.1-8B-Instruct",
|
||||
messages=[
|
||||
{"role": "system", "content": "You are a helpful expert math tutor."},
|
||||
{"role": "user", "content": "Solve 8x + 31 = 2."},
|
||||
],
|
||||
response_format=MathResponse,
|
||||
extra_body=dict(guided_decoding_backend="outlines"),
|
||||
)
|
||||
|
||||
message = completion.choices[0].message
|
||||
print(message)
|
||||
assert message.parsed
|
||||
for i, step in enumerate(message.parsed.steps):
|
||||
print(f"Step #{i}:", step)
|
||||
print("Answer:", message.parsed.final_answer)
|
||||
|
||||
Output:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
ParsedChatCompletionMessage[MathResponse](content='{ "steps": [{ "explanation": "First, let\'s isolate the term with the variable \'x\'. To do this, we\'ll subtract 31 from both sides of the equation.", "output": "8x + 31 - 31 = 2 - 31"}, { "explanation": "By subtracting 31 from both sides, we simplify the equation to 8x = -29.", "output": "8x = -29"}, { "explanation": "Next, let\'s isolate \'x\' by dividing both sides of the equation by 8.", "output": "8x / 8 = -29 / 8"}], "final_answer": "x = -29/8" }', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=[], parsed=MathResponse(steps=[Step(explanation="First, let's isolate the term with the variable 'x'. To do this, we'll subtract 31 from both sides of the equation.", output='8x + 31 - 31 = 2 - 31'), Step(explanation='By subtracting 31 from both sides, we simplify the equation to 8x = -29.', output='8x = -29'), Step(explanation="Next, let's isolate 'x' by dividing both sides of the equation by 8.", output='8x / 8 = -29 / 8')], final_answer='x = -29/8'))
|
||||
Step #0: explanation="First, let's isolate the term with the variable 'x'. To do this, we'll subtract 31 from both sides of the equation." output='8x + 31 - 31 = 2 - 31'
|
||||
Step #1: explanation='By subtracting 31 from both sides, we simplify the equation to 8x = -29.' output='8x = -29'
|
||||
Step #2: explanation="Next, let's isolate 'x' by dividing both sides of the equation by 8." output='8x / 8 = -29 / 8'
|
||||
Answer: x = -29/8
|
||||
|
||||
Offline Inference
|
||||
-----------------
|
||||
|
||||
Offline inference allows for the same types of guided decoding.
|
||||
To use it, we´ll need to configure the guided decoding using the class ``GuidedDecodingParams`` inside ``SamplingParams``.
|
||||
The main available options inside ``GuidedDecodingParams`` are:
|
||||
|
||||
- ``json``
|
||||
- ``regex``
|
||||
- ``choice``
|
||||
- ``grammar``
|
||||
- ``backend``
|
||||
- ``whitespace_pattern``
|
||||
|
||||
These parameters can be used in the same way as the parameters from the Online Inference examples above.
|
||||
One example for the usage of the ``choices`` parameter is shown below:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from vllm import LLM, SamplingParams
|
||||
from vllm.sampling_params import GuidedDecodingParams
|
||||
|
||||
llm = LLM(model="HuggingFaceTB/SmolLM2-1.7B-Instruct")
|
||||
|
||||
guided_decoding_params = GuidedDecodingParams(choice=["Positive", "Negative"])
|
||||
sampling_params = SamplingParams(guided_decoding=guided_decoding_params)
|
||||
outputs = llm.generate(
|
||||
prompts="Classify this sentiment: vLLM is wonderful!",
|
||||
sampling_params=sampling_params,
|
||||
)
|
||||
print(outputs[0].outputs[0].text)
|
||||
|
||||
A complete example with all options can be found in `examples/offline_inference_structured_outputs.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_structured_outputs.py>`_.
|
||||
57
docs/source/usage/usage_stats.md
Normal file
57
docs/source/usage/usage_stats.md
Normal file
@@ -0,0 +1,57 @@
|
||||
# Usage Stats Collection
|
||||
|
||||
vLLM collects anonymous usage data by default to help the engineering team better understand which hardware and model configurations are widely used. This data allows them to prioritize their efforts on the most common workloads. The collected data is transparent, does not contain any sensitive information, and will be publicly released for the community's benefit.
|
||||
|
||||
## What data is collected?
|
||||
|
||||
You can see the up to date list of data collected by vLLM in the [usage_lib.py](https://github.com/vllm-project/vllm/blob/main/vllm/usage/usage_lib.py).
|
||||
|
||||
Here is an example as of v0.4.0:
|
||||
|
||||
```json
|
||||
{
|
||||
"uuid": "fbe880e9-084d-4cab-a395-8984c50f1109",
|
||||
"provider": "GCP",
|
||||
"num_cpu": 24,
|
||||
"cpu_type": "Intel(R) Xeon(R) CPU @ 2.20GHz",
|
||||
"cpu_family_model_stepping": "6,85,7",
|
||||
"total_memory": 101261135872,
|
||||
"architecture": "x86_64",
|
||||
"platform": "Linux-5.10.0-28-cloud-amd64-x86_64-with-glibc2.31",
|
||||
"gpu_count": 2,
|
||||
"gpu_type": "NVIDIA L4",
|
||||
"gpu_memory_per_device": 23580639232,
|
||||
"model_architecture": "OPTForCausalLM",
|
||||
"vllm_version": "0.3.2+cu123",
|
||||
"context": "LLM_CLASS",
|
||||
"log_time": 1711663373492490000,
|
||||
"source": "production",
|
||||
"dtype": "torch.float16",
|
||||
"tensor_parallel_size": 1,
|
||||
"block_size": 16,
|
||||
"gpu_memory_utilization": 0.9,
|
||||
"quantization": null,
|
||||
"kv_cache_dtype": "auto",
|
||||
"enable_lora": false,
|
||||
"enable_prefix_caching": false,
|
||||
"enforce_eager": false,
|
||||
"disable_custom_all_reduce": true
|
||||
}
|
||||
```
|
||||
|
||||
You can preview the collected data by running the following command:
|
||||
|
||||
```bash
|
||||
tail ~/.config/vllm/usage_stats.json
|
||||
```
|
||||
|
||||
## Opt-out of Usage Stats Collection
|
||||
|
||||
You can opt-out of usage stats collection by setting the VLLM_NO_USAGE_STATS or DO_NOT_TRACK environment variable, or by creating a ~/.config/vllm/do_not_track file:
|
||||
|
||||
```bash
|
||||
# Any of the following methods can disable usage stats collection
|
||||
export VLLM_NO_USAGE_STATS=1
|
||||
export DO_NOT_TRACK=1
|
||||
mkdir -p ~/.config/vllm && touch ~/.config/vllm/do_not_track
|
||||
```
|
||||
Reference in New Issue
Block a user