[Docs] Clean up speculators docs (#34065)
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
This commit is contained in:
62
docs/features/speculative_decoding/README.md
Normal file
62
docs/features/speculative_decoding/README.md
Normal file
@@ -0,0 +1,62 @@
|
||||
# Speculative Decoding
|
||||
|
||||
This document shows how to use [Speculative Decoding](https://arxiv.org/pdf/2302.01318) with vLLM to reduce inter-token latency under medium-to-low QPS (query per second), memory-bound workloads.
|
||||
|
||||
To train your own draft models for optimized speculative decoding, see [vllm-project/speculators](speculators.md) for seamless training and integration with vLLM.
|
||||
|
||||
## vLLM Speculation Methods
|
||||
|
||||
vLLM supports a variety of methods of speculative decoding. Model-based methods such as EAGLE, draft models, and mlp provide the best latency reduction, while simpler methods such as n-gram and and suffix decoding provide modest speedups without increasing workload during peak traffic.
|
||||
|
||||
- [EAGLE](eagle.md)
|
||||
- [Draft Model](draft_model.md)
|
||||
- [Multi-Layer Perceptron](mlp.md)
|
||||
- [N-Gram](n_gram.md)
|
||||
- [Suffix Decoding](suffix.md)
|
||||
|
||||
## Lossless guarantees of Speculative Decoding
|
||||
|
||||
In vLLM, speculative decoding aims to enhance inference efficiency while maintaining accuracy. This section addresses the lossless guarantees of
|
||||
speculative decoding, breaking down the guarantees into three key areas:
|
||||
|
||||
1. **Theoretical Losslessness**
|
||||
\- Speculative decoding sampling is theoretically lossless up to the precision limits of hardware numerics. Floating-point errors might
|
||||
cause slight variations in output distributions, as discussed
|
||||
in [Accelerating Large Language Model Decoding with Speculative Sampling](https://arxiv.org/pdf/2302.01318)
|
||||
|
||||
2. **Algorithmic Losslessness**
|
||||
\- vLLM’s implementation of speculative decoding is algorithmically validated to be lossless. Key validation tests include:
|
||||
|
||||
> - **Rejection Sampler Convergence**: Ensures that samples from vLLM’s rejection sampler align with the target
|
||||
> distribution. [View Test Code](https://github.com/vllm-project/vllm/blob/47b65a550866c7ffbd076ecb74106714838ce7da/tests/samplers/test_rejection_sampler.py#L252)
|
||||
> - **Greedy Sampling Equality**: Confirms that greedy sampling with speculative decoding matches greedy sampling
|
||||
> without it. This verifies that vLLM's speculative decoding framework, when integrated with the vLLM forward pass and the vLLM rejection sampler,
|
||||
> provides a lossless guarantee. Almost all of the tests in [tests/spec_decode/e2e](/tests/v1/spec_decode).
|
||||
> verify this property using [this assertion implementation](https://github.com/vllm-project/vllm/blob/b67ae00cdbbe1a58ffc8ff170f0c8d79044a684a/tests/spec_decode/e2e/conftest.py#L291)
|
||||
|
||||
3. **vLLM Logprob Stability**
|
||||
\- vLLM does not currently guarantee stable token log probabilities (logprobs). This can result in different outputs for the
|
||||
same request across runs. For more details, see the FAQ section
|
||||
titled *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](../../usage/faq.md).
|
||||
|
||||
While vLLM strives to ensure losslessness in speculative decoding, variations in generated outputs with and without speculative decoding
|
||||
can occur due to following factors:
|
||||
|
||||
- **Floating-Point Precision**: Differences in hardware numerical precision may lead to slight discrepancies in the output distribution.
|
||||
- **Batch Size and Numerical Stability**: Changes in batch size may cause variations in logprobs and output probabilities, potentially
|
||||
due to non-deterministic behavior in batched operations or numerical instability.
|
||||
|
||||
For mitigation strategies, please refer to the FAQ entry *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](../../usage/faq.md).
|
||||
|
||||
## Known Feature Incompatibility
|
||||
|
||||
1. Pipeline parallelism is not composible with speculative decoding as of `vllm<=0.15.0`
|
||||
2. Speculative decoding with a draft models is not supported in `vllm<=0.10.0`
|
||||
|
||||
## Resources for vLLM contributors
|
||||
|
||||
- [[vLLM Office Hours #40] Intro to Speculators](https://www.youtube.com/watch?v=2ISAr_JVGLs)
|
||||
- [A Hacker's Guide to Speculative Decoding in vLLM](https://www.youtube.com/watch?v=9wNAgpX6z_4)
|
||||
- [What is Lookahead Scheduling in vLLM?](https://docs.google.com/document/d/1Z9TvqzzBPnh5WHcRwjvK2UEeFeq5zMZb5mFE8jR0HCs/edit#heading=h.1fjfb0donq5a)
|
||||
- [Information on batch expansion](https://docs.google.com/document/d/1T-JaS2T1NRfdP51qzqpyakoCXxSXTtORppiwaj5asxA/edit#heading=h.kk7dq05lc6q8)
|
||||
- [Dynamic speculative decoding](https://github.com/vllm-project/vllm/issues/4565)
|
||||
80
docs/features/speculative_decoding/draft_model.md
Normal file
80
docs/features/speculative_decoding/draft_model.md
Normal file
@@ -0,0 +1,80 @@
|
||||
# Draft Models
|
||||
|
||||
The following code configures vLLM in an offline mode to use speculative decoding with a draft model, speculating 5 tokens at a time.
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
prompts = ["The future of AI is"]
|
||||
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
|
||||
|
||||
llm = LLM(
|
||||
model="Qwen/Qwen3-8B",
|
||||
tensor_parallel_size=1,
|
||||
speculative_config={
|
||||
"model": "Qwen/Qwen3-0.6B",
|
||||
"num_speculative_tokens": 5,
|
||||
"method": "draft_model",
|
||||
},
|
||||
)
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
```
|
||||
|
||||
To perform the equivalent launch in online mode, use the following server-side code:
|
||||
|
||||
```bash
|
||||
vllm serve Qwen/Qwen3-4B-Thinking-2507 \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000 \
|
||||
--seed 42 \
|
||||
-tp 1 \
|
||||
--max_model_len 2048 \
|
||||
--gpu_memory_utilization 0.8 \
|
||||
--speculative_config '{"model": "Qwen/Qwen3-0.6B", "num_speculative_tokens": 5, "method": "draft_model"}'
|
||||
```
|
||||
|
||||
The code used to request as completions as a client remains unchanged:
|
||||
|
||||
??? code
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
|
||||
# Modify OpenAI's API key and API base to use vLLM's API server.
|
||||
openai_api_key = "EMPTY"
|
||||
openai_api_base = "http://localhost:8000/v1"
|
||||
|
||||
client = OpenAI(
|
||||
# defaults to os.environ.get("OPENAI_API_KEY")
|
||||
api_key=openai_api_key,
|
||||
base_url=openai_api_base,
|
||||
)
|
||||
|
||||
models = client.models.list()
|
||||
model = models.data[0].id
|
||||
|
||||
# Completion API
|
||||
stream = False
|
||||
completion = client.completions.create(
|
||||
model=model,
|
||||
prompt="The future of AI is",
|
||||
echo=False,
|
||||
n=1,
|
||||
stream=stream,
|
||||
)
|
||||
|
||||
print("Completion results:")
|
||||
if stream:
|
||||
for c in completion:
|
||||
print(c)
|
||||
else:
|
||||
print(completion)
|
||||
```
|
||||
|
||||
!!! warning
|
||||
Note: Please use `--speculative_config` to set all configurations related to speculative decoding. The previous method of specifying the model through `--speculative_model` and adding related parameters (e.g., `--num_speculative_tokens`) separately has been deprecated.
|
||||
67
docs/features/speculative_decoding/eagle.md
Normal file
67
docs/features/speculative_decoding/eagle.md
Normal file
@@ -0,0 +1,67 @@
|
||||
# EAGLE Draft Models
|
||||
|
||||
The following code configures vLLM to use speculative decoding where proposals are generated by an [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https://arxiv.org/pdf/2401.15077) based draft model. A more detailed example for offline mode, including how to extract request level acceptance rate, can be found in [examples/offline_inference/spec_decode.py](../../../examples/offline_inference/spec_decode.py)
|
||||
|
||||
## Eagle Drafter Example
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
prompts = ["The future of AI is"]
|
||||
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
|
||||
|
||||
llm = LLM(
|
||||
model="meta-llama/Meta-Llama-3-8B-Instruct",
|
||||
tensor_parallel_size=4,
|
||||
speculative_config={
|
||||
"model": "yuhuili/EAGLE-LLaMA3-Instruct-8B",
|
||||
"draft_tensor_parallel_size": 1,
|
||||
"num_speculative_tokens": 2,
|
||||
"method": "eagle",
|
||||
},
|
||||
)
|
||||
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
```
|
||||
|
||||
## Eagle3 Drafter Example
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
prompts = ["The future of AI is"]
|
||||
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
|
||||
|
||||
llm = LLM(
|
||||
model="meta-llama/Meta-Llama-3-8B-Instruct",
|
||||
tensor_parallel_size=2,
|
||||
speculative_config={
|
||||
"model": "RedHatAI/Llama-3.1-8B-Instruct-speculator.eagle3",
|
||||
"draft_tensor_parallel_size": 2,
|
||||
"num_speculative_tokens": 2,
|
||||
"method": "eagle",
|
||||
},
|
||||
)
|
||||
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
```
|
||||
|
||||
## Pre-Trained Eagle Draft Models
|
||||
|
||||
A variety of EAGLE draft models are available on the Hugging Face hub:
|
||||
|
||||
* [RedHatAI/speculator-models](https://huggingface.co/collections/RedHatAI/speculator-models)
|
||||
* [yuhuili/models](https://huggingface.co/yuhuili/models?search=eagle)
|
||||
|
||||
!!! warning
|
||||
If you are using `vllm<0.7.0`, please use [this script](https://gist.github.com/abhigoyal1997/1e7a4109ccb7704fbc67f625e86b2d6d) to convert the speculative model and specify `"model": "path/to/modified/eagle/model"` in `speculative_config`.
|
||||
42
docs/features/speculative_decoding/mlp.md
Normal file
42
docs/features/speculative_decoding/mlp.md
Normal file
@@ -0,0 +1,42 @@
|
||||
# MLP Draft Models
|
||||
|
||||
The following code configures vLLM to use speculative decoding where proposals are generated by draft models that condition draft predictions on both context vectors and sampled tokens. For more information see [The Hitchhiker's Guide to Speculative Decoding](https://pytorch.org/blog/hitchhikers-guide-speculative-decoding/) and [IBM Research's Technical Report](https://arxiv.org/abs/2404.19124).
|
||||
|
||||
## MLP Drafter Example
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
prompts = ["The future of AI is"]
|
||||
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
|
||||
|
||||
llm = LLM(
|
||||
model="meta-llama/Meta-Llama-3.1-70B-Instruct",
|
||||
tensor_parallel_size=4,
|
||||
speculative_config={
|
||||
"model": "ibm-ai-platform/llama3-70b-accelerator",
|
||||
"draft_tensor_parallel_size": 1,
|
||||
"method": "mlp_speculator",
|
||||
},
|
||||
)
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
```
|
||||
|
||||
## Pre-Trained MLP Drafter Models
|
||||
|
||||
A variety of speculative models of this type are available on HF hub:
|
||||
|
||||
- [llama-13b-accelerator](https://huggingface.co/ibm-ai-platform/llama-13b-accelerator)
|
||||
- [llama3-8b-accelerator](https://huggingface.co/ibm-ai-platform/llama3-8b-accelerator)
|
||||
- [codellama-34b-accelerator](https://huggingface.co/ibm-ai-platform/codellama-34b-accelerator)
|
||||
- [llama2-70b-accelerator](https://huggingface.co/ibm-ai-platform/llama2-70b-accelerator)
|
||||
- [llama3-70b-accelerator](https://huggingface.co/ibm-ai-platform/llama3-70b-accelerator)
|
||||
- [granite-3b-code-instruct-accelerator](https://huggingface.co/ibm-granite/granite-3b-code-instruct-accelerator)
|
||||
- [granite-8b-code-instruct-accelerator](https://huggingface.co/ibm-granite/granite-8b-code-instruct-accelerator)
|
||||
- [granite-7b-instruct-accelerator](https://huggingface.co/ibm-granite/granite-7b-instruct-accelerator)
|
||||
- [granite-20b-code-instruct-accelerator](https://huggingface.co/ibm-granite/granite-20b-code-instruct-accelerator)
|
||||
27
docs/features/speculative_decoding/n_gram.md
Normal file
27
docs/features/speculative_decoding/n_gram.md
Normal file
@@ -0,0 +1,27 @@
|
||||
# N-Gram Speculation
|
||||
|
||||
The following code configures vLLM to use speculative decoding where proposals are generated by
|
||||
matching n-grams in the prompt. For more information read [this thread.](https://x.com/joao_gante/status/1747322413006643259)
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
prompts = ["The future of AI is"]
|
||||
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
|
||||
|
||||
llm = LLM(
|
||||
model="Qwen/Qwen3-8B",
|
||||
tensor_parallel_size=1,
|
||||
speculative_config={
|
||||
"method": "ngram",
|
||||
"num_speculative_tokens": 5,
|
||||
"prompt_lookup_max": 4,
|
||||
},
|
||||
)
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
```
|
||||
32
docs/features/speculative_decoding/speculators.md
Normal file
32
docs/features/speculative_decoding/speculators.md
Normal file
@@ -0,0 +1,32 @@
|
||||
# vLLM-Project/Speculators
|
||||
|
||||

|
||||

|
||||
|
||||
[Speculators](https://docs.vllm.ai/projects/speculators/en/latest/) is a library for accelerating LLM inference through speculative decoding, providing efficient draft model training that integrates seamlessly with vLLM to reduce latency and improve throughput.
|
||||
|
||||
Speculators provides the following key features:
|
||||
|
||||
- **Offline training data generation using vLLM**: Enable the generation of hidden states using vLLM. Data samples are saved to disk and can be used for draft model training.
|
||||
- **Draft model training support**: E2E training support of single and multi-layer draft models. Training is supported for both non-MoE and MoE models.
|
||||
- **Standardized, extensible format**: Provides a Hugging Face-compatible format for defining speculative models, with tools to convert from external research repositories into a standard speculators format for easy adoption.
|
||||
- **Seamless vLLM Integration**: Built for direct deployment into vLLM, enabling low-latency, production-grade inference with minimal overhead.
|
||||
|
||||
## Why use Speculators?
|
||||
|
||||
Large language models generate text one token at a time, which creates a fundamental bottleneck: each token requires a full forward pass through the model, leaving GPU compute underutilized while waiting for memory-bound operations.
|
||||
Speculative decoding addresses this by using a smaller, faster "draft" model (often times, just a single transformer layer) to predict multiple tokens ahead, and then verifying tokens in parallel with the primary model.
|
||||
|
||||
Speculative decoding provides the following benefits:
|
||||
|
||||
- **Reduced latency**: Generates tokens 2-3 times faster for interactive applications such as chatbots and code assistants, where response time directly impacts user experience
|
||||
- **Better GPU utilization**: Converts latency and memory-bound decoding in the large model into compute-bound parallel token verification, improving hardware utilization.
|
||||
- **No quality loss**: Speculative decoding does not approximate the target model. Accepted tokens are exactly those the target model would have produced under the same sampling configuration; rejected draft tokens are discarded and regenerated by the target model.
|
||||
- **Cost efficiency**: Serve more requests per GPU by reducing the time each request occupies the hardware
|
||||
|
||||
Speculators is particularly valuable for latency-sensitive applications where users are waiting for responses in real-time, such as conversational AI, interactive coding assistants, and streaming text generation.
|
||||
|
||||
## Resources
|
||||
|
||||
- [Speculators examples](https://github.com/vllm-project/speculators/tree/main/examples)
|
||||
- [GitHub Repository](https://github.com/vllm-project/speculators)
|
||||
35
docs/features/speculative_decoding/suffix.md
Normal file
35
docs/features/speculative_decoding/suffix.md
Normal file
@@ -0,0 +1,35 @@
|
||||
# Suffix Decoding
|
||||
|
||||
The following code configures vLLM to use speculative decoding where proposals are generated using Suffix Decoding ([technical report](https://arxiv.org/abs/2411.04975)).
|
||||
|
||||
Like n-gram, Suffix Decoding can generate draft tokens by pattern-matching using the last `n` generated tokens. Unlike n-gram, Suffix Decoding (1) can pattern-match against both the prompt and previous generations, (2) uses frequency counts to propose the most likely continuations, and (3) speculates an adaptive number of tokens for each request at each iteration to get better acceptance rates.
|
||||
|
||||
Suffix Decoding can achieve better performance for tasks with high repetition, such as code-editing, agentic loops (e.g. self-reflection, self-consistency), and RL rollouts.
|
||||
|
||||
!!! tip "Install Arctic Inference"
|
||||
Suffix Decoding requires [Arctic Inference](https://github.com/snowflakedb/ArcticInference). You can install it with `pip install arctic-inference`.
|
||||
|
||||
!!! tip "Suffix Decoding Speculative Tokens"
|
||||
Suffix Decoding will speculate a dynamic number of tokens for each request at each decoding step, so the `num_speculative_tokens` configuration specifies the *maximum* number of speculative tokens. It is suggested to use a high number such as `16` or `32` (default).
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
prompts = ["The future of AI is"]
|
||||
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
|
||||
|
||||
llm = LLM(
|
||||
model="Qwen/Qwen3-8B",
|
||||
tensor_parallel_size=1,
|
||||
speculative_config={
|
||||
"method": "suffix",
|
||||
"num_speculative_tokens": 32,
|
||||
},
|
||||
)
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
```
|
||||
Reference in New Issue
Block a user