Signed-off-by: <zihaoan2@amd.com> Signed-off-by: zihaoanllm <zihaoan2@amd.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
84 lines
5.4 KiB
Markdown
84 lines
5.4 KiB
Markdown
# Speculative Decoding
|
||
|
||
This document shows how to use [Speculative Decoding](https://arxiv.org/pdf/2302.01318) with vLLM to reduce inter-token latency under medium-to-low QPS (query per second), memory-bound workloads.
|
||
|
||
To train your own draft models for optimized speculative decoding, see [vllm-project/speculators](speculators.md) for seamless training and integration with vLLM.
|
||
|
||
## vLLM Speculation Methods
|
||
|
||
vLLM supports a variety of methods of speculative decoding. Model-based methods such as EAGLE, MTP, draft models, PARD and MLP provide the best latency reduction, while simpler methods such as n-gram and suffix decoding provide modest speedups without increasing workload during peak traffic.
|
||
|
||
- [EAGLE](eagle.md)
|
||
- [Multi-Token Prediction (MTP)](mtp.md)
|
||
- [Draft Model](draft_model.md)
|
||
- [Parallel Draft Model (PARD)](parallel_draft_model.md)
|
||
- [Multi-Layer Perceptron](mlp.md)
|
||
- [N-Gram](n_gram.md)
|
||
- [Suffix Decoding](suffix.md)
|
||
|
||
## Method Selection at a Glance
|
||
|
||
Use this qualitative table as a starting point for method selection. Real gains
|
||
depend on your model family, traffic pattern, hardware, and sampling settings.
|
||
|
||
| Method | Low QPS (latency focused) | High QPS (throughput focused) | Notes |
|
||
| --- | --- | --- | --- |
|
||
| EAGLE | High gain | Medium to high gain | Strong general-purpose model-based method. |
|
||
| MTP | High gain | Medium to high gain | Best when the target model has native MTP support. |
|
||
| Draft model | High gain | Medium gain | Needs a separate draft model. |
|
||
| Parallel Draft Model | High gain | Medium to high gain | Low draft model latency. |
|
||
| MLP speculator | Medium to high gain | Medium gain | Good when compatible MLP speculators are available. |
|
||
| N-gram | Low to medium gain | Medium gain | Lightweight and easy to enable. |
|
||
| Suffix decoding | Low to medium gain | Medium gain | No extra draft model; dynamic speculation depth. |
|
||
|
||
For reproducible measurements in your environment, use
|
||
[`examples/offline_inference/spec_decode.py`](../../../examples/offline_inference/spec_decode.py)
|
||
or the [benchmark CLI guide](../../benchmarking/cli.md).
|
||
|
||
## Lossless guarantees of Speculative Decoding
|
||
|
||
In vLLM, speculative decoding aims to enhance inference efficiency while maintaining accuracy. This section addresses the lossless guarantees of
|
||
speculative decoding, breaking down the guarantees into three key areas:
|
||
|
||
1. **Theoretical Losslessness**
|
||
\- Speculative decoding sampling is theoretically lossless up to the precision limits of hardware numerics. Floating-point errors might
|
||
cause slight variations in output distributions, as discussed
|
||
in [Accelerating Large Language Model Decoding with Speculative Sampling](https://arxiv.org/pdf/2302.01318)
|
||
|
||
2. **Algorithmic Losslessness**
|
||
\- vLLM’s implementation of speculative decoding is algorithmically validated to be lossless. Key validation tests include:
|
||
|
||
> - **Rejection Sampler Convergence**: Ensures that samples from vLLM’s rejection sampler align with the target
|
||
> distribution. [View Test Code](https://github.com/vllm-project/vllm/blob/47b65a550866c7ffbd076ecb74106714838ce7da/tests/samplers/test_rejection_sampler.py#L252)
|
||
> - **Greedy Sampling Equality**: Confirms that greedy sampling with speculative decoding matches greedy sampling
|
||
> without it. This verifies that vLLM's speculative decoding framework, when integrated with the vLLM forward pass and the vLLM rejection sampler,
|
||
> provides a lossless guarantee. Almost all of the tests in [tests/spec_decode/e2e](/tests/v1/spec_decode).
|
||
> verify this property using [this assertion implementation](https://github.com/vllm-project/vllm/blob/b67ae00cdbbe1a58ffc8ff170f0c8d79044a684a/tests/spec_decode/e2e/conftest.py#L291)
|
||
|
||
3. **vLLM Logprob Stability**
|
||
\- vLLM does not currently guarantee stable token log probabilities (logprobs). This can result in different outputs for the
|
||
same request across runs. For more details, see the FAQ section
|
||
titled *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](../../usage/faq.md).
|
||
|
||
While vLLM strives to ensure losslessness in speculative decoding, variations in generated outputs with and without speculative decoding
|
||
can occur due to following factors:
|
||
|
||
- **Floating-Point Precision**: Differences in hardware numerical precision may lead to slight discrepancies in the output distribution.
|
||
- **Batch Size and Numerical Stability**: Changes in batch size may cause variations in logprobs and output probabilities, potentially
|
||
due to non-deterministic behavior in batched operations or numerical instability.
|
||
|
||
For mitigation strategies, please refer to the FAQ entry *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](../../usage/faq.md).
|
||
|
||
## Known Feature Incompatibility
|
||
|
||
1. Pipeline parallelism is not composible with speculative decoding as of `vllm<=0.15.0`
|
||
2. Speculative decoding with a draft models is not supported in `vllm<=0.10.0`
|
||
|
||
## Resources for vLLM contributors
|
||
|
||
- [[vLLM Office Hours #40] Intro to Speculators](https://www.youtube.com/watch?v=2ISAr_JVGLs)
|
||
- [A Hacker's Guide to Speculative Decoding in vLLM](https://www.youtube.com/watch?v=9wNAgpX6z_4)
|
||
- [What is Lookahead Scheduling in vLLM?](https://docs.google.com/document/d/1Z9TvqzzBPnh5WHcRwjvK2UEeFeq5zMZb5mFE8jR0HCs/edit#heading=h.1fjfb0donq5a)
|
||
- [Information on batch expansion](https://docs.google.com/document/d/1T-JaS2T1NRfdP51qzqpyakoCXxSXTtORppiwaj5asxA/edit#heading=h.kk7dq05lc6q8)
|
||
- [Dynamic speculative decoding](https://github.com/vllm-project/vllm/issues/4565)
|