[Docs] Adding links and intro to Speculators and LLM Compressor (#32849)
Signed-off-by: Aidan Reilly <aireilly@redhat.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
29
docs/features/spec_decode/speculators.md
Normal file
29
docs/features/spec_decode/speculators.md
Normal file
@@ -0,0 +1,29 @@
|
||||
# Speculators
|
||||
|
||||
[Speculators](https://docs.vllm.ai/projects/speculators/en/latest/) is a library for accelerating LLM inference through speculative decoding, providing efficient draft model training that integrates seamlessly with vLLM to reduce latency and improve throughput.
|
||||
|
||||
Speculators provides the following key features:
|
||||
|
||||
- **Offline training data generation using vLLM**: Enable the generation of hidden states using vLLM. Data samples are saved to disk and can be used for draft model training.
|
||||
- **Draft model training support**: E2E training support of single and multi-layer draft models. Training is supported for both non-MoE and MoE models.
|
||||
- **Standardized, extensible format**: Provides a Hugging Face-compatible format for defining speculative models, with tools to convert from external research repositories into a standard speculators format for easy adoption.
|
||||
- **Seamless vLLM Integration**: Built for direct deployment into vLLM, enabling low-latency, production-grade inference with minimal overhead.
|
||||
|
||||
## Why use Speculators?
|
||||
|
||||
Large language models generate text one token at a time, which creates a fundamental bottleneck: each token requires a full forward pass through the model, leaving GPU compute underutilized while waiting for memory-bound operations.
|
||||
Speculative decoding addresses this by using a smaller, faster "draft" model (often times, just a single transformer layer) to predict multiple tokens ahead, and then verifying tokens in parallel with the primary model.
|
||||
|
||||
Speculative decoding provides the following benefits:
|
||||
|
||||
- **Reduced latency**: Generates tokens 2-3 times faster for interactive applications such as chatbots and code assistants, where response time directly impacts user experience
|
||||
- **Better GPU utilization**: Converts latency and memory-bound decoding in the large model into compute-bound parallel token verification, improving hardware utilization.
|
||||
- **No quality loss**: Speculative decoding does not approximate the target model. Accepted tokens are exactly those the target model would have produced under the same sampling configuration; rejected draft tokens are discarded and regenerated by the target model.
|
||||
- **Cost efficiency**: Serve more requests per GPU by reducing the time each request occupies the hardware
|
||||
|
||||
Speculators is particularly valuable for latency-sensitive applications where users are waiting for responses in real-time, such as conversational AI, interactive coding assistants, and streaming text generation.
|
||||
|
||||
## Resources
|
||||
|
||||
- [Speculators examples](https://github.com/vllm-project/speculators/tree/main/examples)
|
||||
- [GitHub Repository](https://github.com/vllm-project/speculators)
|
||||
Reference in New Issue
Block a user