[Docs] Adding links and intro to Speculators and LLM Compressor (#32849)
Signed-off-by: Aidan Reilly <aireilly@redhat.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
@@ -36,12 +36,12 @@ th:not(:first-child) {
|
||||
}
|
||||
</style>
|
||||
|
||||
| Feature | [CP](../configuration/optimization.md#chunked-prefill) | [APC](automatic_prefix_caching.md) | [LoRA](lora.md) | [SD](spec_decode.md) | CUDA graph | [pooling](../models/pooling_models.md) | <abbr title="Encoder-Decoder Models">enc-dec</abbr> | <abbr title="Logprobs">logP</abbr> | <abbr title="Prompt Logprobs">prmpt logP</abbr> | <abbr title="Async Output Processing">async output</abbr> | multi-step | <abbr title="Multimodal Inputs">mm</abbr> | best-of | beam-search | [prompt-embeds](prompt_embeds.md) |
|
||||
| Feature | [CP](../configuration/optimization.md#chunked-prefill) | [APC](automatic_prefix_caching.md) | [LoRA](lora.md) | [SD](spec_decode/README.md) | CUDA graph | [pooling](../models/pooling_models.md) | <abbr title="Encoder-Decoder Models">enc-dec</abbr> | <abbr title="Logprobs">logP</abbr> | <abbr title="Prompt Logprobs">prmpt logP</abbr> | <abbr title="Async Output Processing">async output</abbr> | multi-step | <abbr title="Multimodal Inputs">mm</abbr> | best-of | beam-search | [prompt-embeds](prompt_embeds.md) |
|
||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
||||
| [CP](../configuration/optimization.md#chunked-prefill) | ✅ | | | | | | | | | | | | | | |
|
||||
| [APC](automatic_prefix_caching.md) | ✅ | ✅ | | | | | | | | | | | | | |
|
||||
| [LoRA](lora.md) | ✅ | ✅ | ✅ | | | | | | | | | | | | |
|
||||
| [SD](spec_decode.md) | ✅ | ✅ | ❌ | ✅ | | | | | | | | | | | |
|
||||
| [SD](spec_decode/README.md) | ✅ | ✅ | ❌ | ✅ | | | | | | | | | | | |
|
||||
| CUDA graph | ✅ | ✅ | ✅ | ✅ | ✅ | | | | | | | | | | |
|
||||
| [pooling](../models/pooling_models.md) | 🟠\* | 🟠\* | ✅ | ❌ | ✅ | ✅ | | | | | | | | | |
|
||||
| <abbr title="Encoder-Decoder Models">enc-dec</abbr> | ❌ | [❌](https://github.com/vllm-project/vllm/issues/7366) | ❌ | [❌](https://github.com/vllm-project/vllm/issues/7366) | ✅ | ✅ | ✅ | | | | | | | | |
|
||||
@@ -64,7 +64,7 @@ th:not(:first-child) {
|
||||
| [CP](../configuration/optimization.md#chunked-prefill) | [❌](https://github.com/vllm-project/vllm/issues/2729) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||
| [APC](automatic_prefix_caching.md) | [❌](https://github.com/vllm-project/vllm/issues/3687) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||
| [LoRA](lora.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||
| [SD](spec_decode.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ |
|
||||
| [SD](spec_decode/README.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ |
|
||||
| CUDA graph | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | [❌](https://github.com/vllm-project/vllm/issues/26970) |
|
||||
| [pooling](../models/pooling_models.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||
| <abbr title="Encoder-Decoder Models">enc-dec</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ |
|
||||
|
||||
@@ -2,7 +2,10 @@
|
||||
|
||||
Quantization trades off model precision for smaller memory footprint, allowing large models to be run on a wider range of devices.
|
||||
|
||||
Contents:
|
||||
!!! tip
|
||||
To get started with quantization, see [LLM Compressor](llm_compressor.md), a library for optimizing models for deployment with vLLM that supports FP8, INT8, INT4, and other quantization formats.
|
||||
|
||||
The following are the supported quantization formats for vLLM:
|
||||
|
||||
- [AutoAWQ](auto_awq.md)
|
||||
- [BitsAndBytes](bnb.md)
|
||||
|
||||
31
docs/features/quantization/llm_compressor.md
Normal file
31
docs/features/quantization/llm_compressor.md
Normal file
@@ -0,0 +1,31 @@
|
||||
# LLM Compressor
|
||||
|
||||
[LLM Compressor](https://docs.vllm.ai/projects/llm-compressor/en/latest/) is a library for optimizing models for deployment with vLLM.
|
||||
It provides a comprehensive set of quantization algorithms, including support for techniques such as FP4, FP8, INT8, and INT4 quantization.
|
||||
|
||||
## Why use LLM Compressor?
|
||||
|
||||
Modern LLMs often contain billions of parameters stored in 16-bit or 32-bit floating point, requiring substantial GPU memory and limiting deployment options.
|
||||
Quantization lowers memory requirements while maintaining inference output quality by reducing the precision of model weights and activations to smaller data types.
|
||||
|
||||
LLM Compressor provides the following benefits:
|
||||
|
||||
- **Reduced memory footprint**: Run larger models on smaller GPUs.
|
||||
- **Lower inference costs**: Serve more concurrent users per GPU, directly reducing the cost per query in production deployments.
|
||||
- **Faster inference**: Smaller data types mean less memory bandwidth consumed, which often translates to higher throughput, especially for memory-bound workloads.
|
||||
|
||||
LLM Compressor handles the complexity of quantization, calibration, and format conversion, producing models ready for immediate use with vLLM.
|
||||
|
||||
## Key features
|
||||
|
||||
- **Multiple Quantization Algorithms**: Support for AWQ, GPTQ, AutoRound, and Round-to-Nearest.
|
||||
Also includes support for QuIP and SpinQuant-style transforms as well as KV cache and attention quantization.
|
||||
- **Multiple Quantization Methods**: Support for FP8, INT8, INT4, NVFP4, MXFP4, and mixed-precision quantization
|
||||
- **One-Shot Quantization**: Quantize models quickly with minimal calibration data
|
||||
- **vLLM Integration**: Seamlessly deploy quantized models with vLLM using the compressed-tensors format
|
||||
- **Hugging Face Compatibility**: Works with models from the Hugging Face Hub
|
||||
|
||||
## Resources
|
||||
|
||||
- [LLM Compressor examples](https://github.com/vllm-project/llm-compressor/tree/main/examples)
|
||||
- [GitHub Repository](https://github.com/vllm-project/llm-compressor)
|
||||
@@ -11,6 +11,9 @@
|
||||
This document shows how to use [Speculative Decoding](https://x.com/karpathy/status/1697318534555336961) with vLLM.
|
||||
Speculative decoding is a technique which improves inter-token latency in memory-bound LLM inference.
|
||||
|
||||
!!! tip
|
||||
To train your own draft models for speculative decoding, see [Speculators](speculators.md), a library for training draft models that integrates seamlessly with vLLM.
|
||||
|
||||
## Speculating with a draft model
|
||||
|
||||
The following code configures vLLM in an offline mode to use speculative decoding with a draft model, speculating 5 tokens at a time.
|
||||
@@ -223,7 +226,7 @@ A variety of speculative models of this type are available on HF hub:
|
||||
## Speculating using EAGLE based draft models
|
||||
|
||||
The following code configures vLLM to use speculative decoding where proposals are generated by
|
||||
an [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https://arxiv.org/pdf/2401.15077) based draft model. A more detailed example for offline mode, including how to extract request level acceptance rate, can be found [here](../../examples/offline_inference/spec_decode.py).
|
||||
an [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https://arxiv.org/pdf/2401.15077) based draft model. A more detailed example for offline mode, including how to extract request level acceptance rate, can be found in [examples/offline_inference/spec_decode.py](../../../examples/offline_inference/spec_decode.py)
|
||||
|
||||
??? code
|
||||
|
||||
@@ -313,7 +316,7 @@ speculative decoding, breaking down the guarantees into three key areas:
|
||||
3. **vLLM Logprob Stability**
|
||||
\- vLLM does not currently guarantee stable token log probabilities (logprobs). This can result in different outputs for the
|
||||
same request across runs. For more details, see the FAQ section
|
||||
titled *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](../usage/faq.md).
|
||||
titled *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](../../usage/faq.md).
|
||||
|
||||
While vLLM strives to ensure losslessness in speculative decoding, variations in generated outputs with and without speculative decoding
|
||||
can occur due to following factors:
|
||||
@@ -322,7 +325,7 @@ can occur due to following factors:
|
||||
- **Batch Size and Numerical Stability**: Changes in batch size may cause variations in logprobs and output probabilities, potentially
|
||||
due to non-deterministic behavior in batched operations or numerical instability.
|
||||
|
||||
For mitigation strategies, please refer to the FAQ entry *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](../usage/faq.md).
|
||||
For mitigation strategies, please refer to the FAQ entry *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](../../usage/faq.md).
|
||||
|
||||
## Resources for vLLM contributors
|
||||
|
||||
29
docs/features/spec_decode/speculators.md
Normal file
29
docs/features/spec_decode/speculators.md
Normal file
@@ -0,0 +1,29 @@
|
||||
# Speculators
|
||||
|
||||
[Speculators](https://docs.vllm.ai/projects/speculators/en/latest/) is a library for accelerating LLM inference through speculative decoding, providing efficient draft model training that integrates seamlessly with vLLM to reduce latency and improve throughput.
|
||||
|
||||
Speculators provides the following key features:
|
||||
|
||||
- **Offline training data generation using vLLM**: Enable the generation of hidden states using vLLM. Data samples are saved to disk and can be used for draft model training.
|
||||
- **Draft model training support**: E2E training support of single and multi-layer draft models. Training is supported for both non-MoE and MoE models.
|
||||
- **Standardized, extensible format**: Provides a Hugging Face-compatible format for defining speculative models, with tools to convert from external research repositories into a standard speculators format for easy adoption.
|
||||
- **Seamless vLLM Integration**: Built for direct deployment into vLLM, enabling low-latency, production-grade inference with minimal overhead.
|
||||
|
||||
## Why use Speculators?
|
||||
|
||||
Large language models generate text one token at a time, which creates a fundamental bottleneck: each token requires a full forward pass through the model, leaving GPU compute underutilized while waiting for memory-bound operations.
|
||||
Speculative decoding addresses this by using a smaller, faster "draft" model (often times, just a single transformer layer) to predict multiple tokens ahead, and then verifying tokens in parallel with the primary model.
|
||||
|
||||
Speculative decoding provides the following benefits:
|
||||
|
||||
- **Reduced latency**: Generates tokens 2-3 times faster for interactive applications such as chatbots and code assistants, where response time directly impacts user experience
|
||||
- **Better GPU utilization**: Converts latency and memory-bound decoding in the large model into compute-bound parallel token verification, improving hardware utilization.
|
||||
- **No quality loss**: Speculative decoding does not approximate the target model. Accepted tokens are exactly those the target model would have produced under the same sampling configuration; rejected draft tokens are discarded and regenerated by the target model.
|
||||
- **Cost efficiency**: Serve more requests per GPU by reducing the time each request occupies the hardware
|
||||
|
||||
Speculators is particularly valuable for latency-sensitive applications where users are waiting for responses in real-time, such as conversational AI, interactive coding assistants, and streaming text generation.
|
||||
|
||||
## Resources
|
||||
|
||||
- [Speculators examples](https://github.com/vllm-project/speculators/tree/main/examples)
|
||||
- [GitHub Repository](https://github.com/vllm-project/speculators)
|
||||
Reference in New Issue
Block a user