[Docs] Adding links and intro to Speculators and LLM Compressor (#32849)

Signed-off-by: Aidan Reilly <aireilly@redhat.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2026-01-29 22:12:35 +00:00
parent bfb9bdaf3f
commit 133765760b
5 changed files with 73 additions and 7 deletions
--- a/docs/features/README.md
+++ b/docs/features/README.md
@@ -36,12 +36,12 @@ th:not(:first-child) {
 }
 </style>

-| Feature | [CP](../configuration/optimization.md#chunked-prefill) | [APC](automatic_prefix_caching.md) | [LoRA](lora.md) | [SD](spec_decode.md) | CUDA graph | [pooling](../models/pooling_models.md) | <abbr title="Encoder-Decoder Models">enc-dec</abbr> | <abbr title="Logprobs">logP</abbr> | <abbr title="Prompt Logprobs">prmpt logP</abbr> | <abbr title="Async Output Processing">async output</abbr> | multi-step | <abbr title="Multimodal Inputs">mm</abbr> | best-of | beam-search | [prompt-embeds](prompt_embeds.md) |
+| Feature | [CP](../configuration/optimization.md#chunked-prefill) | [APC](automatic_prefix_caching.md) | [LoRA](lora.md) | [SD](spec_decode/README.md) | CUDA graph | [pooling](../models/pooling_models.md) | <abbr title="Encoder-Decoder Models">enc-dec</abbr> | <abbr title="Logprobs">logP</abbr> | <abbr title="Prompt Logprobs">prmpt logP</abbr> | <abbr title="Async Output Processing">async output</abbr> | multi-step | <abbr title="Multimodal Inputs">mm</abbr> | best-of | beam-search | [prompt-embeds](prompt_embeds.md) |
 |---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
 | [CP](../configuration/optimization.md#chunked-prefill) | ✅ | | | | | | | | | | | | | | |
 | [APC](automatic_prefix_caching.md) | ✅ | ✅ | | | | | | | | | | | | | |
 | [LoRA](lora.md) | ✅ | ✅ | ✅ | | | | | | | | | | | | |
-| [SD](spec_decode.md) | ✅ | ✅ | ❌ | ✅ | | | | | | | | | | | |
+| [SD](spec_decode/README.md) | ✅ | ✅ | ❌ | ✅ | | | | | | | | | | | |
 | CUDA graph | ✅ | ✅ | ✅ | ✅ | ✅ | | | | | | | | | | |
 | [pooling](../models/pooling_models.md) | 🟠\* | 🟠\* | ✅ | ❌ | ✅ | ✅ | | | | | | | | | |
 | <abbr title="Encoder-Decoder Models">enc-dec</abbr> | ❌ | [❌](https://github.com/vllm-project/vllm/issues/7366) | ❌ | [❌](https://github.com/vllm-project/vllm/issues/7366) | ✅ | ✅ | ✅ | | | | | | | | |
@@ -64,7 +64,7 @@ th:not(:first-child) {
 | [CP](../configuration/optimization.md#chunked-prefill)                                     | [❌](https://github.com/vllm-project/vllm/issues/2729) | ✅        | ✅        | ✅     | ✅        | ✅                  | ✅     | ✅        |
 | [APC](automatic_prefix_caching.md)                        | [❌](https://github.com/vllm-project/vllm/issues/3687) | ✅        | ✅        | ✅     | ✅        | ✅                  | ✅     | ✅        |
 | [LoRA](lora.md)                                           | ✅                  | ✅        | ✅        | ✅     | ✅        | ✅                  | ✅     | ✅        |
-| [SD](spec_decode.md)                                      | ✅                  | ✅        | ✅        | ✅     | ✅        | ❌                  | ✅     | ✅        |
+| [SD](spec_decode/README.md)                        | ✅                  | ✅        | ✅        | ✅     | ✅        | ❌                  | ✅     | ✅        |
 | CUDA graph                                                | ✅                  | ✅        | ✅        | ✅     | ✅        | ❌                  | ✅     | [❌](https://github.com/vllm-project/vllm/issues/26970)        |
 | [pooling](../models/pooling_models.md)                    | ✅                  | ✅        | ✅        | ✅     | ✅        | ✅                  | ✅     | ✅        |
 | <abbr title="Encoder-Decoder Models">enc-dec</abbr>       | ✅                  | ✅        | ✅        | ✅     | ✅        | ✅                  | ❌     | ✅        |
--- a/docs/features/quantization/README.md
+++ b/docs/features/quantization/README.md
@@ -2,7 +2,10 @@

 Quantization trades off model precision for smaller memory footprint, allowing large models to be run on a wider range of devices.

-Contents:
+!!! tip
+    To get started with quantization, see [LLM Compressor](llm_compressor.md), a library for optimizing models for deployment with vLLM that supports FP8, INT8, INT4, and other quantization formats.
+
+The following are the supported quantization formats for vLLM:

 - [AutoAWQ](auto_awq.md)
 - [BitsAndBytes](bnb.md)
--- a/docs/features/quantization/llm_compressor.md
+++ b/docs/features/quantization/llm_compressor.md
@@ -0,0 +1,31 @@
+# LLM Compressor
+
+[LLM Compressor](https://docs.vllm.ai/projects/llm-compressor/en/latest/) is a library for optimizing models for deployment with vLLM.
+It provides a comprehensive set of quantization algorithms, including support for techniques such as FP4, FP8, INT8, and INT4 quantization.
+
+## Why use LLM Compressor?
+
+Modern LLMs often contain billions of parameters stored in 16-bit or 32-bit floating point, requiring substantial GPU memory and limiting deployment options.
+Quantization lowers memory requirements while maintaining inference output quality by reducing the precision of model weights and activations to smaller data types.
+
+LLM Compressor provides the following benefits:
+
+- **Reduced memory footprint**: Run larger models on smaller GPUs.
+- **Lower inference costs**: Serve more concurrent users per GPU, directly reducing the cost per query in production deployments.
+- **Faster inference**: Smaller data types mean less memory bandwidth consumed, which often translates to higher throughput, especially for memory-bound workloads.
+
+LLM Compressor handles the complexity of quantization, calibration, and format conversion, producing models ready for immediate use with vLLM.
+
+## Key features
+
+- **Multiple Quantization Algorithms**: Support for AWQ, GPTQ, AutoRound, and Round-to-Nearest.
+Also includes support for QuIP and SpinQuant-style transforms as well as KV cache and attention quantization.
+- **Multiple Quantization Methods**: Support for FP8, INT8, INT4, NVFP4, MXFP4, and mixed-precision quantization
+- **One-Shot Quantization**: Quantize models quickly with minimal calibration data
+- **vLLM Integration**: Seamlessly deploy quantized models with vLLM using the compressed-tensors format
+- **Hugging Face Compatibility**: Works with models from the Hugging Face Hub
+
+## Resources
+
+- [LLM Compressor examples](https://github.com/vllm-project/llm-compressor/tree/main/examples)
+- [GitHub Repository](https://github.com/vllm-project/llm-compressor)
--- a/docs/features/spec_decode/README.md
+++ b/docs/features/spec_decode/README.md
@@ -11,6 +11,9 @@
 This document shows how to use [Speculative Decoding](https://x.com/karpathy/status/1697318534555336961) with vLLM.
 Speculative decoding is a technique which improves inter-token latency in memory-bound LLM inference.

+!!! tip
+    To train your own draft models for speculative decoding, see [Speculators](speculators.md), a library for training draft models that integrates seamlessly with vLLM.
+
 ## Speculating with a draft model

 The following code configures vLLM in an offline mode to use speculative decoding with a draft model, speculating 5 tokens at a time.
@@ -223,7 +226,7 @@ A variety of speculative models of this type are available on HF hub:
 ## Speculating using EAGLE based draft models

 The following code configures vLLM to use speculative decoding where proposals are generated by
-an [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https://arxiv.org/pdf/2401.15077) based draft model. A more detailed example for offline mode, including how to extract request level acceptance rate, can be found [here](../../examples/offline_inference/spec_decode.py).
+an [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https://arxiv.org/pdf/2401.15077) based draft model. A more detailed example for offline mode, including how to extract request level acceptance rate, can be found in [examples/offline_inference/spec_decode.py](../../../examples/offline_inference/spec_decode.py)

 ??? code

@@ -313,7 +316,7 @@ speculative decoding, breaking down the guarantees into three key areas:
 3. **vLLM Logprob Stability**
   \- vLLM does not currently guarantee stable token log probabilities (logprobs). This can result in different outputs for the
   same request across runs. For more details, see the FAQ section
-   titled *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](../usage/faq.md).
+   titled *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](../../usage/faq.md).

 While vLLM strives to ensure losslessness in speculative decoding, variations in generated outputs with and without speculative decoding
 can occur due to following factors:
@@ -322,7 +325,7 @@ can occur due to following factors:
 - **Batch Size and Numerical Stability**: Changes in batch size may cause variations in logprobs and output probabilities, potentially
  due to non-deterministic behavior in batched operations or numerical instability.

-For mitigation strategies, please refer to the FAQ entry *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](../usage/faq.md).
+For mitigation strategies, please refer to the FAQ entry *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](../../usage/faq.md).

 ## Resources for vLLM contributors

--- a/docs/features/spec_decode/speculators.md
+++ b/docs/features/spec_decode/speculators.md
@@ -0,0 +1,29 @@
+# Speculators
+
+[Speculators](https://docs.vllm.ai/projects/speculators/en/latest/) is a library for accelerating LLM inference through speculative decoding, providing efficient draft model training that integrates seamlessly with vLLM to reduce latency and improve throughput.
+
+Speculators provides the following key features:
+
+- **Offline training data generation using vLLM**: Enable the generation of hidden states using vLLM. Data samples are saved to disk and can be used for draft model training.
+- **Draft model training support**: E2E training support of single and multi-layer draft models. Training is supported for both non-MoE and MoE models.
+- **Standardized, extensible format**: Provides a Hugging Face-compatible format for defining speculative models, with tools to convert from external research repositories into a standard speculators format for easy adoption.
+- **Seamless vLLM Integration**: Built for direct deployment into vLLM, enabling low-latency, production-grade inference with minimal overhead.
+
+## Why use Speculators?
+
+Large language models generate text one token at a time, which creates a fundamental bottleneck: each token requires a full forward pass through the model, leaving GPU compute underutilized while waiting for memory-bound operations.
+Speculative decoding addresses this by using a smaller, faster "draft" model (often times, just a single transformer layer) to predict multiple tokens ahead, and then verifying tokens in parallel with the primary model.
+
+Speculative decoding provides the following benefits:
+
+- **Reduced latency**: Generates tokens 2-3 times faster for interactive applications such as chatbots and code assistants, where response time directly impacts user experience
+- **Better GPU utilization**: Converts latency and memory-bound decoding in the large model into compute-bound parallel token verification, improving hardware utilization.
+- **No quality loss**: Speculative decoding does not approximate the target model. Accepted tokens are exactly those the target model would have produced under the same sampling configuration; rejected draft tokens are discarded and regenerated by the target model.
+- **Cost efficiency**: Serve more requests per GPU by reducing the time each request occupies the hardware
+
+Speculators is particularly valuable for latency-sensitive applications where users are waiting for responses in real-time, such as conversational AI, interactive coding assistants, and streaming text generation.
+
+## Resources
+
+- [Speculators examples](https://github.com/vllm-project/speculators/tree/main/examples)
+- [GitHub Repository](https://github.com/vllm-project/speculators)