diff --git a/docs/features/README.md b/docs/features/README.md
index b9083b999..d51216219 100644
--- a/docs/features/README.md
+++ b/docs/features/README.md
@@ -36,12 +36,12 @@ th:not(:first-child) {
}
-| Feature | [CP](../configuration/optimization.md#chunked-prefill) | [APC](automatic_prefix_caching.md) | [LoRA](lora.md) | [SD](spec_decode.md) | CUDA graph | [pooling](../models/pooling_models.md) | enc-dec | logP | prmpt logP | async output | multi-step | mm | best-of | beam-search | [prompt-embeds](prompt_embeds.md) |
+| Feature | [CP](../configuration/optimization.md#chunked-prefill) | [APC](automatic_prefix_caching.md) | [LoRA](lora.md) | [SD](spec_decode/README.md) | CUDA graph | [pooling](../models/pooling_models.md) | enc-dec | logP | prmpt logP | async output | multi-step | mm | best-of | beam-search | [prompt-embeds](prompt_embeds.md) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| [CP](../configuration/optimization.md#chunked-prefill) | ✅ | | | | | | | | | | | | | | |
| [APC](automatic_prefix_caching.md) | ✅ | ✅ | | | | | | | | | | | | | |
| [LoRA](lora.md) | ✅ | ✅ | ✅ | | | | | | | | | | | | |
-| [SD](spec_decode.md) | ✅ | ✅ | ❌ | ✅ | | | | | | | | | | | |
+| [SD](spec_decode/README.md) | ✅ | ✅ | ❌ | ✅ | | | | | | | | | | | |
| CUDA graph | ✅ | ✅ | ✅ | ✅ | ✅ | | | | | | | | | | |
| [pooling](../models/pooling_models.md) | 🟠\* | 🟠\* | ✅ | ❌ | ✅ | ✅ | | | | | | | | | |
| enc-dec | ❌ | [❌](https://github.com/vllm-project/vllm/issues/7366) | ❌ | [❌](https://github.com/vllm-project/vllm/issues/7366) | ✅ | ✅ | ✅ | | | | | | | | |
@@ -64,7 +64,7 @@ th:not(:first-child) {
| [CP](../configuration/optimization.md#chunked-prefill) | [❌](https://github.com/vllm-project/vllm/issues/2729) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| [APC](automatic_prefix_caching.md) | [❌](https://github.com/vllm-project/vllm/issues/3687) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| [LoRA](lora.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
-| [SD](spec_decode.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ |
+| [SD](spec_decode/README.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ |
| CUDA graph | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | [❌](https://github.com/vllm-project/vllm/issues/26970) |
| [pooling](../models/pooling_models.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| enc-dec | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ |
diff --git a/docs/features/quantization/README.md b/docs/features/quantization/README.md
index 8c4a06166..77213bb35 100644
--- a/docs/features/quantization/README.md
+++ b/docs/features/quantization/README.md
@@ -2,7 +2,10 @@
Quantization trades off model precision for smaller memory footprint, allowing large models to be run on a wider range of devices.
-Contents:
+!!! tip
+ To get started with quantization, see [LLM Compressor](llm_compressor.md), a library for optimizing models for deployment with vLLM that supports FP8, INT8, INT4, and other quantization formats.
+
+The following are the supported quantization formats for vLLM:
- [AutoAWQ](auto_awq.md)
- [BitsAndBytes](bnb.md)
diff --git a/docs/features/quantization/llm_compressor.md b/docs/features/quantization/llm_compressor.md
new file mode 100644
index 000000000..31bb0f36f
--- /dev/null
+++ b/docs/features/quantization/llm_compressor.md
@@ -0,0 +1,31 @@
+# LLM Compressor
+
+[LLM Compressor](https://docs.vllm.ai/projects/llm-compressor/en/latest/) is a library for optimizing models for deployment with vLLM.
+It provides a comprehensive set of quantization algorithms, including support for techniques such as FP4, FP8, INT8, and INT4 quantization.
+
+## Why use LLM Compressor?
+
+Modern LLMs often contain billions of parameters stored in 16-bit or 32-bit floating point, requiring substantial GPU memory and limiting deployment options.
+Quantization lowers memory requirements while maintaining inference output quality by reducing the precision of model weights and activations to smaller data types.
+
+LLM Compressor provides the following benefits:
+
+- **Reduced memory footprint**: Run larger models on smaller GPUs.
+- **Lower inference costs**: Serve more concurrent users per GPU, directly reducing the cost per query in production deployments.
+- **Faster inference**: Smaller data types mean less memory bandwidth consumed, which often translates to higher throughput, especially for memory-bound workloads.
+
+LLM Compressor handles the complexity of quantization, calibration, and format conversion, producing models ready for immediate use with vLLM.
+
+## Key features
+
+- **Multiple Quantization Algorithms**: Support for AWQ, GPTQ, AutoRound, and Round-to-Nearest.
+Also includes support for QuIP and SpinQuant-style transforms as well as KV cache and attention quantization.
+- **Multiple Quantization Methods**: Support for FP8, INT8, INT4, NVFP4, MXFP4, and mixed-precision quantization
+- **One-Shot Quantization**: Quantize models quickly with minimal calibration data
+- **vLLM Integration**: Seamlessly deploy quantized models with vLLM using the compressed-tensors format
+- **Hugging Face Compatibility**: Works with models from the Hugging Face Hub
+
+## Resources
+
+- [LLM Compressor examples](https://github.com/vllm-project/llm-compressor/tree/main/examples)
+- [GitHub Repository](https://github.com/vllm-project/llm-compressor)
diff --git a/docs/features/spec_decode.md b/docs/features/spec_decode/README.md
similarity index 97%
rename from docs/features/spec_decode.md
rename to docs/features/spec_decode/README.md
index bd525ae33..0d19ef839 100644
--- a/docs/features/spec_decode.md
+++ b/docs/features/spec_decode/README.md
@@ -11,6 +11,9 @@
This document shows how to use [Speculative Decoding](https://x.com/karpathy/status/1697318534555336961) with vLLM.
Speculative decoding is a technique which improves inter-token latency in memory-bound LLM inference.
+!!! tip
+ To train your own draft models for speculative decoding, see [Speculators](speculators.md), a library for training draft models that integrates seamlessly with vLLM.
+
## Speculating with a draft model
The following code configures vLLM in an offline mode to use speculative decoding with a draft model, speculating 5 tokens at a time.
@@ -223,7 +226,7 @@ A variety of speculative models of this type are available on HF hub:
## Speculating using EAGLE based draft models
The following code configures vLLM to use speculative decoding where proposals are generated by
-an [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https://arxiv.org/pdf/2401.15077) based draft model. A more detailed example for offline mode, including how to extract request level acceptance rate, can be found [here](../../examples/offline_inference/spec_decode.py).
+an [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https://arxiv.org/pdf/2401.15077) based draft model. A more detailed example for offline mode, including how to extract request level acceptance rate, can be found in [examples/offline_inference/spec_decode.py](../../../examples/offline_inference/spec_decode.py)
??? code
@@ -313,7 +316,7 @@ speculative decoding, breaking down the guarantees into three key areas:
3. **vLLM Logprob Stability**
\- vLLM does not currently guarantee stable token log probabilities (logprobs). This can result in different outputs for the
same request across runs. For more details, see the FAQ section
- titled *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](../usage/faq.md).
+ titled *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](../../usage/faq.md).
While vLLM strives to ensure losslessness in speculative decoding, variations in generated outputs with and without speculative decoding
can occur due to following factors:
@@ -322,7 +325,7 @@ can occur due to following factors:
- **Batch Size and Numerical Stability**: Changes in batch size may cause variations in logprobs and output probabilities, potentially
due to non-deterministic behavior in batched operations or numerical instability.
-For mitigation strategies, please refer to the FAQ entry *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](../usage/faq.md).
+For mitigation strategies, please refer to the FAQ entry *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](../../usage/faq.md).
## Resources for vLLM contributors
diff --git a/docs/features/spec_decode/speculators.md b/docs/features/spec_decode/speculators.md
new file mode 100644
index 000000000..7735e18ec
--- /dev/null
+++ b/docs/features/spec_decode/speculators.md
@@ -0,0 +1,29 @@
+# Speculators
+
+[Speculators](https://docs.vllm.ai/projects/speculators/en/latest/) is a library for accelerating LLM inference through speculative decoding, providing efficient draft model training that integrates seamlessly with vLLM to reduce latency and improve throughput.
+
+Speculators provides the following key features:
+
+- **Offline training data generation using vLLM**: Enable the generation of hidden states using vLLM. Data samples are saved to disk and can be used for draft model training.
+- **Draft model training support**: E2E training support of single and multi-layer draft models. Training is supported for both non-MoE and MoE models.
+- **Standardized, extensible format**: Provides a Hugging Face-compatible format for defining speculative models, with tools to convert from external research repositories into a standard speculators format for easy adoption.
+- **Seamless vLLM Integration**: Built for direct deployment into vLLM, enabling low-latency, production-grade inference with minimal overhead.
+
+## Why use Speculators?
+
+Large language models generate text one token at a time, which creates a fundamental bottleneck: each token requires a full forward pass through the model, leaving GPU compute underutilized while waiting for memory-bound operations.
+Speculative decoding addresses this by using a smaller, faster "draft" model (often times, just a single transformer layer) to predict multiple tokens ahead, and then verifying tokens in parallel with the primary model.
+
+Speculative decoding provides the following benefits:
+
+- **Reduced latency**: Generates tokens 2-3 times faster for interactive applications such as chatbots and code assistants, where response time directly impacts user experience
+- **Better GPU utilization**: Converts latency and memory-bound decoding in the large model into compute-bound parallel token verification, improving hardware utilization.
+- **No quality loss**: Speculative decoding does not approximate the target model. Accepted tokens are exactly those the target model would have produced under the same sampling configuration; rejected draft tokens are discarded and regenerated by the target model.
+- **Cost efficiency**: Serve more requests per GPU by reducing the time each request occupies the hardware
+
+Speculators is particularly valuable for latency-sensitive applications where users are waiting for responses in real-time, such as conversational AI, interactive coding assistants, and streaming text generation.
+
+## Resources
+
+- [Speculators examples](https://github.com/vllm-project/speculators/tree/main/examples)
+- [GitHub Repository](https://github.com/vllm-project/speculators)