diff --git a/docs/features/README.md b/docs/features/README.md index b9083b999..d51216219 100644 --- a/docs/features/README.md +++ b/docs/features/README.md @@ -36,12 +36,12 @@ th:not(:first-child) { } -| Feature | [CP](../configuration/optimization.md#chunked-prefill) | [APC](automatic_prefix_caching.md) | [LoRA](lora.md) | [SD](spec_decode.md) | CUDA graph | [pooling](../models/pooling_models.md) | enc-dec | logP | prmpt logP | async output | multi-step | mm | best-of | beam-search | [prompt-embeds](prompt_embeds.md) | +| Feature | [CP](../configuration/optimization.md#chunked-prefill) | [APC](automatic_prefix_caching.md) | [LoRA](lora.md) | [SD](spec_decode/README.md) | CUDA graph | [pooling](../models/pooling_models.md) | enc-dec | logP | prmpt logP | async output | multi-step | mm | best-of | beam-search | [prompt-embeds](prompt_embeds.md) | |---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---| | [CP](../configuration/optimization.md#chunked-prefill) | ✅ | | | | | | | | | | | | | | | | [APC](automatic_prefix_caching.md) | ✅ | ✅ | | | | | | | | | | | | | | | [LoRA](lora.md) | ✅ | ✅ | ✅ | | | | | | | | | | | | | -| [SD](spec_decode.md) | ✅ | ✅ | ❌ | ✅ | | | | | | | | | | | | +| [SD](spec_decode/README.md) | ✅ | ✅ | ❌ | ✅ | | | | | | | | | | | | | CUDA graph | ✅ | ✅ | ✅ | ✅ | ✅ | | | | | | | | | | | | [pooling](../models/pooling_models.md) | 🟠\* | 🟠\* | ✅ | ❌ | ✅ | ✅ | | | | | | | | | | | enc-dec | ❌ | [❌](https://github.com/vllm-project/vllm/issues/7366) | ❌ | [❌](https://github.com/vllm-project/vllm/issues/7366) | ✅ | ✅ | ✅ | | | | | | | | | @@ -64,7 +64,7 @@ th:not(:first-child) { | [CP](../configuration/optimization.md#chunked-prefill) | [❌](https://github.com/vllm-project/vllm/issues/2729) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | [APC](automatic_prefix_caching.md) | [❌](https://github.com/vllm-project/vllm/issues/3687) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | [LoRA](lora.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | -| [SD](spec_decode.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | +| [SD](spec_decode/README.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | | CUDA graph | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | [❌](https://github.com/vllm-project/vllm/issues/26970) | | [pooling](../models/pooling_models.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | enc-dec | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | diff --git a/docs/features/quantization/README.md b/docs/features/quantization/README.md index 8c4a06166..77213bb35 100644 --- a/docs/features/quantization/README.md +++ b/docs/features/quantization/README.md @@ -2,7 +2,10 @@ Quantization trades off model precision for smaller memory footprint, allowing large models to be run on a wider range of devices. -Contents: +!!! tip + To get started with quantization, see [LLM Compressor](llm_compressor.md), a library for optimizing models for deployment with vLLM that supports FP8, INT8, INT4, and other quantization formats. + +The following are the supported quantization formats for vLLM: - [AutoAWQ](auto_awq.md) - [BitsAndBytes](bnb.md) diff --git a/docs/features/quantization/llm_compressor.md b/docs/features/quantization/llm_compressor.md new file mode 100644 index 000000000..31bb0f36f --- /dev/null +++ b/docs/features/quantization/llm_compressor.md @@ -0,0 +1,31 @@ +# LLM Compressor + +[LLM Compressor](https://docs.vllm.ai/projects/llm-compressor/en/latest/) is a library for optimizing models for deployment with vLLM. +It provides a comprehensive set of quantization algorithms, including support for techniques such as FP4, FP8, INT8, and INT4 quantization. + +## Why use LLM Compressor? + +Modern LLMs often contain billions of parameters stored in 16-bit or 32-bit floating point, requiring substantial GPU memory and limiting deployment options. +Quantization lowers memory requirements while maintaining inference output quality by reducing the precision of model weights and activations to smaller data types. + +LLM Compressor provides the following benefits: + +- **Reduced memory footprint**: Run larger models on smaller GPUs. +- **Lower inference costs**: Serve more concurrent users per GPU, directly reducing the cost per query in production deployments. +- **Faster inference**: Smaller data types mean less memory bandwidth consumed, which often translates to higher throughput, especially for memory-bound workloads. + +LLM Compressor handles the complexity of quantization, calibration, and format conversion, producing models ready for immediate use with vLLM. + +## Key features + +- **Multiple Quantization Algorithms**: Support for AWQ, GPTQ, AutoRound, and Round-to-Nearest. +Also includes support for QuIP and SpinQuant-style transforms as well as KV cache and attention quantization. +- **Multiple Quantization Methods**: Support for FP8, INT8, INT4, NVFP4, MXFP4, and mixed-precision quantization +- **One-Shot Quantization**: Quantize models quickly with minimal calibration data +- **vLLM Integration**: Seamlessly deploy quantized models with vLLM using the compressed-tensors format +- **Hugging Face Compatibility**: Works with models from the Hugging Face Hub + +## Resources + +- [LLM Compressor examples](https://github.com/vllm-project/llm-compressor/tree/main/examples) +- [GitHub Repository](https://github.com/vllm-project/llm-compressor) diff --git a/docs/features/spec_decode.md b/docs/features/spec_decode/README.md similarity index 97% rename from docs/features/spec_decode.md rename to docs/features/spec_decode/README.md index bd525ae33..0d19ef839 100644 --- a/docs/features/spec_decode.md +++ b/docs/features/spec_decode/README.md @@ -11,6 +11,9 @@ This document shows how to use [Speculative Decoding](https://x.com/karpathy/status/1697318534555336961) with vLLM. Speculative decoding is a technique which improves inter-token latency in memory-bound LLM inference. +!!! tip + To train your own draft models for speculative decoding, see [Speculators](speculators.md), a library for training draft models that integrates seamlessly with vLLM. + ## Speculating with a draft model The following code configures vLLM in an offline mode to use speculative decoding with a draft model, speculating 5 tokens at a time. @@ -223,7 +226,7 @@ A variety of speculative models of this type are available on HF hub: ## Speculating using EAGLE based draft models The following code configures vLLM to use speculative decoding where proposals are generated by -an [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https://arxiv.org/pdf/2401.15077) based draft model. A more detailed example for offline mode, including how to extract request level acceptance rate, can be found [here](../../examples/offline_inference/spec_decode.py). +an [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https://arxiv.org/pdf/2401.15077) based draft model. A more detailed example for offline mode, including how to extract request level acceptance rate, can be found in [examples/offline_inference/spec_decode.py](../../../examples/offline_inference/spec_decode.py) ??? code @@ -313,7 +316,7 @@ speculative decoding, breaking down the guarantees into three key areas: 3. **vLLM Logprob Stability** \- vLLM does not currently guarantee stable token log probabilities (logprobs). This can result in different outputs for the same request across runs. For more details, see the FAQ section - titled *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](../usage/faq.md). + titled *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](../../usage/faq.md). While vLLM strives to ensure losslessness in speculative decoding, variations in generated outputs with and without speculative decoding can occur due to following factors: @@ -322,7 +325,7 @@ can occur due to following factors: - **Batch Size and Numerical Stability**: Changes in batch size may cause variations in logprobs and output probabilities, potentially due to non-deterministic behavior in batched operations or numerical instability. -For mitigation strategies, please refer to the FAQ entry *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](../usage/faq.md). +For mitigation strategies, please refer to the FAQ entry *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](../../usage/faq.md). ## Resources for vLLM contributors diff --git a/docs/features/spec_decode/speculators.md b/docs/features/spec_decode/speculators.md new file mode 100644 index 000000000..7735e18ec --- /dev/null +++ b/docs/features/spec_decode/speculators.md @@ -0,0 +1,29 @@ +# Speculators + +[Speculators](https://docs.vllm.ai/projects/speculators/en/latest/) is a library for accelerating LLM inference through speculative decoding, providing efficient draft model training that integrates seamlessly with vLLM to reduce latency and improve throughput. + +Speculators provides the following key features: + +- **Offline training data generation using vLLM**: Enable the generation of hidden states using vLLM. Data samples are saved to disk and can be used for draft model training. +- **Draft model training support**: E2E training support of single and multi-layer draft models. Training is supported for both non-MoE and MoE models. +- **Standardized, extensible format**: Provides a Hugging Face-compatible format for defining speculative models, with tools to convert from external research repositories into a standard speculators format for easy adoption. +- **Seamless vLLM Integration**: Built for direct deployment into vLLM, enabling low-latency, production-grade inference with minimal overhead. + +## Why use Speculators? + +Large language models generate text one token at a time, which creates a fundamental bottleneck: each token requires a full forward pass through the model, leaving GPU compute underutilized while waiting for memory-bound operations. +Speculative decoding addresses this by using a smaller, faster "draft" model (often times, just a single transformer layer) to predict multiple tokens ahead, and then verifying tokens in parallel with the primary model. + +Speculative decoding provides the following benefits: + +- **Reduced latency**: Generates tokens 2-3 times faster for interactive applications such as chatbots and code assistants, where response time directly impacts user experience +- **Better GPU utilization**: Converts latency and memory-bound decoding in the large model into compute-bound parallel token verification, improving hardware utilization. +- **No quality loss**: Speculative decoding does not approximate the target model. Accepted tokens are exactly those the target model would have produced under the same sampling configuration; rejected draft tokens are discarded and regenerated by the target model. +- **Cost efficiency**: Serve more requests per GPU by reducing the time each request occupies the hardware + +Speculators is particularly valuable for latency-sensitive applications where users are waiting for responses in real-time, such as conversational AI, interactive coding assistants, and streaming text generation. + +## Resources + +- [Speculators examples](https://github.com/vllm-project/speculators/tree/main/examples) +- [GitHub Repository](https://github.com/vllm-project/speculators)