2025-08-13 01:12:26 +01:00
---
hide:
- navigation
- toc
---
2025-05-23 11:09:53 +02:00
# Welcome to vLLM
<figure markdown="span">
2025-07-01 16:18:09 +08:00
{ align="center" alt="vLLM Light" class="logo-light" width="60%" }
{ align="center" alt="vLLM Dark" class="logo-dark" width="60%" }
2025-05-23 11:09:53 +02:00
</figure>
<p style="text-align:center">
<strong>Easy, fast, and cheap LLM serving for everyone
</strong>
</p>
<p style="text-align:center">
<script async defer src="https://buttons.github.io/buttons.js"></script>
<a class="github-button" href="https://github.com/vllm-project/vllm" data-show-count="true" data-size="large" aria-label="Star">Star</a>
2025-05-30 21:45:59 +08:00
<a class="github-button" href="https://github.com/vllm-project/vllm/subscription" data-show-count="true" data-icon="octicon-eye" data-size="large" aria-label="Watch">Watch</a>
<a class="github-button" href="https://github.com/vllm-project/vllm/fork" data-show-count="true" data-icon="octicon-repo-forked" data-size="large" aria-label="Fork">Fork</a>
2025-05-23 11:09:53 +02:00
</p>
vLLM is a fast and easy-to-use library for LLM inference and serving.
Originally developed in the [Sky Computing Lab ](https://sky.cs.berkeley.edu ) at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.
2025-08-12 12:25:55 +01:00
Where to get started with vLLM depends on the type of user. If you are looking to:
- Run open-source models on vLLM, we recommend starting with the [Quickstart Guide ](./getting_started/quickstart.md )
- Build applications with vLLM, we recommend starting with the [User Guide ](./usage )
- Build vLLM, we recommend starting with [Developer Guide ](./contributing )
For information about the development of vLLM, see:
- [Roadmap ](https://roadmap.vllm.ai )
- [Releases ](https://github.com/vllm-project/vllm/releases )
2025-05-23 11:09:53 +02:00
vLLM is fast with:
- State-of-the-art serving throughput
- Efficient management of attention key and value memory with [**PagedAttention** ](https://blog.vllm.ai/2023/06/20/vllm.html )
- Continuous batching of incoming requests
- Fast model execution with CUDA/HIP graph
- Quantization: [GPTQ ](https://arxiv.org/abs/2210.17323 ), [AWQ ](https://arxiv.org/abs/2306.00978 ), INT4, INT8, and FP8
- Optimized CUDA kernels, including integration with FlashAttention and FlashInfer.
- Speculative decoding
- Chunked prefill
vLLM is flexible and easy to use with:
- Seamless integration with popular HuggingFace models
- High-throughput serving with various decoding algorithms, including * parallel sampling * , * beam search * , and more
2025-07-11 17:42:10 +01:00
- Tensor, pipeline, data and expert parallelism support for distributed inference
2025-05-23 11:09:53 +02:00
- Streaming outputs
- OpenAI-compatible API server
- Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs, Gaudi® accelerators and GPUs, IBM Power CPUs, TPU, and AWS Trainium and Inferentia Accelerators.
- Prefix caching support
2025-06-26 18:22:12 -04:00
- Multi-LoRA support
2025-05-23 11:09:53 +02:00
For more information, check out the following:
- [vLLM announcing blog post ](https://vllm.ai ) (intro to PagedAttention)
- [vLLM paper ](https://arxiv.org/abs/2309.06180 ) (SOSP 2023)
- [How continuous batching enables 23x throughput in LLM inference while reducing p50 latency ](https://www.anyscale.com/blog/continuous-batching-llm-inference ) by Cade Daniel et al.
2025-07-08 10:49:13 +01:00
- [vLLM Meetups ](community/meetups.md )