2025-08-13 01:12:26 +01:00
---
hide:
- navigation
- toc
---
2025-05-23 11:09:53 +02:00
# Welcome to vLLM
<figure markdown="span">
2025-07-01 16:18:09 +08:00
{ align="center" alt="vLLM Light" class="logo-light" width="60%" }
{ align="center" alt="vLLM Dark" class="logo-dark" width="60%" }
2025-05-23 11:09:53 +02:00
</figure>
<p style="text-align:center">
<strong>Easy, fast, and cheap LLM serving for everyone
</strong>
</p>
<p style="text-align:center">
<script async defer src="https://buttons.github.io/buttons.js"></script>
<a class="github-button" href="https://github.com/vllm-project/vllm" data-show-count="true" data-size="large" aria-label="Star">Star</a>
2025-05-30 21:45:59 +08:00
<a class="github-button" href="https://github.com/vllm-project/vllm/subscription" data-show-count="true" data-icon="octicon-eye" data-size="large" aria-label="Watch">Watch</a>
<a class="github-button" href="https://github.com/vllm-project/vllm/fork" data-show-count="true" data-icon="octicon-repo-forked" data-size="large" aria-label="Fork">Fork</a>
2025-05-23 11:09:53 +02:00
</p>
vLLM is a fast and easy-to-use library for LLM inference and serving.
2026-04-09 12:35:00 +02:00
Originally developed in the [Sky Computing Lab ](https://sky.cs.berkeley.edu ) at UC Berkeley, vLLM has grown into one of the most active open-source AI projects built and maintained by a diverse community of many dozens of academic institutions and companies from over 2000 contributors.
2025-05-23 11:09:53 +02:00
2025-08-12 12:25:55 +01:00
Where to get started with vLLM depends on the type of user. If you are looking to:
- Run open-source models on vLLM, we recommend starting with the [Quickstart Guide ](./getting_started/quickstart.md )
2025-11-15 21:33:27 +08:00
- Build applications with vLLM, we recommend starting with the [User Guide ](./usage/README.md )
- Build vLLM, we recommend starting with [Developer Guide ](./contributing/README.md )
2025-08-12 12:25:55 +01:00
For information about the development of vLLM, see:
- [Roadmap ](https://roadmap.vllm.ai )
- [Releases ](https://github.com/vllm-project/vllm/releases )
2025-05-23 11:09:53 +02:00
vLLM is fast with:
- State-of-the-art serving throughput
- Efficient management of attention key and value memory with [**PagedAttention** ](https://blog.vllm.ai/2023/06/20/vllm.html )
2026-04-09 12:35:00 +02:00
- Continuous batching of incoming requests, chunked prefill, prefix caching
- Fast and flexible model execution with piecewise and full CUDA/HIP graphs
- Quantization: FP8, MXFP8/MXFP4, NVFP4, INT8, INT4, GPTQ/AWQ, GGUF, compressed-tensors, ModelOpt, TorchAO, and [more ](https://docs.vllm.ai/en/latest/features/quantization/index.html )
- Optimized attention kernels including FlashAttention, FlashInfer, TRTLLM-GEN, FlashMLA, and Triton
- Optimized GEMM/MoE kernels for various precisions using CUTLASS, TRTLLM-GEN, CuTeDSL
- Speculative decoding including n-gram, suffix, EAGLE, DFlash
- Automatic kernel generation and graph-level transformations using torch.compile
- Disaggregated prefill, decode, and encode
2025-05-23 11:09:53 +02:00
vLLM is flexible and easy to use with:
2026-04-09 12:35:00 +02:00
- Seamless integration with popular Hugging Face models
2025-05-23 11:09:53 +02:00
- High-throughput serving with various decoding algorithms, including * parallel sampling * , * beam search * , and more
2026-04-09 12:35:00 +02:00
- Tensor, pipeline, data, expert, and context parallelism for distributed inference
2025-05-23 11:09:53 +02:00
- Streaming outputs
2026-04-09 12:35:00 +02:00
- Generation of structured outputs using xgrammar or guidance
- Tool calling and reasoning parsers
- OpenAI-compatible API server, plus Anthropic Messages API and gRPC support
- Efficient multi-LoRA support for dense and MoE layers
- Support for NVIDIA GPUs, AMD GPUs, and x86/ARM/PowerPC CPUs. Additionally, diverse hardware plugins such as Google TPUs, Intel Gaudi, IBM Spyre, Huawei Ascend, Rebellions NPU, Apple Silicon, MetaX GPU, and more.
vLLM seamlessly supports 200+ model architectures on HuggingFace, including:
- Decoder-only LLMs (e.g., Llama, Qwen, Gemma)
- Mixture-of-Expert LLMs (e.g., Mixtral, DeepSeek-V3, Qwen-MoE, GPT-OSS)
- Hybrid attention and state-space models (e.g., Mamba, Qwen3.5)
- Multi-modal models (e.g., LLaVA, Qwen-VL, Pixtral)
- Embedding and retrieval models (e.g., E5-Mistral, GTE, ColBERT)
- Reward and classification models (e.g., Qwen-Math)
Find the full list of supported models [here ](./models/supported_models.md ).
2025-05-23 11:09:53 +02:00
For more information, check out the following:
2026-01-07 18:06:42 +00:00
- [vLLM announcing blog post ](https://blog.vllm.ai/2023/06/20/vllm.html ) (intro to PagedAttention)
2025-05-23 11:09:53 +02:00
- [vLLM paper ](https://arxiv.org/abs/2309.06180 ) (SOSP 2023)
- [How continuous batching enables 23x throughput in LLM inference while reducing p50 latency ](https://www.anyscale.com/blog/continuous-batching-llm-inference ) by Cade Daniel et al.
2025-07-08 10:49:13 +01:00
- [vLLM Meetups ](community/meetups.md )