2025-07-30 03:45:08 +01:00
<!-- markdownlint-disable MD001 MD041 -->
2023-06-19 16:31:13 +08:00
<p align="center">
<picture>
2025-05-24 17:57:15 +08:00
<source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/assets/logos/vllm-logo-text-dark.png">
<img alt="vLLM" src="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/assets/logos/vllm-logo-text-light.png" width=55%>
2023-06-19 16:31:13 +08:00
</picture>
</p>
2023-02-24 12:04:49 +00:00
2023-06-19 16:31:13 +08:00
<h3 align="center">
Easy, fast, and cheap LLM serving for everyone
</h3>
2023-02-24 12:04:49 +00:00
2023-06-19 16:31:13 +08:00
<p align="center">
2025-04-11 20:39:23 -04:00
| <a href="https://docs.vllm.ai"><b>Documentation</b></a> | <a href="https://blog.vllm.ai/"><b>Blog</b></a> | <a href="https://arxiv.org/abs/2309.06180"><b>Paper</b></a> | <a href="https://x.com/vllm_project"><b>Twitter/X</b></a> | <a href="https://discuss.vllm.ai"><b>User Forum</b></a> | <a href="https://slack.vllm.ai"><b>Developer Slack</b></a> |
2023-06-19 16:31:13 +08:00
</p>
2023-02-24 12:04:49 +00:00
2025-12-30 12:26:15 +08:00
🔥 We have built a vllm website to help you get started with vllm. Please visit [vllm.ai ](https://vllm.ai ) to learn more.
For events, please visit [vllm.ai/events ](https://vllm.ai/events ) to join us.
2025-03-23 08:38:33 +08:00
---
2025-07-30 03:45:08 +01:00
2023-12-25 17:37:07 -07:00
## About
2025-02-08 20:25:15 +08:00
2023-06-19 19:58:23 -07:00
vLLM is a fast and easy-to-use library for LLM inference and serving.
2023-03-29 14:48:56 +08:00
2026-04-08 05:34:07 +02:00
Originally developed in the [Sky Computing Lab ](https://sky.cs.berkeley.edu ) at UC Berkeley, vLLM has grown into one of the most active open-source AI projects built and maintained by a diverse community of many dozens of academic institutions and companies from over 2000 contributors.
2025-01-13 17:24:36 -08:00
2023-06-19 16:31:13 +08:00
vLLM is fast with:
2023-03-29 14:48:56 +08:00
2023-06-19 16:31:13 +08:00
- State-of-the-art serving throughput
2025-01-10 11:10:12 +08:00
- Efficient management of attention key and value memory with [**PagedAttention** ](https://blog.vllm.ai/2023/06/20/vllm.html )
2026-04-08 05:34:07 +02:00
- Continuous batching of incoming requests, chunked prefill, prefix caching
- Fast and flexible model execution with piecewise and full CUDA/HIP graphs
- Quantization: FP8, MXFP8/MXFP4, NVFP4, INT8, INT4, GPTQ/AWQ, GGUF, compressed-tensors, ModelOpt, TorchAO, and [more ](https://docs.vllm.ai/en/latest/features/quantization/index.html )
- Optimized attention kernels including FlashAttention, FlashInfer, TRTLLM-GEN, FlashMLA, and Triton
- Optimized GEMM/MoE kernels for various precisions using CUTLASS, TRTLLM-GEN, CuTeDSL
- Speculative decoding including n-gram, suffix, EAGLE, DFlash
- Automatic kernel generation and graph-level transformations using torch.compile
- Disaggregated prefill, decode, and encode
2023-06-19 16:31:13 +08:00
vLLM is flexible and easy to use with:
2023-09-14 04:55:23 +09:00
- Seamless integration with popular Hugging Face models
2023-06-19 16:31:13 +08:00
- High-throughput serving with various decoding algorithms, including * parallel sampling * , * beam search * , and more
2026-04-08 05:34:07 +02:00
- Tensor, pipeline, data, expert, and context parallelism for distributed inference
2023-06-18 03:19:38 -07:00
- Streaming outputs
2026-04-08 05:34:07 +02:00
- Generation of structured outputs using xgrammar or guidance
- Tool calling and reasoning parsers
- OpenAI-compatible API server, plus Anthropic Messages API and gRPC support
- Efficient multi-LoRA support for dense and MoE layers
- Support for NVIDIA GPUs, AMD GPUs, and x86/ARM/PowerPC CPUs. Additionally, diverse hardware plugins such as Google TPUs, Intel Gaudi, IBM Spyre, Huawei Ascend, Rebellions NPU, Apple Silicon, MetaX GPU, and more.
2023-03-29 14:48:56 +08:00
2026-04-08 05:34:07 +02:00
vLLM seamlessly supports 200+ model architectures on HuggingFace, including:
2025-07-30 03:45:08 +01:00
2026-04-08 05:34:07 +02:00
- Decoder-only LLMs (e.g., Llama, Qwen, Gemma)
- Mixture-of-Expert LLMs (e.g., Mixtral, DeepSeek-V3, Qwen-MoE, GPT-OSS)
- Hybrid attention and state-space models (e.g., Mamba, Qwen3.5)
- Multi-modal models (e.g., LLaVA, Qwen-VL, Pixtral)
- Embedding and retrieval models (e.g., E5-Mistral, GTE, ColBERT)
- Reward and classification models (e.g., Qwen-Math)
2024-05-13 16:23:54 -07:00
Find the full list of supported models [here ](https://docs.vllm.ai/en/latest/models/supported_models.html ).
## Getting Started
2023-06-20 10:57:46 +08:00
2026-04-08 05:34:07 +02:00
Install vLLM with [`uv` ](https://docs.astral.sh/uv/ ) (recommended) or `pip` :
2023-06-19 16:31:13 +08:00
```bash
2026-04-08 05:34:07 +02:00
uv pip install vllm
2023-06-19 16:31:13 +08:00
```
2026-04-08 05:34:07 +02:00
Or [build from source ](https://docs.vllm.ai/en/latest/getting_started/installation/gpu/index.html#build-wheel-from-source ) for development.
2025-01-14 01:23:59 +08:00
Visit our [documentation ](https://docs.vllm.ai/en/latest/ ) to learn more.
2025-07-30 03:45:08 +01:00
2025-03-10 18:43:08 +01:00
- [Installation ](https://docs.vllm.ai/en/latest/getting_started/installation.html )
2025-01-14 01:23:59 +08:00
- [Quickstart ](https://docs.vllm.ai/en/latest/getting_started/quickstart.html )
- [List of Supported Models ](https://docs.vllm.ai/en/latest/models/supported_models.html )
2023-06-19 16:31:13 +08:00
2023-06-18 03:19:38 -07:00
## Contributing
2023-04-01 01:07:57 +08:00
2023-06-18 03:19:38 -07:00
We welcome and value any contributions and collaborations.
2025-05-25 16:36:33 +08:00
Please check out [Contributing to vLLM ](https://docs.vllm.ai/en/latest/contributing/index.html ) for how to get involved.
2023-09-13 17:38:13 -07:00
## Citation
If you use vLLM for your research, please cite our [paper ](https://arxiv.org/abs/2309.06180 ):
2025-02-08 20:25:15 +08:00
2023-09-13 17:38:13 -07:00
```bibtex
@inproceedings {kwon2023efficient,
2023-09-18 12:23:35 -07:00
title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
2023-09-13 17:38:13 -07:00
author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
year={2023}
}
```
2024-09-09 23:21:00 -07:00
## Contact Us
2025-06-22 18:26:13 +08:00
<!-- --8<-- [start:contact-us] -->
2025-08-12 18:15:33 +08:00
- For technical questions and feature requests, please use GitHub [Issues ](https://github.com/vllm-project/vllm/issues )
2025-03-20 14:39:51 +00:00
- For discussing with fellow users, please use the [vLLM Forum ](https://discuss.vllm.ai )
2025-06-18 16:47:08 -04:00
- For coordinating contributions and development, please use [Slack ](https://slack.vllm.ai )
2025-03-20 14:39:51 +00:00
- For security disclosures, please use GitHub's [Security Advisories ](https://github.com/vllm-project/vllm/security/advisories ) feature
2025-12-30 12:26:15 +08:00
- For collaborations and partnerships, please contact us at [collaboration@vllm.ai ](mailto:collaboration@vllm.ai )
2025-06-22 18:26:13 +08:00
<!-- --8<-- [end:contact-us] -->
2024-12-11 17:33:11 -08:00
## Media Kit
2025-06-03 20:50:55 +02:00
- If you wish to use vLLM's logo, please refer to [our media kit repo ](https://github.com/vllm-project/media-kit )