2025-07-30 03:45:08 +01:00
<!-- markdownlint-disable MD001 MD041 -->
2023-06-19 16:31:13 +08:00
<p align="center">
<picture>
2025-05-24 17:57:15 +08:00
<source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/assets/logos/vllm-logo-text-dark.png">
<img alt="vLLM" src="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/assets/logos/vllm-logo-text-light.png" width=55%>
2023-06-19 16:31:13 +08:00
</picture>
</p>
2023-02-24 12:04:49 +00:00
2023-06-19 16:31:13 +08:00
<h3 align="center">
Easy, fast, and cheap LLM serving for everyone
</h3>
2023-02-24 12:04:49 +00:00
2023-06-19 16:31:13 +08:00
<p align="center">
2025-04-11 20:39:23 -04:00
| <a href="https://docs.vllm.ai"><b>Documentation</b></a> | <a href="https://blog.vllm.ai/"><b>Blog</b></a> | <a href="https://arxiv.org/abs/2309.06180"><b>Paper</b></a> | <a href="https://x.com/vllm_project"><b>Twitter/X</b></a> | <a href="https://discuss.vllm.ai"><b>User Forum</b></a> | <a href="https://slack.vllm.ai"><b>Developer Slack</b></a> |
2023-06-19 16:31:13 +08:00
</p>
2023-02-24 12:04:49 +00:00
2025-12-30 12:26:15 +08:00
🔥 We have built a vllm website to help you get started with vllm. Please visit [vllm.ai ](https://vllm.ai ) to learn more.
For events, please visit [vllm.ai/events ](https://vllm.ai/events ) to join us.
2025-03-23 08:38:33 +08:00
---
2025-07-30 03:45:08 +01:00
2023-12-25 17:37:07 -07:00
## About
2025-02-08 20:25:15 +08:00
2023-06-19 19:58:23 -07:00
vLLM is a fast and easy-to-use library for LLM inference and serving.
2023-03-29 14:48:56 +08:00
2025-02-01 18:17:29 +01:00
Originally developed in the [Sky Computing Lab ](https://sky.cs.berkeley.edu ) at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.
2025-01-13 17:24:36 -08:00
2023-06-19 16:31:13 +08:00
vLLM is fast with:
2023-03-29 14:48:56 +08:00
2023-06-19 16:31:13 +08:00
- State-of-the-art serving throughput
2025-01-10 11:10:12 +08:00
- Efficient management of attention key and value memory with [**PagedAttention** ](https://blog.vllm.ai/2023/06/20/vllm.html )
2023-06-26 11:34:23 -07:00
- Continuous batching of incoming requests
2023-12-17 01:49:20 -08:00
- Fast model execution with CUDA/HIP graph
2025-06-03 20:50:55 +02:00
- Quantizations: [GPTQ ](https://arxiv.org/abs/2210.17323 ), [AWQ ](https://arxiv.org/abs/2306.00978 ), [AutoRound ](https://arxiv.org/abs/2309.05516 ), INT4, INT8, and FP8
- Optimized CUDA kernels, including integration with FlashAttention and FlashInfer
2024-08-11 17:13:37 -07:00
- Speculative decoding
- Chunked prefill
2023-06-19 16:31:13 +08:00
vLLM is flexible and easy to use with:
2023-09-14 04:55:23 +09:00
- Seamless integration with popular Hugging Face models
2023-06-19 16:31:13 +08:00
- High-throughput serving with various decoding algorithms, including * parallel sampling * , * beam search * , and more
2025-07-11 17:42:10 +01:00
- Tensor, pipeline, data and expert parallelism support for distributed inference
2023-06-18 03:19:38 -07:00
- Streaming outputs
- OpenAI-compatible API server
2025-11-07 15:41:47 +00:00
- Support for NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, Arm CPUs, and TPU. Additionally, support for diverse hardware plugins such as Intel Gaudi, IBM Spyre and Huawei Ascend.
2024-08-11 17:13:37 -07:00
- Prefix caching support
2025-05-13 11:36:27 +08:00
- Multi-LoRA support
2023-03-29 14:48:56 +08:00
2024-05-13 16:23:54 -07:00
vLLM seamlessly supports most popular open-source models on HuggingFace, including:
2025-07-30 03:45:08 +01:00
2024-05-13 16:23:54 -07:00
- Transformer-like LLMs (e.g., Llama)
2024-12-26 16:21:56 -08:00
- Mixture-of-Expert LLMs (e.g., Mixtral, Deepseek-V2 and V3)
2025-06-03 20:50:55 +02:00
- Embedding Models (e.g., E5-Mistral)
2024-05-13 16:23:54 -07:00
- Multi-modal LLMs (e.g., LLaVA)
Find the full list of supported models [here ](https://docs.vllm.ai/en/latest/models/supported_models.html ).
## Getting Started
2023-06-20 10:57:46 +08:00
2025-01-14 01:23:59 +08:00
Install vLLM with `pip` or [from source ](https://docs.vllm.ai/en/latest/getting_started/installation/gpu/index.html#build-wheel-from-source ):
2023-06-19 16:31:13 +08:00
```bash
pip install vllm
```
2025-01-14 01:23:59 +08:00
Visit our [documentation ](https://docs.vllm.ai/en/latest/ ) to learn more.
2025-07-30 03:45:08 +01:00
2025-03-10 18:43:08 +01:00
- [Installation ](https://docs.vllm.ai/en/latest/getting_started/installation.html )
2025-01-14 01:23:59 +08:00
- [Quickstart ](https://docs.vllm.ai/en/latest/getting_started/quickstart.html )
- [List of Supported Models ](https://docs.vllm.ai/en/latest/models/supported_models.html )
2023-06-19 16:31:13 +08:00
2023-06-18 03:19:38 -07:00
## Contributing
2023-04-01 01:07:57 +08:00
2023-06-18 03:19:38 -07:00
We welcome and value any contributions and collaborations.
2025-05-25 16:36:33 +08:00
Please check out [Contributing to vLLM ](https://docs.vllm.ai/en/latest/contributing/index.html ) for how to get involved.
2023-09-13 17:38:13 -07:00
## Citation
If you use vLLM for your research, please cite our [paper ](https://arxiv.org/abs/2309.06180 ):
2025-02-08 20:25:15 +08:00
2023-09-13 17:38:13 -07:00
```bibtex
@inproceedings {kwon2023efficient,
2023-09-18 12:23:35 -07:00
title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
2023-09-13 17:38:13 -07:00
author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
year={2023}
}
```
2024-09-09 23:21:00 -07:00
## Contact Us
2025-06-22 18:26:13 +08:00
<!-- --8<-- [start:contact-us] -->
2025-08-12 18:15:33 +08:00
- For technical questions and feature requests, please use GitHub [Issues ](https://github.com/vllm-project/vllm/issues )
2025-03-20 14:39:51 +00:00
- For discussing with fellow users, please use the [vLLM Forum ](https://discuss.vllm.ai )
2025-06-18 16:47:08 -04:00
- For coordinating contributions and development, please use [Slack ](https://slack.vllm.ai )
2025-03-20 14:39:51 +00:00
- For security disclosures, please use GitHub's [Security Advisories ](https://github.com/vllm-project/vllm/security/advisories ) feature
2025-12-30 12:26:15 +08:00
- For collaborations and partnerships, please contact us at [collaboration@vllm.ai ](mailto:collaboration@vllm.ai )
2025-06-22 18:26:13 +08:00
<!-- --8<-- [end:contact-us] -->
2024-12-11 17:33:11 -08:00
## Media Kit
2025-06-03 20:50:55 +02:00
- If you wish to use vLLM's logo, please refer to [our media kit repo ](https://github.com/vllm-project/media-kit )