README.md

<!-- markdownlint-disable MD001 MD041 -->
<p align="center">
  <picture>
    <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/assets/logos/vllm-logo-text-dark.png">
    <img alt="vLLM" src="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/assets/logos/vllm-logo-text-light.png" width=55%>
  </picture>
</p>

<h3 align="center">
Easy, fast, and cheap LLM serving for everyone
</h3>

<p align="center">
| <a href="https://docs.vllm.ai"><b>Documentation</b></a> | <a href="https://blog.vllm.ai/"><b>Blog</b></a> | <a href="https://arxiv.org/abs/2309.06180"><b>Paper</b></a> | <a href="https://x.com/vllm_project"><b>Twitter/X</b></a> | <a href="https://discuss.vllm.ai"><b>User Forum</b></a> | <a href="https://slack.vllm.ai"><b>Developer Slack</b></a> |
</p>

🔥 We have built a vllm website to help you get started with vllm. Please visit [vllm.ai](https://vllm.ai) to learn more.
For events, please visit [vllm.ai/events](https://vllm.ai/events) to join us.

---

## About

vLLM is a fast and easy-to-use library for LLM inference and serving.

Originally developed in the [Sky Computing Lab](https://sky.cs.berkeley.edu) at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

vLLM is fast with:

- State-of-the-art serving throughput
- Efficient management of attention key and value memory with [**PagedAttention**](https://blog.vllm.ai/2023/06/20/vllm.html)
- Continuous batching of incoming requests
- Fast model execution with CUDA/HIP graph
- Quantizations: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), [AutoRound](https://arxiv.org/abs/2309.05516), INT4, INT8, and FP8
- Optimized CUDA kernels, including integration with FlashAttention and FlashInfer
- Speculative decoding
- Chunked prefill

vLLM is flexible and easy to use with:

- Seamless integration with popular Hugging Face models
- High-throughput serving with various decoding algorithms, including *parallel sampling*, *beam search*, and more
- Tensor, pipeline, data and expert parallelism support for distributed inference
- Streaming outputs
- OpenAI-compatible API server
- Support for NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, Arm CPUs, and TPU. Additionally, support for diverse hardware plugins such as Intel Gaudi, IBM Spyre and Huawei Ascend.
- Prefix caching support
- Multi-LoRA support

vLLM seamlessly supports most popular open-source models on HuggingFace, including:

- Transformer-like LLMs (e.g., Llama)
- Mixture-of-Expert LLMs (e.g., Mixtral, Deepseek-V2 and V3)
- Embedding Models (e.g., E5-Mistral)
- Multi-modal LLMs (e.g., LLaVA)

Find the full list of supported models [here](https://docs.vllm.ai/en/latest/models/supported_models.html).

## Getting Started

Install vLLM with `pip` or [from source](https://docs.vllm.ai/en/latest/getting_started/installation/gpu/index.html#build-wheel-from-source):

```bash
pip install vllm
```

Visit our [documentation](https://docs.vllm.ai/en/latest/) to learn more.

- [Installation](https://docs.vllm.ai/en/latest/getting_started/installation.html)
- [Quickstart](https://docs.vllm.ai/en/latest/getting_started/quickstart.html)
- [List of Supported Models](https://docs.vllm.ai/en/latest/models/supported_models.html)

## Contributing

We welcome and value any contributions and collaborations.
Please check out [Contributing to vLLM](https://docs.vllm.ai/en/latest/contributing/index.html) for how to get involved.

## Citation

If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs/2309.06180):

```bibtex
@inproceedings{kwon2023efficient,
  title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
  author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
  booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
  year={2023}
}
```

## Contact Us

<!-- --8<-- [start:contact-us] -->
- For technical questions and feature requests, please use GitHub [Issues](https://github.com/vllm-project/vllm/issues)
- For discussing with fellow users, please use the [vLLM Forum](https://discuss.vllm.ai)
- For coordinating contributions and development, please use [Slack](https://slack.vllm.ai)
- For security disclosures, please use GitHub's [Security Advisories](https://github.com/vllm-project/vllm/security/advisories) feature
- For collaborations and partnerships, please contact us at [collaboration@vllm.ai](mailto:collaboration@vllm.ai)
<!-- --8<-- [end:contact-us] -->

## Media Kit

- If you wish to use vLLM's logo, please refer to [our media kit repo](https://github.com/vllm-project/media-kit)
[Docs] Switch to better markdown linting pre-commit hook (#21851) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> 2025-07-30 03:45:08 +01:00			`<!-- markdownlint-disable MD001 MD041 -->`
Add logo and polish readme (#156) 2023-06-19 16:31:13 +08:00			`<p align="center">`
			`<picture>`
[Doc] Update README links, mark external links (#18635) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> 2025-05-24 17:57:15 +08:00			`<source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/assets/logos/vllm-logo-text-dark.png">`
			`<img alt="vLLM" src="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/assets/logos/vllm-logo-text-light.png" width=55%>`
Add logo and polish readme (#156) 2023-06-19 16:31:13 +08:00			`</picture>`
			`</p>`
Add README 2023-02-24 12:04:49 +00:00
Add logo and polish readme (#156) 2023-06-19 16:31:13 +08:00			`<h3 align="center">`
			`Easy, fast, and cheap LLM serving for everyone`
			`</h3>`
Add README 2023-02-24 12:04:49 +00:00
Add logo and polish readme (#156) 2023-06-19 16:31:13 +08:00			`<p align="center">`
[Doc] Fix link to vLLM blog (#16519) Signed-off-by: Yuan Tang <terrytangyuan@gmail.com> 2025-04-11 20:39:23 -04:00			`\| <a href="https://docs.vllm.ai"><b>Documentation</b></a> \| <a href="https://blog.vllm.ai/"><b>Blog</b></a> \| <a href="https://arxiv.org/abs/2309.06180"><b>Paper</b></a> \| <a href="https://x.com/vllm_project"><b>Twitter/X</b></a> \| <a href="https://discuss.vllm.ai"><b>User Forum</b></a> \| <a href="https://slack.vllm.ai"><b>Developer Slack</b></a> \|`
Add logo and polish readme (#156) 2023-06-19 16:31:13 +08:00			`</p>`
Add README 2023-02-24 12:04:49 +00:00
Migrate meetups & sponsors [2/N] (#31500) Signed-off-by: esmeetu <jasonailu87@gmail.com> 2025-12-30 12:26:15 +08:00			`🔥 We have built a vllm website to help you get started with vllm. Please visit [vllm.ai](https://vllm.ai) to learn more.`
			`For events, please visit [vllm.ai/events](https://vllm.ai/events) to join us.`
[doc] Add back previous news (#15331) Signed-off-by: Chen Zhang <zhangch99@outlook.com> 2025-03-23 08:38:33 +08:00
			`---`
[Docs] Switch to better markdown linting pre-commit hook (#21851) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> 2025-07-30 03:45:08 +01:00
[Docs] Add "About" Heading to README.md (#2260) 2023-12-25 17:37:07 -07:00			`## About`
[CI/Build] Auto-fix Markdown files (#12941) 2025-02-08 20:25:15 +08:00
[Docs] Minor fix (#162) 2023-06-19 19:58:23 -07:00			`vLLM is a fast and easy-to-use library for LLM inference and serving.`
FastAPI-based working frontend (#10) 2023-03-29 14:48:56 +08:00
doc: fixing minor typo in readme.md (#12643) Word "evolved" was mistyped Signed-off-by: Vicente Herrera <vicenteherrera@vicenteherrera.com> --------- Signed-off-by: Vicente Herrera <vicenteherrera@vicenteherrera.com> 2025-02-01 18:17:29 +01:00			`Originally developed in the [Sky Computing Lab](https://sky.cs.berkeley.edu) at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.`
[Docs] Add Sky Computing Lab to project intro (#12019) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> 2025-01-13 17:24:36 -08:00
Add logo and polish readme (#156) 2023-06-19 16:31:13 +08:00			`vLLM is fast with:`
FastAPI-based working frontend (#10) 2023-03-29 14:48:56 +08:00
Add logo and polish readme (#156) 2023-06-19 16:31:13 +08:00			`- State-of-the-art serving throughput`
[Doc][5/N] Move Community and API Reference to the bottom (#11896) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Simon Mo <simon.mo@hey.com> 2025-01-10 11:10:12 +08:00			`- Efficient management of attention key and value memory with [PagedAttention](https://blog.vllm.ai/2023/06/20/vllm.html)`
[Doc] Documentation for distributed inference (#261) 2023-06-26 11:34:23 -07:00			`- Continuous batching of incoming requests`
[Docs] Add CUDA graph support to docs (#2148) 2023-12-17 01:49:20 -08:00			`- Fast model execution with CUDA/HIP graph`
[Doc] Readme standardization (#18695) Co-authored-by: Soren Dreano <soren@numind.ai> 2025-06-03 20:50:55 +02:00			`- Quantizations: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), [AutoRound](https://arxiv.org/abs/2309.05516), INT4, INT8, and FP8`
			`- Optimized CUDA kernels, including integration with FlashAttention and FlashInfer`
[Docs] Update readme (#7316) 2024-08-11 17:13:37 -07:00			`- Speculative decoding`
			`- Chunked prefill`
Add logo and polish readme (#156) 2023-06-19 16:31:13 +08:00
			`vLLM is flexible and easy to use with:`

Fix typo in README.md (#1033) 2023-09-14 04:55:23 +09:00			`- Seamless integration with popular Hugging Face models`
Add logo and polish readme (#156) 2023-06-19 16:31:13 +08:00			`- High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more`
[Docs] Data Parallel deployment documentation (#20768) Signed-off-by: Nick Hill <nhill@redhat.com> 2025-07-11 17:42:10 +01:00			`- Tensor, pipeline, data and expert parallelism support for distributed inference`
Write README and front page of doc (#147) 2023-06-18 03:19:38 -07:00			`- Streaming outputs`
			`- OpenAI-compatible API server`
[README] Add Arm CPUs to the list of supported targets (#28290) Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com> 2025-11-07 15:41:47 +00:00			`- Support for NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, Arm CPUs, and TPU. Additionally, support for diverse hardware plugins such as Intel Gaudi, IBM Spyre and Huawei Ascend.`
[Docs] Update readme (#7316) 2024-08-11 17:13:37 -07:00			`- Prefix caching support`
[Misc] Slight spelling modification (#18039) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> 2025-05-13 11:36:27 +08:00			`- Multi-LoRA support`
FastAPI-based working frontend (#10) 2023-03-29 14:48:56 +08:00
[Doc] Shorten README by removing supported model list (#4796) 2024-05-13 16:23:54 -07:00			`vLLM seamlessly supports most popular open-source models on HuggingFace, including:`
[Docs] Switch to better markdown linting pre-commit hook (#21851) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> 2025-07-30 03:45:08 +01:00
[Doc] Shorten README by removing supported model list (#4796) 2024-05-13 16:23:54 -07:00			`- Transformer-like LLMs (e.g., Llama)`
[Docs] Document Deepseek V3 support (#11535) Signed-off-by: simon-mo <simon.mo@hey.com> 2024-12-26 16:21:56 -08:00			`- Mixture-of-Expert LLMs (e.g., Mixtral, Deepseek-V2 and V3)`
[Doc] Readme standardization (#18695) Co-authored-by: Soren Dreano <soren@numind.ai> 2025-06-03 20:50:55 +02:00			`- Embedding Models (e.g., E5-Mistral)`
[Doc] Shorten README by removing supported model list (#4796) 2024-05-13 16:23:54 -07:00			`- Multi-modal LLMs (e.g., LLaVA)`

			`Find the full list of supported models [here](https://docs.vllm.ai/en/latest/models/supported_models.html).`

			`## Getting Started`
Add and list supported models in README (#161) 2023-06-20 10:57:46 +08:00
[Doc] Fix build from source and installation link in README.md (#12013) Signed-off-by: Yikun <yikunkero@gmail.com> 2025-01-14 01:23:59 +08:00			Install vLLM with `pip` or [from source](https://docs.vllm.ai/en/latest/getting_started/installation/gpu/index.html#build-wheel-from-source):
Add logo and polish readme (#156) 2023-06-19 16:31:13 +08:00
			```bash
			`pip install vllm`
			```

[Doc] Fix build from source and installation link in README.md (#12013) Signed-off-by: Yikun <yikunkero@gmail.com> 2025-01-14 01:23:59 +08:00			`Visit our [documentation](https://docs.vllm.ai/en/latest/) to learn more.`
[Docs] Switch to better markdown linting pre-commit hook (#21851) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> 2025-07-30 03:45:08 +01:00
[Docs] Make installation URLs nicer (#14556) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> 2025-03-10 18:43:08 +01:00			`- [Installation](https://docs.vllm.ai/en/latest/getting_started/installation.html)`
[Doc] Fix build from source and installation link in README.md (#12013) Signed-off-by: Yikun <yikunkero@gmail.com> 2025-01-14 01:23:59 +08:00			`- [Quickstart](https://docs.vllm.ai/en/latest/getting_started/quickstart.html)`
			`- [List of Supported Models](https://docs.vllm.ai/en/latest/models/supported_models.html)`
Add logo and polish readme (#156) 2023-06-19 16:31:13 +08:00
Write README and front page of doc (#147) 2023-06-18 03:19:38 -07:00			`## Contributing`
Modify README to include info on loading LLaMA (#18) 2023-04-01 01:07:57 +08:00
Write README and front page of doc (#147) 2023-06-18 03:19:38 -07:00			`We welcome and value any contributions and collaborations.`
[doc] fix broken links (#18671) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> 2025-05-25 16:36:33 +08:00			`Please check out [Contributing to vLLM](https://docs.vllm.ai/en/latest/contributing/index.html) for how to get involved.`
Announce paper release (#1036) 2023-09-13 17:38:13 -07:00
			`## Citation`

			`If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs/2309.06180):`
[CI/Build] Auto-fix Markdown files (#12941) 2025-02-08 20:25:15 +08:00
Announce paper release (#1036) 2023-09-13 17:38:13 -07:00			```bibtex
			`@inproceedings{kwon2023efficient,`
[Community] Add vLLM Discord server (#1086) 2023-09-18 12:23:35 -07:00			`title={Efficient Memory Management for Large Language Model Serving with PagedAttention},`
Announce paper release (#1036) 2023-09-13 17:38:13 -07:00			`author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},`
			`booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},`
			`year={2023}`
			`}`
			```
Add NVIDIA Meetup slides, announce AMD meetup, and add contact info (#8319) 2024-09-09 23:21:00 -07:00
			`## Contact Us`

[doc] use snippets for contact us (#19944) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> 2025-06-22 18:26:13 +08:00			`<!-- --8<-- [start:contact-us] -->`
[Misc] remove GH discussions link (#22722) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> 2025-08-12 18:15:33 +08:00			`- For technical questions and feature requests, please use GitHub [Issues](https://github.com/vllm-project/vllm/issues)`
Add user forum to README (#15220) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> 2025-03-20 14:39:51 +00:00			`- For discussing with fellow users, please use the [vLLM Forum](https://discuss.vllm.ai)`
docs: fix Slack bulletpoint in README (#19811) Signed-off-by: Nathan Weinberg <nweinber@redhat.com> 2025-06-18 16:47:08 -04:00			`- For coordinating contributions and development, please use [Slack](https://slack.vllm.ai)`
Add user forum to README (#15220) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> 2025-03-20 14:39:51 +00:00			`- For security disclosures, please use GitHub's [Security Advisories](https://github.com/vllm-project/vllm/security/advisories) feature`
Migrate meetups & sponsors [2/N] (#31500) Signed-off-by: esmeetu <jasonailu87@gmail.com> 2025-12-30 12:26:15 +08:00			`- For collaborations and partnerships, please contact us at [collaboration@vllm.ai](mailto:collaboration@vllm.ai)`
[doc] use snippets for contact us (#19944) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> 2025-06-22 18:26:13 +08:00			`<!-- --8<-- [end:contact-us] -->`
[Docs] Add media kit (#11121) 2024-12-11 17:33:11 -08:00
			`## Media Kit`

[Doc] Readme standardization (#18695) Co-authored-by: Soren Dreano <soren@numind.ai> 2025-06-03 20:50:55 +02:00			`- If you wish to use vLLM's logo, please refer to [our media kit repo](https://github.com/vllm-project/media-kit)`