README.md

<!-- markdownlint-disable MD001 MD041 -->
<p align="center">
  <picture>
    <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/assets/logos/vllm-logo-text-dark.png">
    <img alt="vLLM" src="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/assets/logos/vllm-logo-text-light.png" width=55%>
  </picture>
</p>

<h3 align="center">
Easy, fast, and cheap LLM serving for everyone
</h3>

<p align="center">
| <a href="https://docs.vllm.ai"><b>Documentation</b></a> | <a href="https://blog.vllm.ai/"><b>Blog</b></a> | <a href="https://arxiv.org/abs/2309.06180"><b>Paper</b></a> | <a href="https://x.com/vllm_project"><b>Twitter/X</b></a> | <a href="https://discuss.vllm.ai"><b>User Forum</b></a> | <a href="https://slack.vllm.ai"><b>Developer Slack</b></a> |
</p>

🔥 We have built a vllm website to help you get started with vllm. Please visit [vllm.ai](https://vllm.ai) to learn more.
For events, please visit [vllm.ai/events](https://vllm.ai/events) to join us.

---

## About

vLLM is a fast and easy-to-use library for LLM inference and serving.

Originally developed in the [Sky Computing Lab](https://sky.cs.berkeley.edu) at UC Berkeley, vLLM has grown into one of the most active open-source AI projects built and maintained by a diverse community of many dozens of academic institutions and companies from over 2000 contributors.

vLLM is fast with:

- State-of-the-art serving throughput
- Efficient management of attention key and value memory with [**PagedAttention**](https://blog.vllm.ai/2023/06/20/vllm.html)
- Continuous batching of incoming requests, chunked prefill, prefix caching
- Fast and flexible model execution with piecewise and full CUDA/HIP graphs
- Quantization: FP8, MXFP8/MXFP4, NVFP4, INT8, INT4, GPTQ/AWQ, GGUF, compressed-tensors, ModelOpt, TorchAO, and [more](https://docs.vllm.ai/en/latest/features/quantization/index.html)
- Optimized attention kernels including FlashAttention, FlashInfer, TRTLLM-GEN, FlashMLA, and Triton
- Optimized GEMM/MoE kernels for various precisions using CUTLASS, TRTLLM-GEN, CuTeDSL
- Speculative decoding including n-gram, suffix, EAGLE, DFlash
- Automatic kernel generation and graph-level transformations using torch.compile
- Disaggregated prefill, decode, and encode

vLLM is flexible and easy to use with:

- Seamless integration with popular Hugging Face models
- High-throughput serving with various decoding algorithms, including *parallel sampling*, *beam search*, and more
- Tensor, pipeline, data, expert, and context parallelism for distributed inference
- Streaming outputs
- Generation of structured outputs using xgrammar or guidance
- Tool calling and reasoning parsers
- OpenAI-compatible API server, plus Anthropic Messages API and gRPC support
- Efficient multi-LoRA support for dense and MoE layers
- Support for NVIDIA GPUs, AMD GPUs, and x86/ARM/PowerPC CPUs. Additionally, diverse hardware plugins such as Google TPUs, Intel Gaudi, IBM Spyre, Huawei Ascend, Rebellions NPU, Apple Silicon, MetaX GPU, and more.

vLLM seamlessly supports 200+ model architectures on HuggingFace, including:

- Decoder-only LLMs (e.g., Llama, Qwen, Gemma)
- Mixture-of-Expert LLMs (e.g., Mixtral, DeepSeek-V3, Qwen-MoE, GPT-OSS)
- Hybrid attention and state-space models (e.g., Mamba, Qwen3.5)
- Multi-modal models (e.g., LLaVA, Qwen-VL, Pixtral)
- Embedding and retrieval models (e.g., E5-Mistral, GTE, ColBERT)
- Reward and classification models (e.g., Qwen-Math)

Find the full list of supported models [here](https://docs.vllm.ai/en/latest/models/supported_models.html).

## Getting Started

Install vLLM with [`uv`](https://docs.astral.sh/uv/) (recommended) or `pip`:

```bash
uv pip install vllm
```

Or [build from source](https://docs.vllm.ai/en/latest/getting_started/installation/gpu/index.html#build-wheel-from-source) for development.

Visit our [documentation](https://docs.vllm.ai/en/latest/) to learn more.

- [Installation](https://docs.vllm.ai/en/latest/getting_started/installation.html)
- [Quickstart](https://docs.vllm.ai/en/latest/getting_started/quickstart.html)
- [List of Supported Models](https://docs.vllm.ai/en/latest/models/supported_models.html)

## Contributing

We welcome and value any contributions and collaborations.
Please check out [Contributing to vLLM](https://docs.vllm.ai/en/latest/contributing/index.html) for how to get involved.

## Citation

If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs/2309.06180):

```bibtex
@inproceedings{kwon2023efficient,
  title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
  author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
  booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
  year={2023}
}
```

## Contact Us

<!-- --8<-- [start:contact-us] -->
- For technical questions and feature requests, please use GitHub [Issues](https://github.com/vllm-project/vllm/issues)
- For discussing with fellow users, please use the [vLLM Forum](https://discuss.vllm.ai)
- For coordinating contributions and development, please use [Slack](https://slack.vllm.ai)
- For security disclosures, please use GitHub's [Security Advisories](https://github.com/vllm-project/vllm/security/advisories) feature
- For collaborations and partnerships, please contact us at [collaboration@vllm.ai](mailto:collaboration@vllm.ai)
<!-- --8<-- [end:contact-us] -->

## Media Kit

- If you wish to use vLLM's logo, please refer to [our media kit repo](https://github.com/vllm-project/media-kit)
[Docs] Switch to better markdown linting pre-commit hook (#21851) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> 2025-07-30 03:45:08 +01:00			`<!-- markdownlint-disable MD001 MD041 -->`
Add logo and polish readme (#156) 2023-06-19 16:31:13 +08:00			`<p align="center">`
			`<picture>`
[Doc] Update README links, mark external links (#18635) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> 2025-05-24 17:57:15 +08:00			`<source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/assets/logos/vllm-logo-text-dark.png">`
			`<img alt="vLLM" src="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/assets/logos/vllm-logo-text-light.png" width=55%>`
Add logo and polish readme (#156) 2023-06-19 16:31:13 +08:00			`</picture>`
			`</p>`
Add README 2023-02-24 12:04:49 +00:00
Add logo and polish readme (#156) 2023-06-19 16:31:13 +08:00			`<h3 align="center">`
			`Easy, fast, and cheap LLM serving for everyone`
			`</h3>`
Add README 2023-02-24 12:04:49 +00:00
Add logo and polish readme (#156) 2023-06-19 16:31:13 +08:00			`<p align="center">`
[Doc] Fix link to vLLM blog (#16519) Signed-off-by: Yuan Tang <terrytangyuan@gmail.com> 2025-04-11 20:39:23 -04:00			`\| <a href="https://docs.vllm.ai"><b>Documentation</b></a> \| <a href="https://blog.vllm.ai/"><b>Blog</b></a> \| <a href="https://arxiv.org/abs/2309.06180"><b>Paper</b></a> \| <a href="https://x.com/vllm_project"><b>Twitter/X</b></a> \| <a href="https://discuss.vllm.ai"><b>User Forum</b></a> \| <a href="https://slack.vllm.ai"><b>Developer Slack</b></a> \|`
Add logo and polish readme (#156) 2023-06-19 16:31:13 +08:00			`</p>`
Add README 2023-02-24 12:04:49 +00:00
Migrate meetups & sponsors [2/N] (#31500) Signed-off-by: esmeetu <jasonailu87@gmail.com> 2025-12-30 12:26:15 +08:00			`🔥 We have built a vllm website to help you get started with vllm. Please visit [vllm.ai](https://vllm.ai) to learn more.`
			`For events, please visit [vllm.ai/events](https://vllm.ai/events) to join us.`
[doc] Add back previous news (#15331) Signed-off-by: Chen Zhang <zhangch99@outlook.com> 2025-03-23 08:38:33 +08:00
			`---`
[Docs] Switch to better markdown linting pre-commit hook (#21851) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> 2025-07-30 03:45:08 +01:00
[Docs] Add "About" Heading to README.md (#2260) 2023-12-25 17:37:07 -07:00			`## About`
[CI/Build] Auto-fix Markdown files (#12941) 2025-02-08 20:25:15 +08:00
[Docs] Minor fix (#162) 2023-06-19 19:58:23 -07:00			`vLLM is a fast and easy-to-use library for LLM inference and serving.`
FastAPI-based working frontend (#10) 2023-03-29 14:48:56 +08:00
[Docs] Update README (#39251) Signed-off-by: mgoin <mgoin64@gmail.com> 2026-04-08 05:34:07 +02:00			`Originally developed in the [Sky Computing Lab](https://sky.cs.berkeley.edu) at UC Berkeley, vLLM has grown into one of the most active open-source AI projects built and maintained by a diverse community of many dozens of academic institutions and companies from over 2000 contributors.`
[Docs] Add Sky Computing Lab to project intro (#12019) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> 2025-01-13 17:24:36 -08:00
Add logo and polish readme (#156) 2023-06-19 16:31:13 +08:00			`vLLM is fast with:`
FastAPI-based working frontend (#10) 2023-03-29 14:48:56 +08:00
Add logo and polish readme (#156) 2023-06-19 16:31:13 +08:00			`- State-of-the-art serving throughput`
[Doc][5/N] Move Community and API Reference to the bottom (#11896) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Simon Mo <simon.mo@hey.com> 2025-01-10 11:10:12 +08:00			`- Efficient management of attention key and value memory with [PagedAttention](https://blog.vllm.ai/2023/06/20/vllm.html)`
[Docs] Update README (#39251) Signed-off-by: mgoin <mgoin64@gmail.com> 2026-04-08 05:34:07 +02:00			`- Continuous batching of incoming requests, chunked prefill, prefix caching`
			`- Fast and flexible model execution with piecewise and full CUDA/HIP graphs`
			`- Quantization: FP8, MXFP8/MXFP4, NVFP4, INT8, INT4, GPTQ/AWQ, GGUF, compressed-tensors, ModelOpt, TorchAO, and [more](https://docs.vllm.ai/en/latest/features/quantization/index.html)`
			`- Optimized attention kernels including FlashAttention, FlashInfer, TRTLLM-GEN, FlashMLA, and Triton`
			`- Optimized GEMM/MoE kernels for various precisions using CUTLASS, TRTLLM-GEN, CuTeDSL`
			`- Speculative decoding including n-gram, suffix, EAGLE, DFlash`
			`- Automatic kernel generation and graph-level transformations using torch.compile`
			`- Disaggregated prefill, decode, and encode`
Add logo and polish readme (#156) 2023-06-19 16:31:13 +08:00
			`vLLM is flexible and easy to use with:`

Fix typo in README.md (#1033) 2023-09-14 04:55:23 +09:00			`- Seamless integration with popular Hugging Face models`
Add logo and polish readme (#156) 2023-06-19 16:31:13 +08:00			`- High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more`
[Docs] Update README (#39251) Signed-off-by: mgoin <mgoin64@gmail.com> 2026-04-08 05:34:07 +02:00			`- Tensor, pipeline, data, expert, and context parallelism for distributed inference`
Write README and front page of doc (#147) 2023-06-18 03:19:38 -07:00			`- Streaming outputs`
[Docs] Update README (#39251) Signed-off-by: mgoin <mgoin64@gmail.com> 2026-04-08 05:34:07 +02:00			`- Generation of structured outputs using xgrammar or guidance`
			`- Tool calling and reasoning parsers`
			`- OpenAI-compatible API server, plus Anthropic Messages API and gRPC support`
			`- Efficient multi-LoRA support for dense and MoE layers`
			`- Support for NVIDIA GPUs, AMD GPUs, and x86/ARM/PowerPC CPUs. Additionally, diverse hardware plugins such as Google TPUs, Intel Gaudi, IBM Spyre, Huawei Ascend, Rebellions NPU, Apple Silicon, MetaX GPU, and more.`
FastAPI-based working frontend (#10) 2023-03-29 14:48:56 +08:00
[Docs] Update README (#39251) Signed-off-by: mgoin <mgoin64@gmail.com> 2026-04-08 05:34:07 +02:00			`vLLM seamlessly supports 200+ model architectures on HuggingFace, including:`
[Docs] Switch to better markdown linting pre-commit hook (#21851) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> 2025-07-30 03:45:08 +01:00
[Docs] Update README (#39251) Signed-off-by: mgoin <mgoin64@gmail.com> 2026-04-08 05:34:07 +02:00			`- Decoder-only LLMs (e.g., Llama, Qwen, Gemma)`
			`- Mixture-of-Expert LLMs (e.g., Mixtral, DeepSeek-V3, Qwen-MoE, GPT-OSS)`
			`- Hybrid attention and state-space models (e.g., Mamba, Qwen3.5)`
			`- Multi-modal models (e.g., LLaVA, Qwen-VL, Pixtral)`
			`- Embedding and retrieval models (e.g., E5-Mistral, GTE, ColBERT)`
			`- Reward and classification models (e.g., Qwen-Math)`
[Doc] Shorten README by removing supported model list (#4796) 2024-05-13 16:23:54 -07:00
			`Find the full list of supported models [here](https://docs.vllm.ai/en/latest/models/supported_models.html).`

			`## Getting Started`
Add and list supported models in README (#161) 2023-06-20 10:57:46 +08:00
[Docs] Update README (#39251) Signed-off-by: mgoin <mgoin64@gmail.com> 2026-04-08 05:34:07 +02:00			Install vLLM with [`uv`](https://docs.astral.sh/uv/) (recommended) or `pip`:
Add logo and polish readme (#156) 2023-06-19 16:31:13 +08:00
			```bash
[Docs] Update README (#39251) Signed-off-by: mgoin <mgoin64@gmail.com> 2026-04-08 05:34:07 +02:00			`uv pip install vllm`
Add logo and polish readme (#156) 2023-06-19 16:31:13 +08:00			```

[Docs] Update README (#39251) Signed-off-by: mgoin <mgoin64@gmail.com> 2026-04-08 05:34:07 +02:00			`Or [build from source](https://docs.vllm.ai/en/latest/getting_started/installation/gpu/index.html#build-wheel-from-source) for development.`

[Doc] Fix build from source and installation link in README.md (#12013) Signed-off-by: Yikun <yikunkero@gmail.com> 2025-01-14 01:23:59 +08:00			`Visit our [documentation](https://docs.vllm.ai/en/latest/) to learn more.`
[Docs] Switch to better markdown linting pre-commit hook (#21851) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> 2025-07-30 03:45:08 +01:00
[Docs] Make installation URLs nicer (#14556) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> 2025-03-10 18:43:08 +01:00			`- [Installation](https://docs.vllm.ai/en/latest/getting_started/installation.html)`
[Doc] Fix build from source and installation link in README.md (#12013) Signed-off-by: Yikun <yikunkero@gmail.com> 2025-01-14 01:23:59 +08:00			`- [Quickstart](https://docs.vllm.ai/en/latest/getting_started/quickstart.html)`
			`- [List of Supported Models](https://docs.vllm.ai/en/latest/models/supported_models.html)`
Add logo and polish readme (#156) 2023-06-19 16:31:13 +08:00
Write README and front page of doc (#147) 2023-06-18 03:19:38 -07:00			`## Contributing`
Modify README to include info on loading LLaMA (#18) 2023-04-01 01:07:57 +08:00
Write README and front page of doc (#147) 2023-06-18 03:19:38 -07:00			`We welcome and value any contributions and collaborations.`
[doc] fix broken links (#18671) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> 2025-05-25 16:36:33 +08:00			`Please check out [Contributing to vLLM](https://docs.vllm.ai/en/latest/contributing/index.html) for how to get involved.`
Announce paper release (#1036) 2023-09-13 17:38:13 -07:00
			`## Citation`

			`If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs/2309.06180):`
[CI/Build] Auto-fix Markdown files (#12941) 2025-02-08 20:25:15 +08:00
Announce paper release (#1036) 2023-09-13 17:38:13 -07:00			```bibtex
			`@inproceedings{kwon2023efficient,`
[Community] Add vLLM Discord server (#1086) 2023-09-18 12:23:35 -07:00			`title={Efficient Memory Management for Large Language Model Serving with PagedAttention},`
Announce paper release (#1036) 2023-09-13 17:38:13 -07:00			`author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},`
			`booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},`
			`year={2023}`
			`}`
			```
Add NVIDIA Meetup slides, announce AMD meetup, and add contact info (#8319) 2024-09-09 23:21:00 -07:00
			`## Contact Us`

[doc] use snippets for contact us (#19944) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> 2025-06-22 18:26:13 +08:00			`<!-- --8<-- [start:contact-us] -->`
[Misc] remove GH discussions link (#22722) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> 2025-08-12 18:15:33 +08:00			`- For technical questions and feature requests, please use GitHub [Issues](https://github.com/vllm-project/vllm/issues)`
Add user forum to README (#15220) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> 2025-03-20 14:39:51 +00:00			`- For discussing with fellow users, please use the [vLLM Forum](https://discuss.vllm.ai)`
docs: fix Slack bulletpoint in README (#19811) Signed-off-by: Nathan Weinberg <nweinber@redhat.com> 2025-06-18 16:47:08 -04:00			`- For coordinating contributions and development, please use [Slack](https://slack.vllm.ai)`
Add user forum to README (#15220) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> 2025-03-20 14:39:51 +00:00			`- For security disclosures, please use GitHub's [Security Advisories](https://github.com/vllm-project/vllm/security/advisories) feature`
Migrate meetups & sponsors [2/N] (#31500) Signed-off-by: esmeetu <jasonailu87@gmail.com> 2025-12-30 12:26:15 +08:00			`- For collaborations and partnerships, please contact us at [collaboration@vllm.ai](mailto:collaboration@vllm.ai)`
[doc] use snippets for contact us (#19944) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> 2025-06-22 18:26:13 +08:00			`<!-- --8<-- [end:contact-us] -->`
[Docs] Add media kit (#11121) 2024-12-11 17:33:11 -08:00
			`## Media Kit`

[Doc] Readme standardization (#18695) Co-authored-by: Soren Dreano <soren@numind.ai> 2025-06-03 20:50:55 +02:00			`- If you wish to use vLLM's logo, please refer to [our media kit repo](https://github.com/vllm-project/media-kit)`