Write README and front page of doc (#147)
This commit is contained in:
@@ -1,7 +1,21 @@
|
||||
Welcome to vLLM!
|
||||
================
|
||||
|
||||
vLLM is a high-throughput and memory-efficient inference and serving engine for large language models (LLM).
|
||||
**vLLM** is a fast and easy-to-use library for LLM inference and serving.
|
||||
Its core features include:
|
||||
|
||||
- State-of-the-art performance in serving throughput
|
||||
- Efficient management of attention key and value memory with **PagedAttention**
|
||||
- Seamless integration with popular HuggingFace models
|
||||
- Dynamic batching of incoming requests
|
||||
- Optimized CUDA kernels
|
||||
- High-throughput serving with various decoding algorithms, including *parallel sampling* and *beam search*
|
||||
- Tensor parallelism support for distributed inference
|
||||
- Streaming outputs
|
||||
- OpenAI-compatible API server
|
||||
|
||||
For more information, please refer to our `blog post <>`_.
|
||||
|
||||
|
||||
Documentation
|
||||
-------------
|
||||
|
||||
Reference in New Issue
Block a user