Write README and front page of doc (#147)

This commit is contained in:
Woosuk Kwon
2023-06-18 03:19:38 -07:00
committed by GitHub
parent bf5f121c02
commit dcda03b4cb
9 changed files with 65 additions and 60 deletions

View File

@@ -3,17 +3,20 @@
Installation
============
vLLM is a Python library that includes some C++ and CUDA code.
vLLM can run on systems that meet the following requirements:
vLLM is a Python library that also contains some C++ and CUDA code.
This additional code requires compilation on the user's machine.
Requirements
------------
* OS: Linux
* Python: 3.8 or higher
* CUDA: 11.0 -- 11.8
* GPU: compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, etc.)
* GPU: compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, L4, etc.)
.. note::
As of now, vLLM does not support CUDA 12.
If you are using Hopper or Lovelace GPUs, please use CUDA 11.8.
If you are using Hopper or Lovelace GPUs, please use CUDA 11.8 instead of CUDA 12.
.. tip::
If you have trouble installing vLLM, we recommend using the NVIDIA PyTorch Docker image.
@@ -45,7 +48,7 @@ You can install vLLM using pip:
Build from source
-----------------
You can also build and install vLLM from source.
You can also build and install vLLM from source:
.. code-block:: console

View File

@@ -1,7 +1,21 @@
Welcome to vLLM!
================
vLLM is a high-throughput and memory-efficient inference and serving engine for large language models (LLM).
**vLLM** is a fast and easy-to-use library for LLM inference and serving.
Its core features include:
- State-of-the-art performance in serving throughput
- Efficient management of attention key and value memory with **PagedAttention**
- Seamless integration with popular HuggingFace models
- Dynamic batching of incoming requests
- Optimized CUDA kernels
- High-throughput serving with various decoding algorithms, including *parallel sampling* and *beam search*
- Tensor parallelism support for distributed inference
- Streaming outputs
- OpenAI-compatible API server
For more information, please refer to our `blog post <>`_.
Documentation
-------------

View File

@@ -3,7 +3,7 @@
Supported Models
================
vLLM supports a variety of generative Transformer models in `HuggingFace Transformers <https://github.com/huggingface/transformers>`_.
vLLM supports a variety of generative Transformer models in `HuggingFace Transformers <https://huggingface.co/models>`_.
The following is the list of model architectures that are currently supported by vLLM.
Alongside each architecture, we include some popular models that use it.
@@ -18,7 +18,7 @@ Alongside each architecture, we include some popular models that use it.
* - :code:`GPTNeoXForCausalLM`
- GPT-NeoX, Pythia, OpenAssistant, Dolly V2, StableLM
* - :code:`LlamaForCausalLM`
- LLaMA, Vicuna, Alpaca, Koala
- LLaMA, Vicuna, Alpaca, Koala, Guanaco
* - :code:`OPTForCausalLM`
- OPT, OPT-IML