Write README and front page of doc (#147)

2023-06-18 03:19:38 -07:00
parent bf5f121c02
commit dcda03b4cb
9 changed files with 65 additions and 60 deletions
--- a/docs/source/getting_started/installation.rst
+++ b/docs/source/getting_started/installation.rst
@@ -3,17 +3,20 @@
 Installation
 ============

-vLLM is a Python library that includes some C++ and CUDA code.
-vLLM can run on systems that meet the following requirements:
+vLLM is a Python library that also contains some C++ and CUDA code.
+This additional code requires compilation on the user's machine.
+
+Requirements
+------------

 * OS: Linux
 * Python: 3.8 or higher
 * CUDA: 11.0 -- 11.8
-* GPU: compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, etc.)
+* GPU: compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, L4, etc.)

 .. note::
    As of now, vLLM does not support CUDA 12.
-    If you are using Hopper or Lovelace GPUs, please use CUDA 11.8.
+    If you are using Hopper or Lovelace GPUs, please use CUDA 11.8 instead of CUDA 12.

 .. tip::
    If you have trouble installing vLLM, we recommend using the NVIDIA PyTorch Docker image.
@@ -45,7 +48,7 @@ You can install vLLM using pip:
 Build from source
 -----------------

-You can also build and install vLLM from source.
+You can also build and install vLLM from source:

 .. code-block:: console

--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -1,7 +1,21 @@
 Welcome to vLLM!
 ================

-vLLM is a high-throughput and memory-efficient inference and serving engine for large language models (LLM).
+**vLLM** is a fast and easy-to-use library for LLM inference and serving.
+Its core features include:
+
+- State-of-the-art performance in serving throughput
+- Efficient management of attention key and value memory with **PagedAttention**
+- Seamless integration with popular HuggingFace models
+- Dynamic batching of incoming requests
+- Optimized CUDA kernels
+- High-throughput serving with various decoding algorithms, including *parallel sampling* and *beam search*
+- Tensor parallelism support for distributed inference
+- Streaming outputs
+- OpenAI-compatible API server
+
+For more information, please refer to our `blog post <>`_.
+

 Documentation
 -------------
--- a/docs/source/models/supported_models.rst
+++ b/docs/source/models/supported_models.rst
@@ -3,7 +3,7 @@
 Supported Models
 ================

-vLLM supports a variety of generative Transformer models in `HuggingFace Transformers <https://github.com/huggingface/transformers>`_.
+vLLM supports a variety of generative Transformer models in `HuggingFace Transformers <https://huggingface.co/models>`_.
 The following is the list of model architectures that are currently supported by vLLM.
 Alongside each architecture, we include some popular models that use it.

@@ -18,7 +18,7 @@ Alongside each architecture, we include some popular models that use it.
  * - :code:`GPTNeoXForCausalLM`
    - GPT-NeoX, Pythia, OpenAssistant, Dolly V2, StableLM
  * - :code:`LlamaForCausalLM`
-    - LLaMA, Vicuna, Alpaca, Koala
+    - LLaMA, Vicuna, Alpaca, Koala, Guanaco
  * - :code:`OPTForCausalLM`
    - OPT, OPT-IML