[Doc] Documentation for distributed inference (#261)

2023-06-26 11:34:23 -07:00
parent 0b7db411b5
commit 2cf1a333b6
4 changed files with 54 additions and 3 deletions
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -29,7 +29,7 @@ vLLM is fast with:

 * State-of-the-art serving throughput
 * Efficient management of attention key and value memory with **PagedAttention**
-* Dynamic batching of incoming requests
+* Continuous batching of incoming requests
 * Optimized CUDA kernels

 vLLM is flexible and easy to use with:
@@ -40,7 +40,11 @@ vLLM is flexible and easy to use with:
 * Streaming outputs
 * OpenAI-compatible API server

-For more information, please refer to our `blog post <https://vllm.ai>`_.
+For more information, check out the following:
+
+* `vLLM announcing blog post <https://vllm.ai>`_ (intro to PagedAttention)
+* `How continuous batching enables 23x throughput in LLM inference while reducing p50 latency <https://www.anyscale.com/blog/continuous-batching-llm-inference>`_ by Cade Daniel et al.
+


 Documentation
@@ -53,6 +57,12 @@ Documentation
   getting_started/installation
   getting_started/quickstart

+.. toctree::
+   :maxdepth: 1
+   :caption: Serving
+
+   serving/distributed_serving
+
 .. toctree::
   :maxdepth: 1
   :caption: Models