[Doc][CI/Build] Update docs and tests to use vllm serve (#6431)

2024-07-17 15:43:21 +08:00
parent a19e8d3726
commit 5bf35a91e4
23 changed files with 155 additions and 175 deletions
--- a/docs/source/serving/distributed_serving.rst
+++ b/docs/source/serving/distributed_serving.rst
@@ -35,16 +35,14 @@ To run multi-GPU serving, pass in the :code:`--tensor-parallel-size` argument wh

 .. code-block:: console

-    $ python -m vllm.entrypoints.openai.api_server \
-    $     --model facebook/opt-13b \
+    $ vllm serve facebook/opt-13b \
    $     --tensor-parallel-size 4

 You can also additionally specify :code:`--pipeline-parallel-size` to enable pipeline parallelism. For example, to run API server on 8 GPUs with pipeline parallelism and tensor parallelism:

 .. code-block:: console

-    $ python -m vllm.entrypoints.openai.api_server \
-    $     --model gpt2 \
+    $ vllm serve gpt2 \
    $     --tensor-parallel-size 4 \
    $     --pipeline-parallel-size 2 \
    $     --distributed-executor-backend ray