[Doc][CI/Build] Update docs and tests to use vllm serve (#6431)
This commit is contained in:
@@ -35,16 +35,14 @@ To run multi-GPU serving, pass in the :code:`--tensor-parallel-size` argument wh
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ python -m vllm.entrypoints.openai.api_server \
|
||||
$ --model facebook/opt-13b \
|
||||
$ vllm serve facebook/opt-13b \
|
||||
$ --tensor-parallel-size 4
|
||||
|
||||
You can also additionally specify :code:`--pipeline-parallel-size` to enable pipeline parallelism. For example, to run API server on 8 GPUs with pipeline parallelism and tensor parallelism:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ python -m vllm.entrypoints.openai.api_server \
|
||||
$ --model gpt2 \
|
||||
$ vllm serve gpt2 \
|
||||
$ --tensor-parallel-size 4 \
|
||||
$ --pipeline-parallel-size 2 \
|
||||
$ --distributed-executor-backend ray
|
||||
|
||||
Reference in New Issue
Block a user