[Bugfix] Fix multi nodes TP+PP for XPU (#8884)
Signed-off-by: YiSheng5 <syhm@mail.ustc.edu.cn> Signed-off-by: yan ma <yan.ma@intel.com> Co-authored-by: YiSheng5 <syhm@mail.ustc.edu.cn>
This commit is contained in:
@@ -60,3 +60,21 @@ Build from source
|
||||
- FP16 is the default data type in the current XPU backend. The BF16 data
|
||||
type will be supported in the future.
|
||||
|
||||
|
||||
Distributed inference and serving
|
||||
---------------------------------
|
||||
|
||||
XPU platform supports tensor-parallel inference/serving and also supports pipeline parallel as a beta feature for online serving. We requires Ray as the distributed runtime backend. For example, a reference execution likes following:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ python -m vllm.entrypoints.openai.api_server \
|
||||
$ --model=facebook/opt-13b \
|
||||
$ --dtype=bfloat16 \
|
||||
$ --device=xpu \
|
||||
$ --max_model_len=1024 \
|
||||
$ --distributed-executor-backend=ray \
|
||||
$ --pipeline-parallel-size=2 \
|
||||
$ -tp=8
|
||||
|
||||
By default, a ray instance will be launched automatically if no existing one is detected in system, with ``num-gpus`` equals to ``parallel_config.world_size``. We recommend properly starting a ray cluster before execution, referring helper `script <https://github.com/vllm-project/vllm/tree/main/examples/run_cluster.sh>`_.
|
||||
|
||||
Reference in New Issue
Block a user