[Feat] Adds runai distributed streamer (#27230)
Signed-off-by: bbartels <benjamin@bartels.dev> Signed-off-by: Benjamin Bartels <benjamin@bartels.dev> Co-authored-by: omer-dayan <omdayan@nvidia.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
This commit is contained in:
@@ -45,6 +45,15 @@ vllm serve s3://core-llm/Llama-3-8b \
|
||||
|
||||
You can tune parameters using `--model-loader-extra-config`:
|
||||
|
||||
You can tune `distributed` that controls whether distributed streaming should be used. This is currently only possible on CUDA and ROCM devices. This can significantly improve loading times from object storage or high-throughput network fileshares.
|
||||
You can read further about Distributed streaming [here](https://github.com/run-ai/runai-model-streamer/blob/master/docs/src/usage.md#distributed-streaming)
|
||||
|
||||
```bash
|
||||
vllm serve /home/meta-llama/Llama-3.2-3B-Instruct \
|
||||
--load-format runai_streamer \
|
||||
--model-loader-extra-config '{"distributed":true}'
|
||||
```
|
||||
|
||||
You can tune `concurrency` that controls the level of concurrency and number of OS threads reading tensors from the file to the CPU buffer.
|
||||
For reading from S3, it will be the number of client instances the host is opening to the S3 server.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user