[Feat] Adds runai distributed streamer (#27230)

Signed-off-by: bbartels <benjamin@bartels.dev>
Signed-off-by: Benjamin Bartels <benjamin@bartels.dev>
Co-authored-by: omer-dayan <omdayan@nvidia.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
This commit is contained in:
Benjamin Bartels
2025-10-30 04:09:10 +00:00
committed by GitHub
parent 2ce5c5d3d6
commit 17d055f527
9 changed files with 39 additions and 11 deletions

View File

@@ -45,6 +45,15 @@ vllm serve s3://core-llm/Llama-3-8b \
You can tune parameters using `--model-loader-extra-config`:
You can tune `distributed` that controls whether distributed streaming should be used. This is currently only possible on CUDA and ROCM devices. This can significantly improve loading times from object storage or high-throughput network fileshares.
You can read further about Distributed streaming [here](https://github.com/run-ai/runai-model-streamer/blob/master/docs/src/usage.md#distributed-streaming)
```bash
vllm serve /home/meta-llama/Llama-3.2-3B-Instruct \
--load-format runai_streamer \
--model-loader-extra-config '{"distributed":true}'
```
You can tune `concurrency` that controls the level of concurrency and number of OS threads reading tensors from the file to the CPU buffer.
For reading from S3, it will be the number of client instances the host is opening to the S3 server.