[Feat] Adds runai distributed streamer (#27230)

Signed-off-by: bbartels <benjamin@bartels.dev> Signed-off-by: Benjamin Bartels <benjamin@bartels.dev> Co-authored-by: omer-dayan <omdayan@nvidia.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-10-30 04:09:10 +00:00
parent 2ce5c5d3d6
commit 17d055f527
9 changed files with 39 additions and 11 deletions
--- a/docs/models/extensions/runai_model_streamer.md
+++ b/docs/models/extensions/runai_model_streamer.md
@@ -45,6 +45,15 @@ vllm serve s3://core-llm/Llama-3-8b \

 You can tune parameters using `--model-loader-extra-config`:

+You can tune `distributed` that controls whether distributed streaming should be used. This is currently only possible on CUDA and ROCM devices. This can significantly improve loading times from object storage or high-throughput network fileshares.
+You can read further about Distributed streaming [here](https://github.com/run-ai/runai-model-streamer/blob/master/docs/src/usage.md#distributed-streaming)
+
+```bash
+vllm serve /home/meta-llama/Llama-3.2-3B-Instruct \
+    --load-format runai_streamer \
+    --model-loader-extra-config '{"distributed":true}'
+```
+
 You can tune `concurrency` that controls the level of concurrency and number of OS threads reading tensors from the file to the CPU buffer.
 For reading from S3, it will be the number of client instances the host is opening to the S3 server.