[Performance] Add prefetch for checkpoints to OS page cache (#36012)

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
2026-03-16 13:32:02 +02:00
parent 9b005edc48
commit f5e59ee7a6
2 changed files with 76 additions and 1 deletions
--- a/vllm/config/load.py
+++ b/vllm/config/load.py
@@ -62,6 +62,9 @@ class LoadConfig:
      This is recommended for models on network filesystems (e.g., Lustre, NFS)
      as it avoids inefficient random reads, significantly speeding up model
      initialization. However, it uses more CPU RAM.
+    - "prefetch": Checkpoint files are read into the OS page cache before
+      workers load them, speeding up the model loading phase. Useful on
+      network or high-latency storage.
    - "torchao": Weights are loaded in upfront and then reconstructed
      into torchao tensor subclasses. This is used when the checkpoint
      was quantized using torchao and saved using safetensors.