vllm/managed_alloc.cu at 013b73e9b2ecd370635cfcc712fb245ce45496e7

Files

biondizzle 487dd34e04 Selective prefetch: only prefetch allocations <2 GiB to GPU

Model weights (small tensors) must be in HBM for cuBLAS GEMM ops
which can't page-fault into managed memory. KV cache blocks are
large and numerous — prefetching them all fills HBM and causes
OOM. The 2 GiB threshold separates compute data from cache data.

2026-04-10 14:58:57 +00:00

4.1 KiB

Raw Blame History

View Raw

4.1 KiB Raw Blame History

4.1 KiB

Raw Blame History