grace-gpu-containers/vllm/managed_alloc.cu at cuda-malloc-managed

Files

biondizzle 07468031db Sync managed_alloc.cu: selective prefetch (<2 GiB to GPU)

Previously the Dockerfile was copying the old version that had
cudaMemPrefetchAsync completely removed, which caused cublasGemmEx
crashes because model weights stayed in LPDDR and cuBLAS couldn't
access them. This version prefetches allocations <2 GiB (model
weights) to GPU but skips large allocations (KV cache).

2026-04-10 18:37:11 +00:00

4.1 KiB

Raw Permalink Blame History

View Raw

4.1 KiB Raw Permalink Blame History

4.1 KiB

Raw Permalink Blame History