Files
grace-gpu-containers/vllm/managed_alloc.cu
biondizzle 07468031db Sync managed_alloc.cu: selective prefetch (<2 GiB to GPU)
Previously the Dockerfile was copying the old version that had
cudaMemPrefetchAsync completely removed, which caused cublasGemmEx
crashes because model weights stayed in LPDDR and cuBLAS couldn't
access them. This version prefetches allocations <2 GiB (model
weights) to GPU but skips large allocations (KV cache).
2026-04-10 18:37:11 +00:00

4.1 KiB