biondizzle/vllm - vllm - Gitea: Git with a cup of tea

biondizzle/vllm

Fork 0

Commit Graph

Author	SHA1	Message	Date
biondizzle	487dd34e04	Selective prefetch: only prefetch allocations <2 GiB to GPU Model weights (small tensors) must be in HBM for cuBLAS GEMM ops which can't page-fault into managed memory. KV cache blocks are large and numerous — prefetching them all fills HBM and causes OOM. The 2 GiB threshold separates compute data from cache data.	2026-04-10 14:58:57 +00:00
biondizzle	a15f86ecfa	Remove cudaMemPrefetchAsync from managed allocator Eager prefetching was filling HBM+EGM, causing subsequent cudaMallocManaged calls to fail after model loading. On GH200 with EGM, pages should migrate on-demand via hardware page faults over C2C NVLink. The cudaMemAdviseSetPreferredLocation(GPU) hint is sufficient to prefer GPU placement with LPDDR fallback.	2026-04-10 05:58:11 +00:00

Author

SHA1

Message

Date

biondizzle

487dd34e04

Selective prefetch: only prefetch allocations <2 GiB to GPU

Model weights (small tensors) must be in HBM for cuBLAS GEMM ops
which can't page-fault into managed memory. KV cache blocks are
large and numerous — prefetching them all fills HBM and causes
OOM. The 2 GiB threshold separates compute data from cache data.

2026-04-10 14:58:57 +00:00

biondizzle

a15f86ecfa

Remove cudaMemPrefetchAsync from managed allocator

Eager prefetching was filling HBM+EGM, causing subsequent
cudaMallocManaged calls to fail after model loading. On GH200
with EGM, pages should migrate on-demand via hardware page faults
over C2C NVLink. The cudaMemAdviseSetPreferredLocation(GPU) hint
is sufficient to prefer GPU placement with LPDDR fallback.

2026-04-10 05:58:11 +00:00

2 Commits