Commit Graph

2 Commits

Author SHA1 Message Date
487dd34e04 Selective prefetch: only prefetch allocations <2 GiB to GPU
Model weights (small tensors) must be in HBM for cuBLAS GEMM ops
which can't page-fault into managed memory. KV cache blocks are
large and numerous — prefetching them all fills HBM and causes
OOM. The 2 GiB threshold separates compute data from cache data.
2026-04-10 14:58:57 +00:00
a15f86ecfa Remove cudaMemPrefetchAsync from managed allocator
Eager prefetching was filling HBM+EGM, causing subsequent
cudaMallocManaged calls to fail after model loading. On GH200
with EGM, pages should migrate on-demand via hardware page faults
over C2C NVLink. The cudaMemAdviseSetPreferredLocation(GPU) hint
is sufficient to prefer GPU placement with LPDDR fallback.
2026-04-10 05:58:11 +00:00