Model weights (small tensors) must be in HBM for cuBLAS GEMM ops which can't page-fault into managed memory. KV cache blocks are large and numerous — prefetching them all fills HBM and causes OOM. The 2 GiB threshold separates compute data from cache data.
4.1 KiB
4.1 KiB