Previously the Dockerfile was copying the old version that had cudaMemPrefetchAsync completely removed, which caused cublasGemmEx crashes because model weights stayed in LPDDR and cuBLAS couldn't access them. This version prefetches allocations <2 GiB (model weights) to GPU but skips large allocations (KV cache).
4.1 KiB
4.1 KiB