Previously the Dockerfile was copying the old version that had
cudaMemPrefetchAsync completely removed, which caused cublasGemmEx
crashes because model weights stayed in LPDDR and cuBLAS couldn't
access them. This version prefetches allocations <2 GiB (model
weights) to GPU but skips large allocations (KV cache).
Key changes:
- managed_alloc.cu: Add cudaMemPrefetchAsync to migrate pages to GPU
immediately (prevents OOM from system RAM pinning on EGM systems
where only ~102 GiB RAM remains). Add cudaMemAdviseSetAccessedBy
for CPU so reads go over C2C NVLink without page migration.
- vllm_managed_mem.py: Rewrite with idempotent patches, proper
MemorySnapshot.measure() override, and torch.cuda tracking stubs
for CUDAPluggableAllocator compatibility.
- sitecustomize.py: Auto-loaded by Python in ALL subprocesses
(including vLLM EngineCore). Applies allocator swap, torch patches,
MemorySnapshot override, and request_memory override before any
CUDA operations in spawned processes.
- Dockerfile: Install sitecustomize.py into Python dist-packages.
- README.md: Full rewrite with EGM problem statement, memory layout,
architecture diagram, and build pipeline documentation.
- managed_alloc.cu: PyTorch pluggable allocator using cudaMallocManaged
- vllm_managed_mem.py: Launcher that patches vLLM for managed memory
- Dockerfile: Build and install managed memory components
This enables vLLM to use cudaMallocManaged for transparent page-fault
access to both HBM (~96 GiB) and LPDDR (EGM, up to 480 GiB additional)
on GH200 systems with Extended GPU Memory enabled.
Experimental branch: v0.19.0-cmm