sitecustomize.py: No longer swaps CUDAPluggableAllocator globally.
Sets VLLM_KV_CACHE_USE_MANAGED_MEMORY=1 instead.
vllm_managed_mem.py: No global allocator swap, no torch.cuda patches.
Key changes:
- managed_alloc.cu: Add cudaMemPrefetchAsync to migrate pages to GPU
immediately (prevents OOM from system RAM pinning on EGM systems
where only ~102 GiB RAM remains). Add cudaMemAdviseSetAccessedBy
for CPU so reads go over C2C NVLink without page migration.
- vllm_managed_mem.py: Rewrite with idempotent patches, proper
MemorySnapshot.measure() override, and torch.cuda tracking stubs
for CUDAPluggableAllocator compatibility.
- sitecustomize.py: Auto-loaded by Python in ALL subprocesses
(including vLLM EngineCore). Applies allocator swap, torch patches,
MemorySnapshot override, and request_memory override before any
CUDA operations in spawned processes.
- Dockerfile: Install sitecustomize.py into Python dist-packages.
- README.md: Full rewrite with EGM problem statement, memory layout,
architecture diagram, and build pipeline documentation.
- managed_alloc.cu: PyTorch pluggable allocator using cudaMallocManaged
- vllm_managed_mem.py: Launcher that patches vLLM for managed memory
- Dockerfile: Build and install managed memory components
This enables vLLM to use cudaMallocManaged for transparent page-fault
access to both HBM (~96 GiB) and LPDDR (EGM, up to 480 GiB additional)
on GH200 systems with Extended GPU Memory enabled.
Experimental branch: v0.19.0-cmm