CMM: Fix OOM and subprocess crashes for GH200 EGM

Key changes: - managed_alloc.cu: Add cudaMemPrefetchAsync to migrate pages to GPU immediately (prevents OOM from system RAM pinning on EGM systems where only ~102 GiB RAM remains). Add cudaMemAdviseSetAccessedBy for CPU so reads go over C2C NVLink without page migration. - vllm_managed_mem.py: Rewrite with idempotent patches, proper MemorySnapshot.measure() override, and torch.cuda tracking stubs for CUDAPluggableAllocator compatibility. - sitecustomize.py: Auto-loaded by Python in ALL subprocesses (including vLLM EngineCore). Applies allocator swap, torch patches, MemorySnapshot override, and request_memory override before any CUDA operations in spawned processes. - Dockerfile: Install sitecustomize.py into Python dist-packages. - README.md: Full rewrite with EGM problem statement, memory layout, architecture diagram, and build pipeline documentation.
2026-04-09 23:25:48 +00:00
parent 079eb88d7d
commit aadde3ddf9
5 changed files with 460 additions and 96 deletions
--- a/vllm/Dockerfile
+++ b/vllm/Dockerfile
@@ -249,15 +249,29 @@ RUN pip uninstall -y pynvml && pip install nvidia-ml-py
 # access to both HBM and LPDDR (EGM) memory on GH200 systems.
 #
 # The managed_alloc.cu provides a PyTorch pluggable allocator that uses
-# cudaMallocManaged instead of cudaMalloc. vllm_managed_mem.py is a
-# launcher that swaps the allocator before any CUDA operations and patches
-# vLLM's memory validation to understand the larger managed memory space.
+# cudaMallocManaged instead of cudaMalloc. Key features:
+#   - cudaMemAdviseSetPreferredLocation(GPU): keep pages on GPU
+#   - cudaMemAdviseSetAccessedBy(CPU): CPU reads over C2C NVLink without
+#     migrating pages back to system RAM (prevents OOM on EGM systems)
+#   - cudaMemPrefetchAsync(GPU): actively migrates pages to GPU immediately,
+#     so model weight writes go to HBM/EGM, not system RAM
+#
+# vllm_managed_mem.py is the launcher that swaps the allocator before any
+# CUDA operations and patches vLLM's memory validation to understand the
+# larger managed memory space.
+#
+# sitecustomize.py is auto-loaded by Python in ALL subprocesses (including
+# vLLM's EngineCore). It applies the allocator swap and torch.cuda patches
+# before any CUDA operations in spawned processes.
 # ==============================================================================
 COPY managed_alloc.cu /tmp/managed_alloc.cu
 RUN nvcc -shared -o /usr/local/lib/libmanaged_alloc.so \
    /tmp/managed_alloc.cu -Xcompiler -fPIC && rm /tmp/managed_alloc.cu
 COPY vllm_managed_mem.py /usr/local/bin/vllm_managed_mem.py
 RUN chmod +x /usr/local/bin/vllm_managed_mem.py
+# sitecustomize.py is auto-loaded by Python before any other imports.
+# This ensures CMM patches apply in ALL subprocesses (EngineCore, etc.)
+COPY sitecustomize.py /usr/local/lib/python3.12/dist-packages/sitecustomize.py

 # API server entrypoint
 # ENTRYPOINT ["vllm", "serve"]