CMM: Fix OOM and subprocess crashes for GH200 EGM

Key changes:
- managed_alloc.cu: Add cudaMemPrefetchAsync to migrate pages to GPU
  immediately (prevents OOM from system RAM pinning on EGM systems
  where only ~102 GiB RAM remains). Add cudaMemAdviseSetAccessedBy
  for CPU so reads go over C2C NVLink without page migration.
- vllm_managed_mem.py: Rewrite with idempotent patches, proper
  MemorySnapshot.measure() override, and torch.cuda tracking stubs
  for CUDAPluggableAllocator compatibility.
- sitecustomize.py: Auto-loaded by Python in ALL subprocesses
  (including vLLM EngineCore). Applies allocator swap, torch patches,
  MemorySnapshot override, and request_memory override before any
  CUDA operations in spawned processes.
- Dockerfile: Install sitecustomize.py into Python dist-packages.
- README.md: Full rewrite with EGM problem statement, memory layout,
  architecture diagram, and build pipeline documentation.
This commit is contained in:
2026-04-09 23:25:48 +00:00
parent 079eb88d7d
commit aadde3ddf9
5 changed files with 460 additions and 96 deletions

View File

@@ -249,15 +249,29 @@ RUN pip uninstall -y pynvml && pip install nvidia-ml-py
# access to both HBM and LPDDR (EGM) memory on GH200 systems.
#
# The managed_alloc.cu provides a PyTorch pluggable allocator that uses
# cudaMallocManaged instead of cudaMalloc. vllm_managed_mem.py is a
# launcher that swaps the allocator before any CUDA operations and patches
# vLLM's memory validation to understand the larger managed memory space.
# cudaMallocManaged instead of cudaMalloc. Key features:
# - cudaMemAdviseSetPreferredLocation(GPU): keep pages on GPU
# - cudaMemAdviseSetAccessedBy(CPU): CPU reads over C2C NVLink without
# migrating pages back to system RAM (prevents OOM on EGM systems)
# - cudaMemPrefetchAsync(GPU): actively migrates pages to GPU immediately,
# so model weight writes go to HBM/EGM, not system RAM
#
# vllm_managed_mem.py is the launcher that swaps the allocator before any
# CUDA operations and patches vLLM's memory validation to understand the
# larger managed memory space.
#
# sitecustomize.py is auto-loaded by Python in ALL subprocesses (including
# vLLM's EngineCore). It applies the allocator swap and torch.cuda patches
# before any CUDA operations in spawned processes.
# ==============================================================================
COPY managed_alloc.cu /tmp/managed_alloc.cu
RUN nvcc -shared -o /usr/local/lib/libmanaged_alloc.so \
/tmp/managed_alloc.cu -Xcompiler -fPIC && rm /tmp/managed_alloc.cu
COPY vllm_managed_mem.py /usr/local/bin/vllm_managed_mem.py
RUN chmod +x /usr/local/bin/vllm_managed_mem.py
# sitecustomize.py is auto-loaded by Python before any other imports.
# This ensures CMM patches apply in ALL subprocesses (EngineCore, etc.)
COPY sitecustomize.py /usr/local/lib/python3.12/dist-packages/sitecustomize.py
# API server entrypoint
# ENTRYPOINT ["vllm", "serve"]