CMM: Fix OOM and subprocess crashes for GH200 EGM
Key changes: - managed_alloc.cu: Add cudaMemPrefetchAsync to migrate pages to GPU immediately (prevents OOM from system RAM pinning on EGM systems where only ~102 GiB RAM remains). Add cudaMemAdviseSetAccessedBy for CPU so reads go over C2C NVLink without page migration. - vllm_managed_mem.py: Rewrite with idempotent patches, proper MemorySnapshot.measure() override, and torch.cuda tracking stubs for CUDAPluggableAllocator compatibility. - sitecustomize.py: Auto-loaded by Python in ALL subprocesses (including vLLM EngineCore). Applies allocator swap, torch patches, MemorySnapshot override, and request_memory override before any CUDA operations in spawned processes. - Dockerfile: Install sitecustomize.py into Python dist-packages. - README.md: Full rewrite with EGM problem statement, memory layout, architecture diagram, and build pipeline documentation.
This commit is contained in:
@@ -249,15 +249,29 @@ RUN pip uninstall -y pynvml && pip install nvidia-ml-py
|
||||
# access to both HBM and LPDDR (EGM) memory on GH200 systems.
|
||||
#
|
||||
# The managed_alloc.cu provides a PyTorch pluggable allocator that uses
|
||||
# cudaMallocManaged instead of cudaMalloc. vllm_managed_mem.py is a
|
||||
# launcher that swaps the allocator before any CUDA operations and patches
|
||||
# vLLM's memory validation to understand the larger managed memory space.
|
||||
# cudaMallocManaged instead of cudaMalloc. Key features:
|
||||
# - cudaMemAdviseSetPreferredLocation(GPU): keep pages on GPU
|
||||
# - cudaMemAdviseSetAccessedBy(CPU): CPU reads over C2C NVLink without
|
||||
# migrating pages back to system RAM (prevents OOM on EGM systems)
|
||||
# - cudaMemPrefetchAsync(GPU): actively migrates pages to GPU immediately,
|
||||
# so model weight writes go to HBM/EGM, not system RAM
|
||||
#
|
||||
# vllm_managed_mem.py is the launcher that swaps the allocator before any
|
||||
# CUDA operations and patches vLLM's memory validation to understand the
|
||||
# larger managed memory space.
|
||||
#
|
||||
# sitecustomize.py is auto-loaded by Python in ALL subprocesses (including
|
||||
# vLLM's EngineCore). It applies the allocator swap and torch.cuda patches
|
||||
# before any CUDA operations in spawned processes.
|
||||
# ==============================================================================
|
||||
COPY managed_alloc.cu /tmp/managed_alloc.cu
|
||||
RUN nvcc -shared -o /usr/local/lib/libmanaged_alloc.so \
|
||||
/tmp/managed_alloc.cu -Xcompiler -fPIC && rm /tmp/managed_alloc.cu
|
||||
COPY vllm_managed_mem.py /usr/local/bin/vllm_managed_mem.py
|
||||
RUN chmod +x /usr/local/bin/vllm_managed_mem.py
|
||||
# sitecustomize.py is auto-loaded by Python before any other imports.
|
||||
# This ensures CMM patches apply in ALL subprocesses (EngineCore, etc.)
|
||||
COPY sitecustomize.py /usr/local/lib/python3.12/dist-packages/sitecustomize.py
|
||||
|
||||
# API server entrypoint
|
||||
# ENTRYPOINT ["vllm", "serve"]
|
||||
|
||||
Reference in New Issue
Block a user