grace-gpu-containers

Author	SHA1	Message	Date
biondizzle	aadde3ddf9	CMM: Fix OOM and subprocess crashes for GH200 EGM Key changes: - managed_alloc.cu: Add cudaMemPrefetchAsync to migrate pages to GPU immediately (prevents OOM from system RAM pinning on EGM systems where only ~102 GiB RAM remains). Add cudaMemAdviseSetAccessedBy for CPU so reads go over C2C NVLink without page migration. - vllm_managed_mem.py: Rewrite with idempotent patches, proper MemorySnapshot.measure() override, and torch.cuda tracking stubs for CUDAPluggableAllocator compatibility. - sitecustomize.py: Auto-loaded by Python in ALL subprocesses (including vLLM EngineCore). Applies allocator swap, torch patches, MemorySnapshot override, and request_memory override before any CUDA operations in spawned processes. - Dockerfile: Install sitecustomize.py into Python dist-packages. - README.md: Full rewrite with EGM problem statement, memory layout, architecture diagram, and build pipeline documentation.	2026-04-09 23:25:48 +00:00
Rajesh Shashi Kumar	5fa395825a	Updated to vLLM v0.11.1rc3	2025-10-23 18:16:57 +00:00
Rajesh Shashi Kumar	3c4796ed55	Updated for CUDA 13	2025-10-21 19:21:13 +00:00
Rajesh Shashi Kumar	02430037ea	Updated for v0.11.0	2025-10-16 01:08:21 +00:00
Rajesh Shashi Kumar	31f4489d1f	Update README.md	2025-09-24 01:43:49 -05:00
Rajesh Shashi Kumar	9f2769285a	Initial commit	2025-03-28 14:31:17 -05:00