Commit Graph

6 Commits

Author SHA1 Message Date
aadde3ddf9 CMM: Fix OOM and subprocess crashes for GH200 EGM
Key changes:
- managed_alloc.cu: Add cudaMemPrefetchAsync to migrate pages to GPU
  immediately (prevents OOM from system RAM pinning on EGM systems
  where only ~102 GiB RAM remains). Add cudaMemAdviseSetAccessedBy
  for CPU so reads go over C2C NVLink without page migration.
- vllm_managed_mem.py: Rewrite with idempotent patches, proper
  MemorySnapshot.measure() override, and torch.cuda tracking stubs
  for CUDAPluggableAllocator compatibility.
- sitecustomize.py: Auto-loaded by Python in ALL subprocesses
  (including vLLM EngineCore). Applies allocator swap, torch patches,
  MemorySnapshot override, and request_memory override before any
  CUDA operations in spawned processes.
- Dockerfile: Install sitecustomize.py into Python dist-packages.
- README.md: Full rewrite with EGM problem statement, memory layout,
  architecture diagram, and build pipeline documentation.
2026-04-09 23:25:48 +00:00
Rajesh Shashi Kumar
5fa395825a Updated to vLLM v0.11.1rc3 2025-10-23 18:16:57 +00:00
Rajesh Shashi Kumar
3c4796ed55 Updated for CUDA 13 2025-10-21 19:21:13 +00:00
Rajesh Shashi Kumar
02430037ea Updated for v0.11.0 2025-10-16 01:08:21 +00:00
Rajesh Shashi Kumar
31f4489d1f Update README.md 2025-09-24 01:43:49 -05:00
Rajesh Shashi Kumar
9f2769285a Initial commit 2025-03-28 14:31:17 -05:00