Commit Graph

6 Commits

Author SHA1 Message Date
07468031db Sync managed_alloc.cu: selective prefetch (<2 GiB to GPU)
Previously the Dockerfile was copying the old version that had
cudaMemPrefetchAsync completely removed, which caused cublasGemmEx
crashes because model weights stayed in LPDDR and cuBLAS couldn't
access them. This version prefetches allocations <2 GiB (model
weights) to GPU but skips large allocations (KV cache).
2026-04-10 18:37:11 +00:00
c583bcb4fc Fix cudaMemPrefetchAsync for CUDA 13: use cudaMemLocation + flags=0 (no stream param) 2026-04-10 02:45:05 +00:00
6053e6d0ea Fix cudaMemPrefetchAsync: use int device instead of cudaMemLocation struct 2026-04-10 01:48:01 +00:00
aadde3ddf9 CMM: Fix OOM and subprocess crashes for GH200 EGM
Key changes:
- managed_alloc.cu: Add cudaMemPrefetchAsync to migrate pages to GPU
  immediately (prevents OOM from system RAM pinning on EGM systems
  where only ~102 GiB RAM remains). Add cudaMemAdviseSetAccessedBy
  for CPU so reads go over C2C NVLink without page migration.
- vllm_managed_mem.py: Rewrite with idempotent patches, proper
  MemorySnapshot.measure() override, and torch.cuda tracking stubs
  for CUDAPluggableAllocator compatibility.
- sitecustomize.py: Auto-loaded by Python in ALL subprocesses
  (including vLLM EngineCore). Applies allocator swap, torch patches,
  MemorySnapshot override, and request_memory override before any
  CUDA operations in spawned processes.
- Dockerfile: Install sitecustomize.py into Python dist-packages.
- README.md: Full rewrite with EGM problem statement, memory layout,
  architecture diagram, and build pipeline documentation.
2026-04-09 23:25:48 +00:00
7c79fb4ee7 fix: Update cudaMemAdvise for CUDA 13 API
CUDA 13 changed cudaMemAdvise to take cudaMemLocation struct instead of int.
Updated to use cudaMemLocation with type=cudaMemLocationTypeDevice.
2026-04-07 21:32:17 +00:00
2757bffcb6 Add cudaMallocManaged allocator for GH200 EGM support
- managed_alloc.cu: PyTorch pluggable allocator using cudaMallocManaged
- vllm_managed_mem.py: Launcher that patches vLLM for managed memory
- Dockerfile: Build and install managed memory components

This enables vLLM to use cudaMallocManaged for transparent page-fault
access to both HBM (~96 GiB) and LPDDR (EGM, up to 480 GiB additional)
on GH200 systems with Extended GPU Memory enabled.

Experimental branch: v0.19.0-cmm
2026-04-07 21:19:39 +00:00