grace-gpu-containers

Author	SHA1	Message	Date
biondizzle	07468031db	Sync managed_alloc.cu: selective prefetch (<2 GiB to GPU) Previously the Dockerfile was copying the old version that had cudaMemPrefetchAsync completely removed, which caused cublasGemmEx crashes because model weights stayed in LPDDR and cuBLAS couldn't access them. This version prefetches allocations <2 GiB (model weights) to GPU but skips large allocations (KV cache).	2026-04-10 18:37:11 +00:00
biondizzle	c583bcb4fc	Fix cudaMemPrefetchAsync for CUDA 13: use cudaMemLocation + flags=0 (no stream param)	2026-04-10 02:45:05 +00:00
biondizzle	6053e6d0ea	Fix cudaMemPrefetchAsync: use int device instead of cudaMemLocation struct	2026-04-10 01:48:01 +00:00
biondizzle	aadde3ddf9	CMM: Fix OOM and subprocess crashes for GH200 EGM Key changes: - managed_alloc.cu: Add cudaMemPrefetchAsync to migrate pages to GPU immediately (prevents OOM from system RAM pinning on EGM systems where only ~102 GiB RAM remains). Add cudaMemAdviseSetAccessedBy for CPU so reads go over C2C NVLink without page migration. - vllm_managed_mem.py: Rewrite with idempotent patches, proper MemorySnapshot.measure() override, and torch.cuda tracking stubs for CUDAPluggableAllocator compatibility. - sitecustomize.py: Auto-loaded by Python in ALL subprocesses (including vLLM EngineCore). Applies allocator swap, torch patches, MemorySnapshot override, and request_memory override before any CUDA operations in spawned processes. - Dockerfile: Install sitecustomize.py into Python dist-packages. - README.md: Full rewrite with EGM problem statement, memory layout, architecture diagram, and build pipeline documentation.	2026-04-09 23:25:48 +00:00
biondizzle	7c79fb4ee7	fix: Update cudaMemAdvise for CUDA 13 API CUDA 13 changed cudaMemAdvise to take cudaMemLocation struct instead of int. Updated to use cudaMemLocation with type=cudaMemLocationTypeDevice.	2026-04-07 21:32:17 +00:00
biondizzle	2757bffcb6	Add cudaMallocManaged allocator for GH200 EGM support - managed_alloc.cu: PyTorch pluggable allocator using cudaMallocManaged - vllm_managed_mem.py: Launcher that patches vLLM for managed memory - Dockerfile: Build and install managed memory components This enables vLLM to use cudaMallocManaged for transparent page-fault access to both HBM (~96 GiB) and LPDDR (EGM, up to 480 GiB additional) on GH200 systems with Extended GPU Memory enabled. Experimental branch: v0.19.0-cmm	2026-04-07 21:19:39 +00:00

6 Commits