sitecustomize.py: No longer swaps CUDAPluggableAllocator globally.
Sets VLLM_KV_CACHE_USE_MANAGED_MEMORY=1 instead.
vllm_managed_mem.py: No global allocator swap, no torch.cuda patches.
Previously the Dockerfile was copying the old version that had
cudaMemPrefetchAsync completely removed, which caused cublasGemmEx
crashes because model weights stayed in LPDDR and cuBLAS couldn't
access them. This version prefetches allocations <2 GiB (model
weights) to GPU but skips large allocations (KV cache).
Docker was caching the git clone layer because the command
string didn't change. Adding VLLM_COMMIT arg forces cache
invalidation when the source changes.
Key changes:
- managed_alloc.cu: Add cudaMemPrefetchAsync to migrate pages to GPU
immediately (prevents OOM from system RAM pinning on EGM systems
where only ~102 GiB RAM remains). Add cudaMemAdviseSetAccessedBy
for CPU so reads go over C2C NVLink without page migration.
- vllm_managed_mem.py: Rewrite with idempotent patches, proper
MemorySnapshot.measure() override, and torch.cuda tracking stubs
for CUDAPluggableAllocator compatibility.
- sitecustomize.py: Auto-loaded by Python in ALL subprocesses
(including vLLM EngineCore). Applies allocator swap, torch patches,
MemorySnapshot override, and request_memory override before any
CUDA operations in spawned processes.
- Dockerfile: Install sitecustomize.py into Python dist-packages.
- README.md: Full rewrite with EGM problem statement, memory layout,
architecture diagram, and build pipeline documentation.
- managed_alloc.cu: PyTorch pluggable allocator using cudaMallocManaged
- vllm_managed_mem.py: Launcher that patches vLLM for managed memory
- Dockerfile: Build and install managed memory components
This enables vLLM to use cudaMallocManaged for transparent page-fault
access to both HBM (~96 GiB) and LPDDR (EGM, up to 480 GiB additional)
on GH200 systems with Extended GPU Memory enabled.
Experimental branch: v0.19.0-cmm
- Add build-triton-kernels stage to fetch triton_kernels from Triton v3.6.0
- Install to site-packages for vLLM to find at runtime
- Resolves: No module named 'triton_kernels.matmul_ogs'
- Image tag: gh200-vllm-tfa:v0.19.0-tfa
CUDA 13 removed the fail_idx parameter from cuMemcpyBatchAsync.
Patch cache_kernels.cu to match new API signature instead of downgrading.
- Restore CUDA 13.0.1, PyTorch 2.9.0+cu130, flashinfer cu130
- Patch: remove fail_idx variable and parameter from cuMemcpyBatchAsync call
- Simplify error message to not reference fail_idx
- Switch to official facebook/xformers (johnnynunez fork has TORCH_STABLE_ONLY requiring PyTorch headers not in 2.9.0)
- Increase MAX_JOBS from 2-4 to 8 for all builds (native GH200 has 97GB HBM3)
- Increase NVCC_THREADS from 1 to 4 for flash-attention
- LMCache: reduced parallelism to avoid memory pressure
- vLLM: restored build from source (was using PyPI wheel)
- Will test with docker --memory=24g limit
- vLLM 0.18.1 aarch64 wheel includes pre-compiled FA2, FA3, MoE kernels
- Original build-from-source code commented out for GH200 restoration
- CMake compiler ABI detection fails under QEMU emulation