Docker was caching the git clone layer because the command
string didn't change. Adding VLLM_COMMIT arg forces cache
invalidation when the source changes.
Key changes:
- managed_alloc.cu: Add cudaMemPrefetchAsync to migrate pages to GPU
immediately (prevents OOM from system RAM pinning on EGM systems
where only ~102 GiB RAM remains). Add cudaMemAdviseSetAccessedBy
for CPU so reads go over C2C NVLink without page migration.
- vllm_managed_mem.py: Rewrite with idempotent patches, proper
MemorySnapshot.measure() override, and torch.cuda tracking stubs
for CUDAPluggableAllocator compatibility.
- sitecustomize.py: Auto-loaded by Python in ALL subprocesses
(including vLLM EngineCore). Applies allocator swap, torch patches,
MemorySnapshot override, and request_memory override before any
CUDA operations in spawned processes.
- Dockerfile: Install sitecustomize.py into Python dist-packages.
- README.md: Full rewrite with EGM problem statement, memory layout,
architecture diagram, and build pipeline documentation.
- managed_alloc.cu: PyTorch pluggable allocator using cudaMallocManaged
- vllm_managed_mem.py: Launcher that patches vLLM for managed memory
- Dockerfile: Build and install managed memory components
This enables vLLM to use cudaMallocManaged for transparent page-fault
access to both HBM (~96 GiB) and LPDDR (EGM, up to 480 GiB additional)
on GH200 systems with Extended GPU Memory enabled.
Experimental branch: v0.19.0-cmm
- Add build-triton-kernels stage to fetch triton_kernels from Triton v3.6.0
- Install to site-packages for vLLM to find at runtime
- Resolves: No module named 'triton_kernels.matmul_ogs'
- Image tag: gh200-vllm-tfa:v0.19.0-tfa
CUDA 13 removed the fail_idx parameter from cuMemcpyBatchAsync.
Patch cache_kernels.cu to match new API signature instead of downgrading.
- Restore CUDA 13.0.1, PyTorch 2.9.0+cu130, flashinfer cu130
- Patch: remove fail_idx variable and parameter from cuMemcpyBatchAsync call
- Simplify error message to not reference fail_idx
- Switch to official facebook/xformers (johnnynunez fork has TORCH_STABLE_ONLY requiring PyTorch headers not in 2.9.0)
- Increase MAX_JOBS from 2-4 to 8 for all builds (native GH200 has 97GB HBM3)
- Increase NVCC_THREADS from 1 to 4 for flash-attention
- LMCache: reduced parallelism to avoid memory pressure
- vLLM: restored build from source (was using PyPI wheel)
- Will test with docker --memory=24g limit
- vLLM 0.18.1 aarch64 wheel includes pre-compiled FA2, FA3, MoE kernels
- Original build-from-source code commented out for GH200 restoration
- CMake compiler ABI detection fails under QEMU emulation
- Pin setuptools>=77.0.3,<81.0.0 for LMCache compatibility
- Use 'uv pip wheel' instead of 'pip3 wheel' for flash-attention (torch is in venv)
- Add CLAWMINE.md with build pipeline documentation