grace-gpu-containers

Author	SHA1	Message	Date
biondizzle	bcc872c2c3	Remove global allocator swap, use targeted KV cache managed allocation sitecustomize.py: No longer swaps CUDAPluggableAllocator globally. Sets VLLM_KV_CACHE_USE_MANAGED_MEMORY=1 instead. vllm_managed_mem.py: No global allocator swap, no torch.cuda patches.	2026-04-11 02:15:09 +00:00
biondizzle	07468031db	Sync managed_alloc.cu: selective prefetch (<2 GiB to GPU) Previously the Dockerfile was copying the old version that had cudaMemPrefetchAsync completely removed, which caused cublasGemmEx crashes because model weights stayed in LPDDR and cuBLAS couldn't access them. This version prefetches allocations <2 GiB (model weights) to GPU but skips large allocations (KV cache).	2026-04-10 18:37:11 +00:00
biondizzle	cdfd37c1e6	Fix Dockerfile: separate git clone and build RUN commands	2026-04-10 15:32:16 +00:00
biondizzle	c1b013234e	Fix cache-bust: embed VLLM_COMMIT in git clone RUN command	2026-04-10 15:01:29 +00:00
biondizzle	98b4ae6676	Add VLLM_COMMIT cache-bust arg to Dockerfile Docker was caching the git clone layer because the command string didn't change. Adding VLLM_COMMIT arg forces cache invalidation when the source changes.	2026-04-10 06:01:20 +00:00
biondizzle	c583bcb4fc	Fix cudaMemPrefetchAsync for CUDA 13: use cudaMemLocation + flags=0 (no stream param)	2026-04-10 02:45:05 +00:00
biondizzle	6053e6d0ea	Fix cudaMemPrefetchAsync: use int device instead of cudaMemLocation struct	2026-04-10 01:48:01 +00:00
biondizzle	aadde3ddf9	CMM: Fix OOM and subprocess crashes for GH200 EGM Key changes: - managed_alloc.cu: Add cudaMemPrefetchAsync to migrate pages to GPU immediately (prevents OOM from system RAM pinning on EGM systems where only ~102 GiB RAM remains). Add cudaMemAdviseSetAccessedBy for CPU so reads go over C2C NVLink without page migration. - vllm_managed_mem.py: Rewrite with idempotent patches, proper MemorySnapshot.measure() override, and torch.cuda tracking stubs for CUDAPluggableAllocator compatibility. - sitecustomize.py: Auto-loaded by Python in ALL subprocesses (including vLLM EngineCore). Applies allocator swap, torch patches, MemorySnapshot override, and request_memory override before any CUDA operations in spawned processes. - Dockerfile: Install sitecustomize.py into Python dist-packages. - README.md: Full rewrite with EGM problem statement, memory layout, architecture diagram, and build pipeline documentation.	2026-04-09 23:25:48 +00:00
biondizzle	079eb88d7d	Switch vLLM source to Gitea fork (cmm branch)	2026-04-09 22:05:40 +00:00
biondizzle	7c79fb4ee7	fix: Update cudaMemAdvise for CUDA 13 API CUDA 13 changed cudaMemAdvise to take cudaMemLocation struct instead of int. Updated to use cudaMemLocation with type=cudaMemLocationTypeDevice.	2026-04-07 21:32:17 +00:00
biondizzle	2757bffcb6	Add cudaMallocManaged allocator for GH200 EGM support - managed_alloc.cu: PyTorch pluggable allocator using cudaMallocManaged - vllm_managed_mem.py: Launcher that patches vLLM for managed memory - Dockerfile: Build and install managed memory components This enables vLLM to use cudaMallocManaged for transparent page-fault access to both HBM (~96 GiB) and LPDDR (EGM, up to 480 GiB additional) on GH200 systems with Extended GPU Memory enabled. Experimental branch: v0.19.0-cmm	2026-04-07 21:19:39 +00:00
biondizzle	edf12f7996	Clean up: remove PLAN-triton-kernels.md (merged into main)	2026-04-06 17:25:06 +00:00
biondizzle	e6cc28a942	Add triton_kernels for MoE support (vLLM v0.19.0) - Add build-triton-kernels stage to fetch triton_kernels from Triton v3.6.0 - Install to site-packages for vLLM to find at runtime - Resolves: No module named 'triton_kernels.matmul_ogs' - Image tag: gh200-vllm-tfa:v0.19.0-tfa	2026-04-06 16:39:56 +00:00
biondizzle	643d5589a3	Switch flashinfer to v0.6.6 for vLLM v0.19.0 (v0.6.7 works with v0.18.2rc0) v0.19.0	2026-04-03 13:15:56 +00:00
biondizzle	3290adb0ac	Upgrade vLLM to v0.19.0 for Gemma 4 support (requires transformers>=5.5.0)	2026-04-03 11:55:16 +00:00
biondizzle	cd5d58a6f9	Patch vLLM torch_utils.py: remove hoist=True for NGC PyTorch 2.11 compatibility	2026-04-03 11:40:51 +00:00
biondizzle	659c79638c	✅ WORKING BUILD #43 - GH200 vLLM container builds successfully Versions locked: - vLLM: v0.18.2rc0 - flashinfer: v0.6.7 - flash-attention: hopper branch - lmcache: dev branch - infinistore: main - triton: 3.6.0 (PyPI wheel) - Base: nvcr.io/nvidia/pytorch:26.03-py3 (PyTorch 2.11.0a0, CUDA 13.2.0) DO NOT MODIFY WITHOUT MIKE'S APPROVAL	2026-04-03 11:08:29 +00:00
biondizzle	2442906d95	Add -y flag to pip uninstall pynvml for non-interactive Docker build	2026-04-03 10:57:42 +00:00
biondizzle	5280a28205	Bump flashinfer from v0.6.6 to v0.6.7 (required by vLLM v0.18.2rc0)	2026-04-03 10:52:19 +00:00
biondizzle	dbca81bba2	Switch vLLM from main to v0.18.2rc0 for CUDA 13.2 compatibility	2026-04-03 09:19:01 +00:00
biondizzle	202b9c4e23	Add -y flag to pip uninstall infinistore for non-interactive Docker build	2026-04-03 09:06:42 +00:00
biondizzle	c2cebcf962	Add apache-tvm-ffi dependency for flashinfer build	2026-04-03 09:00:18 +00:00
biondizzle	beb26d3573	Fix python -m build flag: use --no-isolation instead of --no-build-isolation	2026-04-03 08:54:45 +00:00
biondizzle	4e8a765c72	Fix wheel install conflict, use python -m build instead of pip build	2026-04-03 08:52:43 +00:00
biondizzle	ce55e45db2	Fix NGC PyTorch image tag format (26.03-py3)	2026-04-03 08:46:43 +00:00
biondizzle	c92c4ec68a	Switch to NVIDIA NGC PyTorch 26.03 base image (PyTorch 2.11.0a0, CUDA 13.2.0, ARM SBSA support)	2026-04-03 08:44:36 +00:00
biondizzle	54e609b2c5	Update lmcache/Dockerfile to CUDA 13.0.1, PyTorch nightly, LMCache dev branch	2026-04-03 08:39:35 +00:00
biondizzle	4980d9e49a	Use PyTorch nightly with CUDA 13.0 (torch 2.11.0.dev)	2026-04-03 08:36:36 +00:00
biondizzle	6a97539682	Fix duplicate corrupted lines in Dockerfile	2026-04-03 08:31:56 +00:00
biondizzle	f55789c53b	Bump to CUDA 13.0.1 + PyTorch 2.9.0, add version output on git checkouts	2026-04-03 08:26:53 +00:00
biondizzle	e514e0cd1e	Revert my patches - try v0.18.2rc0	2026-04-03 08:09:05 +00:00
biondizzle	4860bcee41	Skip LMCache CUDA extensions (NO_CUDA_EXT=1) PyTorch 2.9.0+cu130 was compiled with CUDA 12.8 but container has CUDA 13.0. Skip CUDA extension build to avoid version mismatch.	2026-04-03 08:05:44 +00:00
biondizzle	360b0dea58	Restore CUDA 13.0.1 + patch vLLM for cuMemcpyBatchAsync API change CUDA 13 removed the fail_idx parameter from cuMemcpyBatchAsync. Patch cache_kernels.cu to match new API signature instead of downgrading. - Restore CUDA 13.0.1, PyTorch 2.9.0+cu130, flashinfer cu130 - Patch: remove fail_idx variable and parameter from cuMemcpyBatchAsync call - Simplify error message to not reference fail_idx	2026-04-03 07:53:12 +00:00
biondizzle	6255c94359	Downgrade to CUDA 12.8.1 for vLLM compatibility cuMemcpyBatchAsync API changed in CUDA 13 - removed fail_idx parameter. vLLM code targets CUDA 12.8 API. Downgrade to CUDA 12.8.1.	2026-04-03 07:43:19 +00:00
biondizzle	ceab7ada22	Update flashinfer to v0.6.6 to match vLLM 0.18.x requirements vLLM 0.18.x depends on flashinfer-python==0.6.6, was building 0.4.1	2026-04-03 07:13:16 +00:00
biondizzle	9d88d4c7d8	Skip xformers - vLLM has built-in FlashAttention kernels xformers requires TORCH_STABLE_ONLY which needs torch/csrc/stable/ headers not present in PyTorch 2.9.0. vLLM 0.18.1 includes its own FA2/FA3 kernels.	2026-04-03 05:50:02 +00:00
biondizzle	45b6109ee1	Fix xformers TORCH_STABLE_ONLY issue + ramp up MAX_JOBS for native GH200 - Switch to official facebook/xformers (johnnynunez fork has TORCH_STABLE_ONLY requiring PyTorch headers not in 2.9.0) - Increase MAX_JOBS from 2-4 to 8 for all builds (native GH200 has 97GB HBM3) - Increase NVCC_THREADS from 1 to 4 for flash-attention	2026-04-03 05:46:11 +00:00
biondizzle	b223c051de	move things	2026-04-03 04:27:21 +00:00
biondizzle	7f7ca4a742	move things	2026-04-03 04:26:52 +00:00
biondizzle	2dc2008475	move things	2026-04-03 04:26:26 +00:00
biondizzle	980cd1b749	move things	2026-04-03 04:26:08 +00:00
biondizzle	1540b0c54e	move things	2026-04-03 04:25:23 +00:00
biondizzle	0b4ede8047	Add .gitignore for internal docs	2026-04-03 03:49:41 +00:00
biondizzle	5c29d2bea7	Fix: LMCache default branch is 'dev' not 'main'	2026-04-03 03:34:58 +00:00
biondizzle	9259555802	Fix: Actually update LMCache to main branch (previous edit failed)	2026-04-03 03:16:45 +00:00
biondizzle	750906e649	Bleeding edge build: LMCache main, vLLM main, latest transformers	2026-04-03 03:14:01 +00:00
biondizzle	a399fbc8c6	Add MAX_JOBS=2 for LMCache, restore vLLM build from source - LMCache: reduced parallelism to avoid memory pressure - vLLM: restored build from source (was using PyPI wheel) - Will test with docker --memory=24g limit	2026-04-03 02:49:43 +00:00
biondizzle	f8a9d372e5	Use PyPI vLLM wheel instead of building (QEMU cmake try_compile fails) - vLLM 0.18.1 aarch64 wheel includes pre-compiled FA2, FA3, MoE kernels - Original build-from-source code commented out for GH200 restoration - CMake compiler ABI detection fails under QEMU emulation	2026-04-03 00:05:56 +00:00
biondizzle	436214bb72	Use PyPI triton wheel instead of building (QEMU segfaults) Triton 3.6.0 has official aarch64 wheel on PyPI. Building triton from source causes segfaults under QEMU emulation.	2026-04-02 23:58:20 +00:00
biondizzle	e5445512aa	Reduce MAX_JOBS by half to reduce QEMU memory pressure - xformers: 6 -> 3 - flash-attention: 8 -> 4 - vllm: 8 -> 4 Testing if lower parallelism helps avoid segfaults under emulation	2026-04-02 23:44:11 +00:00

1 2

83 Commits