Commit Graph

83 Commits

Author SHA1 Message Date
bcc872c2c3 Remove global allocator swap, use targeted KV cache managed allocation
sitecustomize.py: No longer swaps CUDAPluggableAllocator globally.
Sets VLLM_KV_CACHE_USE_MANAGED_MEMORY=1 instead.
vllm_managed_mem.py: No global allocator swap, no torch.cuda patches.
2026-04-11 02:15:09 +00:00
07468031db Sync managed_alloc.cu: selective prefetch (<2 GiB to GPU)
Previously the Dockerfile was copying the old version that had
cudaMemPrefetchAsync completely removed, which caused cublasGemmEx
crashes because model weights stayed in LPDDR and cuBLAS couldn't
access them. This version prefetches allocations <2 GiB (model
weights) to GPU but skips large allocations (KV cache).
2026-04-10 18:37:11 +00:00
cdfd37c1e6 Fix Dockerfile: separate git clone and build RUN commands 2026-04-10 15:32:16 +00:00
c1b013234e Fix cache-bust: embed VLLM_COMMIT in git clone RUN command 2026-04-10 15:01:29 +00:00
98b4ae6676 Add VLLM_COMMIT cache-bust arg to Dockerfile
Docker was caching the git clone layer because the command
string didn't change. Adding VLLM_COMMIT arg forces cache
invalidation when the source changes.
2026-04-10 06:01:20 +00:00
c583bcb4fc Fix cudaMemPrefetchAsync for CUDA 13: use cudaMemLocation + flags=0 (no stream param) 2026-04-10 02:45:05 +00:00
6053e6d0ea Fix cudaMemPrefetchAsync: use int device instead of cudaMemLocation struct 2026-04-10 01:48:01 +00:00
aadde3ddf9 CMM: Fix OOM and subprocess crashes for GH200 EGM
Key changes:
- managed_alloc.cu: Add cudaMemPrefetchAsync to migrate pages to GPU
  immediately (prevents OOM from system RAM pinning on EGM systems
  where only ~102 GiB RAM remains). Add cudaMemAdviseSetAccessedBy
  for CPU so reads go over C2C NVLink without page migration.
- vllm_managed_mem.py: Rewrite with idempotent patches, proper
  MemorySnapshot.measure() override, and torch.cuda tracking stubs
  for CUDAPluggableAllocator compatibility.
- sitecustomize.py: Auto-loaded by Python in ALL subprocesses
  (including vLLM EngineCore). Applies allocator swap, torch patches,
  MemorySnapshot override, and request_memory override before any
  CUDA operations in spawned processes.
- Dockerfile: Install sitecustomize.py into Python dist-packages.
- README.md: Full rewrite with EGM problem statement, memory layout,
  architecture diagram, and build pipeline documentation.
2026-04-09 23:25:48 +00:00
079eb88d7d Switch vLLM source to Gitea fork (cmm branch) 2026-04-09 22:05:40 +00:00
7c79fb4ee7 fix: Update cudaMemAdvise for CUDA 13 API
CUDA 13 changed cudaMemAdvise to take cudaMemLocation struct instead of int.
Updated to use cudaMemLocation with type=cudaMemLocationTypeDevice.
2026-04-07 21:32:17 +00:00
2757bffcb6 Add cudaMallocManaged allocator for GH200 EGM support
- managed_alloc.cu: PyTorch pluggable allocator using cudaMallocManaged
- vllm_managed_mem.py: Launcher that patches vLLM for managed memory
- Dockerfile: Build and install managed memory components

This enables vLLM to use cudaMallocManaged for transparent page-fault
access to both HBM (~96 GiB) and LPDDR (EGM, up to 480 GiB additional)
on GH200 systems with Extended GPU Memory enabled.

Experimental branch: v0.19.0-cmm
2026-04-07 21:19:39 +00:00
edf12f7996 Clean up: remove PLAN-triton-kernels.md (merged into main) 2026-04-06 17:25:06 +00:00
e6cc28a942 Add triton_kernels for MoE support (vLLM v0.19.0)
- Add build-triton-kernels stage to fetch triton_kernels from Triton v3.6.0
- Install to site-packages for vLLM to find at runtime
- Resolves: No module named 'triton_kernels.matmul_ogs'
- Image tag: gh200-vllm-tfa:v0.19.0-tfa
2026-04-06 16:39:56 +00:00
643d5589a3 Switch flashinfer to v0.6.6 for vLLM v0.19.0 (v0.6.7 works with v0.18.2rc0) v0.19.0 2026-04-03 13:15:56 +00:00
3290adb0ac Upgrade vLLM to v0.19.0 for Gemma 4 support (requires transformers>=5.5.0) 2026-04-03 11:55:16 +00:00
cd5d58a6f9 Patch vLLM torch_utils.py: remove hoist=True for NGC PyTorch 2.11 compatibility 2026-04-03 11:40:51 +00:00
659c79638c WORKING BUILD #43 - GH200 vLLM container builds successfully
Versions locked:
- vLLM: v0.18.2rc0
- flashinfer: v0.6.7
- flash-attention: hopper branch
- lmcache: dev branch
- infinistore: main
- triton: 3.6.0 (PyPI wheel)
- Base: nvcr.io/nvidia/pytorch:26.03-py3 (PyTorch 2.11.0a0, CUDA 13.2.0)

DO NOT MODIFY WITHOUT MIKE'S APPROVAL
2026-04-03 11:08:29 +00:00
2442906d95 Add -y flag to pip uninstall pynvml for non-interactive Docker build 2026-04-03 10:57:42 +00:00
5280a28205 Bump flashinfer from v0.6.6 to v0.6.7 (required by vLLM v0.18.2rc0) 2026-04-03 10:52:19 +00:00
dbca81bba2 Switch vLLM from main to v0.18.2rc0 for CUDA 13.2 compatibility 2026-04-03 09:19:01 +00:00
202b9c4e23 Add -y flag to pip uninstall infinistore for non-interactive Docker build 2026-04-03 09:06:42 +00:00
c2cebcf962 Add apache-tvm-ffi dependency for flashinfer build 2026-04-03 09:00:18 +00:00
beb26d3573 Fix python -m build flag: use --no-isolation instead of --no-build-isolation 2026-04-03 08:54:45 +00:00
4e8a765c72 Fix wheel install conflict, use python -m build instead of pip build 2026-04-03 08:52:43 +00:00
ce55e45db2 Fix NGC PyTorch image tag format (26.03-py3) 2026-04-03 08:46:43 +00:00
c92c4ec68a Switch to NVIDIA NGC PyTorch 26.03 base image (PyTorch 2.11.0a0, CUDA 13.2.0, ARM SBSA support) 2026-04-03 08:44:36 +00:00
54e609b2c5 Update lmcache/Dockerfile to CUDA 13.0.1, PyTorch nightly, LMCache dev branch 2026-04-03 08:39:35 +00:00
4980d9e49a Use PyTorch nightly with CUDA 13.0 (torch 2.11.0.dev) 2026-04-03 08:36:36 +00:00
6a97539682 Fix duplicate corrupted lines in Dockerfile 2026-04-03 08:31:56 +00:00
f55789c53b Bump to CUDA 13.0.1 + PyTorch 2.9.0, add version output on git checkouts 2026-04-03 08:26:53 +00:00
e514e0cd1e Revert my patches - try v0.18.2rc0 2026-04-03 08:09:05 +00:00
4860bcee41 Skip LMCache CUDA extensions (NO_CUDA_EXT=1)
PyTorch 2.9.0+cu130 was compiled with CUDA 12.8 but container has CUDA 13.0.
Skip CUDA extension build to avoid version mismatch.
2026-04-03 08:05:44 +00:00
360b0dea58 Restore CUDA 13.0.1 + patch vLLM for cuMemcpyBatchAsync API change
CUDA 13 removed the fail_idx parameter from cuMemcpyBatchAsync.
Patch cache_kernels.cu to match new API signature instead of downgrading.

- Restore CUDA 13.0.1, PyTorch 2.9.0+cu130, flashinfer cu130
- Patch: remove fail_idx variable and parameter from cuMemcpyBatchAsync call
- Simplify error message to not reference fail_idx
2026-04-03 07:53:12 +00:00
6255c94359 Downgrade to CUDA 12.8.1 for vLLM compatibility
cuMemcpyBatchAsync API changed in CUDA 13 - removed fail_idx parameter.
vLLM code targets CUDA 12.8 API. Downgrade to CUDA 12.8.1.
2026-04-03 07:43:19 +00:00
ceab7ada22 Update flashinfer to v0.6.6 to match vLLM 0.18.x requirements
vLLM 0.18.x depends on flashinfer-python==0.6.6, was building 0.4.1
2026-04-03 07:13:16 +00:00
9d88d4c7d8 Skip xformers - vLLM has built-in FlashAttention kernels
xformers requires TORCH_STABLE_ONLY which needs torch/csrc/stable/ headers
not present in PyTorch 2.9.0. vLLM 0.18.1 includes its own FA2/FA3 kernels.
2026-04-03 05:50:02 +00:00
45b6109ee1 Fix xformers TORCH_STABLE_ONLY issue + ramp up MAX_JOBS for native GH200
- Switch to official facebook/xformers (johnnynunez fork has TORCH_STABLE_ONLY requiring PyTorch headers not in 2.9.0)
- Increase MAX_JOBS from 2-4 to 8 for all builds (native GH200 has 97GB HBM3)
- Increase NVCC_THREADS from 1 to 4 for flash-attention
2026-04-03 05:46:11 +00:00
b223c051de move things 2026-04-03 04:27:21 +00:00
7f7ca4a742 move things 2026-04-03 04:26:52 +00:00
2dc2008475 move things 2026-04-03 04:26:26 +00:00
980cd1b749 move things 2026-04-03 04:26:08 +00:00
1540b0c54e move things 2026-04-03 04:25:23 +00:00
0b4ede8047 Add .gitignore for internal docs 2026-04-03 03:49:41 +00:00
5c29d2bea7 Fix: LMCache default branch is 'dev' not 'main' 2026-04-03 03:34:58 +00:00
9259555802 Fix: Actually update LMCache to main branch (previous edit failed) 2026-04-03 03:16:45 +00:00
750906e649 Bleeding edge build: LMCache main, vLLM main, latest transformers 2026-04-03 03:14:01 +00:00
a399fbc8c6 Add MAX_JOBS=2 for LMCache, restore vLLM build from source
- LMCache: reduced parallelism to avoid memory pressure
- vLLM: restored build from source (was using PyPI wheel)
- Will test with docker --memory=24g limit
2026-04-03 02:49:43 +00:00
f8a9d372e5 Use PyPI vLLM wheel instead of building (QEMU cmake try_compile fails)
- vLLM 0.18.1 aarch64 wheel includes pre-compiled FA2, FA3, MoE kernels
- Original build-from-source code commented out for GH200 restoration
- CMake compiler ABI detection fails under QEMU emulation
2026-04-03 00:05:56 +00:00
436214bb72 Use PyPI triton wheel instead of building (QEMU segfaults)
Triton 3.6.0 has official aarch64 wheel on PyPI.
Building triton from source causes segfaults under QEMU emulation.
2026-04-02 23:58:20 +00:00
e5445512aa Reduce MAX_JOBS by half to reduce QEMU memory pressure
- xformers: 6 -> 3
- flash-attention: 8 -> 4
- vllm: 8 -> 4

Testing if lower parallelism helps avoid segfaults under emulation
2026-04-02 23:44:11 +00:00