Commit Graph

64 Commits

Author SHA1 Message Date
cdfd37c1e6 Fix Dockerfile: separate git clone and build RUN commands 2026-04-10 15:32:16 +00:00
c1b013234e Fix cache-bust: embed VLLM_COMMIT in git clone RUN command 2026-04-10 15:01:29 +00:00
98b4ae6676 Add VLLM_COMMIT cache-bust arg to Dockerfile
Docker was caching the git clone layer because the command
string didn't change. Adding VLLM_COMMIT arg forces cache
invalidation when the source changes.
2026-04-10 06:01:20 +00:00
aadde3ddf9 CMM: Fix OOM and subprocess crashes for GH200 EGM
Key changes:
- managed_alloc.cu: Add cudaMemPrefetchAsync to migrate pages to GPU
  immediately (prevents OOM from system RAM pinning on EGM systems
  where only ~102 GiB RAM remains). Add cudaMemAdviseSetAccessedBy
  for CPU so reads go over C2C NVLink without page migration.
- vllm_managed_mem.py: Rewrite with idempotent patches, proper
  MemorySnapshot.measure() override, and torch.cuda tracking stubs
  for CUDAPluggableAllocator compatibility.
- sitecustomize.py: Auto-loaded by Python in ALL subprocesses
  (including vLLM EngineCore). Applies allocator swap, torch patches,
  MemorySnapshot override, and request_memory override before any
  CUDA operations in spawned processes.
- Dockerfile: Install sitecustomize.py into Python dist-packages.
- README.md: Full rewrite with EGM problem statement, memory layout,
  architecture diagram, and build pipeline documentation.
2026-04-09 23:25:48 +00:00
079eb88d7d Switch vLLM source to Gitea fork (cmm branch) 2026-04-09 22:05:40 +00:00
2757bffcb6 Add cudaMallocManaged allocator for GH200 EGM support
- managed_alloc.cu: PyTorch pluggable allocator using cudaMallocManaged
- vllm_managed_mem.py: Launcher that patches vLLM for managed memory
- Dockerfile: Build and install managed memory components

This enables vLLM to use cudaMallocManaged for transparent page-fault
access to both HBM (~96 GiB) and LPDDR (EGM, up to 480 GiB additional)
on GH200 systems with Extended GPU Memory enabled.

Experimental branch: v0.19.0-cmm
2026-04-07 21:19:39 +00:00
e6cc28a942 Add triton_kernels for MoE support (vLLM v0.19.0)
- Add build-triton-kernels stage to fetch triton_kernels from Triton v3.6.0
- Install to site-packages for vLLM to find at runtime
- Resolves: No module named 'triton_kernels.matmul_ogs'
- Image tag: gh200-vllm-tfa:v0.19.0-tfa
2026-04-06 16:39:56 +00:00
643d5589a3 Switch flashinfer to v0.6.6 for vLLM v0.19.0 (v0.6.7 works with v0.18.2rc0) 2026-04-03 13:15:56 +00:00
3290adb0ac Upgrade vLLM to v0.19.0 for Gemma 4 support (requires transformers>=5.5.0) 2026-04-03 11:55:16 +00:00
cd5d58a6f9 Patch vLLM torch_utils.py: remove hoist=True for NGC PyTorch 2.11 compatibility 2026-04-03 11:40:51 +00:00
659c79638c WORKING BUILD #43 - GH200 vLLM container builds successfully
Versions locked:
- vLLM: v0.18.2rc0
- flashinfer: v0.6.7
- flash-attention: hopper branch
- lmcache: dev branch
- infinistore: main
- triton: 3.6.0 (PyPI wheel)
- Base: nvcr.io/nvidia/pytorch:26.03-py3 (PyTorch 2.11.0a0, CUDA 13.2.0)

DO NOT MODIFY WITHOUT MIKE'S APPROVAL
2026-04-03 11:08:29 +00:00
2442906d95 Add -y flag to pip uninstall pynvml for non-interactive Docker build 2026-04-03 10:57:42 +00:00
5280a28205 Bump flashinfer from v0.6.6 to v0.6.7 (required by vLLM v0.18.2rc0) 2026-04-03 10:52:19 +00:00
dbca81bba2 Switch vLLM from main to v0.18.2rc0 for CUDA 13.2 compatibility 2026-04-03 09:19:01 +00:00
202b9c4e23 Add -y flag to pip uninstall infinistore for non-interactive Docker build 2026-04-03 09:06:42 +00:00
c2cebcf962 Add apache-tvm-ffi dependency for flashinfer build 2026-04-03 09:00:18 +00:00
beb26d3573 Fix python -m build flag: use --no-isolation instead of --no-build-isolation 2026-04-03 08:54:45 +00:00
4e8a765c72 Fix wheel install conflict, use python -m build instead of pip build 2026-04-03 08:52:43 +00:00
ce55e45db2 Fix NGC PyTorch image tag format (26.03-py3) 2026-04-03 08:46:43 +00:00
c92c4ec68a Switch to NVIDIA NGC PyTorch 26.03 base image (PyTorch 2.11.0a0, CUDA 13.2.0, ARM SBSA support) 2026-04-03 08:44:36 +00:00
4980d9e49a Use PyTorch nightly with CUDA 13.0 (torch 2.11.0.dev) 2026-04-03 08:36:36 +00:00
6a97539682 Fix duplicate corrupted lines in Dockerfile 2026-04-03 08:31:56 +00:00
f55789c53b Bump to CUDA 13.0.1 + PyTorch 2.9.0, add version output on git checkouts 2026-04-03 08:26:53 +00:00
e514e0cd1e Revert my patches - try v0.18.2rc0 2026-04-03 08:09:05 +00:00
4860bcee41 Skip LMCache CUDA extensions (NO_CUDA_EXT=1)
PyTorch 2.9.0+cu130 was compiled with CUDA 12.8 but container has CUDA 13.0.
Skip CUDA extension build to avoid version mismatch.
2026-04-03 08:05:44 +00:00
360b0dea58 Restore CUDA 13.0.1 + patch vLLM for cuMemcpyBatchAsync API change
CUDA 13 removed the fail_idx parameter from cuMemcpyBatchAsync.
Patch cache_kernels.cu to match new API signature instead of downgrading.

- Restore CUDA 13.0.1, PyTorch 2.9.0+cu130, flashinfer cu130
- Patch: remove fail_idx variable and parameter from cuMemcpyBatchAsync call
- Simplify error message to not reference fail_idx
2026-04-03 07:53:12 +00:00
6255c94359 Downgrade to CUDA 12.8.1 for vLLM compatibility
cuMemcpyBatchAsync API changed in CUDA 13 - removed fail_idx parameter.
vLLM code targets CUDA 12.8 API. Downgrade to CUDA 12.8.1.
2026-04-03 07:43:19 +00:00
ceab7ada22 Update flashinfer to v0.6.6 to match vLLM 0.18.x requirements
vLLM 0.18.x depends on flashinfer-python==0.6.6, was building 0.4.1
2026-04-03 07:13:16 +00:00
9d88d4c7d8 Skip xformers - vLLM has built-in FlashAttention kernels
xformers requires TORCH_STABLE_ONLY which needs torch/csrc/stable/ headers
not present in PyTorch 2.9.0. vLLM 0.18.1 includes its own FA2/FA3 kernels.
2026-04-03 05:50:02 +00:00
45b6109ee1 Fix xformers TORCH_STABLE_ONLY issue + ramp up MAX_JOBS for native GH200
- Switch to official facebook/xformers (johnnynunez fork has TORCH_STABLE_ONLY requiring PyTorch headers not in 2.9.0)
- Increase MAX_JOBS from 2-4 to 8 for all builds (native GH200 has 97GB HBM3)
- Increase NVCC_THREADS from 1 to 4 for flash-attention
2026-04-03 05:46:11 +00:00
5c29d2bea7 Fix: LMCache default branch is 'dev' not 'main' 2026-04-03 03:34:58 +00:00
9259555802 Fix: Actually update LMCache to main branch (previous edit failed) 2026-04-03 03:16:45 +00:00
750906e649 Bleeding edge build: LMCache main, vLLM main, latest transformers 2026-04-03 03:14:01 +00:00
a399fbc8c6 Add MAX_JOBS=2 for LMCache, restore vLLM build from source
- LMCache: reduced parallelism to avoid memory pressure
- vLLM: restored build from source (was using PyPI wheel)
- Will test with docker --memory=24g limit
2026-04-03 02:49:43 +00:00
f8a9d372e5 Use PyPI vLLM wheel instead of building (QEMU cmake try_compile fails)
- vLLM 0.18.1 aarch64 wheel includes pre-compiled FA2, FA3, MoE kernels
- Original build-from-source code commented out for GH200 restoration
- CMake compiler ABI detection fails under QEMU emulation
2026-04-03 00:05:56 +00:00
436214bb72 Use PyPI triton wheel instead of building (QEMU segfaults)
Triton 3.6.0 has official aarch64 wheel on PyPI.
Building triton from source causes segfaults under QEMU emulation.
2026-04-02 23:58:20 +00:00
e5445512aa Reduce MAX_JOBS by half to reduce QEMU memory pressure
- xformers: 6 -> 3
- flash-attention: 8 -> 4
- vllm: 8 -> 4

Testing if lower parallelism helps avoid segfaults under emulation
2026-04-02 23:44:11 +00:00
4f94431af6 Revert CC/CXX to full paths, keep QEMU_CPU=max 2026-04-02 22:50:23 +00:00
866c9d9db8 Add QEMU_CPU=max for better emulation compatibility during cross-compilation 2026-04-02 22:47:53 +00:00
2ed1b1e2dd Fix: use CC=gcc CXX=g++ instead of full paths for QEMU compatibility 2026-04-02 22:47:22 +00:00
14467bef70 Fix: add --no-build-isolation to pip wheel for flash-attention
Without this flag, pip runs the build in an isolated environment
that doesn't have access to torch in the venv.
2026-04-02 20:55:32 +00:00
8f870921f8 Fix: use 'pip wheel' instead of 'uv pip wheel' (uv has no wheel subcommand) 2026-04-02 20:22:11 +00:00
9da93ec625 Fix setuptools pin and flash-attention build for GH200
- Pin setuptools>=77.0.3,<81.0.0 for LMCache compatibility
- Use 'uv pip wheel' instead of 'pip3 wheel' for flash-attention (torch is in venv)
- Add CLAWMINE.md with build pipeline documentation
2026-04-02 20:19:39 +00:00
Rajesh Shashi Kumar
0814f059f5 Updated to v0.11.1rc3 2025-10-23 18:11:41 +00:00
Rajesh Shashi Kumar
3c4796ed55 Updated for CUDA 13 2025-10-21 19:21:13 +00:00
Rajesh Shashi Kumar
ebcdb4ab50 Updates for PyTorch 2.9, CUDA13 2025-10-20 20:16:06 +00:00
Rajesh Shashi Kumar
02430037ea Updated for v0.11.0 2025-10-16 01:08:21 +00:00
Rajesh Shashi Kumar
201bbf5379 v0.10.2 cleanup 2025-09-24 06:14:16 +00:00
Rajesh Shashi Kumar
fc321295f1 Updated for vllm v0.10.2 2025-09-24 05:52:11 +00:00
Rajesh Shashi Kumar
daf345024b Updated for v0.10.0 2025-08-20 21:02:46 +00:00