CMM: Fix OOM and subprocess crashes for GH200 EGM

Key changes: - managed_alloc.cu: Add cudaMemPrefetchAsync to migrate pages to GPU immediately (prevents OOM from system RAM pinning on EGM systems where only ~102 GiB RAM remains). Add cudaMemAdviseSetAccessedBy for CPU so reads go over C2C NVLink without page migration. - vllm_managed_mem.py: Rewrite with idempotent patches, proper MemorySnapshot.measure() override, and torch.cuda tracking stubs for CUDAPluggableAllocator compatibility. - sitecustomize.py: Auto-loaded by Python in ALL subprocesses (including vLLM EngineCore). Applies allocator swap, torch patches, MemorySnapshot override, and request_memory override before any CUDA operations in spawned processes. - Dockerfile: Install sitecustomize.py into Python dist-packages. - README.md: Full rewrite with EGM problem statement, memory layout, architecture diagram, and build pipeline documentation.
2026-04-09 23:25:48 +00:00
parent 079eb88d7d
commit aadde3ddf9
5 changed files with 460 additions and 96 deletions
--- a/README.md
+++ b/README.md
@@ -1,33 +1,152 @@
-# Building containers for GH200
+# Building vLLM containers for GH200 leveraging cudaMallocManaged when EGM (Extended GPU Memory) is enabled

-Currently, prebuilt wheels for `vLLM` and `LMcache` are not available for `aarch64`. This can make setup tedious when working on modern `aarch64` platforms such as NVIDIA GH200.
+## The Problem

-Further, Nvidia at this time does not provide the `Dockerfile` associated with the NGC containers which makes replacing some of the components (like a newer version of vLLM) tedious.
+The GH200 has 96 GiB of HBM (VRAM) and 480 GiB of LPDDR (system memory). The only way for the GPU to access system memory over the C2C NVLink at the full 900 GB/s — without going through the IOMMU — is to enable EGM in the BIOS.

-This repository provides a Dockerfile to build a container with vLLM and all its dependencies pre-installed to try out various things such as KV offloading.
+This creates three issues:

-If you prefer not to build the image yourself, you can pull the ready-to-use image directly from Docker Hub:
+1. **EGM requires reserved memory value of 8192** — this is the only value that works
+2. **The server loses most of its system memory** — no longer sees 480 GiB; instead sees ~102 GiB. The rest has been handed over to the GPU
+3. **vLLM still only sees 96 GiB of VRAM** — the GPU now has access to the full memory space, but the only way to leverage it is to convert all `cudaMalloc` calls to `cudaMallocManaged`. Without this, vLLM's allocator only touches HBM

-```bash
-docker run --rm -it --gpus all -v "$PWD":"$PWD" -w "$PWD" rajesh550/gh200-vllm:0.11.0 bash
+## The Goal

-# CUDA 13
-docker run --rm -it --gpus all -v "$PWD":"$PWD" -w "$PWD" rajesh550/gh200-vllm:0.11.1rc2 bash
+Force vLLM to use `cudaMallocManaged` so it can address the full memory space (HBM + EGM). Before we can do that, we have to make sure vLLM's preflight checks of available VRAM show the new fully allocated amount (~97 GiB HBM + ~378 GiB EGM = ~475 GiB total managed memory).
+
+## GH200 System State with EGM Enabled
+
+```
+$ nvidia-smi
+-----------------------------------------------------------------------------------------+
+| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
+| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
+|=========================================================================================|
+|   0  NVIDIA GH200 480GB             On  |   00000009:01:00.0 Off |                  Off |
+| N/A   31C    P0             83W /  700W |   14920MiB /  97871MiB |      0%      Default |
+-----------------------------------------------------------------------------------------+
+
+$ free -h
+               total        used        free      shared  buff/cache   available
+Mem:           102Gi       4.2Gi        83Gi        10Mi        15Gi        93Gi
+
+$ dmidecode -t memory | grep -i size
+	Size: 480 GB
+
+$ numactl --hardware
+available: 10 nodes (0-9)
+node 0 cpus: 0-71
+node 0 size: 8080 MB
+node 2 size: 97280 MB
 ```

-👉 [Docker Hub](https://hub.docker.com/repository/docker/rajesh550/gh200-vllm/general)
+Key observations:
+- **nvidia-smi** reports 97,871 MiB (~96 GiB) — this is HBM only, NOT the EGM
+- **OS** sees only ~102 GiB system RAM — EGM carved out 378 GiB for the GPU
+- **dmidecode** confirms the physical DIMM is 480 GiB
+- **NUMA** node 2 has 97 GiB (the LPDDR that wasn't handed to EGM), node 0 has 8 GiB (local)
+- The EGM memory appears as `System RAM (NVIDIA)` in `/proc/iomem` at addresses `0x400000000000+`

-Version info:
+## Architecture
+
+### Managed Memory Allocator
+
+The approach uses a PyTorch pluggable allocator (`managed_alloc.cu`) that replaces `cudaMalloc` with `cudaMallocManaged`, enabling transparent page-fault access to both HBM and EGM.
+
+```
+┌─────────────────────────────────────────────────┐
+│                   vLLM                           │
+│         (sees full managed memory)               │
+├─────────────────────────────────────────────────┤
+│          PyTorch Pluggable Allocator             │
+│         (managed_alloc.cu / .so)                 │
+│      cudaMallocManaged → unified memory pool     │
+├─────────────────────────────────────────────────┤
+│              CUDA Unified Memory                 │
+├──────────────────┬──────────────────────────────┤
+│   HBM (~96 GiB)  │     EGM (~378 GiB)           │
+│   Fast / local   │  Page-fault over C2C NVLink  │
+│                  │     900 GB/s bandwidth        │
+└──────────────────┴──────────────────────────────┘
+```
+
+### Launcher
+
+`vllm_managed_mem.py` is the entry point that:
+1. Loads the managed allocator `.so` before any CUDA operations
+2. Swaps PyTorch's default allocator to `cudaMallocManaged`
+3. Patches vLLM's memory validation to understand the larger managed memory space
+
+## Source Repos
+
+| Repo | URL | Branch | Purpose |
+|------|-----|--------|---------|
+| vLLM (our fork) | `https://sweetapi.com/biondizzle/vllm.git` | `cmm` | vLLM with cudaMallocManaged patches |
+| grace-gpu-containers | `https://sweetapi.com/biondizzle/grace-gpu-containers.git` | `cuda-malloc-managed` | Dockerfile + build pipeline |
+
+The `cmm` branch in our vLLM fork is based on tag `v0.19.0` (commit `2a69949bd`).
+
+## Build Pipeline
+
+### Jenkins: `gh200-vllm-build-cmm`
+
+The build chain:
+
+```
+Jenkins → grace-gpu-containers (cuda-malloc-managed branch)
+        → Dockerfile clones vLLM from Gitea fork (cmm branch)
+        → Builds ARM64 image via buildx on remote GH200
+        → Pushes to atl.vultrcr.com/vllm
+```
+
+**Default parameters:**
+- `VLLM_VERSION`: `cmm` (our fork branch)
+- `IMAGE_TAG`: `gh200-vllm-cmm`
+- `IMAGE_SUFFIX`: `-cmm`
+
+**Image:** `atl.vultrcr.com/vllm/gh200-vllm-cmm:cmm-cmm`
+
+### Dockerfile Build Stages
+
+| Stage | What it builds | Source |
+|-------|---------------|--------|
+| build-triton | Triton 3.6.0 | PyPI wheel (aarch64) |
+| build-triton-kernels | triton_kernels v3.6.0 | Triton repo |
+| build-flashinfer | flashinfer v0.6.6 | Source (apache-tvm-ffi required) |
+| build-lmcache | LMCache dev | Source |
+| build-flash-attention | FlashAttention hopper | Source |
+| build-vllm | vLLM cmm branch | Our Gitea fork |
+| build-infinistore | InfiniStore | Source |
+
+**Base image:** `nvcr.io/nvidia/pytorch:26.03-py3`
+- PyTorch 2.11.0a0, CUDA 13.2.0
+- Multi-arch: x86 + ARM SBSA (GH200)
+- Target: `9.0a` (Hopper)
+
+## Key Files
+
+| File | Description |
+|------|-------------|
+| `vllm/Dockerfile` | Multi-stage build for the CMM container |
+| `vllm/managed_alloc.cu` | PyTorch pluggable allocator using `cudaMallocManaged` |
+| `vllm/vllm_managed_mem.py` | Launcher that patches vLLM for managed memory |
+| `lmcache/Dockerfile` | Standalone LMCache build |
+
+## Local Development

 ```bash
-CUDA: 13.0.1
-Ubuntu: 24.04
-Python: 3.12
-PyTorch: 2.9.0+cu130
-Triton: 3.5.x
-xformers: 0.32.post2+
-flashinfer: 0.4.1
-flashattention: 3.0.0b1
-LMCache: 0.3.7
-vLLM: 0.11.1rc3
-```
+# vLLM fork (working directory)
+cd /home/openclaw/dev/vllm
+git checkout cmm  # our working branch, based on v0.19.0
+
+# grace-gpu-containers
+cd /home/openclaw/dev/grace-gpu-containers
+git checkout cuda-malloc-managed  # CMM Dockerfile lives here
+```
+
+## Hard Rules
+
+1. **No downgrades** — CUDA 13+, PyTorch 2.9+, vLLM 0.18.1+
+2. **No skipping compilation** — build from source
+3. **Clear all changes with Mike before making them** — no autonomous commits
+4. **One build at a time** — Mike reports failure → assess → report → Mike decides