Go to file

biondizzle bcc872c2c3 Remove global allocator swap, use targeted KV cache managed allocation

sitecustomize.py: No longer swaps CUDAPluggableAllocator globally.
Sets VLLM_KV_CACHE_USE_MANAGED_MEMORY=1 instead.
vllm_managed_mem.py: No global allocator swap, no torch.cuda patches.

2026-04-11 02:15:09 +00:00

lmcache

Fix wheel install conflict, use python -m build instead of pip build

2026-04-03 08:52:43 +00:00

vllm

Remove global allocator swap, use targeted KV cache managed allocation

2026-04-11 02:15:09 +00:00

.gitignore

move things

2026-04-03 04:27:21 +00:00

README.md

CMM: Fix OOM and subprocess crashes for GH200 EGM

2026-04-09 23:25:48 +00:00

README.md

Building vLLM containers for GH200 leveraging cudaMallocManaged when EGM (Extended GPU Memory) is enabled

The Problem

The GH200 has 96 GiB of HBM (VRAM) and 480 GiB of LPDDR (system memory). The only way for the GPU to access system memory over the C2C NVLink at the full 900 GB/s — without going through the IOMMU — is to enable EGM in the BIOS.

This creates three issues:

EGM requires reserved memory value of 8192 — this is the only value that works
The server loses most of its system memory — no longer sees 480 GiB; instead sees ~102 GiB. The rest has been handed over to the GPU
vLLM still only sees 96 GiB of VRAM — the GPU now has access to the full memory space, but the only way to leverage it is to convert all cudaMalloc calls to cudaMallocManaged. Without this, vLLM's allocator only touches HBM

The Goal

Force vLLM to use cudaMallocManaged so it can address the full memory space (HBM + EGM). Before we can do that, we have to make sure vLLM's preflight checks of available VRAM show the new fully allocated amount (~97 GiB HBM + ~378 GiB EGM = ~475 GiB total managed memory).

GH200 System State with EGM Enabled

$ nvidia-smi
+-----------------------------------------------------------------------------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|=========================================================================================|
|   0  NVIDIA GH200 480GB             On  |   00000009:01:00.0 Off |                  Off |
| N/A   31C    P0             83W /  700W |   14920MiB /  97871MiB |      0%      Default |
+-----------------------------------------------------------------------------------------+

$ free -h
               total        used        free      shared  buff/cache   available
Mem:           102Gi       4.2Gi        83Gi        10Mi        15Gi        93Gi

$ dmidecode -t memory | grep -i size
	Size: 480 GB

$ numactl --hardware
available: 10 nodes (0-9)
node 0 cpus: 0-71
node 0 size: 8080 MB
node 2 size: 97280 MB

Key observations:

nvidia-smi reports 97,871 MiB (~96 GiB) — this is HBM only, NOT the EGM
OS sees only ~102 GiB system RAM — EGM carved out 378 GiB for the GPU
dmidecode confirms the physical DIMM is 480 GiB
NUMA node 2 has 97 GiB (the LPDDR that wasn't handed to EGM), node 0 has 8 GiB (local)
The EGM memory appears as System RAM (NVIDIA) in /proc/iomem at addresses 0x400000000000+

Architecture

Managed Memory Allocator

The approach uses a PyTorch pluggable allocator (managed_alloc.cu) that replaces cudaMalloc with cudaMallocManaged, enabling transparent page-fault access to both HBM and EGM.

┌─────────────────────────────────────────────────┐
│                   vLLM                           │
│         (sees full managed memory)               │
├─────────────────────────────────────────────────┤
│          PyTorch Pluggable Allocator             │
│         (managed_alloc.cu / .so)                 │
│      cudaMallocManaged → unified memory pool     │
├─────────────────────────────────────────────────┤
│              CUDA Unified Memory                 │
├──────────────────┬──────────────────────────────┤
│   HBM (~96 GiB)  │     EGM (~378 GiB)           │
│   Fast / local   │  Page-fault over C2C NVLink  │
│                  │     900 GB/s bandwidth        │
└──────────────────┴──────────────────────────────┘

Launcher

vllm_managed_mem.py is the entry point that:

Loads the managed allocator .so before any CUDA operations
Swaps PyTorch's default allocator to cudaMallocManaged
Patches vLLM's memory validation to understand the larger managed memory space

Source Repos

Repo	URL	Branch	Purpose
vLLM (our fork)	`https://sweetapi.com/biondizzle/vllm.git`	`cmm`	vLLM with cudaMallocManaged patches
grace-gpu-containers	`https://sweetapi.com/biondizzle/grace-gpu-containers.git`	`cuda-malloc-managed`	Dockerfile + build pipeline

The cmm branch in our vLLM fork is based on tag v0.19.0 (commit 2a69949bd).

Build Pipeline

Jenkins: `gh200-vllm-build-cmm`

The build chain:

Jenkins → grace-gpu-containers (cuda-malloc-managed branch)
        → Dockerfile clones vLLM from Gitea fork (cmm branch)
        → Builds ARM64 image via buildx on remote GH200
        → Pushes to atl.vultrcr.com/vllm

Default parameters:

VLLM_VERSION: cmm (our fork branch)
IMAGE_TAG: gh200-vllm-cmm
IMAGE_SUFFIX: -cmm

Image: atl.vultrcr.com/vllm/gh200-vllm-cmm:cmm-cmm

Dockerfile Build Stages

Stage	What it builds	Source
build-triton	Triton 3.6.0	PyPI wheel (aarch64)
build-triton-kernels	triton_kernels v3.6.0	Triton repo
build-flashinfer	flashinfer v0.6.6	Source (apache-tvm-ffi required)
build-lmcache	LMCache dev	Source
build-flash-attention	FlashAttention hopper	Source
build-vllm	vLLM cmm branch	Our Gitea fork
build-infinistore	InfiniStore	Source

Base image: nvcr.io/nvidia/pytorch:26.03-py3

PyTorch 2.11.0a0, CUDA 13.2.0
Multi-arch: x86 + ARM SBSA (GH200)
Target: 9.0a (Hopper)

Key Files

File	Description
`vllm/Dockerfile`	Multi-stage build for the CMM container
`vllm/managed_alloc.cu`	PyTorch pluggable allocator using `cudaMallocManaged`
`vllm/vllm_managed_mem.py`	Launcher that patches vLLM for managed memory
`lmcache/Dockerfile`	Standalone LMCache build

Local Development

# vLLM fork (working directory)
cd /home/openclaw/dev/vllm
git checkout cmm  # our working branch, based on v0.19.0

# grace-gpu-containers
cd /home/openclaw/dev/grace-gpu-containers
git checkout cuda-malloc-managed  # CMM Dockerfile lives here

Hard Rules

No downgrades — CUDA 13+, PyTorch 2.9+, vLLM 0.18.1+
No skipping compilation — build from source
Clear all changes with Mike before making them — no autonomous commits
One build at a time — Mike reports failure → assess → report → Mike decides

README.md

Building vLLM containers for GH200 leveraging cudaMallocManaged when EGM (Extended GPU Memory) is enabled

The Problem

The Goal

GH200 System State with EGM Enabled

Architecture

Managed Memory Allocator

Launcher

Source Repos

Build Pipeline

Jenkins: gh200-vllm-build-cmm

Dockerfile Build Stages

Key Files

Local Development

Hard Rules

Jenkins: `gh200-vllm-build-cmm`