# GH200 vLLM Container Build Pipeline > Managed by Clawmine — `/home/openclaw/dev/grace-gpu-containers` ## Overview Building vLLM containers for NVIDIA GH200 (Grace Hopper, ARM64 + H100 GPU). The challenge: prebuilt wheels for `aarch64` are limited, and NVIDIA doesn't publish NGC Dockerfiles. ## Jenkins Pipeline **Server:** https://jenkins.sweetapi.com/ **Job:** `gh200-vllm-build` **Status:** Configured, never run (ready for first build) ### Jenkins Server (Build Machine) ``` Host: 66.135.24.21 User: root Pass: Wy9,za7+8BL(v@ZT ``` **Setup:** - Docker buildx `multiarch` builder configured - QEMU user-static installed for ARM64 emulation - 780GB free disk ### Build Parameters | Parameter | Default | Description | |-----------|---------|-------------| | `VLLM_VERSION` | `v0.18.1` | vLLM git tag | | `CUDA_VERSION` | `13.0.1` | CUDA version | | `IMAGE_TAG` | `gh200-vllm` | Docker image name | | `PUSH_TO_REGISTRY` | `true` | Push to Vultr CR after build | ### Container Registry **URL:** `sjc.vultrcr.com/charizard` **User:** `891294d0-df76-4c37-b41d-2b77f95a54c1` **Pass:** `H3aE2NfqRLs5Aio6SCnnDKBJwnB6rsJfFZ7E` **Images:** - `sjc.vultrcr.com/charizard/gh200-vllm:v0.18.1` - `sjc.vultrcr.com/charizard/gh200-vllm:latest` ### Build History | Build # | Version | Status | Duration | Notes | |---------|---------|--------|----------|-------| | 18 | v0.18.1 | RUNNING | - | Started 2026-04-03 04:11 UTC — native GH200 builder | | 17 | main | FAILED | ~4 min | vLLM main branch CUDA 13 API mismatch | | 2 | v0.18.1 | FAILED | - | Started 2026-04-02 ~20:03 UTC — setuptools pinned | | 1 | v0.18.1 | FAILED | ~15 min | setuptools 82.0.1 incompatible with LMCache | ### Monitoring **Cron Job:** Every 30 minutes, checks build status and notifies on completion. - Job ID: `e06d540f-899e-4358-92df-85fe036b05e2` - Script: `~/.openclaw/scripts/check-jenkins-vllm-build.sh` **Manual check:** ```bash curl -s -u "admin:1112cc255997ecb7c34d0089c4edddd976" \ "https://jenkins.sweetapi.com/job/gh200-vllm-build/lastBuild/api/json" | jq '{result, building, duration}' ``` ### Triggering a Build ```bash # Via API curl -X POST "https://jenkins.sweetapi.com/job/gh200-vllm-build/buildWithParameters" \ -u "admin:1112cc255997ecb7c34d0089c4edddd976" \ -d "VLLM_VERSION=v0.18.1" # Check status curl -s -u "admin:1112cc255997ecb7c34d0089c4edddd976" \ "https://jenkins.sweetapi.com/job/gh200-vllm-build/lastBuild/api/json" | jq '.result' ``` ## Current State ### Dockerfile (`vllm/Dockerfile`) Builds everything from source: - **triton** (release/3.5.x) - **xformers** (johnnynunez fork) - **flashinfer** (v0.4.1) - **flash-attention** (hopper branch) - **lmcache** (v0.3.7) - **infinistore** (main) - **vLLM** (configurable via `VLLM_REF`) **Target Architecture:** `9.0a` (NVIDIA Hopper) ### Latest Versions (as of 2026-04-02) | Package | Stable | Latest | |---------|--------|--------| | vLLM | v0.18.1 | v0.19.0rc1 | | Triton | 3.6.0 | 3.6.0 | | CUDA | 13.0.1 | 13.0.1 | ## PyPI Wheel Status for aarch64 | Package | aarch64 Wheel | Notes | |---------|---------------|-------| | vLLM 0.18.1 | ✅ Yes | Includes FA2, FA3, MoE kernels | | Triton 3.6.0 | ✅ Yes | Official wheel | | flashinfer | ❌ No | Must build from source | | xformers | ❌ No | Must build from source | **Key Finding:** The official vLLM aarch64 wheel includes pre-compiled CUDA kernels: - `vllm/_C.abi3.so` (381MB) - `vllm/vllm_flash_attn/_vllm_fa2_C.abi3.so` (263MB) - `vllm/vllm_flash_attn/_vllm_fa3_C.abi3.so` (142MB) - `vllm/_moe_C.abi3.so` (202MB) This means **basic vLLM on GH200 can use PyPI wheels directly** without compilation. ## Build Options ### Option 1: QEMU Cross-Compilation (Current Setup) **Pros:** - Works on existing x86 Jenkins server - No GH200 required **Cons:** - Very slow (hours) - CUDA kernels may not be fully optimized - flashinfer/xformers need source builds **When to use:** When no GH200 available ### Option 2: Native GH200 Build **Pros:** - Much faster compilation - Properly optimized ARM64 + Hopper kernels - All dependencies build correctly **Cons:** - Need GH200 access **When to use:** For production-quality builds ### Option 3: Hybrid (PyPI wheels + source builds) **Approach:** 1. Use vLLM + Triton from PyPI (pre-compiled aarch64) 2. Build only flashinfer/xformers from source (if needed) **Pros:** - Fastest option - Official wheels for core components **Cons:** - May miss some optimizations - flashinfer/xformers still need source builds ## Recommended Path 1. **Immediate:** Try PyPI wheels on a GH200 — if they work, no build needed 2. **If PyPI insufficient:** Run Jenkins build with `VLLM_VERSION=v0.18.1` 3. **Production:** Get a GH200 for native builds ## Source Repo - **Our fork:** https://sweetapi.com/biondizzle/grace-gpu-containers.git (Jenkins pulls from here) - **Upstream:** https://github.com/rajesh-s/grace-gpu-containers ### Local repo: `/home/openclaw/dev/grace-gpu-containers` ```bash git remote -v origin ssh://git@sweetapi.com:2222/biondizzle/grace-gpu-containers.git (push) upstream https://github.com/rajesh-s/grace-gpu-containers.git (fetch) ``` ### Changes from upstream 1. **setuptools pin** — Pin setuptools to `<81.0.0` for LMCache compatibility - Changed: `RUN uv pip install -U build cmake ninja pybind11 setuptools wheel` - To: `RUN uv pip install -U build cmake ninja pybind11 "setuptools>=77.0.3,<81.0.0" wheel` 2. **pip for flash-attention** — Use `pip wheel` instead of `pip3 wheel` (venv has pip, not pip3) - Changed: `pip3 wheel . -v --no-deps` - To: `pip wheel . -v --no-deps` 3. **CLAWMINE.md** — This documentation file --- *Last updated: 2026-04-03 by Clawmine*