5.6 KiB
GH200 vLLM Container Build Pipeline
Managed by Clawmine —
/home/openclaw/dev/grace-gpu-containers
Overview
Building vLLM containers for NVIDIA GH200 (Grace Hopper, ARM64 + H100 GPU). The challenge: prebuilt wheels for aarch64 are limited, and NVIDIA doesn't publish NGC Dockerfiles.
Jenkins Pipeline
Server: https://jenkins.sweetapi.com/
Job: gh200-vllm-build
Status: Configured, never run (ready for first build)
Jenkins Server (Build Machine)
Host: 66.135.24.21
User: root
Pass: Wy9,za7+8BL(v@ZT
Setup:
- Docker buildx
multiarchbuilder configured - QEMU user-static installed for ARM64 emulation
- 780GB free disk
Build Parameters
| Parameter | Default | Description |
|---|---|---|
VLLM_VERSION |
v0.18.1 |
vLLM git tag |
CUDA_VERSION |
13.0.1 |
CUDA version |
IMAGE_TAG |
gh200-vllm |
Docker image name |
PUSH_TO_REGISTRY |
true |
Push to Vultr CR after build |
Container Registry
URL: sjc.vultrcr.com/charizard
User: 891294d0-df76-4c37-b41d-2b77f95a54c1
Pass: H3aE2NfqRLs5Aio6SCnnDKBJwnB6rsJfFZ7E
Images:
sjc.vultrcr.com/charizard/gh200-vllm:v0.18.1sjc.vultrcr.com/charizard/gh200-vllm:latest
Build History
| Build # | Version | Status | Duration | Notes |
|---|---|---|---|---|
| 2 | v0.18.1 | RUNNING | - | Started 2026-04-02 ~20:03 UTC — setuptools pinned |
| 1 | v0.18.1 | FAILED | ~15 min | setuptools 82.0.1 incompatible with LMCache |
Monitoring
Cron Job: Every 30 minutes, checks build status and notifies on completion.
- Job ID:
e06d540f-899e-4358-92df-85fe036b05e2 - Script:
~/.openclaw/scripts/check-jenkins-vllm-build.sh
Manual check:
curl -s -u "admin:1112cc255997ecb7c34d0089c4edddd976" \
"https://jenkins.sweetapi.com/job/gh200-vllm-build/lastBuild/api/json" | jq '{result, building, duration}'
Triggering a Build
# Via API
curl -X POST "https://jenkins.sweetapi.com/job/gh200-vllm-build/buildWithParameters" \
-u "admin:1112cc255997ecb7c34d0089c4edddd976" \
-d "VLLM_VERSION=v0.18.1"
# Check status
curl -s -u "admin:1112cc255997ecb7c34d0089c4edddd976" \
"https://jenkins.sweetapi.com/job/gh200-vllm-build/lastBuild/api/json" | jq '.result'
Current State
Dockerfile (vllm/Dockerfile)
Builds everything from source:
- triton (release/3.5.x)
- xformers (johnnynunez fork)
- flashinfer (v0.4.1)
- flash-attention (hopper branch)
- lmcache (v0.3.7)
- infinistore (main)
- vLLM (configurable via
VLLM_REF)
Target Architecture: 9.0a (NVIDIA Hopper)
Latest Versions (as of 2026-04-02)
| Package | Stable | Latest |
|---|---|---|
| vLLM | v0.18.1 | v0.19.0rc1 |
| Triton | 3.6.0 | 3.6.0 |
| CUDA | 13.0.1 | 13.0.1 |
PyPI Wheel Status for aarch64
| Package | aarch64 Wheel | Notes |
|---|---|---|
| vLLM 0.18.1 | ✅ Yes | Includes FA2, FA3, MoE kernels |
| Triton 3.6.0 | ✅ Yes | Official wheel |
| flashinfer | ❌ No | Must build from source |
| xformers | ❌ No | Must build from source |
Key Finding: The official vLLM aarch64 wheel includes pre-compiled CUDA kernels:
vllm/_C.abi3.so(381MB)vllm/vllm_flash_attn/_vllm_fa2_C.abi3.so(263MB)vllm/vllm_flash_attn/_vllm_fa3_C.abi3.so(142MB)vllm/_moe_C.abi3.so(202MB)
This means basic vLLM on GH200 can use PyPI wheels directly without compilation.
Build Options
Option 1: QEMU Cross-Compilation (Current Setup)
Pros:
- Works on existing x86 Jenkins server
- No GH200 required
Cons:
- Very slow (hours)
- CUDA kernels may not be fully optimized
- flashinfer/xformers need source builds
When to use: When no GH200 available
Option 2: Native GH200 Build
Pros:
- Much faster compilation
- Properly optimized ARM64 + Hopper kernels
- All dependencies build correctly
Cons:
- Need GH200 access
When to use: For production-quality builds
Option 3: Hybrid (PyPI wheels + source builds)
Approach:
- Use vLLM + Triton from PyPI (pre-compiled aarch64)
- Build only flashinfer/xformers from source (if needed)
Pros:
- Fastest option
- Official wheels for core components
Cons:
- May miss some optimizations
- flashinfer/xformers still need source builds
Recommended Path
- Immediate: Try PyPI wheels on a GH200 — if they work, no build needed
- If PyPI insufficient: Run Jenkins build with
VLLM_VERSION=v0.18.1 - Production: Get a GH200 for native builds
Source Repo
- Our fork: https://sweetapi.com/biondizzle/grace-gpu-containers.git (Jenkins pulls from here)
- Upstream: https://github.com/rajesh-s/grace-gpu-containers
Local repo: /home/openclaw/dev/grace-gpu-containers
git remote -v
origin ssh://git@sweetapi.com:2222/biondizzle/grace-gpu-containers.git (push)
upstream https://github.com/rajesh-s/grace-gpu-containers.git (fetch)
Changes from upstream
-
setuptools pin — Pin setuptools to
<81.0.0for LMCache compatibility- Changed:
RUN uv pip install -U build cmake ninja pybind11 setuptools wheel - To:
RUN uv pip install -U build cmake ninja pybind11 "setuptools>=77.0.3,<81.0.0" wheel
- Changed:
-
pip for flash-attention — Use
pip wheelinstead ofpip3 wheel(venv has pip, not pip3)- Changed:
pip3 wheel . -v --no-deps - To:
pip wheel . -v --no-deps
- Changed:
-
CLAWMINE.md — This documentation file
Last updated: 2026-04-02 by Clawmine
pip3 wheel . -v --no-deps
- To:
uv pip wheel . -v --no-deps
- CLAWMINE.md — This documentation file
Last updated: 2026-04-02 by Clawmine