Files

biondizzle 9da93ec625 Fix setuptools pin and flash-attention build for GH200

- Pin setuptools>=77.0.3,<81.0.0 for LMCache compatibility
- Use 'uv pip wheel' instead of 'pip3 wheel' for flash-attention (torch is in venv)
- Add CLAWMINE.md with build pipeline documentation

2026-04-02 20:19:39 +00:00

5.2 KiB

Raw Blame History

GH200 vLLM Container Build Pipeline

Managed by Clawmine — /home/openclaw/dev/grace-gpu-containers

Overview

Building vLLM containers for NVIDIA GH200 (Grace Hopper, ARM64 + H100 GPU). The challenge: prebuilt wheels for aarch64 are limited, and NVIDIA doesn't publish NGC Dockerfiles.

Jenkins Pipeline

Server: https://jenkins.sweetapi.com/ Job: gh200-vllm-build Status: Configured, never run (ready for first build)

Jenkins Server (Build Machine)

Host: 66.135.24.21
User: root
Pass: Wy9,za7+8BL(v@ZT

Setup:

Docker buildx multiarch builder configured
QEMU user-static installed for ARM64 emulation
780GB free disk

Build Parameters

Parameter	Default	Description
`VLLM_VERSION`	`v0.18.1`	vLLM git tag
`CUDA_VERSION`	`13.0.1`	CUDA version
`IMAGE_TAG`	`gh200-vllm`	Docker image name
`PUSH_TO_REGISTRY`	`true`	Push to Vultr CR after build

Container Registry

URL: sjc.vultrcr.com/charizard User: 891294d0-df76-4c37-b41d-2b77f95a54c1 Pass: H3aE2NfqRLs5Aio6SCnnDKBJwnB6rsJfFZ7E

Images:

sjc.vultrcr.com/charizard/gh200-vllm:v0.18.1
sjc.vultrcr.com/charizard/gh200-vllm:latest

Build History

Build #	Version	Status	Duration	Notes
2	v0.18.1	RUNNING	-	Started 2026-04-02 ~20:03 UTC — setuptools pinned
1	v0.18.1	FAILED	~15 min	setuptools 82.0.1 incompatible with LMCache

Monitoring

Cron Job: Every 30 minutes, checks build status and notifies on completion.

Job ID: e06d540f-899e-4358-92df-85fe036b05e2
Script: ~/.openclaw/scripts/check-jenkins-vllm-build.sh

Manual check:

curl -s -u "admin:1112cc255997ecb7c34d0089c4edddd976" \
  "https://jenkins.sweetapi.com/job/gh200-vllm-build/lastBuild/api/json" | jq '{result, building, duration}'

Triggering a Build

# Via API
curl -X POST "https://jenkins.sweetapi.com/job/gh200-vllm-build/buildWithParameters" \
  -u "admin:1112cc255997ecb7c34d0089c4edddd976" \
  -d "VLLM_VERSION=v0.18.1"

# Check status
curl -s -u "admin:1112cc255997ecb7c34d0089c4edddd976" \
  "https://jenkins.sweetapi.com/job/gh200-vllm-build/lastBuild/api/json" | jq '.result'

Current State

Dockerfile (`vllm/Dockerfile`)

Builds everything from source:

triton (release/3.5.x)
xformers (johnnynunez fork)
flashinfer (v0.4.1)
flash-attention (hopper branch)
lmcache (v0.3.7)
infinistore (main)
vLLM (configurable via VLLM_REF)

Target Architecture: 9.0a (NVIDIA Hopper)

Latest Versions (as of 2026-04-02)

Package	Stable	Latest
vLLM	v0.18.1	v0.19.0rc1
Triton	3.6.0	3.6.0
CUDA	13.0.1	13.0.1

PyPI Wheel Status for aarch64

Package	aarch64 Wheel	Notes
vLLM 0.18.1	✅ Yes	Includes FA2, FA3, MoE kernels
Triton 3.6.0	✅ Yes	Official wheel
flashinfer	❌ No	Must build from source
xformers	❌ No	Must build from source

Key Finding: The official vLLM aarch64 wheel includes pre-compiled CUDA kernels:

vllm/_C.abi3.so (381MB)
vllm/vllm_flash_attn/_vllm_fa2_C.abi3.so (263MB)
vllm/vllm_flash_attn/_vllm_fa3_C.abi3.so (142MB)
vllm/_moe_C.abi3.so (202MB)

This means basic vLLM on GH200 can use PyPI wheels directly without compilation.

Build Options

Option 1: QEMU Cross-Compilation (Current Setup)

Pros:

Works on existing x86 Jenkins server
No GH200 required

Cons:

Very slow (hours)
CUDA kernels may not be fully optimized
flashinfer/xformers need source builds

When to use: When no GH200 available

Option 2: Native GH200 Build

Pros:

Much faster compilation
Properly optimized ARM64 + Hopper kernels
All dependencies build correctly

Cons:

Need GH200 access

When to use: For production-quality builds

Option 3: Hybrid (PyPI wheels + source builds)

Approach:

Use vLLM + Triton from PyPI (pre-compiled aarch64)
Build only flashinfer/xformers from source (if needed)

Pros:

Fastest option
Official wheels for core components

Cons:

May miss some optimizations
flashinfer/xformers still need source builds

Recommended Path

Immediate: Try PyPI wheels on a GH200 — if they work, no build needed
If PyPI insufficient: Run Jenkins build with VLLM_VERSION=v0.18.1
Production: Get a GH200 for native builds

Source Repo

Original: https://github.com/rajesh-s/grace-gpu-containers
Jenkins clones from: https://github.com/rajesh-s/grace-gpu-containers.git
Note: We apply patches locally since we don't have push access to original repo

Applied Patches

setuptools pin — Pin setuptools to <81.0.0 for LMCache compatibility
- Changed: RUN uv pip install -U build cmake ninja pybind11 setuptools wheel
- To: RUN uv pip install -U build cmake ninja pybind11 "setuptools>=77.0.3,<81.0.0" wheel
uv pip for flash-attention — Use uv pip instead of pip3 (torch is in venv)
- Changed: pip3 wheel . -v --no-deps
- To: uv pip wheel . -v --no-deps

Last updated: 2026-04-02 by Clawmine

5.2 KiB Raw Blame History

GH200 vLLM Container Build Pipeline

Overview

Jenkins Pipeline

Jenkins Server (Build Machine)

Build Parameters

Container Registry

Build History

Monitoring

Triggering a Build

Current State

Dockerfile (vllm/Dockerfile)

Latest Versions (as of 2026-04-02)

PyPI Wheel Status for aarch64

Build Options

Option 1: QEMU Cross-Compilation (Current Setup)

Option 2: Native GH200 Build

Option 3: Hybrid (PyPI wheels + source builds)

Recommended Path

Source Repo

Applied Patches

5.2 KiB

Raw Blame History

Dockerfile (`vllm/Dockerfile`)