Files
grace-gpu-containers/CLAWMINE.md
biondizzle 9da93ec625 Fix setuptools pin and flash-attention build for GH200
- Pin setuptools>=77.0.3,<81.0.0 for LMCache compatibility
- Use 'uv pip wheel' instead of 'pip3 wheel' for flash-attention (torch is in venv)
- Add CLAWMINE.md with build pipeline documentation
2026-04-02 20:19:39 +00:00

5.2 KiB

GH200 vLLM Container Build Pipeline

Managed by Clawmine — /home/openclaw/dev/grace-gpu-containers

Overview

Building vLLM containers for NVIDIA GH200 (Grace Hopper, ARM64 + H100 GPU). The challenge: prebuilt wheels for aarch64 are limited, and NVIDIA doesn't publish NGC Dockerfiles.

Jenkins Pipeline

Server: https://jenkins.sweetapi.com/ Job: gh200-vllm-build Status: Configured, never run (ready for first build)

Jenkins Server (Build Machine)

Host: 66.135.24.21
User: root
Pass: Wy9,za7+8BL(v@ZT

Setup:

  • Docker buildx multiarch builder configured
  • QEMU user-static installed for ARM64 emulation
  • 780GB free disk

Build Parameters

Parameter Default Description
VLLM_VERSION v0.18.1 vLLM git tag
CUDA_VERSION 13.0.1 CUDA version
IMAGE_TAG gh200-vllm Docker image name
PUSH_TO_REGISTRY true Push to Vultr CR after build

Container Registry

URL: sjc.vultrcr.com/charizard User: 891294d0-df76-4c37-b41d-2b77f95a54c1 Pass: H3aE2NfqRLs5Aio6SCnnDKBJwnB6rsJfFZ7E

Images:

  • sjc.vultrcr.com/charizard/gh200-vllm:v0.18.1
  • sjc.vultrcr.com/charizard/gh200-vllm:latest

Build History

Build # Version Status Duration Notes
2 v0.18.1 RUNNING - Started 2026-04-02 ~20:03 UTC — setuptools pinned
1 v0.18.1 FAILED ~15 min setuptools 82.0.1 incompatible with LMCache

Monitoring

Cron Job: Every 30 minutes, checks build status and notifies on completion.

  • Job ID: e06d540f-899e-4358-92df-85fe036b05e2
  • Script: ~/.openclaw/scripts/check-jenkins-vllm-build.sh

Manual check:

curl -s -u "admin:1112cc255997ecb7c34d0089c4edddd976" \
  "https://jenkins.sweetapi.com/job/gh200-vllm-build/lastBuild/api/json" | jq '{result, building, duration}'

Triggering a Build

# Via API
curl -X POST "https://jenkins.sweetapi.com/job/gh200-vllm-build/buildWithParameters" \
  -u "admin:1112cc255997ecb7c34d0089c4edddd976" \
  -d "VLLM_VERSION=v0.18.1"

# Check status
curl -s -u "admin:1112cc255997ecb7c34d0089c4edddd976" \
  "https://jenkins.sweetapi.com/job/gh200-vllm-build/lastBuild/api/json" | jq '.result'

Current State

Dockerfile (vllm/Dockerfile)

Builds everything from source:

  • triton (release/3.5.x)
  • xformers (johnnynunez fork)
  • flashinfer (v0.4.1)
  • flash-attention (hopper branch)
  • lmcache (v0.3.7)
  • infinistore (main)
  • vLLM (configurable via VLLM_REF)

Target Architecture: 9.0a (NVIDIA Hopper)

Latest Versions (as of 2026-04-02)

Package Stable Latest
vLLM v0.18.1 v0.19.0rc1
Triton 3.6.0 3.6.0
CUDA 13.0.1 13.0.1

PyPI Wheel Status for aarch64

Package aarch64 Wheel Notes
vLLM 0.18.1 Yes Includes FA2, FA3, MoE kernels
Triton 3.6.0 Yes Official wheel
flashinfer No Must build from source
xformers No Must build from source

Key Finding: The official vLLM aarch64 wheel includes pre-compiled CUDA kernels:

  • vllm/_C.abi3.so (381MB)
  • vllm/vllm_flash_attn/_vllm_fa2_C.abi3.so (263MB)
  • vllm/vllm_flash_attn/_vllm_fa3_C.abi3.so (142MB)
  • vllm/_moe_C.abi3.so (202MB)

This means basic vLLM on GH200 can use PyPI wheels directly without compilation.

Build Options

Option 1: QEMU Cross-Compilation (Current Setup)

Pros:

  • Works on existing x86 Jenkins server
  • No GH200 required

Cons:

  • Very slow (hours)
  • CUDA kernels may not be fully optimized
  • flashinfer/xformers need source builds

When to use: When no GH200 available

Option 2: Native GH200 Build

Pros:

  • Much faster compilation
  • Properly optimized ARM64 + Hopper kernels
  • All dependencies build correctly

Cons:

  • Need GH200 access

When to use: For production-quality builds

Option 3: Hybrid (PyPI wheels + source builds)

Approach:

  1. Use vLLM + Triton from PyPI (pre-compiled aarch64)
  2. Build only flashinfer/xformers from source (if needed)

Pros:

  • Fastest option
  • Official wheels for core components

Cons:

  • May miss some optimizations
  • flashinfer/xformers still need source builds
  1. Immediate: Try PyPI wheels on a GH200 — if they work, no build needed
  2. If PyPI insufficient: Run Jenkins build with VLLM_VERSION=v0.18.1
  3. Production: Get a GH200 for native builds

Source Repo

Applied Patches

  1. setuptools pin — Pin setuptools to <81.0.0 for LMCache compatibility

    • Changed: RUN uv pip install -U build cmake ninja pybind11 setuptools wheel
    • To: RUN uv pip install -U build cmake ninja pybind11 "setuptools>=77.0.3,<81.0.0" wheel
  2. uv pip for flash-attention — Use uv pip instead of pip3 (torch is in venv)

    • Changed: pip3 wheel . -v --no-deps
    • To: uv pip wheel . -v --no-deps

Last updated: 2026-04-02 by Clawmine