grace-gpu-containers/CLAWMINE.md

# GH200 vLLM Container Build Pipeline

> Managed by Clawmine — `/home/openclaw/dev/grace-gpu-containers`

## Overview

Building vLLM containers for NVIDIA GH200 (Grace Hopper, ARM64 + H100 GPU). The challenge: prebuilt wheels for `aarch64` are limited, and NVIDIA doesn't publish NGC Dockerfiles.

## Jenkins Pipeline

**Server:** https://jenkins.sweetapi.com/
**Job:** `gh200-vllm-build`
**Status:** Configured, never run (ready for first build)

### Jenkins Server (Build Machine)

```
Host: 66.135.24.21
User: root
Pass: Wy9,za7+8BL(v@ZT
```

**Setup:**
- Docker buildx `multiarch` builder configured
- QEMU user-static installed for ARM64 emulation
- 780GB free disk

### Build Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| `VLLM_VERSION` | `v0.18.1` | vLLM git tag |
| `CUDA_VERSION` | `13.0.1` | CUDA version |
| `IMAGE_TAG` | `gh200-vllm` | Docker image name |
| `PUSH_TO_REGISTRY` | `true` | Push to Vultr CR after build |

### Container Registry

**URL:** `sjc.vultrcr.com/charizard`
**User:** `891294d0-df76-4c37-b41d-2b77f95a54c1`
**Pass:** `H3aE2NfqRLs5Aio6SCnnDKBJwnB6rsJfFZ7E`

**Images:**
- `sjc.vultrcr.com/charizard/gh200-vllm:v0.18.1`
- `sjc.vultrcr.com/charizard/gh200-vllm:latest`

### Build History

| Build # | Version | Status | Duration | Notes |
|---------|---------|--------|----------|-------|
| 2 | v0.18.1 | RUNNING | - | Started 2026-04-02 ~20:03 UTC — setuptools pinned |
| 1 | v0.18.1 | FAILED | ~15 min | setuptools 82.0.1 incompatible with LMCache |

### Monitoring

**Cron Job:** Every 30 minutes, checks build status and notifies on completion.
- Job ID: `e06d540f-899e-4358-92df-85fe036b05e2`
- Script: `~/.openclaw/scripts/check-jenkins-vllm-build.sh`

**Manual check:**
```bash
curl -s -u "admin:1112cc255997ecb7c34d0089c4edddd976" \
  "https://jenkins.sweetapi.com/job/gh200-vllm-build/lastBuild/api/json" | jq '{result, building, duration}'
```

### Triggering a Build

```bash
# Via API
curl -X POST "https://jenkins.sweetapi.com/job/gh200-vllm-build/buildWithParameters" \
  -u "admin:1112cc255997ecb7c34d0089c4edddd976" \
  -d "VLLM_VERSION=v0.18.1"

# Check status
curl -s -u "admin:1112cc255997ecb7c34d0089c4edddd976" \
  "https://jenkins.sweetapi.com/job/gh200-vllm-build/lastBuild/api/json" | jq '.result'
```

## Current State

### Dockerfile (`vllm/Dockerfile`)

Builds everything from source:
- **triton** (release/3.5.x)
- **xformers** (johnnynunez fork)
- **flashinfer** (v0.4.1)
- **flash-attention** (hopper branch)
- **lmcache** (v0.3.7)
- **infinistore** (main)
- **vLLM** (configurable via `VLLM_REF`)

**Target Architecture:** `9.0a` (NVIDIA Hopper)

### Latest Versions (as of 2026-04-02)

| Package | Stable | Latest |
|---------|--------|--------|
| vLLM | v0.18.1 | v0.19.0rc1 |
| Triton | 3.6.0 | 3.6.0 |
| CUDA | 13.0.1 | 13.0.1 |

## PyPI Wheel Status for aarch64

| Package | aarch64 Wheel | Notes |
|---------|---------------|-------|
| vLLM 0.18.1 | ✅ Yes | Includes FA2, FA3, MoE kernels |
| Triton 3.6.0 | ✅ Yes | Official wheel |
| flashinfer | ❌ No | Must build from source |
| xformers | ❌ No | Must build from source |

**Key Finding:** The official vLLM aarch64 wheel includes pre-compiled CUDA kernels:
- `vllm/_C.abi3.so` (381MB)
- `vllm/vllm_flash_attn/_vllm_fa2_C.abi3.so` (263MB)
- `vllm/vllm_flash_attn/_vllm_fa3_C.abi3.so` (142MB)
- `vllm/_moe_C.abi3.so` (202MB)

This means **basic vLLM on GH200 can use PyPI wheels directly** without compilation.

## Build Options

### Option 1: QEMU Cross-Compilation (Current Setup)

**Pros:**
- Works on existing x86 Jenkins server
- No GH200 required

**Cons:**
- Very slow (hours)
- CUDA kernels may not be fully optimized
- flashinfer/xformers need source builds

**When to use:** When no GH200 available

### Option 2: Native GH200 Build

**Pros:**
- Much faster compilation
- Properly optimized ARM64 + Hopper kernels
- All dependencies build correctly

**Cons:**
- Need GH200 access

**When to use:** For production-quality builds

### Option 3: Hybrid (PyPI wheels + source builds)

**Approach:**
1. Use vLLM + Triton from PyPI (pre-compiled aarch64)
2. Build only flashinfer/xformers from source (if needed)

**Pros:**
- Fastest option
- Official wheels for core components

**Cons:**
- May miss some optimizations
- flashinfer/xformers still need source builds

## Recommended Path

1. **Immediate:** Try PyPI wheels on a GH200 — if they work, no build needed
2. **If PyPI insufficient:** Run Jenkins build with `VLLM_VERSION=v0.18.1`
3. **Production:** Get a GH200 for native builds

## Source Repo

- **Our fork:** https://sweetapi.com/biondizzle/grace-gpu-containers.git (Jenkins pulls from here)
- **Upstream:** https://github.com/rajesh-s/grace-gpu-containers

### Local repo: `/home/openclaw/dev/grace-gpu-containers`

```bash
git remote -v
origin    ssh://git@sweetapi.com:2222/biondizzle/grace-gpu-containers.git (push)
upstream  https://github.com/rajesh-s/grace-gpu-containers.git (fetch)
```

### Changes from upstream

1. **setuptools pin** — Pin setuptools to `<81.0.0` for LMCache compatibility
   - Changed: `RUN uv pip install -U build cmake ninja pybind11 setuptools wheel`
   - To: `RUN uv pip install -U build cmake ninja pybind11 "setuptools>=77.0.3,<81.0.0" wheel`

2. **pip for flash-attention** — Use `pip wheel` instead of `pip3 wheel` (venv has pip, not pip3)
   - Changed: `pip3 wheel . -v --no-deps`
   - To: `pip wheel . -v --no-deps`

3. **CLAWMINE.md** — This documentation file

---

*Last updated: 2026-04-02 by Clawmine*
 `pip3 wheel . -v --no-deps`
   - To: `uv pip wheel . -v --no-deps`

3. **CLAWMINE.md** — This documentation file

---

*Last updated: 2026-04-02 by Clawmine*