Files
grace-gpu-containers/CLAWMINE.md

202 lines
5.6 KiB
Markdown

# GH200 vLLM Container Build Pipeline
> Managed by Clawmine — `/home/openclaw/dev/grace-gpu-containers`
## Overview
Building vLLM containers for NVIDIA GH200 (Grace Hopper, ARM64 + H100 GPU). The challenge: prebuilt wheels for `aarch64` are limited, and NVIDIA doesn't publish NGC Dockerfiles.
## Jenkins Pipeline
**Server:** https://jenkins.sweetapi.com/
**Job:** `gh200-vllm-build`
**Status:** Configured, never run (ready for first build)
### Jenkins Server (Build Machine)
```
Host: 66.135.24.21
User: root
Pass: Wy9,za7+8BL(v@ZT
```
**Setup:**
- Docker buildx `multiarch` builder configured
- QEMU user-static installed for ARM64 emulation
- 780GB free disk
### Build Parameters
| Parameter | Default | Description |
|-----------|---------|-------------|
| `VLLM_VERSION` | `v0.18.1` | vLLM git tag |
| `CUDA_VERSION` | `13.0.1` | CUDA version |
| `IMAGE_TAG` | `gh200-vllm` | Docker image name |
| `PUSH_TO_REGISTRY` | `true` | Push to Vultr CR after build |
### Container Registry
**URL:** `sjc.vultrcr.com/charizard`
**User:** `891294d0-df76-4c37-b41d-2b77f95a54c1`
**Pass:** `H3aE2NfqRLs5Aio6SCnnDKBJwnB6rsJfFZ7E`
**Images:**
- `sjc.vultrcr.com/charizard/gh200-vllm:v0.18.1`
- `sjc.vultrcr.com/charizard/gh200-vllm:latest`
### Build History
| Build # | Version | Status | Duration | Notes |
|---------|---------|--------|----------|-------|
| 2 | v0.18.1 | RUNNING | - | Started 2026-04-02 ~20:03 UTC — setuptools pinned |
| 1 | v0.18.1 | FAILED | ~15 min | setuptools 82.0.1 incompatible with LMCache |
### Monitoring
**Cron Job:** Every 30 minutes, checks build status and notifies on completion.
- Job ID: `e06d540f-899e-4358-92df-85fe036b05e2`
- Script: `~/.openclaw/scripts/check-jenkins-vllm-build.sh`
**Manual check:**
```bash
curl -s -u "admin:1112cc255997ecb7c34d0089c4edddd976" \
"https://jenkins.sweetapi.com/job/gh200-vllm-build/lastBuild/api/json" | jq '{result, building, duration}'
```
### Triggering a Build
```bash
# Via API
curl -X POST "https://jenkins.sweetapi.com/job/gh200-vllm-build/buildWithParameters" \
-u "admin:1112cc255997ecb7c34d0089c4edddd976" \
-d "VLLM_VERSION=v0.18.1"
# Check status
curl -s -u "admin:1112cc255997ecb7c34d0089c4edddd976" \
"https://jenkins.sweetapi.com/job/gh200-vllm-build/lastBuild/api/json" | jq '.result'
```
## Current State
### Dockerfile (`vllm/Dockerfile`)
Builds everything from source:
- **triton** (release/3.5.x)
- **xformers** (johnnynunez fork)
- **flashinfer** (v0.4.1)
- **flash-attention** (hopper branch)
- **lmcache** (v0.3.7)
- **infinistore** (main)
- **vLLM** (configurable via `VLLM_REF`)
**Target Architecture:** `9.0a` (NVIDIA Hopper)
### Latest Versions (as of 2026-04-02)
| Package | Stable | Latest |
|---------|--------|--------|
| vLLM | v0.18.1 | v0.19.0rc1 |
| Triton | 3.6.0 | 3.6.0 |
| CUDA | 13.0.1 | 13.0.1 |
## PyPI Wheel Status for aarch64
| Package | aarch64 Wheel | Notes |
|---------|---------------|-------|
| vLLM 0.18.1 | ✅ Yes | Includes FA2, FA3, MoE kernels |
| Triton 3.6.0 | ✅ Yes | Official wheel |
| flashinfer | ❌ No | Must build from source |
| xformers | ❌ No | Must build from source |
**Key Finding:** The official vLLM aarch64 wheel includes pre-compiled CUDA kernels:
- `vllm/_C.abi3.so` (381MB)
- `vllm/vllm_flash_attn/_vllm_fa2_C.abi3.so` (263MB)
- `vllm/vllm_flash_attn/_vllm_fa3_C.abi3.so` (142MB)
- `vllm/_moe_C.abi3.so` (202MB)
This means **basic vLLM on GH200 can use PyPI wheels directly** without compilation.
## Build Options
### Option 1: QEMU Cross-Compilation (Current Setup)
**Pros:**
- Works on existing x86 Jenkins server
- No GH200 required
**Cons:**
- Very slow (hours)
- CUDA kernels may not be fully optimized
- flashinfer/xformers need source builds
**When to use:** When no GH200 available
### Option 2: Native GH200 Build
**Pros:**
- Much faster compilation
- Properly optimized ARM64 + Hopper kernels
- All dependencies build correctly
**Cons:**
- Need GH200 access
**When to use:** For production-quality builds
### Option 3: Hybrid (PyPI wheels + source builds)
**Approach:**
1. Use vLLM + Triton from PyPI (pre-compiled aarch64)
2. Build only flashinfer/xformers from source (if needed)
**Pros:**
- Fastest option
- Official wheels for core components
**Cons:**
- May miss some optimizations
- flashinfer/xformers still need source builds
## Recommended Path
1. **Immediate:** Try PyPI wheels on a GH200 — if they work, no build needed
2. **If PyPI insufficient:** Run Jenkins build with `VLLM_VERSION=v0.18.1`
3. **Production:** Get a GH200 for native builds
## Source Repo
- **Our fork:** https://sweetapi.com/biondizzle/grace-gpu-containers.git (Jenkins pulls from here)
- **Upstream:** https://github.com/rajesh-s/grace-gpu-containers
### Local repo: `/home/openclaw/dev/grace-gpu-containers`
```bash
git remote -v
origin ssh://git@sweetapi.com:2222/biondizzle/grace-gpu-containers.git (push)
upstream https://github.com/rajesh-s/grace-gpu-containers.git (fetch)
```
### Changes from upstream
1. **setuptools pin** — Pin setuptools to `<81.0.0` for LMCache compatibility
- Changed: `RUN uv pip install -U build cmake ninja pybind11 setuptools wheel`
- To: `RUN uv pip install -U build cmake ninja pybind11 "setuptools>=77.0.3,<81.0.0" wheel`
2. **pip for flash-attention** — Use `pip wheel` instead of `pip3 wheel` (venv has pip, not pip3)
- Changed: `pip3 wheel . -v --no-deps`
- To: `pip wheel . -v --no-deps`
3. **CLAWMINE.md** — This documentation file
---
*Last updated: 2026-04-02 by Clawmine*
`pip3 wheel . -v --no-deps`
- To: `uv pip wheel . -v --no-deps`
3. **CLAWMINE.md** — This documentation file
---
*Last updated: 2026-04-02 by Clawmine*