202 lines
5.6 KiB
Markdown
202 lines
5.6 KiB
Markdown
# GH200 vLLM Container Build Pipeline
|
|
|
|
> Managed by Clawmine — `/home/openclaw/dev/grace-gpu-containers`
|
|
|
|
## Overview
|
|
|
|
Building vLLM containers for NVIDIA GH200 (Grace Hopper, ARM64 + H100 GPU). The challenge: prebuilt wheels for `aarch64` are limited, and NVIDIA doesn't publish NGC Dockerfiles.
|
|
|
|
## Jenkins Pipeline
|
|
|
|
**Server:** https://jenkins.sweetapi.com/
|
|
**Job:** `gh200-vllm-build`
|
|
**Status:** Configured, never run (ready for first build)
|
|
|
|
### Jenkins Server (Build Machine)
|
|
|
|
```
|
|
Host: 66.135.24.21
|
|
User: root
|
|
Pass: Wy9,za7+8BL(v@ZT
|
|
```
|
|
|
|
**Setup:**
|
|
- Docker buildx `multiarch` builder configured
|
|
- QEMU user-static installed for ARM64 emulation
|
|
- 780GB free disk
|
|
|
|
### Build Parameters
|
|
|
|
| Parameter | Default | Description |
|
|
|-----------|---------|-------------|
|
|
| `VLLM_VERSION` | `v0.18.1` | vLLM git tag |
|
|
| `CUDA_VERSION` | `13.0.1` | CUDA version |
|
|
| `IMAGE_TAG` | `gh200-vllm` | Docker image name |
|
|
| `PUSH_TO_REGISTRY` | `true` | Push to Vultr CR after build |
|
|
|
|
### Container Registry
|
|
|
|
**URL:** `sjc.vultrcr.com/charizard`
|
|
**User:** `891294d0-df76-4c37-b41d-2b77f95a54c1`
|
|
**Pass:** `H3aE2NfqRLs5Aio6SCnnDKBJwnB6rsJfFZ7E`
|
|
|
|
**Images:**
|
|
- `sjc.vultrcr.com/charizard/gh200-vllm:v0.18.1`
|
|
- `sjc.vultrcr.com/charizard/gh200-vllm:latest`
|
|
|
|
### Build History
|
|
|
|
| Build # | Version | Status | Duration | Notes |
|
|
|---------|---------|--------|----------|-------|
|
|
| 2 | v0.18.1 | RUNNING | - | Started 2026-04-02 ~20:03 UTC — setuptools pinned |
|
|
| 1 | v0.18.1 | FAILED | ~15 min | setuptools 82.0.1 incompatible with LMCache |
|
|
|
|
### Monitoring
|
|
|
|
**Cron Job:** Every 30 minutes, checks build status and notifies on completion.
|
|
- Job ID: `e06d540f-899e-4358-92df-85fe036b05e2`
|
|
- Script: `~/.openclaw/scripts/check-jenkins-vllm-build.sh`
|
|
|
|
**Manual check:**
|
|
```bash
|
|
curl -s -u "admin:1112cc255997ecb7c34d0089c4edddd976" \
|
|
"https://jenkins.sweetapi.com/job/gh200-vllm-build/lastBuild/api/json" | jq '{result, building, duration}'
|
|
```
|
|
|
|
### Triggering a Build
|
|
|
|
```bash
|
|
# Via API
|
|
curl -X POST "https://jenkins.sweetapi.com/job/gh200-vllm-build/buildWithParameters" \
|
|
-u "admin:1112cc255997ecb7c34d0089c4edddd976" \
|
|
-d "VLLM_VERSION=v0.18.1"
|
|
|
|
# Check status
|
|
curl -s -u "admin:1112cc255997ecb7c34d0089c4edddd976" \
|
|
"https://jenkins.sweetapi.com/job/gh200-vllm-build/lastBuild/api/json" | jq '.result'
|
|
```
|
|
|
|
## Current State
|
|
|
|
### Dockerfile (`vllm/Dockerfile`)
|
|
|
|
Builds everything from source:
|
|
- **triton** (release/3.5.x)
|
|
- **xformers** (johnnynunez fork)
|
|
- **flashinfer** (v0.4.1)
|
|
- **flash-attention** (hopper branch)
|
|
- **lmcache** (v0.3.7)
|
|
- **infinistore** (main)
|
|
- **vLLM** (configurable via `VLLM_REF`)
|
|
|
|
**Target Architecture:** `9.0a` (NVIDIA Hopper)
|
|
|
|
### Latest Versions (as of 2026-04-02)
|
|
|
|
| Package | Stable | Latest |
|
|
|---------|--------|--------|
|
|
| vLLM | v0.18.1 | v0.19.0rc1 |
|
|
| Triton | 3.6.0 | 3.6.0 |
|
|
| CUDA | 13.0.1 | 13.0.1 |
|
|
|
|
## PyPI Wheel Status for aarch64
|
|
|
|
| Package | aarch64 Wheel | Notes |
|
|
|---------|---------------|-------|
|
|
| vLLM 0.18.1 | ✅ Yes | Includes FA2, FA3, MoE kernels |
|
|
| Triton 3.6.0 | ✅ Yes | Official wheel |
|
|
| flashinfer | ❌ No | Must build from source |
|
|
| xformers | ❌ No | Must build from source |
|
|
|
|
**Key Finding:** The official vLLM aarch64 wheel includes pre-compiled CUDA kernels:
|
|
- `vllm/_C.abi3.so` (381MB)
|
|
- `vllm/vllm_flash_attn/_vllm_fa2_C.abi3.so` (263MB)
|
|
- `vllm/vllm_flash_attn/_vllm_fa3_C.abi3.so` (142MB)
|
|
- `vllm/_moe_C.abi3.so` (202MB)
|
|
|
|
This means **basic vLLM on GH200 can use PyPI wheels directly** without compilation.
|
|
|
|
## Build Options
|
|
|
|
### Option 1: QEMU Cross-Compilation (Current Setup)
|
|
|
|
**Pros:**
|
|
- Works on existing x86 Jenkins server
|
|
- No GH200 required
|
|
|
|
**Cons:**
|
|
- Very slow (hours)
|
|
- CUDA kernels may not be fully optimized
|
|
- flashinfer/xformers need source builds
|
|
|
|
**When to use:** When no GH200 available
|
|
|
|
### Option 2: Native GH200 Build
|
|
|
|
**Pros:**
|
|
- Much faster compilation
|
|
- Properly optimized ARM64 + Hopper kernels
|
|
- All dependencies build correctly
|
|
|
|
**Cons:**
|
|
- Need GH200 access
|
|
|
|
**When to use:** For production-quality builds
|
|
|
|
### Option 3: Hybrid (PyPI wheels + source builds)
|
|
|
|
**Approach:**
|
|
1. Use vLLM + Triton from PyPI (pre-compiled aarch64)
|
|
2. Build only flashinfer/xformers from source (if needed)
|
|
|
|
**Pros:**
|
|
- Fastest option
|
|
- Official wheels for core components
|
|
|
|
**Cons:**
|
|
- May miss some optimizations
|
|
- flashinfer/xformers still need source builds
|
|
|
|
## Recommended Path
|
|
|
|
1. **Immediate:** Try PyPI wheels on a GH200 — if they work, no build needed
|
|
2. **If PyPI insufficient:** Run Jenkins build with `VLLM_VERSION=v0.18.1`
|
|
3. **Production:** Get a GH200 for native builds
|
|
|
|
## Source Repo
|
|
|
|
- **Our fork:** https://sweetapi.com/biondizzle/grace-gpu-containers.git (Jenkins pulls from here)
|
|
- **Upstream:** https://github.com/rajesh-s/grace-gpu-containers
|
|
|
|
### Local repo: `/home/openclaw/dev/grace-gpu-containers`
|
|
|
|
```bash
|
|
git remote -v
|
|
origin ssh://git@sweetapi.com:2222/biondizzle/grace-gpu-containers.git (push)
|
|
upstream https://github.com/rajesh-s/grace-gpu-containers.git (fetch)
|
|
```
|
|
|
|
### Changes from upstream
|
|
|
|
1. **setuptools pin** — Pin setuptools to `<81.0.0` for LMCache compatibility
|
|
- Changed: `RUN uv pip install -U build cmake ninja pybind11 setuptools wheel`
|
|
- To: `RUN uv pip install -U build cmake ninja pybind11 "setuptools>=77.0.3,<81.0.0" wheel`
|
|
|
|
2. **pip for flash-attention** — Use `pip wheel` instead of `pip3 wheel` (venv has pip, not pip3)
|
|
- Changed: `pip3 wheel . -v --no-deps`
|
|
- To: `pip wheel . -v --no-deps`
|
|
|
|
3. **CLAWMINE.md** — This documentation file
|
|
|
|
---
|
|
|
|
*Last updated: 2026-04-02 by Clawmine*
|
|
`pip3 wheel . -v --no-deps`
|
|
- To: `uv pip wheel . -v --no-deps`
|
|
|
|
3. **CLAWMINE.md** — This documentation file
|
|
|
|
---
|
|
|
|
*Last updated: 2026-04-02 by Clawmine*
|