Fix setuptools pin and flash-attention build for GH200
- Pin setuptools>=77.0.3,<81.0.0 for LMCache compatibility - Use 'uv pip wheel' instead of 'pip3 wheel' for flash-attention (torch is in venv) - Add CLAWMINE.md with build pipeline documentation
This commit is contained in:
184
CLAWMINE.md
Normal file
184
CLAWMINE.md
Normal file
@@ -0,0 +1,184 @@
|
||||
# GH200 vLLM Container Build Pipeline
|
||||
|
||||
> Managed by Clawmine — `/home/openclaw/dev/grace-gpu-containers`
|
||||
|
||||
## Overview
|
||||
|
||||
Building vLLM containers for NVIDIA GH200 (Grace Hopper, ARM64 + H100 GPU). The challenge: prebuilt wheels for `aarch64` are limited, and NVIDIA doesn't publish NGC Dockerfiles.
|
||||
|
||||
## Jenkins Pipeline
|
||||
|
||||
**Server:** https://jenkins.sweetapi.com/
|
||||
**Job:** `gh200-vllm-build`
|
||||
**Status:** Configured, never run (ready for first build)
|
||||
|
||||
### Jenkins Server (Build Machine)
|
||||
|
||||
```
|
||||
Host: 66.135.24.21
|
||||
User: root
|
||||
Pass: Wy9,za7+8BL(v@ZT
|
||||
```
|
||||
|
||||
**Setup:**
|
||||
- Docker buildx `multiarch` builder configured
|
||||
- QEMU user-static installed for ARM64 emulation
|
||||
- 780GB free disk
|
||||
|
||||
### Build Parameters
|
||||
|
||||
| Parameter | Default | Description |
|
||||
|-----------|---------|-------------|
|
||||
| `VLLM_VERSION` | `v0.18.1` | vLLM git tag |
|
||||
| `CUDA_VERSION` | `13.0.1` | CUDA version |
|
||||
| `IMAGE_TAG` | `gh200-vllm` | Docker image name |
|
||||
| `PUSH_TO_REGISTRY` | `true` | Push to Vultr CR after build |
|
||||
|
||||
### Container Registry
|
||||
|
||||
**URL:** `sjc.vultrcr.com/charizard`
|
||||
**User:** `891294d0-df76-4c37-b41d-2b77f95a54c1`
|
||||
**Pass:** `H3aE2NfqRLs5Aio6SCnnDKBJwnB6rsJfFZ7E`
|
||||
|
||||
**Images:**
|
||||
- `sjc.vultrcr.com/charizard/gh200-vllm:v0.18.1`
|
||||
- `sjc.vultrcr.com/charizard/gh200-vllm:latest`
|
||||
|
||||
### Build History
|
||||
|
||||
| Build # | Version | Status | Duration | Notes |
|
||||
|---------|---------|--------|----------|-------|
|
||||
| 2 | v0.18.1 | RUNNING | - | Started 2026-04-02 ~20:03 UTC — setuptools pinned |
|
||||
| 1 | v0.18.1 | FAILED | ~15 min | setuptools 82.0.1 incompatible with LMCache |
|
||||
|
||||
### Monitoring
|
||||
|
||||
**Cron Job:** Every 30 minutes, checks build status and notifies on completion.
|
||||
- Job ID: `e06d540f-899e-4358-92df-85fe036b05e2`
|
||||
- Script: `~/.openclaw/scripts/check-jenkins-vllm-build.sh`
|
||||
|
||||
**Manual check:**
|
||||
```bash
|
||||
curl -s -u "admin:1112cc255997ecb7c34d0089c4edddd976" \
|
||||
"https://jenkins.sweetapi.com/job/gh200-vllm-build/lastBuild/api/json" | jq '{result, building, duration}'
|
||||
```
|
||||
|
||||
### Triggering a Build
|
||||
|
||||
```bash
|
||||
# Via API
|
||||
curl -X POST "https://jenkins.sweetapi.com/job/gh200-vllm-build/buildWithParameters" \
|
||||
-u "admin:1112cc255997ecb7c34d0089c4edddd976" \
|
||||
-d "VLLM_VERSION=v0.18.1"
|
||||
|
||||
# Check status
|
||||
curl -s -u "admin:1112cc255997ecb7c34d0089c4edddd976" \
|
||||
"https://jenkins.sweetapi.com/job/gh200-vllm-build/lastBuild/api/json" | jq '.result'
|
||||
```
|
||||
|
||||
## Current State
|
||||
|
||||
### Dockerfile (`vllm/Dockerfile`)
|
||||
|
||||
Builds everything from source:
|
||||
- **triton** (release/3.5.x)
|
||||
- **xformers** (johnnynunez fork)
|
||||
- **flashinfer** (v0.4.1)
|
||||
- **flash-attention** (hopper branch)
|
||||
- **lmcache** (v0.3.7)
|
||||
- **infinistore** (main)
|
||||
- **vLLM** (configurable via `VLLM_REF`)
|
||||
|
||||
**Target Architecture:** `9.0a` (NVIDIA Hopper)
|
||||
|
||||
### Latest Versions (as of 2026-04-02)
|
||||
|
||||
| Package | Stable | Latest |
|
||||
|---------|--------|--------|
|
||||
| vLLM | v0.18.1 | v0.19.0rc1 |
|
||||
| Triton | 3.6.0 | 3.6.0 |
|
||||
| CUDA | 13.0.1 | 13.0.1 |
|
||||
|
||||
## PyPI Wheel Status for aarch64
|
||||
|
||||
| Package | aarch64 Wheel | Notes |
|
||||
|---------|---------------|-------|
|
||||
| vLLM 0.18.1 | ✅ Yes | Includes FA2, FA3, MoE kernels |
|
||||
| Triton 3.6.0 | ✅ Yes | Official wheel |
|
||||
| flashinfer | ❌ No | Must build from source |
|
||||
| xformers | ❌ No | Must build from source |
|
||||
|
||||
**Key Finding:** The official vLLM aarch64 wheel includes pre-compiled CUDA kernels:
|
||||
- `vllm/_C.abi3.so` (381MB)
|
||||
- `vllm/vllm_flash_attn/_vllm_fa2_C.abi3.so` (263MB)
|
||||
- `vllm/vllm_flash_attn/_vllm_fa3_C.abi3.so` (142MB)
|
||||
- `vllm/_moe_C.abi3.so` (202MB)
|
||||
|
||||
This means **basic vLLM on GH200 can use PyPI wheels directly** without compilation.
|
||||
|
||||
## Build Options
|
||||
|
||||
### Option 1: QEMU Cross-Compilation (Current Setup)
|
||||
|
||||
**Pros:**
|
||||
- Works on existing x86 Jenkins server
|
||||
- No GH200 required
|
||||
|
||||
**Cons:**
|
||||
- Very slow (hours)
|
||||
- CUDA kernels may not be fully optimized
|
||||
- flashinfer/xformers need source builds
|
||||
|
||||
**When to use:** When no GH200 available
|
||||
|
||||
### Option 2: Native GH200 Build
|
||||
|
||||
**Pros:**
|
||||
- Much faster compilation
|
||||
- Properly optimized ARM64 + Hopper kernels
|
||||
- All dependencies build correctly
|
||||
|
||||
**Cons:**
|
||||
- Need GH200 access
|
||||
|
||||
**When to use:** For production-quality builds
|
||||
|
||||
### Option 3: Hybrid (PyPI wheels + source builds)
|
||||
|
||||
**Approach:**
|
||||
1. Use vLLM + Triton from PyPI (pre-compiled aarch64)
|
||||
2. Build only flashinfer/xformers from source (if needed)
|
||||
|
||||
**Pros:**
|
||||
- Fastest option
|
||||
- Official wheels for core components
|
||||
|
||||
**Cons:**
|
||||
- May miss some optimizations
|
||||
- flashinfer/xformers still need source builds
|
||||
|
||||
## Recommended Path
|
||||
|
||||
1. **Immediate:** Try PyPI wheels on a GH200 — if they work, no build needed
|
||||
2. **If PyPI insufficient:** Run Jenkins build with `VLLM_VERSION=v0.18.1`
|
||||
3. **Production:** Get a GH200 for native builds
|
||||
|
||||
## Source Repo
|
||||
|
||||
- Original: https://github.com/rajesh-s/grace-gpu-containers
|
||||
- Jenkins clones from: https://github.com/rajesh-s/grace-gpu-containers.git
|
||||
- **Note:** We apply patches locally since we don't have push access to original repo
|
||||
|
||||
### Applied Patches
|
||||
|
||||
1. **setuptools pin** — Pin setuptools to `<81.0.0` for LMCache compatibility
|
||||
- Changed: `RUN uv pip install -U build cmake ninja pybind11 setuptools wheel`
|
||||
- To: `RUN uv pip install -U build cmake ninja pybind11 "setuptools>=77.0.3,<81.0.0" wheel`
|
||||
|
||||
2. **uv pip for flash-attention** — Use uv pip instead of pip3 (torch is in venv)
|
||||
- Changed: `pip3 wheel . -v --no-deps`
|
||||
- To: `uv pip wheel . -v --no-deps`
|
||||
|
||||
---
|
||||
|
||||
*Last updated: 2026-04-02 by Clawmine*
|
||||
@@ -60,8 +60,8 @@ FROM base AS build-base
|
||||
RUN mkdir /wheels
|
||||
|
||||
# Install build deps that aren't in project requirements files
|
||||
# Make sure to upgrade setuptools to avoid triton build bug
|
||||
RUN uv pip install -U build cmake ninja pybind11 setuptools wheel
|
||||
# Pin setuptools to <81 for LMCache compatibility (needs >=77.0.3,<81.0.0)
|
||||
RUN uv pip install -U build cmake ninja pybind11 "setuptools>=77.0.3,<81.0.0" wheel
|
||||
|
||||
FROM build-base AS build-triton
|
||||
ARG TRITON_REF=release/3.5.x
|
||||
@@ -122,7 +122,7 @@ RUN apt-get update && apt-get install -y build-essential cmake gcc && \
|
||||
FLASH_ATTENTION_FORCE_BUILD="TRUE" \
|
||||
FLASH_ATTENTION_FORCE_CXX11_ABI="FALSE" \
|
||||
FLASH_ATTENTION_SKIP_CUDA_BUILD="FALSE" \
|
||||
pip3 wheel . -v --no-deps -w ./wheels/ && \
|
||||
uv pip wheel . -v --no-deps -w ./wheels/ && \
|
||||
cp wheels/*.whl /wheels/
|
||||
|
||||
FROM build-base AS build-vllm
|
||||
|
||||
Reference in New Issue
Block a user