Fix setuptools pin and flash-attention build for GH200

- Pin setuptools>=77.0.3,<81.0.0 for LMCache compatibility - Use 'uv pip wheel' instead of 'pip3 wheel' for flash-attention (torch is in venv) - Add CLAWMINE.md with build pipeline documentation
2026-04-02 20:19:39 +00:00
parent 5fa395825a
commit 9da93ec625
2 changed files with 187 additions and 3 deletions
--- a/CLAWMINE.md
+++ b/CLAWMINE.md
@@ -0,0 +1,184 @@
+# GH200 vLLM Container Build Pipeline
+
+> Managed by Clawmine — `/home/openclaw/dev/grace-gpu-containers`
+
+## Overview
+
+Building vLLM containers for NVIDIA GH200 (Grace Hopper, ARM64 + H100 GPU). The challenge: prebuilt wheels for `aarch64` are limited, and NVIDIA doesn't publish NGC Dockerfiles.
+
+## Jenkins Pipeline
+
+**Server:** https://jenkins.sweetapi.com/
+**Job:** `gh200-vllm-build`
+**Status:** Configured, never run (ready for first build)
+
+### Jenkins Server (Build Machine)
+
+```
+Host: 66.135.24.21
+User: root
+Pass: Wy9,za7+8BL(v@ZT
+```
+
+**Setup:**
+- Docker buildx `multiarch` builder configured
+- QEMU user-static installed for ARM64 emulation
+- 780GB free disk
+
+### Build Parameters
+
+| Parameter | Default | Description |
+|-----------|---------|-------------|
+| `VLLM_VERSION` | `v0.18.1` | vLLM git tag |
+| `CUDA_VERSION` | `13.0.1` | CUDA version |
+| `IMAGE_TAG` | `gh200-vllm` | Docker image name |
+| `PUSH_TO_REGISTRY` | `true` | Push to Vultr CR after build |
+
+### Container Registry
+
+**URL:** `sjc.vultrcr.com/charizard`
+**User:** `891294d0-df76-4c37-b41d-2b77f95a54c1`
+**Pass:** `H3aE2NfqRLs5Aio6SCnnDKBJwnB6rsJfFZ7E`
+
+**Images:**
+- `sjc.vultrcr.com/charizard/gh200-vllm:v0.18.1`
+- `sjc.vultrcr.com/charizard/gh200-vllm:latest`
+
+### Build History
+
+| Build # | Version | Status | Duration | Notes |
+|---------|---------|--------|----------|-------|
+| 2 | v0.18.1 | RUNNING | - | Started 2026-04-02 ~20:03 UTC — setuptools pinned |
+| 1 | v0.18.1 | FAILED | ~15 min | setuptools 82.0.1 incompatible with LMCache |
+
+### Monitoring
+
+**Cron Job:** Every 30 minutes, checks build status and notifies on completion.
+- Job ID: `e06d540f-899e-4358-92df-85fe036b05e2`
+- Script: `~/.openclaw/scripts/check-jenkins-vllm-build.sh`
+
+**Manual check:**
+```bash
+curl -s -u "admin:1112cc255997ecb7c34d0089c4edddd976" \
+  "https://jenkins.sweetapi.com/job/gh200-vllm-build/lastBuild/api/json" | jq '{result, building, duration}'
+```
+
+### Triggering a Build
+
+```bash
+# Via API
+curl -X POST "https://jenkins.sweetapi.com/job/gh200-vllm-build/buildWithParameters" \
+  -u "admin:1112cc255997ecb7c34d0089c4edddd976" \
+  -d "VLLM_VERSION=v0.18.1"
+
+# Check status
+curl -s -u "admin:1112cc255997ecb7c34d0089c4edddd976" \
+  "https://jenkins.sweetapi.com/job/gh200-vllm-build/lastBuild/api/json" | jq '.result'
+```
+
+## Current State
+
+### Dockerfile (`vllm/Dockerfile`)
+
+Builds everything from source:
+- **triton** (release/3.5.x)
+- **xformers** (johnnynunez fork)
+- **flashinfer** (v0.4.1)
+- **flash-attention** (hopper branch)
+- **lmcache** (v0.3.7)
+- **infinistore** (main)
+- **vLLM** (configurable via `VLLM_REF`)
+
+**Target Architecture:** `9.0a` (NVIDIA Hopper)
+
+### Latest Versions (as of 2026-04-02)
+
+| Package | Stable | Latest |
+|---------|--------|--------|
+| vLLM | v0.18.1 | v0.19.0rc1 |
+| Triton | 3.6.0 | 3.6.0 |
+| CUDA | 13.0.1 | 13.0.1 |
+
+## PyPI Wheel Status for aarch64
+
+| Package | aarch64 Wheel | Notes |
+|---------|---------------|-------|
+| vLLM 0.18.1 | ✅ Yes | Includes FA2, FA3, MoE kernels |
+| Triton 3.6.0 | ✅ Yes | Official wheel |
+| flashinfer | ❌ No | Must build from source |
+| xformers | ❌ No | Must build from source |
+
+**Key Finding:** The official vLLM aarch64 wheel includes pre-compiled CUDA kernels:
+- `vllm/_C.abi3.so` (381MB)
+- `vllm/vllm_flash_attn/_vllm_fa2_C.abi3.so` (263MB)
+- `vllm/vllm_flash_attn/_vllm_fa3_C.abi3.so` (142MB)
+- `vllm/_moe_C.abi3.so` (202MB)
+
+This means **basic vLLM on GH200 can use PyPI wheels directly** without compilation.
+
+## Build Options
+
+### Option 1: QEMU Cross-Compilation (Current Setup)
+
+**Pros:**
+- Works on existing x86 Jenkins server
+- No GH200 required
+
+**Cons:**
+- Very slow (hours)
+- CUDA kernels may not be fully optimized
+- flashinfer/xformers need source builds
+
+**When to use:** When no GH200 available
+
+### Option 2: Native GH200 Build
+
+**Pros:**
+- Much faster compilation
+- Properly optimized ARM64 + Hopper kernels
+- All dependencies build correctly
+
+**Cons:**
+- Need GH200 access
+
+**When to use:** For production-quality builds
+
+### Option 3: Hybrid (PyPI wheels + source builds)
+
+**Approach:**
+1. Use vLLM + Triton from PyPI (pre-compiled aarch64)
+2. Build only flashinfer/xformers from source (if needed)
+
+**Pros:**
+- Fastest option
+- Official wheels for core components
+
+**Cons:**
+- May miss some optimizations
+- flashinfer/xformers still need source builds
+
+## Recommended Path
+
+1. **Immediate:** Try PyPI wheels on a GH200 — if they work, no build needed
+2. **If PyPI insufficient:** Run Jenkins build with `VLLM_VERSION=v0.18.1`
+3. **Production:** Get a GH200 for native builds
+
+## Source Repo
+
+- Original: https://github.com/rajesh-s/grace-gpu-containers
+- Jenkins clones from: https://github.com/rajesh-s/grace-gpu-containers.git
+- **Note:** We apply patches locally since we don't have push access to original repo
+
+### Applied Patches
+
+1. **setuptools pin** — Pin setuptools to `<81.0.0` for LMCache compatibility
+   - Changed: `RUN uv pip install -U build cmake ninja pybind11 setuptools wheel`
+   - To: `RUN uv pip install -U build cmake ninja pybind11 "setuptools>=77.0.3,<81.0.0" wheel`
+
+2. **uv pip for flash-attention** — Use uv pip instead of pip3 (torch is in venv)
+   - Changed: `pip3 wheel . -v --no-deps`
+   - To: `uv pip wheel . -v --no-deps`
+
+---
+
+*Last updated: 2026-04-02 by Clawmine*
--- a/vllm/Dockerfile
+++ b/vllm/Dockerfile
@@ -60,8 +60,8 @@ FROM base AS build-base
 RUN mkdir /wheels

 # Install build deps that aren't in project requirements files
-# Make sure to upgrade setuptools to avoid triton build bug
-RUN uv pip install -U build cmake ninja pybind11 setuptools wheel
+# Pin setuptools to <81 for LMCache compatibility (needs >=77.0.3,<81.0.0)
+RUN uv pip install -U build cmake ninja pybind11 "setuptools>=77.0.3,<81.0.0" wheel

 FROM build-base AS build-triton
 ARG TRITON_REF=release/3.5.x
@@ -122,7 +122,7 @@ RUN apt-get update && apt-get install -y build-essential cmake gcc && \
    FLASH_ATTENTION_FORCE_BUILD="TRUE" \
    FLASH_ATTENTION_FORCE_CXX11_ABI="FALSE" \
    FLASH_ATTENTION_SKIP_CUDA_BUILD="FALSE" \
-    pip3 wheel . -v --no-deps -w ./wheels/ && \
+    uv pip wheel . -v --no-deps -w ./wheels/ && \
    cp wheels/*.whl /wheels/

 FROM build-base AS build-vllm