diff --git a/CLAWMINE.md b/CLAWMINE.md new file mode 100644 index 0000000..61a5312 --- /dev/null +++ b/CLAWMINE.md @@ -0,0 +1,184 @@ +# GH200 vLLM Container Build Pipeline + +> Managed by Clawmine — `/home/openclaw/dev/grace-gpu-containers` + +## Overview + +Building vLLM containers for NVIDIA GH200 (Grace Hopper, ARM64 + H100 GPU). The challenge: prebuilt wheels for `aarch64` are limited, and NVIDIA doesn't publish NGC Dockerfiles. + +## Jenkins Pipeline + +**Server:** https://jenkins.sweetapi.com/ +**Job:** `gh200-vllm-build` +**Status:** Configured, never run (ready for first build) + +### Jenkins Server (Build Machine) + +``` +Host: 66.135.24.21 +User: root +Pass: Wy9,za7+8BL(v@ZT +``` + +**Setup:** +- Docker buildx `multiarch` builder configured +- QEMU user-static installed for ARM64 emulation +- 780GB free disk + +### Build Parameters + +| Parameter | Default | Description | +|-----------|---------|-------------| +| `VLLM_VERSION` | `v0.18.1` | vLLM git tag | +| `CUDA_VERSION` | `13.0.1` | CUDA version | +| `IMAGE_TAG` | `gh200-vllm` | Docker image name | +| `PUSH_TO_REGISTRY` | `true` | Push to Vultr CR after build | + +### Container Registry + +**URL:** `sjc.vultrcr.com/charizard` +**User:** `891294d0-df76-4c37-b41d-2b77f95a54c1` +**Pass:** `H3aE2NfqRLs5Aio6SCnnDKBJwnB6rsJfFZ7E` + +**Images:** +- `sjc.vultrcr.com/charizard/gh200-vllm:v0.18.1` +- `sjc.vultrcr.com/charizard/gh200-vllm:latest` + +### Build History + +| Build # | Version | Status | Duration | Notes | +|---------|---------|--------|----------|-------| +| 2 | v0.18.1 | RUNNING | - | Started 2026-04-02 ~20:03 UTC — setuptools pinned | +| 1 | v0.18.1 | FAILED | ~15 min | setuptools 82.0.1 incompatible with LMCache | + +### Monitoring + +**Cron Job:** Every 30 minutes, checks build status and notifies on completion. +- Job ID: `e06d540f-899e-4358-92df-85fe036b05e2` +- Script: `~/.openclaw/scripts/check-jenkins-vllm-build.sh` + +**Manual check:** +```bash +curl -s -u "admin:1112cc255997ecb7c34d0089c4edddd976" \ + "https://jenkins.sweetapi.com/job/gh200-vllm-build/lastBuild/api/json" | jq '{result, building, duration}' +``` + +### Triggering a Build + +```bash +# Via API +curl -X POST "https://jenkins.sweetapi.com/job/gh200-vllm-build/buildWithParameters" \ + -u "admin:1112cc255997ecb7c34d0089c4edddd976" \ + -d "VLLM_VERSION=v0.18.1" + +# Check status +curl -s -u "admin:1112cc255997ecb7c34d0089c4edddd976" \ + "https://jenkins.sweetapi.com/job/gh200-vllm-build/lastBuild/api/json" | jq '.result' +``` + +## Current State + +### Dockerfile (`vllm/Dockerfile`) + +Builds everything from source: +- **triton** (release/3.5.x) +- **xformers** (johnnynunez fork) +- **flashinfer** (v0.4.1) +- **flash-attention** (hopper branch) +- **lmcache** (v0.3.7) +- **infinistore** (main) +- **vLLM** (configurable via `VLLM_REF`) + +**Target Architecture:** `9.0a` (NVIDIA Hopper) + +### Latest Versions (as of 2026-04-02) + +| Package | Stable | Latest | +|---------|--------|--------| +| vLLM | v0.18.1 | v0.19.0rc1 | +| Triton | 3.6.0 | 3.6.0 | +| CUDA | 13.0.1 | 13.0.1 | + +## PyPI Wheel Status for aarch64 + +| Package | aarch64 Wheel | Notes | +|---------|---------------|-------| +| vLLM 0.18.1 | ✅ Yes | Includes FA2, FA3, MoE kernels | +| Triton 3.6.0 | ✅ Yes | Official wheel | +| flashinfer | ❌ No | Must build from source | +| xformers | ❌ No | Must build from source | + +**Key Finding:** The official vLLM aarch64 wheel includes pre-compiled CUDA kernels: +- `vllm/_C.abi3.so` (381MB) +- `vllm/vllm_flash_attn/_vllm_fa2_C.abi3.so` (263MB) +- `vllm/vllm_flash_attn/_vllm_fa3_C.abi3.so` (142MB) +- `vllm/_moe_C.abi3.so` (202MB) + +This means **basic vLLM on GH200 can use PyPI wheels directly** without compilation. + +## Build Options + +### Option 1: QEMU Cross-Compilation (Current Setup) + +**Pros:** +- Works on existing x86 Jenkins server +- No GH200 required + +**Cons:** +- Very slow (hours) +- CUDA kernels may not be fully optimized +- flashinfer/xformers need source builds + +**When to use:** When no GH200 available + +### Option 2: Native GH200 Build + +**Pros:** +- Much faster compilation +- Properly optimized ARM64 + Hopper kernels +- All dependencies build correctly + +**Cons:** +- Need GH200 access + +**When to use:** For production-quality builds + +### Option 3: Hybrid (PyPI wheels + source builds) + +**Approach:** +1. Use vLLM + Triton from PyPI (pre-compiled aarch64) +2. Build only flashinfer/xformers from source (if needed) + +**Pros:** +- Fastest option +- Official wheels for core components + +**Cons:** +- May miss some optimizations +- flashinfer/xformers still need source builds + +## Recommended Path + +1. **Immediate:** Try PyPI wheels on a GH200 — if they work, no build needed +2. **If PyPI insufficient:** Run Jenkins build with `VLLM_VERSION=v0.18.1` +3. **Production:** Get a GH200 for native builds + +## Source Repo + +- Original: https://github.com/rajesh-s/grace-gpu-containers +- Jenkins clones from: https://github.com/rajesh-s/grace-gpu-containers.git +- **Note:** We apply patches locally since we don't have push access to original repo + +### Applied Patches + +1. **setuptools pin** — Pin setuptools to `<81.0.0` for LMCache compatibility + - Changed: `RUN uv pip install -U build cmake ninja pybind11 setuptools wheel` + - To: `RUN uv pip install -U build cmake ninja pybind11 "setuptools>=77.0.3,<81.0.0" wheel` + +2. **uv pip for flash-attention** — Use uv pip instead of pip3 (torch is in venv) + - Changed: `pip3 wheel . -v --no-deps` + - To: `uv pip wheel . -v --no-deps` + +--- + +*Last updated: 2026-04-02 by Clawmine* diff --git a/vllm/Dockerfile b/vllm/Dockerfile index 732b4bc..eeb79e9 100644 --- a/vllm/Dockerfile +++ b/vllm/Dockerfile @@ -60,8 +60,8 @@ FROM base AS build-base RUN mkdir /wheels # Install build deps that aren't in project requirements files -# Make sure to upgrade setuptools to avoid triton build bug -RUN uv pip install -U build cmake ninja pybind11 setuptools wheel +# Pin setuptools to <81 for LMCache compatibility (needs >=77.0.3,<81.0.0) +RUN uv pip install -U build cmake ninja pybind11 "setuptools>=77.0.3,<81.0.0" wheel FROM build-base AS build-triton ARG TRITON_REF=release/3.5.x @@ -122,7 +122,7 @@ RUN apt-get update && apt-get install -y build-essential cmake gcc && \ FLASH_ATTENTION_FORCE_BUILD="TRUE" \ FLASH_ATTENTION_FORCE_CXX11_ABI="FALSE" \ FLASH_ATTENTION_SKIP_CUDA_BUILD="FALSE" \ - pip3 wheel . -v --no-deps -w ./wheels/ && \ + uv pip wheel . -v --no-deps -w ./wheels/ && \ cp wheels/*.whl /wheels/ FROM build-base AS build-vllm