diff --git a/PLAN-triton-kernels.md b/PLAN-triton-kernels.md new file mode 100644 index 0000000..f30fcb9 --- /dev/null +++ b/PLAN-triton-kernels.md @@ -0,0 +1,129 @@ +# Plan: Add Triton Kernels Support for vLLM v0.19.0 + +**Date:** 2026-04-06 +**Status:** Ready for execution +**Branch:** `feature/triton-kernels` (to be created) + +--- + +## Problem + +vLLM v0.19.0 container builds successfully but fails at runtime with: +``` +No module named 'triton_kernels.matmul_ogs' +``` + +The error occurs because: +- `triton_kernels` is a library of pre-written Triton kernels (separate from the `triton` compiler) +- vLLM v0.19.0 requires it for MoE (Mixture of Experts) operations +- Our Dockerfile builds vLLM via `pip build --wheel`, which skips the cmake step that normally fetches `triton_kernels` + +--- + +## Root Cause Analysis + +### What vLLM's cmake build does (that we skip): +1. Fetches `triton_kernels` from Triton repo (tag v3.6.0) → copies to `vllm/third_party/triton_kernels/` +2. Builds `flashmla` +3. Builds `vllm_flash_attn` (we already handle this separately) + +### What's in `triton_kernels`: +Located at `python/triton_kernels/triton_kernels/` in the Triton repo: +- `matmul_ogs.py` — MoE kernels (the missing module) +- `topk.py` — Top-k routing +- `swiglu.py` — SwiGLU activation +- `tensor.py` — Tensor utilities +- `distributed.py` — Distributed ops + +### PyPI confusion: +- `triton-kernels` on PyPI (v0.1.0) is from "Kernelize AI" — **NOT what vLLM needs** +- The real `triton_kernels` is in the Triton repo itself + +--- + +## Solution: Option A — Pip install from Git + +Add a build stage that installs `triton_kernels` directly from the Triton repo: + +```dockerfile +FROM build-base AS build-triton-kernels +# Install triton_kernels from Triton repo (v3.6.0 matches vLLM's cmake default) +RUN pip install git+https://github.com/triton-lang/triton.git@v3.6.0#subdirectory=python/triton_kernels +# Copy the installed package to wheels for final stage +RUN pip show triton_kernels | grep Location | cut -d' ' -f2 | xargs -I {} cp -r {} /wheels/ +``` + +Then in the final `vllm-openai` stage: +```dockerfile +COPY --from=build-triton-kernels /wheels/triton_kernels /usr/local/lib/python3.12/dist-packages/triton_kernels +``` + +### Why this works: +- vLLM's `import_utils.py` checks `site-packages` for `triton_kernels` first +- Installing it there means vLLM will find it +- Minimal change, doesn't affect existing working components + +--- + +## Execution Steps + +### 1. Create new branch +```bash +cd /home/openclaw/dev/grace-gpu-containers +git checkout -b feature/triton-kernels +``` + +### 2. Modify Dockerfile +Edit `vllm/Dockerfile`: +- Add `build-triton-kernels` stage after `build-triton` +- Copy `triton_kernels` to final stage +- Update header comment with new version info + +### 3. Update CLAWMINE.md +Document the new build configuration. + +### 4. Commit and push +```bash +git add -A +git commit -m "Add triton_kernels for MoE support (vLLM v0.19.0)" +git push origin feature/triton-kernels +``` + +### 5. Create new Jenkins pipeline +Create `gh200-vllm-tfa-build`: +- Same as `gh200-vllm-build` but: + - Pulls from `feature/triton-kernels` branch + - Default `IMAGE_TAG=gh200-vllm-tfa` + - Default `VLLM_VERSION=v0.19.0` + +### 6. Trigger build +Wait for Mike's OK before triggering. + +--- + +## Tag Strategy + +| Image | Tag | Purpose | +|-------|-----|---------| +| `gh200-vllm` | `v0.19.0` | Working fallback (no triton_kernels) | +| `gh200-vllm-tfa` | `v0.19.0-tfa` | New build with triton_kernels | + +If successful, `gh200-vllm-tfa:v0.19.0-tfa` becomes the production image. + +--- + +## Rollback Plan + +If the build fails or runtime issues occur: +1. The existing `gh200-vllm:v0.19.0` image is untouched +2. Just revert to using that tag +3. No changes to main branch until verified working + +--- + +## References + +- vLLM cmake config: `cmake/external_projects/triton_kernels.cmake` +- vLLM import logic: `vllm/utils/import_utils.py` +- Triton repo: https://github.com/triton-lang/triton +- Triton v3.6.0 tag: `python/triton_kernels/triton_kernels/matmul_ogs.py` diff --git a/vllm/Dockerfile b/vllm/Dockerfile index 3c34d8b..398bdfe 100644 --- a/vllm/Dockerfile +++ b/vllm/Dockerfile @@ -1,13 +1,16 @@ # ============================================================================== -# ⚠️⚠️⚠️ WORKING BUILD - DO NOT TOUCH ⚠️⚠️⚠️ +# Triton Kernels Build (TFA) - vLLM v0.19.0 + triton_kernels # ============================================================================== -# Build #43 succeeded on 2026-04-03 with these exact versions: -# - vLLM: v0.18.2rc0 -# - flashinfer: v0.6.7 +# This branch adds triton_kernels from Triton v3.6.0 for MoE support. +# +# Based on working Build #43 (v0.18.2rc0) with vLLM upgraded to v0.19.0: +# - vLLM: v0.19.0 +# - flashinfer: v0.6.6 # - flash-attention: hopper branch # - lmcache: dev branch # - infinistore: main # - triton: 3.6.0 (PyPI wheel) +# - triton_kernels: v3.6.0 (from Triton repo) # - Base: nvcr.io/nvidia/pytorch:26.03-py3 (PyTorch 2.11.0a0, CUDA 13.2.0) # # HARD RULES: @@ -16,7 +19,7 @@ # 3. CLEAR ALL CHANGES WITH MIKE BEFORE MAKING THEM # 4. ONE BUILD AT A TIME - Mike reports failure → I assess → I report # -# If you need to modify this file, ask Mike first. +# Image tag: gh200-vllm-tfa:v0.19.0-tfa # ============================================================================== # ---------- Builder Base ---------- @@ -79,6 +82,11 @@ FROM build-base AS build-triton RUN mkdir -p /wheels && \ pip download triton==3.6.0 --platform manylinux_2_27_aarch64 --only-binary=:all: --no-deps -d /wheels +# Install triton_kernels from Triton repo (v3.6.0) for MoE support +# vLLM v0.19.0 requires this for triton_kernels.matmul_ogs module +FROM build-base AS build-triton-kernels +RUN pip install --target=/wheels git+https://github.com/triton-lang/triton.git@v3.6.0#subdirectory=python/triton_kernels + # Skip xformers - vLLM has built-in FlashAttention kernels # xformers requires TORCH_STABLE_ONLY which needs PyTorch headers not in 2.9.0 # FROM build-base AS build-xformers @@ -191,6 +199,7 @@ FROM base AS vllm-openai COPY --from=build-flash-attention /wheels/* wheels/ COPY --from=build-flashinfer /wheels/* wheels/ COPY --from=build-triton /wheels/* wheels/ +COPY --from=build-triton-kernels /wheels/triton_kernels /usr/local/lib/python3.12/dist-packages/triton_kernels COPY --from=build-vllm /wheels/* wheels/ COPY --from=build-lmcache /wheels/* wheels/ COPY --from=build-infinistore /wheels/* wheels/