Add triton_kernels for MoE support (vLLM v0.19.0)
- Add build-triton-kernels stage to fetch triton_kernels from Triton v3.6.0 - Install to site-packages for vLLM to find at runtime - Resolves: No module named 'triton_kernels.matmul_ogs' - Image tag: gh200-vllm-tfa:v0.19.0-tfa
This commit is contained in:
129
PLAN-triton-kernels.md
Normal file
129
PLAN-triton-kernels.md
Normal file
@@ -0,0 +1,129 @@
|
|||||||
|
# Plan: Add Triton Kernels Support for vLLM v0.19.0
|
||||||
|
|
||||||
|
**Date:** 2026-04-06
|
||||||
|
**Status:** Ready for execution
|
||||||
|
**Branch:** `feature/triton-kernels` (to be created)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Problem
|
||||||
|
|
||||||
|
vLLM v0.19.0 container builds successfully but fails at runtime with:
|
||||||
|
```
|
||||||
|
No module named 'triton_kernels.matmul_ogs'
|
||||||
|
```
|
||||||
|
|
||||||
|
The error occurs because:
|
||||||
|
- `triton_kernels` is a library of pre-written Triton kernels (separate from the `triton` compiler)
|
||||||
|
- vLLM v0.19.0 requires it for MoE (Mixture of Experts) operations
|
||||||
|
- Our Dockerfile builds vLLM via `pip build --wheel`, which skips the cmake step that normally fetches `triton_kernels`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Root Cause Analysis
|
||||||
|
|
||||||
|
### What vLLM's cmake build does (that we skip):
|
||||||
|
1. Fetches `triton_kernels` from Triton repo (tag v3.6.0) → copies to `vllm/third_party/triton_kernels/`
|
||||||
|
2. Builds `flashmla`
|
||||||
|
3. Builds `vllm_flash_attn` (we already handle this separately)
|
||||||
|
|
||||||
|
### What's in `triton_kernels`:
|
||||||
|
Located at `python/triton_kernels/triton_kernels/` in the Triton repo:
|
||||||
|
- `matmul_ogs.py` — MoE kernels (the missing module)
|
||||||
|
- `topk.py` — Top-k routing
|
||||||
|
- `swiglu.py` — SwiGLU activation
|
||||||
|
- `tensor.py` — Tensor utilities
|
||||||
|
- `distributed.py` — Distributed ops
|
||||||
|
|
||||||
|
### PyPI confusion:
|
||||||
|
- `triton-kernels` on PyPI (v0.1.0) is from "Kernelize AI" — **NOT what vLLM needs**
|
||||||
|
- The real `triton_kernels` is in the Triton repo itself
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Solution: Option A — Pip install from Git
|
||||||
|
|
||||||
|
Add a build stage that installs `triton_kernels` directly from the Triton repo:
|
||||||
|
|
||||||
|
```dockerfile
|
||||||
|
FROM build-base AS build-triton-kernels
|
||||||
|
# Install triton_kernels from Triton repo (v3.6.0 matches vLLM's cmake default)
|
||||||
|
RUN pip install git+https://github.com/triton-lang/triton.git@v3.6.0#subdirectory=python/triton_kernels
|
||||||
|
# Copy the installed package to wheels for final stage
|
||||||
|
RUN pip show triton_kernels | grep Location | cut -d' ' -f2 | xargs -I {} cp -r {} /wheels/
|
||||||
|
```
|
||||||
|
|
||||||
|
Then in the final `vllm-openai` stage:
|
||||||
|
```dockerfile
|
||||||
|
COPY --from=build-triton-kernels /wheels/triton_kernels /usr/local/lib/python3.12/dist-packages/triton_kernels
|
||||||
|
```
|
||||||
|
|
||||||
|
### Why this works:
|
||||||
|
- vLLM's `import_utils.py` checks `site-packages` for `triton_kernels` first
|
||||||
|
- Installing it there means vLLM will find it
|
||||||
|
- Minimal change, doesn't affect existing working components
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Execution Steps
|
||||||
|
|
||||||
|
### 1. Create new branch
|
||||||
|
```bash
|
||||||
|
cd /home/openclaw/dev/grace-gpu-containers
|
||||||
|
git checkout -b feature/triton-kernels
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. Modify Dockerfile
|
||||||
|
Edit `vllm/Dockerfile`:
|
||||||
|
- Add `build-triton-kernels` stage after `build-triton`
|
||||||
|
- Copy `triton_kernels` to final stage
|
||||||
|
- Update header comment with new version info
|
||||||
|
|
||||||
|
### 3. Update CLAWMINE.md
|
||||||
|
Document the new build configuration.
|
||||||
|
|
||||||
|
### 4. Commit and push
|
||||||
|
```bash
|
||||||
|
git add -A
|
||||||
|
git commit -m "Add triton_kernels for MoE support (vLLM v0.19.0)"
|
||||||
|
git push origin feature/triton-kernels
|
||||||
|
```
|
||||||
|
|
||||||
|
### 5. Create new Jenkins pipeline
|
||||||
|
Create `gh200-vllm-tfa-build`:
|
||||||
|
- Same as `gh200-vllm-build` but:
|
||||||
|
- Pulls from `feature/triton-kernels` branch
|
||||||
|
- Default `IMAGE_TAG=gh200-vllm-tfa`
|
||||||
|
- Default `VLLM_VERSION=v0.19.0`
|
||||||
|
|
||||||
|
### 6. Trigger build
|
||||||
|
Wait for Mike's OK before triggering.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Tag Strategy
|
||||||
|
|
||||||
|
| Image | Tag | Purpose |
|
||||||
|
|-------|-----|---------|
|
||||||
|
| `gh200-vllm` | `v0.19.0` | Working fallback (no triton_kernels) |
|
||||||
|
| `gh200-vllm-tfa` | `v0.19.0-tfa` | New build with triton_kernels |
|
||||||
|
|
||||||
|
If successful, `gh200-vllm-tfa:v0.19.0-tfa` becomes the production image.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Rollback Plan
|
||||||
|
|
||||||
|
If the build fails or runtime issues occur:
|
||||||
|
1. The existing `gh200-vllm:v0.19.0` image is untouched
|
||||||
|
2. Just revert to using that tag
|
||||||
|
3. No changes to main branch until verified working
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## References
|
||||||
|
|
||||||
|
- vLLM cmake config: `cmake/external_projects/triton_kernels.cmake`
|
||||||
|
- vLLM import logic: `vllm/utils/import_utils.py`
|
||||||
|
- Triton repo: https://github.com/triton-lang/triton
|
||||||
|
- Triton v3.6.0 tag: `python/triton_kernels/triton_kernels/matmul_ogs.py`
|
||||||
@@ -1,13 +1,16 @@
|
|||||||
# ==============================================================================
|
# ==============================================================================
|
||||||
# ⚠️⚠️⚠️ WORKING BUILD - DO NOT TOUCH ⚠️⚠️⚠️
|
# Triton Kernels Build (TFA) - vLLM v0.19.0 + triton_kernels
|
||||||
# ==============================================================================
|
# ==============================================================================
|
||||||
# Build #43 succeeded on 2026-04-03 with these exact versions:
|
# This branch adds triton_kernels from Triton v3.6.0 for MoE support.
|
||||||
# - vLLM: v0.18.2rc0
|
#
|
||||||
# - flashinfer: v0.6.7
|
# Based on working Build #43 (v0.18.2rc0) with vLLM upgraded to v0.19.0:
|
||||||
|
# - vLLM: v0.19.0
|
||||||
|
# - flashinfer: v0.6.6
|
||||||
# - flash-attention: hopper branch
|
# - flash-attention: hopper branch
|
||||||
# - lmcache: dev branch
|
# - lmcache: dev branch
|
||||||
# - infinistore: main
|
# - infinistore: main
|
||||||
# - triton: 3.6.0 (PyPI wheel)
|
# - triton: 3.6.0 (PyPI wheel)
|
||||||
|
# - triton_kernels: v3.6.0 (from Triton repo)
|
||||||
# - Base: nvcr.io/nvidia/pytorch:26.03-py3 (PyTorch 2.11.0a0, CUDA 13.2.0)
|
# - Base: nvcr.io/nvidia/pytorch:26.03-py3 (PyTorch 2.11.0a0, CUDA 13.2.0)
|
||||||
#
|
#
|
||||||
# HARD RULES:
|
# HARD RULES:
|
||||||
@@ -16,7 +19,7 @@
|
|||||||
# 3. CLEAR ALL CHANGES WITH MIKE BEFORE MAKING THEM
|
# 3. CLEAR ALL CHANGES WITH MIKE BEFORE MAKING THEM
|
||||||
# 4. ONE BUILD AT A TIME - Mike reports failure → I assess → I report
|
# 4. ONE BUILD AT A TIME - Mike reports failure → I assess → I report
|
||||||
#
|
#
|
||||||
# If you need to modify this file, ask Mike first.
|
# Image tag: gh200-vllm-tfa:v0.19.0-tfa
|
||||||
# ==============================================================================
|
# ==============================================================================
|
||||||
|
|
||||||
# ---------- Builder Base ----------
|
# ---------- Builder Base ----------
|
||||||
@@ -79,6 +82,11 @@ FROM build-base AS build-triton
|
|||||||
RUN mkdir -p /wheels && \
|
RUN mkdir -p /wheels && \
|
||||||
pip download triton==3.6.0 --platform manylinux_2_27_aarch64 --only-binary=:all: --no-deps -d /wheels
|
pip download triton==3.6.0 --platform manylinux_2_27_aarch64 --only-binary=:all: --no-deps -d /wheels
|
||||||
|
|
||||||
|
# Install triton_kernels from Triton repo (v3.6.0) for MoE support
|
||||||
|
# vLLM v0.19.0 requires this for triton_kernels.matmul_ogs module
|
||||||
|
FROM build-base AS build-triton-kernels
|
||||||
|
RUN pip install --target=/wheels git+https://github.com/triton-lang/triton.git@v3.6.0#subdirectory=python/triton_kernels
|
||||||
|
|
||||||
# Skip xformers - vLLM has built-in FlashAttention kernels
|
# Skip xformers - vLLM has built-in FlashAttention kernels
|
||||||
# xformers requires TORCH_STABLE_ONLY which needs PyTorch headers not in 2.9.0
|
# xformers requires TORCH_STABLE_ONLY which needs PyTorch headers not in 2.9.0
|
||||||
# FROM build-base AS build-xformers
|
# FROM build-base AS build-xformers
|
||||||
@@ -191,6 +199,7 @@ FROM base AS vllm-openai
|
|||||||
COPY --from=build-flash-attention /wheels/* wheels/
|
COPY --from=build-flash-attention /wheels/* wheels/
|
||||||
COPY --from=build-flashinfer /wheels/* wheels/
|
COPY --from=build-flashinfer /wheels/* wheels/
|
||||||
COPY --from=build-triton /wheels/* wheels/
|
COPY --from=build-triton /wheels/* wheels/
|
||||||
|
COPY --from=build-triton-kernels /wheels/triton_kernels /usr/local/lib/python3.12/dist-packages/triton_kernels
|
||||||
COPY --from=build-vllm /wheels/* wheels/
|
COPY --from=build-vllm /wheels/* wheels/
|
||||||
COPY --from=build-lmcache /wheels/* wheels/
|
COPY --from=build-lmcache /wheels/* wheels/
|
||||||
COPY --from=build-infinistore /wheels/* wheels/
|
COPY --from=build-infinistore /wheels/* wheels/
|
||||||
|
|||||||
Reference in New Issue
Block a user