[BugFix] Fix false assertion with spec-decode=[2,4,..] and TP>2 (#29036 )

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> (cherry picked from commit 8f4f77a727)
[BugFix] Ray with multiple nodes (#28873 )
2025-11-19 14:11:21 -08:00 · 2025-11-19 14:11:08 -08:00 · 2025-11-19 14:10:50 -08:00 · 2025-11-19 14:10:37 -08:00
2583 changed files with 99004 additions and 263394 deletions
--- a/.buildkite/.pipeline_gen_v2
+++ b/.buildkite/.pipeline_gen_v2
--- a/.buildkite/ci_config.yaml
+++ b/.buildkite/ci_config.yaml
@@ -1,25 +0,0 @@
-name: vllm_ci
-job_dirs:
-  - ".buildkite/image_build"
-  - ".buildkite/test_areas"
-  - ".buildkite/hardware_tests"
-run_all_patterns:
-  - "docker/Dockerfile"
-  - "CMakeLists.txt"
-  - "requirements/common.txt"
-  - "requirements/cuda.txt"
-  - "requirements/build.txt"
-  - "requirements/test.txt"
-  - "setup.py"
-  - "csrc/"
-  - "cmake/"
-run_all_exclude_patterns:
-  - "docker/Dockerfile."
-  - "csrc/cpu/"
-  - "csrc/rocm/"
-  - "cmake/hipify.py"
-  - "cmake/cpu_extension.cmake"
-registries: public.ecr.aws/q9t5s3a7
-repositories:
-  main: "vllm-ci-postmerge-repo"
-  premerge: "vllm-ci-test-repo"
--- a/.buildkite/generate_index.py
+++ b/.buildkite/generate_index.py
@@ -0,0 +1,46 @@
+# SPDX-License-Identifier: Apache-2.0
+# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
+
+import argparse
+import os
+
+template = """<!DOCTYPE html>
+<html>
+    <body>
+    <h1>Links for vLLM</h1/>
+        <a href="../{x86_wheel_html_escaped}">{x86_wheel}</a><br/>
+        <a href="../{arm_wheel_html_escaped}">{arm_wheel}</a><br/>
+    </body>
+</html>
+"""
+
+parser = argparse.ArgumentParser()
+parser.add_argument("--wheel", help="The wheel path.", required=True)
+args = parser.parse_args()
+
+filename = os.path.basename(args.wheel)
+
+with open("index.html", "w") as f:
+    print(f"Generated index.html for {args.wheel}")
+    # sync the abi tag with .buildkite/scripts/upload-wheels.sh
+    if "x86_64" in filename:
+        x86_wheel = filename
+        arm_wheel = filename.replace("x86_64", "aarch64").replace(
+            "manylinux1", "manylinux2014"
+        )
+    elif "aarch64" in filename:
+        x86_wheel = filename.replace("aarch64", "x86_64").replace(
+            "manylinux2014", "manylinux1"
+        )
+        arm_wheel = filename
+    else:
+        raise ValueError(f"Unsupported wheel: {filename}")
+    # cloudfront requires escaping the '+' character
+    f.write(
+        template.format(
+            x86_wheel=x86_wheel,
+            x86_wheel_html_escaped=x86_wheel.replace("+", "%2B"),
+            arm_wheel=arm_wheel,
+            arm_wheel_html_escaped=arm_wheel.replace("+", "%2B"),
+        )
+    )
--- a/.buildkite/hardware_tests/amd.yaml
+++ b/.buildkite/hardware_tests/amd.yaml
@@ -1,29 +0,0 @@
-group: Hardware
-steps:
-  - label: "AMD: :docker: build image"
-    depends_on: []
-    device: amd_cpu
-    no_plugin: true
-    commands:
-    - >
-      docker build
-      --build-arg max_jobs=16
-      --build-arg REMOTE_VLLM=1
-      --build-arg ARG_PYTORCH_ROCM_ARCH='gfx90a;gfx942'
-      --build-arg VLLM_BRANCH=$BUILDKITE_COMMIT
-      --tag "rocm/vllm-ci:${BUILDKITE_COMMIT}"
-      -f docker/Dockerfile.rocm
-      --target test
-      --no-cache
-      --progress plain .
-    - docker push "rocm/vllm-ci:${BUILDKITE_COMMIT}"
-    env:
-      DOCKER_BUILDKIT: "1"
-    retry:
-      automatic:
-        - exit_status: -1  # Agent was lost
-          limit: 1
-        - exit_status: -10  # Agent was lost
-          limit: 1
-        - exit_status: 1  # Machine occasionally fail
-          limit: 1
--- a/.buildkite/hardware_tests/ascend_npu.yaml
+++ b/.buildkite/hardware_tests/ascend_npu.yaml
@@ -1,10 +0,0 @@
-group: Hardware
-depends_on: ~
-steps:
-  - label: "Ascend NPU Test"
-    soft_fail: true
-    timeout_in_minutes: 20
-    no_plugin: true
-    device: ascend_npu
-    commands: 
-    - bash .buildkite/scripts/hardware_ci/run-npu-test.sh
--- a/.buildkite/hardware_tests/cpu.yaml
+++ b/.buildkite/hardware_tests/cpu.yaml
@@ -1,100 +0,0 @@
-group: CPU
-depends_on: []
-steps:
- label: CPU-Kernel Tests
-  depends_on: []
-  soft_fail: true
-  device: intel_cpu
-  no_plugin: true
-  source_file_dependencies:
-  - csrc/cpu/
-  - cmake/cpu_extension.cmake
-  - CMakeLists.txt
-  - vllm/_custom_ops.py
-  - tests/kernels/attention/test_cpu_attn.py
-  - tests/kernels/moe/test_cpu_fused_moe.py
-  - tests/kernels/test_onednn.py
-  commands:
-    - |
-      bash .buildkite/scripts/hardware_ci/run-cpu-test.sh 20m "
-      pytest -x -v -s tests/kernels/attention/test_cpu_attn.py
-      pytest -x -v -s tests/kernels/moe/test_cpu_fused_moe.py
-      pytest -x -v -s tests/kernels/test_onednn.py"
-
- label: CPU-Language Generation and Pooling Model Tests
-  depends_on: []
-  soft_fail: true
-  device: intel_cpu
-  no_plugin: true
-  source_file_dependencies:
-  - csrc/cpu/
-  - vllm/
-  - tests/models/language/generation/
-  - tests/models/language/pooling/
-  commands:
-    - |
-      bash .buildkite/scripts/hardware_ci/run-cpu-test.sh 30m "
-      pytest -x -v -s tests/models/language/generation -m cpu_model
-      pytest -x -v -s tests/models/language/pooling -m cpu_model"
-
- label: CPU-Quantization Model Tests
-  depends_on: []
-  soft_fail: true
-  device: intel_cpu
-  no_plugin: true
-  source_file_dependencies:
-  - csrc/cpu/
-  - vllm/model_executor/layers/quantization/cpu_wna16.py
-  - vllm/model_executor/layers/quantization/gptq_marlin.py
-  - vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a8_int8.py
-  - vllm/model_executor/layers/quantization/kernels/scaled_mm/cpu.py
-  - vllm/model_executor/layers/quantization/kernels/mixed_precision/cpu.py
-  - tests/quantization/test_compressed_tensors.py
-  - tests/quantization/test_cpu_wna16.py
-  commands:
-    - |
-      bash .buildkite/scripts/hardware_ci/run-cpu-test.sh 20m "
-      pytest -x -v -s tests/quantization/test_compressed_tensors.py::test_compressed_tensors_w8a8_logprobs
-      pytest -x -v -s tests/quantization/test_cpu_wna16.py"
-      
- label: CPU-Distributed Tests
-  depends_on: []
-  soft_fail: true
-  device: intel_cpu
-  no_plugin: true
-  source_file_dependencies:
-  - csrc/cpu/shm.cpp
-  - vllm/v1/worker/cpu_worker.py
-  - vllm/v1/worker/gpu_worker.py
-  - vllm/v1/worker/cpu_model_runner.py
-  - vllm/v1/worker/gpu_model_runner.py
-  - vllm/platforms/cpu.py
-  - vllm/distributed/parallel_state.py
-  - vllm/distributed/device_communicators/cpu_communicator.py
-  commands:
-    - |
-      bash .buildkite/scripts/hardware_ci/run-cpu-test.sh 10m "
-      bash .buildkite/scripts/hardware_ci/run-cpu-distributed-smoke-test.sh"
-
- label: CPU-Multi-Modal Model Tests %N
-  depends_on: []
-  soft_fail: true
-  device: intel_cpu
-  no_plugin: true
-  source_file_dependencies:
-  # - vllm/
-  - vllm/model_executor/layers/rotary_embedding
-  - tests/models/multimodal/generation/
-  commands:
-    - |
-      bash .buildkite/scripts/hardware_ci/run-cpu-test.sh 45m "
-      pytest -x -v -s tests/models/multimodal/generation --ignore=tests/models/multimodal/generation/test_pixtral.py -m cpu_model --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT --shard-id=$$BUILDKITE_PARALLEL_JOB"
-  parallelism: 2
-
- label: "Arm CPU Test"
-  depends_on: []
-  soft_fail: true
-  device: arm_cpu
-  no_plugin: true
-  commands: 
-  - bash .buildkite/scripts/hardware_ci/run-cpu-test-arm.sh
--- a/.buildkite/hardware_tests/gh200.yaml
+++ b/.buildkite/hardware_tests/gh200.yaml
@@ -1,10 +0,0 @@
-group: Hardware
-steps:
-  - label: "GH200 Test"
-    soft_fail: true
-    device: gh200
-    no_plugin: true
-    optional: true
-    commands: 
-    - nvidia-smi 
-    - bash .buildkite/scripts/hardware_ci/run-gh200-test.sh
--- a/.buildkite/hardware_tests/intel.yaml
+++ b/.buildkite/hardware_tests/intel.yaml
@@ -1,17 +0,0 @@
-group: Hardware
-depends_on: ~
-steps:
-  - label: "Intel HPU Test"
-    soft_fail: true
-    device: intel_hpu
-    no_plugin: true
-    commands: 
-    - bash .buildkite/scripts/hardware_ci/run-hpu-test.sh
-
-  - label: "Intel GPU Test"
-    depends_on: []
-    soft_fail: true
-    device: intel_gpu
-    no_plugin: true
-    commands: 
-    - bash .buildkite/scripts/hardware_ci/run-xpu-test.sh
--- a/.buildkite/image_build/image_build.sh
+++ b/.buildkite/image_build/image_build.sh
@@ -1,256 +0,0 @@
-#!/bin/bash
-set -euo pipefail
-
-# replace invalid characters in Docker image tags and truncate to 128 chars
-clean_docker_tag() {
-    local input="$1"
-    echo "$input" | sed 's/[^a-zA-Z0-9._-]/_/g' | cut -c1-128
-}
-
-print_usage_and_exit() {
-    echo "Usage: $0 <registry> <repo> <commit> <branch> <vllm_use_precompiled> <vllm_merge_base_commit> <cache_from> <cache_to>"
-    exit 1
-}
-
-print_instance_info() {
-    echo ""
-    echo "=== Debug: Instance Information ==="
-    # Get IMDSv2 token
-    if TOKEN=$(curl -s -X PUT "http://169.254.169.254/latest/api/token" \
-            -H "X-aws-ec2-metadata-token-ttl-seconds: 21600" 2>/dev/null); then
-        AMI_ID=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" \
-            http://169.254.169.254/latest/meta-data/ami-id 2>/dev/null || echo "unknown")
-        INSTANCE_TYPE=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" \
-            http://169.254.169.254/latest/meta-data/instance-type 2>/dev/null || echo "unknown")
-        INSTANCE_ID=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" \
-            http://169.254.169.254/latest/meta-data/instance-id 2>/dev/null || echo "unknown")
-        AZ=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" \
-            http://169.254.169.254/latest/meta-data/placement/availability-zone 2>/dev/null || echo "unknown")
-        echo "AMI ID:        ${AMI_ID}"
-        echo "Instance Type: ${INSTANCE_TYPE}"
-        echo "Instance ID:   ${INSTANCE_ID}"
-        echo "AZ:            ${AZ}"
-    else
-        echo "Not running on EC2 or IMDS not available"
-    fi
-    # Check for warm cache AMI (marker file baked into custom AMI)
-    if [[ -f /etc/vllm-ami-info ]]; then
-        echo "Cache:         warm (custom vLLM AMI)"
-        cat /etc/vllm-ami-info
-    else
-        echo "Cache:         cold (standard AMI)"
-    fi
-    echo "==================================="
-    echo ""
-}
-
-setup_buildx_builder() {
-    echo "--- :buildkite: Setting up buildx builder"
-    if [[ -S "${BUILDKIT_SOCKET}" ]]; then
-        # Custom AMI with standalone buildkitd - use remote driver for warm cache
-        echo "✅ Found local buildkitd socket at ${BUILDKIT_SOCKET}"
-        echo "Using remote driver to connect to buildkitd (warm cache available)"
-        if docker buildx inspect baked-vllm-builder >/dev/null 2>&1; then
-            echo "Using existing baked-vllm-builder"
-            docker buildx use baked-vllm-builder
-        else
-            echo "Creating baked-vllm-builder with remote driver"
-            docker buildx create \
-                --name baked-vllm-builder \
-                --driver remote \
-                --use \
-                "unix://${BUILDKIT_SOCKET}"
-        fi
-        docker buildx inspect --bootstrap
-    elif docker buildx inspect "${BUILDER_NAME}" >/dev/null 2>&1; then
-        # Existing builder available
-        echo "Using existing builder: ${BUILDER_NAME}"
-        docker buildx use "${BUILDER_NAME}"
-        docker buildx inspect --bootstrap
-    else
-        # No local buildkitd, no existing builder - create new docker-container builder
-        echo "No local buildkitd found, using docker-container driver"
-        docker buildx create --name "${BUILDER_NAME}" --driver docker-container --use
-        docker buildx inspect --bootstrap
-    fi
-
-    # builder info
-    echo "Active builder:"
-    docker buildx ls | grep -E '^\*|^NAME' || docker buildx ls
-}
-
-check_and_skip_if_image_exists() {
-    if [[ -n "${IMAGE_TAG:-}" ]]; then
-        echo "--- :mag: Checking if image exists"
-        if docker manifest inspect "${IMAGE_TAG}" >/dev/null 2>&1; then
-            echo "Image already exists: ${IMAGE_TAG}"
-            echo "Skipping build"
-            exit 0
-        fi
-        echo "Image not found, proceeding with build"
-    fi
-}
-
-ecr_login() {
-    aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin "$REGISTRY"
-    aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 936637512419.dkr.ecr.us-east-1.amazonaws.com
-}
-
-prepare_cache_tags() {
-    # resolve and set: CACHE_TO, CACHE_FROM, CACHE_FROM_BASE_BRANCH, CACHE_FROM_MAIN
-    TEST_CACHE_ECR="936637512419.dkr.ecr.us-east-1.amazonaws.com/vllm-ci-test-cache"
-    MAIN_CACHE_ECR="936637512419.dkr.ecr.us-east-1.amazonaws.com/vllm-ci-postmerge-cache"
-
-    if [[ "$BUILDKITE_PULL_REQUEST" == "false" ]]; then
-        if [[ "$BUILDKITE_BRANCH" == "main" ]]; then
-            cache="${MAIN_CACHE_ECR}:latest"
-        else
-            clean_branch=$(clean_docker_tag "$BUILDKITE_BRANCH")
-            cache="${TEST_CACHE_ECR}:${clean_branch}"
-        fi
-        CACHE_TO="$cache"
-        CACHE_FROM="$cache"
-        CACHE_FROM_BASE_BRANCH="$cache"
-    else
-        CACHE_TO="${TEST_CACHE_ECR}:pr-${BUILDKITE_PULL_REQUEST}"
-        CACHE_FROM="${TEST_CACHE_ECR}:pr-${BUILDKITE_PULL_REQUEST}"
-        if [[ "$BUILDKITE_PULL_REQUEST_BASE_BRANCH" == "main" ]]; then
-            CACHE_FROM_BASE_BRANCH="${MAIN_CACHE_ECR}:latest"
-        else
-            clean_base=$(clean_docker_tag "$BUILDKITE_PULL_REQUEST_BASE_BRANCH")
-            CACHE_FROM_BASE_BRANCH="${TEST_CACHE_ECR}:${clean_base}"
-        fi
-    fi
-
-    CACHE_FROM_MAIN="${MAIN_CACHE_ECR}:latest"
-    export CACHE_TO CACHE_FROM CACHE_FROM_BASE_BRANCH CACHE_FROM_MAIN
-}
-
-resolve_parent_commit() {
-    if [[ -z "${PARENT_COMMIT:-}" ]]; then
-        PARENT_COMMIT=$(git rev-parse HEAD~1 2>/dev/null || echo "")
-        if [[ -n "${PARENT_COMMIT}" ]]; then
-            echo "Computed parent commit for cache fallback: ${PARENT_COMMIT}"
-            export PARENT_COMMIT
-        else
-            echo "Could not determine parent commit (may be first commit in repo)"
-        fi
-    else
-        echo "Using provided PARENT_COMMIT: ${PARENT_COMMIT}"
-    fi
-}
-
-print_bake_config() {
-    echo "--- :page_facing_up: Resolved bake configuration"
-    BAKE_CONFIG_FILE="bake-config-build-${BUILDKITE_BUILD_NUMBER:-local}.json"
-    docker buildx bake -f "${VLLM_BAKE_FILE_PATH}" -f "${CI_HCL_PATH}" --print "${TARGET}" | tee "${BAKE_CONFIG_FILE}" || true
-    echo "Saved bake config to ${BAKE_CONFIG_FILE}"
-    echo "--- :arrow_down: Uploading bake config to Buildkite"
-    buildkite-agent artifact upload "${BAKE_CONFIG_FILE}"
-}
-
-#################################
-#         Main Script           #
-#################################
-print_instance_info
-
-if [[ $# -lt 7 ]]; then
-    print_usage_and_exit
-fi
-
-# input args
-REGISTRY=$1
-REPO=$2
-BUILDKITE_COMMIT=$3
-BRANCH=$4
-VLLM_USE_PRECOMPILED=$5
-VLLM_MERGE_BASE_COMMIT=$6
-IMAGE_TAG=$7
-IMAGE_TAG_LATEST=${8:-} # only used for main branch, optional
-
-# build config
-TARGET="test-ci"
-VLLM_BAKE_FILE_PATH="${VLLM_BAKE_FILE_PATH:-docker/docker-bake.hcl}"
-BUILDER_NAME="${BUILDER_NAME:-vllm-builder}"
-CI_HCL_URL="${CI_HCL_URL:-https://raw.githubusercontent.com/vllm-project/ci-infra/main/docker/ci.hcl}"
-CI_HCL_PATH="/tmp/ci.hcl"
-BUILDKIT_SOCKET="/run/buildkit/buildkitd.sock"
-
-prepare_cache_tags
-ecr_login
-
-# Environment info (for docs and human readers)
-#   VLLM_CI_BRANCH      - ci-infra branch to use (default: main)
-#   VLLM_BAKE_FILE_PATH      - Path to vLLM's bake file (default: docker/docker-bake.hcl)
-#   BUILDER_NAME        - Name for buildx builder (default: vllm-builder)
-#
-# Build configuration (exported as environment variables for bake):
-export BUILDKITE_COMMIT
-export PARENT_COMMIT
-export IMAGE_TAG
-export IMAGE_TAG_LATEST
-export CACHE_FROM
-export CACHE_FROM_BASE_BRANCH
-export CACHE_FROM_MAIN
-export CACHE_TO
-export VLLM_USE_PRECOMPILED
-export VLLM_MERGE_BASE_COMMIT
-
-# print args
-echo "--- :mag: Arguments"
-echo "REGISTRY: ${REGISTRY}"
-echo "REPO: ${REPO}"
-echo "BUILDKITE_COMMIT: ${BUILDKITE_COMMIT}"
-echo "BRANCH: ${BRANCH}"
-echo "VLLM_USE_PRECOMPILED: ${VLLM_USE_PRECOMPILED}"
-echo "VLLM_MERGE_BASE_COMMIT: ${VLLM_MERGE_BASE_COMMIT}"
-echo "IMAGE_TAG: ${IMAGE_TAG}"
-echo "IMAGE_TAG_LATEST: ${IMAGE_TAG_LATEST}"
-
-# print build configuration
-echo "--- :mag: Build configuration"
-echo "TARGET: ${TARGET}"
-echo "vLLM bake file: ${VLLM_BAKE_FILE_PATH}"
-echo "BUILDER_NAME: ${BUILDER_NAME}"
-echo "CI_HCL_URL: ${CI_HCL_URL}"
-echo "BUILDKIT_SOCKET: ${BUILDKIT_SOCKET}"
-
-echo "--- :mag: Cache tags"
-echo "CACHE_TO: ${CACHE_TO}"
-echo "CACHE_FROM: ${CACHE_FROM}"
-echo "CACHE_FROM_BASE_BRANCH: ${CACHE_FROM_BASE_BRANCH}"
-echo "CACHE_FROM_MAIN: ${CACHE_FROM_MAIN}"
-
-check_and_skip_if_image_exists
-
-echo "--- :docker: Setting up Docker buildx bake"
-echo "Target: ${TARGET}"
-echo "vLLM bake file: ${VLLM_BAKE_FILE_PATH}"
-echo "CI HCL path: ${CI_HCL_PATH}"
-
-if [[ ! -f "${VLLM_BAKE_FILE_PATH}" ]]; then
-    echo "Error: vLLM bake file not found at ${VLLM_BAKE_FILE_PATH}"
-    echo "Make sure you're running from the vLLM repository root"
-    exit 1
-fi
-
-echo "--- :arrow_down: Downloading ci.hcl"
-curl -sSfL -o "${CI_HCL_PATH}" "${CI_HCL_URL}"
-echo "Downloaded to ${CI_HCL_PATH}"
-
-if [[ ! -f "${CI_HCL_PATH}" ]]; then
-    echo "Error: ci.hcl not found at ${CI_HCL_PATH}"
-    exit 1
-fi
-
-setup_buildx_builder
-
-resolve_parent_commit
-export PARENT_COMMIT
-
-print_bake_config
-
-echo "--- :docker: Building ${TARGET}"
-docker --debug buildx bake -f "${VLLM_BAKE_FILE_PATH}" -f "${CI_HCL_PATH}" --progress plain "${TARGET}"
-
-echo "--- :white_check_mark: Build complete"
--- a/.buildkite/image_build/image_build.yaml
+++ b/.buildkite/image_build/image_build.yaml
@@ -1,59 +0,0 @@
-group: Abuild
-steps:
-  - label: ":docker: Build image"
-    key: image-build
-    depends_on: []
-    timeout_in_minutes: 600
-    commands:
-    - if [[ "$BUILDKITE_BRANCH" != "main" ]]; then .buildkite/image_build/image_build.sh $REGISTRY $REPO $BUILDKITE_COMMIT $BRANCH $VLLM_USE_PRECOMPILED $VLLM_MERGE_BASE_COMMIT $IMAGE_TAG; fi
-    - if [[ "$BUILDKITE_BRANCH" == "main" ]]; then .buildkite/image_build/image_build.sh $REGISTRY $REPO $BUILDKITE_COMMIT $BRANCH $VLLM_USE_PRECOMPILED $VLLM_MERGE_BASE_COMMIT $IMAGE_TAG $IMAGE_TAG_LATEST; fi
-    retry:
-      automatic:
-        - exit_status: -1  # Agent was lost
-          limit: 2
-        - exit_status: -10  # Agent was lost
-          limit: 2
-
-  - label: ":docker: Build CPU image"
-    key: image-build-cpu
-    depends_on: []
-    commands:
-    - .buildkite/image_build/image_build_cpu.sh $REGISTRY $REPO $BUILDKITE_COMMIT
-    env:
-      DOCKER_BUILDKIT: "1"
-    retry:
-      automatic:
-        - exit_status: -1  # Agent was lost
-          limit: 2
-        - exit_status: -10  # Agent was lost
-          limit: 2
-
-  - label: ":docker: Build HPU image"
-    soft_fail: true
-    depends_on: []
-    key: image-build-hpu
-    commands:
-    - .buildkite/image_build/image_build_hpu.sh $REGISTRY $REPO $BUILDKITE_COMMIT
-    env:
-      DOCKER_BUILDKIT: "1"
-    retry:
-      automatic:
-        - exit_status: -1  # Agent was lost
-          limit: 2
-        - exit_status: -10  # Agent was lost
-          limit: 2
-
-  - label: ":docker: Build CPU arm64 image"
-    key: cpu-arm64-image-build
-    depends_on: []
-    optional: true
-    commands:
-    - .buildkite/image_build/image_build_cpu_arm64.sh $REGISTRY $REPO $BUILDKITE_COMMIT
-    env:
-      DOCKER_BUILDKIT: "1"
-    retry:
-      automatic:
-        - exit_status: -1  # Agent was lost
-          limit: 2
-        - exit_status: -10  # Agent was lost
-          limit: 2
--- a/.buildkite/image_build/image_build_cpu.sh
+++ b/.buildkite/image_build/image_build_cpu.sh
@@ -1,36 +0,0 @@
-#!/bin/bash
-set -e
-
-if [[ $# -lt 3 ]]; then
-  echo "Usage: $0 <registry> <repo> <commit>"
-  exit 1
-fi
-
-REGISTRY=$1
-REPO=$2
-BUILDKITE_COMMIT=$3
-
-# authenticate with AWS ECR
-aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin $REGISTRY
-
-# skip build if image already exists
-if [[ -z $(docker manifest inspect $REGISTRY/$REPO:$BUILDKITE_COMMIT-cpu) ]]; then
-  echo "Image not found, proceeding with build..."
-else
-  echo "Image found"
-  exit 0
-fi
-
-# build
-docker build --file docker/Dockerfile.cpu \
-  --build-arg max_jobs=16 \
-  --build-arg buildkite_commit=$BUILDKITE_COMMIT \
-  --build-arg VLLM_CPU_AVX512BF16=true \
-  --build-arg VLLM_CPU_AVX512VNNI=true \
-  --build-arg VLLM_CPU_AMXBF16=true \
-  --tag $REGISTRY/$REPO:$BUILDKITE_COMMIT-cpu \
-  --target vllm-test \
-  --progress plain .
-
-# push
-docker push $REGISTRY/$REPO:$BUILDKITE_COMMIT-cpu
--- a/.buildkite/image_build/image_build_cpu_arm64.sh
+++ b/.buildkite/image_build/image_build_cpu_arm64.sh
@@ -1,33 +0,0 @@
-#!/bin/bash
-set -e
-
-if [[ $# -lt 3 ]]; then
-  echo "Usage: $0 <registry> <repo> <commit>"
-  exit 1
-fi
-
-REGISTRY=$1
-REPO=$2
-BUILDKITE_COMMIT=$3
-
-# authenticate with AWS ECR
-aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin $REGISTRY
-
-# skip build if image already exists
-if [[ -z $(docker manifest inspect $REGISTRY/$REPO:$BUILDKITE_COMMIT-cpu) ]]; then
-  echo "Image not found, proceeding with build..."
-else
-  echo "Image found"
-  exit 0
-fi
-
-# build
-docker build --file docker/Dockerfile.cpu \
-  --build-arg max_jobs=16 \
-  --build-arg buildkite_commit=$BUILDKITE_COMMIT \
-  --tag $REGISTRY/$REPO:$BUILDKITE_COMMIT-cpu \
-  --target vllm-test \
-  --progress plain .
-
-# push
-docker push $REGISTRY/$REPO:$BUILDKITE_COMMIT-cpu
--- a/.buildkite/image_build/image_build_hpu.sh
+++ b/.buildkite/image_build/image_build_hpu.sh
@@ -1,34 +0,0 @@
-#!/bin/bash
-set -e
-
-if [[ $# -lt 3 ]]; then
-  echo "Usage: $0 <registry> <repo> <commit>"
-  exit 1
-fi
-
-REGISTRY=$1
-REPO=$2
-BUILDKITE_COMMIT=$3
-
-# authenticate with AWS ECR
-aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin $REGISTRY
-
-# skip build if image already exists
-if [[ -z $(docker manifest inspect $REGISTRY/$REPO:$BUILDKITE_COMMIT-hpu) ]]; then
-  echo "Image not found, proceeding with build..."
-else
-  echo "Image found"
-  exit 0
-fi
-
-# build
-docker build \
-  --file tests/pytorch_ci_hud_benchmark/Dockerfile.hpu \
-  --build-arg max_jobs=16 \
-  --build-arg buildkite_commit=$BUILDKITE_COMMIT \
-  --tag $REGISTRY/$REPO:$BUILDKITE_COMMIT-hpu \
-  --progress plain \
-  https://github.com/vllm-project/vllm-gaudi.git
-
-# push
-docker push $REGISTRY/$REPO:$BUILDKITE_COMMIT-hpu
--- a/.buildkite/lm-eval-harness/configs/Meta-Llama-4-Maverick-17B-128E-Instruct-FP8.yaml
+++ b/.buildkite/lm-eval-harness/configs/Meta-Llama-4-Maverick-17B-128E-Instruct-FP8.yaml
@@ -8,4 +8,3 @@ tasks:
    value: 0.80
 limit: 250 # will run on 250 * 14 subjects = 3500 samples
 num_fewshot: 5
-rtol: 0.05
--- a/.buildkite/lm-eval-harness/configs/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16.yaml
+++ b/.buildkite/lm-eval-harness/configs/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16.yaml
@@ -1,15 +0,0 @@
-model_name: "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16"
-tasks:
- name: "gsm8k"
-  metrics:
-  - name: "exact_match,strict-match"
-    value: 0.695
-  - name: "exact_match,flexible-extract"
-    value: 0.447
-limit: 1319
-num_fewshot: 5
-max_model_len: 262144
-enforce_eager: false
-apply_chat_template: true
-fewshot_as_multiturn: true
-trust_remote_code: true
--- a/.buildkite/lm-eval-harness/configs/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8.yaml
+++ b/.buildkite/lm-eval-harness/configs/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8.yaml
@@ -1,19 +0,0 @@
-model_name: "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8"
-tasks:
- name: "gsm8k"
-  metrics:
-  - name: "exact_match,strict-match"
-    value: 0.7142
-  - name: "exact_match,flexible-extract"
-    value: 0.4579
-env_vars:
-  VLLM_USE_FLASHINFER_MOE_FP8: "1"
-  VLLM_FLASHINFER_MOE_BACKEND: "throughput"
-limit: 1319
-num_fewshot: 5
-max_model_len: 262144
-kv_cache_dtype: fp8
-enforce_eager: false
-apply_chat_template: true
-fewshot_as_multiturn: true
-trust_remote_code: true
--- a/.buildkite/lm-eval-harness/configs/models-large-hopper.txt
+++ b/.buildkite/lm-eval-harness/configs/models-large-hopper.txt
@@ -1,2 +1 @@
 Qwen3-235B-A22B-Instruct-2507-FP8.yaml
-NVIDIA-Nemotron-3-Nano-30B-A3B-FP8.yaml
--- a/.buildkite/lm-eval-harness/configs/models-large-rocm.txt
+++ b/.buildkite/lm-eval-harness/configs/models-large-rocm.txt
@@ -1 +0,0 @@
-Meta-Llama-4-Maverick-17B-128E-Instruct-FP8.yaml
--- a/.buildkite/lm-eval-harness/configs/models-large.txt
+++ b/.buildkite/lm-eval-harness/configs/models-large.txt
@@ -3,4 +3,3 @@ Meta-Llama-3-70B-Instruct.yaml
 Mixtral-8x7B-Instruct-v0.1.yaml
 Qwen2-57B-A14-Instruct.yaml
 DeepSeek-V2-Lite-Chat.yaml
-NVIDIA-Nemotron-3-Nano-30B-A3B-BF16.yaml
--- a/.buildkite/lm-eval-harness/configs/models-small-rocm.txt
+++ b/.buildkite/lm-eval-harness/configs/models-small-rocm.txt
@@ -1,5 +0,0 @@
-Qwen2.5-1.5B-Instruct.yaml
-Meta-Llama-3.2-1B-Instruct-INT8-compressed-tensors.yaml
-Meta-Llama-3-8B-Instruct-nonuniform-compressed-tensors.yaml
-Qwen2.5-VL-3B-Instruct-FP8-dynamic.yaml
-Qwen1.5-MoE-W4A16-compressed-tensors.yaml
--- a/.buildkite/lm-eval-harness/run-lm-eval-chartqa-vllm-vlm-baseline.sh
+++ b/.buildkite/lm-eval-harness/run-lm-eval-chartqa-vllm-vlm-baseline.sh
@@ -2,7 +2,7 @@
 # We can use this script to compute baseline accuracy on chartqa for vllm.
 #
 # Make sure you have lm-eval-harness installed:
-#   pip install "lm-eval[api]>=0.4.9.2"
+#   pip install lm-eval==0.4.9

 usage() {
    echo``
--- a/.buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh
+++ b/.buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh
@@ -2,7 +2,7 @@
 # We can use this script to compute baseline accuracy on GSM for transformers.
 #
 # Make sure you have lm-eval-harness installed:
-#   pip install "lm-eval[api]>=0.4.9.2"
+#   pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git@206b7722158f58c35b7ffcd53b035fdbdda5126d#egg=lm-eval[api]

 usage() {
    echo``
--- a/.buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh
+++ b/.buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh
@@ -3,7 +3,7 @@
 # We use this for fp8, which HF does not support.
 #
 # Make sure you have lm-eval-harness installed:
-#   pip install "lm-eval[api]>=0.4.9.2"
+#   pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git@206b7722158f58c35b7ffcd53b035fdbdda5126d#egg=lm-eval[api]

 usage() {
    echo``
--- a/.buildkite/lm-eval-harness/run-lm-eval-mmlupro-vllm-baseline.sh
+++ b/.buildkite/lm-eval-harness/run-lm-eval-mmlupro-vllm-baseline.sh
@@ -3,7 +3,7 @@
 # We use this for fp8, which HF does not support.
 #
 # Make sure you have lm-eval-harness installed:
-#   pip install "lm-eval[api]>=0.4.9.2"
+#   pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git@206b7722158f58c35b7ffcd53b035fdbdda5126d#egg=lm-eval[api]

 usage() {
    echo``
--- a/.buildkite/lm-eval-harness/test_lm_eval_correctness.py
+++ b/.buildkite/lm-eval-harness/test_lm_eval_correctness.py
@@ -9,40 +9,11 @@ pytest -s -v test_lm_eval_correctness.py \
    --tp-size=1
 """

-import os
-from contextlib import contextmanager
-
 import lm_eval
 import numpy as np
 import yaml

-DEFAULT_RTOL = 0.08
-
-
-@contextmanager
-def scoped_env_vars(new_env: dict[str, str]):
-    if not new_env:
-        # Fast path: nothing to do
-        yield
-        return
-
-    old_values = {}
-    new_keys = []
-
-    try:
-        for key, value in new_env.items():
-            if key in os.environ:
-                old_values[key] = os.environ[key]
-            else:
-                new_keys.append(key)
-            os.environ[key] = str(value)
-        yield
-    finally:
-        # Restore / clean up
-        for key, value in old_values.items():
-            os.environ[key] = value
-        for key in new_keys:
-            os.environ.pop(key, None)
+RTOL = 0.08


 def launch_lm_eval(eval_config, tp_size):
@@ -60,28 +31,24 @@ def launch_lm_eval(eval_config, tp_size):
        f"add_bos_token=true,"
        f"trust_remote_code={trust_remote_code},"
        f"max_model_len={max_model_len},"
-        "allow_deprecated_quantization=True,"
    )
-
-    env_vars = eval_config.get("env_vars", None)
-    with scoped_env_vars(env_vars):
-        results = lm_eval.simple_evaluate(
-            model=backend,
-            model_args=model_args,
-            tasks=[task["name"] for task in eval_config["tasks"]],
-            num_fewshot=eval_config["num_fewshot"],
-            limit=eval_config["limit"],
-            # TODO(yeq): using chat template w/ fewshot_as_multiturn is supposed help
-            # text models. however, this is regressing measured strict-match for
-            # existing text models in CI, so only apply it for mm, or explicitly set
-            apply_chat_template=eval_config.get(
-                "apply_chat_template", backend == "vllm-vlm"
-            ),
-            fewshot_as_multiturn=eval_config.get("fewshot_as_multiturn", False),
-            # Forward decoding and early-stop controls (e.g., max_gen_toks, until=...)
-            gen_kwargs=eval_config.get("gen_kwargs"),
-            batch_size=batch_size,
-        )
+    results = lm_eval.simple_evaluate(
+        model=backend,
+        model_args=model_args,
+        tasks=[task["name"] for task in eval_config["tasks"]],
+        num_fewshot=eval_config["num_fewshot"],
+        limit=eval_config["limit"],
+        # TODO(yeq): using chat template w/ fewshot_as_multiturn is supposed help
+        # text models. however, this is regressing measured strict-match for
+        # existing text models in CI, so only apply it for mm, or explicitly set
+        apply_chat_template=eval_config.get(
+            "apply_chat_template", backend == "vllm-vlm"
+        ),
+        fewshot_as_multiturn=eval_config.get("fewshot_as_multiturn", False),
+        # Forward decoding and early-stop controls (e.g., max_gen_toks, until=...)
+        gen_kwargs=eval_config.get("gen_kwargs"),
+        batch_size=batch_size,
+    )
    return results


@@ -90,8 +57,6 @@ def test_lm_eval_correctness_param(config_filename, tp_size):

    results = launch_lm_eval(eval_config, tp_size)

-    rtol = eval_config.get("rtol", DEFAULT_RTOL)
-
    success = True
    for task in eval_config["tasks"]:
        for metric in task["metrics"]:
@@ -99,9 +64,8 @@ def test_lm_eval_correctness_param(config_filename, tp_size):
            measured_value = results["results"][task["name"]][metric["name"]]
            print(
                f"{task['name']} | {metric['name']}: "
-                f"ground_truth={ground_truth:.3f} | "
-                f"measured={measured_value:.3f} | rtol={rtol}"
+                f"ground_truth={ground_truth} | measured={measured_value}"
            )
-            success = success and np.isclose(ground_truth, measured_value, rtol=rtol)
+            success = success and np.isclose(ground_truth, measured_value, rtol=RTOL)

    assert success
--- a/.buildkite/performance-benchmarks/README.md
+++ b/.buildkite/performance-benchmarks/README.md
@@ -7,7 +7,7 @@ vLLM also maintains a continuous performance benchmark under [perf.vllm.ai](http

 ## Performance benchmark quick overview

-**Benchmarking Coverage**: latency, throughput and fix-qps serving on B200, A100, H100, Intel® Xeon® Processors, Intel® Gaudi® 3 Accelerators and Arm® Neoverse™ with different models.
+**Benchmarking Coverage**: latency, throughput and fix-qps serving on B200, A100, H100, Intel® Xeon® Processors and Intel® Gaudi® 3 Accelerators with different models.

 **Benchmarking Duration**: about 1hr.

@@ -23,7 +23,7 @@ bash .buildkite/performance-benchmarks/scripts/run-performance-benchmarks.sh

 Runtime environment variables:

- `ON_CPU`: set the value to '1' on Intel® Xeon® and Arm® Neoverse™ Processors. Default value is 0.
+- `ON_CPU`: set the value to '1' on Intel® Xeon® Processors. Default value is 0.
 - `SERVING_JSON`: JSON file to use for the serving tests. Default value is empty string (use default file).
 - `LATENCY_JSON`: JSON file to use for the latency tests. Default value is empty string (use default file).
 - `THROUGHPUT_JSON`: JSON file to use for the throughout tests. Default value is empty string (use default file).
@@ -34,9 +34,8 @@ Runtime environment variables:

 See [performance-benchmarks-descriptions.md](performance-benchmarks-descriptions.md) for detailed descriptions, and use `tests/latency-tests.json`, `tests/throughput-tests.json`, `tests/serving-tests.json` to configure the test cases.
 > NOTE: For Intel® Xeon® Processors, use `tests/latency-tests-cpu.json`, `tests/throughput-tests-cpu.json`, `tests/serving-tests-cpu.json` instead.
-> For Intel® Gaudi® 3 Accelerators, use `tests/latency-tests-hpu.json`, `tests/throughput-tests-hpu.json`, `tests/serving-tests-hpu.json` instead.
-> For Arm® Neoverse™, use `tests/latency-tests-arm64-cpu.json`, `tests/throughput-tests-arm64-cpu.json`, `tests/serving-tests-arm64-cpu.json` instead.
-
+For Intel® Gaudi® 3 Accelerators, use `tests/latency-tests-hpu.json`, `tests/throughput-tests-hpu.json`, `tests/serving-tests-hpu.json` instead.
+>
 ### Latency test

 Here is an example of one test inside `latency-tests.json`:
@@ -109,65 +108,6 @@ The number of this test is less stable compared to the delay and latency benchma

 WARNING: The benchmarking script will save json results by itself, so please do not configure `--save-results` or other results-saving-related parameters in `serving-tests.json`.

-#### Default Parameters Field
-
-We can specify default parameters in a JSON field with key `defaults`. Parameters defined in the field are applied globally to all serving tests, and can be overridden in test case fields. Here is an example:
-
-<details>
-<summary> An Example of default parameters field </summary>
-
-```json
-{
-  "defaults": {
-    "qps_list": [
-      "inf"
-    ],
-    "server_environment_variables": {
-      "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1
-    },
-    "server_parameters": {
-      "tensor_parallel_size": 1,
-      "dtype": "bfloat16",
-      "block_size": 128,
-      "disable_log_stats": "",
-      "load_format": "dummy"
-    },
-    "client_parameters": {
-      "backend": "vllm",
-      "dataset_name": "random",
-      "random-input-len": 128,
-      "random-output-len": 128,
-      "num_prompts": 200,
-      "ignore-eos": ""
-    }
-  },
-  "tests": [
-    {
-      "test_name": "serving_llama3B_tp2_random_128_128",
-      "server_parameters": {
-        "model": "meta-llama/Llama-3.2-3B-Instruct",
-        "tensor_parallel_size": 2,
-      },
-      "client_parameters": {
-        "model": "meta-llama/Llama-3.2-3B-Instruct",
-      }
-    },
-    {
-      "test_name": "serving_qwen3_tp4_random_128_128",
-      "server_parameters": {
-        "model": "Qwen/Qwen3-14B",
-        "tensor_parallel_size": 4,
-      },
-      "client_parameters": {
-        "model": "Qwen/Qwen3-14B",
-      }
-    },
-  ]
-}
-```
-
-</details>
-
 ### Visualizing the results

 The `convert-results-json-to-markdown.py` helps you put the benchmarking results inside a markdown table, by formatting [descriptions.md](performance-benchmarks-descriptions.md) with real benchmarking results.
@@ -176,6 +116,19 @@ If you do not see the table, please wait till the benchmark finish running.
 The json version of the table (together with the json version of the benchmark) will be also attached to the markdown file.
 The raw benchmarking results (in the format of json files) are in the `Artifacts` tab of the benchmarking.

-#### Performance Results Comparison  
+The `compare-json-results.py` helps to compare benchmark results JSON files converted using `convert-results-json-to-markdown.py`.
+When run, benchmark script generates results under `benchmark/results` folder, along with the `benchmark_results.md` and `benchmark_results.json`.
+`compare-json-results.py` compares two `benchmark_results.json` files and provides performance ratio e.g. for Output Tput, Median TTFT and Median TPOT.  
+If only one benchmark_results.json is passed, `compare-json-results.py` compares different TP and PP configurations in the benchmark_results.json instead.

-Follow the instructions in [performance results comparison](https://docs.vllm.ai/en/latest/benchmarking/dashboard/#performance-results-comparison) to analyze performance results and the sizing guide.
+Here is an example using the script to compare result_a and result_b with Model, Dataset name, input/output length, max concurrency and qps.
+`python3 compare-json-results.py -f results_a/benchmark_results.json -f results_b/benchmark_results.json`
+
+|   | Model | Dataset Name | Input Len | Output Len | # of max concurrency | qps  | results_a/benchmark_results.json | results_b/benchmark_results.json | perf_ratio        |
+|----|---------------------------------------|--------|-----|-----|------|-----|-----------|----------|----------|
+| 0  | meta-llama/Meta-Llama-3.1-8B-Instruct | random | 128 | 128 | 1000 | 1 | 142.633982                             | 156.526018                             | 1.097396 |
+| 1  | meta-llama/Meta-Llama-3.1-8B-Instruct | random | 128 | 128 | 1000 | inf| 241.620334                             | 294.018783                             | 1.216863 |
+
+A comparison diagram will be generated below the table.
+Here is an example to compare between 96c/results_gnr_96c_091_tp2pp3 and 128c/results_gnr_128c_091_tp2pp3
+<img width="1886" height="828" alt="image" src="https://github.com/user-attachments/assets/c02a43ef-25d0-4fd6-90e5-2169a28682dd" />
--- a/.buildkite/performance-benchmarks/scripts/compare-json-results.py
+++ b/.buildkite/performance-benchmarks/scripts/compare-json-results.py
@@ -1,13 +1,8 @@
 # SPDX-License-Identifier: Apache-2.0
 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project
-
-from __future__ import annotations
-
 import argparse
-import html as _html
 import json
 import os
-from dataclasses import dataclass
 from importlib import util

 import pandas as pd
@@ -15,49 +10,27 @@ import pandas as pd
 pd.options.display.float_format = "{:.2f}".format
 plotly_found = util.find_spec("plotly.express") is not None

-DEFAULT_INFO_COLS = [
-    "Model",
-    "Dataset Name",
-    "Input Len",
-    "Output Len",
-    #    "TP Size",
-    #    "PP Size",
-    "# of max concurrency.",
-    "qps",
-]

-# Safety net: if any DataFrame leaks into to_html(), keep precision at 2.
-pd.set_option("display.precision", 2)
-pd.set_option("display.float_format", lambda x: f"{x:.2f}")
-
-
-# -----------------------------
-# Core data compare
-# -----------------------------
 def compare_data_columns(
-    files: list[str],
-    name_column: str,
-    data_column: str,
-    info_cols: list[str],
-    drop_column: str,
-    debug: bool = False,
+    files, name_column, data_column, info_cols, drop_column, debug=False
 ):
    """
    Align concatenation by keys derived from info_cols instead of row order.
    - Pick one canonical key list: subset of info_cols present in ALL files.
    - For each file: set index to those keys, aggregate duplicates
-      (mean for metric, first for names).
+    - (mean for metric, first for names).
    - Concat along axis=1 (indexes align), then reset_index so callers can
-      group by columns.
+    - group by columns.
    - If --debug, add a <file_label>_name column per file.
    """
    print("\ncompare_data_column:", data_column)

    frames = []
-    raw_data_cols: list[str] = []
+    raw_data_cols = []
    compare_frames = []

-    cols_per_file: list[set] = []
+    # 1) choose a canonical key list from info_cols that exists in ALL files
+    cols_per_file = []
    for f in files:
        try:
            df_tmp = pd.read_json(f, orient="records")
@@ -67,20 +40,24 @@ def compare_data_columns(

    key_cols = [c for c in info_cols if all(c in cset for cset in cols_per_file)]
    if not key_cols:
+        # soft fallback: use any info_cols present in the first file
        key_cols = [c for c in info_cols if c in list(cols_per_file[0])]
    if not key_cols:
        raise ValueError(
            "No common key columns found from info_cols across the input files."
        )

+    # 2) build a single "meta" block (keys as columns) once, aligned by the key index
    meta_added = False

    for file in files:
        df = pd.read_json(file, orient="records")

+        # Keep rows that actually have the compared metric (same as original behavior)
        if drop_column in df.columns:
            df = df.dropna(subset=[drop_column], ignore_index=True)

+        # Stabilize numeric key columns (harmless if missing)
        for c in (
            "Input Len",
            "Output Len",
@@ -92,26 +69,32 @@ def compare_data_columns(
            if c in df.columns:
                df[c] = pd.to_numeric(df[c], errors="coerce")

+        # Ensure all key columns exist
        for c in key_cols:
            if c not in df.columns:
                df[c] = pd.NA

+        # Set index = key_cols and aggregate duplicates → unique MultiIndex
        df_idx = df.set_index(key_cols, drop=False)

+        # meta (key columns), unique per key
        meta = df_idx[key_cols]
        if not meta.index.is_unique:
            meta = meta.groupby(level=key_cols, dropna=False).first()

+        # metric series for this file, aggregated to one row per key
        file_label = "/".join(file.split("/")[:-1]) or os.path.basename(file)
        s = df_idx[data_column]
        if not s.index.is_unique:
            s = s.groupby(level=key_cols, dropna=False).mean()
-        s.name = file_label
+        s.name = file_label  # column label like original

+        # add meta once (from first file) so keys are the leftmost columns
        if not meta_added:
            frames.append(meta)
            meta_added = True

+        # (NEW) debug: aligned test-name column per file
        if debug and name_column in df_idx.columns:
            name_s = df_idx[name_column]
            if not name_s.index.is_unique:
@@ -123,19 +106,26 @@ def compare_data_columns(
        raw_data_cols.append(file_label)
        compare_frames.append(s)

+        # Generalize ratio: for any file N>=2, add ratio (fileN / file1)
        if len(compare_frames) >= 2:
            base = compare_frames[0]
            current = compare_frames[-1]
            if "P99" in data_column or "Median" in data_column:
-                ratio = base / current
+                ratio = base / current  # for latency
            else:
                ratio = current / base
-            ratio = ratio.mask(base == 0)
+            ratio = ratio.mask(base == 0)  # avoid inf when baseline is 0
            ratio.name = f"Ratio 1 vs {len(compare_frames)}"
            frames.append(ratio)

-    concat_df = pd.concat(frames, axis=1).reset_index(drop=True)
+    # 4) concat on columns with aligned MultiIndex;
+    # then reset_index to return keys as columns
+    concat_df = pd.concat(frames, axis=1)
+    concat_df = concat_df.reset_index(drop=True).reset_index()
+    if "index" in concat_df.columns:
+        concat_df = concat_df.drop(columns=["index"])

+    # Ensure key/info columns appear first (in your info_cols order)
    front = [c for c in info_cols if c in concat_df.columns]
    rest = [c for c in concat_df.columns if c not in front]
    concat_df = concat_df[front + rest]
@@ -144,15 +134,20 @@ def compare_data_columns(
    return concat_df, raw_data_cols


-# -----------------------------
-# Split helper
-# -----------------------------
 def split_json_by_tp_pp(
    input_file: str = "benchmark_results.json", output_root: str = "."
 ) -> list[str]:
+    """
+    Split a benchmark JSON into separate folders by (TP Size, PP Size).
+
+    Creates: <output_root>/tp{TP}_pp{PP}/benchmark_results.json
+    Returns: list of file paths written.
+    """
+    # Load JSON data into DataFrame
    with open(input_file, encoding="utf-8") as f:
        data = json.load(f)

+    # If the JSON is a dict with a list under common keys, use that list
    if isinstance(data, dict):
        for key in ("results", "serving_results", "benchmarks", "data"):
            if isinstance(data.get(key), list):
@@ -161,6 +156,7 @@ def split_json_by_tp_pp(

    df = pd.DataFrame(data)

+    # Keep only "serving" tests
    name_col = next(
        (c for c in ["Test name", "test_name", "Test Name"] if c in df.columns), None
    )
@@ -169,6 +165,7 @@ def split_json_by_tp_pp(
            df[name_col].astype(str).str.contains(r"serving", case=False, na=False)
        ].copy()

+    # Handle alias column names
    rename_map = {
        "tp_size": "TP Size",
        "tensor_parallel_size": "TP Size",
@@ -179,14 +176,21 @@ def split_json_by_tp_pp(
        columns={k: v for k, v in rename_map.items() if k in df.columns}, inplace=True
    )

+    # Ensure TP/PP columns exist (default to 1 if missing)
    if "TP Size" not in df.columns:
        df["TP Size"] = 1
    if "PP Size" not in df.columns:
        df["PP Size"] = 1

-    df["TP Size"] = pd.to_numeric(df["TP Size"], errors="coerce").fillna(1).astype(int)
-    df["PP Size"] = pd.to_numeric(df["PP Size"], errors="coerce").fillna(1).astype(int)
+    # make sure TP/PP are numeric ints with no NaN
+    df["TP Size"] = (
+        pd.to_numeric(df.get("TP Size", 1), errors="coerce").fillna(1).astype(int)
+    )
+    df["PP Size"] = (
+        pd.to_numeric(df.get("PP Size", 1), errors="coerce").fillna(1).astype(int)
+    )

+    # Split into separate folders
    saved_paths: list[str] = []
    for (tp, pp), group_df in df.groupby(["TP Size", "PP Size"], dropna=False):
        folder_name = os.path.join(output_root, f"tp{int(tp)}_pp{int(pp)}")
@@ -199,9 +203,32 @@ def split_json_by_tp_pp(
    return saved_paths


-# -----------------------------
-# Styling helpers
-# -----------------------------
+def _add_limit_line(fig, y_value, label):
+    # Visible dashed line + annotation
+    fig.add_hline(
+        y=y_value,
+        line_dash="dash",
+        line_color="red" if "ttft" in label.lower() else "blue",
+        annotation_text=f"{label}: {y_value} ms",
+        annotation_position="top left",
+    )
+    # Optional: add a legend item (as a transparent helper trace)
+    if plot and plotly_found:
+        import plotly.graph_objects as go
+
+        fig.add_trace(
+            go.Scatter(
+                x=[None],
+                y=[None],
+                mode="lines",
+                line=dict(
+                    dash="dash", color="red" if "ttft" in label.lower() else "blue"
+                ),
+                name=f"{label}",
+            )
+        )
+
+
 def _find_concurrency_col(df: pd.DataFrame) -> str:
    for c in [
        "# of max concurrency.",
@@ -212,6 +239,7 @@ def _find_concurrency_col(df: pd.DataFrame) -> str:
    ]:
        if c in df.columns:
            return c
+    # Fallback: guess an integer-like column (harmless if unused)
    for c in df.columns:
        if df[c].dtype.kind in "iu" and df[c].nunique() > 1 and df[c].min() >= 1:
            return c
@@ -220,7 +248,8 @@ def _find_concurrency_col(df: pd.DataFrame) -> str:

 def _highlight_threshold(
    df: pd.DataFrame, threshold: float
-) -> pd.io.formats.style.Styler:
+) -> "pd.io.formats.style.Styler":
+    """Highlight numeric per-configuration columns with value <= threshold."""
    conc_col = _find_concurrency_col(df)
    key_cols = [
        c
@@ -231,7 +260,6 @@ def _highlight_threshold(
        c for c in df.columns if c not in key_cols and not str(c).startswith("Ratio")
    ]
    conf_cols = [c for c in conf_cols if pd.api.types.is_numeric_dtype(df[c])]
-
    return df.style.map(
        lambda v: "background-color:#e6ffe6;font-weight:bold;"
        if pd.notna(v) and v <= threshold
@@ -240,264 +268,7 @@ def _highlight_threshold(
    )


-def highlight_ratio_columns(styler: pd.io.formats.style.Styler):
-    ratio_cols = [c for c in styler.data.columns if "ratio" in str(c).lower()]
-    if not ratio_cols:
-        return styler
-
-    styler = styler.apply(
-        lambda _: ["background-color: #fff3b0"] * len(styler.data),
-        subset=ratio_cols,
-        axis=0,
-    )
-
-    styler = styler.set_table_styles(
-        [
-            {
-                "selector": f"th.col_heading.level0.col{i}",
-                "props": [("background-color", "#fff3b0")],
-            }
-            for i, col in enumerate(styler.data.columns)
-            if col in ratio_cols
-        ],
-        overwrite=False,
-    )
-    return styler
-
-
-def _apply_two_decimals(
-    styler: pd.io.formats.style.Styler,
-) -> pd.io.formats.style.Styler:
-    df = styler.data
-    num_cols = df.select_dtypes("number").columns
-    if len(num_cols) == 0:
-        return styler
-    return styler.format({c: "{:.2f}" for c in num_cols}, na_rep="")
-
-
-# -----------------------------
-# Valid max concurrency summary helpers
-# -----------------------------
-def _config_value_columns(df: pd.DataFrame, conc_col: str) -> list[str]:
-    key_cols = [
-        c
-        for c in ["Model", "Dataset Name", "Input Len", "Output Len"]
-        if c in df.columns
-    ]
-    exclude = set(key_cols + [conc_col, "qps", "QPS"])
-
-    cols: list[str] = []
-    for c in df.columns:
-        if c in exclude:
-            continue
-        lc = str(c).lower()
-        if lc.startswith("ratio"):
-            continue
-        if lc.endswith("_name") or lc == "test name" or lc == "test_name":
-            continue
-        if pd.api.types.is_numeric_dtype(df[c]):
-            cols.append(c)
-    return cols
-
-
-def _max_concurrency_ok(
-    df: pd.DataFrame, conc_col: str, cfg_col: str, threshold: float
-):
-    if df is None or conc_col not in df.columns or cfg_col not in df.columns:
-        return pd.NA
-
-    d = df[[conc_col, cfg_col]].copy()
-    d[conc_col] = pd.to_numeric(d[conc_col], errors="coerce")
-    d[cfg_col] = pd.to_numeric(d[cfg_col], errors="coerce")
-    d = d.dropna(subset=[conc_col, cfg_col])
-
-    if d.empty:
-        return pd.NA
-
-    ok = d[d[cfg_col] <= threshold]
-    if ok.empty:
-        return pd.NA
-
-    return ok[conc_col].max()
-
-
-def _value_at_concurrency(df: pd.DataFrame, conc_col: str, cfg_col: str, conc_value):
-    if (
-        df is None
-        or conc_col not in df.columns
-        or cfg_col not in df.columns
-        or pd.isna(conc_value)
-    ):
-        return pd.NA
-
-    d = df[[conc_col, cfg_col]].copy()
-    d[conc_col] = pd.to_numeric(d[conc_col], errors="coerce")
-    d[cfg_col] = pd.to_numeric(d[cfg_col], errors="coerce")
-
-    conc_value = pd.to_numeric(conc_value, errors="coerce")
-    if pd.isna(conc_value):
-        return pd.NA
-
-    hit = d[d[conc_col] == conc_value]
-    if hit.empty:
-        return pd.NA
-    return hit[cfg_col].iloc[0]
-
-
-def build_valid_max_concurrency_summary_html(
-    tput_group_df: pd.DataFrame | None,
-    ttft_group_df: pd.DataFrame | None,
-    tpot_group_df: pd.DataFrame | None,
-    conc_col: str,
-    args,
-) -> str:
-    if ttft_group_df is None and tpot_group_df is None:
-        return ""
-
-    ttft_cols = (
-        _config_value_columns(ttft_group_df, conc_col)
-        if ttft_group_df is not None
-        else []
-    )
-    tpot_cols = (
-        _config_value_columns(tpot_group_df, conc_col)
-        if tpot_group_df is not None
-        else []
-    )
-    tput_cols = (
-        _config_value_columns(tput_group_df, conc_col)
-        if tput_group_df is not None
-        else []
-    )
-
-    if ttft_group_df is not None and tpot_group_df is not None:
-        cfg_cols = [c for c in ttft_cols if c in tpot_cols]
-        if tput_group_df is not None:
-            cfg_cols = [c for c in cfg_cols if c in tput_cols] or cfg_cols
-    else:
-        cfg_cols = ttft_cols or tpot_cols
-
-    if not cfg_cols:
-        cfg_cols = sorted(set(ttft_cols) | set(tpot_cols) | set(tput_cols), key=str)
-
-    rows = []
-    for cfg in cfg_cols:
-        ttft_max = (
-            _max_concurrency_ok(ttft_group_df, conc_col, cfg, args.ttft_max_ms)
-            if ttft_group_df is not None
-            else pd.NA
-        )
-        tpot_max = (
-            _max_concurrency_ok(tpot_group_df, conc_col, cfg, args.tpot_max_ms)
-            if tpot_group_df is not None
-            else pd.NA
-        )
-        both = (
-            pd.NA
-            if (pd.isna(ttft_max) or pd.isna(tpot_max))
-            else min(ttft_max, tpot_max)
-        )
-
-        tput_at_both = (
-            _value_at_concurrency(tput_group_df, conc_col, cfg, both)
-            if tput_group_df is not None
-            else pd.NA
-        )
-        ttft_at_both = (
-            _value_at_concurrency(ttft_group_df, conc_col, cfg, both)
-            if ttft_group_df is not None
-            else pd.NA
-        )
-        tpot_at_both = (
-            _value_at_concurrency(tpot_group_df, conc_col, cfg, both)
-            if tpot_group_df is not None
-            else pd.NA
-        )
-
-        rows.append(
-            {
-                "Configuration": cfg,
-                f"Max {conc_col} (TTFT ≤ {args.ttft_max_ms:g} ms)": ttft_max,
-                f"Max {conc_col} (TPOT ≤ {args.tpot_max_ms:g} ms)": tpot_max,
-                f"Max {conc_col} (Both)": both,
-                "Output Tput @ Both (tok/s)": tput_at_both,
-                "TTFT @ Both (ms)": ttft_at_both,
-                "TPOT @ Both (ms)": tpot_at_both,
-            }
-        )
-
-    summary_df = pd.DataFrame(rows)
-
-    # --- Coerce numeric columns so Styler doesn't miss them due to object dtype ---
-    for c in summary_df.columns:
-        if c == "Configuration":
-            continue
-        summary_df[c] = pd.to_numeric(summary_df[c], errors="coerce")
-
-    both_col = f"Max {conc_col} (Both)"
-
-    # --- Strict 2-decimal formatting for ALL non-Configuration columns ---
-    formatters = {}
-    for c in summary_df.columns:
-        if c == "Configuration":
-            continue
-        # default argument binds per-column formatter correctly
-        formatters[c] = lambda v: "" if pd.isna(v) else f"{float(v):.2f}"
-
-    styler = summary_df.style.format(formatters)
-
-    def _green(v):
-        return "background-color:#e6ffe6;font-weight:bold;" if pd.notna(v) else ""
-
-    if both_col in summary_df.columns:
-        styler = styler.map(_green, subset=[both_col])
-
-    title = (
-        '<div style="font-size: 1.15em; font-weight: 700; margin: 12px 0 6px 0;">'
-        "Valid Max Concurrency Summary"
-        "</div>\n"
-    )
-    return title + styler.to_html(table_attributes='border="1" class="dataframe"')
-
-
-# -----------------------------
-# Plot helper
-# -----------------------------
-def _add_limit_line(fig, y_value: float, label: str):
-    fig.add_hline(
-        y=y_value,
-        line_dash="dash",
-        line_color="red" if "ttft" in label.lower() else "blue",
-        annotation_text=f"{label}: {y_value} ms",
-        annotation_position="top left",
-    )
-    if plotly_found:
-        import plotly.graph_objects as go
-
-        fig.add_trace(
-            go.Scatter(
-                x=[None],
-                y=[None],
-                mode="lines",
-                line=dict(
-                    dash="dash",
-                    color="red" if "ttft" in label.lower() else "blue",
-                ),
-                name=label,
-            )
-        )
-
-
-# -----------------------------
-# Refactored main + group-first report
-# -----------------------------
-@dataclass(frozen=True)
-class MetricPlan:
-    data_cols: list[str]
-    drop_column: str
-
-
-def build_parser() -> argparse.ArgumentParser:
+if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "-f", "--file", action="append", type=str, help="input file name"
@@ -537,289 +308,149 @@ def build_parser() -> argparse.ArgumentParser:
        default=100.0,
        help="Reference limit for TPOT plots (ms)",
    )
-    return parser

+    args = parser.parse_args()

-def choose_metrics(latency: str) -> MetricPlan:
-    latency = (latency or "").lower()
    drop_column = "P99"
+    name_column = "Test name"
+    info_cols = [
+        "Model",
+        "Dataset Name",
+        "Input Len",
+        "Output Len",
+        "TP Size",
+        "PP Size",
+        "# of max concurrency.",
+        "qps",
+    ]

-    if "median" in latency:
-        return MetricPlan(
-            data_cols=["Output Tput (tok/s)", "Median TTFT (ms)", "Median"],
-            drop_column=drop_column,
-        )
-
-    return MetricPlan(
-        data_cols=["Output Tput (tok/s)", "P99 TTFT (ms)", "P99"],
-        drop_column=drop_column,
-    )
-
-
-def prepare_input_files(args, info_cols: list[str]) -> tuple[list[str], list[str]]:
-    if not args.file:
-        raise ValueError("No input files provided. Use -f/--file.")
+    if "median" in args.latency:
+        data_cols_to_compare = ["Output Tput (tok/s)", "Median TTFT (ms)", "Median"]
+        html_msgs_for_data_cols = [
+            "Compare Output Tokens /n",
+            "Median TTFT /n",
+            "Median TPOT /n",
+        ]
+        drop_column = "P99"
+    elif "p99" in args.latency:
+        data_cols_to_compare = ["Output Tput (tok/s)", "P99 TTFT (ms)", "P99"]
+        html_msgs_for_data_cols = [
+            "Compare Output Tokens /n",
+            "P99 TTFT /n",
+            "P99 TPOT /n",
+        ]

    if len(args.file) == 1:
        files = split_json_by_tp_pp(args.file[0], output_root="splits")
        info_cols = [c for c in info_cols if c not in ("TP Size", "PP Size")]
    else:
        files = args.file
-
-    return files, info_cols
-
-
-def get_y_axis_col(info_cols: list[str], xaxis: str) -> str:
-    y_axis_index = info_cols.index(xaxis) if xaxis in info_cols else 6
-    return info_cols[y_axis_index]
-
-
-def get_group_cols(output_df: pd.DataFrame, info_cols: list[str]) -> list[str]:
-    filtered_info_cols = info_cols[:4]
-    group_cols = [c for c in filtered_info_cols if c in output_df.columns]
-    if not group_cols:
-        raise ValueError(
-            f"No valid group-by columns. Expected subset: {filtered_info_cols}, "
-            f"but DataFrame has: {list(output_df.columns)}"
-        )
-    return group_cols
-
-
-def normalize_group_key(name):
-    return name if isinstance(name, tuple) else (name,)
-
-
-def group_filename(name, prefix: str = "perf_comparison_") -> str:
-    name_vals = normalize_group_key(name)
-    safe = ",".join(map(str, name_vals)).replace(",", "_").replace("/", "-")
-    return f"{prefix}{safe}.html"
-
-
-def build_group_suffix(group_cols: list[str], name) -> str:
-    name_vals = normalize_group_key(name)
-    return " , ".join(f"{col} : [ {val} ] " for col, val in zip(group_cols, name_vals))
-
-
-def render_metric_table_html(
-    display_group: pd.DataFrame,
-    metric_label: str,
-    group_suffix: str,
-    args,
-) -> str:
-    title = (
-        f'<div style="font-size: 1.25em; font-weight: 600; margin: 12px 0;">'
-        f"{_html.escape(metric_label)}"
-        f" — {_html.escape(group_suffix)}"
-        f"</div>\n"
-    )
-
-    metric_name = metric_label.lower()
-    if "ttft" in metric_name:
-        styler = _highlight_threshold(display_group, args.ttft_max_ms)
-    elif ("tpot" in metric_name) or ("median" in metric_name) or ("p99" in metric_name):
-        styler = _highlight_threshold(display_group, args.tpot_max_ms)
-    else:
-        styler = display_group.style
-
-    styler = _apply_two_decimals(styler)
-    styler = highlight_ratio_columns(styler)
-
-    return title + styler.to_html(table_attributes='border="1" class="dataframe"')
-
-
-def maybe_write_plot(
-    main_fh,
-    sub_fh,
-    group_df: pd.DataFrame,
-    raw_data_cols: list[str],
-    metric_label: str,
-    y_axis_col: str,
-    args,
-):
-    if not (args.plot and plotly_found):
-        return
-
-    import plotly.express as px
-
-    df = group_df[raw_data_cols].sort_values(by=y_axis_col)
-    df_melted = df.melt(
-        id_vars=y_axis_col,
-        var_name="Configuration",
-        value_name=metric_label,
-    )
-
-    fig = px.line(
-        df_melted,
-        x=y_axis_col,
-        y=metric_label,
-        color="Configuration",
-        title=f"{metric_label} vs {y_axis_col}",
-        markers=True,
-    )
-
-    # Ensure plot hover + y tick labels are also 2 decimals.
-    fig.update_traces(hovertemplate="%{y:.2f}<extra></extra>")
-    fig.update_yaxes(tickformat=".2f")
-
-    metric_name = metric_label.lower()
-    if "ttft" in metric_name:
-        _add_limit_line(fig, args.ttft_max_ms, "TTFT limit")
-    elif ("tpot" in metric_name) or ("median" in metric_name) or ("p99" in metric_name):
-        _add_limit_line(fig, args.tpot_max_ms, "TPOT limit")
-
-    html = fig.to_html(full_html=True, include_plotlyjs="cdn")
-    main_fh.write(html)
-    sub_fh.write(html)
-
-
-def build_group_keys(
-    df: pd.DataFrame, group_cols: list[str], sort_cols: list[str] | None = None
-):
-    if sort_cols:
-        df = df.sort_values(by=sort_cols)
-    gb = df.groupby(group_cols, dropna=False)
-    return [k for k, _ in gb]
-
-
-def write_report_group_first(
-    files: list[str], info_cols: list[str], plan: MetricPlan, args
-):
-    name_column = "Test name"
-    y_axis_col = get_y_axis_col(info_cols, args.xaxis)
-
    print("comparing : " + ", ".join(files))
-
-    metric_cache: dict[str, tuple[pd.DataFrame, list[str]]] = {}
-    group_cols_canonical: list[str] | None = None
-
-    for metric_label in plan.data_cols:
-        output_df, raw_data_cols = compare_data_columns(
-            files,
-            name_column,
-            metric_label,
-            info_cols,
-            plan.drop_column,
-            debug=args.debug,
-        )
-
-        raw_data_cols = list(raw_data_cols)
-        raw_data_cols.insert(0, y_axis_col)
-
-        group_cols = get_group_cols(output_df, info_cols)
-        if group_cols_canonical is None:
-            group_cols_canonical = group_cols
-        else:
-            group_cols_canonical = [c for c in group_cols_canonical if c in group_cols]
-
-        metric_cache[metric_label] = (
-            output_df.sort_values(by=args.xaxis),
-            raw_data_cols,
-        )
-
-    if not group_cols_canonical:
-        raise ValueError("No canonical group columns found across metrics.")
-
-    first_metric = plan.data_cols[0]
-    first_df_sorted, _ = metric_cache[first_metric]
-    group_keys = build_group_keys(
-        first_df_sorted, group_cols_canonical, sort_cols=[args.xaxis]
-    )
-
-    metric_groupbys = {
-        metric_label: df.groupby(group_cols_canonical, dropna=False)
-        for metric_label, (df, _) in metric_cache.items()
-    }
-
-    with open("perf_comparison.html", "w", encoding="utf-8") as main_fh:
-        main_fh.write('<meta charset="utf-8">\n')
-        for gkey in group_keys:
-            gkey_tuple = normalize_group_key(gkey)
-            suffix = build_group_suffix(group_cols_canonical, gkey_tuple)
-            sub_path = group_filename(gkey_tuple)
-            group_header = (
-                '<div style="font-size: 1.4em; font-weight: 700; '
-                'margin: 18px 0 10px 0;">'
-                f"{_html.escape(suffix)}"
-                "</div>\n"
+    debug = args.debug
+    plot = args.plot
+    # For Plot feature, assign y axis from one of info_cols
+    y_axis_index = info_cols.index(args.xaxis) if args.xaxis in info_cols else 6
+    with open("perf_comparison.html", "w") as text_file:
+        for i in range(len(data_cols_to_compare)):
+            output_df, raw_data_cols = compare_data_columns(
+                files,
+                name_column,
+                data_cols_to_compare[i],
+                info_cols,
+                drop_column,
+                debug=debug,
            )

-            main_fh.write(group_header)
-            with open(sub_path, "w", encoding="utf-8") as sub_fh:
-                sub_fh.write('<meta charset="utf-8">\n')
-                sub_fh.write(group_header)
-                tput_group_df = None
-                ttft_group_df = None
-                tpot_group_df = None
-                conc_col = args.xaxis
+            # For Plot feature, insert y axis from one of info_cols
+            raw_data_cols.insert(0, info_cols[y_axis_index])

-                for metric_label in plan.data_cols:
-                    gb = metric_groupbys[metric_label]
-                    df_sorted, raw_data_cols = metric_cache[metric_label]
+            filtered_info_cols = info_cols[:-2]
+            existing_group_cols = [
+                c for c in filtered_info_cols if c in output_df.columns
+            ]
+            if not existing_group_cols:
+                raise ValueError(
+                    f"No valid group-by columns  "
+                    f"Expected subset: {filtered_info_cols}, "
+                    f"but DataFrame has: {list(output_df.columns)}"
+                )
+            # output_df_sorted = output_df.sort_values(by=existing_group_cols)
+            output_df_sorted = output_df.sort_values(by=args.xaxis)
+            output_groups = output_df_sorted.groupby(existing_group_cols, dropna=False)
+            for name, group in output_groups:
+                group_name = (
+                    ",".join(map(str, name)).replace(",", "_").replace("/", "-")
+                )
+                group_html_name = "perf_comparison_" + group_name + ".html"

-                    try:
-                        group_df = gb.get_group(gkey)
-                    except KeyError:
-                        missing = (
-                            '<div style="font-size: 1.1em; font-weight: 600; '
-                            'margin: 10px 0;">'
-                            f"{_html.escape(metric_label)} — missing for this group"
-                            "</div>\n"
+                metric_name = str(data_cols_to_compare[i]).lower()
+                if "tok/s" in metric_name:
+                    html = group.to_html()
+                elif "ttft" in metric_name:
+                    styler = _highlight_threshold(group, args.ttft_max_ms).format(
+                        {c: "{:.2f}" for c in group.select_dtypes("number").columns},
+                        na_rep="—",
+                    )
+                    html = styler.to_html(
+                        table_attributes='border="1" class="dataframe"'
+                    )
+                elif (
+                    "tpot" in metric_name
+                    or "median" in metric_name
+                    or "p99" in metric_name
+                ):
+                    styler = _highlight_threshold(group, args.tpot_max_ms).format(
+                        {c: "{:.2f}" for c in group.select_dtypes("number").columns},
+                        na_rep="—",
+                    )
+                    html = styler.to_html(
+                        table_attributes='border="1" class="dataframe"'
+                    )
+
+                text_file.write(html_msgs_for_data_cols[i])
+                text_file.write(html)
+                with open(group_html_name, "a+") as sub_text_file:
+                    sub_text_file.write(html_msgs_for_data_cols[i])
+                    sub_text_file.write(html)
+
+                    if plot and plotly_found:
+                        import plotly.express as px
+
+                        df = group[raw_data_cols]
+                        df_sorted = df.sort_values(by=info_cols[y_axis_index])
+                        # Melt DataFrame for plotting
+                        df_melted = df_sorted.melt(
+                            id_vars=info_cols[y_axis_index],
+                            var_name="Configuration",
+                            value_name=data_cols_to_compare[i],
+                        )
+                        title = (
+                            data_cols_to_compare[i] + " vs " + info_cols[y_axis_index]
+                        )
+                        # Create Plotly line chart
+                        fig = px.line(
+                            df_melted,
+                            x=info_cols[y_axis_index],
+                            y=data_cols_to_compare[i],
+                            color="Configuration",
+                            title=title,
+                            markers=True,
                        )

-                        main_fh.write(missing)
-                        sub_fh.write(missing)
-                        continue
+                        # ---- Add threshold lines based on metric name ----
+                        if "ttft" in metric_name:
+                            _add_limit_line(fig, args.ttft_max_ms, "TTFT limit")
+                        elif (
+                            "tpot" in metric_name
+                            or "median" in metric_name
+                            or "p99" in metric_name
+                        ):
+                            _add_limit_line(fig, args.tpot_max_ms, "TPOT limit")

-                    if conc_col not in group_df.columns:
-                        conc_col = _find_concurrency_col(group_df)
-
-                    mn = metric_label.lower().strip()
-                    if "tok/s" in mn:
-                        tput_group_df = group_df
-                    elif "ttft" in mn:
-                        ttft_group_df = group_df
-                    elif mn in ("p99", "median") or "tpot" in mn:
-                        tpot_group_df = group_df
-
-                    display_group = group_df.drop(
-                        columns=group_cols_canonical, errors="ignore"
-                    )
-
-                    html = render_metric_table_html(
-                        display_group, metric_label, suffix, args
-                    )
-                    main_fh.write(html)
-                    sub_fh.write(html)
-
-                    maybe_write_plot(
-                        main_fh,
-                        sub_fh,
-                        group_df=group_df,
-                        raw_data_cols=raw_data_cols,
-                        metric_label=metric_label,
-                        y_axis_col=y_axis_col,
-                        args=args,
-                    )
-
-                summary_html = build_valid_max_concurrency_summary_html(
-                    tput_group_df=tput_group_df,
-                    ttft_group_df=ttft_group_df,
-                    tpot_group_df=tpot_group_df,
-                    conc_col=conc_col,
-                    args=args,
-                )
-                if summary_html:
-                    main_fh.write(summary_html)
-                    sub_fh.write(summary_html)
-
-
-def main():
-    args = build_parser().parse_args()
-    info_cols = list(DEFAULT_INFO_COLS)
-    plan = choose_metrics(args.latency)
-    files, info_cols = prepare_input_files(args, info_cols)
-    write_report_group_first(files, info_cols, plan, args)
-
-
-if __name__ == "__main__":
-    main()
+                        # Export to HTML
+                        text_file.write(
+                            fig.to_html(full_html=True, include_plotlyjs="cdn")
+                        )
+                        sub_text_file.write(
+                            fig.to_html(full_html=True, include_plotlyjs="cdn")
+                        )
--- a/.buildkite/performance-benchmarks/scripts/convert-results-json-to-markdown.py
+++ b/.buildkite/performance-benchmarks/scripts/convert-results-json-to-markdown.py
@@ -393,7 +393,7 @@ if __name__ == "__main__":
    with open(results_folder / md_file, "w") as f:
        results = read_markdown(
            "../.buildkite/performance-benchmarks/"
-            "performance-benchmarks-descriptions.md"
+            + "performance-benchmarks-descriptions.md"
        )
        results = results.format(
            latency_tests_markdown_table=latency_md_table,
--- a/.buildkite/performance-benchmarks/scripts/run-performance-benchmarks.sh
+++ b/.buildkite/performance-benchmarks/scripts/run-performance-benchmarks.sh
@@ -25,9 +25,9 @@ check_gpus() {
    echo "Need at least 1 GPU to run benchmarking."
    exit 1
  fi
-
+  
  declare -g arch_suffix=''
-
+  
  if command -v nvidia-smi; then
    declare -g gpu_type=$(nvidia-smi --query-gpu=name --format=csv,noheader | awk '{print $2}')
  elif command -v amd-smi; then
@@ -49,11 +49,7 @@ check_cpus() {
    echo "Need at least 1 NUMA to run benchmarking."
    exit 1
  fi
-  if [[ "$(uname -m)" == "aarch64" ]] || [[ "$(uname -m)" == "arm64" ]]; then
-    declare -g gpu_type="arm64-cpu"
-  else
-    declare -g gpu_type="cpu"
-  fi
+  declare -g gpu_type="cpu"
  echo "GPU type is $gpu_type"
 }

@@ -114,8 +110,7 @@ json2envs() {
 wait_for_server() {
  # wait for vllm server to start
  # return 1 if vllm server crashes
-  local timeout_val="1200"
-  timeout "$timeout_val" bash -c '
+  timeout 1200 bash -c '
    until curl -X POST localhost:8000/v1/completions; do
      sleep 1
    done' && return 0 || return 1
@@ -181,20 +176,19 @@ upload_to_buildkite() {
  $BUILDKITE_AGENT_COMMAND artifact upload "$RESULTS_FOLDER/*"
 }

-run_benchmark_tests() {
-  # run benchmark tests using `vllm bench <test_type>` command
-  # $1: test type (latency or throughput)
-  # $2: a json file specifying test cases
+run_latency_tests() {
+  # run latency tests using `vllm bench latency` command
+  # $1: a json file specifying latency test cases

-  local test_type=$1
-  local test_file=$2
+  local latency_test_file
+  latency_test_file=$1

-  # Iterate over tests
-  jq -c '.[]' "$test_file" | while read -r params; do
+  # Iterate over latency tests
+  jq -c '.[]' "$latency_test_file" | while read -r params; do
    # get the test name, and append the GPU type back to it.
    test_name=$(echo "$params" | jq -r '.test_name')
-    if [[ ! "$test_name" =~ ^${test_type}_ ]]; then
-      echo "In ${test_type}-test.json, test_name must start with \"${test_type}_\"."
+    if [[ ! "$test_name" =~ ^latency_ ]]; then
+      echo "In latency-test.json, test_name must start with \"latency_\"."
      exit 1
    fi

@@ -205,15 +199,15 @@ run_benchmark_tests() {
    fi

    # get arguments
-    bench_params=$(echo "$params" | jq -r '.parameters')
-    bench_args=$(json2args "$bench_params")
-    bench_environment_variables=$(echo "$params" | jq -r '.environment_variables')
-    bench_envs=$(json2envs "$bench_environment_variables")
+    latency_params=$(echo "$params" | jq -r '.parameters')
+    latency_args=$(json2args "$latency_params")
+    latency_environment_variables=$(echo "$params" | jq -r '.environment_variables')
+    latency_envs=$(json2envs "$latency_environment_variables")

    # check if there is enough GPU to run the test
-    tp=$(echo "$bench_params" | jq -r '.tensor_parallel_size')
-    if [[ "$ON_CPU" == "1" ]]; then
-      pp=$(echo "$bench_params" | jq -r '.pipeline_parallel_size // 1')
+    tp=$(echo "$latency_params" | jq -r '.tensor_parallel_size')
+    if [ "$ON_CPU" == "1" ]; then
+      pp=$(echo "$latency_params" | jq -r '.pipeline_parallel_size')
      world_size=$(($tp*$pp))
      if [[ $numa_count -lt $world_size  && -z "${REMOTE_HOST}" ]]; then
        echo "Required world-size $world_size but only $numa_count NUMA nodes found. Skip testcase $test_name."
@@ -226,85 +220,108 @@ run_benchmark_tests() {
      fi
    fi

-    bench_command=" $bench_envs vllm bench $test_type \
+    latency_command=" $latency_envs vllm bench latency \
      --output-json $RESULTS_FOLDER/${test_name}.json \
-      $bench_args"
+      $latency_args"

    echo "Running test case $test_name"
-    echo "${test_type^} command: $bench_command"
+    echo "Latency command: $latency_command"

-    # recording benchmarking command and GPU command
+    # recoding benchmarking command ang GPU command
    jq_output=$(jq -n \
-      --arg command "$bench_command" \
+      --arg latency "$latency_command" \
      --arg gpu "$gpu_type" \
-      --arg test_type "$test_type" \
      '{
-        ($test_type + "_command"): $command,
+        latency_command: $latency,
        gpu_type: $gpu
      }')
    echo "$jq_output" >"$RESULTS_FOLDER/$test_name.commands"

    # run the benchmark
-    eval "$bench_command"
+    eval "$latency_command"

    kill_gpu_processes

  done
 }

-run_latency_tests() {
-  run_benchmark_tests "latency" "$1"
-}
-
-run_startup_tests() {
-  run_benchmark_tests "startup" "$1"
-}
-
 run_throughput_tests() {
-  run_benchmark_tests "throughput" "$1"
+  # run throughput tests using `vllm bench throughput`
+  # $1: a json file specifying throughput test cases
+
+  local throughput_test_file
+  throughput_test_file=$1
+
+  # Iterate over throughput tests
+  jq -c '.[]' "$throughput_test_file" | while read -r params; do
+    # get the test name, and append the GPU type back to it.
+    test_name=$(echo "$params" | jq -r '.test_name')
+    if [[ ! "$test_name" =~ ^throughput_ ]]; then
+      echo "In throughput-test.json, test_name must start with \"throughput_\"."
+      exit 1
+    fi
+
+    # if TEST_SELECTOR is set, only run the test cases that match the selector
+    if [[ -n "$TEST_SELECTOR" ]] && [[ ! "$test_name" =~ $TEST_SELECTOR ]]; then
+      echo "Skip test case $test_name."
+      continue
+    fi
+
+    # get arguments
+    throughput_params=$(echo "$params" | jq -r '.parameters')
+    throughput_args=$(json2args "$throughput_params")
+    throughput_environment_variables=$(echo "$params" | jq -r '.environment_variables')
+    throughput_envs=$(json2envs "$throughput_environment_variables")
+
+    # check if there is enough GPU to run the test
+    tp=$(echo "$throughput_params" | jq -r '.tensor_parallel_size')
+    if [ "$ON_CPU" == "1" ]; then
+      pp=$(echo "$throughput_params" | jq -r '.pipeline_parallel_size')
+      world_size=$(($tp*$pp))
+      if [[ $numa_count -lt $world_size  && -z "${REMOTE_HOST}" ]]; then
+        echo "Required world-size $world_size but only $numa_count NUMA nodes found. Skip testcase $test_name."
+        continue
+      fi
+    else
+      if [[ $gpu_count -lt $tp ]]; then
+        echo "Required tensor-parallel-size $tp but only $gpu_count GPU found. Skip testcase $test_name."
+        continue
+      fi
+    fi
+
+    throughput_command=" $throughput_envs vllm bench throughput \
+      --output-json $RESULTS_FOLDER/${test_name}.json \
+      $throughput_args"
+
+    echo "Running test case $test_name"
+    echo "Throughput command: $throughput_command"
+    # recoding benchmarking command ang GPU command
+    jq_output=$(jq -n \
+      --arg command "$throughput_command" \
+      --arg gpu "$gpu_type" \
+      '{
+        throughput_command: $command,
+        gpu_type: $gpu
+      }')
+    echo "$jq_output" >"$RESULTS_FOLDER/$test_name.commands"
+
+    # run the benchmark
+    eval "$throughput_command"
+
+    kill_gpu_processes
+
+  done
 }

 run_serving_tests() {
  # run serving tests using `vllm bench serve` command
  # $1: a json file specifying serving test cases
-  #
-  # Supported JSON formats:
-  # 1) Plain format: top-level array
-  #    [ { "test_name": "...", "server_parameters": {...}, ... }, ... ]
-  #
-  # 2) Default parameters field + plain format tests
-  #    {
-  #      "defaults": { ... },
-  #      "tests": [ { "test_name": "...", "server_parameters": {...}, ... }, ... ]
-  #    }

  local serving_test_file
  serving_test_file=$1

  # Iterate over serving tests
-  jq -c '
-    if type == "array" then
-      # Plain format: test cases array
-      .[]
-    elif (type == "object" and has("tests")) then
-      # merge the default parameters into each test cases
-      . as $root
-      | ($root.defaults // {}) as $d
-      | ($root.tests // [])[]
-      # default qps / max_concurrency from defaults if missing
-      | .qps_list = (.qps_list // $d.qps_list)
-      | .max_concurrency_list = (.max_concurrency_list // $d.max_concurrency_list)
-      # merge envs / params: test overrides defaults
-      | .server_environment_variables =
-          (($d.server_environment_variables // {}) + (.server_environment_variables // {}))
-      | .server_parameters =
-          (($d.server_parameters // {}) + (.server_parameters // {}))
-      | .client_parameters =
-          (($d.client_parameters // {}) + (.client_parameters // {}))
-    else
-      error("Unsupported serving test file format: must be array or object with .tests")
-    end
-  ' "$serving_test_file" | while read -r params; do
+  jq -c '.[]' "$serving_test_file" | while read -r params; do
    # get the test name, and append the GPU type back to it.
    test_name=$(echo "$params" | jq -r '.test_name')
    if [[ ! "$test_name" =~ ^serving_ ]]; then
@@ -318,33 +335,28 @@ run_serving_tests() {
      continue
    fi

-    # get client and server arguments (after merged the default parameters)
+    # get client and server arguments
    server_params=$(echo "$params" | jq -r '.server_parameters')
    server_envs=$(echo "$params" | jq -r '.server_environment_variables')
    client_params=$(echo "$params" | jq -r '.client_parameters')
-
    server_args=$(json2args "$server_params")
    server_envs=$(json2envs "$server_envs")
    client_args=$(json2args "$client_params")
-
-    # qps_list
    qps_list=$(echo "$params" | jq -r '.qps_list')
    qps_list=$(echo "$qps_list" | jq -r '.[] | @sh')
    echo "Running over qps list $qps_list"
-
-    # max_concurrency_list (fallback to num_prompts if missing)
    max_concurrency_list=$(echo "$params" | jq -r '.max_concurrency_list')
    if [[ -z "$max_concurrency_list" || "$max_concurrency_list" == "null" ]]; then
-      num_prompts=$(echo "$client_params" | jq -r '.num_prompts')
-      max_concurrency_list="[$num_prompts]"
+        num_prompts=$(echo "$client_params" | jq -r '.num_prompts')
+        max_concurrency_list="[$num_prompts]"
    fi
    max_concurrency_list=$(echo "$max_concurrency_list" | jq -r '.[] | @sh')
    echo "Running over max concurrency list $max_concurrency_list"

    # check if there is enough resources to run the test
    tp=$(echo "$server_params" | jq -r '.tensor_parallel_size')
-    if [[ "$ON_CPU" == "1" ]]; then
-      pp=$(echo "$server_params" | jq -r '.pipeline_parallel_size // 1')
+    if [ "$ON_CPU" == "1" ]; then
+      pp=$(echo "$server_params" | jq -r '.pipeline_parallel_size')
      world_size=$(($tp*$pp))
      if [[ $numa_count -lt $world_size  && -z "${REMOTE_HOST}" ]]; then
        echo "Required world-size $world_size but only $numa_count NUMA nodes found. Skip testcase $test_name."
@@ -393,11 +405,6 @@ run_serving_tests() {
      fi
    fi

-    # save the compilation mode and optimization level on the serving results
-    # whenever they are set
-    compilation_config_mode=$(echo "$server_params" | jq -r '."compilation_config.mode" // empty')
-    optimization_level=$(echo "$server_params" | jq -r '.optimization_level // empty')
-
    # iterate over different QPS
    for qps in $qps_list; do
      # remove the surrounding single quote from qps
@@ -411,15 +418,15 @@ run_serving_tests() {
      for max_concurrency in $max_concurrency_list; do
        new_test_name=$test_name"_qps_"$qps"_concurrency_"$max_concurrency
        echo " new test name $new_test_name"
-        # pass the tensor parallel size, the compilation mode, and the optimization
-        # level to the client so that they can be used on the benchmark dashboard
+        # pass the tensor parallel size to the client so that it can be displayed
+        # on the benchmark dashboard
        client_command="vllm bench serve \
          --save-result \
          --result-dir $RESULTS_FOLDER \
          --result-filename ${new_test_name}.json \
          --request-rate $qps \
          --max-concurrency $max_concurrency \
-          --metadata tensor_parallel_size=$tp compilation_config.mode=$compilation_config_mode optimization_level=$optimization_level \
+          --metadata "tensor_parallel_size=$tp" \
          $client_args $client_remote_args "

        echo "Running test case $test_name with qps $qps"
@@ -451,9 +458,9 @@ run_serving_tests() {
 main() {
  local ARCH
  ARCH=''
-  if [[ "$ON_CPU" == "1" ]]; then
-    check_cpus
-    ARCH="-$gpu_type"
+  if [ "$ON_CPU" == "1" ];then
+     check_cpus
+     ARCH='-cpu'
  else
     check_gpus
     ARCH="$arch_suffix"
@@ -485,7 +492,6 @@ main() {
  # benchmarking
  run_serving_tests $QUICK_BENCHMARK_ROOT/tests/"${SERVING_JSON:-serving-tests$ARCH.json}"
  run_latency_tests $QUICK_BENCHMARK_ROOT/tests/"${LATENCY_JSON:-latency-tests$ARCH.json}"
-  run_startup_tests $QUICK_BENCHMARK_ROOT/tests/"${STARTUP_JSON:-startup-tests$ARCH.json}"
  run_throughput_tests $QUICK_BENCHMARK_ROOT/tests/"${THROUGHPUT_JSON:-throughput-tests$ARCH.json}"

  # postprocess benchmarking results
--- a/.buildkite/performance-benchmarks/tests/latency-tests-arm64-cpu.json
+++ b/.buildkite/performance-benchmarks/tests/latency-tests-arm64-cpu.json
@@ -1,26 +0,0 @@
-[
-    {
-        "test_name": "latency_llama8B_tp1",
-        "environment_variables": {
-            "VLLM_RPC_TIMEOUT": 100000,
-            "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
-            "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
-            "VLLM_CPU_KVCACHE_SPACE": 40
-        },
-        "parameters": {
-            "model": "meta-llama/Llama-3.1-8B-Instruct",
-            "tensor_parallel_size": 1,
-            "load_format": "dummy",
-            "dtype": "bfloat16",
-            "distributed_executor_backend": "mp",
-            "block_size": 128,
-            "trust_remote_code": "",
-            "disable_log_stats": "",
-            "enforce_eager": "",
-            "max_num_batched_tokens": 2048,
-            "max_num_seqs": 256,
-            "num_iters_warmup": 5,
-            "num_iters": 15
-        }
-    }
-]
--- a/.buildkite/performance-benchmarks/tests/serving-tests-arm64-cpu.json
+++ b/.buildkite/performance-benchmarks/tests/serving-tests-arm64-cpu.json
@@ -1,130 +0,0 @@
-{
-  "defaults": {
-    "qps_list": [
-      "inf"
-    ],
-    "max_concurrency_list": [
-      12,
-      16,
-      24,
-      32,
-      64,
-      128,
-      200
-    ],
-    "server_environment_variables": {
-      "VLLM_RPC_TIMEOUT": 100000,
-      "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
-      "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
-      "VLLM_CPU_SGL_KERNEL": 1,
-      "VLLM_CPU_KVCACHE_SPACE": 40
-    },
-    "server_parameters": {
-      "model": "meta-llama/Llama-3.1-8B-Instruct",
-      "tensor_parallel_size": 1,
-      "dtype": "bfloat16",
-      "distributed_executor_backend": "mp",
-      "block_size": 128,
-      "trust_remote_code": "",
-      "disable_log_stats": "",
-      "enforce_eager": "",
-      "max_num_batched_tokens": 2048,
-      "max_num_seqs": 256,
-      "load_format": "dummy"
-    },
-    "client_parameters": {
-      "model": "meta-llama/Llama-3.1-8B-Instruct",
-      "backend": "vllm",
-      "ignore-eos": "",
-      "num_prompts": 200
-    }
-  },
-  "tests": [
-    {
-      "test_name": "serving_llama8B_tp1_sharegpt",
-      "server_parameters": {
-        "tensor_parallel_size": 1
-      },
-      "client_parameters": {
-        "dataset_name": "sharegpt",
-        "dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json"
-      }
-    },
-    {
-      "test_name": "serving_llama8B_tp2_sharegpt",
-      "server_parameters": {
-        "tensor_parallel_size": 2
-      },
-      "client_parameters": {
-        "dataset_name": "sharegpt",
-        "dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json"
-      }
-    },
-    {
-      "test_name": "serving_llama8B_tp1_random_128_128",
-      "server_parameters": {
-        "tensor_parallel_size": 1
-      },
-      "client_parameters": {
-        "dataset_name": "random",
-        "random-input-len": 128,
-        "random-output-len": 128
-      }
-    },
-    {
-      "test_name": "serving_llama8B_tp2_random_128_128",
-      "server_parameters": {
-        "tensor_parallel_size": 2
-      },
-      "client_parameters": {
-        "dataset_name": "random",
-        "random-input-len": 128,
-        "random-output-len": 128
-      }
-    },
-    {
-      "test_name": "serving_llama8B_tp1_random_128_2048",
-      "server_parameters": {
-        "tensor_parallel_size": 1
-      },
-      "client_parameters": {
-        "dataset_name": "random",
-        "random-input-len": 128,
-        "random-output-len": 2048
-      }
-    },
-    {
-      "test_name": "serving_llama8B_tp2_random_128_2048",
-      "server_parameters": {
-        "tensor_parallel_size": 2
-      },
-      "client_parameters": {
-        "dataset_name": "random",
-        "random-input-len": 128,
-        "random-output-len": 2048
-      }
-    },
-    {
-      "test_name": "serving_llama8B_tp1_random_2048_128",
-      "server_parameters": {
-        "tensor_parallel_size": 1
-      },
-      "client_parameters": {
-        "dataset_name": "random",
-        "random-input-len": 2048,
-        "random-output-len": 128
-      }
-    },
-    {
-      "test_name": "serving_llama8B_tp2_random_2048_128",
-      "server_parameters": {
-        "tensor_parallel_size": 2
-      },
-      "client_parameters": {
-        "dataset_name": "random",
-        "random-input-len": 2048,
-        "random-output-len": 128
-      }
-    }
-  ]
-}
--- a/.buildkite/performance-benchmarks/tests/serving-tests-cpu-snc2.json
+++ b/.buildkite/performance-benchmarks/tests/serving-tests-cpu-snc2.json
@@ -0,0 +1,610 @@
+[
+    {
+        "test_name": "serving_llama8B_bf16_tp1_sharegpt",
+        "qps_list": ["inf"],
+        "max_concurrency_list": [12, 16, 24, 32, 64, 128, 200],
+        "server_environment_variables": {
+            "VLLM_RPC_TIMEOUT": 100000,
+	    "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
+	    "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
+	    "VLLM_CPU_SGL_KERNEL": 1,
+	    "VLLM_CPU_KVCACHE_SPACE": 40
+        },
+        "server_parameters": {
+            "model": "meta-llama/Llama-3.1-8B-Instruct",
+            "tensor_parallel_size": 1,
+	    "dtype": "bfloat16",
+	    "distributed_executor_backend": "mp",
+	    "block_size": 128,
+	    "trust_remote_code": "",
+            "disable_log_stats": "",
+	    "enforce_eager": "",
+	    "max_num_batched_tokens": 2048,
+	    "max_num_seqs": 256,
+            "load_format": "dummy"
+        },
+        "client_parameters": {
+            "model": "meta-llama/Llama-3.1-8B-Instruct",
+            "backend": "vllm",
+            "dataset_name": "sharegpt",
+            "dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
+            "num_prompts": 200
+        }
+    },
+    {
+        "test_name": "serving_llama8B_bf16_tp2_sharegpt",
+        "qps_list": ["inf"],
+        "max_concurrency_list": [12, 16, 24, 32, 64, 128, 200],
+        "server_environment_variables": {
+            "VLLM_RPC_TIMEOUT": 100000,
+	    "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
+	    "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
+	    "VLLM_CPU_SGL_KERNEL": 1,
+	    "VLLM_CPU_KVCACHE_SPACE": 40
+        },
+        "server_parameters": {
+            "model": "meta-llama/Llama-3.1-8B-Instruct",
+            "tensor_parallel_size": 2,
+	    "dtype": "bfloat16",
+	    "distributed_executor_backend": "mp",
+	    "block_size": 128,
+	    "trust_remote_code": "",
+            "disable_log_stats": "",
+	    "enforce_eager": "",
+	    "max_num_batched_tokens": 2048,
+	    "max_num_seqs": 256,
+            "load_format": "dummy"
+        },
+        "client_parameters": {
+            "model": "meta-llama/Llama-3.1-8B-Instruct",
+            "backend": "vllm",
+            "dataset_name": "sharegpt",
+            "dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
+            "num_prompts": 200
+        }
+    },
+    {
+        "test_name": "serving_llama8B_bf16_tp4_sharegpt",
+        "qps_list": ["inf"],
+        "max_concurrency_list": [12, 16, 24, 32, 64, 128, 200],
+        "server_environment_variables": {
+            "VLLM_RPC_TIMEOUT": 100000,
+	    "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
+	    "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
+	    "VLLM_CPU_SGL_KERNEL": 1,
+	    "VLLM_CPU_KVCACHE_SPACE": 40
+        },
+        "server_parameters": {
+            "model": "meta-llama/Llama-3.1-8B-Instruct",
+            "tensor_parallel_size": 4,
+	    "dtype": "bfloat16",
+	    "distributed_executor_backend": "mp",
+	    "block_size": 128,
+	    "trust_remote_code": "",
+            "disable_log_stats": "",
+	    "enforce_eager": "",
+	    "max_num_batched_tokens": 2048,
+	    "max_num_seqs": 256,
+            "load_format": "dummy"
+        },
+        "client_parameters": {
+            "model": "meta-llama/Llama-3.1-8B-Instruct",
+            "backend": "vllm",
+            "dataset_name": "sharegpt",
+            "dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
+            "num_prompts": 200
+        }
+    },
+    {
+        "test_name": "serving_llama8B_bf16_tp1_random_128_128",
+        "qps_list": ["inf"],
+        "max_concurrency_list": [12, 16, 24, 32, 64, 128, 200, 1000],
+        "server_environment_variables": {
+            "VLLM_RPC_TIMEOUT": 100000,
+	    "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
+	    "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
+	    "VLLM_CPU_SGL_KERNEL": 1,
+	    "VLLM_CPU_KVCACHE_SPACE": 40
+        },
+        "server_parameters": {
+            "model": "meta-llama/Llama-3.1-8B-Instruct",
+            "tensor_parallel_size": 1,
+	    "dtype": "bfloat16",
+	    "distributed_executor_backend": "mp",
+	    "block_size": 128,
+	    "trust_remote_code": "",
+	    "enable_chunked_prefill": "",
+            "disable_log_stats": "",
+	    "enforce_eager": "",
+	    "max_num_batched_tokens": 2048,
+	    "max_num_seqs": 256,
+            "load_format": "dummy"
+        },
+        "client_parameters": {
+            "model": "meta-llama/Llama-3.1-8B-Instruct",
+            "backend": "vllm",
+            "dataset_name": "random",
+	    "random-input-len": 128,
+	    "random-output-len": 128,
+	    "ignore-eos": "",
+            "num_prompts": 1000
+        }
+    },
+    {
+        "test_name": "serving_llama8B_bf16_tp2_random_128_128",
+        "qps_list": ["inf"],
+        "max_concurrency_list": [12, 16, 24, 32, 64, 128, 200, 1000],
+        "server_environment_variables": {
+            "VLLM_RPC_TIMEOUT": 100000,
+	    "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
+	    "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
+	    "VLLM_CPU_SGL_KERNEL": 1,
+	    "VLLM_CPU_KVCACHE_SPACE": 40
+        },
+        "server_parameters": {
+            "model": "meta-llama/Llama-3.1-8B-Instruct",
+            "tensor_parallel_size": 2,
+	    "dtype": "bfloat16",
+	    "distributed_executor_backend": "mp",
+	    "block_size": 128,
+	    "trust_remote_code": "",
+	    "enable_chunked_prefill": "",
+            "disable_log_stats": "",
+	    "enforce_eager": "",
+	    "max_num_batched_tokens": 2048,
+	    "max_num_seqs": 256,
+            "load_format": "dummy"
+        },
+        "client_parameters": {
+            "model": "meta-llama/Llama-3.1-8B-Instruct",
+            "backend": "vllm",
+            "dataset_name": "random",
+	    "random-input-len": 128,
+	    "random-output-len": 128,
+	    "ignore-eos": "",
+            "num_prompts": 1000
+        }
+    },
+    {
+        "test_name": "serving_llama8B_bf16_tp4_random_128_128",
+        "qps_list": ["inf"],
+        "max_concurrency_list": [12, 16, 24, 32, 64, 128, 200, 1000],
+        "server_environment_variables": {
+            "VLLM_RPC_TIMEOUT": 100000,
+	    "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
+	    "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
+	    "VLLM_CPU_SGL_KERNEL": 1,
+	    "VLLM_CPU_KVCACHE_SPACE": 40
+        },
+        "server_parameters": {
+            "model": "meta-llama/Llama-3.1-8B-Instruct",
+            "tensor_parallel_size": 4,
+	    "dtype": "bfloat16",
+	    "distributed_executor_backend": "mp",
+	    "block_size": 128,
+	    "trust_remote_code": "",
+	    "enable_chunked_prefill": "",
+            "disable_log_stats": "",
+	    "enforce_eager": "",
+	    "max_num_batched_tokens": 2048,
+	    "max_num_seqs": 256,
+            "load_format": "dummy"
+        },
+        "client_parameters": {
+            "model": "meta-llama/Llama-3.1-8B-Instruct",
+            "backend": "vllm",
+            "dataset_name": "random",
+	    "random-input-len": 128,
+	    "random-output-len": 128,
+            "num_prompts": 1000
+        }
+    },
+    {
+        "test_name": "serving_llama8B_int8_tp1_sharegpt",
+        "qps_list": ["inf"],
+        "max_concurrency_list": [12, 16, 24, 32, 64, 128, 200],
+        "server_environment_variables": {
+            "VLLM_RPC_TIMEOUT": 100000,
+	    "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
+	    "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
+	    "VLLM_CPU_SGL_KERNEL": 1,
+	    "VLLM_CPU_KVCACHE_SPACE": 40
+        },
+        "server_parameters": {
+            "model": "RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a8",
+            "tensor_parallel_size": 1,
+	    "dtype": "bfloat16",
+	    "distributed_executor_backend": "mp",
+	    "block_size": 128,
+	    "trust_remote_code": "",
+            "disable_log_stats": "",
+	    "enforce_eager": "",
+	    "max_num_batched_tokens": 2048,
+	    "max_num_seqs": 256,
+            "load_format": "dummy"
+        },
+        "client_parameters": {
+            "model": "RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a8",
+            "backend": "vllm",
+            "dataset_name": "sharegpt",
+            "dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
+            "num_prompts": 200
+        }
+    },
+    {
+        "test_name": "serving_llama8B_int8_tp2_sharegpt",
+        "qps_list": ["inf"],
+        "max_concurrency_list": [12, 16, 24, 32, 64, 128, 200],
+        "server_environment_variables": {
+            "VLLM_RPC_TIMEOUT": 100000,
+	    "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
+	    "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
+	    "VLLM_CPU_SGL_KERNEL": 1,
+	    "VLLM_CPU_KVCACHE_SPACE": 40
+        },
+        "server_parameters": {
+            "model": "RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a8",
+            "tensor_parallel_size": 2,
+	    "dtype": "bfloat16",
+	    "distributed_executor_backend": "mp",
+	    "block_size": 128,
+	    "trust_remote_code": "",
+            "disable_log_stats": "",
+	    "enforce_eager": "",
+	    "max_num_batched_tokens": 2048,
+	    "max_num_seqs": 256,
+            "load_format": "dummy"
+        },
+        "client_parameters": {
+            "model": "RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a8",
+            "backend": "vllm",
+            "dataset_name": "sharegpt",
+            "dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
+            "num_prompts": 200
+        }
+    },
+    {
+        "test_name": "serving_llama8B_int8_tp4_sharegpt",
+        "qps_list": ["inf"],
+        "max_concurrency_list": [12, 16, 24, 32, 64, 128, 200],
+        "server_environment_variables": {
+            "VLLM_RPC_TIMEOUT": 100000,
+	    "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
+	    "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
+	    "VLLM_CPU_SGL_KERNEL": 1,
+	    "VLLM_CPU_KVCACHE_SPACE": 40
+        },
+        "server_parameters": {
+            "model": "RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a8",
+            "tensor_parallel_size": 4,
+	    "dtype": "bfloat16",
+	    "distributed_executor_backend": "mp",
+	    "block_size": 128,
+	    "trust_remote_code": "",
+            "disable_log_stats": "",
+	    "enforce_eager": "",
+	    "max_num_batched_tokens": 2048,
+	    "max_num_seqs": 256,
+            "load_format": "dummy"
+        },
+        "client_parameters": {
+            "model": "RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a8",
+            "backend": "vllm",
+            "dataset_name": "sharegpt",
+            "dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
+            "num_prompts": 200
+        }
+    },
+    {
+        "test_name": "serving_llama8B_int8_tp1_random_128_128",
+        "qps_list": ["inf"],
+        "max_concurrency_list": [12, 16, 24, 32, 64, 128, 200, 1000],
+        "server_environment_variables": {
+            "VLLM_RPC_TIMEOUT": 100000,
+	    "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
+	    "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
+	    "VLLM_CPU_SGL_KERNEL": 1,
+	    "VLLM_CPU_KVCACHE_SPACE": 40
+        },
+        "server_parameters": {
+            "model": "RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a8",
+            "tensor_parallel_size": 1,
+	    "dtype": "bfloat16",
+	    "distributed_executor_backend": "mp",
+	    "block_size": 128,
+	    "trust_remote_code": "",
+	    "enable_chunked_prefill": "",
+            "disable_log_stats": "",
+	    "enforce_eager": "",
+	    "max_num_batched_tokens": 2048,
+	    "max_num_seqs": 256,
+            "load_format": "dummy"
+        },
+        "client_parameters": {
+            "model": "RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a8",
+            "backend": "vllm",
+            "dataset_name": "random",
+	    "random-input-len": 128,
+	    "random-output-len": 128,
+	    "ignore-eos": "",
+            "num_prompts": 1000
+        }
+    },
+    {
+        "test_name": "serving_llama8B_int8_tp2_random_128_128",
+        "qps_list": ["inf"],
+        "max_concurrency_list": [12, 16, 24, 32, 64, 128, 200, 1000],
+        "server_environment_variables": {
+            "VLLM_RPC_TIMEOUT": 100000,
+	    "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
+	    "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
+	    "VLLM_CPU_SGL_KERNEL": 1,
+	    "VLLM_CPU_KVCACHE_SPACE": 40
+        },
+        "server_parameters": {
+            "model": "RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a8",
+            "tensor_parallel_size": 2,
+	    "dtype": "bfloat16",
+	    "distributed_executor_backend": "mp",
+	    "block_size": 128,
+	    "trust_remote_code": "",
+	    "enable_chunked_prefill": "",
+            "disable_log_stats": "",
+	    "enforce_eager": "",
+	    "max_num_batched_tokens": 2048,
+	    "max_num_seqs": 256,
+            "load_format": "dummy"
+        },
+        "client_parameters": {
+            "model": "RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a8",
+            "backend": "vllm",
+            "dataset_name": "random",
+	    "random-input-len": 128,
+	    "random-output-len": 128,
+	    "ignore-eos": "",
+            "num_prompts": 1000
+        }
+    },
+    {
+        "test_name": "serving_llama8B_int8_tp4_random_128_128",
+        "qps_list": ["inf"],
+        "max_concurrency_list": [12, 16, 24, 32, 64, 128, 200, 1000],
+        "server_environment_variables": {
+            "VLLM_RPC_TIMEOUT": 100000,
+	    "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
+	    "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
+	    "VLLM_CPU_SGL_KERNEL": 1,
+	    "VLLM_CPU_KVCACHE_SPACE": 40
+        },
+        "server_parameters": {
+            "model": "RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a8",
+            "tensor_parallel_size": 4,
+	    "dtype": "bfloat16",
+	    "distributed_executor_backend": "mp",
+	    "block_size": 128,
+	    "trust_remote_code": "",
+	    "enable_chunked_prefill": "",
+            "disable_log_stats": "",
+	    "enforce_eager": "",
+	    "max_num_batched_tokens": 2048,
+	    "max_num_seqs": 256,
+            "load_format": "dummy"
+        },
+        "client_parameters": {
+            "model": "RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a8",
+            "backend": "vllm",
+            "dataset_name": "random",
+	    "random-input-len": 128,
+	    "random-output-len": 128,
+	    "ignore-eos": "",
+            "num_prompts": 1000
+        }
+    },
+    {
+        "test_name": "serving_llama8B_int4_tp1_sharegpt",
+        "qps_list": ["inf"],
+        "max_concurrency_list": [12, 16, 24, 32, 64, 128, 200],
+        "server_environment_variables": {
+            "VLLM_RPC_TIMEOUT": 100000,
+	    "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
+	    "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
+	    "VLLM_CPU_SGL_KERNEL": 1,
+	    "VLLM_CPU_KVCACHE_SPACE": 40
+        },
+        "server_parameters": {
+            "model": "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4",
+	    "quantization": "awq",
+            "tensor_parallel_size": 1,
+	    "dtype": "bfloat16",
+	    "distributed_executor_backend": "mp",
+	    "block_size": 128,
+	    "trust_remote_code": "",
+            "disable_log_stats": "",
+	    "enforce_eager": "",
+	    "max_num_batched_tokens": 2048,
+	    "max_num_seqs": 256,
+            "load_format": "dummy"
+        },
+        "client_parameters": {
+            "model": "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4",
+            "backend": "vllm",
+            "dataset_name": "sharegpt",
+            "dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
+            "num_prompts": 200
+        }
+    },
+    {
+        "test_name": "serving_llama8B_int4_tp2_sharegpt",
+        "qps_list": ["inf"],
+        "max_concurrency_list": [12, 16, 24, 32, 64, 128, 200],
+        "server_environment_variables": {
+            "VLLM_RPC_TIMEOUT": 100000,
+	    "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
+	    "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
+	    "VLLM_CPU_SGL_KERNEL": 1,
+	    "VLLM_CPU_KVCACHE_SPACE": 40
+        },
+        "server_parameters": {
+            "model": "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4",
+	    "quantization": "awq",
+            "tensor_parallel_size": 2,
+	    "dtype": "bfloat16",
+	    "distributed_executor_backend": "mp",
+	    "block_size": 128,
+	    "trust_remote_code": "",
+            "disable_log_stats": "",
+	    "enforce_eager": "",
+	    "max_num_batched_tokens": 2048,
+	    "max_num_seqs": 256,
+            "load_format": "dummy"
+        },
+        "client_parameters": {
+            "model": "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4",
+            "backend": "vllm",
+            "dataset_name": "sharegpt",
+            "dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
+            "num_prompts": 200
+        }
+    },
+    {
+        "test_name": "serving_llama8B_int4_tp4_sharegpt",
+        "qps_list": ["inf"],
+        "max_concurrency_list": [12, 16, 24, 32, 64, 128, 200],
+        "server_environment_variables": {
+            "VLLM_RPC_TIMEOUT": 100000,
+	    "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
+	    "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
+	    "VLLM_CPU_SGL_KERNEL": 1,
+	    "VLLM_CPU_KVCACHE_SPACE": 40
+        },
+        "server_parameters": {
+            "model": "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4",
+	    "quantization": "awq",
+            "tensor_parallel_size": 4,
+	    "dtype": "bfloat16",
+	    "distributed_executor_backend": "mp",
+	    "block_size": 128,
+	    "trust_remote_code": "",
+            "disable_log_stats": "",
+	    "enforce_eager": "",
+	    "max_num_batched_tokens": 2048,
+	    "max_num_seqs": 256,
+            "load_format": "dummy"
+        },
+        "client_parameters": {
+            "model": "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4",
+            "backend": "vllm",
+            "dataset_name": "sharegpt",
+            "dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
+            "num_prompts": 200
+        }
+    },
+    {
+        "test_name": "serving_llama8B_int4_tp1_random_128_128",
+        "qps_list": ["inf"],
+        "max_concurrency_list": [12, 16, 24, 32, 64, 128, 200, 1000],
+        "server_environment_variables": {
+            "VLLM_RPC_TIMEOUT": 100000,
+	    "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
+	    "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
+	    "VLLM_CPU_SGL_KERNEL": 1,
+	    "VLLM_CPU_KVCACHE_SPACE": 40
+        },
+        "server_parameters": {
+            "model": "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4",
+	    "quantization": "awq",
+            "tensor_parallel_size": 1,
+	    "dtype": "bfloat16",
+	    "distributed_executor_backend": "mp",
+	    "block_size": 128,
+	    "trust_remote_code": "",
+	    "enable_chunked_prefill": "",
+            "disable_log_stats": "",
+	    "enforce_eager": "",
+	    "max_num_batched_tokens": 2048,
+	    "max_num_seqs": 256,
+            "load_format": "dummy"
+        },
+        "client_parameters": {
+            "model": "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4",
+            "backend": "vllm",
+            "dataset_name": "random",
+	    "random-input-len": 128,
+	    "random-output-len": 128,
+	    "ignore-eos": "",
+            "num_prompts": 1000
+        }
+    },
+    {
+        "test_name": "serving_llama8B_int4_tp2_random_128_128",
+        "qps_list": ["inf"],
+        "max_concurrency_list": [12, 16, 24, 32, 64, 128, 200, 1000],
+        "server_environment_variables": {
+            "VLLM_RPC_TIMEOUT": 100000,
+	    "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
+	    "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
+	    "VLLM_CPU_SGL_KERNEL": 1,
+	    "VLLM_CPU_KVCACHE_SPACE": 40
+        },
+        "server_parameters": {
+            "model": "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4",
+	    "quantization": "awq",
+            "tensor_parallel_size": 2,
+	    "dtype": "bfloat16",
+	    "distributed_executor_backend": "mp",
+	    "block_size": 128,
+	    "trust_remote_code": "",
+	    "enable_chunked_prefill": "",
+            "disable_log_stats": "",
+	    "enforce_eager": "",
+	    "max_num_batched_tokens": 2048,
+	    "max_num_seqs": 256,
+            "load_format": "dummy"
+        },
+        "client_parameters": {
+            "model": "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4",
+            "backend": "vllm",
+            "dataset_name": "random",
+	    "random-input-len": 128,
+	    "random-output-len": 128,
+	    "ignore-eos": "",
+            "num_prompts": 1000
+        }
+    },
+    {
+        "test_name": "serving_llama8B_int4_tp4_random_128_128",
+        "qps_list": ["inf"],
+        "max_concurrency_list": [12, 16, 24, 32, 64, 128, 200, 1000],
+        "server_environment_variables": {
+            "VLLM_RPC_TIMEOUT": 100000,
+	    "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
+	    "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
+	    "VLLM_CPU_SGL_KERNEL": 1,
+	    "VLLM_CPU_KVCACHE_SPACE": 40
+        },
+        "server_parameters": {
+            "model": "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4",
+	    "quantization": "awq",
+            "tensor_parallel_size": 4,
+	    "dtype": "bfloat16",
+	    "distributed_executor_backend": "mp",
+	    "block_size": 128,
+	    "trust_remote_code": "",
+	    "enable_chunked_prefill": "",
+            "disable_log_stats": "",
+	    "enforce_eager": "",
+	    "max_num_batched_tokens": 2048,
+	    "max_num_seqs": 256,
+            "load_format": "dummy"
+        },
+        "client_parameters": {
+            "model": "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4",
+            "backend": "vllm",
+            "dataset_name": "random",
+	    "random-input-len": 128,
+	    "random-output-len": 128,
+	    "ignore-eos": "",
+            "num_prompts": 1000
+        }
+    }
+]
--- a/.buildkite/performance-benchmarks/tests/serving-tests-cpu-snc3.json
+++ b/.buildkite/performance-benchmarks/tests/serving-tests-cpu-snc3.json
--- a/.buildkite/performance-benchmarks/tests/serving-tests-cpu.json
+++ b/.buildkite/performance-benchmarks/tests/serving-tests-cpu.json
@@ -1,283 +1,276 @@
-{
-  "defaults": {
-    "qps_list": [
-      "inf"
-    ],
-    "max_concurrency_list": [12, 16, 24, 32, 64, 128, 200],
-    "server_environment_variables": {
-      "VLLM_RPC_TIMEOUT": 100000,
-      "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
-      "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
-      "VLLM_CPU_SGL_KERNEL": 1,
-      "VLLM_CPU_KVCACHE_SPACE": 40
+[
+    {
+        "test_name": "serving_llama8B_tp1_sharegpt",
+        "qps_list": [1, 4, 16, "inf"],
+        "max_concurrency_list": [32],
+        "server_environment_variables": {
+            "VLLM_RPC_TIMEOUT": 100000,
+	    "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
+	    "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
+	    "VLLM_CPU_SGL_KERNEL": 1,
+	    "VLLM_CPU_KVCACHE_SPACE": 40
+        },
+        "server_parameters": {
+            "model": "meta-llama/Llama-3.1-8B-Instruct",
+            "tensor_parallel_size": 1,
+	    "dtype": "bfloat16",
+	    "distributed_executor_backend": "mp",
+	    "block_size": 128,
+	    "trust_remote_code": "",
+            "disable_log_stats": "",
+	    "enforce_eager": "",
+	    "max_num_batched_tokens": 2048,
+	    "max_num_seqs": 256,
+            "load_format": "dummy"
+        },
+        "client_parameters": {
+            "model": "meta-llama/Llama-3.1-8B-Instruct",
+            "backend": "vllm",
+            "dataset_name": "sharegpt",
+            "dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
+            "num_prompts": 32
+        }
    },
-    "server_parameters": {
-      "model": "meta-llama/Llama-3.1-8B-Instruct",
-      "tensor_parallel_size": 1,
-      "dtype": "bfloat16",
-      "distributed_executor_backend": "mp",
-      "block_size": 128,
-      "trust_remote_code": "",
-      "disable_log_stats": "",
-      "max_num_batched_tokens": 2048,
-      "max_num_seqs": 256
+    {
+        "test_name": "serving_llama8B_tp2_sharegpt",
+        "qps_list": [1, 4, 16, "inf"],
+        "max_concurrency_list": [32],
+        "server_environment_variables": {
+            "VLLM_RPC_TIMEOUT": 100000,
+	    "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
+	    "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
+	    "VLLM_CPU_SGL_KERNEL": 1,
+	    "VLLM_CPU_KVCACHE_SPACE": 40
+        },
+        "server_parameters": {
+            "model": "meta-llama/Llama-3.1-8B-Instruct",
+            "tensor_parallel_size": 2,
+	    "dtype": "bfloat16",
+	    "distributed_executor_backend": "mp",
+	    "block_size": 128,
+	    "trust_remote_code": "",
+            "disable_log_stats": "",
+	    "enforce_eager": "",
+	    "max_num_batched_tokens": 2048,
+	    "max_num_seqs": 256,
+            "load_format": "dummy"
+        },
+        "client_parameters": {
+            "model": "meta-llama/Llama-3.1-8B-Instruct",
+            "backend": "vllm",
+            "dataset_name": "sharegpt",
+            "dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
+            "num_prompts": 32
+        }
    },
-    "client_parameters": {
-      "model": "meta-llama/Llama-3.1-8B-Instruct",
-      "backend": "vllm",
-      "ignore-eos": "",
-      "num_prompts": 200
+    {
+        "test_name": "serving_llama8B_tp1_random_128_128",
+        "qps_list": [1, 4, 16, "inf"],
+        "max_concurrency_list": [32],
+        "server_environment_variables": {
+            "VLLM_RPC_TIMEOUT": 100000,
+	    "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
+	    "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
+	    "VLLM_CPU_SGL_KERNEL": 1,
+	    "VLLM_CPU_KVCACHE_SPACE": 40
+        },
+        "server_parameters": {
+            "model": "meta-llama/Llama-3.1-8B-Instruct",
+            "tensor_parallel_size": 1,
+	    "dtype": "bfloat16",
+	    "distributed_executor_backend": "mp",
+	    "block_size": 128,
+	    "trust_remote_code": "",
+	    "enable_chunked_prefill": "",
+            "disable_log_stats": "",
+	    "enforce_eager": "",
+	    "max_num_batched_tokens": 2048,
+	    "max_num_seqs": 256,
+            "load_format": "dummy"
+        },
+        "client_parameters": {
+            "model": "meta-llama/Llama-3.1-8B-Instruct",
+            "backend": "vllm",
+            "dataset_name": "random",
+	    "random-input-len": 128,
+	    "random-output-len": 128,
+	    "ignore-eos": "",
+            "num_prompts": 32
+        }
+    },
+    {
+        "test_name": "serving_llama8B_tp2_random_128_128",
+        "qps_list": [1, 4, 16, "inf"],
+        "max_concurrency_list": [32],
+        "server_environment_variables": {
+            "VLLM_RPC_TIMEOUT": 100000,
+	    "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
+	    "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
+	    "VLLM_CPU_SGL_KERNEL": 1,
+	    "VLLM_CPU_KVCACHE_SPACE": 40
+        },
+        "server_parameters": {
+            "model": "meta-llama/Llama-3.1-8B-Instruct",
+            "tensor_parallel_size": 2,
+	    "dtype": "bfloat16",
+	    "distributed_executor_backend": "mp",
+	    "block_size": 128,
+	    "trust_remote_code": "",
+	    "enable_chunked_prefill": "",
+            "disable_log_stats": "",
+	    "enforce_eager": "",
+	    "max_num_batched_tokens": 2048,
+	    "max_num_seqs": 256,
+            "load_format": "dummy"
+        },
+        "client_parameters": {
+            "model": "meta-llama/Llama-3.1-8B-Instruct",
+            "backend": "vllm",
+            "dataset_name": "random",
+	    "random-input-len": 128,
+	    "random-output-len": 128,
+	    "ignore-eos": "",
+            "num_prompts": 32
+        }
+    },
+    {
+        "test_name": "serving_llama8B_tp1_random_128_2048",
+        "qps_list": [1, 4, 16, "inf"],
+        "max_concurrency_list": [32],
+        "server_environment_variables": {
+            "VLLM_RPC_TIMEOUT": 100000,
+	    "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
+	    "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
+	    "VLLM_CPU_SGL_KERNEL": 1,
+	    "VLLM_CPU_KVCACHE_SPACE": 40
+        },
+        "server_parameters": {
+            "model": "meta-llama/Llama-3.1-8B-Instruct",
+            "tensor_parallel_size": 1,
+	    "dtype": "bfloat16",
+	    "distributed_executor_backend": "mp",
+	    "block_size": 128,
+	    "trust_remote_code": "",
+	    "enable_chunked_prefill": "",
+            "disable_log_stats": "",
+	    "enforce_eager": "",
+	    "max_num_batched_tokens": 2048,
+	    "max_num_seqs": 256,
+            "load_format": "dummy"
+        },
+        "client_parameters": {
+            "model": "meta-llama/Llama-3.1-8B-Instruct",
+            "backend": "vllm",
+            "dataset_name": "random",
+	    "random-input-len": 128,
+	    "random-output-len": 2048,
+	    "ignore-eos": "",
+            "num_prompts": 32
+        }
+    },
+    {
+        "test_name": "serving_llama8B_tp2_random_128_2048",
+        "qps_list": [1, 4, 16, "inf"],
+        "max_concurrency_list": [32],
+        "server_environment_variables": {
+            "VLLM_RPC_TIMEOUT": 100000,
+	    "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
+	    "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
+	    "VLLM_CPU_SGL_KERNEL": 1,
+	    "VLLM_CPU_KVCACHE_SPACE": 40
+        },
+        "server_parameters": {
+            "model": "meta-llama/Llama-3.1-8B-Instruct",
+            "tensor_parallel_size": 2,
+	    "dtype": "bfloat16",
+	    "distributed_executor_backend": "mp",
+	    "block_size": 128,
+	    "trust_remote_code": "",
+	    "enable_chunked_prefill": "",
+            "disable_log_stats": "",
+	    "enforce_eager": "",
+	    "max_num_batched_tokens": 2048,
+	    "max_num_seqs": 256,
+            "load_format": "dummy"
+        },
+        "client_parameters": {
+            "model": "meta-llama/Llama-3.1-8B-Instruct",
+            "backend": "vllm",
+            "dataset_name": "random",
+	    "random-input-len": 128,
+	    "random-output-len": 2048,
+	    "ignore-eos": "",
+            "num_prompts": 32
+        }
+    },
+    {
+        "test_name": "serving_llama8B_tp1_random_2048_128",
+        "qps_list": [1, 4, 16, "inf"],
+        "max_concurrency_list": [32],
+        "server_environment_variables": {
+            "VLLM_RPC_TIMEOUT": 100000,
+	    "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
+	    "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
+	    "VLLM_CPU_SGL_KERNEL": 1,
+	    "VLLM_CPU_KVCACHE_SPACE": 40
+        },
+        "server_parameters": {
+            "model": "meta-llama/Llama-3.1-8B-Instruct",
+            "tensor_parallel_size": 1,
+	    "dtype": "bfloat16",
+	    "distributed_executor_backend": "mp",
+	    "block_size": 128,
+	    "trust_remote_code": "",
+	    "enable_chunked_prefill": "",
+            "disable_log_stats": "",
+	    "enforce_eager": "",
+	    "max_num_batched_tokens": 2048,
+	    "max_num_seqs": 256,
+            "load_format": "dummy"
+        },
+        "client_parameters": {
+            "model": "meta-llama/Llama-3.1-8B-Instruct",
+            "backend": "vllm",
+            "dataset_name": "random",
+	    "random-input-len": 2048,
+	    "random-output-len": 128,
+	    "ignore-eos": "",
+            "num_prompts": 32
+        }
+    },
+    {
+        "test_name": "serving_llama8B_tp2_random_2048_128",
+        "qps_list": [1, 4, 16, "inf"],
+        "max_concurrency_list": [32],
+        "server_environment_variables": {
+            "VLLM_RPC_TIMEOUT": 100000,
+	    "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
+	    "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
+	    "VLLM_CPU_SGL_KERNEL": 1,
+	    "VLLM_CPU_KVCACHE_SPACE": 40
+        },
+        "server_parameters": {
+            "model": "meta-llama/Llama-3.1-8B-Instruct",
+            "tensor_parallel_size": 2,
+	    "dtype": "bfloat16",
+	    "distributed_executor_backend": "mp",
+	    "block_size": 128,
+	    "trust_remote_code": "",
+	    "enable_chunked_prefill": "",
+            "disable_log_stats": "",
+	    "enforce_eager": "",
+	    "max_num_batched_tokens": 2048,
+	    "max_num_seqs": 256,
+            "load_format": "dummy"
+        },
+        "client_parameters": {
+            "model": "meta-llama/Llama-3.1-8B-Instruct",
+            "backend": "vllm",
+            "dataset_name": "random",
+	    "random-input-len": 2048,
+	    "random-output-len": 128,
+	    "ignore-eos": "",
+            "num_prompts": 32
+        }
    }
-  },
-  "tests": [
-    {
-      "test_name": "serving_llama8B_tp1_sharegpt",
-      "server_parameters": {
-        "tensor_parallel_size": 1
-      },
-      "client_parameters": {
-        "dataset_name": "sharegpt",
-        "dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json"
-      }
-    },
-    {
-      "test_name": "serving_llama8B_tp2_sharegpt",
-      "server_parameters": {
-        "tensor_parallel_size": 2
-      },
-      "client_parameters": {
-        "dataset_name": "sharegpt",
-        "dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json"
-      }
-    },
-    {
-      "test_name": "serving_llama8B_tp1_random_128_128",
-      "server_parameters": {
-        "tensor_parallel_size": 1
-      },
-      "client_parameters": {
-        "dataset_name": "random",
-        "random-input-len": 128,
-        "random-output-len": 128
-      }
-    },
-    {
-      "test_name": "serving_llama8B_tp2_random_128_128",
-      "server_parameters": {
-        "tensor_parallel_size": 2
-      },
-      "client_parameters": {
-        "dataset_name": "random",
-        "random-input-len": 128,
-        "random-output-len": 128
-      }
-    },
-    {
-      "test_name": "serving_llama8B_tp4_random_128_128",
-      "server_parameters": {
-        "tensor_parallel_size": 4
-      },
-      "client_parameters": {
-        "dataset_name": "random",
-        "random-input-len": 128,
-        "random-output-len": 128
-      }
-    },
-    {
-      "test_name": "serving_llama8B_tp1_random_128_2048",
-      "server_parameters": {
-        "tensor_parallel_size": 1
-      },
-      "client_parameters": {
-        "dataset_name": "random",
-        "random-input-len": 128,
-        "random-output-len": 2048
-      }
-    },
-    {
-      "test_name": "serving_llama8B_tp2_random_128_2048",
-      "server_parameters": {
-        "tensor_parallel_size": 2
-      },
-      "client_parameters": {
-        "dataset_name": "random",
-        "random-input-len": 128,
-        "random-output-len": 2048
-      }
-    },
-    {
-      "test_name": "serving_llama8B_tp4_random_128_2048",
-      "server_parameters": {
-        "tensor_parallel_size": 4
-      },
-      "client_parameters": {
-        "dataset_name": "random",
-        "random-input-len": 128,
-        "random-output-len": 2048
-      }
-    },
-    {
-      "test_name": "serving_llama8B_tp1_random_2048_128",
-      "server_parameters": {
-        "tensor_parallel_size": 1
-      },
-      "client_parameters": {
-        "dataset_name": "random",
-        "random-input-len": 2048,
-        "random-output-len": 128
-      }
-    },
-    {
-      "test_name": "serving_llama8B_tp2_random_2048_128",
-      "server_parameters": {
-        "tensor_parallel_size": 2
-      },
-      "client_parameters": {
-        "dataset_name": "random",
-        "random-input-len": 2048,
-        "random-output-len": 128
-      }
-    },
-    {
-      "test_name": "serving_llama8B_tp4_random_2048_128",
-      "server_parameters": {
-        "tensor_parallel_size": 4
-      },
-      "client_parameters": {
-        "dataset_name": "random",
-        "random-input-len": 2048,
-        "random-output-len": 128
-      }
-    },
-    {
-      "test_name": "serving_llama8B_int4_tp1_random_128_128",
-      "server_parameters": {
-        "model": "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4",
-        "tensor_parallel_size": 1
-      },
-      "client_parameters": {
-        "model": "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4",
-        "dataset_name": "random",
-        "random-input-len": 128,
-        "random-output-len": 128
-      }
-    },
-    {
-      "test_name": "serving_llama8B_int4_tp2_random_128_128",
-      "server_parameters": {
-        "model": "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4",
-        "tensor_parallel_size": 2
-      },
-      "client_parameters": {
-        "model": "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4",
-        "dataset_name": "random",
-        "random-input-len": 128,
-        "random-output-len": 128
-      }
-    },
-    {
-      "test_name": "serving_llama8B_int4_tp4_random_128_128",
-      "server_parameters": {
-        "model": "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4",
-        "tensor_parallel_size": 4
-      },
-      "client_parameters": {
-        "model": "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4",
-        "dataset_name": "random",
-        "random-input-len": 128,
-        "random-output-len": 128
-      }
-    },
-    {
-      "test_name": "serving_llama3B_tp1_random_128_128",
-      "server_parameters": {
-        "model": "meta-llama/Llama-3.2-3B-Instruct",
-        "tensor_parallel_size": 1
-      },
-      "client_parameters": {
-        "model": "meta-llama/Llama-3.2-3B-Instruct",
-        "dataset_name": "random",
-        "random-input-len": 128,
-        "random-output-len": 128
-      }
-    },
-    {
-      "test_name": "serving_granite2B_tp1_random_128_128",
-      "server_parameters": {
-        "model": "ibm-granite/granite-3.2-2b-instruct",
-        "tensor_parallel_size": 1
-      },
-      "client_parameters": {
-        "model": "ibm-granite/granite-3.2-2b-instruct",
-        "dataset_name": "random",
-        "random-input-len": 128,
-        "random-output-len": 128
-      }
-    },
-    {
-      "test_name": "serving_qwen1.7B_tp1_random_128_128",
-      "server_parameters": {
-        "model": "Qwen/Qwen3-1.7B",
-        "tensor_parallel_size": 1
-      },
-      "client_parameters": {
-        "model": "Qwen/Qwen3-1.7B",
-        "dataset_name": "random",
-        "random-input-len": 128,
-        "random-output-len": 128
-      }
-    },
-    {
-      "test_name": "serving_qwen4B_tp1_random_128_128",
-      "server_parameters": {
-        "model": "Qwen/Qwen3-4B",
-        "tensor_parallel_size": 1
-      },
-      "client_parameters": {
-        "model": "Qwen/Qwen3-4B",
-        "dataset_name": "random",
-        "random-input-len": 128,
-        "random-output-len": 128
-      }
-    },
-    {
-      "test_name": "serving_qwen8B_tp1_random_128_128",
-      "server_parameters": {
-        "model": "Qwen/Qwen3-8B",
-        "tensor_parallel_size": 1
-      },
-      "client_parameters": {
-        "model": "Qwen/Qwen3-8B",
-        "dataset_name": "random",
-        "random-input-len": 128,
-        "random-output-len": 128
-      }
-    },
-    {
-      "test_name": "serving_glm9B_tp1_random_128_128",
-      "server_parameters": {
-        "model": "zai-org/glm-4-9b-hf",
-        "tensor_parallel_size": 1
-      },
-      "client_parameters": {
-        "model": "zai-org/glm-4-9b-hf",
-        "dataset_name": "random",
-        "random-input-len": 128,
-        "random-output-len": 128
-      }
-    },
-    {
-      "test_name": "serving_gemma7B_tp1_random_128_128",
-      "server_parameters": {
-        "model": "google/gemma-7b",
-        "tensor_parallel_size": 1
-      },
-      "client_parameters": {
-        "model": "google/gemma-7b",
-        "dataset_name": "random",
-        "random-input-len": 128,
-        "random-output-len": 128
-      }
-    }
-  ]
-}
+]
--- a/.buildkite/performance-benchmarks/tests/throughput-tests-arm64-cpu.json
+++ b/.buildkite/performance-benchmarks/tests/throughput-tests-arm64-cpu.json
@@ -1,27 +0,0 @@
-[
-    {
-        "test_name": "throughput_llama8B_tp1",
-        "environment_variables": {
-            "VLLM_RPC_TIMEOUT": 100000,
-            "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
-            "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
-            "VLLM_CPU_KVCACHE_SPACE": 40
-        },
-        "parameters": {
-            "model": "meta-llama/Llama-3.1-8B-Instruct",
-            "tensor_parallel_size": 1,
-            "load_format": "dummy",
-            "dtype": "bfloat16",
-            "distributed_executor_backend": "mp",
-            "block_size": 128,
-            "trust_remote_code": "",
-            "disable_log_stats": "",
-            "enforce_eager": "",
-            "max_num_batched_tokens": 2048,
-            "max_num_seqs": 256,
-            "dataset": "./ShareGPT_V3_unfiltered_cleaned_split.json",
-            "num_prompts": 200,
-            "backend": "vllm"
-        }
-    }
-]
--- a/.buildkite/release-pipeline.yaml
+++ b/.buildkite/release-pipeline.yaml
@@ -1,713 +1,183 @@
 steps:
+  # aarch64 + CUDA builds
+  - label: "Build arm64 wheel - CUDA 12.9"
+    depends_on: ~
+    id: build-wheel-arm64-cuda-12-9
+    agents:
+      queue: arm64_cpu_queue_postmerge
+    commands:
+      # #NOTE: torch_cuda_arch_list is derived from upstream PyTorch build files here:
+      # https://github.com/pytorch/pytorch/blob/main/.ci/aarch64_linux/aarch64_ci_build.sh#L7
+      - "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.9.1 --build-arg VLLM_MAIN_CUDA_VERSION=12.9 --build-arg torch_cuda_arch_list='8.7 8.9 9.0 10.0+PTX 12.0' --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ."
+      - "mkdir artifacts"
+      - "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'"
+      - "bash .buildkite/scripts/upload-wheels.sh"
+    env:
+      DOCKER_BUILDKIT: "1"
+
+  # aarch64 build
+  - label: "Build arm64 CPU wheel"
+    depends_on: ~
+    id: build-wheel-arm64-cpu
+    agents:
+      queue: arm64_cpu_queue_postmerge
+    commands:
+      - "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg GIT_REPO_CHECK=1 --build-arg VLLM_BUILD_ACL=ON --tag vllm-ci:build-image --target vllm-build --progress plain -f docker/Dockerfile.cpu ."
+      - "mkdir artifacts"
+      - "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'"
+      - "bash .buildkite/scripts/upload-wheels.sh"
+    env:
+      DOCKER_BUILDKIT: "1"
+
+  # x86 + CUDA builds
+  - label: "Build wheel - CUDA 12.8"
+    depends_on: ~
+    id: build-wheel-cuda-12-8
+    agents:
+      queue: cpu_queue_postmerge
+    commands:
+      - "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.8.1 --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ."
+      - "mkdir artifacts"
+      - "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'"
+      - "bash .buildkite/scripts/upload-wheels.sh"
+    env:
+      DOCKER_BUILDKIT: "1"
+
+  - label: "Build wheel - CUDA 12.9"
+    depends_on: ~
+    id: build-wheel-cuda-12-9
+    agents:
+      queue: cpu_queue_postmerge
+    commands:
+      - "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.9.1 --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ."
+      - "mkdir artifacts"
+      - "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'"
+      - "bash .buildkite/scripts/upload-wheels.sh"
+    env:
+      DOCKER_BUILDKIT: "1"
+
+  - label: "Build wheel - CUDA 13.0"
+    depends_on: ~
+    id: build-wheel-cuda-13-0
+    agents:
+      queue: cpu_queue_postmerge
+    commands:
+      - "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=13.0.1 --build-arg BUILD_BASE_IMAGE=nvidia/cuda:13.0.1-devel-ubuntu22.04 --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ."
+      - "mkdir artifacts"
+      - "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'"
+      - "bash .buildkite/scripts/upload-wheels.sh"
+    env:
+      DOCKER_BUILDKIT: "1"
+
+  # Build release images (12.9)
+  - label: "Build release image (x86)"
+    depends_on: ~
+    id: build-release-image-x86
+    agents:
+      queue: cpu_queue_postmerge
+    commands:
+      - "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
+      - "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.9.1 --build-arg FLASHINFER_AOT_COMPILE=true --build-arg INSTALL_KV_CONNECTORS=true --tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-$(uname -m) --target vllm-openai --progress plain -f docker/Dockerfile ."
+      - "docker push public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-$(uname -m)"
+      # re-tag to default image tag and push, just in case arm64 build fails
+      - "docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-$(uname -m) public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT"
+      - "docker push public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT"
+
+  - label: "Build release image (arm64)"
+    depends_on: ~
+    id: build-release-image-arm64
+    agents:
+      queue: arm64_cpu_queue_postmerge
+    commands:
+      - "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
+      - "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.9.1 --build-arg FLASHINFER_AOT_COMPILE=true --build-arg torch_cuda_arch_list='8.7 8.9 9.0 10.0+PTX 12.0' --build-arg INSTALL_KV_CONNECTORS=true --tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-$(uname -m) --target vllm-openai --progress plain -f docker/Dockerfile ."
+      - "docker push public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-$(uname -m)"
+
+  # Add job to create multi-arch manifest
+  - label: "Create multi-arch manifest"
+    depends_on:
+      - build-release-image-x86
+      - build-release-image-arm64
+    id: create-multi-arch-manifest
+    agents:
+      queue: cpu_queue_postmerge
+    commands:
+      - "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
+      - "docker manifest create public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-x86_64 public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-aarch64 --amend"
+      - "docker manifest push public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT"
+
+  - label: "Annotate release workflow"
+    depends_on:
+      - create-multi-arch-manifest
+      - build-wheel-cuda-12-8
+    id: annotate-release-workflow
+    agents:
+      queue: cpu_queue_postmerge
+    commands:
+      - "bash .buildkite/scripts/annotate-release.sh"
+
  - input: "Provide Release version here"
    id: input-release-version
    fields:
      - text: "What is the release version?"
        key: release-version

-  - group: "Build Python wheels"
-    key: "build-wheels"
-    steps:
-      - label: "Build wheel - aarch64 - CUDA 12.9"
-        depends_on: ~
-        id: build-wheel-arm64-cuda-12-9
-        agents:
-          queue: arm64_cpu_queue_postmerge
-        commands:
-          # #NOTE: torch_cuda_arch_list is derived from upstream PyTorch build files here:
-          # https://github.com/pytorch/pytorch/blob/main/.ci/aarch64_linux/aarch64_ci_build.sh#L7
-          - "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.9.1 --build-arg torch_cuda_arch_list='8.7 8.9 9.0 10.0+PTX 12.0' --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ."
-          - "mkdir artifacts"
-          - "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'"
-          - "bash .buildkite/scripts/upload-nightly-wheels.sh"
-        env:
-          DOCKER_BUILDKIT: "1"
-
-      - label: "Build wheel - aarch64 - CUDA 13.0"
-        depends_on: ~
-        id: build-wheel-arm64-cuda-13-0
-        agents:
-          queue: arm64_cpu_queue_postmerge
-        commands:
-          # #NOTE: torch_cuda_arch_list is derived from upstream PyTorch build files here:
-          # https://github.com/pytorch/pytorch/blob/main/.ci/aarch64_linux/aarch64_ci_build.sh#L7
-          - "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=13.0.1 --build-arg torch_cuda_arch_list='8.7 8.9 9.0 10.0+PTX 12.0' --build-arg BUILD_BASE_IMAGE=nvidia/cuda:13.0.1-devel-ubuntu22.04  --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ."
-          - "mkdir artifacts"
-          - "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'"
-          - "bash .buildkite/scripts/upload-nightly-wheels.sh manylinux_2_35"
-        env:
-          DOCKER_BUILDKIT: "1"
-
-      - label: "Build wheel - aarch64 - CPU"
-        depends_on: ~
-        id: build-wheel-arm64-cpu
-        agents:
-          queue: arm64_cpu_queue_postmerge
-        commands:
-          - "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg GIT_REPO_CHECK=1 --build-arg VLLM_BUILD_ACL=ON --tag vllm-ci:build-image --target vllm-build --progress plain -f docker/Dockerfile.cpu ."
-          - "mkdir artifacts"
-          - "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'"
-          - "bash .buildkite/scripts/upload-nightly-wheels.sh manylinux_2_35"
-        env:
-          DOCKER_BUILDKIT: "1"
-
-      - label: "Build wheel - x86_64 - CUDA 12.9"
-        depends_on: ~
-        id: build-wheel-x86-cuda-12-9
-        agents:
-          queue: cpu_queue_postmerge
-        commands:
-          - "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.9.1 --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ."
-          - "mkdir artifacts"
-          - "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'"
-          - "bash .buildkite/scripts/upload-nightly-wheels.sh manylinux_2_31"
-        env:
-          DOCKER_BUILDKIT: "1"
-
-      - label: "Build wheel - x86_64 - CUDA 13.0"
-        depends_on: ~
-        id: build-wheel-x86-cuda-13-0
-        agents:
-          queue: cpu_queue_postmerge
-        commands:
-          - "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=13.0.1 --build-arg BUILD_BASE_IMAGE=nvidia/cuda:13.0.1-devel-ubuntu22.04 --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ."
-          - "mkdir artifacts"
-          - "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'"
-          - "bash .buildkite/scripts/upload-nightly-wheels.sh manylinux_2_35"
-        env:
-          DOCKER_BUILDKIT: "1"
-
-      - label: "Build wheel - x86_64 - CPU"
-        depends_on: ~
-        id: build-wheel-x86-cpu
-        agents:
-          queue: cpu_queue_postmerge
-        commands:
-          - "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg GIT_REPO_CHECK=1 --build-arg VLLM_CPU_AVX512BF16=true --build-arg VLLM_CPU_AVX512VNNI=true --build-arg VLLM_CPU_AMXBF16=true --tag vllm-ci:build-image --target vllm-build --progress plain -f docker/Dockerfile.cpu ."
-          - "mkdir artifacts"
-          - "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'"
-          - "bash .buildkite/scripts/upload-nightly-wheels.sh manylinux_2_35"
-        env:
-          DOCKER_BUILDKIT: "1"
-
-  - group: "Build release Docker images"
-    key: "build-release-images"
-    steps:
-      - label: "Build release image - x86_64 - CUDA 12.9"
-        depends_on: ~
-        id: build-release-image-x86
-        agents:
-          queue: cpu_queue_postmerge
-        commands:
-          - "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
-          - "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.9.1 --build-arg FLASHINFER_AOT_COMPILE=true --build-arg INSTALL_KV_CONNECTORS=true --tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-$(uname -m) --target vllm-openai --progress plain -f docker/Dockerfile ."
-          - "docker push public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-$(uname -m)"
-          # re-tag to default image tag and push, just in case arm64 build fails
-          - "docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-$(uname -m) public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT"
-          - "docker push public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT"
-
-      - label: "Build release image - aarch64 - CUDA 12.9"
-        depends_on: ~
-        id: build-release-image-arm64
-        agents:
-          queue: arm64_cpu_queue_postmerge
-        commands:
-          - "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
-          - "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.9.1 --build-arg FLASHINFER_AOT_COMPILE=true --build-arg torch_cuda_arch_list='8.7 8.9 9.0 10.0+PTX 12.0' --build-arg INSTALL_KV_CONNECTORS=true --tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-$(uname -m) --target vllm-openai --progress plain -f docker/Dockerfile ."
-          - "docker push public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-$(uname -m)"
-
-      - label: "Build release image - x86_64 - CUDA 13.0"
-        depends_on: ~
-        id: build-release-image-x86-cuda-13-0
-        agents:
-          queue: cpu_queue_postmerge
-        commands:
-          - "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
-          - "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=13.0.1 --build-arg INSTALL_KV_CONNECTORS=true --build-arg BUILD_BASE_IMAGE=nvidia/cuda:13.0.1-devel-ubuntu22.04 --tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-$(uname -m)-cu130 --target vllm-openai --progress plain -f docker/Dockerfile ."
-          - "docker push public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-$(uname -m)-cu130"
-          # re-tag to default image tag and push, just in case arm64 build fails
-          - "docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-$(uname -m)-cu130 public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-cu130"
-          - "docker push public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-cu130"
-
-      - label: "Build release image - aarch64 - CUDA 13.0"
-        depends_on: ~
-        id: build-release-image-arm64-cuda-13-0
-        agents:
-          queue: arm64_cpu_queue_postmerge
-        commands:
-          - "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
-          # compute capability 12.0 for RTX-50 series / RTX PRO 6000 Blackwell, 12.1 for DGX Spark
-          - "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=13.0.1 --build-arg torch_cuda_arch_list='8.7 8.9 9.0 10.0+PTX 12.0 12.1' --build-arg INSTALL_KV_CONNECTORS=true --build-arg BUILD_BASE_IMAGE=nvidia/cuda:13.0.1-devel-ubuntu22.04 --tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-$(uname -m)-cu130 --target vllm-openai --progress plain -f docker/Dockerfile ."
-          - "docker push public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-$(uname -m)-cu130"
-
-      - block: "Build release image for x86_64 CPU"
-        key: block-cpu-release-image-build
-        depends_on: ~
-
-      - label: "Build release image - x86_64 - CPU"
-        depends_on:
-          - block-cpu-release-image-build
-          - input-release-version
-        agents:
-          queue: cpu_queue_postmerge
-        commands:
-          - "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
-          - "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg GIT_REPO_CHECK=1 --build-arg VLLM_CPU_AVX512BF16=true --build-arg VLLM_CPU_AVX512VNNI=true --build-arg VLLM_CPU_AMXBF16=true --tag public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:$(buildkite-agent meta-data get release-version) --tag public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:latest --progress plain --target vllm-openai -f docker/Dockerfile.cpu ."
-          - "docker push public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:latest"
-          - "docker push public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:$(buildkite-agent meta-data get release-version)"
-        env:
-          DOCKER_BUILDKIT: "1"
-
-      - block: "Build release image for arm64 CPU"
-        key: block-arm64-cpu-release-image-build
-        depends_on: ~
-
-      - label: "Build release image - arm64 - CPU"
-        depends_on: 
-          - block-arm64-cpu-release-image-build
-          - input-release-version
-        agents:
-          queue: arm64_cpu_queue_postmerge
-        commands:
-          - "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
-          - "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg GIT_REPO_CHECK=1 --tag public.ecr.aws/q9t5s3a7/vllm-arm64-cpu-release-repo:$(buildkite-agent meta-data get release-version) --tag public.ecr.aws/q9t5s3a7/vllm-arm64-cpu-release-repo:latest --progress plain --target vllm-openai -f docker/Dockerfile.cpu ."
-          - "docker push public.ecr.aws/q9t5s3a7/vllm-arm64-cpu-release-repo:latest"
-          - "docker push public.ecr.aws/q9t5s3a7/vllm-arm64-cpu-release-repo:$(buildkite-agent meta-data get release-version)"
-        env:
-          DOCKER_BUILDKIT: "1"
-
-  - group: "Publish release images"
-    key: "publish-release-images"
-    steps:
-      - label: "Create multi-arch manifest - CUDA 12.9"
-        depends_on:
-          - build-release-image-x86
-          - build-release-image-arm64
-        id: create-multi-arch-manifest
-        agents:
-          queue: small_cpu_queue_postmerge
-        commands:
-          - "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
-          - "docker manifest create public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-x86_64 public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-aarch64 --amend"
-          - "docker manifest push public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT"
-
-      - label: "Annotate release workflow - CUDA 12.9"
-        depends_on:
-          - create-multi-arch-manifest
-        id: annotate-release-workflow
-        agents:
-          queue: small_cpu_queue_postmerge
-        commands:
-          - "bash .buildkite/scripts/annotate-release.sh"
-
-      - label: "Create multi-arch manifest - CUDA 13.0"
-        depends_on:
-          - build-release-image-x86-cuda-13-0
-          - build-release-image-arm64-cuda-13-0
-        id: create-multi-arch-manifest-cuda-13-0
-        agents:
-          queue: small_cpu_queue_postmerge
-        commands:
-          - "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
-          - "docker manifest create public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-cu130 public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-x86_64-cu130 public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-aarch64-cu130 --amend"
-          - "docker manifest push public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-cu130"
-
-      - label: "Publish nightly multi-arch image to DockerHub"
-        depends_on:
-          - create-multi-arch-manifest
-        if: build.env("NIGHTLY") == "1"
-        agents:
-          queue: small_cpu_queue_postmerge
-        commands:
-          - "bash .buildkite/scripts/push-nightly-builds.sh"
-          # Clean up old nightly builds (keep only last 14)
-          - "bash .buildkite/scripts/cleanup-nightly-builds.sh"
-        plugins:
-          - docker-login#v3.0.0:
-              username: vllmbot
-              password-env: DOCKERHUB_TOKEN
-        env:
-          DOCKER_BUILDKIT: "1"
-          DOCKERHUB_USERNAME: "vllmbot"
-
-      - label: "Publish nightly multi-arch image to DockerHub - CUDA 13.0"
-        depends_on:
-          - create-multi-arch-manifest-cuda-13-0
-        if: build.env("NIGHTLY") == "1"
-        agents:
-          queue: small_cpu_queue_postmerge
-        commands:
-          - "bash .buildkite/scripts/push-nightly-builds.sh cu130"
-          # Clean up old nightly builds (keep only last 14)
-          - "bash .buildkite/scripts/cleanup-nightly-builds.sh cu130-nightly-"
-        plugins:
-          - docker-login#v3.0.0:
-              username: vllmbot
-              password-env: DOCKERHUB_TOKEN
-        env:
-          DOCKER_BUILDKIT: "1"
-          DOCKERHUB_USERNAME: "vllmbot"
-
-  - group: "Publish wheels"
-    key: "publish-wheels"
-    steps:
-      - block: "Confirm update release wheels to PyPI (experimental, use with caution)?"
-        key: block-upload-release-wheels
-        depends_on:
-          - input-release-version
-          - build-wheels
-
-      - label: "Upload release wheels to PyPI"
-        depends_on:
-          - block-upload-release-wheels
-        id: upload-release-wheels
-        agents:
-          queue: small_cpu_queue_postmerge
-        commands:
-          - "bash .buildkite/scripts/upload-release-wheels-pypi.sh"
-
-  # =============================================================================
-  # ROCm Release Pipeline (x86_64 only)
-  # =============================================================================
-  #
-  # vLLM version is determined by the Buildkite checkout (like CUDA pipeline).
-  # To build a specific version, trigger the build from that branch/tag.
-  #
-  # Environment variables for ROCm builds (set via Buildkite UI or schedule):
-  #   ROCM_PYTHON_VERSION: Python version (default: 3.12)
-  #   PYTORCH_ROCM_ARCH: GPU architectures (default: gfx90a;gfx942;gfx950;gfx1100;gfx1101;gfx1200;gfx1201;gfx1150;gfx1151)
-  #   ROCM_UPLOAD_WHEELS: Upload to S3 (default: false for nightly, true for releases)
-  #   ROCM_FORCE_REBUILD: Force rebuild base wheels, ignore S3 cache (default: false)
-  #
-  # Note: ROCm version is determined by BASE_IMAGE in docker/Dockerfile.rocm_base
-  #       (currently rocm/dev-ubuntu-22.04:7.1-complete)
-  #
-  # =============================================================================
-
-  # ROCm Input Step - Collect build configuration (manual trigger only)
-  - input: "ROCm Wheel Release Build Configuration"
-    key: input-rocm-config
+  - block: "Build CPU release image"
+    key: block-cpu-release-image-build
    depends_on: ~
-    if: build.source == "ui"
-    fields:
-      - text: "Python Version"
-        key: "rocm-python-version"
-        default: "3.12"
-        hint: "Python version (e.g., 3.12)"
-      - text: "GPU Architectures"
-        key: "rocm-pytorch-rocm-arch"
-        default: "gfx90a;gfx942;gfx950;gfx1100;gfx1101;gfx1200;gfx1201;gfx1150;gfx1151"
-        hint: "Semicolon-separated GPU architectures"
-      - select: "Upload Wheels to S3"
-        key: "rocm-upload-wheels"
-        default: "true"
-        options:
-          - label: "No - Build only (nightly/dev)"
-            value: "false"
-          - label: "Yes - Upload to S3 (release)"
-            value: "true"
-      - select: "Force Rebuild Base Wheels"
-        key: "rocm-force-rebuild"
-        default: "false"
-        hint: "Ignore S3 cache and rebuild base wheels from scratch"
-        options:
-          - label: "No - Use cached wheels if available"
-            value: "false"
-          - label: "Yes - Rebuild even if cache exists"
-            value: "true"

-  # ROCm Job 1: Build ROCm Base Wheels (with S3 caching)
-  - label: ":rocm: Build ROCm Base Wheels"
-    id: build-rocm-base-wheels
-    depends_on:
-      - step: input-rocm-config
-        allow_failure: true  # Allow failure so non-UI builds can proceed (input step is skipped)
+  - label: "Build and publish CPU release image"
+    depends_on: block-cpu-release-image-build
    agents:
      queue: cpu_queue_postmerge
    commands:
-      # Set configuration and check cache
-      - |
-        set -euo pipefail
-
-        # Get values from meta-data (set by input step) or use defaults
-        PYTHON_VERSION="$$(buildkite-agent meta-data get rocm-python-version 2>/dev/null || echo '')"
-        export PYTHON_VERSION="$${PYTHON_VERSION:-3.12}"
-
-        PYTORCH_ROCM_ARCH="$$(buildkite-agent meta-data get rocm-pytorch-rocm-arch 2>/dev/null || echo '')"
-        export PYTORCH_ROCM_ARCH="$${PYTORCH_ROCM_ARCH:-gfx90a;gfx942;gfx950;gfx1100;gfx1101;gfx1200;gfx1201;gfx1150;gfx1151}"
-
-        # Check for force rebuild flag
-        ROCM_FORCE_REBUILD="$${ROCM_FORCE_REBUILD:-}"
-        if [ -z "$${ROCM_FORCE_REBUILD}" ]; then
-          ROCM_FORCE_REBUILD="$$(buildkite-agent meta-data get rocm-force-rebuild 2>/dev/null || echo '')"
-        fi
-
-        echo "========================================"
-        echo "ROCm Base Wheels Build Configuration"
-        echo "========================================"
-        echo "  PYTHON_VERSION: $${PYTHON_VERSION}"
-        echo "  PYTORCH_ROCM_ARCH: $${PYTORCH_ROCM_ARCH}"
-        echo "  ROCM_FORCE_REBUILD: $${ROCM_FORCE_REBUILD:-false}"
-        echo "========================================"
-
-        # Save resolved config for later jobs
-        buildkite-agent meta-data set "rocm-python-version" "$${PYTHON_VERSION}"
-        buildkite-agent meta-data set "rocm-pytorch-rocm-arch" "$${PYTORCH_ROCM_ARCH}"
-
-        # Check S3 cache for pre-built wheels
-        CACHE_KEY=$$(.buildkite/scripts/cache-rocm-base-wheels.sh key)
-        CACHE_PATH=$$(.buildkite/scripts/cache-rocm-base-wheels.sh path)
-        echo ""
-        echo "Cache key: $${CACHE_KEY}"
-        echo "Cache path: $${CACHE_PATH}"
-
-        # Save cache key for downstream jobs
-        buildkite-agent meta-data set "rocm-cache-key" "$${CACHE_KEY}"
-
-        CACHE_STATUS="miss"
-        if [ "$${ROCM_FORCE_REBUILD}" != "true" ]; then
-          CACHE_STATUS=$$(.buildkite/scripts/cache-rocm-base-wheels.sh check)
-        else
-          echo "Force rebuild requested, skipping cache check"
-        fi
-
-        if [ "$${CACHE_STATUS}" = "hit" ]; then
-          echo ""
-          echo "CACHE HIT! Downloading pre-built wheels..."
-          echo ""
-          .buildkite/scripts/cache-rocm-base-wheels.sh download
-
-          # Set the S3 path for the cached Docker image (for Job 2 to download)
-          S3_ARTIFACT_PATH="s3://$${S3_BUCKET}/rocm/cache/$${CACHE_KEY}"
-          buildkite-agent meta-data set "rocm-docker-image-s3-path" "$${S3_ARTIFACT_PATH}/rocm-base-image.tar.gz"
-
-          # Mark that we used cache (for Docker image handling)
-          buildkite-agent meta-data set "rocm-used-cache" "true"
-
-          echo ""
-          echo "Cache download complete. Skipping Docker build."
-          echo "Docker image will be downloaded from: $${S3_ARTIFACT_PATH}/rocm-base-image.tar.gz"
-        else
-          echo ""
-          echo "CACHE MISS. Building from scratch..."
-          echo ""
-
-          # Build full base image (for later vLLM build)
-          DOCKER_BUILDKIT=1 docker buildx build \
-            --file docker/Dockerfile.rocm_base \
-            --tag rocm/vllm-dev:base-$${BUILDKITE_BUILD_NUMBER} \
-            --build-arg PYTORCH_ROCM_ARCH="$${PYTORCH_ROCM_ARCH}" \
-            --build-arg PYTHON_VERSION="$${PYTHON_VERSION}" \
-            --build-arg USE_SCCACHE=1 \
-            --build-arg SCCACHE_BUCKET_NAME=vllm-build-sccache \
-            --build-arg SCCACHE_REGION_NAME=us-west-2 \
-            --build-arg SCCACHE_S3_NO_CREDENTIALS=0 \
-            --load \
-            .
-
-          # Build debs_wheel_release stage for wheel extraction
-          DOCKER_BUILDKIT=1 docker buildx build \
-            --file docker/Dockerfile.rocm_base \
-            --tag rocm-base-debs:$${BUILDKITE_BUILD_NUMBER} \
-            --target debs_wheel_release \
-            --build-arg PYTORCH_ROCM_ARCH="$${PYTORCH_ROCM_ARCH}" \
-            --build-arg PYTHON_VERSION="$${PYTHON_VERSION}" \
-            --build-arg USE_SCCACHE=1 \
-            --build-arg SCCACHE_BUCKET_NAME=vllm-build-sccache \
-            --build-arg SCCACHE_REGION_NAME=us-west-2 \
-            --build-arg SCCACHE_S3_NO_CREDENTIALS=0 \
-            --load \
-            .
-
-          # Extract wheels from Docker image
-          mkdir -p artifacts/rocm-base-wheels
-          container_id=$$(docker create rocm-base-debs:$${BUILDKITE_BUILD_NUMBER})
-          docker cp $${container_id}:/app/debs/. artifacts/rocm-base-wheels/
-          docker rm $${container_id}
-          echo "Extracted base wheels:"
-          ls -lh artifacts/rocm-base-wheels/
-
-          # Upload wheels to S3 cache for future builds
-          echo ""
-          echo "Uploading wheels to S3 cache..."
-          .buildkite/scripts/cache-rocm-base-wheels.sh upload
-
-          # Export base Docker image for reuse in vLLM build
-          mkdir -p artifacts/rocm-docker-image
-          docker save rocm/vllm-dev:base-$${BUILDKITE_BUILD_NUMBER} | gzip > artifacts/rocm-docker-image/rocm-base-image.tar.gz
-          echo "Docker image size:"
-          ls -lh artifacts/rocm-docker-image/
-
-          # Upload large Docker image to S3 (also cached by cache key)
-          S3_ARTIFACT_PATH="s3://$${S3_BUCKET}/rocm/cache/$${CACHE_KEY}"
-          echo "Uploading Docker image to $${S3_ARTIFACT_PATH}/"
-          aws s3 cp artifacts/rocm-docker-image/rocm-base-image.tar.gz "$${S3_ARTIFACT_PATH}/rocm-base-image.tar.gz"
-
-          # Save the S3 path for downstream jobs
-          buildkite-agent meta-data set "rocm-docker-image-s3-path" "$${S3_ARTIFACT_PATH}/rocm-base-image.tar.gz"
-
-          # Mark that we did NOT use cache
-          buildkite-agent meta-data set "rocm-used-cache" "false"
-
-          echo ""
-          echo "Build complete. Wheels cached for future builds."
-        fi
-    artifact_paths:
-      - "artifacts/rocm-base-wheels/*.whl"
+      - "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
+      - "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg GIT_REPO_CHECK=1 --build-arg VLLM_CPU_AVX512BF16=true --build-arg VLLM_CPU_AVX512VNNI=true --build-arg VLLM_CPU_AMXBF16=true --tag public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:$(buildkite-agent meta-data get release-version) --tag public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:latest --progress plain --target vllm-openai -f docker/Dockerfile.cpu ."
+      - "docker push public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:latest"
+      - "docker push public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:$(buildkite-agent meta-data get release-version)"
    env:
      DOCKER_BUILDKIT: "1"
-      S3_BUCKET: "vllm-wheels"

-  # ROCm Job 2: Build vLLM ROCm Wheel
-  - label: ":python: Build vLLM ROCm Wheel - x86_64"
-    id: build-rocm-vllm-wheel
-    depends_on:
-      - step: build-rocm-base-wheels
-        allow_failure: false
+  - block: "Build arm64 CPU release image"
+    key: block-arm64-cpu-release-image-build
+    depends_on: ~
+
+  - label: "Build and publish arm64 CPU release image"
+    depends_on: block-arm64-cpu-release-image-build
    agents:
-      queue: cpu_queue_postmerge
-    timeout_in_minutes: 180
+      queue: arm64_cpu_queue_postmerge
    commands:
-      # Download artifacts and prepare Docker image
-      - |
-        set -euo pipefail
-
-        # Ensure git tags are up-to-date (Buildkite's default fetch doesn't update tags)
-        # This fixes version detection when tags are moved/force-pushed
-        echo "Fetching latest tags from origin..."
-        git fetch --tags --force origin
-        
-        # Log tag information for debugging version detection
-        echo "========================================"
-        echo "Git Tag Verification"
-        echo "========================================"
-        echo "Current HEAD: $(git rev-parse HEAD)"
-        echo "git describe --tags: $(git describe --tags 2>/dev/null || echo 'No tags found')"
-        echo ""
-        echo "Recent tags (pointing to commits near HEAD):"
-        git tag -l --sort=-creatordate | head -5
-        echo "setuptools_scm version detection:"
-        pip install -q setuptools_scm 2>/dev/null || true
-        python3 -c "import setuptools_scm; print('  Detected version:', setuptools_scm.get_version())" 2>/dev/null || echo "  (setuptools_scm not available in this environment)"
-        echo "========================================"
-
-        # Download wheel artifacts from current build
-        echo "Downloading wheel artifacts from current build"
-        buildkite-agent artifact download "artifacts/rocm-base-wheels/*.whl" .
-
-        # Download Docker image from S3 (too large for Buildkite artifacts)
-        DOCKER_IMAGE_S3_PATH="$$(buildkite-agent meta-data get rocm-docker-image-s3-path 2>/dev/null || echo '')"
-        if [ -z "$${DOCKER_IMAGE_S3_PATH}" ]; then
-          echo "ERROR: rocm-docker-image-s3-path metadata not found"
-          echo "This should have been set by the build-rocm-base-wheels job"
-          exit 1
-        fi
-        echo "Downloading Docker image from $${DOCKER_IMAGE_S3_PATH}"
-        mkdir -p artifacts/rocm-docker-image
-        aws s3 cp "$${DOCKER_IMAGE_S3_PATH}" artifacts/rocm-docker-image/rocm-base-image.tar.gz
-
-        # Load base Docker image and capture the tag
-        echo "Loading base Docker image..."
-        LOAD_OUTPUT=$$(gunzip -c artifacts/rocm-docker-image/rocm-base-image.tar.gz | docker load)
-        echo "$${LOAD_OUTPUT}"
-        # Extract the actual loaded image tag from "Loaded image: <tag>" output
-        # This avoids picking up stale images (like rocm/vllm-dev:nightly) already on the agent
-        BASE_IMAGE_TAG=$$(echo "$${LOAD_OUTPUT}" | grep "Loaded image:" | sed 's/Loaded image: //')
-        if [ -z "$${BASE_IMAGE_TAG}" ]; then
-          echo "ERROR: Failed to extract image tag from docker load output"
-          echo "Load output was: $${LOAD_OUTPUT}"
-          exit 1
-        fi
-        echo "Loaded base image: $${BASE_IMAGE_TAG}"
-
-        # Prepare base wheels for Docker build context
-        mkdir -p docker/context/base-wheels
-        touch docker/context/base-wheels/.keep
-        cp artifacts/rocm-base-wheels/*.whl docker/context/base-wheels/
-        echo "Base wheels for vLLM build:"
-        ls -lh docker/context/base-wheels/
-
-        # Get GPU architectures from meta-data
-        PYTORCH_ROCM_ARCH="$$(buildkite-agent meta-data get rocm-pytorch-rocm-arch 2>/dev/null || echo '')"
-        PYTORCH_ROCM_ARCH="$${PYTORCH_ROCM_ARCH:-gfx90a;gfx942;gfx950;gfx1100;gfx1101;gfx1200;gfx1201;gfx1150;gfx1151}"
-
-        echo "========================================"
-        echo "Building vLLM wheel with:"
-        echo "  BUILDKITE_COMMIT: $${BUILDKITE_COMMIT}"
-        echo "  BUILDKITE_BRANCH: $${BUILDKITE_BRANCH}"
-        echo "  PYTORCH_ROCM_ARCH: $${PYTORCH_ROCM_ARCH}"
-        echo "  BASE_IMAGE: $${BASE_IMAGE_TAG}"
-        echo "========================================"
-
-        # Build vLLM wheel using local checkout (REMOTE_VLLM=0)
-        DOCKER_BUILDKIT=1 docker build \
-          --file docker/Dockerfile.rocm \
-          --target export_vllm_wheel_release \
-          --output type=local,dest=rocm-dist \
-          --build-arg BASE_IMAGE="$${BASE_IMAGE_TAG}" \
-          --build-arg ARG_PYTORCH_ROCM_ARCH="$${PYTORCH_ROCM_ARCH}" \
-          --build-arg REMOTE_VLLM=0 \
-          --build-arg GIT_REPO_CHECK=1 \
-          --build-arg USE_SCCACHE=1 \
-          --build-arg SCCACHE_BUCKET_NAME=vllm-build-sccache \
-          --build-arg SCCACHE_REGION_NAME=us-west-2 \
-          --build-arg SCCACHE_S3_NO_CREDENTIALS=0 \
-          .
-
-        echo "Built vLLM wheel:"
-        ls -lh rocm-dist/*.whl
-
-        # Copy wheel to artifacts directory
-        mkdir -p artifacts/rocm-vllm-wheel
-        cp rocm-dist/*.whl artifacts/rocm-vllm-wheel/
-        echo "Final vLLM wheel:"
-        ls -lh artifacts/rocm-vllm-wheel/
-    artifact_paths:
-      - "artifacts/rocm-vllm-wheel/*.whl"
+      - "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
+      - "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg GIT_REPO_CHECK=1 --tag public.ecr.aws/q9t5s3a7/vllm-arm64-cpu-release-repo:$(buildkite-agent meta-data get release-version) --tag public.ecr.aws/q9t5s3a7/vllm-arm64-cpu-release-repo:latest --progress plain --target vllm-openai -f docker/Dockerfile.cpu ."
+      - "docker push public.ecr.aws/q9t5s3a7/vllm-arm64-cpu-release-repo:latest"
+      - "docker push public.ecr.aws/q9t5s3a7/vllm-arm64-cpu-release-repo:$(buildkite-agent meta-data get release-version)"
    env:
      DOCKER_BUILDKIT: "1"
-      S3_BUCKET: "vllm-wheels"

-  # ROCm Job 3: Upload Wheels to S3
-  - label: ":s3: Upload ROCm Wheels to S3"
-    id: upload-rocm-wheels
+  - label: "Build and publish nightly multi-arch image to DockerHub"
    depends_on:
-      - step: build-rocm-vllm-wheel
-        allow_failure: false
+      - create-multi-arch-manifest
+    if: build.env("NIGHTLY") == "1"
    agents:
      queue: cpu_queue_postmerge
-    timeout_in_minutes: 60
    commands:
-      # Download all wheel artifacts and run upload
-      - |
-        set -euo pipefail
-
-        # Check if upload is enabled (from env var, meta-data, or release branch)
-        ROCM_UPLOAD_WHEELS="$${ROCM_UPLOAD_WHEELS:-}"
-        if [ -z "$${ROCM_UPLOAD_WHEELS}" ]; then
-          # Try to get from meta-data (input form)
-          ROCM_UPLOAD_WHEELS="$$(buildkite-agent meta-data get rocm-upload-wheels 2>/dev/null || echo '')"
-        fi
-
-        echo "========================================"
-        echo "Upload check:"
-        echo "  ROCM_UPLOAD_WHEELS: $${ROCM_UPLOAD_WHEELS}"
-        echo "  BUILDKITE_BRANCH: $${BUILDKITE_BRANCH}"
-        echo "========================================"
-
-        # Skip upload if not enabled
-        if [ "$${ROCM_UPLOAD_WHEELS}" != "true" ]; then
-          echo "Skipping S3 upload (ROCM_UPLOAD_WHEELS != true, NIGHTLY != 1, not a release branch)"
-          echo "To enable upload, set 'Upload Wheels to S3' to 'Yes' in the build configuration"
-          exit 0
-        fi
-
-        echo "Upload enabled, proceeding..."
-
-        # Download artifacts from current build
-        echo "Downloading artifacts from current build"
-        buildkite-agent artifact download "artifacts/rocm-base-wheels/*.whl" .
-        buildkite-agent artifact download "artifacts/rocm-vllm-wheel/*.whl" .
-
-        # Run upload script
-        bash .buildkite/scripts/upload-rocm-wheels.sh
+      - "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
+      - "docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-x86_64"
+      - "docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-aarch64"
+      - "docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-x86_64 vllm/vllm-openai:nightly-x86_64"
+      - "docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-aarch64 vllm/vllm-openai:nightly-aarch64"
+      - "docker push vllm/vllm-openai:nightly-x86_64"
+      - "docker push vllm/vllm-openai:nightly-aarch64"
+      - "docker manifest create vllm/vllm-openai:nightly vllm/vllm-openai:nightly-x86_64 vllm/vllm-openai:nightly-aarch64 --amend"
+      - "docker manifest create vllm/vllm-openai:nightly-$BUILDKITE_COMMIT vllm/vllm-openai:nightly-x86_64 vllm/vllm-openai:nightly-aarch64 --amend"
+      - "docker manifest push vllm/vllm-openai:nightly"
+      - "docker manifest push vllm/vllm-openai:nightly-$BUILDKITE_COMMIT"
+      # Clean up old nightly builds (keep only last 14)
+      - "bash .buildkite/scripts/cleanup-nightly-builds.sh"
+    plugins:
+      - docker-login#v3.0.0:
+          username: vllmbot
+          password-env: DOCKERHUB_TOKEN
    env:
      DOCKER_BUILDKIT: "1"
-      S3_BUCKET: "vllm-wheels"
-
-  # ROCm Job 4: Annotate ROCm Wheel Release
-  - label: ":memo: Annotate ROCm wheel release"
-    id: annotate-rocm-release
-    depends_on:
-      - step: upload-rocm-wheels
-        allow_failure: true
-      - step: input-release-version
-        allow_failure: true
-    agents:
-      queue: cpu_queue_postmerge
-    commands:
-      - "bash .buildkite/scripts/annotate-rocm-release.sh"
-    env:
-      S3_BUCKET: "vllm-wheels"
-
-  # ROCm Job 5: Generate Root Index for ROCm Wheels (for release only)
-  # This is the job to create https://wheels.vllm.ai/rocm/ index allowing
-  # users to install with `uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/`
-  - block: "Generate Root Index for ROCm Wheels for Release"
-    key: block-generate-root-index-rocm-wheels
-    depends_on: upload-rocm-wheels
-
-  - label: ":package: Generate Root Index for ROCm Wheels for Release"
-    depends_on: block-generate-root-index-rocm-wheels
-    id: generate-root-index-rocm-wheels
-    agents:
-      queue: cpu_queue_postmerge
-    commands:
-      - "bash tools/vllm-rocm/generate-rocm-wheels-root-index.sh"
-    env:
-      S3_BUCKET: "vllm-wheels"
-      VARIANT: "rocm700"
-
-  # ROCm Job 5: Build ROCm Release Docker Image
-  - label: ":docker: Build release image - x86_64 - ROCm"
-    id: build-rocm-release-image
-    depends_on:
-      - step: build-rocm-base-wheels
-        allow_failure: false
-    agents:
-      queue: cpu_queue_postmerge
-    timeout_in_minutes: 60
-    commands:
-      - |
-        set -euo pipefail
-
-        # Login to ECR
-        aws ecr-public get-login-password --region us-east-1 | \
-          docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7
-
-        # Download Docker image from S3 (set by build-rocm-base-wheels)
-        DOCKER_IMAGE_S3_PATH="$$(buildkite-agent meta-data get rocm-docker-image-s3-path 2>/dev/null || echo '')"
-        if [ -z "$${DOCKER_IMAGE_S3_PATH}" ]; then
-          echo "ERROR: rocm-docker-image-s3-path metadata not found"
-          exit 1
-        fi
-
-        echo "Downloading base image from $${DOCKER_IMAGE_S3_PATH}"
-        mkdir -p artifacts/rocm-docker-image
-        aws s3 cp "$${DOCKER_IMAGE_S3_PATH}" artifacts/rocm-docker-image/rocm-base-image.tar.gz
-
-        # Load base Docker image
-        echo "Loading base Docker image..."
-        LOAD_OUTPUT=$$(gunzip -c artifacts/rocm-docker-image/rocm-base-image.tar.gz | docker load)
-        BASE_IMAGE_TAG=$$(echo "$${LOAD_OUTPUT}" | grep "Loaded image:" | sed 's/Loaded image: //')
-        echo "Loaded base image: $${BASE_IMAGE_TAG}"
-
-        # Tag and push the base image to ECR
-        docker tag "$${BASE_IMAGE_TAG}" public.ecr.aws/q9t5s3a7/vllm-release-repo:$${BUILDKITE_COMMIT}-rocm-base
-        docker push public.ecr.aws/q9t5s3a7/vllm-release-repo:$${BUILDKITE_COMMIT}-rocm-base
-        echo "Pushed base image: public.ecr.aws/q9t5s3a7/vllm-release-repo:$${BUILDKITE_COMMIT}-rocm-base"
-
-        # Get GPU architectures from meta-data
-        PYTORCH_ROCM_ARCH="$$(buildkite-agent meta-data get rocm-pytorch-rocm-arch 2>/dev/null || echo '')"
-        PYTORCH_ROCM_ARCH="$${PYTORCH_ROCM_ARCH:-gfx90a;gfx942;gfx950;gfx1100;gfx1101;gfx1200;gfx1201;gfx1150;gfx1151}"
-
-        # Build vLLM ROCm release image using cached base
-        DOCKER_BUILDKIT=1 docker build \
-          --build-arg max_jobs=16 \
-          --build-arg BASE_IMAGE="$${BASE_IMAGE_TAG}" \
-          --build-arg ARG_PYTORCH_ROCM_ARCH="$${PYTORCH_ROCM_ARCH}" \
-          --build-arg USE_SCCACHE=1 \
-          --build-arg SCCACHE_BUCKET_NAME=vllm-build-sccache \
-          --build-arg SCCACHE_REGION_NAME=us-west-2 \
-          --build-arg SCCACHE_S3_NO_CREDENTIALS=0 \
-          --tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$${BUILDKITE_COMMIT}-rocm \
-          --target vllm-openai \
-          --progress plain \
-          -f docker/Dockerfile.rocm .
-
-        # Push to ECR
-        docker push public.ecr.aws/q9t5s3a7/vllm-release-repo:$${BUILDKITE_COMMIT}-rocm
-        echo "Pushed: public.ecr.aws/q9t5s3a7/vllm-release-repo:$${BUILDKITE_COMMIT}-rocm"
-    env:
-      DOCKER_BUILDKIT: "1"
-      S3_BUCKET: "vllm-wheels"
+      DOCKERHUB_USERNAME: "vllmbot"
--- a/.buildkite/scripts/annotate-release.sh
+++ b/.buildkite/scripts/annotate-release.sh
@@ -11,36 +11,27 @@ fi
 buildkite-agent annotate --style 'info' --context 'release-workflow' << EOF
 To download the wheel (by commit):
 \`\`\`
-aws s3 cp s3://vllm-wheels/${BUILDKITE_COMMIT}/vllm-${RELEASE_VERSION}-cp38-abi3-manylinux_2_31_x86_64.whl .
-aws s3 cp s3://vllm-wheels/${BUILDKITE_COMMIT}/vllm-${RELEASE_VERSION}-cp38-abi3-manylinux_2_31_aarch64.whl .
+aws s3 cp s3://vllm-wheels/${BUILDKITE_COMMIT}/vllm-${RELEASE_VERSION}-cp38-abi3-manylinux1_x86_64.whl .
+aws s3 cp s3://vllm-wheels/${BUILDKITE_COMMIT}/vllm-${RELEASE_VERSION}-cp38-abi3-manylinux2014_aarch64.whl .

-(Optional) For CUDA 13.0:
-aws s3 cp s3://vllm-wheels/${BUILDKITE_COMMIT}/vllm-${RELEASE_VERSION}+cu130-cp38-abi3-manylinux_2_35_x86_64.whl .
-aws s3 cp s3://vllm-wheels/${BUILDKITE_COMMIT}/vllm-${RELEASE_VERSION}+cu130-cp38-abi3-manylinux_2_35_aarch64.whl .
-
-(Optional) For CPU:
-aws s3 cp s3://vllm-wheels/${BUILDKITE_COMMIT}/vllm-${RELEASE_VERSION}+cpu-cp38-abi3-manylinux_2_35_x86_64.whl .
-aws s3 cp s3://vllm-wheels/${BUILDKITE_COMMIT}/vllm-${RELEASE_VERSION}+cpu-cp38-abi3-manylinux_2_35_aarch64.whl .
+aws s3 cp s3://vllm-wheels/${BUILDKITE_COMMIT}/vllm-${RELEASE_VERSION}+cu129-cp38-abi3-manylinux1_x86_64.whl .
+aws s3 cp s3://vllm-wheels/${BUILDKITE_COMMIT}/vllm-${RELEASE_VERSION}+cu129-cp38-abi3-manylinux1_x86_64.whl .
 \`\`\`

+To download the wheel (by version):
+\`\`\`
+aws s3 cp s3://vllm-wheels/${RELEASE_VERSION}/vllm-${RELEASE_VERSION}-cp38-abi3-manylinux1_x86_64.whl .
+aws s3 cp s3://vllm-wheels/${RELEASE_VERSION}/vllm-${RELEASE_VERSION}-cp38-abi3-manylinux2014_aarch64.whl .
+
+aws s3 cp s3://vllm-wheels/${RELEASE_VERSION}+cu126/vllm-${RELEASE_VERSION}+cu126-cp38-abi3-manylinux1_x86_64.whl .
+aws s3 cp s3://vllm-wheels/${RELEASE_VERSION}+cu129/vllm-${RELEASE_VERSION}+cu129-cp38-abi3-manylinux1_x86_64.whl .
+\`\`\`

 To download and upload the image:

 \`\`\`
-# Download images:
-
 docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-x86_64
 docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-aarch64
-docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-x86_64-cu130
-docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-aarch64-cu130
-docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-rocm-base
-docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-rocm
-docker pull public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:v${RELEASE_VERSION}
-docker pull public.ecr.aws/q9t5s3a7/vllm-arm64-cpu-release-repo:v${RELEASE_VERSION}
-
-# Tag and push images:
-
-## CUDA

 docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-x86_64 vllm/vllm-openai:x86_64
 docker tag vllm/vllm-openai:x86_64 vllm/vllm-openai:latest-x86_64
@@ -48,70 +39,15 @@ docker tag vllm/vllm-openai:x86_64 vllm/vllm-openai:v${RELEASE_VERSION}-x86_64
 docker push vllm/vllm-openai:latest-x86_64
 docker push vllm/vllm-openai:v${RELEASE_VERSION}-x86_64

-docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-x86_64-cu130 vllm/vllm-openai:x86_64-cu130
-docker tag vllm/vllm-openai:x86_64-cu130 vllm/vllm-openai:latest-x86_64-cu130
-docker tag vllm/vllm-openai:x86_64-cu130 vllm/vllm-openai:v${RELEASE_VERSION}-x86_64-cu130
-docker push vllm/vllm-openai:latest-x86_64-cu130
-docker push vllm/vllm-openai:v${RELEASE_VERSION}-x86_64-cu130
-
 docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-aarch64 vllm/vllm-openai:aarch64
 docker tag vllm/vllm-openai:aarch64 vllm/vllm-openai:latest-aarch64
 docker tag vllm/vllm-openai:aarch64 vllm/vllm-openai:v${RELEASE_VERSION}-aarch64
 docker push vllm/vllm-openai:latest-aarch64
 docker push vllm/vllm-openai:v${RELEASE_VERSION}-aarch64

-docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-aarch64-cu130 vllm/vllm-openai:aarch64-cu130
-docker tag vllm/vllm-openai:aarch64-cu130 vllm/vllm-openai:latest-aarch64-cu130
-docker tag vllm/vllm-openai:aarch64-cu130 vllm/vllm-openai:v${RELEASE_VERSION}-aarch64-cu130
-docker push vllm/vllm-openai:latest-aarch64-cu130
-docker push vllm/vllm-openai:v${RELEASE_VERSION}-aarch64-cu130
-
-## ROCm
-
-docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-rocm vllm/vllm-openai-rocm:${BUILDKITE_COMMIT}
-docker tag vllm/vllm-openai-rocm:${BUILDKITE_COMMIT} vllm/vllm-openai-rocm:latest
-docker tag vllm/vllm-openai-rocm:${BUILDKITE_COMMIT} vllm/vllm-openai-rocm:v${RELEASE_VERSION}
-docker push vllm/vllm-openai-rocm:latest
-docker push vllm/vllm-openai-rocm:v${RELEASE_VERSION}
-
-docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-rocm-base vllm/vllm-openai-rocm:${BUILDKITE_COMMIT}-base
-docker tag vllm/vllm-openai-rocm:${BUILDKITE_COMMIT}-base vllm/vllm-openai-rocm:latest-base
-docker tag vllm/vllm-openai-rocm:${BUILDKITE_COMMIT}-base vllm/vllm-openai-rocm:v${RELEASE_VERSION}-base
-docker push vllm/vllm-openai-rocm:latest-base
-docker push vllm/vllm-openai-rocm:v${RELEASE_VERSION}-base
-
-## CPU
-
-docker tag public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:v${RELEASE_VERSION} vllm/vllm-openai-cpu:x86_64
-docker tag vllm/vllm-openai-cpu:x86_64 vllm/vllm-openai-cpu:latest-x86_64
-docker tag vllm/vllm-openai-cpu:x86_64 vllm/vllm-openai-cpu:v${RELEASE_VERSION}-x86_64
-docker push vllm/vllm-openai-cpu:latest-x86_64
-docker push vllm/vllm-openai-cpu:v${RELEASE_VERSION}-x86_64
-
-docker tag public.ecr.aws/q9t5s3a7/vllm-arm64-cpu-release-repo:v${RELEASE_VERSION} vllm/vllm-openai-cpu:arm64
-docker tag vllm/vllm-openai-cpu:arm64 vllm/vllm-openai-cpu:latest-arm64
-docker tag vllm/vllm-openai-cpu:arm64 vllm/vllm-openai-cpu:v${RELEASE_VERSION}-arm64
-docker push vllm/vllm-openai-cpu:latest-arm64
-docker push vllm/vllm-openai-cpu:v${RELEASE_VERSION}-arm64
-
-# Create multi-arch manifest:
-
-docker manifest rm vllm/vllm-openai:latest
-docker manifest create vllm/vllm-openai:latest vllm/vllm-openai:latest-x86_64 vllm/vllm-openai:latest-aarch64
-docker manifest create vllm/vllm-openai:v${RELEASE_VERSION} vllm/vllm-openai:v${RELEASE_VERSION}-x86_64 vllm/vllm-openai:v${RELEASE_VERSION}-aarch64
+docker manifest create vllm/vllm-openai:latest vllm/vllm-openai:latest-x86_64 vllm/vllm-openai:latest-aarch64 --amend
+docker manifest create vllm/vllm-openai:v${RELEASE_VERSION} vllm/vllm-openai:v${RELEASE_VERSION}-x86_64 vllm/vllm-openai:v${RELEASE_VERSION}-aarch64 --amend
 docker manifest push vllm/vllm-openai:latest
 docker manifest push vllm/vllm-openai:v${RELEASE_VERSION}
-
-docker manifest rm vllm/vllm-openai:latest-cu130
-docker manifest create vllm/vllm-openai:latest-cu130 vllm/vllm-openai:latest-x86_64-cu130 vllm/vllm-openai:latest-aarch64-cu130
-docker manifest create vllm/vllm-openai:v${RELEASE_VERSION}-cu130 vllm/vllm-openai:v${RELEASE_VERSION}-x86_64-cu130 vllm/vllm-openai:v${RELEASE_VERSION}-aarch64-cu130
-docker manifest push vllm/vllm-openai:latest-cu130
-docker manifest push vllm/vllm-openai:v${RELEASE_VERSION}-cu130
-
-docker manifest rm vllm/vllm-openai-cpu:latest || true
-docker manifest create vllm/vllm-openai-cpu:latest vllm/vllm-openai-cpu:latest-x86_64 vllm/vllm-openai-cpu:latest-arm64
-docker manifest create vllm/vllm-openai-cpu:v${RELEASE_VERSION} vllm/vllm-openai-cpu:v${RELEASE_VERSION}-x86_64 vllm/vllm-openai-cpu:v${RELEASE_VERSION}-arm64
-docker manifest push vllm/vllm-openai-cpu:latest
-docker manifest push vllm/vllm-openai-cpu:v${RELEASE_VERSION}
 \`\`\`
-EOF
+EOF 
--- a/.buildkite/scripts/annotate-rocm-release.sh
+++ b/.buildkite/scripts/annotate-rocm-release.sh
@@ -1,112 +0,0 @@
-#!/bin/bash
-# SPDX-License-Identifier: Apache-2.0
-# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
-#
-# Generate Buildkite annotation for ROCm wheel release
-set -ex
-
-# Get build configuration from meta-data
-# Extract ROCm version dynamically from Dockerfile.rocm_base
-# BASE_IMAGE format: rocm/dev-ubuntu-22.04:7.0-complete -> extracts "7.0"
-ROCM_VERSION=$(grep -E '^ARG BASE_IMAGE=' docker/Dockerfile.rocm_base | sed -E 's/.*:([0-9]+\.[0-9]+).*/\1/' || echo "unknown")
-PYTHON_VERSION=$(buildkite-agent meta-data get rocm-python-version 2>/dev/null || echo "3.12")
-PYTORCH_ROCM_ARCH=$(buildkite-agent meta-data get rocm-pytorch-rocm-arch 2>/dev/null || echo "gfx90a;gfx942;gfx950;gfx1100;gfx1101;gfx1200;gfx1201;gfx1150;gfx1151")
-
-# TODO: Enable the nightly build for ROCm
-# Get release version, default to 1.0.0.dev for nightly/per-commit builds
-RELEASE_VERSION=$(buildkite-agent meta-data get release-version 2>/dev/null || echo "")
-if [ -z "${RELEASE_VERSION}" ]; then
-  RELEASE_VERSION="1.0.0.dev"
-fi
-
-# S3 URLs
-S3_BUCKET="${S3_BUCKET:-vllm-wheels}"
-S3_REGION="${AWS_DEFAULT_REGION:-us-west-2}"
-S3_URL="http://${S3_BUCKET}.s3-website-${S3_REGION}.amazonaws.com"
-
-# Format ROCm version for path (e.g., "7.1" -> "rocm710")
-ROCM_VERSION_PATH="rocm$(echo ${ROCM_VERSION} | tr -d '.')"
-ROCM_PATH="rocm/${BUILDKITE_COMMIT}/${ROCM_VERSION_PATH}"
-buildkite-agent annotate --style 'success' --context 'rocm-release-workflow' << EOF
-## ROCm Wheel and Docker Image Releases
-### Build Configuration
-| Setting | Value |
-|---------|-------|
-| **ROCm Version** | ${ROCM_VERSION} |
-| **Python Version** | ${PYTHON_VERSION} |
-| **GPU Architectures** | ${PYTORCH_ROCM_ARCH} |
-| **Branch** | \`${BUILDKITE_BRANCH}\` |
-| **Commit** | \`${BUILDKITE_COMMIT}\` |
-
-### :package: Installation
-
-**Install from this build (by commit):**
-
-\`\`\`bash
-pip install vllm --extra-index-url ${S3_URL}/${ROCM_PATH}/ --trusted-host ${S3_BUCKET}.s3-website-${S3_REGION}.amazonaws.com
-
-# Example for ROCm ${ROCM_VERSION}:
-pip install vllm --extra-index-url ${S3_URL}/rocm/${BUILDKITE_COMMIT}/${ROCM_VERSION_PATH}/ --trusted-host ${S3_BUCKET}.s3-website-${S3_REGION}.amazonaws.com
-\`\`\`
-
-**Install from nightly (if published):**
-
-\`\`\`bash
-pip install vllm --extra-index-url ${S3_URL}/rocm/nightly/ --trusted-host ${S3_BUCKET}.s3-website-${S3_REGION}.amazonaws.com
-\`\`\`
-
-### :floppy_disk: Download Wheels Directly
-
-\`\`\`bash
-# List all ROCm wheels
-aws s3 ls s3://${S3_BUCKET}/rocm/${BUILDKITE_COMMIT}/${ROCM_VERSION_PATH}/
-# Download specific wheels
-aws s3 cp s3://${S3_BUCKET}/rocm/${BUILDKITE_COMMIT}/${ROCM_VERSION_PATH}/vllm-*.whl .
-aws s3 cp s3://${S3_BUCKET}/rocm/${BUILDKITE_COMMIT}/${ROCM_VERSION_PATH}/torch-*.whl .
-aws s3 cp s3://${S3_BUCKET}/rocm/${BUILDKITE_COMMIT}/${ROCM_VERSION_PATH}/triton-*.whl .
-aws s3 cp s3://${S3_BUCKET}/rocm/${BUILDKITE_COMMIT}/${ROCM_VERSION_PATH}/triton-kernels-*.whl .
-aws s3 cp s3://${S3_BUCKET}/rocm/${BUILDKITE_COMMIT}/${ROCM_VERSION_PATH}/torchvision-*.whl .
-aws s3 cp s3://${S3_BUCKET}/rocm/${BUILDKITE_COMMIT}/${ROCM_VERSION_PATH}/torchaudio-*.whl .
-aws s3 cp s3://${S3_BUCKET}/rocm/${BUILDKITE_COMMIT}/${ROCM_VERSION_PATH}/amdsmi-*.whl .
-aws s3 cp s3://${S3_BUCKET}/rocm/${BUILDKITE_COMMIT}/${ROCM_VERSION_PATH}/aiter-*.whl .
-aws s3 cp s3://${S3_BUCKET}/rocm/${BUILDKITE_COMMIT}/${ROCM_VERSION_PATH}/flash-attn-*.whl .
-\`\`\`
-
-### :gear: Included Packages
- **vllm**: vLLM with ROCm support
- **torch**: PyTorch built for ROCm ${ROCM_VERSION}
- **triton**: Triton
- **triton-kernels**: Triton kernels
- **torchvision**: TorchVision for ROCm PyTorch
- **torchaudio**: Torchaudio for ROCm PyTorch
- **amdsmi**: AMD SMI Python bindings
- **aiter**: Aiter for ROCm
- **flash-attn**: Flash Attention for ROCm
-
-### :warning: Notes
- These wheels are built for **ROCm ${ROCM_VERSION}** and will NOT work with CUDA GPUs
- Supported GPU architectures: ${PYTORCH_ROCM_ARCH}
- Platform: Linux x86_64 only
-
-### :package: Docker Image Release
-
-To download and upload the image:
-
-\`\`\`
-docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-rocm-base
-docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-rocm
-
-docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-rocm-base vllm/vllm-openai-rocm:${BUILDKITE_COMMIT}-base
-docker tag vllm/vllm-openai-rocm:${BUILDKITE_COMMIT}-base vllm/vllm-openai-rocm:latest-base
-docker tag vllm/vllm-openai-rocm:${BUILDKITE_COMMIT}-base vllm/vllm-openai-rocm:v${RELEASE_VERSION}-base
-docker push vllm/vllm-openai-rocm:latest-base
-docker push vllm/vllm-openai-rocm:v${RELEASE_VERSION}-base
-
-docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-rocm vllm/vllm-openai-rocm:${BUILDKITE_COMMIT}
-docker tag vllm/vllm-openai-rocm:${BUILDKITE_COMMIT} vllm/vllm-openai-rocm:latest
-docker tag vllm/vllm-openai-rocm:${BUILDKITE_COMMIT} vllm/vllm-openai-rocm:v${RELEASE_VERSION}
-docker push vllm/vllm-openai-rocm:latest
-docker push vllm/vllm-openai-rocm:v${RELEASE_VERSION}
-\`\`\`
-
-EOF
--- a/.buildkite/scripts/cache-rocm-base-wheels.sh
+++ b/.buildkite/scripts/cache-rocm-base-wheels.sh
@@ -1,140 +0,0 @@
-#!/usr/bin/env bash
-# SPDX-License-Identifier: Apache-2.0
-# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
-#
-# Cache helper for ROCm base wheels
-#
-# This script manages caching of pre-built ROCm base wheels (torch, triton, etc.)
-# to avoid rebuilding them when Dockerfile.rocm_base hasn't changed.
-#
-# Usage:
-#   cache-rocm-base-wheels.sh check    - Check if cache exists, outputs "hit" or "miss"
-#   cache-rocm-base-wheels.sh upload   - Upload wheels to cache
-#   cache-rocm-base-wheels.sh download - Download wheels from cache
-#   cache-rocm-base-wheels.sh key      - Output the cache key
-#
-# Environment variables:
-#   S3_BUCKET          - S3 bucket name (default: vllm-wheels)
-#   PYTHON_VERSION     - Python version (affects cache key)
-#   PYTORCH_ROCM_ARCH  - GPU architectures (affects cache key)
-#
-# Note: ROCm version is determined by BASE_IMAGE in Dockerfile.rocm_base,
-#       so changes to ROCm version are captured by the Dockerfile hash.
-
-set -euo pipefail
-
-BUCKET="${S3_BUCKET:-vllm-wheels}"
-DOCKERFILE="docker/Dockerfile.rocm_base"
-CACHE_PREFIX="rocm/cache"
-
-# Generate hash from Dockerfile content + build args
-generate_cache_key() {
-    # Include Dockerfile content
-    if [[ ! -f "$DOCKERFILE" ]]; then
-        echo "ERROR: Dockerfile not found: $DOCKERFILE" >&2
-        exit 1
-    fi
-    local dockerfile_hash=$(sha256sum "$DOCKERFILE" | cut -c1-16)
-
-    # Include key build args that affect the output
-    # These should match the ARGs in Dockerfile.rocm_base that change the build output
-    # Note: ROCm version is determined by BASE_IMAGE in the Dockerfile, so it's captured by dockerfile_hash
-    local args_string="${PYTHON_VERSION:-}|${PYTORCH_ROCM_ARCH:-}"
-    local args_hash=$(echo "$args_string" | sha256sum | cut -c1-8)
-
-    echo "${dockerfile_hash}-${args_hash}"
-}
-
-CACHE_KEY=$(generate_cache_key)
-CACHE_PATH="s3://${BUCKET}/${CACHE_PREFIX}/${CACHE_KEY}/"
-
-case "${1:-}" in
-    check)
-        echo "Checking cache for key: ${CACHE_KEY}" >&2
-        echo "Cache path: ${CACHE_PATH}" >&2
-        echo "Variables used in cache key:" >&2
-        echo "  PYTHON_VERSION: ${PYTHON_VERSION:-<not set>}" >&2
-        echo "  PYTORCH_ROCM_ARCH: ${PYTORCH_ROCM_ARCH:-<not set>}" >&2
-
-        # Check if cache exists by listing objects
-        # We look for at least one .whl file
-        echo "Running: aws s3 ls ${CACHE_PATH}" >&2
-        S3_OUTPUT=$(aws s3 ls "${CACHE_PATH}" 2>&1) || true
-        echo "S3 ls output:" >&2
-        echo "$S3_OUTPUT" | head -5 >&2
-
-        if echo "$S3_OUTPUT" | grep -q "\.whl"; then
-            echo "hit"
-        else
-            echo "miss"
-        fi
-        ;;
-
-    upload)
-        echo "========================================"
-        echo "Uploading wheels to cache"
-        echo "========================================"
-        echo "Cache key: ${CACHE_KEY}"
-        echo "Cache path: ${CACHE_PATH}"
-        echo ""
-
-        if [[ ! -d "artifacts/rocm-base-wheels" ]]; then
-            echo "ERROR: artifacts/rocm-base-wheels directory not found" >&2
-            exit 1
-        fi
-
-        WHEEL_COUNT=$(ls artifacts/rocm-base-wheels/*.whl 2>/dev/null | wc -l)
-        if [[ "$WHEEL_COUNT" -eq 0 ]]; then
-            echo "ERROR: No wheels found in artifacts/rocm-base-wheels/" >&2
-            exit 1
-        fi
-
-        echo "Uploading $WHEEL_COUNT wheels..."
-        aws s3 cp --recursive artifacts/rocm-base-wheels/ "${CACHE_PATH}"
-
-        echo ""
-        echo "Cache upload complete!"
-        echo "========================================"
-        ;;
-
-    download)
-        echo "========================================"
-        echo "Downloading wheels from cache"
-        echo "========================================"
-        echo "Cache key: ${CACHE_KEY}"
-        echo "Cache path: ${CACHE_PATH}"
-        echo ""
-
-        mkdir -p artifacts/rocm-base-wheels
-        aws s3 cp --recursive "${CACHE_PATH}" artifacts/rocm-base-wheels/
-
-        echo ""
-        echo "Downloaded wheels:"
-        ls -lh artifacts/rocm-base-wheels/
-
-        WHEEL_COUNT=$(ls artifacts/rocm-base-wheels/*.whl 2>/dev/null | wc -l)
-        echo ""
-        echo "Total: $WHEEL_COUNT wheels"
-        echo "========================================"
-        ;;
-
-    key)
-        echo "${CACHE_KEY}"
-        ;;
-
-    path)
-        echo "${CACHE_PATH}"
-        ;;
-
-    *)
-        echo "Usage: $0 {check|upload|download|key|path}" >&2
-        echo "" >&2
-        echo "Commands:" >&2
-        echo "  check    - Check if cache exists, outputs 'hit' or 'miss'" >&2
-        echo "  upload   - Upload wheels from artifacts/rocm-base-wheels/ to cache" >&2
-        echo "  download - Download wheels from cache to artifacts/rocm-base-wheels/" >&2
-        echo "  key      - Output the cache key" >&2
-        echo "  path     - Output the full S3 cache path" >&2
-        exit 1
-        ;;
-esac
--- a/.buildkite/scripts/cherry-pick-from-milestone.sh
+++ b/.buildkite/scripts/cherry-pick-from-milestone.sh
@@ -1,242 +0,0 @@
-#!/bin/bash
-#
-# cherry-pick-from-milestone.sh
-# Find commits from a GitHub milestone that are missing from the current branch
-# and output them in chronological order for cherry-picking.
-#
-# Usage: ./cherry-pick-from-milestone.sh <milestone> [--dry-run] [--execute]
-#
-
-set -euo pipefail
-
-# Colors for output
-RED='\033[0;31m'
-GREEN='\033[0;32m'
-YELLOW='\033[1;33m'
-BLUE='\033[0;34m'
-NC='\033[0m' # No Color
-
-usage() {
-    cat <<EOF
-Usage: $(basename "$0") <milestone> [options]
-
-Find commits from a GitHub milestone that need to be cherry-picked into the current branch.
-
-Arguments:
-    milestone       The GitHub milestone name (e.g., v0.14.0)
-
-Options:
-    --dry-run       Show the cherry-pick commands without executing (default)
-    --execute       Actually execute the cherry-picks
-    --main-branch   Specify the main branch name (default: main)
-    --help          Show this help message
-
-Examples:
-    $(basename "$0") v0.14.0
-    $(basename "$0") v0.14.0 --dry-run
-    $(basename "$0") v0.14.0 --execute
-    $(basename "$0") v0.14.0 --main-branch master
-EOF
-    exit 1
-}
-
-log_info() {
-    echo -e "${BLUE}[INFO]${NC} $1"
-}
-
-log_success() {
-    echo -e "${GREEN}[OK]${NC} $1"
-}
-
-log_warn() {
-    echo -e "${YELLOW}[WARN]${NC} $1"
-}
-
-log_error() {
-    echo -e "${RED}[ERROR]${NC} $1" >&2
-}
-
-# Default values
-MILESTONE=""
-DRY_RUN=true
-MAIN_BRANCH="main"
-
-# Parse arguments
-while [[ $# -gt 0 ]]; do
-    case $1 in
-        --dry-run)
-            DRY_RUN=true
-            shift
-            ;;
-        --execute)
-            DRY_RUN=false
-            shift
-            ;;
-        --main-branch)
-            MAIN_BRANCH="$2"
-            shift 2
-            ;;
-        --help|-h)
-            usage
-            ;;
-        -*)
-            log_error "Unknown option: $1"
-            usage
-            ;;
-        *)
-            if [[ -z "$MILESTONE" ]]; then
-                MILESTONE="$1"
-            else
-                log_error "Unexpected argument: $1"
-                usage
-            fi
-            shift
-            ;;
-    esac
-done
-
-# Validate milestone argument
-if [[ -z "$MILESTONE" ]]; then
-    log_error "Milestone is required"
-    usage
-fi
-
-# Check if we're in a git repository
-if ! git rev-parse --is-inside-work-tree &>/dev/null; then
-    log_error "Not in a git repository"
-    exit 1
-fi
-
-# Check if gh CLI is available
-if ! command -v gh &>/dev/null; then
-    log_error "GitHub CLI (gh) is not installed"
-    exit 1
-fi
-
-# Check if authenticated with gh
-if ! gh auth status &>/dev/null; then
-    log_error "Not authenticated with GitHub CLI. Run 'gh auth login' first."
-    exit 1
-fi
-
-CURRENT_BRANCH=$(git branch --show-current)
-log_info "Current branch: ${CURRENT_BRANCH}"
-log_info "Main branch: ${MAIN_BRANCH}"
-log_info "Milestone: ${MILESTONE}"
-echo ""
-
-# Fetch latest from remote
-log_info "Fetching latest from remote..."
-git fetch origin "$MAIN_BRANCH" --quiet
-
-# Get merged PRs from the milestone, sorted by merge date
-log_info "Fetching merged PRs from milestone '${MILESTONE}'..."
-
-# Store PR data in a temp file
-PR_DATA=$(mktemp)
-trap "rm -f $PR_DATA" EXIT
-
-if ! gh pr list --state merged --search "milestone:${MILESTONE}" \
-    --limit 1000 \
-    --json number,title,mergeCommit,mergedAt \
-    --jq 'sort_by(.mergedAt) | .[] | "\(.mergeCommit.oid)\t\(.number)\t\(.title)"' > "$PR_DATA" 2>/dev/null; then
-    log_error "Failed to fetch PRs from milestone '${MILESTONE}'"
-    log_error "This could be due to:"
-    log_error "  - Milestone does not exist"
-    log_error "  - Network/authentication issues"
-    log_error "  - Invalid milestone name format"
-    exit 1
-fi
-
-if [[ ! -s "$PR_DATA" ]]; then
-    log_warn "No merged PRs found for milestone '${MILESTONE}'"
-    exit 0
-fi
-
-TOTAL_PRS=$(wc -l < "$PR_DATA")
-log_info "Found ${TOTAL_PRS} merged PR(s) in milestone"
-echo ""
-
-# Find commits that are missing from current branch
-MISSING_COMMITS=()
-MISSING_INFO=()
-
-while IFS=$'\t' read -r sha pr_number title; do
-    # Skip if SHA is empty or null
-    if [[ -z "$sha" || "$sha" == "null" ]]; then
-        log_warn "PR #${pr_number} has no merge commit SHA, skipping"
-        continue
-    fi
-    
-    # Check if this commit is already in the current branch
-    if git merge-base --is-ancestor "$sha" HEAD 2>/dev/null; then
-        log_success "PR #${pr_number} already in branch: ${title:0:60}"
-    else
-        log_warn "PR #${pr_number} MISSING: ${title:0:60}"
-        MISSING_COMMITS+=("$sha")
-        MISSING_INFO+=("$sha PR #${pr_number}: ${title}")
-    fi
-done < "$PR_DATA"
-
-echo ""
-
-if [[ ${#MISSING_COMMITS[@]} -eq 0 ]]; then
-    log_success "All PRs from milestone '${MILESTONE}' are already in the current branch!"
-    exit 0
-fi
-
-log_info "Found ${#MISSING_COMMITS[@]} missing commit(s) to cherry-pick"
-echo ""
-
-# Output the cherry-pick commands
-echo "=========================================="
-echo "Cherry-pick commands (in chronological order):"
-echo "=========================================="
-echo ""
-
-for info in "${MISSING_INFO[@]}"; do
-    echo "# $info"
-done
-echo ""
-
-echo "# Run these commands to cherry-pick all missing commits:"
-echo "git cherry-pick ${MISSING_COMMITS[*]}"
-echo ""
-
-# Or one by one
-echo "# Or cherry-pick one at a time:"
-for sha in "${MISSING_COMMITS[@]}"; do
-    echo "git cherry-pick $sha"
-done
-echo ""
-
-# Execute if requested
-if [[ "$DRY_RUN" == false ]]; then
-    echo "=========================================="
-    log_info "Executing cherry-picks..."
-    echo "=========================================="
-    
-    for i in "${!MISSING_COMMITS[@]}"; do
-        sha="${MISSING_COMMITS[$i]}"
-        info="${MISSING_INFO[$i]}"
-        
-        echo ""
-        log_info "Cherry-picking: $info"
-        
-        if git cherry-pick "$sha"; then
-            log_success "Successfully cherry-picked $sha"
-        else
-            log_error "Failed to cherry-pick $sha"
-            log_error "Resolve conflicts and run 'git cherry-pick --continue', or 'git cherry-pick --abort' to cancel"
-            exit 1
-        fi
-    done
-    
-    echo ""
-    log_success "All cherry-picks completed successfully!"
-else
-    echo "=========================================="
-    echo -e "${YELLOW}Dry run mode - no changes made${NC}"
-    echo "Run with --execute to perform the cherry-picks"
-    echo "=========================================="
-fi
--- a/.buildkite/scripts/cleanup-nightly-builds.sh
+++ b/.buildkite/scripts/cleanup-nightly-builds.sh
@@ -3,14 +3,7 @@
 set -ex

 # Clean up old nightly builds from DockerHub, keeping only the last 14 builds
-# This script uses DockerHub API to list and delete old tags with specified prefix
-# Usage: cleanup-nightly-builds.sh [TAG_PREFIX]
-# Example: cleanup-nightly-builds.sh "nightly-" or cleanup-nightly-builds.sh "cu130-nightly-"
-
-# Get tag prefix from argument, default to "nightly-" if not provided
-TAG_PREFIX="${1:-nightly-}"
-
-echo "Cleaning up tags with prefix: $TAG_PREFIX"
+# This script uses DockerHub API to list and delete old tags with "nightly-" prefix

 # DockerHub API endpoint for vllm/vllm-openai repository
 REPO_API_URL="https://hub.docker.com/v2/repositories/vllm/vllm-openai/tags"
@@ -52,7 +45,7 @@ get_all_tags() {
        set -x
        
        # Get both last_updated timestamp and tag name, separated by |
-        local tags=$(echo "$response" | jq -r --arg prefix "$TAG_PREFIX" '.results[] | select(.name | startswith($prefix)) | "\(.last_updated)|\(.name)"')
+        local tags=$(echo "$response" | jq -r '.results[] | select(.name | startswith("nightly-")) | "\(.last_updated)|\(.name)"')
        
        if [ -z "$tags" ]; then
            break
--- a/.buildkite/scripts/generate-nightly-index.py
+++ b/.buildkite/scripts/generate-nightly-index.py
@@ -1,468 +0,0 @@
-#!/usr/bin/env python3
-# SPDX-License-Identifier: Apache-2.0
-# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
-
-# do not complain about line length (for docstring)
-# ruff: noqa: E501
-
-import argparse
-import json
-import sys
-from dataclasses import asdict, dataclass
-from datetime import datetime
-from pathlib import Path
-from typing import Any
-from urllib.parse import quote
-
-import regex as re
-
-
-def normalize_package_name(name: str) -> str:
-    """
-    Normalize package name according to PEP 503.
-    https://peps.python.org/pep-0503/#normalized-names
-
-    Replace runs of underscores, hyphens, and periods with a single hyphen,
-    and lowercase the result.
-    """
-    return re.sub(r"[-_.]+", "-", name).lower()
-
-
-if not sys.version_info >= (3, 12):
-    raise RuntimeError("This script requires Python 3.12 or higher.")
-
-INDEX_HTML_TEMPLATE = """<!DOCTYPE html>
-<html>
-  <!-- {comment} -->
-  <meta name="pypi:repository-version" content="1.0">
-  <body>
-{items}
-  </body>
-</html>
-"""
-
-
-@dataclass
-class WheelFileInfo:
-    package_name: str
-    version: str
-    build_tag: str | None
-    python_tag: str
-    abi_tag: str
-    platform_tag: str
-    variant: str | None
-    filename: str
-
-
-def parse_from_filename(file: str) -> WheelFileInfo:
-    """
-    Parse wheel file name to extract metadata.
-
-    The format of wheel names:
-        {package_name}-{version}(-{build_tag})?-{python_tag}-{abi_tag}-{platform_tag}.whl
-    All versions could contain a variant like '+cu129' or '.cpu' or `.rocm` (or not).
-    Example:
-        vllm-0.11.0-cp38-abi3-manylinux1_x86_64.whl
-        vllm-0.10.2rc2+cu129-cp38-abi3-manylinux2014_aarch64.whl
-        vllm-0.11.1rc8.dev14+gaa384b3c0-cp38-abi3-manylinux2014_aarch64.whl
-        vllm-0.11.1rc8.dev14+gaa384b3c0.cu130-cp38-abi3-manylinux1_x86_64.whl
-    """
-    wheel_file_re = re.compile(
-        r"^(?P<package_name>.+)-(?P<version>[^-]+?)(-(?P<build_tag>[^-]+))?-(?P<python_tag>[^-]+)-(?P<abi_tag>[^-]+)-(?P<platform_tag>[^-]+)\.whl$"
-    )
-    match = wheel_file_re.match(file)
-    if not match:
-        raise ValueError(f"Invalid wheel file name: {file}")
-
-    package_name = match.group("package_name")
-    version = match.group("version")
-    build_tag = match.group("build_tag")
-    python_tag = match.group("python_tag")
-    abi_tag = match.group("abi_tag")
-    platform_tag = match.group("platform_tag")
-
-    # extract variant from version
-    variant = None
-    if "dev" in version:
-        ver_after_dev = version.split("dev")[-1]
-        if "." in ver_after_dev:
-            variant = ver_after_dev.split(".")[-1]
-            version = version.removesuffix("." + variant)
-    else:
-        if "+" in version:
-            version_part, suffix = version.split("+", 1)
-            # Only treat known patterns as variants (rocmXXX, cuXXX, cpu)
-            # Git hashes and other suffixes are NOT variants
-            if suffix.startswith(("rocm", "cu", "cpu")):
-                variant = suffix
-                version = version_part
-            # Otherwise keep the full version string (variant stays None)
-
-    return WheelFileInfo(
-        package_name=package_name,
-        version=version,
-        build_tag=build_tag,
-        python_tag=python_tag,
-        abi_tag=abi_tag,
-        platform_tag=platform_tag,
-        variant=variant,
-        filename=file,
-    )
-
-
-def generate_project_list(subdir_names: list[str], comment: str = "") -> str:
-    """
-    Generate project list HTML content linking to each project & variant subdirectory.
-    """
-    href_tags = []
-    for name in sorted(subdir_names):
-        name = name.strip("/").strip(".")
-        href_tags.append(f'    <a href="{name}/">{name}/</a><br/>')
-    return INDEX_HTML_TEMPLATE.format(items="\n".join(href_tags), comment=comment)
-
-
-def generate_package_index_and_metadata(
-    wheel_files: list[WheelFileInfo],
-    wheel_base_dir: Path,
-    index_base_dir: Path,
-    comment: str = "",
-) -> tuple[str, str]:
-    """
-    Generate package index HTML content for a specific package, linking to actual wheel files.
-    """
-    href_tags = []
-    metadata = []
-    for file in sorted(wheel_files, key=lambda x: x.filename):
-        relative_path = (
-            wheel_base_dir.relative_to(index_base_dir, walk_up=True) / file.filename
-        )
-        # handle with '+' in URL, and avoid double-encoding '/' and already-encoded '%2B'
-        # NOTE: this is AWS S3 specific behavior!
-        file_path_quoted = quote(relative_path.as_posix(), safe=":%/")
-        href_tags.append(f'    <a href="{file_path_quoted}">{file.filename}</a><br/>')
-        file_meta = asdict(file)
-        file_meta["path"] = file_path_quoted
-        metadata.append(file_meta)
-    index_str = INDEX_HTML_TEMPLATE.format(items="\n".join(href_tags), comment=comment)
-    metadata_str = json.dumps(metadata, indent=2)
-    return index_str, metadata_str
-
-
-def generate_index_and_metadata(
-    whl_files: list[str],
-    wheel_base_dir: Path,
-    index_base_dir: Path,
-    default_variant: str | None = None,
-    alias_to_default: str | None = None,
-    comment: str = "",
-):
-    """
-    Generate index for all wheel files.
-
-    Args:
-        whl_files (list[str]): List of wheel files (must be directly under `wheel_base_dir`).
-        wheel_base_dir (Path): Base directory for wheel files.
-        index_base_dir (Path): Base directory to store index files.
-        default_variant (str | None): The default variant name, if any.
-        alias_to_default (str | None): Alias variant name for the default variant, if any.
-        comment (str | None): Optional comment to include in the generated HTML files.
-
-    First, parse all wheel files to extract metadata.
-    We need to collect all wheel files for each variant, and generate an index for it (in a subdirectory).
-    The index for the default variant (if any) is generated in the root index directory.
-
-    If `default_variant` is provided, all wheels must have variant suffixes, and the default variant index
-    is purely a copy of the corresponding variant index, with only the links adjusted.
-    Otherwise, all wheels without variant suffixes are treated as the default variant.
-
-    If `alias_to_default` is provided, an additional alias subdirectory is created, it has the same content
-    as the default variant index, but the links are adjusted accordingly.
-
-    Index directory structure:
-        index_base_dir/ (hosted at wheels.vllm.ai/{nightly,$commit,$version}/)
-            index.html  # project list, linking to "vllm/" and other packages, and all variant subdirectories
-            vllm/
-                index.html # package index, pointing to actual files in wheel_base_dir (relative path)
-                metadata.json # machine-readable metadata for all wheels in this package
-            cpu/ # cpu variant subdirectory
-                index.html
-                vllm/
-                    index.html
-                    metadata.json
-            cu129/ # cu129 is actually the alias to default variant
-                index.html
-                vllm/
-                    index.html
-                    metadata.json
-            cu130/ # cu130 variant subdirectory
-                index.html
-                vllm/
-                    index.html
-                    metadata.json
-            ...
-
-    metadata.json stores a dump of all wheel files' metadata in a machine-readable format:
-        [
-            {
-                "package_name": "vllm",
-                "version": "0.10.2rc2",
-                "build_tag": null,
-                "python_tag": "cp38",
-                "abi_tag": "abi3",
-                "platform_tag": "manylinux2014_aarch64",
-                "variant": "cu129",
-                "filename": "vllm-0.10.2rc2+cu129-cp38-abi3-manylinux2014_aarch64.whl",
-                "path": "../vllm-0.10.2rc2%2Bcu129-cp38-abi3-manylinux2014_aarch64.whl" # to be concatenated with the directory URL and URL-encoded
-            },
-            ...
-        ]
-    """
-
-    parsed_files = [parse_from_filename(f) for f in whl_files]
-
-    if not parsed_files:
-        print("No wheel files found, skipping index generation.")
-        return
-
-    # For ROCm builds: inherit variant from vllm wheel
-    # All ROCm wheels should share the same variant as vllm
-    rocm_variant = None
-    for file in parsed_files:
-        if (
-            file.package_name == "vllm"
-            and file.variant
-            and file.variant.startswith("rocm")
-        ):
-            rocm_variant = file.variant
-            print(f"Detected ROCm variant from vllm: {rocm_variant}")
-            break
-
-    # Apply ROCm variant to all wheels without a variant
-    if rocm_variant:
-        for file in parsed_files:
-            if file.variant is None:
-                file.variant = rocm_variant
-                print(f"Inherited variant '{rocm_variant}' for {file.filename}")
-
-    # Group by variant
-    variant_to_files: dict[str, list[WheelFileInfo]] = {}
-    for file in parsed_files:
-        variant = file.variant or "default"
-        if variant not in variant_to_files:
-            variant_to_files[variant] = []
-        variant_to_files[variant].append(file)
-
-    print(f"Found variants: {list(variant_to_files.keys())}")
-
-    # sanity check for default variant
-    if default_variant:
-        if "default" in variant_to_files:
-            raise ValueError(
-                "All wheel files must have variant suffixes when `default_variant` is specified."
-            )
-        if default_variant not in variant_to_files:
-            raise ValueError(
-                f"Default variant '{default_variant}' not found among wheel files."
-            )
-
-    if alias_to_default:
-        if "default" not in variant_to_files:
-            # e.g. only some wheels are uploaded to S3 currently
-            print(
-                "[WARN] Alias to default variant specified, but no default variant found."
-            )
-        elif alias_to_default in variant_to_files:
-            raise ValueError(
-                f"Alias variant name '{alias_to_default}' already exists among wheel files."
-            )
-        else:
-            variant_to_files[alias_to_default] = variant_to_files["default"].copy()
-            print(f"Alias variant '{alias_to_default}' created for default variant.")
-
-    # Generate comment in HTML header
-    comment_str = f" ({comment})" if comment else ""
-    comment_tmpl = f"Generated on {datetime.now().isoformat()}{comment_str}"
-
-    # Generate index for each variant
-    subdir_names = set()
-    for variant, files in variant_to_files.items():
-        if variant == "default":
-            variant_dir = index_base_dir
-        else:
-            variant_dir = index_base_dir / variant
-            subdir_names.add(variant)
-
-        variant_dir.mkdir(parents=True, exist_ok=True)
-
-        # gather all package names in this variant (normalized per PEP 503)
-        packages = set(normalize_package_name(f.package_name) for f in files)
-        if variant == "default":
-            # these packages should also appear in the "project list"
-            # generate after all variants are processed
-            subdir_names = subdir_names.union(packages)
-        else:
-            # generate project list for this variant directly
-            project_list_str = generate_project_list(sorted(packages), comment_tmpl)
-            with open(variant_dir / "index.html", "w") as f:
-                f.write(project_list_str)
-
-        for package in packages:
-            # filter files belonging to this package only (compare normalized names)
-            package_files = [
-                f for f in files if normalize_package_name(f.package_name) == package
-            ]
-            package_dir = variant_dir / package
-            package_dir.mkdir(parents=True, exist_ok=True)
-            index_str, metadata_str = generate_package_index_and_metadata(
-                package_files, wheel_base_dir, package_dir, comment
-            )
-            with open(package_dir / "index.html", "w") as f:
-                f.write(index_str)
-            with open(package_dir / "metadata.json", "w") as f:
-                f.write(metadata_str)
-
-    # Generate top-level project list index
-    project_list_str = generate_project_list(sorted(subdir_names), comment_tmpl)
-    with open(index_base_dir / "index.html", "w") as f:
-        f.write(project_list_str)
-
-
-if __name__ == "__main__":
-    """
-    Arguments:
-        --version <version> : version string for the current build (e.g., commit hash)
-        --wheel-dir <wheel_directory> : directory containing wheel files (default to be same as `version`)
-        --current-objects <path_to_json> : path to JSON file containing current S3 objects listing in this version directory
-        --output-dir <output_directory> : directory to store generated index files
-        --alias-to-default <alias_variant_name> : (optional) alias variant name for the default variant
-        --comment <comment_string> : (optional) comment string to include in generated HTML files
-    """
-
-    parser = argparse.ArgumentParser(
-        description="Process nightly build wheel files to generate indices."
-    )
-    parser.add_argument(
-        "--version",
-        type=str,
-        required=True,
-        help="Version string for the current build (e.g., commit hash)",
-    )
-    parser.add_argument(
-        "--current-objects",
-        type=str,
-        required=True,
-        help="Path to JSON file containing current S3 objects listing in this version directory",
-    )
-    parser.add_argument(
-        "--output-dir",
-        type=str,
-        required=True,
-        help="Directory to store generated index files",
-    )
-    parser.add_argument(
-        "--wheel-dir",
-        type=str,
-        default=None,
-        help="Directory containing wheel files (default to be same as `version`)",
-    )
-    parser.add_argument(
-        "--alias-to-default",
-        type=str,
-        default=None,
-        help="Alias variant name for the default variant",
-    )
-    parser.add_argument(
-        "--comment",
-        type=str,
-        default="",
-        help="Optional comment string to include in generated HTML files",
-    )
-
-    args = parser.parse_args()
-
-    version = args.version
-    # Allow rocm/ prefix, reject other slashes and all backslashes
-    if "\\" in version:
-        raise ValueError("Version string must not contain backslashes.")
-    if "/" in version and not version.startswith("rocm/"):
-        raise ValueError(
-            "Version string must not contain slashes (except for 'rocm/' prefix)."
-        )
-    current_objects_path = Path(args.current_objects)
-    output_dir = Path(args.output_dir)
-    if not output_dir.exists():
-        output_dir.mkdir(parents=True, exist_ok=True)
-
-    # Read current objects JSON
-    with open(current_objects_path) as f:
-        current_objects: dict[str, list[dict[str, Any]]] = json.load(f)
-
-    # current_objects looks like from list_objects_v2 S3 API:
-    """
-    "Contents": [
-        {
-            "Key": "e2f56c309d2a28899c68975a7e104502d56deb8f/vllm-0.11.2.dev363+ge2f56c309-cp38-abi3-manylinux1_x86_64.whl",
-            "LastModified": "2025-11-28T14:00:32+00:00",
-            "ETag": "\"37a38339c7cdb61ca737021b968075df-52\"",
-            "ChecksumAlgorithm": [
-                "CRC64NVME"
-            ],
-            "ChecksumType": "FULL_OBJECT",
-            "Size": 435649349,
-            "StorageClass": "STANDARD"
-        },
-        ...
-    ]
-    """
-
-    # Extract wheel file keys
-    wheel_files = []
-    for item in current_objects.get("Contents", []):
-        key: str = item["Key"]
-        if key.endswith(".whl"):
-            wheel_files.append(key.split("/")[-1])  # only the filename is used
-
-    print(f"Found {len(wheel_files)} wheel files for version {version}: {wheel_files}")
-
-    # keep only "official" files for a non-nightly version (specified by cli args)
-    PY_VERSION_RE = re.compile(r"^\d+\.\d+\.\d+([a-zA-Z0-9.+-]*)?$")
-    if PY_VERSION_RE.match(version):
-        # upload-wheels.sh ensures no "dev" is in args.version
-        wheel_files = list(
-            filter(lambda x: version in x and "dev" not in x, wheel_files)
-        )
-        print(f"Non-nightly version detected, wheel files used: {wheel_files}")
-    else:
-        print("Nightly version detected, keeping all wheel files.")
-
-    # Generate index and metadata, assuming wheels and indices are stored as:
-    # s3://vllm-wheels/{wheel_dir}/<wheel files>
-    # s3://vllm-wheels/<anything>/<index files>
-    #
-    # For ROCm builds, version is "rocm/{commit}" and indices are uploaded to:
-    #   - rocm/{commit}/  (same as wheels)
-    #   - rocm/nightly/
-    #   - rocm/{version}/
-    # All these are under the "rocm/" prefix, so relative paths should be
-    # relative to "rocm/", not the bucket root.
-    if args.wheel_dir:
-        # Explicit wheel-dir provided (e.g., for version-specific indices pointing to commit dir)
-        wheel_dir = args.wheel_dir.strip().rstrip("/")
-    elif version.startswith("rocm/"):
-        # For rocm/commit, wheel_base_dir should be just the commit part
-        # so relative path from rocm/0.12.0/rocm710/vllm/ -> ../../../{commit}/
-        wheel_dir = version.split("/", 1)[1]
-    else:
-        wheel_dir = version
-    wheel_base_dir = Path(output_dir).parent / wheel_dir
-    index_base_dir = Path(output_dir)
-
-    generate_index_and_metadata(
-        whl_files=wheel_files,
-        wheel_base_dir=wheel_base_dir,
-        index_base_dir=index_base_dir,
-        default_variant=None,
-        alias_to_default=args.alias_to_default,
-        comment=args.comment.strip(),
-    )
-    print(f"Successfully generated index and metadata in {output_dir}")
--- a/.buildkite/scripts/hardware_ci/run-amd-test.sh
+++ b/.buildkite/scripts/hardware_ci/run-amd-test.sh
@@ -44,17 +44,6 @@ cleanup_docker() {
  fi
 }

-cleanup_network() {
-  for node in $(seq 0 $((NUM_NODES-1))); do
-    if docker pr -a -q -f name="node${node}" | grep -q .; then
-      docker stop "node${node}"
-    fi
-  done
-  if docker network ls | grep docker-net; then
-    docker network rm docker-net
-  fi
-}
-
 # Call the cleanup docker function
 cleanup_docker

@@ -87,7 +76,7 @@ mkdir -p "${HF_CACHE}"
 HF_MOUNT="/root/.cache/huggingface"

 commands=$@
-echo "Raw commands: $commands"
+echo "Commands:$commands"

 commands=${commands//"pytest -v -s basic_correctness/test_basic_correctness.py"/"pytest -v -s basic_correctness/test_basic_correctness.py"}

@@ -152,6 +141,7 @@ if [[ $commands == *" entrypoints/openai "* ]]; then
  --ignore=entrypoints/openai/test_audio.py \
  --ignore=entrypoints/openai/test_shutdown.py \
  --ignore=entrypoints/openai/test_completion.py \
+  --ignore=entrypoints/openai/test_sleep.py \
  --ignore=entrypoints/openai/test_models.py \
  --ignore=entrypoints/openai/test_lora_adapters.py \
  --ignore=entrypoints/openai/test_return_tokens_as_ids.py \
@@ -169,9 +159,6 @@ if [[ $commands == *" entrypoints/llm "* ]]; then
  --ignore=entrypoints/llm/test_prompt_validation.py "}
 fi

-commands=$(echo "$commands" | sed 's/ \\ / /g')
-echo "Final commands: $commands"
-
 # --ignore=entrypoints/openai/test_encoder_decoder.py \
 # --ignore=entrypoints/openai/test_embedding.py \
 # --ignore=entrypoints/openai/test_oot_registration.py
@@ -179,6 +166,7 @@ echo "Final commands: $commands"
 # --ignore=entrypoints/openai/test_models.py <= Fails on MI250 but passes on MI300 as of 2025-03-13


+PARALLEL_JOB_COUNT=8
 MYPYTHONPATH=".."

 # Test that we're launching on the machine that has
@@ -189,34 +177,45 @@ if [[ -z "$render_gid" ]]; then
  exit 1
 fi

-if [[ $commands == *"VLLM_TEST_GROUP_NAME=mi325_4-2-node-tests-4-gpus-in-total"* ]]; then
-
-  export DCKR_VER=$(docker --version | sed 's/Docker version \(.*\), build .*/\1/')
-
-  if [[ "$commands" =~ ^(.*)"["(.*)"] && ["(.*)"]"$ ]]; then
-      prefix=$( echo "${BASH_REMATCH[1]}" | sed 's/;//g')
-      echo "PREFIX: ${prefix}"
-      export composite_command="(command rocm-smi || true)"
-      myIFS=$IFS
-      IFS=','
-      read -ra node0 <<< ${BASH_REMATCH[2]}
-      read -ra node1 <<< ${BASH_REMATCH[3]}
-      IFS=$myIFS
-      for i in "${!node0[@]}";do 
-        command_node_0=$(echo ${node0[i]} | sed 's/\"//g')
-        command_node_1=$(echo ${node1[i]} | sed 's/\"//g')
-        
-        export commands="./.buildkite/scripts/run-multi-node-test.sh /vllm-workspace/tests 2 2 ${image_name} '${command_node_0}' '${command_node_1}'"
-        echo "COMMANDS: ${commands}"
-        composite_command=$(echo "${composite_command} && ${commands}")
-      done
-      /bin/bash -c "${composite_command}"
-      cleanup_network
-  else
-      echo "Failed to parse node commands! Exiting."
-      cleanup_network
-      exit 111
-  fi
+# check if the command contains shard flag, we will run all shards in parallel because the host have 8 GPUs.
+if [[ $commands == *"--shard-id="* ]]; then
+  # assign job count as the number of shards used
+  commands=$(echo "$commands" | sed -E "s/--num-shards[[:blank:]]*=[[:blank:]]*[0-9]*/--num-shards=${PARALLEL_JOB_COUNT} /g" | sed 's/ \\ / /g')
+  for GPU in $(seq 0 $(($PARALLEL_JOB_COUNT-1))); do
+    # assign shard-id for each shard
+    commands_gpu=$(echo "$commands" | sed -E "s/--shard-id[[:blank:]]*=[[:blank:]]*[0-9]*/--shard-id=${GPU} /g" | sed 's/ \\ / /g')
+    echo "Shard ${GPU} commands:$commands_gpu"
+    echo "Render devices: $BUILDKITE_AGENT_META_DATA_RENDER_DEVICES"
+    docker run \
+        --device /dev/kfd $BUILDKITE_AGENT_META_DATA_RENDER_DEVICES \
+        --network=host \
+        --shm-size=16gb \
+        --group-add "$render_gid" \
+        --rm \
+        -e HIP_VISIBLE_DEVICES="${GPU}" \
+        -e HF_TOKEN \
+        -e AWS_ACCESS_KEY_ID \
+        -e AWS_SECRET_ACCESS_KEY \
+        -v "${HF_CACHE}:${HF_MOUNT}" \
+        -e "HF_HOME=${HF_MOUNT}" \
+        -e "PYTHONPATH=${MYPYTHONPATH}" \
+        --name "${container_name}_${GPU}" \
+        "${image_name}" \
+        /bin/bash -c "${commands_gpu}" \
+        |& while read -r line; do echo ">>Shard $GPU: $line"; done &
+    PIDS+=($!)
+  done
+  #wait for all processes to finish and collect exit codes
+  for pid in "${PIDS[@]}"; do
+    wait "${pid}"
+    STATUS+=($?)
+  done
+  for st in "${STATUS[@]}"; do
+    if [[ ${st} -ne 0 ]]; then
+      echo "One of the processes failed with $st"
+      exit "${st}"
+    fi
+  done
 else
  echo "Render devices: $BUILDKITE_AGENT_META_DATA_RENDER_DEVICES"
  docker run \
--- a/.buildkite/scripts/hardware_ci/run-cpu-distributed-smoke-test.sh
+++ b/.buildkite/scripts/hardware_ci/run-cpu-distributed-smoke-test.sh
@@ -1,26 +0,0 @@
-#!/bin/bash
-set -euox pipefail
-
-echo "--- PP+TP"
-vllm serve meta-llama/Llama-3.2-3B-Instruct -tp=2 -pp=2 &
-server_pid=$!
-timeout 600 bash -c "until curl localhost:8000/v1/models; do sleep 1; done" || exit 1
-vllm bench serve \
-    --backend vllm \
-    --dataset-name random \
-    --model meta-llama/Llama-3.2-3B-Instruct \
-    --num-prompts 20 \
-    --endpoint /v1/completions
-kill -s SIGTERM $server_pid &
-
-echo "--- DP+TP"
-vllm serve meta-llama/Llama-3.2-3B-Instruct -tp=2 -dp=2 &
-server_pid=$!
-timeout 600 bash -c "until curl localhost:8000/v1/models; do sleep 1; done" || exit 1
-vllm bench serve \
-    --backend vllm \
-    --dataset-name random \
-    --model meta-llama/Llama-3.2-3B-Instruct \
-    --num-prompts 20 \
-    --endpoint /v1/completions
-kill -s SIGTERM $server_pid &
--- a/.buildkite/scripts/hardware_ci/run-cpu-test-arm.sh
+++ b/.buildkite/scripts/hardware_ci/run-cpu-test-arm.sh
@@ -1,68 +0,0 @@
-#!/bin/bash
-
-# This script build the CPU docker image and run the offline inference inside the container.
-# It serves a sanity check for compilation and basic model usage.
-set -ex
-
-# allow to bind to different cores
-CORE_RANGE=${CORE_RANGE:-0-16}
-OMP_CORE_RANGE=${OMP_CORE_RANGE:-0-16}
-
-export CMAKE_BUILD_PARALLEL_LEVEL=16
-
-# Setup cleanup
-remove_docker_container() {
-    set -e;
-    docker rm -f cpu-test || true;
-}
-trap remove_docker_container EXIT
-remove_docker_container
-
-# Try building the docker image
-docker build --tag cpu-test --target vllm-test -f docker/Dockerfile.cpu .
-
-# Run the image
-docker run -itd --cpuset-cpus="$CORE_RANGE" --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=16 --env VLLM_CPU_CI_ENV=1 -e E2E_OMP_THREADS="$OMP_CORE_RANGE" --shm-size=4g --name cpu-test cpu-test
-
-function cpu_tests() {
-  set -e
-
-  docker exec cpu-test bash -c "
-    set -e
-    pip list"
-
-  # offline inference
-  docker exec cpu-test bash -c "
-    set -e
-    python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m"
-
-  # Run model tests
-  docker exec cpu-test bash -c "
-    set -e
-    pytest -x -v -s tests/models/multimodal/generation/test_whisper.py -m cpu_model"
-
-  # Run kernel tests
-  docker exec cpu-test bash -c "
-    set -e
-    pytest -x -v -s tests/kernels/test_onednn.py
-    pytest -x -v -s tests/kernels/attention/test_cpu_attn.py
-    pytest -x -v -s tests/kernels/moe/test_moe.py -k test_cpu_fused_moe_basic"
-
-  # basic online serving
-  docker exec cpu-test bash -c '
-    set -e
-    VLLM_CPU_OMP_THREADS_BIND=$E2E_OMP_THREADS vllm serve Qwen/Qwen3-0.6B --max-model-len 2048 &
-    server_pid=$!
-    timeout 600 bash -c "until curl localhost:8000/v1/models; do sleep 1; done" || exit 1
-    vllm bench serve \
-      --backend vllm \
-      --dataset-name random \
-      --model Qwen/Qwen3-0.6B \
-      --num-prompts 20 \
-      --endpoint /v1/completions
-    kill -s SIGTERM $server_pid &'
-}
-
-# All of CPU tests are expected to be finished less than 40 mins.
-export -f cpu_tests
-timeout 2h bash -c cpu_tests
--- a/.buildkite/scripts/hardware_ci/run-cpu-test-ppc64le.sh
+++ b/.buildkite/scripts/hardware_ci/run-cpu-test-ppc64le.sh
@@ -25,22 +25,20 @@ function cpu_tests() {

  # offline inference
  podman exec -it "$container_id" bash -c "
-    export TORCH_COMPILE_DISABLE=1
    set -xve
    python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m" >> $HOME/test_basic.log

  # Run basic model test
  podman exec -it "$container_id" bash -c "
-    export TORCH_COMPILE_DISABLE=1
    set -evx
    pip install pytest pytest-asyncio einops peft Pillow soundfile transformers_stream_generator matplotlib
-    pip install sentence-transformers datamodel_code_generator tblib 
+    pip install sentence-transformers datamodel_code_generator

    # Note: disable Bart until supports V1
    # pytest -v -s tests/models/language/generation/test_bart.py -m cpu_model
-    pytest -v -s tests/models/language/generation/test_common.py::test_models[False-False-5-32-openai-community/gpt2]
-    pytest -v -s tests/models/language/generation/test_common.py::test_models[False-False-5-32-facebook/opt-125m]
-    pytest -v -s tests/models/language/generation/test_common.py::test_models[False-False-5-32-google/gemma-1.1-2b-it]
+    pytest -v -s tests/models/language/generation/test_common.py::test_models[False-5-32-openai-community/gpt2]
+    pytest -v -s tests/models/language/generation/test_common.py::test_models[False-5-32-facebook/opt-125m]
+    pytest -v -s tests/models/language/generation/test_common.py::test_models[False-5-32-google/gemma-1.1-2b-it]
    pytest -v -s tests/models/language/pooling/test_classification.py::test_models[float-jason9693/Qwen2.5-1.5B-apeach]
    # TODO: Below test case tests/models/language/pooling/test_embedding.py::test_models[True-ssmits/Qwen2-7B-Instruct-embed-base] fails on ppc64le. Disabling it for time being.
    # pytest -v -s tests/models/language/pooling/test_embedding.py -m cpu_model" >> $HOME/test_rest.log
--- a/.buildkite/scripts/hardware_ci/run-cpu-test.sh
+++ b/.buildkite/scripts/hardware_ci/run-cpu-test.sh
@@ -2,19 +2,119 @@

 # This script build the CPU docker image and run the offline inference inside the container.
 # It serves a sanity check for compilation and basic model usage.
-set -euox pipefail
+set -ex

 # allow to bind to different cores
 CORE_RANGE=${CORE_RANGE:-48-95}
+# used for TP/PP E2E test
+OMP_CORE_RANGE=${OMP_CORE_RANGE:-48-95}
 NUMA_NODE=${NUMA_NODE:-1}
-IMAGE_NAME="cpu-test-$NUMA_NODE"
-TIMEOUT_VAL=$1
-TEST_COMMAND=$2

-# building the docker image
-echo "--- :docker: Building Docker image"
-docker build --progress plain --tag "$IMAGE_NAME" --target vllm-test -f docker/Dockerfile.cpu .
+export CMAKE_BUILD_PARALLEL_LEVEL=32
+
+# Setup cleanup
+remove_docker_container() {
+    set -e;
+    docker rm -f cpu-test-"$NUMA_NODE" cpu-test-"$NUMA_NODE"-avx2 || true;
+}
+trap remove_docker_container EXIT
+remove_docker_container
+
+# Try building the docker image
+numactl -C "$CORE_RANGE" -N "$NUMA_NODE" docker build --tag cpu-test-"$NUMA_NODE" --target vllm-test -f docker/Dockerfile.cpu .
+numactl -C "$CORE_RANGE" -N "$NUMA_NODE" docker build --build-arg VLLM_CPU_DISABLE_AVX512="true" --tag cpu-test-"$NUMA_NODE"-avx2 --target vllm-test -f docker/Dockerfile.cpu .

 # Run the image, setting --shm-size=4g for tensor parallel.
-docker run --rm --cpuset-cpus=$CORE_RANGE --cpuset-mems=$NUMA_NODE -v ~/.cache/huggingface:/root/.cache/huggingface --privileged=true -e HF_TOKEN -e VLLM_CPU_KVCACHE_SPACE=16 -e VLLM_CPU_CI_ENV=1 -e VLLM_CPU_SIM_MULTI_NUMA=1 --shm-size=4g $IMAGE_NAME \
-        timeout $TIMEOUT_VAL bash -c "set -euox pipefail; echo \"--- Print packages\"; pip list; echo \"--- Running tests\"; ${TEST_COMMAND}"
+docker run -itd --cpuset-cpus="$CORE_RANGE" --cpuset-mems="$NUMA_NODE" --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --privileged=true -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=16 --env VLLM_CPU_CI_ENV=1 -e E2E_OMP_THREADS="$OMP_CORE_RANGE" --shm-size=4g --name cpu-test-"$NUMA_NODE" cpu-test-"$NUMA_NODE"
+docker run -itd --cpuset-cpus="$CORE_RANGE" --cpuset-mems="$NUMA_NODE" --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --privileged=true -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=16 --env VLLM_CPU_CI_ENV=1 -e E2E_OMP_THREADS="$OMP_CORE_RANGE" --shm-size=4g --name cpu-test-"$NUMA_NODE"-avx2 cpu-test-"$NUMA_NODE"-avx2
+
+function cpu_tests() {
+  set -e
+  export NUMA_NODE=$2
+
+  # list packages
+  docker exec cpu-test-"$NUMA_NODE"-avx2 bash -c "
+    set -e
+    pip list"
+
+  docker exec cpu-test-"$NUMA_NODE" bash -c "
+    set -e
+    pip list"
+
+  # offline inference
+  docker exec cpu-test-"$NUMA_NODE"-avx2 bash -c "
+    set -e
+    python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m"
+
+  # Run kernel tests
+  docker exec cpu-test-"$NUMA_NODE" bash -c "
+    set -e
+    pytest -x -v -s tests/kernels/attention/test_cpu_attn.py
+    pytest -x -v -s tests/kernels/test_onednn.py"
+
+  # Run basic model test
+  docker exec cpu-test-"$NUMA_NODE" bash -c "
+    set -e
+    # Note: disable until supports V1
+    # pytest -x -v -s tests/kernels/attention/test_cache.py -m cpu_model
+    # pytest -x -v -s tests/kernels/attention/test_mla_decode_cpu.py -m cpu_model
+
+    pytest -x -v -s tests/models/language/generation -m cpu_model
+    VLLM_CPU_SGL_KERNEL=1 pytest -x -v -s tests/models/language/generation -m cpu_model
+
+    pytest -x -v -s tests/models/language/pooling -m cpu_model
+    pytest -x -v -s tests/models/multimodal/generation \
+                --ignore=tests/models/multimodal/generation/test_pixtral.py \
+                -m cpu_model"
+
+  # Run compressed-tensor test
+  docker exec cpu-test-"$NUMA_NODE" bash -c "
+    set -e
+    pytest -x -s -v \
+    tests/quantization/test_compressed_tensors.py::test_compressed_tensors_w8a8_logprobs"
+
+  # Note: disable it until supports V1
+  # Run AWQ test
+  # docker exec cpu-test-"$NUMA_NODE" bash -c "
+  #   set -e
+  #   pytest -x -s -v \
+  #   tests/quantization/test_ipex_quant.py"
+
+  # Run multi-lora tests
+  docker exec cpu-test-"$NUMA_NODE" bash -c "
+    set -e
+    pytest -x -s -v \
+    tests/lora/test_qwen2vl.py"
+
+  # online serving: tp+pp
+  docker exec cpu-test-"$NUMA_NODE" bash -c '
+    set -e
+    VLLM_CPU_OMP_THREADS_BIND=$E2E_OMP_THREADS VLLM_CPU_SGL_KERNEL=1 vllm serve meta-llama/Llama-3.2-3B-Instruct -tp=2 -pp=2 &
+    server_pid=$!
+    timeout 600 bash -c "until curl localhost:8000/v1/models; do sleep 1; done" || exit 1
+    vllm bench serve \
+      --backend vllm \
+      --dataset-name random \
+      --model meta-llama/Llama-3.2-3B-Instruct \
+      --num-prompts 20 \
+      --endpoint /v1/completions
+    kill -s SIGTERM $server_pid &'
+
+  # online serving: tp+dp
+  docker exec cpu-test-"$NUMA_NODE" bash -c '
+    set -e
+    VLLM_CPU_OMP_THREADS_BIND=$E2E_OMP_THREADS VLLM_CPU_SGL_KERNEL=1 vllm serve meta-llama/Llama-3.2-3B-Instruct -tp=2 -dp=2 &
+    server_pid=$!
+    timeout 600 bash -c "until curl localhost:8000/v1/models; do sleep 1; done" || exit 1
+    vllm bench serve \
+      --backend vllm \
+      --dataset-name random \
+      --model meta-llama/Llama-3.2-3B-Instruct \
+      --num-prompts 20 \
+      --endpoint /v1/completions
+    kill -s SIGTERM $server_pid &'
+}
+
+# All of CPU tests are expected to be finished less than 40 mins.
+export -f cpu_tests
+timeout 2.5h bash -c "cpu_tests $CORE_RANGE $NUMA_NODE"
--- a/.buildkite/scripts/hardware_ci/run-hpu-test.sh
+++ b/.buildkite/scripts/hardware_ci/run-hpu-test.sh
@@ -5,9 +5,7 @@
 set -exuo pipefail

 # Try building the docker image
-image_name="hpu/upstream-vllm-ci:${BUILDKITE_COMMIT}"
-container_name="hpu-upstream-vllm-ci-${BUILDKITE_COMMIT}-container"
-cat <<EOF | docker build -t ${image_name} -f - .
+cat <<EOF | docker build -t hpu-plugin-v1-test-env -f - .
 FROM gaudi-base-image:latest

 COPY ./ /workspace/vllm
@@ -17,8 +15,7 @@ WORKDIR /workspace/vllm
 ENV no_proxy=localhost,127.0.0.1
 ENV PT_HPU_ENABLE_LAZY_COLLECTIVES=true

-RUN bash -c 'pip install -r <(sed "/^torch/d" requirements/build.txt)'
-RUN VLLM_TARGET_DEVICE=empty pip install --no-build-isolation -e .
+RUN VLLM_TARGET_DEVICE=empty pip install .
 RUN pip install git+https://github.com/vllm-project/vllm-gaudi.git

 # install development dependencies (for testing)
@@ -39,20 +36,15 @@ EOF
 # functions, while other platforms only need one remove_docker_container
 # function.
 EXITCODE=1
-remove_docker_containers() { docker rm -f ${container_name} || true; }
+remove_docker_containers() { docker rm -f hpu-plugin-v1-test || true; }
 trap 'remove_docker_containers; exit $EXITCODE;' EXIT
 remove_docker_containers

 echo "Running HPU plugin v1 test"
-docker run --rm --runtime=habana --name=${container_name} --network=host \
+docker run --rm --runtime=habana --name=hpu-plugin-v1-test --network=host \
  -e HABANA_VISIBLE_DEVICES=all \
-  -e VLLM_SKIP_WARMUP=true \
-  -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
-  -e PT_HPU_LAZY_MODE=1 \
-  "${image_name}" \
-  /bin/bash -c '
-  cd vllm; timeout 120s python -u examples/offline_inference/basic/generate.py --model facebook/opt-125m
-'
+  hpu-plugin-v1-test-env \
+  /bin/bash "/workspace/vllm-gaudi/tests/upstream_tests/ci_tests.sh"

 EXITCODE=$?
 if [ $EXITCODE -eq 0 ]; then
--- a/.buildkite/scripts/hardware_ci/run-npu-test.sh
+++ b/.buildkite/scripts/hardware_ci/run-npu-test.sh
@@ -74,7 +74,6 @@ FROM ${BASE_IMAGE_NAME}

 # Define environments
 ENV DEBIAN_FRONTEND=noninteractive
-ENV SOC_VERSION="ascend910b1"

 RUN pip config set global.index-url http://cache-service-vllm.nginx-pypi-cache.svc.cluster.local:${PYPI_CACHE_PORT}/pypi/simple && \
    pip config set global.trusted-host cache-service-vllm.nginx-pypi-cache.svc.cluster.local && \
--- a/.buildkite/scripts/hardware_ci/run-tpu-v1-test-part2.sh
+++ b/.buildkite/scripts/hardware_ci/run-tpu-v1-test-part2.sh
@@ -61,7 +61,7 @@ echo "Results will be stored in: $RESULTS_DIR"
 echo "--- Installing Python dependencies ---"
 python3 -m pip install --progress-bar off git+https://github.com/thuml/depyf.git \
    && python3 -m pip install --progress-bar off pytest pytest-asyncio tpu-info \
-    && python3 -m pip install --progress-bar off "lm-eval[api]>=0.4.9.2" \
+    && python3 -m pip install --progress-bar off "lm-eval @ git+https://github.com/EleutherAI/lm-evaluation-harness.git@206b7722158f58c35b7ffcd53b035fdbdda5126d" \
    && python3 -m pip install --progress-bar off hf-transfer tblib==3.1.0
 echo "--- Python dependencies installed ---"

--- a/.buildkite/scripts/hardware_ci/run-tpu-v1-test.sh
+++ b/.buildkite/scripts/hardware_ci/run-tpu-v1-test.sh
@@ -61,7 +61,7 @@ echo "Results will be stored in: $RESULTS_DIR"
 echo "--- Installing Python dependencies ---"
 python3 -m pip install --progress-bar off git+https://github.com/thuml/depyf.git \
    && python3 -m pip install --progress-bar off pytest pytest-asyncio tpu-info \
-    && python3 -m pip install --progress-bar off "lm-eval[api]>=0.4.9.2" \
+    && python3 -m pip install --progress-bar off "lm-eval @ git+https://github.com/EleutherAI/lm-evaluation-harness.git@206b7722158f58c35b7ffcd53b035fdbdda5126d" \
    && python3 -m pip install --progress-bar off hf-transfer tblib==3.1.0
 echo "--- Python dependencies installed ---"

--- a/.buildkite/scripts/hardware_ci/run-xpu-test.sh
+++ b/.buildkite/scripts/hardware_ci/run-xpu-test.sh
@@ -35,21 +35,17 @@ docker run \
    echo $ZE_AFFINITY_MASK
    pip install tblib==3.1.0
    python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m --block-size 64 --enforce-eager
-    python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m --block-size 64 -O3 -cc.cudagraph_mode=NONE
+    python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m --block-size 64 -O3 -O.cudagraph_mode=NONE
    python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m --block-size 64 --enforce-eager -tp 2 --distributed-executor-backend ray
    python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m --block-size 64 --enforce-eager -tp 2 --distributed-executor-backend mp
-    python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m --block-size 64 --enforce-eager --attention-backend=TRITON_ATTN
-    python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m --block-size 64 --enforce-eager --quantization fp8
-    python3 examples/offline_inference/basic/generate.py --model superjob/Qwen3-4B-Instruct-2507-GPTQ-Int4  --block-size 64 --enforce-eager
-    python3 examples/offline_inference/basic/generate.py --model ibm-research/PowerMoE-3b  --block-size 64 --enforce-eager -tp 2
-    python3 examples/offline_inference/basic/generate.py --model ibm-research/PowerMoE-3b  --block-size 64 --enforce-eager -tp 2 --enable-expert-parallel
+    VLLM_ATTENTION_BACKEND=TRITON_ATTN python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m --block-size 64 --enforce-eager
    cd tests
-    pytest -v -s v1/core --ignore=v1/core/test_reset_prefix_cache_e2e.py
+    pytest -v -s v1/core
    pytest -v -s v1/engine
    pytest -v -s v1/sample --ignore=v1/sample/test_logprobs.py --ignore=v1/sample/test_logprobs_e2e.py
    pytest -v -s v1/worker --ignore=v1/worker/test_gpu_model_runner.py
    pytest -v -s v1/structured_output
-    pytest -v -s v1/spec_decode --ignore=v1/spec_decode/test_max_len.py --ignore=v1/spec_decode/test_tree_attention.py --ignore=v1/spec_decode/test_speculators_eagle3.py --ignore=v1/spec_decode/test_acceptance_length.py
-    pytest -v -s v1/kv_connector/unit --ignore=v1/kv_connector/unit/test_multi_connector.py --ignore=v1/kv_connector/unit/test_nixl_connector.py --ignore=v1/kv_connector/unit/test_example_connector.py --ignore=v1/kv_connector/unit/test_lmcache_integration.py
+    pytest -v -s v1/spec_decode --ignore=v1/spec_decode/test_max_len.py --ignore=v1/spec_decode/test_tree_attention.py --ignore=v1/spec_decode/test_speculators_eagle3.py
+    pytest -v -s v1/kv_connector/unit --ignore=v1/kv_connector/unit/test_multi_connector.py --ignore=v1/kv_connector/unit/test_nixl_connector.py --ignore=v1/kv_connector/unit/test_shared_storage_connector.py --ignore=v1/kv_connector/unit/test_lmcache_integration.py
    pytest -v -s v1/test_serial_utils.py
 '
--- a/.buildkite/scripts/push-nightly-builds.sh
+++ b/.buildkite/scripts/push-nightly-builds.sh
@@ -1,36 +0,0 @@
-#!/bin/bash
-
-set -ex
-
-# Get tag variant from argument, default to empty if not provided, should be something like "cu130".
-# Due to limits in cleanup script, we must move variants to use separate tags like "cu130-nightly",
-# otherwise they will be cleaned up together with the main "nightly" tags.
-
-TAG_VARIANT="$1"
-if [ -n "$TAG_VARIANT" ]; then
-    ORIG_TAG_SUFFIX="-$TAG_VARIANT"
-    TAG_NAME="$TAG_VARIANT-nightly"
-else
-    ORIG_TAG_SUFFIX=""
-    TAG_NAME="nightly"
-fi
-
-ORIG_TAG_NAME="$BUILDKITE_COMMIT"
-
-echo "Pushing original tag $ORIG_TAG_NAME$ORIG_TAG_SUFFIX to new nightly tag name: $TAG_NAME"
-
-# pull original arch-dependent images from AWS ECR Public
-aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7
-docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:$ORIG_TAG_NAME-x86_64$ORIG_TAG_SUFFIX
-docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:$ORIG_TAG_NAME-aarch64$ORIG_TAG_SUFFIX
-# tag arch-dependent images
-docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$ORIG_TAG_NAME-x86_64$ORIG_TAG_SUFFIX vllm/vllm-openai:$TAG_NAME-x86_64
-docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$ORIG_TAG_NAME-aarch64$ORIG_TAG_SUFFIX vllm/vllm-openai:$TAG_NAME-aarch64
-# push arch-dependent images to DockerHub
-docker push vllm/vllm-openai:$TAG_NAME-x86_64
-docker push vllm/vllm-openai:$TAG_NAME-aarch64
-# push arch-independent manifest to DockerHub
-docker manifest create vllm/vllm-openai:$TAG_NAME vllm/vllm-openai:$TAG_NAME-x86_64 vllm/vllm-openai:$TAG_NAME-aarch64 --amend
-docker manifest create vllm/vllm-openai:$TAG_NAME-$BUILDKITE_COMMIT vllm/vllm-openai:$TAG_NAME-x86_64 vllm/vllm-openai:$TAG_NAME-aarch64 --amend
-docker manifest push vllm/vllm-openai:$TAG_NAME
-docker manifest push vllm/vllm-openai:$TAG_NAME-$BUILDKITE_COMMIT
--- a/.buildkite/scripts/run-multi-node-test.sh
+++ b/.buildkite/scripts/run-multi-node-test.sh
@@ -2,17 +2,6 @@

 set -euox pipefail

-# To detect ROCm
-# Check multiple indicators:
-if [ -e /dev/kfd ] || \
-    [ -d /opt/rocm ] || \
-    command -v rocm-smi &> /dev/null || \
-    [ -n "${ROCM_HOME:-}" ]; then
-    IS_ROCM=1
-else
-    IS_ROCM=0
-fi
-
 if [[ $# -lt 4 ]]; then
    echo "Usage: .buildkite/scripts/run-multi-node-test.sh WORKING_DIR NUM_NODES NUM_GPUS DOCKER_IMAGE COMMAND1 COMMAND2 ... COMMANDN"
    exit 1
@@ -37,18 +26,13 @@ for command in "${COMMANDS[@]}"; do
    echo "$command"
 done

-
 start_network() {
    docker network create --subnet=192.168.10.0/24 docker-net
 }

 start_nodes() {
    for node in $(seq 0 $(($NUM_NODES-1))); do
-        if [ "$IS_ROCM" -eq 1 ]; then
-            GPU_DEVICES='--device /dev/kfd --device /dev/dri -e HIP_VISIBLE_DEVICES='
-        else
-            GPU_DEVICES='--gpus "device='
-        fi
+        GPU_DEVICES='"device='
        for node_gpu in $(seq 0 $(($NUM_GPUS - 1))); do
            DEVICE_NUM=$(($node * $NUM_GPUS + $node_gpu))
            GPU_DEVICES+=$(($DEVICE_NUM))
@@ -56,9 +40,7 @@ start_nodes() {
                GPU_DEVICES+=','
            fi
        done
-        if [ "$IS_ROCM" -eq 0 ]; then
-            GPU_DEVICES+='"'
-        fi
+        GPU_DEVICES+='"'

        # start the container in detached mode
        # things to note:
@@ -67,7 +49,7 @@ start_nodes() {
        # 3. map the huggingface cache directory to the container
        # 3. assign ip addresses to the containers (head node: 192.168.10.10, worker nodes:
        #    starting from 192.168.10.11)
-        docker run -d $GPU_DEVICES --shm-size=10.24gb -e HF_TOKEN \
+        docker run -d --gpus "$GPU_DEVICES" --shm-size=10.24gb -e HF_TOKEN \
            -v ~/.cache/huggingface:/root/.cache/huggingface --name "node$node" \
            --network docker-net --ip 192.168.10.$((10 + $node)) --rm "$DOCKER_IMAGE" \
            /bin/bash -c "tail -f /dev/null"
--- a/.buildkite/scripts/run-prime-rl-test.sh
+++ b/.buildkite/scripts/run-prime-rl-test.sh
@@ -12,11 +12,6 @@ REPO_ROOT="$(cd "${SCRIPT_DIR}/../.." && pwd)"
 PRIME_RL_REPO="https://github.com/PrimeIntellect-ai/prime-rl.git"
 PRIME_RL_DIR="${REPO_ROOT}/prime-rl"

-if command -v rocm-smi &> /dev/null || command -v rocminfo &> /dev/null; then
-    echo "AMD GPU detected. Prime-RL currently only supports NVIDIA. Skipping..."
-    exit 0
-fi
-
 echo "Setting up Prime-RL integration test environment..."

 # Clean up any existing Prime-RL directory
--- a/.buildkite/scripts/scheduled_integration_test/deepseek_v2_lite_ep_eplb.sh
+++ b/.buildkite/scripts/scheduled_integration_test/deepseek_v2_lite_ep_eplb.sh
@@ -17,17 +17,7 @@ wait_for_server() {
 }

 MODEL="deepseek-ai/DeepSeek-V2-lite"
-
-# Set BACKENDS based on platform
-if command -v rocm-smi &> /dev/null || [[ -d /opt/rocm ]] || [[ -n "${ROCM_PATH:-}" ]]; then
-  # ROCm platform
-  BACKENDS=("allgather_reducescatter")
-  # Disable MOE padding for ROCm since it is causing eplb to fail
-  export VLLM_ROCM_MOE_PADDING=0
-else
-  # Non-ROCm platform (CUDA/other)
-  BACKENDS=("deepep_high_throughput" "deepep_low_latency")
-fi
+BACKENDS=("deepep_high_throughput" "deepep_low_latency")

 cleanup() {
  if [[ -n "${SERVER_PID:-}" ]] && kill -0 "${SERVER_PID}" 2>/dev/null; then
@@ -43,6 +33,7 @@ trap cleanup EXIT

 for BACK in "${BACKENDS[@]}"; do
  VLLM_DEEP_GEMM_WARMUP=skip \
+  VLLM_ALL2ALL_BACKEND=$BACK \
  vllm serve "$MODEL" \
    --enforce-eager \
    --tensor-parallel-size 2 \
@@ -51,7 +42,6 @@ for BACK in "${BACKENDS[@]}"; do
    --enable-eplb \
    --trust-remote-code \
    --max-model-len 2048 \
-    --all2all-backend $BACK \
    --port $PORT &
  SERVER_PID=$!
  wait_for_server $PORT
--- a/.buildkite/scripts/scheduled_integration_test/qwen30b_a3b_fp8_block_ep_eplb.sh
+++ b/.buildkite/scripts/scheduled_integration_test/qwen30b_a3b_fp8_block_ep_eplb.sh
@@ -1,12 +1,10 @@
 #!/usr/bin/env bash
 set -euxo pipefail

-# args: [THRESHOLD] [NUM_QUESTIONS] [START_PORT] [DATA_PARALLEL_SIZE] [TENSOR_PARALLEL_SIZE]
+# args: [THRESHOLD] [NUM_QUESTIONS] [START_PORT]
 THRESHOLD=${1:-0.8}
 NUM_Q=${2:-1319}
 PORT=${3:-8020}
-DATA_PARALLEL_SIZE=${4:-2}
-TENSOR_PARALLEL_SIZE=${5:-2}
 OUT_DIR=${OUT_DIR:-/tmp/vllm-scheduled}
 mkdir -p "${OUT_DIR}"

@@ -19,16 +17,7 @@ wait_for_server() {
 }

 MODEL="QWen/Qwen3-30B-A3B-FP8"
-# Set BACKENDS based on platform
-if command -v rocm-smi &> /dev/null || [[ -d /opt/rocm ]] || [[ -n "${ROCM_PATH:-}" ]]; then
-  # ROCm platform
-  BACKENDS=("allgather_reducescatter")
-  # Disable MOE padding for ROCm since it is causing eplb to fail
-  export VLLM_ROCM_MOE_PADDING=0
-else
-  # Non-ROCm platform (CUDA/other)
-  BACKENDS=("deepep_high_throughput" "deepep_low_latency")
-fi
+BACKENDS=("deepep_high_throughput" "deepep_low_latency")

 cleanup() {
  if [[ -n "${SERVER_PID:-}" ]] && kill -0 "${SERVER_PID}" 2>/dev/null; then
@@ -44,13 +33,11 @@ trap cleanup EXIT

 for BACK in "${BACKENDS[@]}"; do
  VLLM_DEEP_GEMM_WARMUP=skip \
+  VLLM_ALL2ALL_BACKEND=$BACK \
  vllm serve "$MODEL" \
    --enforce-eager \
-    --enable-eplb \
-    --all2all-backend $BACK \
-    --eplb-config '{"window_size":10, "step_interval":100, "num_redundant_experts":0, "log_balancedness":true}' \
-    --tensor-parallel-size ${TENSOR_PARALLEL_SIZE} \
-    --data-parallel-size ${DATA_PARALLEL_SIZE} \
+    --tensor-parallel-size 2 \
+    --data-parallel-size 2 \
    --enable-expert-parallel \
    --trust-remote-code \
    --max-model-len 2048 \
--- a/.buildkite/scripts/scheduled_integration_test/qwen3_next_mtp_async_eplb.sh
+++ b/.buildkite/scripts/scheduled_integration_test/qwen3_next_mtp_async_eplb.sh
@@ -1,78 +0,0 @@
-#!/usr/bin/env bash
-set -euxo pipefail
-
-# args: [THRESHOLD] [NUM_QUESTIONS] [START_PORT]
-THRESHOLD=${1:-0.25}
-NUM_Q=${2:-1319}
-PORT=${3:-8040}
-OUT_DIR=${OUT_DIR:-/tmp/vllm-scheduled}
-mkdir -p "${OUT_DIR}"
-
-wait_for_server() {
-  local port=$1
-  timeout 600 bash -c '
-    until curl -sf "http://127.0.0.1:'"$port"'/health" > /dev/null; do
-      sleep 1
-    done'
-}
-
-MODEL="Qwen/Qwen3-Next-80B-A3B-Instruct"
-
-# Set BACKENDS and platform-specific args based on platform
-if command -v rocm-smi &> /dev/null || [[ -d /opt/rocm ]] || [[ -n "${ROCM_PATH:-}" ]]; then
-  # ROCm platform
-  BACKENDS=("allgather_reducescatter")
-  # Disable MOE padding for ROCm since it is causing eplb to fail
-  export VLLM_ROCM_MOE_PADDING=0
-  PLATFORM_ARGS=("--no-async-scheduling")
-  echo "Disabled async scheduling for ROCm platform due to issues with spec decode."
-else
-  # Non-ROCm platform (CUDA/other)
-  BACKENDS=("deepep_high_throughput" "deepep_low_latency")
-  PLATFORM_ARGS=()
-fi
-
-cleanup() {
-  if [[ -n "${SERVER_PID:-}" ]] && kill -0 "${SERVER_PID}" 2>/dev/null; then
-    kill "${SERVER_PID}" 2>/dev/null || true
-    for _ in {1..20}; do
-      kill -0 "${SERVER_PID}" 2>/dev/null || break
-      sleep 0.5
-    done
-    kill -9 "${SERVER_PID}" 2>/dev/null || true
-  fi
-}
-trap cleanup EXIT
-
-for BACK in "${BACKENDS[@]}"; do
-  VLLM_DEEP_GEMM_WARMUP=skip \
-  vllm serve "$MODEL" \
-    --enforce-eager \
-    --tensor-parallel-size 4 \
-    --enable-expert-parallel \
-    --enable-eplb \
-    --all2all-backend $BACK \
-    --eplb-config '{"window_size":200,"step_interval":600,"use_async":true}' \
-    --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":1}' \
-    --trust-remote-code \
-    --max-model-len 2048 \
-    --gpu-memory-utilization 0.9 \
-    "${PLATFORM_ARGS[@]}" \
-    --port $PORT &
-  SERVER_PID=$!
-  wait_for_server $PORT
-
-  TAG=$(echo "$MODEL" | tr '/: \\n' '_____')
-  OUT="${OUT_DIR}/${TAG}_${BACK}.json"
-  python3 tests/evals/gsm8k/gsm8k_eval.py --host http://127.0.0.1 --port $PORT --num-questions ${NUM_Q} --save-results ${OUT}
-  python3 - <<PY
-import json; acc=json.load(open('${OUT}'))['accuracy']
-print(f"${MODEL} ${BACK}: accuracy {acc:.3f}")
-assert acc >= ${THRESHOLD}, f"${MODEL} ${BACK} accuracy {acc}"
-PY
-
-  cleanup
-  SERVER_PID=
-  sleep 1
-  PORT=$((PORT+1))
-done
--- a/.buildkite/scripts/trigger-ci-build.sh
+++ b/.buildkite/scripts/trigger-ci-build.sh
@@ -1,227 +0,0 @@
-#!/bin/bash
-#
-# trigger-ci-build.sh
-# Trigger a Buildkite CI build using the bk CLI for the current commit and branch
-# with RUN_ALL=1 and NIGHTLY=1 environment variables.
-#
-# Usage: ./trigger-ci-build.sh [options]
-#
-# Requires: bk CLI (https://buildkite.com/docs/platform/cli)
-#
-# SAFETY: Dry-run by default. Use --execute to actually trigger a build.
-#
-
-set -euo pipefail
-
-# Colors for output
-RED='\033[0;31m'
-GREEN='\033[0;32m'
-YELLOW='\033[1;33m'
-BLUE='\033[0;34m'
-NC='\033[0m' # No Color
-
-# Default configuration
-PIPELINE="ci"
-DRY_RUN=true
-
-usage() {
-    cat <<EOF
-Usage: $(basename "$0") [options]
-
-Trigger a Buildkite CI build using the bk CLI for the current commit and branch.
-Sets RUN_ALL=1 and NIGHTLY=1 environment variables.
-
-SAFETY: Dry-run by default. Use --execute to actually trigger a build.
-
-Options:
-    --execute       Actually trigger the build (default: dry-run)
-    --pipeline      Buildkite pipeline slug (default: ${PIPELINE})
-    --commit        Override commit SHA (default: current HEAD)
-    --branch        Override branch name (default: current branch)
-    --message       Custom build message (default: auto-generated)
-    --help          Show this help message
-
-Prerequisites:
-    - bk CLI installed: brew tap buildkite/buildkite && brew install buildkite/buildkite/bk
-    - bk configured: bk configure
-
-Examples:
-    $(basename "$0")                        # Dry-run, show what would happen
-    $(basename "$0") --execute              # Actually trigger the build
-    $(basename "$0") --pipeline ci-shadow   # Dry-run with different pipeline
-EOF
-    exit 1
-}
-
-log_info() {
-    echo -e "${BLUE}[INFO]${NC} $1"
-}
-
-log_success() {
-    echo -e "${GREEN}[OK]${NC} $1"
-}
-
-log_warn() {
-    echo -e "${YELLOW}[WARN]${NC} $1"
-}
-
-log_error() {
-    echo -e "${RED}[ERROR]${NC} $1" >&2
-}
-
-# Parse arguments
-COMMIT=""
-BRANCH=""
-MESSAGE=""
-
-while [[ $# -gt 0 ]]; do
-    case $1 in
-        --execute)
-            DRY_RUN=false
-            shift
-            ;;
-        --pipeline)
-            PIPELINE="$2"
-            shift 2
-            ;;
-        --commit)
-            COMMIT="$2"
-            shift 2
-            ;;
-        --branch)
-            BRANCH="$2"
-            shift 2
-            ;;
-        --message)
-            MESSAGE="$2"
-            shift 2
-            ;;
-        --help|-h)
-            usage
-            ;;
-        -*)
-            log_error "Unknown option: $1"
-            usage
-            ;;
-        *)
-            log_error "Unexpected argument: $1"
-            usage
-            ;;
-    esac
-done
-
-# Check if bk CLI is installed
-if ! command -v bk &>/dev/null; then
-    log_error "Buildkite CLI (bk) is not installed"
-    echo ""
-    echo "Install with:"
-    echo "  brew tap buildkite/buildkite && brew install buildkite/buildkite/bk"
-    echo ""
-    echo "Then configure:"
-    echo "  bk configure"
-    exit 1
-fi
-
-# Check if we're in a git repository
-if ! git rev-parse --is-inside-work-tree &>/dev/null; then
-    log_error "Not in a git repository"
-    exit 1
-fi
-
-# Get current commit and branch if not overridden
-if [[ -z "$COMMIT" ]]; then
-    COMMIT=$(git rev-parse HEAD)
-fi
-
-if [[ -z "$BRANCH" ]]; then
-    BRANCH=$(git branch --show-current)
-    if [[ -z "$BRANCH" ]]; then
-        # Detached HEAD state - try to get branch from ref
-        BRANCH=$(git rev-parse --abbrev-ref HEAD)
-    fi
-fi
-
-# Generate default message if not provided
-if [[ -z "$MESSAGE" ]]; then
-    COMMIT_MSG=$(git log -1 --pretty=format:"%s" "$COMMIT" 2>/dev/null || echo "Manual build")
-    MESSAGE="[Manual] ${COMMIT_MSG}"
-fi
-
-# Safety check: Verify the commit exists on the remote
-log_info "Verifying commit exists on remote..."
-git fetch origin --quiet 2>/dev/null || true
-
-# Check if commit is reachable from any remote branch
-REMOTE_BRANCHES=$(git branch -r --contains "$COMMIT" 2>/dev/null || true)
-if [[ -z "$REMOTE_BRANCHES" ]]; then
-    log_error "Commit ${COMMIT} does not exist on any remote branch!"
-    echo ""
-    echo "The CI system will fail to checkout this commit."
-    echo "Please push your changes first:"
-    echo ""
-    echo "  git push origin ${BRANCH}"
-    echo ""
-    exit 1
-fi
-
-log_success "Commit found on remote branches:"
-echo "$REMOTE_BRANCHES" | head -5 | sed 's/^/  /'
-if [[ $(echo "$REMOTE_BRANCHES" | wc -l) -gt 5 ]]; then
-    echo "  ... and more"
-fi
-echo ""
-
-log_info "Pipeline: ${PIPELINE}"
-log_info "Branch: ${BRANCH}"
-log_info "Commit: ${COMMIT}"
-log_info "Message: ${MESSAGE}"
-log_info "Environment: RUN_ALL=1, NIGHTLY=1"
-echo ""
-
-# Build the command
-CMD=(bk build create
-    -y
-    -w
-    -i
-    --pipeline "${PIPELINE}"
-    --commit "${COMMIT}"
-    --branch "${BRANCH}"
-    --message "${MESSAGE}"
-    --env "RUN_ALL=1"
-    --env "NIGHTLY=1"
-)
-
-if [[ "$DRY_RUN" == true ]]; then
-    echo "=========================================="
-    log_warn "DRY-RUN MODE - No build will be triggered"
-    echo "=========================================="
-    echo ""
-    echo "Command that would be executed:"
-    echo ""
-    # Escape single quotes in values for safe shell display
-    escape_for_shell() {
-        printf '%s' "$1" | sed "s/'/'\\\\''/g"
-    }
-    echo "  bk build create \\"
-    echo "    -y \\"
-    echo "    -w \\"
-    echo "    -i \\"
-    echo "    --pipeline '$(escape_for_shell "${PIPELINE}")' \\"
-    echo "    --commit '$(escape_for_shell "${COMMIT}")' \\"
-    echo "    --branch '$(escape_for_shell "${BRANCH}")' \\"
-    echo "    --message '$(escape_for_shell "${MESSAGE}")' \\"
-    echo "    --env 'RUN_ALL=1' \\"
-    echo "    --env 'NIGHTLY=1'"
-    echo ""
-    echo "=========================================="
-    echo -e "${YELLOW}To actually trigger this build, run:${NC}"
-    echo ""
-    echo "  $0 --execute"
-    echo "=========================================="
-    exit 0
-fi
-
-log_info "Triggering build..."
-
-# Execute the command - bk will print the URL and open browser
-"${CMD[@]}"
--- a/.buildkite/scripts/upload-nightly-wheels.sh
+++ b/.buildkite/scripts/upload-nightly-wheels.sh
@@ -1,108 +0,0 @@
-#!/usr/bin/env bash
-
-set -ex
-
-# ======== part 0: setup ========
-
-BUCKET="vllm-wheels"
-INDICES_OUTPUT_DIR="indices"
-DEFAULT_VARIANT_ALIAS="cu129" # align with vLLM_MAIN_CUDA_VERSION in vllm/envs.py
-PYTHON=${PYTHON_PROG:=python3} # try to read from env var, otherwise use python3
-SUBPATH=$BUILDKITE_COMMIT
-S3_COMMIT_PREFIX="s3://$BUCKET/$SUBPATH/"
-
-# detect if python3.10+ is available
-has_new_python=$($PYTHON -c "print(1 if __import__('sys').version_info >= (3,12) else 0)")
-if [[ "$has_new_python" -eq 0 ]]; then
-    # use new python from docker
-    docker pull python:3-slim
-    PYTHON="docker run --rm -v $(pwd):/app -w /app python:3-slim python3"
-fi
-
-echo "Using python interpreter: $PYTHON"
-echo "Python version: $($PYTHON --version)"
-
-# ========= part 1: collect, rename & upload the wheel ==========
-
-# Assume wheels are in artifacts/dist/*.whl
-wheel_files=(artifacts/dist/*.whl)
-
-# Check that exactly one wheel is found
-if [[ ${#wheel_files[@]} -ne 1 ]]; then
-  echo "Error: Expected exactly one wheel file in artifacts/dist/, but found ${#wheel_files[@]}"
-  exit 1
-fi
-wheel="${wheel_files[0]}"
-
-# default build image uses ubuntu 20.04, which corresponds to manylinux_2_31
-# we also accept params as manylinux tag
-# refer to https://github.com/mayeut/pep600_compliance?tab=readme-ov-file#acceptable-distros-to-build-wheels
-manylinux_version="${1:-manylinux_2_31}"
-
-# Rename 'linux' to the appropriate manylinux version in the wheel filename
-if [[ "$wheel" != *"linux"* ]]; then
-  echo "Error: Wheel filename does not contain 'linux': $wheel"
-  exit 1
-fi
-new_wheel="${wheel/linux/$manylinux_version}"
-mv -- "$wheel" "$new_wheel"
-wheel="$new_wheel"
-echo "Renamed wheel to: $wheel"
-
-# Extract the version from the wheel
-version=$(unzip -p "$wheel" '**/METADATA' | grep '^Version: ' | cut -d' ' -f2)
-echo "Version in wheel: $version"
-pure_version="${version%%+*}"
-echo "Pure version (without variant): $pure_version"
-
-# copy wheel to its own bucket
-aws s3 cp "$wheel" "$S3_COMMIT_PREFIX"
-
-# ========= part 2: generate and upload indices ==========
-# generate indices for all existing wheels in the commit directory
-# this script might be run multiple times if there are multiple variants being built
-# so we need to guarantee there is little chance for "TOCTOU" issues
-# i.e., one process is generating indices while another is uploading a new wheel
-# so we need to ensure no time-consuming operations happen below
-
-# list all wheels in the commit directory
-echo "Existing wheels on S3:"
-aws s3 ls "$S3_COMMIT_PREFIX"
-obj_json="objects.json"
-aws s3api list-objects-v2 --bucket "$BUCKET" --prefix "$SUBPATH/" --delimiter / --output json > "$obj_json"
-mkdir -p "$INDICES_OUTPUT_DIR"
-
-# call script to generate indicies for all existing wheels
-# this indices have relative paths that could work as long as it is next to the wheel directory in s3
-# i.e., the wheels are always in s3://vllm-wheels/<commit>/
-# and indices can be placed in /<commit>/, or /nightly/, or /<version>/
-if [[ ! -z "$DEFAULT_VARIANT_ALIAS" ]]; then
-    alias_arg="--alias-to-default $DEFAULT_VARIANT_ALIAS"
-else
-    alias_arg=""
-fi
-
-# HACK: we do not need regex module here, but it is required by pre-commit hook
-# To avoid any external dependency, we simply replace it back to the stdlib re module
-sed -i 's/import regex as re/import re/g' .buildkite/scripts/generate-nightly-index.py
-$PYTHON .buildkite/scripts/generate-nightly-index.py --version "$SUBPATH" --current-objects "$obj_json" --output-dir "$INDICES_OUTPUT_DIR" --comment "commit $BUILDKITE_COMMIT" $alias_arg
-
-# copy indices to /<commit>/ unconditionally
-echo "Uploading indices to $S3_COMMIT_PREFIX"
-aws s3 cp --recursive "$INDICES_OUTPUT_DIR/" "$S3_COMMIT_PREFIX"
-
-# copy to /nightly/ only if it is on the main branch and not a PR 
-if [[ "$BUILDKITE_BRANCH" == "main" && "$BUILDKITE_PULL_REQUEST" == "false" ]]; then
-    echo "Uploading indices to overwrite /nightly/"
-    aws s3 cp --recursive "$INDICES_OUTPUT_DIR/" "s3://$BUCKET/nightly/"
-fi
-
-# re-generate and copy to /<pure_version>/ only if it does not have "dev" in the version
-if [[ "$version" != *"dev"* ]]; then
-    echo "Re-generating indices for /$pure_version/"
-    rm -rf "$INDICES_OUTPUT_DIR/*"
-    mkdir -p "$INDICES_OUTPUT_DIR"
-    # wheel-dir is overridden to be the commit directory, so that the indices point to the correct wheel path
-    $PYTHON .buildkite/scripts/generate-nightly-index.py --version "$pure_version" --wheel-dir "$SUBPATH" --current-objects "$obj_json" --output-dir "$INDICES_OUTPUT_DIR" --comment "version $pure_version" $alias_arg
-    aws s3 cp --recursive "$INDICES_OUTPUT_DIR/" "s3://$BUCKET/$pure_version/"
-fi
--- a/.buildkite/scripts/upload-release-wheels-pypi.sh
+++ b/.buildkite/scripts/upload-release-wheels-pypi.sh
@@ -1,70 +0,0 @@
-#!/usr/bin/env bash
-
-set -e
-
-BUCKET="vllm-wheels"
-SUBPATH=$BUILDKITE_COMMIT
-S3_COMMIT_PREFIX="s3://$BUCKET/$SUBPATH/"
-
-RELEASE_VERSION=$(buildkite-agent meta-data get release-version)
-GIT_VERSION=$(git describe --exact-match --tags $BUILDKITE_COMMIT 2>/dev/null)
-
-echo "Release version from Buildkite: $RELEASE_VERSION"
-
-if [[ -z "$GIT_VERSION" ]]; then
-    echo "[FATAL] Not on a git tag, cannot create release."
-    exit 1
-else
-    echo "Git version for commit $BUILDKITE_COMMIT: $GIT_VERSION"
-fi
-# sanity check for version mismatch
-if [[ "$RELEASE_VERSION" != "$GIT_VERSION" ]]; then
-  if [[ "$FORCE_RELEASE_IGNORE_VERSION_MISMATCH" == "true" ]]; then
-    echo "[WARNING] Force release and ignore version mismatch"
-  else
-    echo "[FATAL] Release version from Buildkite does not match Git version."
-    exit 1
-  fi
-fi
-PURE_VERSION=${RELEASE_VERSION#v} # remove leading 'v'
-
-# check pypi token
-if [[ -z "$PYPI_TOKEN" ]]; then
-  echo "[FATAL] PYPI_TOKEN is not set."
-  exit 1
-else
-  export TWINE_USERNAME="__token__"
-  export TWINE_PASSWORD="$PYPI_TOKEN"
-fi
-
-set -x # avoid printing secrets above
-
-# install twine from pypi
-python3 -m venv /tmp/vllm-release-env
-source /tmp/vllm-release-env/bin/activate
-pip install twine
-python3 -m twine --version
-
-# copy release wheels to local directory
-DIST_DIR=/tmp/vllm-release-dist
-echo "Existing wheels on S3:"
-aws s3 ls "$S3_COMMIT_PREFIX"
-echo "Copying wheels to local directory"
-mkdir -p $DIST_DIR
-# include only wheels for the release version, ignore all files with "dev" or "rc" in the name (without excluding 'aarch64')
-aws s3 cp --recursive --exclude "*" --include "vllm-${PURE_VERSION}*.whl" --exclude "*dev*" --exclude "*rc[0-9]*" "$S3_COMMIT_PREFIX" $DIST_DIR
-echo "Wheels copied to local directory"
-# generate source tarball
-git archive --format=tar.gz --output="$DIST_DIR/vllm-${PURE_VERSION}.tar.gz" $BUILDKITE_COMMIT
-ls -la $DIST_DIR
-
-# upload wheels to PyPI (only default variant, i.e. files without '+' in the name)
-PYPI_WHEEL_FILES=$(find $DIST_DIR -name "vllm-${PURE_VERSION}*.whl" -not -name "*+*")
-if [[ -z "$PYPI_WHEEL_FILES" ]]; then
-  echo "No default variant wheels found, quitting..."
-  exit 1
-fi
-
-python3 -m twine check $PYPI_WHEEL_FILES
-python3 -m twine upload --non-interactive --verbose $PYPI_WHEEL_FILES
-echo "Wheels uploaded to PyPI"
--- a/.buildkite/scripts/upload-rocm-wheels.sh
+++ b/.buildkite/scripts/upload-rocm-wheels.sh
@@ -1,151 +0,0 @@
-#!/usr/bin/env bash
-# SPDX-License-Identifier: Apache-2.0
-# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
-#
-# Upload ROCm wheels to S3 with proper index generation
-#
-# Required environment variables:
-#   AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY (or IAM role)
-#   S3_BUCKET (default: vllm-wheels)
-#
-# S3 path structure:
-#   s3://vllm-wheels/rocm/{commit}/     - All wheels for this commit
-#   s3://vllm-wheels/rocm/nightly/      - Index pointing to latest nightly
-#   s3://vllm-wheels/rocm/{version}/    - Index for release versions
-
-set -ex
-
-# ======== Configuration ========
-BUCKET="${S3_BUCKET:-vllm-wheels}"
-ROCM_SUBPATH="rocm/${BUILDKITE_COMMIT}"
-S3_COMMIT_PREFIX="s3://$BUCKET/$ROCM_SUBPATH/"
-INDICES_OUTPUT_DIR="rocm-indices"
-PYTHON="${PYTHON_PROG:-python3}"
-
-# ROCm uses manylinux_2_35 (Ubuntu 22.04 based)
-MANYLINUX_VERSION="manylinux_2_35"
-
-echo "========================================"
-echo "ROCm Wheel Upload Configuration"
-echo "========================================"
-echo "S3 Bucket: $BUCKET"
-echo "S3 Path: $ROCM_SUBPATH"
-echo "Commit: $BUILDKITE_COMMIT"
-echo "Branch: $BUILDKITE_BRANCH"
-echo "========================================"
-
-# ======== Part 0: Setup Python ========
-
-# Detect if python3.12+ is available
-has_new_python=$($PYTHON -c "print(1 if __import__('sys').version_info >= (3,12) else 0)" 2>/dev/null || echo 0)
-if [[ "$has_new_python" -eq 0 ]]; then
-    # Use new python from docker
-    # Use --user to ensure files are created with correct ownership (not root)
-    docker pull python:3-slim
-    PYTHON="docker run --rm --user $(id -u):$(id -g) -v $(pwd):/app -w /app python:3-slim python3"
-fi
-
-echo "Using python interpreter: $PYTHON"
-echo "Python version: $($PYTHON --version)"
-
-# ======== Part 1: Collect and prepare wheels ========
-
-# Collect all wheels
-mkdir -p all-rocm-wheels
-cp artifacts/rocm-base-wheels/*.whl all-rocm-wheels/ 2>/dev/null || true
-cp artifacts/rocm-vllm-wheel/*.whl all-rocm-wheels/ 2>/dev/null || true
-
-WHEEL_COUNT=$(ls all-rocm-wheels/*.whl 2>/dev/null | wc -l)
-echo "Total wheels to upload: $WHEEL_COUNT"
-
-if [ "$WHEEL_COUNT" -eq 0 ]; then
-    echo "ERROR: No wheels found to upload!"
-    exit 1
-fi
-
-# Rename linux to manylinux in wheel filenames
-for wheel in all-rocm-wheels/*.whl; do
-    if [[ "$wheel" == *"linux"* ]] && [[ "$wheel" != *"manylinux"* ]]; then
-        new_wheel="${wheel/linux/$MANYLINUX_VERSION}"
-        mv -- "$wheel" "$new_wheel"
-        echo "Renamed: $(basename "$wheel") -> $(basename "$new_wheel")"
-    fi
-done
-
-echo ""
-echo "Wheels to upload:"
-ls -lh all-rocm-wheels/
-
-# ======== Part 2: Upload wheels to S3 ========
-
-echo ""
-echo "Uploading wheels to $S3_COMMIT_PREFIX"
-for wheel in all-rocm-wheels/*.whl; do
-    aws s3 cp "$wheel" "$S3_COMMIT_PREFIX"
-done
-
-# ======== Part 3: Generate and upload indices ========
-
-# List existing wheels in commit directory
-echo ""
-echo "Generating indices..."
-obj_json="rocm-objects.json"
-aws s3api list-objects-v2 --bucket "$BUCKET" --prefix "$ROCM_SUBPATH/" --delimiter / --output json > "$obj_json"
-
-mkdir -p "$INDICES_OUTPUT_DIR"
-
-# Use the existing generate-nightly-index.py
-# HACK: Replace regex module with stdlib re (same as CUDA script)
-sed -i 's/import regex as re/import re/g' .buildkite/scripts/generate-nightly-index.py
-
-$PYTHON .buildkite/scripts/generate-nightly-index.py \
-    --version "$ROCM_SUBPATH" \
-    --current-objects "$obj_json" \
-    --output-dir "$INDICES_OUTPUT_DIR" \
-    --comment "ROCm commit $BUILDKITE_COMMIT"
-
-# Upload indices to commit directory
-echo "Uploading indices to $S3_COMMIT_PREFIX"
-aws s3 cp --recursive "$INDICES_OUTPUT_DIR/" "$S3_COMMIT_PREFIX"
-
-# Update rocm/nightly/ if on main branch and not a PR
-if [[ "$BUILDKITE_BRANCH" == "main" && "$BUILDKITE_PULL_REQUEST" == "false" ]] || [[ "$NIGHTLY" == "1" ]]; then
-    echo "Updating rocm/nightly/ index..."
-    aws s3 cp --recursive "$INDICES_OUTPUT_DIR/" "s3://$BUCKET/rocm/nightly/"
-fi
-
-# Extract version from vLLM wheel and update version-specific index
-VLLM_WHEEL=$(ls all-rocm-wheels/vllm*.whl 2>/dev/null | head -1)
-if [ -n "$VLLM_WHEEL" ]; then
-    VERSION=$(unzip -p "$VLLM_WHEEL" '**/METADATA' | grep '^Version: ' | cut -d' ' -f2)
-    echo "Version in wheel: $VERSION"
-    PURE_VERSION="${VERSION%%+*}"
-    PURE_VERSION="${PURE_VERSION%%.rocm}"
-    echo "Pure version: $PURE_VERSION"
-
-    if [[ "$VERSION" != *"dev"* ]]; then
-        echo "Updating rocm/$PURE_VERSION/ index..."
-        aws s3 cp --recursive "$INDICES_OUTPUT_DIR/" "s3://$BUCKET/rocm/$PURE_VERSION/"
-    fi
-fi
-
-# ======== Part 4: Summary ========
-
-echo ""
-echo "========================================"
-echo "ROCm Wheel Upload Complete!"
-echo "========================================"
-echo ""
-echo "Wheels available at:"
-echo "  s3://$BUCKET/$ROCM_SUBPATH/"
-echo ""
-echo "Install command (by commit):"
-echo "  pip install vllm --extra-index-url https://${BUCKET}.s3.amazonaws.com/$ROCM_SUBPATH/"
-echo ""
-if [[ "$BUILDKITE_BRANCH" == "main" ]] || [[ "$NIGHTLY" == "1" ]]; then
-    echo "Install command (nightly):"
-    echo "  pip install vllm --extra-index-url https://${BUCKET}.s3.amazonaws.com/rocm/nightly/"
-fi
-echo ""
-echo "Wheel count: $WHEEL_COUNT"
-echo "========================================"
--- a/.buildkite/scripts/upload-wheels.sh
+++ b/.buildkite/scripts/upload-wheels.sh
@@ -0,0 +1,83 @@
+#!/usr/bin/env bash
+
+set -ex
+
+# Assume wheels are in artifacts/dist/*.whl
+wheel_files=(artifacts/dist/*.whl)
+
+# Check that exactly one wheel is found
+if [[ ${#wheel_files[@]} -ne 1 ]]; then
+  echo "Error: Expected exactly one wheel file in artifacts/dist/, but found ${#wheel_files[@]}"
+  exit 1
+fi
+
+# Get the single wheel file
+wheel="${wheel_files[0]}"
+
+# Detect architecture and rename 'linux' to appropriate manylinux version
+arch=$(uname -m)
+if [[ $arch == "x86_64" ]]; then
+    manylinux_version="manylinux1"
+elif [[ $arch == "aarch64" ]]; then
+    manylinux_version="manylinux2014"
+else
+    echo "Warning: Unknown architecture $arch, using manylinux1 as default"
+    manylinux_version="manylinux1"
+fi
+
+# Rename 'linux' to the appropriate manylinux version in the wheel filename
+new_wheel="${wheel/linux/$manylinux_version}"
+mv -- "$wheel" "$new_wheel"
+wheel="$new_wheel"
+
+# Extract the version from the wheel
+version=$(unzip -p "$wheel" '**/METADATA' | grep '^Version: ' | cut -d' ' -f2)
+echo "Version: $version"
+
+normal_wheel="$wheel" # Save the original wheel filename
+
+# If the version contains "dev", rename it to v1.0.0.dev for consistency
+if [[ $version == *dev* ]]; then
+    suffix="${version##*.}"
+    if [[ $suffix == cu* ]]; then
+        new_version="1.0.0.dev+${suffix}"
+    else
+        new_version="1.0.0.dev"
+    fi
+    new_wheel="${wheel/$version/$new_version}"
+    # use cp to keep both files in the artifacts directory
+    cp -- "$wheel" "$new_wheel"
+    wheel="$new_wheel"
+    version="$new_version"
+fi
+
+# Upload the wheel to S3
+python3 .buildkite/generate_index.py --wheel "$normal_wheel"
+
+# generate index for this commit
+aws s3 cp "$wheel" "s3://vllm-wheels/$BUILDKITE_COMMIT/"
+aws s3 cp "$normal_wheel" "s3://vllm-wheels/$BUILDKITE_COMMIT/"
+
+if [[ $normal_wheel == *"cu129"* ]]; then
+    # only upload index.html for cu129 wheels (default wheels) as it
+    # is available on both x86 and arm64
+    aws s3 cp index.html "s3://vllm-wheels/$BUILDKITE_COMMIT/vllm/index.html"
+    aws s3 cp "s3://vllm-wheels/nightly/index.html" "s3://vllm-wheels/$BUILDKITE_COMMIT/index.html"
+else
+    echo "Skipping index files for non-cu129 wheels"
+fi
+
+# generate index for nightly
+aws s3 cp "$wheel" "s3://vllm-wheels/nightly/"
+aws s3 cp "$normal_wheel" "s3://vllm-wheels/nightly/"
+
+if [[ $normal_wheel == *"cu129"* ]]; then
+    # only upload index.html for cu129 wheels (default wheels) as it
+    # is available on both x86 and arm64
+    aws s3 cp index.html "s3://vllm-wheels/nightly/vllm/index.html"
+else
+    echo "Skipping index files for non-cu129 wheels"
+fi
+
+aws s3 cp "$wheel" "s3://vllm-wheels/$version/"
+aws s3 cp index.html "s3://vllm-wheels/$version/vllm/index.html"
--- a/.buildkite/test-amd.yaml
+++ b/.buildkite/test-amd.yaml
--- a/.buildkite/test-pipeline.yaml
+++ b/.buildkite/test-pipeline.yaml
@@ -57,18 +57,14 @@ steps:
  - pytest -v -s -m 'not cpu_test' multimodal
  - pytest -v -s utils_

- label: Async Engine, Inputs, Utils, Worker, Config Test (CPU) # 20min
-  timeout_in_minutes: 30
+- label: Async Engine, Inputs, Utils, Worker, Config Test (CPU) # 4 mins
+  timeout_in_minutes: 10
  source_file_dependencies:
  - vllm/
  - tests/test_inputs.py
  - tests/test_outputs.py
-  - tests/test_pooling_params.py
  - tests/multimodal
-  - tests/renderers
  - tests/standalone_tests/lazy_imports.py
-  - tests/tokenizers_
-  - tests/tool_parsers
  - tests/transformers_utils
  - tests/config
  no_gpu: true
@@ -76,11 +72,7 @@ steps:
  - python3 standalone_tests/lazy_imports.py
  - pytest -v -s test_inputs.py
  - pytest -v -s test_outputs.py
-  - pytest -v -s test_pooling_params.py
  - pytest -v -s -m 'cpu_test' multimodal
-  - pytest -v -s renderers
-  - pytest -v -s tokenizers_
-  - pytest -v -s tool_parsers
  - pytest -v -s transformers_utils
  - pytest -v -s config

@@ -118,7 +110,7 @@ steps:
  - tests/entrypoints/
  commands:
  - pytest -v -s entrypoints/openai/tool_parsers
-  - pytest -v -s entrypoints/ --ignore=entrypoints/llm --ignore=entrypoints/rpc --ignore=entrypoints/sleep --ignore=entrypoints/instrumentator --ignore=entrypoints/openai --ignore=entrypoints/offline_mode --ignore=entrypoints/test_chat_utils.py  --ignore=entrypoints/pooling
+  - pytest -v -s entrypoints/ --ignore=entrypoints/llm --ignore=entrypoints/openai --ignore=entrypoints/offline_mode --ignore=entrypoints/test_chat_utils.py  --ignore=entrypoints/pooling

 - label: Entrypoints Integration Test (LLM) # 30min
  timeout_in_minutes: 40
@@ -136,7 +128,7 @@ steps:
  - pytest -v -s entrypoints/llm/test_generate.py # it needs a clean process
  - pytest -v -s entrypoints/offline_mode # Needs to avoid interference with other tests

- label: Entrypoints Integration Test (API Server 1) # 100min
+- label: Entrypoints Integration Test (API Server) # 100min
  timeout_in_minutes: 130
  mirror_hardwares: [amdexperimental]
  working_dir: "/vllm-workspace/tests"
@@ -148,26 +140,10 @@ steps:
  - tests/entrypoints/test_chat_utils
  commands:
  - export VLLM_WORKER_MULTIPROC_METHOD=spawn
-  - pytest -v -s entrypoints/openai --ignore=entrypoints/openai/test_chat_with_tool_reasoning.py --ignore=entrypoints/openai/test_oot_registration.py --ignore=entrypoints/openai/test_tensorizer_entrypoint.py --ignore=entrypoints/openai/correctness/  --ignore=entrypoints/openai/tool_parsers/ --ignore=entrypoints/openai/responses
+  - PYTHONPATH=/vllm-workspace pytest -v -s entrypoints/openai/test_collective_rpc.py # PYTHONPATH is needed to import custom Worker extension
+  - pytest -v -s entrypoints/openai --ignore=entrypoints/openai/test_chat_with_tool_reasoning.py --ignore=entrypoints/openai/test_oot_registration.py --ignore=entrypoints/openai/test_tensorizer_entrypoint.py --ignore=entrypoints/openai/correctness/ --ignore=entrypoints/openai/test_collective_rpc.py --ignore=entrypoints/openai/tool_parsers/
  - pytest -v -s entrypoints/test_chat_utils.py

- label: Entrypoints Integration Test (API Server 2)
-  timeout_in_minutes: 50
-  mirror_hardwares: [amdexperimental]
-  working_dir: "/vllm-workspace/tests"
-  fast_check: true
-  torch_nightly: true
-  source_file_dependencies:
-  - vllm/
-  - tests/entrypoints/sleep
-  - tests/entrypoints/rpc
-  - tests/tool_use
-  commands:
-  - export VLLM_WORKER_MULTIPROC_METHOD=spawn
-  - pytest -v -s entrypoints/sleep
-  - PYTHONPATH=/vllm-workspace  pytest -v -s entrypoints/rpc
-  - pytest -v -s tool_use
-
 - label: Entrypoints Integration Test (Pooling)
  timeout_in_minutes: 50
  mirror_hardwares: [amdexperimental]
@@ -181,18 +157,6 @@ steps:
  - export VLLM_WORKER_MULTIPROC_METHOD=spawn
  - pytest -v -s entrypoints/pooling

- label: Entrypoints Integration Test (Responses API)
-  timeout_in_minutes: 50
-  mirror_hardwares: [amdexperimental]
-  working_dir: "/vllm-workspace/tests"
-  fast_check: true
-  torch_nightly: true
-  source_file_dependencies:
-  - vllm/
-  - tests/entrypoints/openai/responses
-  commands:
-  - pytest -v -s entrypoints/openai/responses
-
 - label: Distributed Tests (4 GPUs) # 35min
  timeout_in_minutes: 50
  mirror_hardwares: [amdexperimental]
@@ -203,10 +167,9 @@ steps:
  - tests/distributed/test_utils
  - tests/distributed/test_pynccl
  - tests/distributed/test_events
-  - tests/compile/fullgraph/test_basic_correctness.py
+  - tests/compile/test_basic_correctness
  - examples/offline_inference/rlhf.py
  - examples/offline_inference/rlhf_colocate.py
-  - examples/offline_inference/new_weight_syncing/
  - tests/examples/offline_inference/data_parallel.py
  - tests/v1/distributed
  - tests/v1/engine/test_engine_core_client.py
@@ -229,28 +192,21 @@ steps:
  # test with internal dp
  - python3 ../examples/offline_inference/data_parallel.py --enforce-eager
  - TP_SIZE=2 DP_SIZE=2 pytest -v -s v1/distributed/test_async_llm_dp.py
-  - TP_SIZE=2 DP_SIZE=2 pytest -v -s v1/distributed/test_eagle_dp.py
  - TP_SIZE=2 DP_SIZE=2 pytest -v -s v1/distributed/test_external_lb_dp.py
  - TP_SIZE=1 DP_SIZE=4 pytest -v -s v1/distributed/test_internal_lb_dp.py
  - TP_SIZE=1 DP_SIZE=4 pytest -v -s v1/distributed/test_hybrid_lb_dp.py
  - pytest -v -s v1/engine/test_engine_core_client.py::test_kv_cache_events_dp
  - pytest -v -s distributed/test_utils.py
-  - pytest -v -s compile/fullgraph/test_basic_correctness.py
+  - pytest -v -s compile/test_basic_correctness.py
  - pytest -v -s distributed/test_pynccl.py
  - pytest -v -s distributed/test_events.py
  - pytest -v -s distributed/test_symm_mem_allreduce.py
  # TODO: create a dedicated test section for multi-GPU example tests
  # when we have multiple distributed example tests
-  # OLD rlhf examples
  - pushd ../examples/offline_inference
  - VLLM_ALLOW_INSECURE_SERIALIZATION=1 python3 rlhf.py
  - VLLM_ALLOW_INSECURE_SERIALIZATION=1 RAY_DEDUP_LOGS=0 python3 rlhf_colocate.py
  - popd
-  # NEW rlhf examples
-  - pushd ../examples/offline_inference/new_weight_syncing
-  - VLLM_ALLOW_INSECURE_SERIALIZATION=1 python3 rlhf.py
-  - VLLM_ALLOW_INSECURE_SERIALIZATION=1 python3 rlhf_async_new_apis.py
-  - popd

 - label: Distributed Tests (8 GPUs) # 4min
  timeout_in_minutes: 10
@@ -319,18 +275,21 @@ steps:
  - pytest -v -s test_regression.py
  working_dir: "/vllm-workspace/tests" # optional

- label: Engine Test # 9min
-  timeout_in_minutes: 15
+- label: Engine Test # 25min
+  timeout_in_minutes: 40
  mirror_hardwares: [amdexperimental]
  source_file_dependencies:
  - vllm/
  - tests/engine
+  - tests/tokenization
  - tests/test_sequence
  - tests/test_config
  - tests/test_logger
  - tests/test_vllm_port
  commands:
  - pytest -v -s engine test_sequence.py test_config.py test_logger.py test_vllm_port.py
+  # OOM in the CI unless we run this separately
+  - pytest -v -s tokenization

 - label: V1 Test e2e + engine # 30min
  timeout_in_minutes: 45
@@ -342,10 +301,7 @@ steps:
    # TODO: accuracy does not match, whether setting
    # VLLM_USE_FLASHINFER_SAMPLER or not on H100.
    - pytest -v -s v1/e2e
-    # Run this test standalone for now;
-    # need to untangle use (implicit) use of spawn/fork across the tests.
-    - pytest -v -s v1/engine/test_preprocess_error_handling.py
-    - pytest -v -s v1/engine --ignore v1/engine/test_preprocess_error_handling.py
+    - pytest -v -s v1/engine

 - label: V1 Test entrypoints # 35min
  timeout_in_minutes: 50
@@ -371,7 +327,7 @@ steps:
    - pytest -v -s v1/sample
    - pytest -v -s v1/logits_processors
    - pytest -v -s v1/worker
-    - pytest -v -s -m 'not slow_test' v1/spec_decode
+    - pytest -v -s v1/spec_decode
    - pytest -v -s -m 'not cpu_test' v1/kv_connector/unit
    - pytest -v -s -m 'not cpu_test' v1/metrics
    - pytest -v -s v1/test_oracle.py
@@ -385,36 +341,19 @@ steps:
  timeout_in_minutes: 30
  gpu: h100
  source_file_dependencies:
-    - vllm/config/attention.py
-    - vllm/model_executor/layers/attention
    - vllm/v1/attention
    - tests/v1/attention
  commands:
    - pytest -v -s v1/attention

- label: Batch Invariance Tests (H100) # 10min
-  timeout_in_minutes: 25
-  gpu: h100
-  source_file_dependencies:
-    - vllm/v1/attention
-    - vllm/model_executor/layers
-    - tests/v1/determinism/
-  commands:
-    - export VLLM_WORKER_MULTIPROC_METHOD=spawn
-    - pip install pytest-timeout pytest-forked
-    - pytest -v -s v1/determinism/test_batch_invariance.py
-    - pytest -v -s v1/determinism/test_rms_norm_batch_invariant.py
-
 - label: V1 Test attention (B200) # 10min
  timeout_in_minutes: 30
  gpu: b200
  source_file_dependencies:
-    - vllm/config/attention.py
-    - vllm/model_executor/layers/attention
    - vllm/v1/attention
    - tests/v1/attention
  commands:
-    - pytest -v -s v1/attention
+    - VLLM_DISABLE_FLASHINFER_PREFILL=1 pytest -v -s v1/attention # TODO: FI prefill is bugged and causes incorrectness, fix this

 - label: V1 Test others (CPU) # 5 mins
  source_file_dependencies:
@@ -436,28 +375,23 @@ steps:
  working_dir: "/vllm-workspace/examples"
  source_file_dependencies:
  - vllm/entrypoints
-  - vllm/multimodal
  - examples/
  commands:
    - pip install tensorizer # for tensorizer test
-    # for basic
-    - python3 offline_inference/basic/chat.py
    - python3 offline_inference/basic/generate.py --model facebook/opt-125m
    - python3 offline_inference/basic/generate.py --model meta-llama/Llama-2-13b-chat-hf --cpu-offload-gb 10
+    - python3 offline_inference/basic/chat.py
+    - python3 offline_inference/prefix_caching.py
+    - python3 offline_inference/llm_engine_example.py
+    - python3 offline_inference/audio_language.py --seed 0
+    - python3 offline_inference/vision_language.py --seed 0
+    - python3 offline_inference/vision_language_pooling.py --seed 0
+    - python3 offline_inference/vision_language_multi_image.py --seed 0
+    - python3 others/tensorize_vllm_model.py --model facebook/opt-125m serialize --serialized-directory /tmp/ --suffix v1 && python3 others/tensorize_vllm_model.py --model facebook/opt-125m deserialize --path-to-tensors /tmp/vllm/facebook/opt-125m/v1/model.tensors
+    - python3 offline_inference/encoder_decoder_multimodal.py --model-type whisper --seed 0
    - python3 offline_inference/basic/classify.py
    - python3 offline_inference/basic/embed.py
    - python3 offline_inference/basic/score.py
-    # for multi-modal models
-    - python3 offline_inference/audio_language.py --seed 0
-    - python3 offline_inference/vision_language.py --seed 0
-    - python3 offline_inference/vision_language_multi_image.py --seed 0
-    - python3 offline_inference/encoder_decoder_multimodal.py --model-type whisper --seed 0
-    # for pooling models
-    - python3 pooling/embed/vision_embedding_offline.py --seed 0
-    # for features demo
-    - python3 offline_inference/prefix_caching.py
-    - python3 offline_inference/llm_engine_example.py
-    - python3 others/tensorize_vllm_model.py --model facebook/opt-125m serialize --serialized-directory /tmp/ --suffix v1 && python3 others/tensorize_vllm_model.py --model facebook/opt-125m deserialize --path-to-tensors /tmp/vllm/facebook/opt-125m/v1/model.tensors
    - python3 offline_inference/spec_decode.py --test --method eagle --num_spec_tokens 3 --dataset-name hf --dataset-path philschmid/mt-bench --num-prompts 80 --temp 0 --top-p 1.0 --top-k -1 --tp 1 --enable-chunked-prefill --max-model-len 2048
    # https://github.com/vllm-project/vllm/pull/26682 uses slightly more memory in PyTorch 2.9+ causing this test to OOM in 1xL4 GPU
    - python3 offline_inference/spec_decode.py --test --method eagle3 --num_spec_tokens 3 --dataset-name hf --dataset-path philschmid/mt-bench --num-prompts 80 --temp 0 --top-p 1.0 --top-k -1 --tp 1 --enable-chunked-prefill --max-model-len 1536
@@ -511,15 +445,18 @@ steps:
    - vllm/
    - tests/compile
  commands:
-  # Run unit tests defined directly under compile/,
-  # not including subdirectories, which are usually heavier
-  # tests covered elsewhere.
-  # Use `find` to launch multiple instances of pytest so that
-  # they do not suffer from https://github.com/vllm-project/vllm/issues/28965
-  # However, find does not normally propagate error codes, so we combine it with xargs
-  # (using -0 for proper path handling)
-  - "find compile/ -maxdepth 1 -name 'test_*.py' -print0 | xargs -0 -n1 -I{} pytest -s -v '{}'"
-  - pytest -s -v compile/passes --ignore compile/passes/distributed
+    - pytest -v -s compile/test_graph_partition.py
+    - pytest -v -s compile/test_config.py
+    - pytest -v -s compile/test_pass_manager.py
+    - pytest -v -s compile/test_fusion.py
+    - pytest -v -s compile/test_fusion_attn.py
+    - pytest -v -s compile/test_functionalization.py
+    - pytest -v -s compile/test_silu_mul_quant_fusion.py
+    - pytest -v -s compile/test_fusion_all_reduce.py
+    - pytest -v -s compile/test_decorator.py
+    - pytest -v -s compile/test_noop_elimination.py
+    - pytest -v -s compile/test_aot_compile.py
+    - pytest -v -s compile/test_qk_norm_rope_fusion.py

 - label: PyTorch Fullgraph Smoke Test # 15min
  timeout_in_minutes: 30
@@ -529,13 +466,9 @@ steps:
  - vllm/
  - tests/compile
  commands:
-  # Run smoke tests under fullgraph directory, except test_full_graph.py
-  # as it is a heavy test that is covered in other steps.
-  # Use `find` to launch multiple instances of pytest so that
-  # they do not suffer from https://github.com/vllm-project/vllm/issues/28965
-  # However, find does not normally propagate error codes, so we combine it with xargs
-  # (using -0 for proper path handling)
-  - "find compile/fullgraph -maxdepth 1 -name 'test_*.py' -not -name 'test_full_graph.py' -print0 | xargs -0 -n1 -I{} pytest -s -v '{}'"
+  - pytest -v -s compile/test_basic_correctness.py
+  - pytest -v -s compile/test_multimodal_compile.py
+  - pytest -v -s compile/piecewise/

 - label: PyTorch Fullgraph Test # 27min
  timeout_in_minutes: 40
@@ -546,12 +479,10 @@ steps:
  - tests/compile
  commands:
    # fp8 kv scales not supported on sm89, tested on Blackwell instead
-  - pytest -v -s compile/fullgraph/test_full_graph.py -k 'not test_fp8_kv_scale_compile'
-    # # Limit to no custom ops to reduce running time
-    # # Wrap with quotes to escape yaml and avoid starting -k string with a -
-    # - "pytest -v -s compile/distributed/test_fusions_e2e.py -k 'TRITON and not +quant_fp8 and not Llama-4'"
-    # Old E2E tests were removed in https://github.com/vllm-project/vllm/pull/33293
-    # in favor of new tests in fusions_e2e. We avoid replicating the new jobs in this file as it's deprecated.
+  - pytest -v -s compile/test_full_graph.py -k 'not test_fp8_kv_scale_compile'
+    # Limit to no custom ops to reduce running time
+    # Wrap with quotes to escape yaml and avoid starting -k string with a -
+  - "pytest -v -s compile/test_fusions_e2e.py -k 'TRITON and not +quant_fp8 and not Llama-4'"

 - label: Cudagraph test
  timeout_in_minutes: 20
@@ -580,9 +511,8 @@ steps:
  mirror_hardwares: [amdexperimental]
  source_file_dependencies:
  - csrc/attention/
+  - vllm/attention
  - vllm/v1/attention
-    # TODO: remove this dependency (https://github.com/vllm-project/vllm/issues/32267)
-  - vllm/model_executor/layers/attention
  - tests/kernels/attention
  commands:
    - pytest -v -s kernels/attention --shard-id=$$BUILDKITE_PARALLEL_JOB --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT
@@ -624,75 +554,6 @@ steps:
  commands:
    - pytest -v -s kernels/mamba

- label: Kernels DeepGEMM Test (H100)
-  timeout_in_minutes: 45
-  gpu: h100
-  num_gpus: 1
-  source_file_dependencies:
-  - tools/install_deepgemm.sh
-  - vllm/utils/deep_gemm.py
-  - vllm/model_executor/layers/fused_moe
-  - vllm/model_executor/layers/quantization
-  - tests/kernels/quantization/test_block_fp8.py
-  - tests/kernels/moe/test_deepgemm.py
-  - tests/kernels/moe/test_batched_deepgemm.py
-  - tests/kernels/attention/test_deepgemm_attention.py
-  commands:
-    - pytest -v -s kernels/quantization/test_block_fp8.py -k deep_gemm
-    - pytest -v -s kernels/moe/test_deepgemm.py
-    - pytest -v -s kernels/moe/test_batched_deepgemm.py
-    - pytest -v -s kernels/attention/test_deepgemm_attention.py
-
- label: Kernels Helion Test
-  timeout_in_minutes: 30
-  gpu: h100
-  source_file_dependencies:
-  - vllm/utils/import_utils.py
-  - tests/kernels/helion/
-  commands:
-    - pip install helion
-    - pytest -v -s kernels/helion/
-
-  
- label: Kernels FP8 MoE Test (1 H100)
-  timeout_in_minutes: 90
-  gpu: h100
-  num_gpus: 1
-  optional: true
-  commands:
-    - pytest -v -s kernels/moe/test_cutlass_moe.py
-    - pytest -v -s kernels/moe/test_flashinfer.py
-    - pytest -v -s kernels/moe/test_gpt_oss_triton_kernels.py
-    - pytest -v -s kernels/moe/test_modular_oai_triton_moe.py
-    - pytest -v -s kernels/moe/test_moe.py
-    # - pytest -v -s kernels/moe/test_block_fp8.py - failing on main
-    - pytest -v -s kernels/moe/test_block_int8.py
-    - pytest -v -s kernels/moe/test_triton_moe_no_act_mul.py
-    - pytest -v -s kernels/moe/test_triton_moe_ptpc_fp8.py
-
- label: Kernels FP8 MoE Test (2 H100s)
-  timeout_in_minutes: 90
-  gpu: h100
-  num_gpus: 2
-  optional: true
-  commands:
-    - pytest -v -s kernels/moe/test_deepep_deepgemm_moe.py
-    - pytest -v -s kernels/moe/test_deepep_moe.py
-    - pytest -v -s kernels/moe/test_pplx_cutlass_moe.py
-    # - pytest -v -s kernels/moe/test_pplx_moe.py - failing on main
-  
- label: Kernels Fp4 MoE Test (B200)
-  timeout_in_minutes: 60
-  gpu: b200
-  num_gpus: 1
-  optional: true
-  commands:
-    - pytest -v -s kernels/moe/test_cutedsl_moe.py
-    - pytest -v -s kernels/moe/test_flashinfer_moe.py
-    - pytest -v -s kernels/moe/test_nvfp4_moe.py
-    - pytest -v -s kernels/moe/test_ocp_mx_moe.py
-
-
 - label: Model Executor Test # 23min
  timeout_in_minutes: 35
  torch_nightly: true
@@ -742,8 +603,7 @@ steps:
  # https://github.com/pytorch/ao/issues/2919, we'll have to skip new torchao tests for now
  # we can only upgrade after this is resolved
  # TODO(jerryzh168): resolve the above comment
-  - uv pip install --system torchao==0.14.1 --index-url https://download.pytorch.org/whl/cu129
-  - uv pip install --system conch-triton-kernels
+  - uv pip install --system torchao==0.13.0 --index-url https://download.pytorch.org/whl/cu129
  - VLLM_TEST_FORCE_LOAD_FORMAT=auto pytest -v -s quantization/ --ignore quantization/test_blackwell_moe.py

 - label: LM Eval Small Models # 53min
@@ -754,7 +614,7 @@ steps:
  - vllm/model_executor/layers/quantization
  autorun_on_main: true
  commands:
-  - pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=configs/models-small.txt
+  - pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=configs/models-small.txt --tp-size=1

 - label: OpenAI API correctness # 22min
  timeout_in_minutes: 30
@@ -766,6 +626,25 @@ steps:
  commands: # LMEval+Transcription WER check
  - pytest -s entrypoints/openai/correctness/

+- label: OpenAI-Compatible Tool Use # 23 min
+  timeout_in_minutes: 35
+  mirror_hardwares: [amdexperimental]
+  fast_check: false
+  source_file_dependencies:
+    - vllm/
+    - tests/tool_use
+  commands:
+    - pytest -v -s -m 'not cpu_test' tool_use
+
+- label: OpenAI-Compatible Tool Use (CPU) # 5 mins
+  timeout_in_minutes: 10
+  source_file_dependencies:
+    - vllm/
+    - tests/tool_use
+  no_gpu: true
+  commands:
+    - pytest -v -s -m 'cpu_test' tool_use
+
 #####  models test  #####

 - label: Basic Models Tests (Initialization)
@@ -775,7 +654,6 @@ steps:
  source_file_dependencies:
  - vllm/
  - tests/models/test_initialization.py
-  - tests/models/registry.py
  commands:
    # Run a subset of model initialization tests
    - pytest -v -s models/test_initialization.py::test_can_initialize_small_subset
@@ -786,9 +664,7 @@ steps:
  torch_nightly: true
  source_file_dependencies:
  - vllm/model_executor/models/
-  - vllm/transformers_utils/
  - tests/models/test_initialization.py
-  - tests/models/registry.py
  commands:
    # Only when vLLM model source is modified - test initialization of a large
    # subset of supported models (the complement of the small subset in the above
@@ -805,11 +681,10 @@ steps:
  torch_nightly: true
  source_file_dependencies:
  - vllm/
-  - tests/models/test_terratorch.py
  - tests/models/test_transformers.py
  - tests/models/test_registry.py
  commands:
-    - pytest -v -s models/test_terratorch.py models/test_transformers.py models/test_registry.py
+    - pytest -v -s models/test_transformers.py models/test_registry.py

 - label: Basic Models Test (Other CPU) # 5min
  timeout_in_minutes: 10
@@ -915,24 +790,14 @@ steps:
  commands:
    - pytest -v -s models/language/pooling_mteb_test

- label: Multi-Modal Processor Test (CPU)
-  timeout_in_minutes: 60
-  source_file_dependencies:
-  - vllm/
-  - tests/models/multimodal
-  no_gpu: true
-  commands:
-    - "pip install git+https://github.com/TIGER-AI-Lab/Mantis.git || echo 'Mantis installation skipped (decord not available on CPU-only environment)'"
-    - pytest -v -s models/multimodal/processing --ignore models/multimodal/processing/test_tensor_schema.py
-
- label: Multi-Modal Processor Test
+- label: Multi-Modal Processor Test # 44min
  timeout_in_minutes: 60
  source_file_dependencies:
  - vllm/
  - tests/models/multimodal
  commands:
    - pip install git+https://github.com/TIGER-AI-Lab/Mantis.git
-    - pytest -v -s models/multimodal/processing/test_tensor_schema.py
+    - pytest -v -s models/multimodal/processing

 - label: Multi-Modal Models Test (Standard) # 60min
  timeout_in_minutes: 80
@@ -1009,35 +874,35 @@ steps:
 - label: Transformers Nightly Models Test
  working_dir: "/vllm-workspace/"
  optional: true
-  soft_fail: true
  commands:
    - pip install --upgrade git+https://github.com/huggingface/transformers
-    - pytest -v -s tests/models/test_initialization.py
+    - pytest -v -s tests/models/test_initialization.py -k 'not (Gemma3 or ModernBert or Qwen2_5_VL or Qwen2_5vl or Qwen2VL or TransformersMultiModalEmbeddingModel or TransformersMultiModalForSequenceClassification or Ultravox or Phi4Multimodal or LlavaNextVideo or MiniCPMO or Lfm2Moe or PaliGemma or RobertaForSequenceClassification or Ovis2_5 or Fuyu or DeepseekOCR or KimiVL)'
    - pytest -v -s tests/models/test_transformers.py
-    - pytest -v -s tests/models/multimodal/processing/
-    - pytest -v -s tests/models/multimodal/test_mapping.py
+    # - pytest -v -s tests/models/multimodal/processing/
+    - pytest -v -s tests/models/multimodal/test_mapping.py -k 'not (Gemma3 or Qwen2VL or Qwen2_5_VL)'
    - python3 examples/offline_inference/basic/chat.py
-    - python3 examples/offline_inference/vision_language.py --model-type qwen2_5_vl
+    # - python3 examples/offline_inference/vision_language.py --model-type qwen2_5_vl
    # Whisper needs spawn method to avoid deadlock
    - VLLM_WORKER_MULTIPROC_METHOD=spawn python3 examples/offline_inference/audio_language.py --model-type whisper

- label: Blackwell Test # 23 min
+- label: Blackwell Test # 21 min
  timeout_in_minutes: 30
  working_dir: "/vllm-workspace/"
  gpu: b200
+  # optional: true
  source_file_dependencies:
  - csrc/quantization/fp4/
  - csrc/attention/mla/
  - csrc/quantization/cutlass_w8a8/moe/
  - vllm/model_executor/layers/fused_moe/cutlass_moe.py
  - vllm/model_executor/layers/fused_moe/flashinfer_cutlass_moe.py
-  - vllm/model_executor/layers/fused_moe/flashinfer_a2a_prepare_finalize.py
+  - vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py
  - vllm/model_executor/layers/quantization/utils/flashinfer_utils.py
  - vllm/v1/attention/backends/flashinfer.py
  - vllm/v1/attention/backends/mla/cutlass_mla.py
  - vllm/v1/attention/backends/mla/flashinfer_mla.py
-  - vllm/v1/attention/selector.py
  - vllm/platforms/cuda.py
+  - vllm/attention/selector.py
  commands:
    - nvidia-smi
    - python3 examples/offline_inference/basic/chat.py
@@ -1060,9 +925,6 @@ steps:
    - pytest -v -s tests/kernels/moe/test_nvfp4_moe.py
    - pytest -v -s tests/kernels/moe/test_ocp_mx_moe.py
    - pytest -v -s tests/kernels/moe/test_flashinfer.py
-    - pytest -v -s tests/kernels/moe/test_cutedsl_moe.py
-    # e2e
-    - pytest -v -s tests/models/quantization/test_nvfp4.py

 - label: Blackwell Fusion and Compile Tests # 30 min
  timeout_in_minutes: 40
@@ -1072,31 +934,44 @@ steps:
  - csrc/quantization/fp4/
  - vllm/model_executor/layers/quantization/utils/flashinfer_utils.py
  - vllm/v1/attention/backends/flashinfer.py
-  - vllm/v1/worker/
-  - vllm/v1/cudagraph_dispatcher.py
  - vllm/compilation/
  # can affect pattern matching
  - vllm/model_executor/layers/layernorm.py
  - vllm/model_executor/layers/activation.py
  - vllm/model_executor/layers/quantization/input_quant_fp8.py
-  - tests/compile/test_fusion_attn.py
-  - tests/compile/test_silu_mul_quant_fusion.py
-  - tests/compile/passes/distributed/test_fusion_all_reduce.py
-  - tests/compile/fullgraph/test_full_graph.py
  commands:
    - nvidia-smi
    - pytest -v -s tests/compile/test_fusion_attn.py
    - pytest -v -s tests/compile/test_silu_mul_quant_fusion.py
    # this runner has 2 GPUs available even though num_gpus=2 is not set
-    - pytest -v -s tests/compile/passes/distributed/test_fusion_all_reduce.py
-    #  # Limit to Inductor partition, no custom ops, and allreduce & attn fusion to reduce running time
-    #  # Wrap with quotes to escape yaml
-    #  - "pytest -v -s tests/compile/distributed/test_fusions_e2e.py::test_tp2_attn_quant_allreduce_rmsnorm -k 'True and not +quant_fp8 and not +rms_norm'"
-    # Old E2E tests were removed in https://github.com/vllm-project/vllm/pull/33293
-    # in favor of new tests in fusions_e2e. We avoid replicating the new jobs in this file as it's deprecated.
-
+    - pytest -v -s tests/compile/test_fusion_all_reduce.py
+    # Limit to Inductor partition, no custom ops, and allreduce & attn fusion to reduce running time
+    # Wrap with quotes to escape yaml
+    - "pytest -v -s tests/compile/test_fusions_e2e.py::test_tp2_attn_quant_allreduce_rmsnorm -k 'True and not +quant_fp8 and not +rms_norm'"
    # test_fp8_kv_scale_compile requires FlashAttention (not supported on default L4/L40)
-    - pytest -v -s tests/compile/fullgraph/test_full_graph.py::test_fp8_kv_scale_compile
+    - pytest -v -s tests/compile/test_full_graph.py::test_fp8_kv_scale_compile
+
+- label: Blackwell Fusion E2E Tests # 30 min
+  timeout_in_minutes: 40
+  working_dir: "/vllm-workspace/"
+  gpu: b200
+  optional: true
+  num_gpus: 2
+  source_file_dependencies:
+  - csrc/quantization/fp4/
+  - vllm/model_executor/layers/quantization/utils/flashinfer_utils.py
+  - vllm/v1/attention/backends/flashinfer.py
+  - vllm/compilation/
+  # can affect pattern matching
+  - vllm/model_executor/layers/layernorm.py
+  - vllm/model_executor/layers/activation.py
+  - vllm/model_executor/layers/quantization/input_quant_fp8.py
+  - tests/compile/test_fusions_e2e.py
+  - tests/compile/test_full_graph.py
+  commands:
+    - nvidia-smi
+    # Run all e2e fusion tests
+    - pytest -v -s tests/compile/test_fusions_e2e.py

 - label: Blackwell GPT-OSS Eval
  timeout_in_minutes: 60
@@ -1137,7 +1012,7 @@ steps:
  - csrc/
  - vllm/model_executor/layers/quantization
  commands:
-  - pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=configs/models-blackwell.txt
+  - pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=configs/models-blackwell.txt --tp-size=1

 #####  1 GPU test  #####
 #####  multi gpus test  #####
@@ -1155,8 +1030,6 @@ steps:
  - pytest -v -s distributed/test_shm_broadcast.py
  - pytest -v -s distributed/test_shm_buffer.py
  - pytest -v -s distributed/test_shm_storage.py
-  - pytest -v -s distributed/test_packed_tensor.py
-  - pytest -v -s distributed/test_weight_transfer.py

 - label: 2 Node Tests (4 GPUs in total) # 16min
  timeout_in_minutes: 30
@@ -1171,18 +1044,17 @@ steps:
  - vllm/model_executor/models/
  - tests/distributed/
  - tests/examples/offline_inference/data_parallel.py
-  - .buildkite/scripts/run-multi-node-test.sh
  commands:
  - # the following commands are for the first node, with ip 192.168.10.10 (ray environment already set up)
    - VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py | grep 'Same node test passed'
    - NUM_NODES=2 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_node_count.py | grep 'Node count test passed'
-    - python3 ../examples/offline_inference/data_parallel.py -dp=2 -tp=1 --dp-num-nodes=2 --dp-node-rank=0 --dp-master-addr=192.168.10.10 --dp-master-port=12345 --enforce-eager --trust-remote-code
+    - python3 ../examples/offline_inference/data_parallel.py --dp-size=2 --tp-size=1 --node-size=2 --node-rank=0 --master-addr=192.168.10.10 --master-port=12345 --enforce-eager --trust-remote-code
    - VLLM_MULTI_NODE=1 pytest -v -s distributed/test_multi_node_assignment.py
    - VLLM_MULTI_NODE=1 pytest -v -s distributed/test_pipeline_parallel.py
  - # the following commands are for the second node, with ip 192.168.10.11 (ray environment already set up)
    - VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py | grep 'Same node test passed'
    - NUM_NODES=2 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_node_count.py | grep 'Node count test passed'
-    - python3 ../examples/offline_inference/data_parallel.py -dp=2 -tp=1 --dp-num-nodes=2 --dp-node-rank=1 --dp-master-addr=192.168.10.10 --dp-master-port=12345 --enforce-eager --trust-remote-code
+    - python3 ../examples/offline_inference/data_parallel.py --dp-size=2 --tp-size=1 --node-size=2 --node-rank=1 --master-addr=192.168.10.10 --master-port=12345 --enforce-eager --trust-remote-code

 - label: Distributed Tests (2 GPUs) # 68min
  timeout_in_minutes: 90
@@ -1197,7 +1069,7 @@ steps:
  - vllm/worker/worker_base.py
  - vllm/v1/engine/
  - vllm/v1/worker/
-  - tests/compile/fullgraph/test_basic_correctness.py
+  - tests/compile/test_basic_correctness.py
  - tests/compile/test_wrapper.py
  - tests/distributed/
  - tests/entrypoints/llm/test_collective_rpc.py
@@ -1209,11 +1081,10 @@ steps:
  # https://github.com/NVIDIA/nccl/issues/1838
  - export NCCL_CUMEM_HOST_ENABLE=0
  - TP_SIZE=1 DP_SIZE=2 pytest -v -s v1/distributed/test_async_llm_dp.py
-  - TP_SIZE=1 DP_SIZE=2 pytest -v -s v1/distributed/test_eagle_dp.py
  - TP_SIZE=1 DP_SIZE=2 pytest -v -s v1/distributed/test_external_lb_dp.py
  - DP_SIZE=2 pytest -v -s v1/entrypoints/openai/test_multi_api_servers.py
  - pytest -v -s entrypoints/llm/test_collective_rpc.py
-  - pytest -v -s ./compile/fullgraph/test_basic_correctness.py
+  - pytest -v -s ./compile/test_basic_correctness.py
  - pytest -v -s ./compile/test_wrapper.py
  - VLLM_TEST_SAME_HOST=1 torchrun --nproc-per-node=4 distributed/test_same_node.py | grep 'Same node test passed'
  - VLLM_TEST_SAME_HOST=1 VLLM_TEST_WITH_DEFAULT_DEVICE_SET=1 torchrun --nproc-per-node=4 distributed/test_same_node.py | grep 'Same node test passed'
@@ -1271,7 +1142,7 @@ steps:
  - pytest -v -s distributed/test_distributed_oot.py
  - pytest -v -s entrypoints/openai/test_oot_registration.py # it needs a clean process
  - pytest -v -s models/test_oot_registration.py # it needs a clean process
-  - pytest -v -s plugins/lora_resolvers # unit tests for lora resolver plugins
+  - pytest -v -s plugins/lora_resolvers # unit tests for in-tree lora resolver plugins

 - label: Pipeline + Context Parallelism Test # 45min
  timeout_in_minutes: 60
@@ -1299,8 +1170,6 @@ steps:
    # FIXIT: find out which code initialize cuda before running the test
    # before the fix, we need to use spawn to test it
    - export VLLM_WORKER_MULTIPROC_METHOD=spawn
-    # Alot of these tests are on the edge of OOMing
-    - export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
    # There is some Tensor Parallelism related processing logic in LoRA that
    # requires multi-GPU testing for validation.
    - pytest -v -s -x lora/test_chatglm3_tp.py
@@ -1334,8 +1203,8 @@ steps:
  commands:
    - bash weight_loading/run_model_weight_loading_test.sh -c weight_loading/models-large.txt

- label: NixlConnector PD accuracy tests (Distributed) # 40min
-  timeout_in_minutes: 40
+- label: NixlConnector PD accuracy tests (Distributed) # 30min
+  timeout_in_minutes: 30
  working_dir: "/vllm-workspace/tests"
  num_gpus: 4
  source_file_dependencies:
@@ -1343,18 +1212,7 @@ steps:
    - tests/v1/kv_connector/nixl_integration/
  commands:
    - uv pip install --system -r /vllm-workspace/requirements/kv_connectors.txt
-    - bash v1/kv_connector/nixl_integration/config_sweep_accuracy_test.sh
-
- label: DP EP NixlConnector PD accuracy tests (Distributed) # 15min
-  timeout_in_minutes: 15
-  working_dir: "/vllm-workspace/tests"
-  num_gpus: 4
-  source_file_dependencies:
-    - vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py
-    - tests/v1/kv_connector/nixl_integration/
-  commands:
-    - uv pip install --system -r /vllm-workspace/requirements/kv_connectors.txt
-    - DP_EP=1 bash v1/kv_connector/nixl_integration/config_sweep_accuracy_test.sh
+    - bash v1/kv_connector/nixl_integration/tp_config_sweep_accuracy_test.sh


 ##### multi gpus test #####
@@ -1374,20 +1232,6 @@ steps:
  - TARGET_TEST_SUITE=A100 pytest basic_correctness/ -v -s -m 'distributed(num_gpus=2)'
  - pytest -v -s -x lora/test_mixtral.py

- label: Acceptance Length Test (Large Models) # optional
-  timeout_in_minutes: 120
-  gpu: h100
-  optional: true
-  num_gpus: 1
-  working_dir: "/vllm-workspace/tests"
-  source_file_dependencies:
-  - vllm/v1/spec_decode/
-  - vllm/model_executor/models/mlp_speculator.py
-  - tests/v1/spec_decode/test_acceptance_length.py
-  commands:
-    - export VLLM_ALLOW_INSECURE_SERIALIZATION=1
-    - pytest -v -s v1/spec_decode/test_acceptance_length.py -m slow_test
-
 - label: LM Eval Large Models # optional
  gpu: a100
  optional: true
@@ -1413,38 +1257,21 @@ steps:
    - export VLLM_USE_DEEP_GEMM=0  # We found Triton is faster than DeepGEMM for H100
    - pytest -s -v test_lm_eval_correctness.py --config-list-file=configs/models-large-hopper.txt --tp-size=4

- label: Sequence Parallel Tests (H100) # 60 min
-  timeout_in_minutes: 60
-  working_dir: "/vllm-workspace/"
-  gpu: h100
-  optional: true
-  num_gpus: 2
-  commands:
-    - export VLLM_TEST_CLEAN_GPU_MEMORY=1
-    # Run sequence parallel tests
-    - pytest -v -s tests/compile/correctness_e2e/test_sequence_parallel.py
-    - pytest -v -s tests/compile/passes/distributed/test_sequence_parallelism.py
-
- label: Distributed Tests (H100) # optional
-  gpu: h100
-  optional: true
-  working_dir: "/vllm-workspace/"
-  num_gpus: 2
-  commands:
-    - VLLM_TEST_CLEAN_GPU_MEMORY=1 pytest -v -s tests/compile/passes/distributed/test_async_tp.py
-    - pytest -v -s tests/distributed/test_context_parallel.py
-    - VLLM_USE_DEEP_GEMM=1 VLLM_LOGGING_LEVEL=DEBUG python3 examples/offline_inference/data_parallel.py --model=Qwen/Qwen1.5-MoE-A2.7B -tp=1 -dp=2 --max-model-len=2048 --all2all-backend=deepep_high_throughput
-    - pytest -v -s tests/v1/distributed/test_dbo.py
-
 ##### H200 test #####
-
- label: LM Eval Large Models (H200) # optional
-  timeout_in_minutes: 60
+- label: Distributed Tests (H200) # optional
  gpu: h200
  optional: true
-  num_gpus: 8
+  working_dir: "/vllm-workspace/"
+  num_gpus: 2
  commands:
-    - pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=configs/models-h200.txt
+    - pytest -v -s tests/compile/test_async_tp.py
+    - pytest -v -s tests/compile/test_sequence_parallelism.py
+    - pytest -v -s tests/compile/test_fusion_all_reduce.py
+    - "pytest -v -s tests/compile/test_fusions_e2e.py -k 'not Llama-4'"
+    - pytest -v -s tests/distributed/test_sequence_parallel.py
+    - pytest -v -s tests/distributed/test_context_parallel.py
+    - CUDA_VISIBLE_DEVICES=1,2 VLLM_ALL2ALL_BACKEND=deepep_high_throughput VLLM_USE_DEEP_GEMM=1 VLLM_LOGGING_LEVEL=DEBUG python3 examples/offline_inference/data_parallel.py --model Qwen/Qwen1.5-MoE-A2.7B --tp-size=1  --dp-size=2 --max-model-len 2048
+    - pytest -v -s tests/v1/distributed/test_dbo.py

 ##### B200 test #####
 - label: Distributed Tests (B200) # optional
@@ -1461,14 +1288,12 @@ steps:
 - label: Prime-RL Integration Test # 15min
  timeout_in_minutes: 30
  optional: true
-  soft_fail: true
  num_gpus: 2
  working_dir: "/vllm-workspace"
  source_file_dependencies:
  - vllm/
  - .buildkite/scripts/run-prime-rl-test.sh
  commands:
-    - nvidia-smi
    - bash .buildkite/scripts/run-prime-rl-test.sh

 - label: DeepSeek V2-Lite Accuracy
@@ -1480,43 +1305,11 @@ steps:
  commands:
  - bash .buildkite/scripts/scheduled_integration_test/deepseek_v2_lite_ep_eplb.sh 0.25 200 8010

- label: Qwen3-30B-A3B-FP8-block Accuracy (H100)
+- label: Qwen3-30B-A3B-FP8-block Accuracy
  timeout_in_minutes: 60
  gpu: h100
  optional: true
  num_gpus: 4
  working_dir: "/vllm-workspace"
  commands:
-  - bash .buildkite/scripts/scheduled_integration_test/qwen30b_a3b_fp8_block_ep_eplb.sh 0.8 200 8020
-
- label: Qwen3-30B-A3B-FP8-block Accuracy (B200)
-  timeout_in_minutes: 60
-  gpu: b200
-  optional: true
-  num_gpus: 2
-  working_dir: "/vllm-workspace"
-  commands:
-  - bash .buildkite/scripts/scheduled_integration_test/qwen30b_a3b_fp8_block_ep_eplb.sh 0.8 200 8020 2 1
-
-##### MoE Refactor (Temporary) Tests #####
-
- label: MoE Refactor Integration Test (H100 - TEMPORARY) # optional
-  gpu: h100
-  optional: true
-  num_gpus: 2
-  commands:
-    - pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=evals/gsm8k/configs/moe-refactor/config-h100.txt
-  
- label: MoE Refactor Integration Test (B200 - TEMPORARY) # optional
-  gpu: b200
-  optional: true
-  num_gpus: 2
-  commands:
-    - pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=evals/gsm8k/configs/moe-refactor/config-b200.txt
-
- label: MoE Refactor Integration Test (B200 DP - TEMPORARY) # optional
-  gpu: b200
-  optional: true
-  num_gpus: 2
-  commands:
-    - pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=evals/gsm8k/configs/moe-refactor-dp-ep/config-b200.txt
+  - bash .buildkite/scripts/scheduled_integration_test/qwen30b_a3b_fp8_block_ep.sh 0.8 200 8020
--- a/.buildkite/test_areas/attention.yaml
+++ b/.buildkite/test_areas/attention.yaml
@@ -1,25 +0,0 @@
-group: Attention
-depends_on: 
-  - image-build
-steps:
- label: V1 attention (H100)
-  timeout_in_minutes: 30
-  device: h100
-  source_file_dependencies:
-    - vllm/config/attention.py
-    - vllm/model_executor/layers/attention
-    - vllm/v1/attention
-    - tests/v1/attention
-  commands:
-    - pytest -v -s v1/attention
-
- label: V1 attention (B200)
-  timeout_in_minutes: 30
-  device: b200
-  source_file_dependencies:
-    - vllm/config/attention.py
-    - vllm/model_executor/layers/attention
-    - vllm/v1/attention
-    - tests/v1/attention
-  commands:
-    - pytest -v -s v1/attention
--- a/.buildkite/test_areas/basic_correctness.yaml
+++ b/.buildkite/test_areas/basic_correctness.yaml
@@ -1,16 +0,0 @@
-group: Basic Correctness
-depends_on: 
-  - image-build
-steps:
- label: Basic Correctness
-  timeout_in_minutes: 30
-  source_file_dependencies:
-  - vllm/
-  - tests/basic_correctness/test_basic_correctness
-  - tests/basic_correctness/test_cpu_offload
-  - tests/basic_correctness/test_cumem.py
-  commands:
-  - export VLLM_WORKER_MULTIPROC_METHOD=spawn
-  - pytest -v -s basic_correctness/test_cumem.py
-  - pytest -v -s basic_correctness/test_basic_correctness.py
-  - pytest -v -s basic_correctness/test_cpu_offload.py
--- a/.buildkite/test_areas/benchmarks.yaml
+++ b/.buildkite/test_areas/benchmarks.yaml
@@ -1,19 +0,0 @@
-group: Benchmarks
-depends_on: 
-  - image-build
-steps:
- label: Benchmarks
-  timeout_in_minutes: 20
-  working_dir: "/vllm-workspace/.buildkite"
-  source_file_dependencies:
-  - benchmarks/
-  commands:
-  - bash scripts/run-benchmarks.sh
-
- label: Benchmarks CLI Test
-  timeout_in_minutes: 20
-  source_file_dependencies:
-  - vllm/
-  - tests/benchmarks/
-  commands:
-  - pytest -v -s benchmarks/
--- a/.buildkite/test_areas/compile.yaml
+++ b/.buildkite/test_areas/compile.yaml
@@ -1,203 +0,0 @@
-group: Compile
-depends_on: 
-  - image-build
-steps:
- label: Sequence Parallel Correctness Tests (2 GPUs)
-  timeout_in_minutes: 50
-  working_dir: "/vllm-workspace/"
-  num_devices: 2
-  source_file_dependencies:
-  - vllm/model_executor/layers/
-  - vllm/compilation/
-  - vllm/v1/worker/
-  - vllm/v1/cudagraph_dispatcher.py
-  - tests/compile/correctness_e2e/test_sequence_parallel.py
-  commands:
-  - export VLLM_TEST_CLEAN_GPU_MEMORY=1
-  - pytest -v -s tests/compile/correctness_e2e/test_sequence_parallel.py
-
- label: Sequence Parallel Correctness Tests (2xH100)
-  timeout_in_minutes: 50
-  working_dir: "/vllm-workspace/"
-  device: h100
-  optional: true
-  num_devices: 2
-  commands:
-  - export VLLM_TEST_CLEAN_GPU_MEMORY=1
-  - pytest -v -s tests/compile/correctness_e2e/test_sequence_parallel.py
-
- label: AsyncTP Correctness Tests (2xH100)
-  timeout_in_minutes: 50
-  working_dir: "/vllm-workspace/"
-  device: h100
-  optional: true
-  num_devices: 2
-  commands:
-  - export VLLM_TEST_CLEAN_GPU_MEMORY=1
-  - pytest -v -s tests/compile/correctness_e2e/test_async_tp.py
-
- label: Distributed Compile Unit Tests (2xH100)
-  timeout_in_minutes: 20
-  working_dir: "/vllm-workspace/"
-  device: h100
-  num_devices: 2
-  source_file_dependencies:
-  - vllm/compilation/
-  - vllm/model_executor/layers
-  - tests/compile/passes/distributed/
-  commands:
-  - export VLLM_TEST_CLEAN_GPU_MEMORY=1
-  - pytest -s -v tests/compile/passes/distributed
-
- label: Fusion and Compile Unit Tests (B200)
-  timeout_in_minutes: 20
-  working_dir: "/vllm-workspace/"
-  device: b200
-  source_file_dependencies:
-  - csrc/quantization/fp4/
-  - vllm/model_executor/layers/quantization/
-  - vllm/model_executor/layers/layernorm.py
-  - vllm/model_executor/layers/activation.py
-  - vllm/model_executor/layers/attention/attention.py
-  - vllm/v1/attention/backends/flashinfer.py
-  - vllm/compilation/ # TODO(luka) limit to vllm/compilation/passes
-  - tests/compile/passes/test_fusion_attn.py
-  - tests/compile/passes/test_silu_mul_quant_fusion.py
-  - tests/compile/passes/distributed/test_fusion_all_reduce.py
-  - tests/compile/fullgraph/test_full_graph.py
-  commands:
-    # b200 runners are limited, so we limit the tests to the minimum set only supported on Blackwell
-    - nvidia-smi
-    - pytest -v -s tests/compile/passes/test_fusion_attn.py -k FLASHINFER
-    - pytest -v -s tests/compile/passes/test_silu_mul_quant_fusion.py
-    # this runner has 2 GPUs available even though num_devices=2 is not set
-    - pytest -v -s tests/compile/passes/distributed/test_fusion_all_reduce.py
-    # test_fp8_kv_scale_compile requires FlashAttention (not supported on default L4/L40)
-    # TODO(luka) move to H100 once pass tests run on H100
-    - pytest -v -s tests/compile/fullgraph/test_full_graph.py::test_fp8_kv_scale_compile
-
- label: Fusion E2E Quick (H100)
-  timeout_in_minutes: 15
-  working_dir: "/vllm-workspace/"
-  device: h100
-  num_devices: 1
-  source_file_dependencies:
-    - csrc/quantization/
-    - vllm/model_executor/
-    - vllm/v1/attention/
-    - vllm/compilation/
-    - tests/compile/fusions_e2e/
-  commands:
-    - nvidia-smi
-    # Run all models and attn backends but only Inductor partition and native custom ops
-    - pytest -v -s tests/compile/fusions_e2e/test_tp1_quant.py -k "inductor_partition and not +rms_norm and not +quant_fp8"
-    # Qwen requires +quant_fp8 as -quant_fp8 rms+quant fusion is not supported
-    - pytest -v -s tests/compile/fusions_e2e/test_tp1_quant.py -k "inductor_partition and not +rms_norm and +quant_fp8 and qwen3"
-
- label: Fusion E2E Config Sweep (H100)
-  timeout_in_minutes: 30
-  working_dir: "/vllm-workspace/"
-  device: h100
-  num_devices: 1
-  source_file_dependencies:
-    - csrc/quantization/
-    - vllm/compilation/
-    # can affect pattern matching
-    - vllm/model_executor/layers/layernorm.py
-    - vllm/model_executor/layers/activation.py
-    - vllm/model_executor/layers/attention/attention.py
-    - vllm/model_executor/layers/quantization/input_quant_fp8.py
-    - tests/compile/fusions_e2e/
-  commands:
-    - nvidia-smi
-    # Run just llama3 (fp8) for all config combinations
-    - pytest -v -s tests/compile/fusions_e2e/test_tp1_quant.py -k "llama-3"
-
- label: Fusion E2E Config Sweep (B200)
-  timeout_in_minutes: 30
-  working_dir: "/vllm-workspace/"
-  device: b200
-  num_devices: 1
-  optional: true
-  commands:
-    - nvidia-smi
-    # Run all models and attn backends but only Inductor partition and native custom ops
-    # -k "inductor_partition and not +rms_norm and not +quant_fp8"
-    # Qwen requires +quant_fp8 as -quant_fp8 rms+quant fusion is not supported
-    # -k "inductor_partition and not +rms_norm and +quant_fp8 and qwen3"
-    # Run just llama3 (fp8 & fp4) for all config combinations
-    # -k "llama-3"
-    - pytest -v -s tests/compile/fusions_e2e/test_tp1_quant.py -k "inductor_partition and not +rms_norm and not +quant_fp8" -k "inductor_partition and not +rms_norm and +quant_fp8 and qwen3" -k "llama-3"
-
- label: Fusion E2E TP2 Quick (H100)
-  timeout_in_minutes: 20
-  working_dir: "/vllm-workspace/"
-  device: h100
-  num_devices: 2
-  source_file_dependencies:
-    - csrc/quantization/
-    - vllm/model_executor/
-    - vllm/v1/attention/
-    - vllm/compilation/
-    - tests/compile/fusions_e2e/
-  commands:
-    - nvidia-smi
-    # Run all models and attn backends but only Inductor partition and native custom ops
-    - pytest -v -s tests/compile/fusions_e2e/test_tp2_ar_rms.py -k "inductor_partition and not +rms_norm and not +quant_fp8"
-    - pytest -v -s tests/compile/fusions_e2e/test_tp2_async_tp.py -k "inductor_partition and not +rms_norm and not +quant_fp8"
-
- label: Fusion E2E TP2 AR-RMS Config Sweep (H100)
-  timeout_in_minutes: 40
-  working_dir: "/vllm-workspace/"
-  device: h100
-  num_devices: 2
-  source_file_dependencies:
-    - csrc/quantization/
-    - vllm/compilation/
-    # can affect pattern matching
-    - vllm/model_executor/layers/layernorm.py
-    - vllm/model_executor/layers/activation.py
-    - vllm/model_executor/layers/attention/attention.py
-    - vllm/model_executor/layers/quantization/input_quant_fp8.py
-    - tests/compile/fusions_e2e/
-  commands:
-    - nvidia-smi
-    # Run just llama3 (fp4 & fp8 & bf16) for all config combinations
-    - pytest -v -s tests/compile/fusions_e2e/test_tp2_ar_rms.py -k "llama-3"
-
- label: Fusion E2E TP2 AsyncTP Config Sweep (H100)
-  timeout_in_minutes: 40
-  working_dir: "/vllm-workspace/"
-  device: h100
-  num_devices: 2
-  source_file_dependencies:
-    - csrc/quantization/
-    - vllm/compilation/
-    # can affect pattern matching
-    - vllm/model_executor/layers/layernorm.py
-    - vllm/model_executor/layers/activation.py
-    - vllm/model_executor/layers/attention/attention.py
-    - vllm/model_executor/layers/quantization/input_quant_fp8.py
-    - tests/compile/fusions_e2e/
-  commands:
-    - nvidia-smi
-    # Run just llama3 (fp8 & bf16) for all config combinations
-    - pytest -v -s tests/compile/fusions_e2e/test_tp2_async_tp.py -k "llama-3"
-
- label: Fusion E2E TP2 (B200)
-  timeout_in_minutes: 20
-  working_dir: "/vllm-workspace/"
-  device: b200
-  num_devices: 2
-  source_file_dependencies:
-    - csrc/quantization/
-    - vllm/model_executor/
-    - vllm/v1/attention/
-    - vllm/compilation/
-    - tests/compile/fusions_e2e/
-  commands:
-    - nvidia-smi
-    # Run all models and attn backends but only Inductor partition and native custom ops
-    # for ar-rms-quant-fp4, also sweep llama3
-    - pytest -v -s tests/compile/fusions_e2e/test_tp2_ar_rms.py -k "inductor_partition and not +rms_norm and not +quant_fp8" -k "Llama-3.1-8B-Instruct-FP4"
-    - pytest -v -s tests/compile/fusions_e2e/test_tp2_async_tp.py -k "inductor_partition and not +rms_norm and not +quant_fp8"
--- a/.buildkite/test_areas/cuda.yaml
+++ b/.buildkite/test_areas/cuda.yaml
@@ -1,23 +0,0 @@
-group: CUDA
-depends_on: 
-  - image-build
-steps:
- label: Platform Tests (CUDA)
-  timeout_in_minutes: 15
-  source_file_dependencies:
-  - vllm/
-  - tests/cuda
-  commands:
-    - pytest -v -s cuda/test_cuda_context.py
-    - pytest -v -s cuda/test_platform_no_cuda_init.py
-
- label: Cudagraph
-  timeout_in_minutes: 20
-  source_file_dependencies:
-  - tests/v1/cudagraph
-  - vllm/v1/cudagraph_dispatcher.py
-  - vllm/config/compilation.py
-  - vllm/compilation
-  commands:
-    - pytest -v -s v1/cudagraph/test_cudagraph_dispatch.py
-    - pytest -v -s v1/cudagraph/test_cudagraph_mode.py
--- a/.buildkite/test_areas/distributed.yaml
+++ b/.buildkite/test_areas/distributed.yaml
@@ -1,212 +0,0 @@
-group: Distributed
-depends_on: 
-  - image-build
-steps:
- label: Distributed Comm Ops
-  timeout_in_minutes: 20
-  working_dir: "/vllm-workspace/tests"
-  num_devices: 2
-  source_file_dependencies:
-  - vllm/distributed
-  - tests/distributed
-  commands:
-  - pytest -v -s distributed/test_comm_ops.py
-  - pytest -v -s distributed/test_shm_broadcast.py
-  - pytest -v -s distributed/test_shm_buffer.py
-  - pytest -v -s distributed/test_shm_storage.py
-
- label: Distributed (2 GPUs)
-  timeout_in_minutes: 60
-  working_dir: "/vllm-workspace/tests"
-  num_devices: 2
-  source_file_dependencies:
-  - vllm/compilation/
-  - vllm/distributed/
-  - vllm/engine/
-  - vllm/executor/
-  - vllm/worker/worker_base.py
-  - vllm/v1/engine/
-  - vllm/v1/worker/
-  - tests/compile/fullgraph/test_basic_correctness.py
-  - tests/compile/test_wrapper.py
-  - tests/distributed/
-  - tests/entrypoints/llm/test_collective_rpc.py
-  - tests/v1/distributed
-  - tests/v1/entrypoints/openai/test_multi_api_servers.py
-  - tests/v1/shutdown
-  - tests/v1/worker/test_worker_memory_snapshot.py
-  commands:
-  # https://github.com/NVIDIA/nccl/issues/1838
-  - export NCCL_CUMEM_HOST_ENABLE=0
-  - TP_SIZE=1 DP_SIZE=2 pytest -v -s v1/distributed/test_async_llm_dp.py
-  - TP_SIZE=1 DP_SIZE=2 pytest -v -s v1/distributed/test_eagle_dp.py
-  - TP_SIZE=1 DP_SIZE=2 pytest -v -s v1/distributed/test_external_lb_dp.py
-  - DP_SIZE=2 pytest -v -s v1/entrypoints/openai/test_multi_api_servers.py
-  - pytest -v -s entrypoints/llm/test_collective_rpc.py
-  - pytest -v -s ./compile/fullgraph/test_basic_correctness.py
-  - pytest -v -s ./compile/test_wrapper.py
-  - VLLM_TEST_SAME_HOST=1 torchrun --nproc-per-node=4 distributed/test_same_node.py | grep 'Same node test passed'
-  - VLLM_TEST_SAME_HOST=1 VLLM_TEST_WITH_DEFAULT_DEVICE_SET=1 torchrun --nproc-per-node=4 distributed/test_same_node.py | grep 'Same node test passed'
-  - CUDA_VISIBLE_DEVICES=0,1 pytest -v -s v1/shutdown
-  - pytest -v -s v1/worker/test_worker_memory_snapshot.py
-
- label: Distributed Tests (4 GPUs)
-  timeout_in_minutes: 50
-  working_dir: "/vllm-workspace/tests"
-  num_devices: 4
-  source_file_dependencies:
-  - vllm/distributed/
-  - tests/distributed/test_utils
-  - tests/distributed/test_pynccl
-  - tests/distributed/test_events
-  - tests/compile/fullgraph/test_basic_correctness.py
-  - examples/offline_inference/rlhf.py
-  - examples/offline_inference/rlhf_colocate.py
-  - examples/offline_inference/new_weight_syncing/
-  - tests/examples/offline_inference/data_parallel.py
-  - tests/v1/distributed
-  - tests/v1/engine/test_engine_core_client.py
-  - tests/distributed/test_symm_mem_allreduce.py
-  commands:
-  # https://github.com/NVIDIA/nccl/issues/1838
-  - export NCCL_CUMEM_HOST_ENABLE=0
-  # test with torchrun tp=2 and external_dp=2
-  - torchrun --nproc-per-node=4 distributed/test_torchrun_example.py
-  # test with torchrun tp=2 and pp=2
-  - PP_SIZE=2 torchrun --nproc-per-node=4 distributed/test_torchrun_example.py
-  # test with torchrun tp=4 and dp=1
-  - TP_SIZE=4 torchrun --nproc-per-node=4 distributed/test_torchrun_example_moe.py
-  # test with torchrun tp=2, pp=2 and dp=1
-  - PP_SIZE=2 TP_SIZE=2 torchrun --nproc-per-node=4 distributed/test_torchrun_example_moe.py
-  # test with torchrun tp=1 and dp=4 with ep
-  - DP_SIZE=4 ENABLE_EP=1 torchrun --nproc-per-node=4 distributed/test_torchrun_example_moe.py
-  # test with torchrun tp=2 and dp=2 with ep
-  - TP_SIZE=2 DP_SIZE=2 ENABLE_EP=1 torchrun --nproc-per-node=4 distributed/test_torchrun_example_moe.py
-  # test with internal dp
-  - python3 ../examples/offline_inference/data_parallel.py --enforce-eager
-  - TP_SIZE=2 DP_SIZE=2 pytest -v -s v1/distributed/test_async_llm_dp.py
-  - TP_SIZE=2 DP_SIZE=2 pytest -v -s v1/distributed/test_eagle_dp.py
-  - TP_SIZE=2 DP_SIZE=2 pytest -v -s v1/distributed/test_external_lb_dp.py
-  - TP_SIZE=1 DP_SIZE=4 pytest -v -s v1/distributed/test_internal_lb_dp.py
-  - TP_SIZE=1 DP_SIZE=4 pytest -v -s v1/distributed/test_hybrid_lb_dp.py
-  - pytest -v -s v1/engine/test_engine_core_client.py::test_kv_cache_events_dp
-  - pytest -v -s distributed/test_utils.py
-  - pytest -v -s compile/fullgraph/test_basic_correctness.py
-  - pytest -v -s distributed/test_pynccl.py
-  - pytest -v -s distributed/test_events.py
-  - pytest -v -s distributed/test_symm_mem_allreduce.py
-  # TODO: create a dedicated test section for multi-GPU example tests
-  # when we have multiple distributed example tests
-  # OLD rlhf examples
-  - cd ../examples/offline_inference
-  - VLLM_ALLOW_INSECURE_SERIALIZATION=1 python3 rlhf.py
-  - VLLM_ALLOW_INSECURE_SERIALIZATION=1 RAY_DEDUP_LOGS=0 python3 rlhf_colocate.py
-  # NEW rlhf examples
-  - cd new_weight_syncing
-  - VLLM_ALLOW_INSECURE_SERIALIZATION=1 python3 rlhf.py
-  - VLLM_ALLOW_INSECURE_SERIALIZATION=1 python3 rlhf_async_new_apis.py
-
- label: Distributed Tests (8 GPUs)(H100)
-  timeout_in_minutes: 10
-  device: h100
-  num_devices: 8
-  working_dir: "/vllm-workspace/tests"
-  source_file_dependencies:
-  - examples/offline_inference/torchrun_dp_example.py
-  - vllm/config/parallel.py
-  - vllm/distributed/
-  - vllm/v1/engine/llm_engine.py
-  - vllm/v1/executor/uniproc_executor.py
-  - vllm/v1/worker/gpu_worker.py
-  commands:
-  # https://github.com/NVIDIA/nccl/issues/1838
-  - export NCCL_CUMEM_HOST_ENABLE=0
-  # test with torchrun tp=2 and dp=4 with ep
-  - torchrun --nproc-per-node=8 ../examples/offline_inference/torchrun_dp_example.py --tp-size=2 --pp-size=1 --dp-size=4 --enable-ep
-
- label: Distributed Tests (4 GPUs)(A100)
-  device: a100
-  optional: true
-  num_devices: 4
-  source_file_dependencies:
-  - vllm/
-  commands:
-  # NOTE: don't test llama model here, it seems hf implementation is buggy
-  # see https://github.com/vllm-project/vllm/pull/5689 for details
-  - pytest -v -s distributed/test_custom_all_reduce.py
-  - torchrun --nproc_per_node=2 distributed/test_ca_buffer_sharing.py
-  - TARGET_TEST_SUITE=A100 pytest basic_correctness/ -v -s -m 'distributed(num_gpus=2)'
-  - pytest -v -s -x lora/test_mixtral.py
-
- label: Distributed Tests (2 GPUs)(H100)
-  timeout_in_minutes: 15
-  device: h100
-  optional: true
-  working_dir: "/vllm-workspace/"
-  num_devices: 2
-  commands:
-    - pytest -v -s tests/distributed/test_context_parallel.py
-    - VLLM_USE_DEEP_GEMM=1 VLLM_LOGGING_LEVEL=DEBUG python3 examples/offline_inference/data_parallel.py --model=Qwen/Qwen1.5-MoE-A2.7B -tp=1 -dp=2 --max-model-len=2048 --all2all-backend=deepep_high_throughput
-    - pytest -v -s tests/v1/distributed/test_dbo.py
-
- label: Distributed Tests (2 GPUs)(B200)
-  device: b200
-  optional: true
-  working_dir: "/vllm-workspace/"
-  num_devices: 2
-  commands:
-    - pytest -v -s tests/distributed/test_context_parallel.py
-    - pytest -v -s tests/distributed/test_nccl_symm_mem_allreduce.py
-    - pytest -v -s tests/v1/distributed/test_dbo.py
-
- label: 2 Node Test (4 GPUs)
-  timeout_in_minutes: 30
-  working_dir: "/vllm-workspace/tests"
-  num_devices: 2
-  num_nodes: 2
-  no_plugin: true
-  source_file_dependencies:
-  - vllm/distributed/
-  - vllm/engine/
-  - vllm/executor/
-  - vllm/model_executor/models/
-  - tests/distributed/
-  - tests/examples/offline_inference/data_parallel.py
-  commands:
-    - ./.buildkite/scripts/run-multi-node-test.sh /vllm-workspace/tests 2 2 $IMAGE_TAG "VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py | grep 'Same node test passed' && NUM_NODES=2 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_node_count.py | grep 'Node count test passed' && python3 ../examples/offline_inference/data_parallel.py -dp=2 -tp=1 --dp-num-nodes=2 --dp-node-rank=0 --dp-master-addr=192.168.10.10 --dp-master-port=12345 --enforce-eager --trust-remote-code && VLLM_MULTI_NODE=1 pytest -v -s distributed/test_multi_node_assignment.py && VLLM_MULTI_NODE=1 pytest -v -s distributed/test_pipeline_parallel.py" "VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py | grep 'Same node test passed' && NUM_NODES=2 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_node_count.py | grep 'Node count test passed' && python3 ../examples/offline_inference/data_parallel.py -dp=2 -tp=1 --dp-num-nodes=2 --dp-node-rank=1 --dp-master-addr=192.168.10.10 --dp-master-port=12345 --enforce-eager --trust-remote-code"
-
- label: Distributed NixlConnector PD accuracy (4 GPUs)
-  timeout_in_minutes: 30
-  working_dir: "/vllm-workspace/tests"
-  num_devices: 4
-  source_file_dependencies:
-    - vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py
-    - tests/v1/kv_connector/nixl_integration/
-  commands:
-    - uv pip install --system -r /vllm-workspace/requirements/kv_connectors.txt
-    - bash v1/kv_connector/nixl_integration/config_sweep_accuracy_test.sh
-
- label: DP EP Distributed NixlConnector PD accuracy tests (4 GPUs)
-  timeout_in_minutes: 30
-  working_dir: "/vllm-workspace/tests"
-  num_devices: 4
-  source_file_dependencies:
-    - vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py
-    - tests/v1/kv_connector/nixl_integration/
-  commands:
-    - uv pip install --system -r /vllm-workspace/requirements/kv_connectors.txt
-    - DP_EP=1 bash v1/kv_connector/nixl_integration/config_sweep_accuracy_test.sh
-
- label: Pipeline + Context Parallelism (4 GPUs))
-  timeout_in_minutes: 60
-  working_dir: "/vllm-workspace/tests"
-  num_devices: 4
-  source_file_dependencies:
-  - vllm/distributed/
-  - vllm/engine/
-  - vllm/executor/
-  - vllm/model_executor/models/
-  - tests/distributed/
-  commands:
-  - pytest -v -s distributed/test_pp_cudagraph.py
-  - pytest -v -s distributed/test_pipeline_parallel.py
--- a/.buildkite/test_areas/e2e_integration.yaml
+++ b/.buildkite/test_areas/e2e_integration.yaml
@@ -1,43 +0,0 @@
-group: E2E Integration
-depends_on: 
-  - image-build
-steps:
- label: DeepSeek V2-Lite Accuracy
-  timeout_in_minutes: 60
-  device: h100
-  optional: true
-  num_devices: 4
-  working_dir: "/vllm-workspace"
-  commands:
-  - bash .buildkite/scripts/scheduled_integration_test/deepseek_v2_lite_ep_eplb.sh 0.25 200 8010
-
- label: Qwen3-30B-A3B-FP8-block Accuracy
-  timeout_in_minutes: 60
-  device: h100
-  optional: true
-  num_devices: 4
-  working_dir: "/vllm-workspace"
-  commands:
-  - bash .buildkite/scripts/scheduled_integration_test/qwen30b_a3b_fp8_block_ep_eplb.sh 0.8 200 8020
-
- label: Qwen3-30B-A3B-FP8-block Accuracy (B200)
-  timeout_in_minutes: 60
-  device: b200
-  optional: true
-  num_devices: 2
-  working_dir: "/vllm-workspace"
-  commands:
-  - bash .buildkite/scripts/scheduled_integration_test/qwen30b_a3b_fp8_block_ep_eplb.sh 0.8 200 8020 2 1
-
- label: Prime-RL Integration (2 GPUs)
-  timeout_in_minutes: 30
-  optional: true
-  soft_fail: true
-  num_devices: 2
-  working_dir: "/vllm-workspace"
-  source_file_dependencies:
-  - vllm/
-  - .buildkite/scripts/run-prime-rl-test.sh
-  commands:
-    - nvidia-smi
-    - bash .buildkite/scripts/run-prime-rl-test.sh
--- a/.buildkite/test_areas/engine.yaml
+++ b/.buildkite/test_areas/engine.yaml
@@ -1,30 +0,0 @@
-group: Engine
-depends_on: 
-  - image-build
-steps:
- label: Engine
-  timeout_in_minutes: 15
-  source_file_dependencies:
-  - vllm/
-  - tests/engine
-  - tests/test_sequence
-  - tests/test_config
-  - tests/test_logger
-  - tests/test_vllm_port
-  commands:
-  - pytest -v -s engine test_sequence.py test_config.py test_logger.py test_vllm_port.py
-
- label: V1 e2e + engine
-  timeout_in_minutes: 45
-  source_file_dependencies:
-    - vllm/
-    - tests/v1
-  commands:
-    # TODO: accuracy does not match, whether setting
-    # VLLM_USE_FLASHINFER_SAMPLER or not on H100.
-    - pytest -v -s v1/e2e
-    # Run this test standalone for now;
-    # need to untangle use (implicit) use of spawn/fork across the tests.
-    - pytest -v -s v1/engine/test_preprocess_error_handling.py
-    # Run the rest of v1/engine tests
-    - pytest -v -s v1/engine --ignore v1/engine/test_preprocess_error_handling.py
--- a/.buildkite/test_areas/entrypoints.yaml
+++ b/.buildkite/test_areas/entrypoints.yaml
@@ -1,90 +0,0 @@
-group: Entrypoints
-depends_on: 
-  - image-build
-steps:
- label: Entrypoints Unit Tests  
-  timeout_in_minutes: 10
-  working_dir: "/vllm-workspace/tests"
-  source_file_dependencies:
-  - vllm/entrypoints
-  - tests/entrypoints/
-  commands:
-  - pytest -v -s entrypoints/openai/tool_parsers
-  - pytest -v -s entrypoints/ --ignore=entrypoints/llm --ignore=entrypoints/rpc --ignore=entrypoints/sleep --ignore=entrypoints/instrumentator --ignore=entrypoints/openai --ignore=entrypoints/offline_mode --ignore=entrypoints/test_chat_utils.py  --ignore=entrypoints/pooling
-
- label: Entrypoints Integration (LLM)
-  timeout_in_minutes: 40
-  working_dir: "/vllm-workspace/tests"
-  source_file_dependencies:
-  - vllm/
-  - tests/entrypoints/llm
-  - tests/entrypoints/offline_mode
-  commands:
-  - export VLLM_WORKER_MULTIPROC_METHOD=spawn
-  - pytest -v -s entrypoints/llm --ignore=entrypoints/llm/test_generate.py --ignore=entrypoints/llm/test_collective_rpc.py
-  - pytest -v -s entrypoints/llm/test_generate.py # it needs a clean process
-  - pytest -v -s entrypoints/offline_mode # Needs to avoid interference with other tests
-
- label: Entrypoints Integration (API Server 1)
-  timeout_in_minutes: 130
-  working_dir: "/vllm-workspace/tests"
-  source_file_dependencies:
-  - vllm/
-  - tests/entrypoints/openai
-  - tests/entrypoints/test_chat_utils
-  commands:
-  - export VLLM_WORKER_MULTIPROC_METHOD=spawn
-  - pytest -v -s entrypoints/openai --ignore=entrypoints/openai/test_chat_with_tool_reasoning.py --ignore=entrypoints/openai/test_oot_registration.py --ignore=entrypoints/openai/test_tensorizer_entrypoint.py --ignore=entrypoints/openai/correctness/  --ignore=entrypoints/openai/tool_parsers/ --ignore=entrypoints/openai/responses
-  - pytest -v -s entrypoints/test_chat_utils.py
-
- label: Entrypoints Integration (API Server 2)
-  timeout_in_minutes: 130
-  working_dir: "/vllm-workspace/tests"
-  source_file_dependencies:
-  - vllm/
-  - tests/tool_use
-  - tests/entrypoints/sleep
-  - tests/entrypoints/instrumentator
-  - tests/entrypoints/rpc
-  commands:
-  - export VLLM_WORKER_MULTIPROC_METHOD=spawn
-  - PYTHONPATH=/vllm-workspace pytest -v -s entrypoints/rpc
-  - pytest -v -s entrypoints/instrumentator
-  - pytest -v -s entrypoints/sleep
-  - pytest -v -s tool_use
-
- label: Entrypoints Integration (Pooling)
-  timeout_in_minutes: 50
-  working_dir: "/vllm-workspace/tests"
-  source_file_dependencies:
-  - vllm/
-  - tests/entrypoints/pooling
-  commands:
-  - export VLLM_WORKER_MULTIPROC_METHOD=spawn
-  - pytest -v -s entrypoints/pooling
-
- label: Entrypoints Integration (Responses API)
-  timeout_in_minutes: 50
-  working_dir: "/vllm-workspace/tests"
-  source_file_dependencies:
-  - vllm/
-  - tests/entrypoints/openai/responses
-  commands:
-  - pytest -v -s entrypoints/openai/responses
-
- label: Entrypoints V1
-  timeout_in_minutes: 50
-  source_file_dependencies:
-    - vllm/
-    - tests/v1
-  commands:
-    - pytest -v -s v1/entrypoints
-
- label: OpenAI API Correctness
-  timeout_in_minutes: 30
-  source_file_dependencies:
-  - csrc/
-  - vllm/entrypoints/openai/
-  - vllm/model_executor/models/whisper.py
-  commands: # LMEval+Transcription WER check
-  - pytest -s entrypoints/openai/correctness/
--- a/.buildkite/test_areas/expert_parallelism.yaml
+++ b/.buildkite/test_areas/expert_parallelism.yaml
@@ -1,23 +0,0 @@
-group: Expert Parallelism
-depends_on: 
-  - image-build
-steps:
- label: EPLB Algorithm
-  timeout_in_minutes: 15
-  working_dir: "/vllm-workspace/tests"
-  source_file_dependencies:
-  - vllm/distributed/eplb
-  - tests/distributed/test_eplb_algo.py
-  commands:
-  - pytest -v -s distributed/test_eplb_algo.py
-
- label: EPLB Execution
-  timeout_in_minutes: 20
-  working_dir: "/vllm-workspace/tests"
-  num_devices: 4
-  source_file_dependencies:
-  - vllm/distributed/eplb
-  - tests/distributed/test_eplb_execute.py
-  commands:
-  - pytest -v -s distributed/test_eplb_execute.py
-  - pytest -v -s distributed/test_eplb_spec_decode.py
--- a/.buildkite/test_areas/kernels.yaml
+++ b/.buildkite/test_areas/kernels.yaml
@@ -1,169 +0,0 @@
-group: Kernels
-depends_on: 
-  - image-build
-steps:
- label: Kernels Core Operation Test
-  timeout_in_minutes: 75
-  source_file_dependencies:
-  - csrc/
-  - tests/kernels/core
-  - tests/kernels/test_top_k_per_row.py
-  commands:
-    - pytest -v -s kernels/core kernels/test_top_k_per_row.py
-
- label: Kernels Attention Test %N
-  timeout_in_minutes: 35
-  source_file_dependencies:
-  - csrc/attention/
-  - vllm/v1/attention
-    # TODO: remove this dependency (https://github.com/vllm-project/vllm/issues/32267)
-  - vllm/model_executor/layers/attention
-  - tests/kernels/attention
-  commands:
-    - pytest -v -s kernels/attention --shard-id=$$BUILDKITE_PARALLEL_JOB --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT
-  parallelism: 2
-
- label: Kernels Quantization Test %N
-  timeout_in_minutes: 90
-  source_file_dependencies:
-  - csrc/quantization/
-  - vllm/model_executor/layers/quantization
-  - tests/kernels/quantization
-  commands:
-    - pytest -v -s kernels/quantization --shard-id=$$BUILDKITE_PARALLEL_JOB --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT
-  parallelism: 2
-
- label: Kernels MoE Test %N
-  timeout_in_minutes: 60
-  source_file_dependencies:
-  - csrc/quantization/cutlass_w8a8/moe/
-  - csrc/moe/
-  - tests/kernels/moe
-  - vllm/model_executor/layers/fused_moe/
-  - vllm/distributed/device_communicators/
-  - vllm/envs.py
-  - vllm/config
-  commands:
-    - pytest -v -s kernels/moe --shard-id=$$BUILDKITE_PARALLEL_JOB --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT
-  parallelism: 2
-
- label: Kernels Mamba Test
-  timeout_in_minutes: 45
-  source_file_dependencies:
-  - csrc/mamba/
-  - tests/kernels/mamba
-  - vllm/model_executor/layers/mamba/ops
-  commands:
-    - pytest -v -s kernels/mamba
-
- label: Kernels DeepGEMM Test (H100)
-  timeout_in_minutes: 45
-  device: h100
-  num_devices: 1
-  source_file_dependencies:
-  - tools/install_deepgemm.sh
-  - vllm/utils/deep_gemm.py
-  - vllm/model_executor/layers/fused_moe
-  - vllm/model_executor/layers/quantization
-  - tests/kernels/quantization/test_block_fp8.py
-  - tests/kernels/moe/test_deepgemm.py
-  - tests/kernels/moe/test_batched_deepgemm.py
-  - tests/kernels/attention/test_deepgemm_attention.py
-  commands:
-    - pytest -v -s kernels/quantization/test_block_fp8.py -k deep_gemm
-    - pytest -v -s kernels/moe/test_deepgemm.py
-    - pytest -v -s kernels/moe/test_batched_deepgemm.py
-    - pytest -v -s kernels/attention/test_deepgemm_attention.py
-
- label: Kernels (B200)
-  timeout_in_minutes: 30
-  working_dir: "/vllm-workspace/"
-  device: b200
-  # optional: true
-  source_file_dependencies:
-  - csrc/quantization/fp4/
-  - csrc/attention/mla/
-  - csrc/quantization/cutlass_w8a8/moe/
-  - vllm/model_executor/layers/fused_moe/cutlass_moe.py
-  - vllm/model_executor/layers/fused_moe/flashinfer_cutlass_moe.py
-  - vllm/model_executor/layers/fused_moe/flashinfer_a2a_prepare_finalize.py
-  - vllm/model_executor/layers/quantization/utils/flashinfer_utils.py
-  - vllm/v1/attention/backends/flashinfer.py
-  - vllm/v1/attention/backends/mla/cutlass_mla.py
-  - vllm/v1/attention/backends/mla/flashinfer_mla.py
-  - vllm/v1/attention/selector.py
-  - vllm/platforms/cuda.py
-  commands:
-    - nvidia-smi
-    - python3 examples/offline_inference/basic/chat.py
-    # Attention
-    # num_heads2 broken by https://github.com/flashinfer-ai/flashinfer/issues/1353
-    - pytest -v -s tests/kernels/attention/test_attention_selector.py
-    - pytest -v -s tests/kernels/attention/test_flashinfer.py -k 'not num_heads2'
-    - pytest -v -s tests/kernels/attention/test_flashinfer_trtllm_attention.py
-    - pytest -v -s tests/kernels/attention/test_cutlass_mla_decode.py
-    - pytest -v -s tests/kernels/attention/test_flashinfer_mla_decode.py
-    # Quantization
-    - pytest -v -s tests/kernels/quantization/test_cutlass_scaled_mm.py -k 'fp8'
-    - pytest -v -s tests/kernels/quantization/test_nvfp4_quant.py
-    - pytest -v -s tests/kernels/quantization/test_silu_mul_nvfp4_quant.py
-    - pytest -v -s tests/kernels/quantization/test_nvfp4_scaled_mm.py
-    - pytest -v -s tests/kernels/quantization/test_flashinfer_scaled_mm.py
-    - pytest -v -s tests/kernels/quantization/test_flashinfer_nvfp4_scaled_mm.py
-    - pytest -v -s tests/kernels/quantization/test_nvfp4_qutlass.py
-    - pytest -v -s tests/kernels/quantization/test_mxfp4_qutlass.py
-    - pytest -v -s tests/kernels/moe/test_nvfp4_moe.py
-    - pytest -v -s tests/kernels/moe/test_ocp_mx_moe.py
-    - pytest -v -s tests/kernels/moe/test_flashinfer.py
-    - pytest -v -s tests/kernels/moe/test_cutedsl_moe.py
-    # e2e
-    - pytest -v -s tests/models/quantization/test_nvfp4.py
-
- label: Kernels Helion Test
-  timeout_in_minutes: 30
-  device: h100
-  source_file_dependencies:
-  - vllm/utils/import_utils.py
-  - tests/kernels/helion/
-  commands:
-    - pip install helion
-    - pytest -v -s kernels/helion/
-
- 
- label: Kernels FP8 MoE Test (1 H100)
-  timeout_in_minutes: 90
-  device: h100
-  num_devices: 1
-  optional: true
-  commands:
-    - pytest -v -s kernels/moe/test_cutlass_moe.py
-    - pytest -v -s kernels/moe/test_flashinfer.py
-    - pytest -v -s kernels/moe/test_gpt_oss_triton_kernels.py
-    - pytest -v -s kernels/moe/test_modular_oai_triton_moe.py
-    - pytest -v -s kernels/moe/test_moe.py
-    # - pytest -v -s kernels/moe/test_block_fp8.py - failing on main
-    - pytest -v -s kernels/moe/test_block_int8.py
-    - pytest -v -s kernels/moe/test_triton_moe_no_act_mul.py
-    - pytest -v -s kernels/moe/test_triton_moe_ptpc_fp8.py
-
- label: Kernels FP8 MoE Test (2 H100s)
-  timeout_in_minutes: 90
-  device: h100
-  num_devices: 2
-  optional: true
-  commands:
-    - pytest -v -s kernels/moe/test_deepep_deepgemm_moe.py
-    - pytest -v -s kernels/moe/test_deepep_moe.py
-    - pytest -v -s kernels/moe/test_pplx_cutlass_moe.py
-    # - pytest -v -s kernels/moe/test_pplx_moe.py - failing on main
-  
- label: Kernels Fp4 MoE Test (B200)
-  timeout_in_minutes: 60
-  device: b200
-  num_devices: 1
-  optional: true
-  commands:
-    - pytest -v -s kernels/moe/test_cutedsl_moe.py
-    - pytest -v -s kernels/moe/test_flashinfer_moe.py
-    - pytest -v -s kernels/moe/test_nvfp4_moe.py
-    - pytest -v -s kernels/moe/test_ocp_mx_moe.py
--- a/.buildkite/test_areas/lm_eval.yaml
+++ b/.buildkite/test_areas/lm_eval.yaml
@@ -1,75 +0,0 @@
-group: LM Eval
-depends_on: 
-  - image-build
-steps:
- label: LM Eval Small Models
-  timeout_in_minutes: 75
-  source_file_dependencies:
-  - csrc/
-  - vllm/model_executor/layers/quantization
-  autorun_on_main: true
-  commands:
-  - pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=configs/models-small.txt
-
- label: LM Eval Large Models (4 GPUs)(A100)
-  device: a100
-  optional: true
-  num_devices: 4
-  working_dir: "/vllm-workspace/.buildkite/lm-eval-harness"
-  source_file_dependencies:
-  - csrc/
-  - vllm/model_executor/layers/quantization
-  commands:
-  - export VLLM_WORKER_MULTIPROC_METHOD=spawn
-  - pytest -s -v test_lm_eval_correctness.py --config-list-file=configs/models-large.txt --tp-size=4
-
- label: LM Eval Large Models (4 GPUs)(H100)
-  device: h100
-  optional: true
-  num_devices: 4
-  working_dir: "/vllm-workspace/.buildkite/lm-eval-harness"
-  source_file_dependencies:
-  - csrc/
-  - vllm/model_executor/layers/quantization
-  commands:
-    - export VLLM_USE_DEEP_GEMM=0  # We found Triton is faster than DeepGEMM for H100
-    - pytest -s -v test_lm_eval_correctness.py --config-list-file=configs/models-large-hopper.txt --tp-size=4
-
- label: LM Eval Small Models (B200)
-  timeout_in_minutes: 120
-  device: b200
-  optional: true
-  source_file_dependencies:
-  - csrc/
-  - vllm/model_executor/layers/quantization
-  commands:
-  - pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=configs/models-blackwell.txt
-
- label: LM Eval Large Models (H200)
-  timeout_in_minutes: 60
-  device: h200
-  optional: true
-  num_devices: 8
-  commands:
-    - pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=configs/models-h200.txt
-
- label: MoE Refactor Integration Test (H100 - TEMPORARY)
-  device: h100
-  optional: true
-  num_devices: 2
-  commands:
-    - pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=evals/gsm8k/configs/moe-refactor/config-h100.txt
-  
- label: MoE Refactor Integration Test (B200 - TEMPORARY)
-  device: b200
-  optional: true
-  num_devices: 2
-  commands:
-    - pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=evals/gsm8k/configs/moe-refactor/config-b200.txt
-
- label: MoE Refactor Integration Test (B200 DP - TEMPORARY)
-  device: b200
-  optional: true
-  num_devices: 2
-  commands:
-    - pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=evals/gsm8k/configs/moe-refactor-dp-ep/config-b200.txt
--- a/.buildkite/test_areas/lora.yaml
+++ b/.buildkite/test_areas/lora.yaml
@@ -1,33 +0,0 @@
-group: LoRA
-depends_on: 
-  - image-build
-steps:
- label: LoRA %N
-  timeout_in_minutes: 30
-  source_file_dependencies:
-  - vllm/lora
-  - tests/lora
-  commands:
-    - pytest -v -s lora --shard-id=$$BUILDKITE_PARALLEL_JOB --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT --ignore=lora/test_chatglm3_tp.py --ignore=lora/test_llama_tp.py --ignore=lora/test_llm_with_multi_loras.py --ignore=lora/test_olmoe_tp.py --ignore=lora/test_deepseekv2_tp.py --ignore=lora/test_gptoss_tp.py --ignore=lora/test_qwen3moe_tp.py
-  parallelism: 4
-
-
- label: LoRA TP (Distributed)
-  timeout_in_minutes: 30
-  num_devices: 4
-  source_file_dependencies:
-  - vllm/lora
-  - tests/lora
-  commands:
-    # FIXIT: find out which code initialize cuda before running the test
-    # before the fix, we need to use spawn to test it
-    - export VLLM_WORKER_MULTIPROC_METHOD=spawn
-    # Alot of these tests are on the edge of OOMing
-    - export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
-    # There is some Tensor Parallelism related processing logic in LoRA that
-    # requires multi-GPU testing for validation.
-    - pytest -v -s -x lora/test_chatglm3_tp.py
-    - pytest -v -s -x lora/test_llama_tp.py
-    - pytest -v -s -x lora/test_llm_with_multi_loras.py
-    - pytest -v -s -x lora/test_olmoe_tp.py
-    - pytest -v -s -x lora/test_gptoss_tp.py
--- a/.buildkite/test_areas/misc.yaml
+++ b/.buildkite/test_areas/misc.yaml
@@ -1,185 +0,0 @@
-group: Miscellaneous
-depends_on: 
-  - image-build
-steps:
- label: V1 Others
-  timeout_in_minutes: 60
-  source_file_dependencies:
-    - vllm/
-    - tests/v1
-  commands:
-    - uv pip install --system -r /vllm-workspace/requirements/kv_connectors.txt
-    # split the test to avoid interference
-    - pytest -v -s -m 'not cpu_test' v1/core
-    - pytest -v -s v1/executor
-    - pytest -v -s v1/kv_offload
-    - pytest -v -s v1/sample
-    - pytest -v -s v1/logits_processors
-    - pytest -v -s v1/worker
-    - pytest -v -s -m 'not slow_test' v1/spec_decode
-    - pytest -v -s -m 'not cpu_test' v1/kv_connector/unit
-    - pytest -v -s -m 'not cpu_test' v1/metrics
-    - pytest -v -s v1/test_oracle.py
-    - pytest -v -s v1/test_request.py
-    - pytest -v -s v1/test_outputs.py
-    # Integration test for streaming correctness (requires special branch).
-    - pip install -U git+https://github.com/robertgshaw2-redhat/lm-evaluation-harness.git@streaming-api
-    - pytest -v -s entrypoints/openai/correctness/test_lmeval.py::test_lm_eval_accuracy_v1_engine
-
- label: V1 Others (CPU)
-  depends_on:
-    - image-build-cpu
-  source_file_dependencies:
-    - vllm/
-    - tests/v1
-  device: cpu
-  commands:
-    # split the test to avoid interference
-    - pytest -v -s -m 'cpu_test' v1/core
-    - pytest -v -s v1/structured_output
-    - pytest -v -s v1/test_serial_utils.py
-    - pytest -v -s -m 'cpu_test' v1/kv_connector/unit
-    - pytest -v -s -m 'cpu_test' v1/metrics
-
- label: Regression
-  timeout_in_minutes: 20
-  source_file_dependencies:
-  - vllm/
-  - tests/test_regression
-  commands:
-  - pip install modelscope
-  - pytest -v -s test_regression.py
-  working_dir: "/vllm-workspace/tests" # optional
-
- label: Examples
-  timeout_in_minutes: 45
-  working_dir: "/vllm-workspace/examples"
-  source_file_dependencies:
-  - vllm/entrypoints
-  - vllm/multimodal
-  - examples/
-  commands:
-    - pip install tensorizer # for tensorizer test
-    - python3 offline_inference/basic/chat.py # for basic
-    - python3 offline_inference/basic/generate.py --model facebook/opt-125m
-    - python3 offline_inference/basic/generate.py --model meta-llama/Llama-2-13b-chat-hf --cpu-offload-gb 10
-    - python3 offline_inference/basic/classify.py
-    - python3 offline_inference/basic/embed.py
-    - python3 offline_inference/basic/score.py
-    # for multi-modal models
-    - python3 offline_inference/audio_language.py --seed 0
-    - python3 offline_inference/vision_language.py --seed 0
-    - python3 offline_inference/vision_language_multi_image.py --seed 0
-    - python3 offline_inference/encoder_decoder_multimodal.py --model-type whisper --seed 0
-     # for pooling models
-    - python3 pooling/embed/vision_embedding_offline.py --seed 0
-    # for features demo
-    - python3 offline_inference/prefix_caching.py
-    - python3 offline_inference/llm_engine_example.py
-    - python3 others/tensorize_vllm_model.py --model facebook/opt-125m serialize --serialized-directory /tmp/ --suffix v1 && python3 others/tensorize_vllm_model.py --model facebook/opt-125m deserialize --path-to-tensors /tmp/vllm/facebook/opt-125m/v1/model.tensors
-    - python3 offline_inference/spec_decode.py --test --method eagle --num_spec_tokens 3 --dataset-name hf --dataset-path philschmid/mt-bench --num-prompts 80 --temp 0 --top-p 1.0 --top-k -1 --tp 1 --enable-chunked-prefill --max-model-len 2048
-    # https://github.com/vllm-project/vllm/pull/26682 uses slightly more memory in PyTorch 2.9+ causing this test to OOM in 1xL4 GPU
-    - python3 offline_inference/spec_decode.py --test --method eagle3 --num_spec_tokens 3 --dataset-name hf --dataset-path philschmid/mt-bench --num-prompts 80 --temp 0 --top-p 1.0 --top-k -1 --tp 1 --enable-chunked-prefill --max-model-len 1536
-
- label: Metrics, Tracing (2 GPUs)
-  timeout_in_minutes: 20
-  num_devices: 2
-  source_file_dependencies:
-  - vllm/
-  - tests/v1/tracing
-  commands:
-  - "pip install \
-      'opentelemetry-sdk>=1.26.0' \
-      'opentelemetry-api>=1.26.0' \
-      'opentelemetry-exporter-otlp>=1.26.0' \
-      'opentelemetry-semantic-conventions-ai>=0.4.1'"
-  - pytest -v -s v1/tracing
-
- label: Python-only Installation
-  depends_on: ~
-  timeout_in_minutes: 20
-  source_file_dependencies:
-  - tests/standalone_tests/python_only_compile.sh
-  - setup.py
-  commands:
-  - bash standalone_tests/python_only_compile.sh
-
- label: Async Engine, Inputs, Utils, Worker
-  timeout_in_minutes: 50
-  source_file_dependencies:
-  - vllm/
-  - tests/multimodal
-  - tests/utils_
-  commands:
-  - pytest -v -s -m 'not cpu_test' multimodal
-  - pytest -v -s utils_
-
- label: Async Engine, Inputs, Utils, Worker, Config (CPU)
-  depends_on: 
-  - image-build-cpu
-  timeout_in_minutes: 30
-  source_file_dependencies:
-  - vllm/
-  - tests/test_inputs.py
-  - tests/test_outputs.py
-  - tests/test_pooling_params.py
-  - tests/multimodal
-  - tests/renderers
-  - tests/standalone_tests/lazy_imports.py
-  - tests/tokenizers_
-  - tests/tool_parsers
-  - tests/transformers_utils
-  - tests/config
-  device: cpu
-  commands:
-  - python3 standalone_tests/lazy_imports.py
-  - pytest -v -s test_inputs.py
-  - pytest -v -s test_outputs.py
-  - pytest -v -s test_pooling_params.py
-  - pytest -v -s -m 'cpu_test' multimodal
-  - pytest -v -s renderers
-  - pytest -v -s tokenizers_
-  - pytest -v -s tool_parsers
-  - pytest -v -s transformers_utils
-  - pytest -v -s config
-
- label: GPT-OSS Eval (B200)
-  timeout_in_minutes: 60
-  working_dir: "/vllm-workspace/"
-  device: b200
-  optional: true
-  source_file_dependencies:
-  - tests/evals/gpt_oss
-  - vllm/model_executor/models/gpt_oss.py
-  - vllm/model_executor/layers/quantization/mxfp4.py
-  - vllm/v1/attention/backends/flashinfer.py
-  commands:
-    - uv pip install --system 'gpt-oss[eval]==0.0.5'
-    - pytest -s -v tests/evals/gpt_oss/test_gpqa_correctness.py --model openai/gpt-oss-20b --metric 0.58
-
- label: Batch Invariance (H100)
-  timeout_in_minutes: 25
-  device: h100
-  source_file_dependencies:
-    - vllm/v1/attention
-    - vllm/model_executor/layers
-    - tests/v1/determinism/
-  commands:
-    - export VLLM_WORKER_MULTIPROC_METHOD=spawn
-    - pip install pytest-timeout pytest-forked
-    - pytest -v -s v1/determinism/test_batch_invariance.py
-    - pytest -v -s v1/determinism/test_rms_norm_batch_invariant.py
-  
- label: Acceptance Length Test (Large Models) # optional
-  timeout_in_minutes: 25
-  gpu: h100
-  optional: true
-  num_gpus: 1
-  working_dir: "/vllm-workspace/tests"
-  source_file_dependencies:
-  - vllm/v1/spec_decode/
-  - vllm/model_executor/models/mlp_speculator.py
-  - tests/v1/spec_decode/test_acceptance_length.py
-  commands:
-    - export VLLM_ALLOW_INSECURE_SERIALIZATION=1
-    - pytest -v -s v1/spec_decode/test_acceptance_length.py -m slow_test
--- a/.buildkite/test_areas/model_executor.yaml
+++ b/.buildkite/test_areas/model_executor.yaml
@@ -1,17 +0,0 @@
-group: Model Executor
-depends_on: 
-  - image-build
-steps:
- label: Model Executor
-  timeout_in_minutes: 35
-  source_file_dependencies:
-  - vllm/engine/arg_utils.py
-  - vllm/config/model.py
-  - vllm/model_executor
-  - tests/model_executor
-  - tests/entrypoints/openai/test_tensorizer_entrypoint.py
-  commands:
-    - apt-get update && apt-get install -y curl libsodium23
-    - export VLLM_WORKER_MULTIPROC_METHOD=spawn
-    - pytest -v -s model_executor
-    - pytest -v -s entrypoints/openai/test_tensorizer_entrypoint.py
--- a/.buildkite/test_areas/models_basic.yaml
+++ b/.buildkite/test_areas/models_basic.yaml
@@ -1,67 +0,0 @@
-group: Models - Basic
-depends_on: 
-  - image-build
-steps:
- label: Basic Models Tests (Initialization)
-  timeout_in_minutes: 45
-  mirror_hardwares: [amdexperimental]
-  torch_nightly: true
-  source_file_dependencies:
-  - vllm/
-  - tests/models/test_initialization.py
-  - tests/models/registry.py
-  commands:
-    # Run a subset of model initialization tests
-    - pytest -v -s models/test_initialization.py::test_can_initialize_small_subset
-
- label: Basic Models Tests (Extra Initialization) %N
-  timeout_in_minutes: 45
-  mirror_hardwares: [amdexperimental]
-  torch_nightly: true
-  source_file_dependencies:
-  - vllm/model_executor/models/
-  - tests/models/test_initialization.py
-  - tests/models/registry.py
-  commands:
-    # Only when vLLM model source is modified - test initialization of a large
-    # subset of supported models (the complement of the small subset in the above
-    # test.) Also run if model initialization test file is modified
-    - pytest -v -s models/test_initialization.py -k 'not test_can_initialize_small_subset' --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT --shard-id=$$BUILDKITE_PARALLEL_JOB
-  parallelism: 2
-
- label: Basic Models Tests (Other)
-  timeout_in_minutes: 45
-  source_file_dependencies:
-  - vllm/
-  - tests/models/test_terratorch.py
-  - tests/models/test_transformers.py
-  - tests/models/test_registry.py
-  commands:
-    - pytest -v -s models/test_terratorch.py models/test_transformers.py models/test_registry.py
-
- label: Basic Models Test (Other CPU) # 5min
-  depends_on: 
-  - image-build-cpu
-  timeout_in_minutes: 10
-  source_file_dependencies:
-  - vllm/
-  - tests/models/test_utils.py
-  - tests/models/test_vision.py
-  device: cpu
-  commands:
-    - pytest -v -s models/test_utils.py models/test_vision.py
-
- label: Transformers Nightly Models
-  working_dir: "/vllm-workspace/"
-  optional: true
-  soft_fail: true
-  commands:
-    - pip install --upgrade git+https://github.com/huggingface/transformers
-    - pytest -v -s tests/models/test_initialization.py
-    - pytest -v -s tests/models/test_transformers.py
-    - pytest -v -s tests/models/multimodal/processing/
-    - pytest -v -s tests/models/multimodal/test_mapping.py
-    - python3 examples/offline_inference/basic/chat.py
-    - python3 examples/offline_inference/vision_language.py --model-type qwen2_5_vl
-    # Whisper needs spawn method to avoid deadlock
-    - VLLM_WORKER_MULTIPROC_METHOD=spawn python3 examples/offline_inference/audio_language.py --model-type whisper
--- a/.buildkite/test_areas/models_distributed.yaml
+++ b/.buildkite/test_areas/models_distributed.yaml
@@ -1,22 +0,0 @@
-group: Models - Distributed
-depends_on: 
-  - image-build
-steps:
- label: Distributed Model Tests (2 GPUs)
-  timeout_in_minutes: 50
-  working_dir: "/vllm-workspace/tests"
-  num_devices: 2
-  source_file_dependencies:
-  - vllm/model_executor/model_loader/sharded_state_loader.py
-  - vllm/model_executor/models/
-  - tests/basic_correctness/
-  - tests/model_executor/model_loader/test_sharded_state_loader.py
-  - tests/models/
-  commands:
-  - TARGET_TEST_SUITE=L4 pytest basic_correctness/ -v -s -m 'distributed(num_gpus=2)'
-  - CUDA_VISIBLE_DEVICES=0,1 pytest -v -s model_executor/model_loader/test_sharded_state_loader.py
-  # Avoid importing model tests that cause CUDA reinitialization error
-  - pytest models/test_transformers.py -v -s -m 'distributed(num_gpus=2)'
-  - pytest models/language -v -s -m 'distributed(num_gpus=2)'
-  - pytest models/multimodal -v -s -m 'distributed(num_gpus=2)' --ignore models/multimodal/generation/test_whisper.py
-  - VLLM_WORKER_MULTIPROC_METHOD=spawn pytest models/multimodal/generation/test_whisper.py -v -s -m 'distributed(num_gpus=2)'
--- a/.buildkite/test_areas/models_language.yaml
+++ b/.buildkite/test_areas/models_language.yaml
@@ -1,91 +0,0 @@
-group: Models - Language
-depends_on: 
-  - image-build
-steps:
- label: Language Models Tests (Standard)
-  timeout_in_minutes: 25
-  mirror_hardwares: [amdexperimental]
-  torch_nightly: true
-  source_file_dependencies:
-  - vllm/
-  - tests/models/language
-  commands:
-    # Test standard language models, excluding a subset of slow tests
-    - pip freeze | grep -E 'torch'
-    - pytest -v -s models/language -m 'core_model and (not slow_test)'
-
- label: Language Models Tests (Extra Standard) %N
-  timeout_in_minutes: 45
-  mirror_hardwares: [amdexperimental]
-  torch_nightly: true
-  source_file_dependencies:
-  - vllm/model_executor/models/
-  - tests/models/language/pooling/test_embedding.py
-  - tests/models/language/generation/test_common.py
-  - tests/models/language/pooling/test_classification.py
-  commands:
-    # Shard slow subset of standard language models tests. Only run when model
-    # source is modified, or when specified test files are modified
-    - pip freeze | grep -E 'torch'
-    - pytest -v -s models/language -m 'core_model and slow_test' --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT --shard-id=$$BUILDKITE_PARALLEL_JOB
-  parallelism: 2
-
- label: Language Models Tests (Hybrid) %N
-  timeout_in_minutes: 75
-  mirror_hardwares: [amdexperimental]
-  torch_nightly: true
-  source_file_dependencies:
-  - vllm/
-  - tests/models/language/generation
-  commands:
-    # Install fast path packages for testing against transformers
-    # Note: also needed to run plamo2 model in vLLM
-    - uv pip install --system --no-build-isolation 'git+https://github.com/state-spaces/mamba@v2.2.5'
-    - uv pip install --system --no-build-isolation 'git+https://github.com/Dao-AILab/causal-conv1d@v1.5.2'
-    # Shard hybrid language model tests
-    - pytest -v -s models/language/generation -m hybrid_model --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT --shard-id=$$BUILDKITE_PARALLEL_JOB
-  parallelism: 2
-
- label: Language Models Test (Extended Generation) # 80min
-  timeout_in_minutes: 110
-  mirror_hardwares: [amdexperimental]
-  optional: true
-  source_file_dependencies:
-  - vllm/
-  - tests/models/language/generation
-  commands:
-    # Install fast path packages for testing against transformers
-    # Note: also needed to run plamo2 model in vLLM
-    - uv pip install --system --no-build-isolation 'git+https://github.com/state-spaces/mamba@v2.2.5'
-    - uv pip install --system --no-build-isolation 'git+https://github.com/Dao-AILab/causal-conv1d@v1.5.2'
-    - pytest -v -s models/language/generation -m '(not core_model) and (not hybrid_model)'
-
- label: Language Models Test (PPL)
-  timeout_in_minutes: 110
-  mirror_hardwares: [amdexperimental]
-  optional: true
-  source_file_dependencies:
-  - vllm/
-  - tests/models/language/generation_ppl_test
-  commands:
-    - pytest -v -s models/language/generation_ppl_test
-
- label: Language Models Test (Extended Pooling)  # 36min
-  timeout_in_minutes: 50
-  mirror_hardwares: [amdexperimental]
-  optional: true
-  source_file_dependencies:
-  - vllm/
-  - tests/models/language/pooling
-  commands:
-    - pytest -v -s models/language/pooling -m 'not core_model'
-
- label: Language Models Test (MTEB)
-  timeout_in_minutes: 110
-  mirror_hardwares: [amdexperimental]
-  optional: true
-  source_file_dependencies:
-  - vllm/
-  - tests/models/language/pooling_mteb_test
-  commands:
-    - pytest -v -s models/language/pooling_mteb_test
--- a/.buildkite/test_areas/models_multimodal.yaml
+++ b/.buildkite/test_areas/models_multimodal.yaml
@@ -1,81 +0,0 @@
-group: Models - Multimodal
-depends_on: 
-  - image-build
-steps:
- label: Multi-Modal Models (Standard) # 60min
-  timeout_in_minutes: 80
-  source_file_dependencies:
-  - vllm/
-  - tests/models/multimodal
-  commands:
-    - pip install git+https://github.com/TIGER-AI-Lab/Mantis.git
-    - pip freeze | grep -E 'torch'
-    - pytest -v -s models/multimodal -m core_model --ignore models/multimodal/generation/test_whisper.py --ignore models/multimodal/processing
-    - cd .. && VLLM_WORKER_MULTIPROC_METHOD=spawn pytest -v -s tests/models/multimodal/generation/test_whisper.py -m core_model  # Otherwise, mp_method="spawn" doesn't work
-
- label: Multi-Modal Processor Test (CPU)
-  depends_on: 
-  - image-build-cpu
-  timeout_in_minutes: 60
-  source_file_dependencies:
-  - vllm/
-  - tests/models/multimodal
-  device: cpu
-  commands:
-    - pip install git+https://github.com/TIGER-AI-Lab/Mantis.git
-    - pytest -v -s models/multimodal/processing --ignore models/multimodal/processing/test_tensor_schema.py
-
- label: Multi-Modal Processor # 44min
-  timeout_in_minutes: 60
-  source_file_dependencies:
-  - vllm/
-  - tests/models/multimodal
-  commands:
-    - pip install git+https://github.com/TIGER-AI-Lab/Mantis.git
-    - pytest -v -s models/multimodal/processing/test_tensor_schema.py
-
- label: Multi-Modal Accuracy Eval (Small Models) # 50min
-  timeout_in_minutes: 70
-  working_dir: "/vllm-workspace/.buildkite/lm-eval-harness"
-  source_file_dependencies:
-  - vllm/multimodal/
-  - vllm/inputs/
-  - vllm/v1/core/
-  commands:
-  - pytest -s -v test_lm_eval_correctness.py --config-list-file=configs/models-mm-small.txt --tp-size=1
-
- label: Multi-Modal Models (Extended) 1
-  optional: true
-  source_file_dependencies:
-  - vllm/
-  - tests/models/multimodal
-  commands:
-    - pip install git+https://github.com/TIGER-AI-Lab/Mantis.git
-    - pytest -v -s models/multimodal -m 'not core_model' --ignore models/multimodal/generation/test_common.py --ignore models/multimodal/processing
-
- label: Multi-Modal Models (Extended) 2
-  optional: true
-  source_file_dependencies:
-  - vllm/
-  - tests/models/multimodal
-  commands:
-    - pip install git+https://github.com/TIGER-AI-Lab/Mantis.git
-    - pytest -v -s models/multimodal/generation/test_common.py -m 'split(group=0) and not core_model'
-
- label: Multi-Modal Models (Extended) 3
-  optional: true
-  source_file_dependencies:
-  - vllm/
-  - tests/models/multimodal
-  commands:
-    - pip install git+https://github.com/TIGER-AI-Lab/Mantis.git
-    - pytest -v -s models/multimodal/generation/test_common.py -m 'split(group=1) and not core_model'
-
-# This test is used only in PR development phase to test individual models and should never run on main
- label: Custom Models
-  optional: true
-  commands:
-    - echo 'Testing custom models...'
-    # PR authors can temporarily add commands below to test individual models
-    # e.g. pytest -v -s models/encoder_decoder/vision_language/test_mllama.py
-    # *To avoid merge conflicts, remember to REMOVE (not just comment out) them before merging the PR*
--- a/.buildkite/test_areas/plugins.yaml
+++ b/.buildkite/test_areas/plugins.yaml
@@ -1,34 +0,0 @@
-group: Plugins
-depends_on: 
-  - image-build
-steps:
- label: Plugin Tests (2 GPUs)
-  timeout_in_minutes: 60
-  working_dir: "/vllm-workspace/tests"
-  num_devices: 2
-  source_file_dependencies:
-  - vllm/plugins/
-  - tests/plugins/
-  commands:
-  # begin platform plugin and general plugin tests, all the code in-between runs on dummy platform
-  - pip install -e ./plugins/vllm_add_dummy_platform
-  - pytest -v -s plugins_tests/test_platform_plugins.py
-  - pip uninstall vllm_add_dummy_platform -y
-  # end platform plugin tests
-  # begin io_processor plugins test, all the code in between uses the prithvi_io_processor plugin
-  - pip install -e ./plugins/prithvi_io_processor_plugin
-  - pytest -v -s plugins_tests/test_io_processor_plugins.py
-  - pip uninstall prithvi_io_processor_plugin -y
-  # end io_processor plugins test
-  # begin stat_logger plugins test
-  - pip install -e ./plugins/vllm_add_dummy_stat_logger
-  - pytest -v -s plugins_tests/test_stats_logger_plugins.py
-  - pip uninstall dummy_stat_logger -y
-  # end stat_logger plugins test
-  # other tests continue here:
-  - pytest -v -s plugins_tests/test_scheduler_plugins.py
-  - pip install -e ./plugins/vllm_add_dummy_model
-  - pytest -v -s distributed/test_distributed_oot.py
-  - pytest -v -s entrypoints/openai/test_oot_registration.py # it needs a clean process
-  - pytest -v -s models/test_oot_registration.py # it needs a clean process
-  - pytest -v -s plugins/lora_resolvers # unit tests for in-tree lora resolver plugins
--- a/.buildkite/test_areas/pytorch.yaml
+++ b/.buildkite/test_areas/pytorch.yaml
@@ -1,57 +0,0 @@
-group: PyTorch
-depends_on: 
-  - image-build
-steps:
- label: PyTorch Compilation Unit Tests
-  timeout_in_minutes: 10
-  source_file_dependencies:
-    - vllm/
-    - tests/compile
-  commands:
-  # Run unit tests defined directly under compile/,
-  # not including subdirectories, which are usually heavier
-  # tests covered elsewhere.
-  # Use `find` to launch multiple instances of pytest so that
-  # they do not suffer from https://github.com/vllm-project/vllm/issues/28965
-  # However, find does not normally propagate error codes, so we combine it with xargs
-  # (using -0 for proper path handling)
-  - "find compile/ -maxdepth 1 -name 'test_*.py' -print0 | xargs -0 -n1 -I{} pytest -s -v '{}'"
-
- label: PyTorch Compilation Passes Unit Tests
-  timeout_in_minutes: 20
-  source_file_dependencies:
-    - vllm/
-    - tests/compile/passes
-  commands:
-  - pytest -s -v compile/passes --ignore compile/passes/distributed
-
- label: PyTorch Fullgraph Smoke Test
-  timeout_in_minutes: 35
-  source_file_dependencies:
-  - vllm/
-  - tests/compile
-  commands:
-  # Run smoke tests under fullgraph directory, except test_full_graph.py
-  # as it is a heavy test that is covered in other steps.
-  # Use `find` to launch multiple instances of pytest so that
-  # they do not suffer from https://github.com/vllm-project/vllm/issues/28965
-  - "find compile/fullgraph/ -name 'test_*.py' -not -name 'test_full_graph.py' -exec pytest -s -v {} \\;"
-
- label: PyTorch Fullgraph
-  timeout_in_minutes: 30
-  source_file_dependencies:
-  - vllm/
-  - tests/compile
-  commands:
-    # fp8 kv scales not supported on sm89, tested on Blackwell instead
-  - pytest -v -s compile/fullgraph/test_full_graph.py -k 'not test_fp8_kv_scale_compile'
-
- label: Pytorch Nightly Dependency Override Check # 2min
-  # if this test fails, it means the nightly torch version is not compatible with some
-  # of the dependencies. Please check the error message and add the package to whitelist
-  # in /vllm/tools/pre_commit/generate_nightly_torch_test.py
-  soft_fail: true
-  source_file_dependencies:
-  - requirements/nightly_torch_test.txt
-  commands:
-  - bash standalone_tests/pytorch_nightly_dependency.sh
--- a/.buildkite/test_areas/quantization.yaml
+++ b/.buildkite/test_areas/quantization.yaml
@@ -1,46 +0,0 @@
-group: Quantization
-depends_on: 
-  - image-build
-steps:
- label: Quantization
-  timeout_in_minutes: 90
-  source_file_dependencies:
-  - csrc/
-  - vllm/model_executor/layers/quantization
-  - tests/quantization
-  commands:
-  # temporary install here since we need nightly, will move to requirements/test.in
-  # after torchao 0.12 release, and pin a working version of torchao nightly here
-
-  # since torchao nightly is only compatible with torch nightly currently
-  # https://github.com/pytorch/ao/issues/2919, we'll have to skip new torchao tests for now
-  # we can only upgrade after this is resolved
-  # TODO(jerryzh168): resolve the above comment
-  - uv pip install --system torchao==0.14.1 --index-url https://download.pytorch.org/whl/cu129
-  - uv pip install --system conch-triton-kernels
-  - VLLM_TEST_FORCE_LOAD_FORMAT=auto pytest -v -s quantization/ --ignore quantization/test_blackwell_moe.py
-
- label: Quantized MoE Test (B200)
-  timeout_in_minutes: 60
-  working_dir: "/vllm-workspace/"
-  device: b200
-  source_file_dependencies:
-  - tests/quantization/test_blackwell_moe.py
-  - vllm/model_executor/models/deepseek_v2.py
-  - vllm/model_executor/models/gpt_oss.py
-  - vllm/model_executor/models/llama4.py
-  - vllm/model_executor/layers/fused_moe
-  - vllm/model_executor/layers/quantization/compressed_tensors
-  - vllm/model_executor/layers/quantization/modelopt.py
-  - vllm/model_executor/layers/quantization/mxfp4.py
-  - vllm/v1/attention/backends/flashinfer.py
-  commands:
-    - pytest -s -v tests/quantization/test_blackwell_moe.py
-
- label: Quantized Models Test
-  timeout_in_minutes: 60
-  source_file_dependencies:
-  - vllm/model_executor/layers/quantization
-  - tests/models/quantization
-  commands:
-    - pytest -v -s models/quantization
--- a/.buildkite/test_areas/samplers.yaml
+++ b/.buildkite/test_areas/samplers.yaml
@@ -1,14 +0,0 @@
-group: Samplers
-depends_on: 
-  - image-build
-steps:
- label: Samplers Test
-  timeout_in_minutes: 75
-  source_file_dependencies:
-  - vllm/model_executor/layers
-  - vllm/sampling_metadata.py
-  - tests/samplers
-  - tests/conftest.py
-  commands:
-    - pytest -v -s samplers
-    - VLLM_USE_FLASHINFER_SAMPLER=1 pytest -v -s samplers
--- a/.buildkite/test_areas/weight_loading.yaml
+++ b/.buildkite/test_areas/weight_loading.yaml
@@ -1,25 +0,0 @@
-group: Weight Loading
-depends_on: 
-  - image-build
-steps:
- label: Weight Loading Multiple GPU  # 33min
-  timeout_in_minutes: 45
-  working_dir: "/vllm-workspace/tests"
-  num_devices: 2
-  optional: true
-  source_file_dependencies:
-  - vllm/
-  - tests/weight_loading
-  commands:
-    - bash weight_loading/run_model_weight_loading_test.sh -c weight_loading/models.txt
-
- label: Weight Loading Multiple GPU - Large Models # optional
-  working_dir: "/vllm-workspace/tests"
-  num_devices: 2
-  device: a100
-  optional: true
-  source_file_dependencies:
-  - vllm/
-  - tests/weight_loading
-  commands:
-    - bash weight_loading/run_model_weight_loading_test.sh -c weight_loading/models-large.txt
--- a/.github/CODEOWNERS
+++ b/.github/CODEOWNERS
@@ -2,21 +2,20 @@
 # for more info about CODEOWNERS file

 # This lists cover the "core" components of vLLM that require careful review
+/vllm/attention @LucasWilkinson
+/vllm/attention/backends/abstract.py @WoosukKwon @zhuohan123 @youkaichao @alexm-redhat @njhill
 /vllm/executor/executor_base.py @zhuohan123 @youkaichao @alexm-redhat @njhill @22quinn
-/vllm/model_executor/layers/attention @LucasWilkinson
 /vllm/model_executor/layers/fused_moe @mgoin @pavanimajety
 /vllm/model_executor/layers/quantization @mgoin @robertgshaw2-redhat @tlrmchlsmth @yewentao256 @pavanimajety
 /vllm/model_executor/layers/mamba @tdoublep
 /vllm/model_executor/model_loader @22quinn
-/vllm/model_executor/layers/batch_invariant.py @yewentao256 
 /vllm/multimodal @DarkLight1337 @ywang96 @NickLucche @tjtanaa
 /vllm/vllm_flash_attn @LucasWilkinson
 /vllm/lora @jeejeelee
 /vllm/reasoning @aarnphm @chaunceyjiang
 /vllm/entrypoints @aarnphm @chaunceyjiang
-/vllm/tool_parsers @aarnphm @chaunceyjiang
 /vllm/compilation @zou3519 @youkaichao @ProExpertProg
-/vllm/distributed/kv_transfer @NickLucche @ApostaC @orozery
+/vllm/distributed/kv_transfer @NickLucche @ApostaC
 CMakeLists.txt @tlrmchlsmth @LucasWilkinson

 # Any change to the VllmConfig changes can have a large user-facing impact,
@@ -26,21 +25,15 @@ CMakeLists.txt @tlrmchlsmth @LucasWilkinson

 # vLLM V1
 /vllm/v1/attention @LucasWilkinson
-/vllm/v1/attention/backend.py @WoosukKwon @zhuohan123 @youkaichao @alexm-redhat @njhill
 /vllm/v1/attention/backends/mla @pavanimajety
 /vllm/v1/attention/backends/flashinfer.py @mgoin @pavanimajety
 /vllm/v1/attention/backends/triton_attn.py @tdoublep
-/vllm/v1/core @WoosukKwon @robertgshaw2-redhat @njhill @ywang96 @alexm-redhat @heheda12345 @ApostaC @orozery
+/vllm/v1/core @WoosukKwon @robertgshaw2-redhat @njhill @ywang96 @alexm-redhat @heheda12345 @ApostaC
 /vllm/v1/sample @22quinn @houseroad @njhill
 /vllm/v1/spec_decode @benchislett @luccafong
 /vllm/v1/structured_output @mgoin @russellb @aarnphm @benchislett
 /vllm/v1/kv_cache_interface.py @heheda12345
-/vllm/v1/kv_offload @ApostaC @orozery
-/vllm/v1/worker/gpu/kv_connector.py @orozery
-/vllm/v1/worker/kv_connector_model_runner_mixin.py @orozery
-
-# Model runner V2
-/vllm/v1/worker/gpu @WoosukKwon
+/vllm/v1/offloading @ApostaC

 # Test ownership
 /.buildkite/lm-eval-harness @mgoin 
@@ -56,14 +49,13 @@ CMakeLists.txt @tlrmchlsmth @LucasWilkinson
 /tests/test_inputs.py @DarkLight1337 @ywang96
 /tests/v1/entrypoints/llm/test_struct_output_generate.py @mgoin @russellb @aarnphm
 /tests/v1/structured_output @mgoin @russellb @aarnphm
-/tests/v1/core @WoosukKwon @robertgshaw2-redhat @njhill @ywang96 @alexm-redhat @heheda12345 @ApostaC @orozery
+/tests/v1/core @WoosukKwon @robertgshaw2-redhat @njhill @ywang96 @alexm-redhat @heheda12345 @ApostaC
 /tests/weight_loading @mgoin @youkaichao @yewentao256
 /tests/lora @jeejeelee
 /tests/models/language/generation/test_hybrid.py @tdoublep
 /tests/v1/kv_connector/nixl_integration @NickLucche
-/tests/v1/kv_connector @ApostaC @orozery
-/tests/v1/kv_offload @ApostaC @orozery
-/tests/v1/determinism @yewentao256 
+/tests/v1/kv_connector @ApostaC
+/tests/v1/offloading @ApostaC

 # Transformers modeling backend
 /vllm/model_executor/models/transformers @hmellor
@@ -119,15 +111,15 @@ mkdocs.yaml @hmellor
 /vllm/transformers_utils/tokenizers/mistral.py @patrickvonplaten

 # Kernels
-/vllm/v1/attention/ops/chunked_prefill_paged_decode.py @tdoublep
-/vllm/v1/attention/ops/triton_unified_attention.py @tdoublep
+/vllm/attention/ops/chunked_prefill_paged_decode.py @tdoublep
+/vllm/attention/ops/triton_unified_attention.py @tdoublep

 # ROCm related: specify owner with write access to notify AMD folks for careful code review
 /vllm/**/*rocm* @tjtanaa
 /docker/Dockerfile.rocm* @gshtras @tjtanaa
 /vllm/v1/attention/backends/rocm*.py @gshtras @tjtanaa
 /vllm/v1/attention/backends/mla/rocm*.py @gshtras @tjtanaa
-/vllm/v1/attention/ops/rocm*.py @gshtras @tjtanaa
+/vllm/attention/ops/rocm*.py @gshtras @tjtanaa
 /vllm/model_executor/layers/fused_moe/rocm*.py @gshtras @tjtanaa
 /csrc/rocm @gshtras @tjtanaa
 /requirements/*rocm* @tjtanaa
@@ -149,13 +141,12 @@ mkdocs.yaml @hmellor
 /requirements/kv_connectors.txt @NickLucche

 # Pooling models
-/examples/pooling @noooop
+/examples/*/pooling/ @noooop
 /tests/models/*/pooling* @noooop
 /tests/entrypoints/pooling @noooop
-/vllm/entrypoints/pooling @noooop
 /vllm/config/pooler.py @noooop
 /vllm/pooling_params.py @noooop
-/vllm/model_executor/layers/pooler @noooop
+/vllm/model_executor/layers/pooler.py @noooop

 # Security guide and policies
 /docs/usage/security.md @russellb
--- a/.github/mergify.yml
+++ b/.github/mergify.yml
@@ -14,52 +14,6 @@ pull_request_rules:
    comment:
      message: "Documentation preview: https://vllm--{{number}}.org.readthedocs.build/en/{{number}}/"

- name: comment-pre-commit-failure
-  description: Comment on PR when pre-commit check fails
-  conditions:
-    - status-failure=pre-commit
-    - -closed
-    - -draft
-  actions:
-    comment:
-      message: |
-        Hi @{{author}}, the pre-commit checks have failed. Please run:
-
-        ```bash 
-        uv pip install pre-commit
-        pre-commit install
-        pre-commit run --all-files
-        ```
-
-        Then, commit the changes and push to your branch.
-
-        For future commits, `pre-commit` will run automatically on changed files before each commit.
-
-        > [!TIP]
-        > <details>
-        > <summary>Is <code>mypy</code> or <code>markdownlint</code> failing?</summary>
-        > <br/>
-        > <code>mypy</code> and <code>markdownlint</code> are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
-        >
-        > ```bash
-        > # For mypy (substitute "3.10" with the failing version if needed)
-        > pre-commit run --hook-stage manual mypy-3.10
-        > # For markdownlint
-        > pre-commit run --hook-stage manual markdownlint
-        > ```
-        > </details>
-
- name: comment-dco-failure
-  description: Comment on PR when DCO check fails
-  conditions:
-    - status-failure=dco
-    - -closed
-    - -draft
-  actions:
-    comment:
-      message: |
-        Hi @{{author}}, the DCO check has failed. Please click on DCO in the Checks section for instructions on how to resolve this.
-
 - name: label-ci-build
  description: Automatically apply ci/build label
  conditions:
@@ -186,7 +140,7 @@ pull_request_rules:
      - files~=^tests/entrypoints/test_context.py
      - files~=^vllm/model_executor/models/.*gpt[-_]?oss.*\.py
      - files~=^vllm/model_executor/layers/.*gpt[-_]?oss.*\.py
-      - files~=^vllm/entrypoints/openai/parser/harmony_utils.py
+      - files~=^vllm/entrypoints/harmony_utils.py
      - files~=^vllm/entrypoints/tool_server.py
      - files~=^vllm/entrypoints/tool.py
      - files~=^vllm/entrypoints/context.py
@@ -222,10 +176,10 @@ pull_request_rules:
      - files~=^csrc/rocm/
      - files~=^docker/Dockerfile.rocm
      - files~=^requirements/rocm.*\.txt
+      - files~=^vllm/attention/backends/rocm.*\.py
+      - files~=^vllm/attention/ops/rocm.*\.py
      - files~=^vllm/model_executor/layers/fused_moe/rocm.*\.py
-      - files~=^vllm/v1/attention/backends/rocm.*\.py
      - files~=^vllm/v1/attention/backends/mla/rocm.*\.py
-      - files~=^vllm/v1/attention/ops/rocm.*\.py
      - files~=^tests/kernels/.*_rocm.*\.py
      - files=vllm/platforms/rocm.py
      - title~=(?i)AMD
@@ -235,20 +189,6 @@ pull_request_rules:
      add:
        - rocm

- name: label-cpu
-  description: Automatically apply cpu label
-  conditions:
-    - label != stale
-    - files~=^(?!.*kv_offload)(?!.*cpu_offload).*\bcpu.*
-  actions:
-    label:
-      add:
-        - cpu
-    assign:
-      users:
-        - "fadara01"
-        - "aditew01"
-
 - name: label-structured-output
  description: Automatically apply structured-output label
  conditions:
@@ -349,18 +289,6 @@ pull_request_rules:
      add:
        - tool-calling

- name: auto-rebase if approved, ready, and 40 commits behind main
-  conditions:
-    - base = main
-    - label=ready
-    - "#approved-reviews-by >= 1"
-    - "#commits-behind >= 40"
-    - -closed
-    - -draft
-    - -conflict
-  actions:
-    rebase: {}
-
 - name: ping author on conflicts and add 'needs-rebase' label
  conditions:
    - label != stale
@@ -414,18 +342,6 @@ pull_request_rules:
      remove:
        - needs-rebase

- name: label-bug
-  description: Automatically apply bug label
-  conditions:
-    - label != stale
-    - or:
-      - title~=(?i)\bbug\b
-      - title~=(?i)\bbugfix\b
-  actions:
-    label:
-      add:
-        - bug
-
 - name: label-kv-connector
  description: Automatically apply kv-connector label
  conditions:
@@ -442,4 +358,4 @@ pull_request_rules:
  actions:
    label:
      add:
-        - kv-connector
+        - kv-connector
--- a/.github/workflows/cleanup_pr_body.yml
+++ b/.github/workflows/cleanup_pr_body.yml
@@ -13,10 +13,10 @@ jobs:

    steps:
      - name: Checkout repository
-        uses: actions/checkout@8e8c483db84b4bee98b60c0593521ed34d9990e8 # v6.0.1
+        uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2

      - name: Set up Python
-        uses: actions/setup-python@83679a892e2d95755f2dac6acb0bfd1e9ac5d548 # v6.1.0
+        uses: actions/setup-python@e797f83bcb11b83ae66e0230d6156d7c80228e7c # v6.0.0
        with:
          python-version: '3.12'

--- a/.github/workflows/issue_autolabel.yml
+++ b/.github/workflows/issue_autolabel.yml
@@ -105,31 +105,6 @@ jobs:
                  }
                ],
              },
-              cpu: {
-                // Keyword search - matches whole words only (with word boundaries)
-                keywords: [
-                  {
-                    term: "CPU Backend",
-                    searchIn: "title"
-                  },
-                  {
-                    term: "x86",
-                    searchIn: "title"
-                  },
-                  {
-                    term: "ARM",
-                    searchIn: "title"
-                  },
-                  {
-                    term: "Apple Silicon",
-                    searchIn: "title"
-                  },
-                  {
-                    term: "IBM Z",
-                    searchIn: "title"
-                  },
-                ],
-              },
              // Add more label configurations here as needed
              // example: {
              //   keywords: [...],
--- a/.github/workflows/macos-smoke-test.yml
+++ b/.github/workflows/macos-smoke-test.yml
@@ -9,10 +9,10 @@ on:
 jobs:
  macos-m1-smoke-test:
    runs-on: macos-latest
-    timeout-minutes: 30
+    timeout-minutes: 20

    steps:
-      - uses: actions/checkout@v6.0.1
+      - uses: actions/checkout@v4

      - uses: astral-sh/setup-uv@v7
        with:
@@ -29,23 +29,23 @@ jobs:

      - name: Install dependencies and build vLLM
        run: |
-          uv pip install -r requirements/cpu-build.txt --index-strategy unsafe-best-match
          uv pip install -r requirements/cpu.txt --index-strategy unsafe-best-match
-          uv pip install -e . --no-build-isolation
+          uv pip install -e .
        env:
          CMAKE_BUILD_PARALLEL_LEVEL: 4

      - name: Verify installation
        run: |
          python -c "import vllm; print(f'vLLM version: {vllm.__version__}')"
+          python -c "import torch; print(f'PyTorch: {torch.__version__}')"

      - name: Smoke test vllm serve
+        timeout-minutes: 10
        run: |
          # Start server in background
          vllm serve Qwen/Qwen3-0.6B \
-            --max-model-len=2K \
+            --max-model-len=2048 \
            --load-format=dummy \
-            --hf-overrides '{"num_hidden_layers": 2}' \
            --enforce-eager \
            --port 8000 &

--- a/.github/workflows/pre-commit.yml
+++ b/.github/workflows/pre-commit.yml
@@ -16,8 +16,8 @@ jobs:
  pre-commit:
    runs-on: ubuntu-latest
    steps:
-    - uses: actions/checkout@8e8c483db84b4bee98b60c0593521ed34d9990e8 # v6.0.1
-    - uses: actions/setup-python@83679a892e2d95755f2dac6acb0bfd1e9ac5d548 # v6.1.0
+    - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
+    - uses: actions/setup-python@e797f83bcb11b83ae66e0230d6156d7c80228e7c # v6.0.0
      with:
        python-version: "3.12"
    - run: echo "::add-matcher::.github/workflows/matchers/actionlint.json"
--- a/.github/workflows/stale.yml
+++ b/.github/workflows/stale.yml
@@ -7,15 +7,13 @@ on:

 jobs:
  close-issues-and-pull-requests:
-    # Prevents triggering on forks or other repos
-    if: github.repository == 'vllm-project/vllm'
    permissions:
      issues: write
      pull-requests: write
      actions: write
    runs-on: ubuntu-latest
    steps:
-      - uses: actions/stale@997185467fa4f803885201cee163a9f38240193d # v10.1.1
+      - uses: actions/stale@5f858e3efba33a5ca4407a664cc011ad407f2008 # v10.1.0
        with:
          # Increasing this value ensures that changes to this workflow
          # propagate to all issues and PRs in days rather than months
--- a/.gitignore
+++ b/.gitignore
@@ -4,12 +4,6 @@
 # vllm-flash-attn built from source
 vllm/vllm_flash_attn/*

-# OpenAI triton kernels copied from source
-vllm/third_party/triton_kernels/*
-
-# FlashMLA interface copied from source
-vllm/third_party/flashmla/flash_mla_interface.py
-
 # triton jit
 .triton

@@ -194,9 +188,6 @@ CLAUDE.md
 AGENTS.md
 .codex/

-# Cursor
-.cursor/
-
 # DS Store
 .DS_Store

@@ -233,8 +224,3 @@ ep_kernels_workspace/

 # Allow tracked library source folders under submodules (e.g., benchmarks/lib)
 !vllm/benchmarks/lib/
-
-# Generated gRPC protobuf files (compiled at build time from vllm_engine.proto)
-vllm/grpc/vllm_engine_pb2.py
-vllm/grpc/vllm_engine_pb2_grpc.py
-vllm/grpc/vllm_engine_pb2.pyi
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -121,9 +121,24 @@ repos:
    name: Update Dockerfile dependency graph
    entry: tools/pre_commit/update-dockerfile-graph.sh
    language: script
-  - id: check-forbidden-imports
-    name: Check for forbidden imports
-    entry: python tools/pre_commit/check_forbidden_imports.py
+  - id: enforce-import-regex-instead-of-re
+    name: Enforce import regex as re
+    entry: python tools/pre_commit/enforce_regex_import.py
+    language: python
+    types: [python]
+    pass_filenames: false
+    additional_dependencies: [regex]
+  # forbid directly import triton
+  - id: forbid-direct-triton-import
+    name: "Forbid direct 'import triton'"
+    entry: python tools/pre_commit/check_triton_import.py
+    language: python
+    types: [python]
+    pass_filenames: false
+    additional_dependencies: [regex]
+  - id: check-pickle-imports
+    name: Prevent new pickle/cloudpickle imports
+    entry: python tools/pre_commit/check_pickle_imports.py
    language: python
    types: [python]
    additional_dependencies: [regex]
@@ -132,17 +147,6 @@ repos:
    entry: python tools/pre_commit/validate_config.py
    language: python
    additional_dependencies: [regex]
-  - id: validate-docker-versions
-    name: Validate docker/versions.json matches Dockerfile
-    entry: python tools/generate_versions_json.py --check
-    language: python
-    files: ^docker/(Dockerfile|versions\.json)$
-    pass_filenames: false
-    additional_dependencies: [dockerfile-parse]
-  - id: attention-backend-docs
-    name: Check attention backend documentation is up to date
-    entry: python tools/pre_commit/generate_attention_backend_docs.py --check
-    language: python
  # Keep `suggestion` last
  - id: suggestion
    name: Suggestion
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -56,8 +56,8 @@ endif()
 # requirements.txt files and should be kept consistent.  The ROCm torch
 # versions are derived from docker/Dockerfile.rocm
 #
-set(TORCH_SUPPORTED_VERSION_CUDA "2.10.0")
-set(TORCH_SUPPORTED_VERSION_ROCM "2.10.0")
+set(TORCH_SUPPORTED_VERSION_CUDA "2.9.0")
+set(TORCH_SUPPORTED_VERSION_ROCM "2.9.0")

 #
 # Try to find python package with an executable that exactly matches
@@ -136,7 +136,7 @@ elseif(HIP_FOUND)

  # ROCm 5.X and 6.X
  if (ROCM_VERSION_DEV_MAJOR GREATER_EQUAL 5 AND
-      Torch_VERSION VERSION_LESS ${TORCH_SUPPORTED_VERSION_ROCM})
+      NOT Torch_VERSION VERSION_EQUAL ${TORCH_SUPPORTED_VERSION_ROCM})
    message(WARNING "Pytorch version >= ${TORCH_SUPPORTED_VERSION_ROCM} "
      "expected for ROCm build, saw ${Torch_VERSION} instead.")
  endif()
@@ -282,7 +282,6 @@ endif()
 set(VLLM_EXT_SRC
  "csrc/mamba/mamba_ssm/selective_scan_fwd.cu"
  "csrc/cache_kernels.cu"
-  "csrc/cache_kernels_fused.cu"
  "csrc/attention/paged_attention_v1.cu"
  "csrc/attention/paged_attention_v2.cu"
  "csrc/attention/merge_attn_states.cu"
@@ -308,7 +307,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
  SET(CUTLASS_ENABLE_HEADERS_ONLY ON CACHE BOOL "Enable only the header library")

  # Set CUTLASS_REVISION. Used for FetchContent. Also fixes some bogus messages when building.
-  set(CUTLASS_REVISION "v4.2.1")
+  set(CUTLASS_REVISION "v4.2.1" CACHE STRING "CUTLASS revision to use")

  # Use the specified CUTLASS source directory for compilation if VLLM_CUTLASS_SRC_DIR is provided
  if (DEFINED ENV{VLLM_CUTLASS_SRC_DIR})
@@ -355,42 +354,27 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
  # Only build Marlin kernels if we are building for at least some compatible archs.
  # Keep building Marlin for 9.0 as there are some group sizes and shapes that
  # are not supported by Machete yet.
-
-  # marlin arches for fp16 output
-  cuda_archs_loose_intersection(MARLIN_ARCHS "8.0+PTX" "${CUDA_ARCHS}")
-  # marlin has limited support for turing
-  cuda_archs_loose_intersection(MARLIN_SM75_ARCHS "7.5" "${CUDA_ARCHS}")
-  # marlin arches for bf16 output (we need 9.0 for bf16 atomicAdd PTX)
-  cuda_archs_loose_intersection(MARLIN_BF16_ARCHS "8.0+PTX;9.0+PTX" "${CUDA_ARCHS}")
-  # marlin arches for fp8 input
-  # - sm80 doesn't support fp8 computation
-  # - sm90 and sm100 don't support QMMA.16832.F32.E4M3.E4M3 SAAS instruction
-  # so we only enable fp8 computation for SM89 (e.g. RTX 40x0)  and 12.0 (e.g. RTX 50x0)
-  cuda_archs_loose_intersection(MARLIN_FP8_ARCHS "8.9;12.0" "${CUDA_ARCHS}")
-  # marlin arches for other files
-  cuda_archs_loose_intersection(MARLIN_OTHER_ARCHS "7.5;8.0+PTX" "${CUDA_ARCHS}")
-
-  if (MARLIN_OTHER_ARCHS)
+  # 9.0 for latest bf16 atomicAdd PTX
+  cuda_archs_loose_intersection(MARLIN_ARCHS "8.0+PTX;9.0+PTX" "${CUDA_ARCHS}")
+  if (MARLIN_ARCHS)

    #
    # For the Marlin kernels we automatically generate sources for various
    # preselected input type pairs and schedules.
    # Generate sources:
    set(MARLIN_GEN_SCRIPT
-      ${CMAKE_CURRENT_SOURCE_DIR}/csrc/quantization/marlin/generate_kernels.py)
+      ${CMAKE_CURRENT_SOURCE_DIR}/csrc/quantization/gptq_marlin/generate_kernels.py)
    file(MD5 ${MARLIN_GEN_SCRIPT} MARLIN_GEN_SCRIPT_HASH)
-    list(JOIN CUDA_ARCHS "," CUDA_ARCHS_STR)
-    set(MARLIN_GEN_SCRIPT_HASH_AND_ARCH "${MARLIN_GEN_SCRIPT_HASH}(ARCH:${CUDA_ARCHS_STR})")

-    message(STATUS "Marlin generation script hash: ${MARLIN_GEN_SCRIPT_HASH_AND_ARCH}")
-    message(STATUS "Last run Marlin generate script hash: $CACHE{MARLIN_GEN_SCRIPT_HASH_AND_ARCH}")
+    message(STATUS "Marlin generation script hash: ${MARLIN_GEN_SCRIPT_HASH}")
+    message(STATUS "Last run Marlin generate script hash: $CACHE{MARLIN_GEN_SCRIPT_HASH}")

-    if (NOT DEFINED CACHE{MARLIN_GEN_SCRIPT_HASH_AND_ARCH}
-        OR NOT $CACHE{MARLIN_GEN_SCRIPT_HASH_AND_ARCH} STREQUAL ${MARLIN_GEN_SCRIPT_HASH_AND_ARCH})
+    if (NOT DEFINED CACHE{MARLIN_GEN_SCRIPT_HASH}
+        OR NOT $CACHE{MARLIN_GEN_SCRIPT_HASH} STREQUAL ${MARLIN_GEN_SCRIPT_HASH})
      execute_process(
        COMMAND ${CMAKE_COMMAND} -E env
-        PYTHONPATH=$ENV{PYTHONPATH}
-          ${Python_EXECUTABLE} ${MARLIN_GEN_SCRIPT} ${CUDA_ARCHS_STR}
+        PYTHONPATH=$PYTHONPATH
+          ${Python_EXECUTABLE} ${MARLIN_GEN_SCRIPT}
        RESULT_VARIABLE marlin_generation_result
        OUTPUT_VARIABLE marlin_generation_result
        OUTPUT_FILE ${CMAKE_CURRENT_BINARY_DIR}/marlin_generation.log
@@ -403,75 +387,40 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
                            "\nCheck the log for details: "
                            "${CMAKE_CURRENT_BINARY_DIR}/marlin_generation.log")
      else()
-        set(MARLIN_GEN_SCRIPT_HASH_AND_ARCH ${MARLIN_GEN_SCRIPT_HASH_AND_ARCH}
-            CACHE STRING "Last run Marlin generate script hash and arch" FORCE)
+        set(MARLIN_GEN_SCRIPT_HASH ${MARLIN_GEN_SCRIPT_HASH}
+            CACHE STRING "Last run Marlin generate script hash" FORCE)
        message(STATUS "Marlin generation completed successfully.")
      endif()
    else()
      message(STATUS "Marlin generation script has not changed, skipping generation.")
    endif()

-    if (MARLIN_ARCHS)
-      file(GLOB MARLIN_TEMPLATE_KERNEL_SRC "csrc/quantization/marlin/sm80_kernel_*_float16.cu")
-      set_gencode_flags_for_srcs(
-        SRCS "${MARLIN_TEMPLATE_KERNEL_SRC}"
-        CUDA_ARCHS "${MARLIN_ARCHS}")
-      if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.8)
-        set_source_files_properties(${MARLIN_TEMPLATE_KERNEL_SRC}
-          PROPERTIES COMPILE_FLAGS "-static-global-template-stub=false")
-      endif()
-      list(APPEND VLLM_EXT_SRC ${MARLIN_TEMPLATE_KERNEL_SRC})
-
-      file(GLOB MARLIN_TEMPLATE_BF16_KERNEL_SRC "csrc/quantization/marlin/sm80_kernel_*_bfloat16.cu")
-      set_gencode_flags_for_srcs(
-        SRCS "${MARLIN_TEMPLATE_BF16_KERNEL_SRC}"
-        CUDA_ARCHS "${MARLIN_BF16_ARCHS}")
-      if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.8)
-        set_source_files_properties(${MARLIN_TEMPLATE_BF16_KERNEL_SRC}
-          PROPERTIES COMPILE_FLAGS "-static-global-template-stub=false")
-      endif()
-      list(APPEND VLLM_EXT_SRC ${MARLIN_TEMPLATE_BF16_KERNEL_SRC})
+    file(GLOB MARLIN_TEMPLATE_KERNEL_SRC "csrc/quantization/gptq_marlin/kernel_*.cu")
+    set_gencode_flags_for_srcs(
+      SRCS "${MARLIN_TEMPLATE_KERNEL_SRC}"
+      CUDA_ARCHS "${MARLIN_ARCHS}")
+    if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.8)
+      set_source_files_properties(${MARLIN_TEMPLATE_KERNEL_SRC}
+        PROPERTIES COMPILE_FLAGS "-static-global-template-stub=false")
    endif()

-    if (MARLIN_SM75_ARCHS)
-      file(GLOB MARLIN_TEMPLATE_SM75_KERNEL_SRC "csrc/quantization/marlin/sm75_kernel_*.cu")
-      set_gencode_flags_for_srcs(
-        SRCS "${MARLIN_TEMPLATE_SM75_KERNEL_SRC}"
-        CUDA_ARCHS "${MARLIN_SM75_ARCHS}")
-      if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.8)
-        set_source_files_properties(${MARLIN_TEMPLATE_SM75_KERNEL_SRC}
-          PROPERTIES COMPILE_FLAGS "-static-global-template-stub=false")
-      endif()
-      list(APPEND VLLM_EXT_SRC ${MARLIN_TEMPLATE_SM75_KERNEL_SRC})
-    endif()
-
-    if (MARLIN_FP8_ARCHS)
-      file(GLOB MARLIN_TEMPLATE_FP8_KERNEL_SRC "csrc/quantization/marlin/sm89_kernel_*.cu")
-      set_gencode_flags_for_srcs(
-        SRCS "${MARLIN_TEMPLATE_FP8_KERNEL_SRC}"
-        CUDA_ARCHS "${MARLIN_FP8_ARCHS}")
-      if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.8)
-        set_source_files_properties(${MARLIN_TEMPLATE_FP8_KERNEL_SRC}
-          PROPERTIES COMPILE_FLAGS "-static-global-template-stub=false")
-      endif()
-      list(APPEND VLLM_EXT_SRC ${MARLIN_TEMPLATE_FP8_KERNEL_SRC})
-    endif()
+    list(APPEND VLLM_EXT_SRC ${MARLIN_TEMPLATE_KERNEL_SRC})

    set(MARLIN_SRCS
-       "csrc/quantization/marlin/marlin.cu"
-       "csrc/quantization/marlin/marlin_int4_fp8_preprocess.cu"
-       "csrc/quantization/marlin/gptq_marlin_repack.cu"
-       "csrc/quantization/marlin/awq_marlin_repack.cu")
+       "csrc/quantization/marlin/sparse/marlin_24_cuda_kernel.cu"
+       "csrc/quantization/gptq_marlin/gptq_marlin.cu"
+       "csrc/quantization/gptq_marlin/gptq_marlin_repack.cu"
+       "csrc/quantization/gptq_marlin/awq_marlin_repack.cu")
    set_gencode_flags_for_srcs(
      SRCS "${MARLIN_SRCS}"
-      CUDA_ARCHS "${MARLIN_OTHER_ARCHS}")
+      CUDA_ARCHS "${MARLIN_ARCHS}")
    if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.8)
-      set_source_files_properties(${MARLIN_SRCS}
+      set_source_files_properties("csrc/quantization/gptq_marlin/gptq_marlin.cu"
        PROPERTIES COMPILE_FLAGS "-static-global-template-stub=false")
    endif()
    list(APPEND VLLM_EXT_SRC "${MARLIN_SRCS}")

-    message(STATUS "Building Marlin kernels for archs: ${MARLIN_OTHER_ARCHS}")
+    message(STATUS "Building Marlin kernels for archs: ${MARLIN_ARCHS}")
  else()
    message(STATUS "Not building Marlin kernels as no compatible archs found"
                   " in CUDA target architectures")
@@ -655,15 +604,12 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
    set(SRCS
      "csrc/quantization/fp4/nvfp4_quant_kernels.cu"
      "csrc/quantization/fp4/activation_nvfp4_quant_fusion_kernels.cu"
-      "csrc/quantization/fp4/nvfp4_experts_quant.cu"
-      "csrc/quantization/fp4/nvfp4_scaled_mm_sm120_kernels.cu"
-      "csrc/quantization/fp4/nvfp4_blockwise_moe_kernel.cu")
+      "csrc/quantization/fp4/nvfp4_scaled_mm_sm120_kernels.cu")
    set_gencode_flags_for_srcs(
      SRCS "${SRCS}"
      CUDA_ARCHS "${FP4_ARCHS}")
    list(APPEND VLLM_EXT_SRC "${SRCS}")
    list(APPEND VLLM_GPU_FLAGS "-DENABLE_NVFP4_SM120=1")
-    list(APPEND VLLM_GPU_FLAGS "-DENABLE_CUTLASS_MOE_SM120=1")
    message(STATUS "Building NVFP4 for archs: ${FP4_ARCHS}")
  else()
    message(STATUS "Not building NVFP4 as no compatible archs were found.")
@@ -799,6 +745,24 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
  else()
    cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0a;10.1a;10.3a" "${CUDA_ARCHS}")
  endif()
+  if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.8 AND SCALED_MM_ARCHS)
+    set(SRCS "csrc/quantization/w8a8/cutlass/moe/blockwise_scaled_group_mm_sm100.cu")
+    set_gencode_flags_for_srcs(
+      SRCS "${SRCS}"
+      CUDA_ARCHS "${SCALED_MM_ARCHS}")
+    list(APPEND VLLM_EXT_SRC "${SRCS}")
+    list(APPEND VLLM_GPU_FLAGS "-DENABLE_CUTLASS_MOE_SM100=1")
+    message(STATUS "Building blockwise_scaled_group_mm_sm100 for archs: ${SCALED_MM_ARCHS}")
+  else()
+    if (NOT ${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.8 AND SCALED_MM_ARCHS)
+      message(STATUS "Not building blockwise_scaled_group_mm_sm100 kernels as CUDA Compiler version is "
+                     "not >= 12.8, we recommend upgrading to CUDA 12.8 or later "
+                     "if you intend on running FP8 quantized MoE models on Blackwell.")
+    else()
+      message(STATUS "Not building blockwise_scaled_group_mm_sm100 as no compatible archs found "
+                     "in CUDA target architectures")
+    endif()
+  endif()

  #
  # Machete kernels
@@ -822,7 +786,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
        OR NOT $CACHE{MACHETE_GEN_SCRIPT_HASH} STREQUAL ${MACHETE_GEN_SCRIPT_HASH})
      execute_process(
        COMMAND ${CMAKE_COMMAND} -E env
-        PYTHONPATH=${CMAKE_CURRENT_SOURCE_DIR}/csrc/cutlass_extensions/:${CUTLASS_DIR}/python/:${VLLM_PYTHON_PATH}:$ENV{PYTHONPATH}
+        PYTHONPATH=${CMAKE_CURRENT_SOURCE_DIR}/csrc/cutlass_extensions/:${CUTLASS_DIR}/python/:${VLLM_PYTHON_PATH}:$PYTHONPATH
          ${Python_EXECUTABLE} ${MACHETE_GEN_SCRIPT}
        RESULT_VARIABLE machete_generation_result
        OUTPUT_VARIABLE machete_generation_output
@@ -874,10 +838,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
  cuda_archs_loose_intersection(W4A8_ARCHS "9.0a" "${CUDA_ARCHS}")
  if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.0 AND W4A8_ARCHS)
    set(SRCS
-       "csrc/quantization/cutlass_w4a8/w4a8_mm_entry.cu"
-       "csrc/quantization/cutlass_w4a8/w4a8_grouped_mm_entry.cu"
-       "csrc/quantization/cutlass_w4a8/w4a8_utils.cu"
-       )
+       "csrc/quantization/cutlass_w4a8/w4a8_mm_entry.cu")

    set_gencode_flags_for_srcs(
      SRCS "${SRCS}"
@@ -947,6 +908,7 @@ target_compile_definitions(_C PRIVATE CUTLASS_ENABLE_DIRECT_CUDA_DRIVER_CALL=1)
 set(VLLM_MOE_EXT_SRC
  "csrc/moe/torch_bindings.cpp"
  "csrc/moe/moe_align_sum_kernels.cu"
+  "csrc/moe/moe_lora_align_sum_kernels.cu"
  "csrc/moe/topk_softmax_kernels.cu")

 if(VLLM_GPU_LANG STREQUAL "CUDA")
@@ -976,20 +938,9 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
    CUDA_ARCHS "${CUDA_ARCHS}")

  list(APPEND VLLM_MOE_EXT_SRC "${VLLM_MOE_WNA16_SRC}")
-  # moe marlin arches
-  # note that we always set `use_atomic_add=False` for moe marlin now,
-  # so we don't need 9.0 for bf16 atomicAdd PTX
-  cuda_archs_loose_intersection(MARLIN_MOE_ARCHS "8.0+PTX" "${CUDA_ARCHS}")
-  # moe marlin has limited support for turing
-  cuda_archs_loose_intersection(MARLIN_MOE_SM75_ARCHS "7.5" "${CUDA_ARCHS}")
-  # moe marlin arches for fp8 input
-  # - sm80 doesn't support fp8 computation
-  # - sm90 and sm100 don't support QMMA.16832.F32.E4M3.E4M3 SAAS instruction
-  # so we only enable fp8 computation for SM89 (e.g. RTX 40x0)  and 12.0 (e.g. RTX 50x0)
-  cuda_archs_loose_intersection(MARLIN_MOE_FP8_ARCHS "8.9;12.0" "${CUDA_ARCHS}")
-  # moe marlin arches for other files
-  cuda_archs_loose_intersection(MARLIN_MOE_OTHER_ARCHS "7.5;8.0+PTX" "${CUDA_ARCHS}")
-  if (MARLIN_MOE_OTHER_ARCHS)
+  # 9.0 for latest bf16 atomicAdd PTX
+  cuda_archs_loose_intersection(MARLIN_MOE_ARCHS "8.0+PTX;9.0+PTX" "${CUDA_ARCHS}")
+  if (MARLIN_MOE_ARCHS)

    #
    # For the Marlin MOE kernels we automatically generate sources for various
@@ -998,18 +949,16 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
    set(MOE_MARLIN_GEN_SCRIPT
      ${CMAKE_CURRENT_SOURCE_DIR}/csrc/moe/marlin_moe_wna16/generate_kernels.py)
    file(MD5 ${MOE_MARLIN_GEN_SCRIPT} MOE_MARLIN_GEN_SCRIPT_HASH)
-    list(JOIN CUDA_ARCHS "," CUDA_ARCHS_STR)
-    set(MOE_MARLIN_GEN_SCRIPT_HASH_AND_ARCH "${MOE_MARLIN_GEN_SCRIPT_HASH}(ARCH:${CUDA_ARCHS_STR})")

-    message(STATUS "Marlin MOE generation script hash with arch: ${MOE_MARLIN_GEN_SCRIPT_HASH_AND_ARCH}")
-    message(STATUS "Last run Marlin MOE generate script hash with arch: $CACHE{MOE_MARLIN_GEN_SCRIPT_HASH_AND_ARCH}")
+    message(STATUS "Marlin MOE generation script hash: ${MOE_MARLIN_GEN_SCRIPT_HASH}")
+    message(STATUS "Last run Marlin MOE generate script hash: $CACHE{MOE_MARLIN_GEN_SCRIPT_HASH}")

-    if (NOT DEFINED CACHE{MOE_MARLIN_GEN_SCRIPT_HASH_AND_ARCH}
-        OR NOT $CACHE{MOE_MARLIN_GEN_SCRIPT_HASH_AND_ARCH} STREQUAL ${MOE_MARLIN_GEN_SCRIPT_HASH_AND_ARCH})
+    if (NOT DEFINED CACHE{MOE_MARLIN_GEN_SCRIPT_HASH}
+        OR NOT $CACHE{MOE_MARLIN_GEN_SCRIPT_HASH} STREQUAL ${MOE_MARLIN_GEN_SCRIPT_HASH})
      execute_process(
        COMMAND ${CMAKE_COMMAND} -E env
-        PYTHONPATH=$ENV{PYTHONPATH}
-          ${Python_EXECUTABLE} ${MOE_MARLIN_GEN_SCRIPT} ${CUDA_ARCHS_STR}
+        PYTHONPATH=$PYTHONPATH
+          ${Python_EXECUTABLE} ${MOE_MARLIN_GEN_SCRIPT}
        RESULT_VARIABLE moe_marlin_generation_result
        OUTPUT_VARIABLE moe_marlin_generation_output
        OUTPUT_FILE ${CMAKE_CURRENT_BINARY_DIR}/moe_marlin_generation.log
@@ -1022,7 +971,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
                            "\nCheck the log for details: "
                            "${CMAKE_CURRENT_BINARY_DIR}/moe_marlin_generation.log")
      else()
-        set(MOE_MARLIN_GEN_SCRIPT_HASH_AND_ARCH ${MOE_MARLIN_GEN_SCRIPT_HASH_AND_ARCH}
+        set(MOE_MARLIN_GEN_SCRIPT_HASH ${MOE_MARLIN_GEN_SCRIPT_HASH}
            CACHE STRING "Last run Marlin MOE generate script hash" FORCE)
        message(STATUS "Marlin MOE generation completed successfully.")
      endif()
@@ -1030,53 +979,18 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
      message(STATUS "Marlin MOE generation script has not changed, skipping generation.")
    endif()

-    if (MARLIN_MOE_ARCHS)
-      file(GLOB MARLIN_MOE_SRC "csrc/moe/marlin_moe_wna16/sm80_kernel_*.cu")
-      set_gencode_flags_for_srcs(
-        SRCS "${MARLIN_MOE_SRC}"
-        CUDA_ARCHS "${MARLIN_MOE_ARCHS}")
-      if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.8)
-        set_source_files_properties(${MARLIN_MOE_SRC}
-          PROPERTIES COMPILE_FLAGS "-static-global-template-stub=false")
-      endif()
-      list(APPEND VLLM_MOE_EXT_SRC ${MARLIN_MOE_SRC})
-    endif()
-
-    if (MARLIN_MOE_SM75_ARCHS)
-      file(GLOB MARLIN_MOE_SM75_SRC "csrc/moe/marlin_moe_wna16/sm75_kernel_*.cu")
-      set_gencode_flags_for_srcs(
-        SRCS "${MARLIN_MOE_SM75_SRC}"
-        CUDA_ARCHS "${MARLIN_MOE_SM75_ARCHS}")
-      if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.8)
-        set_source_files_properties(${MARLIN_MOE_SM75_SRC}
-          PROPERTIES COMPILE_FLAGS "-static-global-template-stub=false")
-      endif()
-      list(APPEND VLLM_MOE_EXT_SRC ${MARLIN_MOE_SM75_SRC})
-    endif()
-
-    if (MARLIN_MOE_FP8_ARCHS)
-      file(GLOB MARLIN_MOE_FP8_SRC "csrc/moe/marlin_moe_wna16/sm89_kernel_*.cu")
-      set_gencode_flags_for_srcs(
-        SRCS "${MARLIN_MOE_FP8_SRC}"
-        CUDA_ARCHS "${MARLIN_MOE_FP8_ARCHS}")
-      if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.8)
-        set_source_files_properties(${MARLIN_MOE_FP8_SRC}
-          PROPERTIES COMPILE_FLAGS "-static-global-template-stub=false")
-      endif()
-      list(APPEND VLLM_MOE_EXT_SRC ${MARLIN_MOE_FP8_SRC})
-    endif()
-
-    set(MARLIN_MOE_OTHER_SRC "csrc/moe/marlin_moe_wna16/ops.cu")
+    file(GLOB MOE_WNAA16_MARLIN_SRC "csrc/moe/marlin_moe_wna16/*.cu")
    set_gencode_flags_for_srcs(
-      SRCS "${MARLIN_MOE_OTHER_SRC}"
-      CUDA_ARCHS "${MARLIN_MOE_OTHER_ARCHS}")
+      SRCS "${MOE_WNAA16_MARLIN_SRC}"
+      CUDA_ARCHS "${MARLIN_MOE_ARCHS}")
    if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.8)
-      set_source_files_properties(${MARLIN_MOE_OTHER_SRC}
+      set_source_files_properties(${MOE_WNAA16_MARLIN_SRC}
        PROPERTIES COMPILE_FLAGS "-static-global-template-stub=false")
    endif()
-    list(APPEND VLLM_MOE_EXT_SRC "${MARLIN_MOE_OTHER_SRC}")

-    message(STATUS "Building Marlin MOE kernels for archs: ${MARLIN_MOE_OTHER_ARCHS}")
+    list(APPEND VLLM_MOE_EXT_SRC ${MOE_WNAA16_MARLIN_SRC})
+
+    message(STATUS "Building Marlin MOE kernels for archs: ${MARLIN_MOE_ARCHS}")
  else()
    message(STATUS "Not building Marlin MOE kernels as no compatible archs found"
                   " in CUDA target architectures")
@@ -1116,11 +1030,6 @@ if(VLLM_GPU_LANG STREQUAL "HIP")
    WITH_SOABI)
 endif()

-# For CUDA and HIP builds also build the triton_kernels external package.
-if(VLLM_GPU_LANG STREQUAL "CUDA" OR VLLM_GPU_LANG STREQUAL "HIP")
-    include(cmake/external_projects/triton_kernels.cmake)
-endif()
-
 # For CUDA we also build and ship some external projects.
 if (VLLM_GPU_LANG STREQUAL "CUDA")
    include(cmake/external_projects/flashmla.cmake)
--- a/README.md
+++ b/README.md
@@ -14,8 +14,50 @@ Easy, fast, and cheap LLM serving for everyone
 | <a href="https://docs.vllm.ai"><b>Documentation</b></a> | <a href="https://blog.vllm.ai/"><b>Blog</b></a> | <a href="https://arxiv.org/abs/2309.06180"><b>Paper</b></a> | <a href="https://x.com/vllm_project"><b>Twitter/X</b></a> | <a href="https://discuss.vllm.ai"><b>User Forum</b></a> | <a href="https://slack.vllm.ai"><b>Developer Slack</b></a> |
 </p>

-🔥 We have built a vllm website to help you get started with vllm. Please visit [vllm.ai](https://vllm.ai) to learn more.
-For events, please visit [vllm.ai/events](https://vllm.ai/events) to join us.
+---
+Join us at the [PyTorch Conference, October 22-23](https://events.linuxfoundation.org/pytorch-conference/) and [Ray Summit, November 3-5](https://www.anyscale.com/ray-summit/2025) in San Francisco for our latest updates on vLLM and to meet the vLLM team! Register now for the largest vLLM community events of the year!
+
+---
+
+*Latest News* 🔥
+
+- [2025/11] We hosted [the first vLLM Europe Meetup in Zurich](https://luma.com/0gls27kb) focused on quantization, distributed inference, and reinforcement learning at scale with speakers from Mistral, IBM, and Red Hat. Please find the meetup slides [here](https://docs.google.com/presentation/d/1UC9PTLCHYXQpOmJDSFg6Sljra3iVXzc09DeEI7dnxMc/edit?usp=sharing) and recording [here](https://www.youtube.com/watch?v=6m6ZE6yVEDI)
+- [2025/11] We hosted [vLLM Beijing Meetup](https://mp.weixin.qq.com/s/xSrYXjNgr1HbCP4ExYNG1w) focusing on distributed inference and diverse accelerator support with vLLM! Please find the meetup slides [here](https://drive.google.com/drive/folders/1nQJ8ZkLSjKxvu36sSHaceVXtttbLvvu-?usp=drive_link).
+- [2025/10] We hosted [vLLM Shanghai Meetup](https://mp.weixin.qq.com/s/__xb4OyOsImz-9eAVrdlcg) focused on hands-on vLLM inference optimization! Please find the meetup slides [here](https://drive.google.com/drive/folders/1KqwjsFJLfEsC8wlDugnrR61zsWHt94Q6).
+- [2025/09] We hosted [vLLM Toronto Meetup](https://luma.com/e80e0ymm) focused on tackling inference at scale and speculative decoding with speakers from NVIDIA and Red Hat! Please find the meetup slides [here](https://docs.google.com/presentation/d/1IYJYmJcu9fLpID5N5RbW_vO0XLo0CGOR14IXOjB61V8/edit?usp=sharing).
+- [2025/08] We hosted [vLLM Shenzhen Meetup](https://mp.weixin.qq.com/s/k8ZBO1u2_2odgiKWH_GVTQ) focusing on the ecosystem around vLLM! Please find the meetup slides [here](https://drive.google.com/drive/folders/1Ua2SVKVSu-wp5vou_6ElraDt2bnKhiEA).
+- [2025/08] We hosted [vLLM Singapore Meetup](https://www.sginnovate.com/event/vllm-sg-meet). We shared V1 updates, disaggregated serving and MLLM speedups with speakers from Embedded LLM, AMD, WekaIO, and A*STAR. Please find the meetup slides [here](https://drive.google.com/drive/folders/1ncf3GyqLdqFaB6IeB834E5TZJPLAOiXZ?usp=sharing).
+- [2025/08] We hosted [vLLM Shanghai Meetup](https://mp.weixin.qq.com/s/pDmAXHcN7Iqc8sUKgJgGtg) focusing on building, developing, and integrating with vLLM! Please find the meetup slides [here](https://drive.google.com/drive/folders/1OvLx39wnCGy_WKq8SiVKf7YcxxYI3WCH).
+- [2025/05] vLLM is now a hosted project under PyTorch Foundation! Please find the announcement [here](https://pytorch.org/blog/pytorch-foundation-welcomes-vllm/).
+- [2025/01] We are excited to announce the alpha release of vLLM V1: A major architectural upgrade with 1.7x speedup! Clean code, optimized execution loop, zero-overhead prefix caching, enhanced multimodal support, and more. Please check out our blog post [here](https://blog.vllm.ai/2025/01/27/v1-alpha-release.html).
+
+<details>
+<summary>Previous News</summary>
+
+- [2025/08] We hosted [vLLM Korea Meetup](https://luma.com/cgcgprmh) with Red Hat and Rebellions! We shared the latest advancements in vLLM along with project spotlights from the vLLM Korea community. Please find the meetup slides [here](https://drive.google.com/file/d/1bcrrAE1rxUgx0mjIeOWT6hNe2RefC5Hm/view).
+- [2025/08] We hosted [vLLM Beijing Meetup](https://mp.weixin.qq.com/s/dgkWg1WFpWGO2jCdTqQHxA) focusing on large-scale LLM deployment! Please find the meetup slides [here](https://drive.google.com/drive/folders/1Pid6NSFLU43DZRi0EaTcPgXsAzDvbBqF) and the recording [here](https://www.chaspark.com/#/live/1166916873711665152).
+- [2025/05] We hosted [NYC vLLM Meetup](https://lu.ma/c1rqyf1f)! Please find the meetup slides [here](https://docs.google.com/presentation/d/1_q_aW_ioMJWUImf1s1YM-ZhjXz8cUeL0IJvaquOYBeA/edit?usp=sharing).
+- [2025/04] We hosted [Asia Developer Day](https://www.sginnovate.com/event/limited-availability-morning-evening-slots-remaining-inaugural-vllm-asia-developer-day)! Please find the meetup slides from the vLLM team [here](https://docs.google.com/presentation/d/19cp6Qu8u48ihB91A064XfaXruNYiBOUKrBxAmDOllOo/edit?usp=sharing).
+- [2025/03] We hosted [vLLM x Ollama Inference Night](https://lu.ma/vllm-ollama)! Please find the meetup slides from the vLLM team [here](https://docs.google.com/presentation/d/16T2PDD1YwRnZ4Tu8Q5r6n53c5Lr5c73UV9Vd2_eBo4U/edit?usp=sharing).
+- [2025/03] We hosted [the first vLLM China Meetup](https://mp.weixin.qq.com/s/n77GibL2corAtQHtVEAzfg)! Please find the meetup slides from vLLM team [here](https://docs.google.com/presentation/d/1REHvfQMKGnvz6p3Fd23HhSO4c8j5WPGZV0bKYLwnHyQ/edit?usp=sharing).
+- [2025/03] We hosted [the East Coast vLLM Meetup](https://lu.ma/7mu4k4xx)! Please find the meetup slides [here](https://docs.google.com/presentation/d/1NHiv8EUFF1NLd3fEYODm56nDmL26lEeXCaDgyDlTsRs/edit#slide=id.g31441846c39_0_0).
+- [2025/02] We hosted [the ninth vLLM meetup](https://lu.ma/h7g3kuj9) with Meta! Please find the meetup slides from vLLM team [here](https://docs.google.com/presentation/d/1jzC_PZVXrVNSFVCW-V4cFXb6pn7zZ2CyP_Flwo05aqg/edit?usp=sharing) and AMD [here](https://drive.google.com/file/d/1Zk5qEJIkTmlQ2eQcXQZlljAx3m9s7nwn/view?usp=sharing). The slides from Meta will not be posted.
+- [2025/01] We hosted [the eighth vLLM meetup](https://lu.ma/zep56hui) with Google Cloud! Please find the meetup slides from vLLM team [here](https://docs.google.com/presentation/d/1epVkt4Zu8Jz_S5OhEHPc798emsYh2BwYfRuDDVEF7u4/edit?usp=sharing), and Google Cloud team [here](https://drive.google.com/file/d/1h24pHewANyRL11xy5dXUbvRC9F9Kkjix/view?usp=sharing).
+- [2024/12] vLLM joins [pytorch ecosystem](https://pytorch.org/blog/vllm-joins-pytorch)! Easy, Fast, and Cheap LLM Serving for Everyone!
+- [2024/11] We hosted [the seventh vLLM meetup](https://lu.ma/h0qvrajz) with Snowflake! Please find the meetup slides from vLLM team [here](https://docs.google.com/presentation/d/1e3CxQBV3JsfGp30SwyvS3eM_tW-ghOhJ9PAJGK6KR54/edit?usp=sharing), and Snowflake team [here](https://docs.google.com/presentation/d/1qF3RkDAbOULwz9WK5TOltt2fE9t6uIc_hVNLFAaQX6A/edit?usp=sharing).
+- [2024/10] We have just created a developer slack ([slack.vllm.ai](https://slack.vllm.ai)) focusing on coordinating contributions and discussing features. Please feel free to join us there!
+- [2024/10] Ray Summit 2024 held a special track for vLLM! Please find the opening talk slides from the vLLM team [here](https://docs.google.com/presentation/d/1B_KQxpHBTRa_mDF-tR6i8rWdOU5QoTZNcEg2MKZxEHM/edit?usp=sharing). Learn more from the [talks](https://www.youtube.com/playlist?list=PLzTswPQNepXl6AQwifuwUImLPFRVpksjR) from other vLLM contributors and users!
+- [2024/09] We hosted [the sixth vLLM meetup](https://lu.ma/87q3nvnh) with NVIDIA! Please find the meetup slides [here](https://docs.google.com/presentation/d/1wrLGwytQfaOTd5wCGSPNhoaW3nq0E-9wqyP7ny93xRs/edit?usp=sharing).
+- [2024/07] We hosted [the fifth vLLM meetup](https://lu.ma/lp0gyjqr) with AWS! Please find the meetup slides [here](https://docs.google.com/presentation/d/1RgUD8aCfcHocghoP3zmXzck9vX3RCI9yfUAB2Bbcl4Y/edit?usp=sharing).
+- [2024/07] In partnership with Meta, vLLM officially supports Llama 3.1 with FP8 quantization and pipeline parallelism! Please check out our blog post [here](https://blog.vllm.ai/2024/07/23/llama31.html).
+- [2024/06] We hosted [the fourth vLLM meetup](https://lu.ma/agivllm) with Cloudflare and BentoML! Please find the meetup slides [here](https://docs.google.com/presentation/d/1iJ8o7V2bQEi0BFEljLTwc5G1S10_Rhv3beed5oB0NJ4/edit?usp=sharing).
+- [2024/04] We hosted [the third vLLM meetup](https://robloxandvllmmeetup2024.splashthat.com/) with Roblox! Please find the meetup slides [here](https://docs.google.com/presentation/d/1A--47JAK4BJ39t954HyTkvtfwn0fkqtsL8NGFuslReM/edit?usp=sharing).
+- [2024/01] We hosted [the second vLLM meetup](https://lu.ma/ygxbpzhl) with IBM! Please find the meetup slides [here](https://docs.google.com/presentation/d/12mI2sKABnUw5RBWXDYY-HtHth4iMSNcEoQ10jDQbxgA/edit?usp=sharing).
+- [2023/10] We hosted [the first vLLM meetup](https://lu.ma/first-vllm-meetup) with a16z! Please find the meetup slides [here](https://docs.google.com/presentation/d/1QL-XPFXiFpDBh86DbEegFXBXFXjix4v032GhShbKf3s/edit?usp=sharing).
+- [2023/08] We would like to express our sincere gratitude to [Andreessen Horowitz](https://a16z.com/2023/08/30/supporting-the-open-source-ai-community/) (a16z) for providing a generous grant to support the open-source development and research of vLLM.
+- [2023/06] We officially released vLLM! FastChat-vLLM integration has powered [LMSYS Vicuna and Chatbot Arena](https://chat.lmsys.org) since mid-April. Check out our [blog post](https://vllm.ai).
+
+</details>

 ---

@@ -75,6 +117,47 @@ Visit our [documentation](https://docs.vllm.ai/en/latest/) to learn more.
 We welcome and value any contributions and collaborations.
 Please check out [Contributing to vLLM](https://docs.vllm.ai/en/latest/contributing/index.html) for how to get involved.

+## Sponsors
+
+vLLM is a community project. Our compute resources for development and testing are supported by the following organizations. Thank you for your support!
+
+<!-- Note: Please sort them in alphabetical order. -->
+<!-- Note: Please keep these consistent with docs/community/sponsors.md -->
+Cash Donations:
+
+- a16z
+- Dropbox
+- Sequoia Capital
+- Skywork AI
+- ZhenFund
+
+Compute Resources:
+
+- Alibaba Cloud
+- AMD
+- Anyscale
+- AWS
+- Crusoe Cloud
+- Databricks
+- DeepInfra
+- Google Cloud
+- Intel
+- Lambda Lab
+- Nebius
+- Novita AI
+- NVIDIA
+- Replicate
+- Roblox
+- RunPod
+- Trainy
+- UC Berkeley
+- UC San Diego
+- Volcengine
+
+Slack Sponsor: Anyscale
+
+We also have an official fundraising venue through [OpenCollective](https://opencollective.com/vllm). We plan to use the fund to support the development, maintenance, and adoption of vLLM.
+
 ## Citation

 If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs/2309.06180):
@@ -95,7 +178,7 @@ If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs
 - For discussing with fellow users, please use the [vLLM Forum](https://discuss.vllm.ai)
 - For coordinating contributions and development, please use [Slack](https://slack.vllm.ai)
 - For security disclosures, please use GitHub's [Security Advisories](https://github.com/vllm-project/vllm/security/advisories) feature
- For collaborations and partnerships, please contact us at [collaboration@vllm.ai](mailto:collaboration@vllm.ai)
+- For collaborations and partnerships, please contact us at [vllm-questions@lists.berkeley.edu](mailto:vllm-questions@lists.berkeley.edu)
 <!-- --8<-- [end:contact-us] -->

 ## Media Kit
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Lucas Wilkinson	275de34170	[BugFix] Fix false assertion with spec-decode=[2,4,..] and TP>2 (#29036 ) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> (cherry picked from commit `8f4f77a727`)	2025-11-19 14:11:21 -08:00
Julien Denize	fa3ffb4365	[BugFix] Ray with multiple nodes (#28873 ) Signed-off-by: Julien Denize <julien.denize@mistral.ai> (cherry picked from commit `cdeec2e606`)	2025-11-19 14:11:08 -08:00
Lucas Wilkinson	6d5974369c	[BugFix] Fix async-scheduling + FlashAttn MLA (#28990 ) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> (cherry picked from commit `48fc8b1e59`)	2025-11-19 14:10:50 -08:00
Johnny	0ce9990d2c	[NVIDIA] Guard SM100 CUTLASS MoE macro to SM100 builds v2 (#28938 ) Signed-off-by: johnnynunez <johnnynuca14@gmail.com> Signed-off-by: Johnny <johnnynuca14@gmail.com> (cherry picked from commit `49ef847aa8`)	2025-11-19 14:10:37 -08:00
				`@@ -1 +0,0 @@`
				`Meta-Llama-4-Maverick-17B-128E-Instruct-FP8.yaml`