[torch.compile] Don't do the fast moe cold start optimization if there is speculative decoding (#33624 )

Signed-off-by: Richard Zou <zou3519@gmail.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> (cherry picked from commit 5eac9a1b34)
[Misc][Build] Lazy load cv2 in nemotron_parse.py (#33189 )
2026-02-03 00:07:18 -08:00 · 2026-02-03 00:03:56 -08:00 · 2026-02-03 00:03:14 -08:00 · 2026-02-03 00:02:39 -08:00 · 2026-02-02 16:19:05 -08:00 · 2026-02-02 14:54:08 -08:00
62 changed files with 4469 additions and 710 deletions
--- a/.buildkite/release-pipeline.yaml
+++ b/.buildkite/release-pipeline.yaml
@@ -274,14 +274,14 @@ steps:
          - input-release-version
          - build-wheels
-      - label: "Upload release wheels to PyPI and GitHub"
+      - label: "Upload release wheels to PyPI"
        depends_on:
          - block-upload-release-wheels
        id: upload-release-wheels
        agents:
          queue: small_cpu_queue_postmerge
        commands:
-          - "bash .buildkite/scripts/upload-release-wheels.sh"
+          - "bash .buildkite/scripts/upload-release-wheels-pypi.sh"
  # =============================================================================
  # ROCm Release Pipeline (x86_64 only)
@@ -638,9 +638,93 @@ steps:
    depends_on:
      - step: upload-rocm-wheels
        allow_failure: true
      - step: input-release-version
        allow_failure: true
    agents:
      queue: cpu_queue_postmerge
    commands:
      - "bash .buildkite/scripts/annotate-rocm-release.sh"
    env:
      S3_BUCKET: "vllm-wheels"
  # ROCm Job 5: Generate Root Index for ROCm Wheels (for release only)
  # This is the job to create https://wheels.vllm.ai/rocm/ index allowing
  # users to install with `uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/`
  - block: "Generate Root Index for ROCm Wheels for Release"
    key: block-generate-root-index-rocm-wheels
    depends_on: upload-rocm-wheels
  - label: ":package: Generate Root Index for ROCm Wheels for Release"
    depends_on: block-generate-root-index-rocm-wheels
    id: generate-root-index-rocm-wheels
    agents:
      queue: cpu_queue_postmerge
    commands:
      - "bash tools/vllm-rocm/generate-rocm-wheels-root-index.sh"
    env:
      S3_BUCKET: "vllm-wheels"
      VARIANT: "rocm700"
  # ROCm Job 5: Build ROCm Release Docker Image
  - label: ":rocm: :docker: Build ROCm Release Docker Image"
    id: build-rocm-release-image
    depends_on:
      - step: build-rocm-base-wheels
        allow_failure: false
    agents:
      queue: cpu_queue_postmerge
    timeout_in_minutes: 60
    commands:
      - |
        set -euo pipefail
        # Login to ECR
        aws ecr-public get-login-password --region us-east-1 | \
          docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7
        # Download Docker image from S3 (set by build-rocm-base-wheels)
        DOCKER_IMAGE_S3_PATH="$$(buildkite-agent meta-data get rocm-docker-image-s3-path 2>/dev/null || echo '')"
        if [ -z "$${DOCKER_IMAGE_S3_PATH}" ]; then
          echo "ERROR: rocm-docker-image-s3-path metadata not found"
          exit 1
        fi
        echo "Downloading base image from $${DOCKER_IMAGE_S3_PATH}"
        mkdir -p artifacts/rocm-docker-image
        aws s3 cp "$${DOCKER_IMAGE_S3_PATH}" artifacts/rocm-docker-image/rocm-base-image.tar.gz
        # Load base Docker image
        echo "Loading base Docker image..."
        LOAD_OUTPUT=$$(gunzip -c artifacts/rocm-docker-image/rocm-base-image.tar.gz | docker load)
        BASE_IMAGE_TAG=$$(echo "$${LOAD_OUTPUT}" | grep "Loaded image:" | sed 's/Loaded image: //')
        echo "Loaded base image: $${BASE_IMAGE_TAG}"
        # Tag and push the base image to ECR
        docker tag "$${BASE_IMAGE_TAG}" public.ecr.aws/q9t5s3a7/vllm-release-repo:$${BUILDKITE_COMMIT}-rocm-base
        docker push public.ecr.aws/q9t5s3a7/vllm-release-repo:$${BUILDKITE_COMMIT}-rocm-base
        echo "Pushed base image: public.ecr.aws/q9t5s3a7/vllm-release-repo:$${BUILDKITE_COMMIT}-rocm-base"
        # Get GPU architectures from meta-data
        PYTORCH_ROCM_ARCH="$$(buildkite-agent meta-data get rocm-pytorch-rocm-arch 2>/dev/null || echo '')"
        PYTORCH_ROCM_ARCH="$${PYTORCH_ROCM_ARCH:-gfx90a;gfx942;gfx950;gfx1100;gfx1101;gfx1200;gfx1201;gfx1150;gfx1151}"
        # Build vLLM ROCm release image using cached base
        DOCKER_BUILDKIT=1 docker build \
          --build-arg max_jobs=16 \
          --build-arg BASE_IMAGE="$${BASE_IMAGE_TAG}" \
          --build-arg ARG_PYTORCH_ROCM_ARCH="$${PYTORCH_ROCM_ARCH}" \
          --build-arg USE_SCCACHE=1 \
          --build-arg SCCACHE_BUCKET_NAME=vllm-build-sccache \
          --build-arg SCCACHE_REGION_NAME=us-west-2 \
          --build-arg SCCACHE_S3_NO_CREDENTIALS=0 \
          --tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$${BUILDKITE_COMMIT}-rocm \
          --target vllm-openai \
          --progress plain \
          -f docker/Dockerfile.rocm .
        # Push to ECR
        docker push public.ecr.aws/q9t5s3a7/vllm-release-repo:$${BUILDKITE_COMMIT}-rocm
        echo "Pushed: public.ecr.aws/q9t5s3a7/vllm-release-repo:$${BUILDKITE_COMMIT}-rocm"
    env:
      DOCKER_BUILDKIT: "1"
      S3_BUCKET: "vllm-wheels"
--- a/.buildkite/scripts/annotate-release.sh
+++ b/.buildkite/scripts/annotate-release.sh
@@ -11,51 +11,80 @@ fi
 buildkite-agent annotate --style 'info' --context 'release-workflow' << EOF
 To download the wheel (by commit):
 \`\`\`
-aws s3 cp s3://vllm-wheels/${BUILDKITE_COMMIT}/vllm-${RELEASE_VERSION}-cp38-abi3-manylinux1_x86_64.whl .
+aws s3 cp s3://vllm-wheels/${BUILDKITE_COMMIT}/vllm-${RELEASE_VERSION}-cp38-abi3-manylinux_2_31_x86_64.whl .
-aws s3 cp s3://vllm-wheels/${BUILDKITE_COMMIT}/vllm-${RELEASE_VERSION}-cp38-abi3-manylinux2014_aarch64.whl .
+aws s3 cp s3://vllm-wheels/${BUILDKITE_COMMIT}/vllm-${RELEASE_VERSION}-cp38-abi3-manylinux_2_31_aarch64.whl .
-aws s3 cp s3://vllm-wheels/${BUILDKITE_COMMIT}/vllm-${RELEASE_VERSION}+cu129-cp38-abi3-manylinux1_x86_64.whl .
+(Optional) For CUDA 13.0:
-aws s3 cp s3://vllm-wheels/${BUILDKITE_COMMIT}/vllm-${RELEASE_VERSION}+cu129-cp38-abi3-manylinux1_x86_64.whl .
+aws s3 cp s3://vllm-wheels/${BUILDKITE_COMMIT}/vllm-${RELEASE_VERSION}+cu130-cp38-abi3-manylinux_2_35_x86_64.whl .
 aws s3 cp s3://vllm-wheels/${BUILDKITE_COMMIT}/vllm-${RELEASE_VERSION}+cu130-cp38-abi3-manylinux_2_35_aarch64.whl .
 (Optional) For CPU:
 aws s3 cp s3://vllm-wheels/${BUILDKITE_COMMIT}/vllm-${RELEASE_VERSION}+cpu-cp38-abi3-manylinux_2_35_x86_64.whl .
 aws s3 cp s3://vllm-wheels/${BUILDKITE_COMMIT}/vllm-${RELEASE_VERSION}+cpu-cp38-abi3-manylinux_2_35_aarch64.whl .
 \`\`\`
 To download the wheel (by version):
 \`\`\`
 aws s3 cp s3://vllm-wheels/${RELEASE_VERSION}/vllm-${RELEASE_VERSION}-cp38-abi3-manylinux1_x86_64.whl .
 aws s3 cp s3://vllm-wheels/${RELEASE_VERSION}/vllm-${RELEASE_VERSION}-cp38-abi3-manylinux2014_aarch64.whl .
 aws s3 cp s3://vllm-wheels/${RELEASE_VERSION}+cu129/vllm-${RELEASE_VERSION}+cu129-cp38-abi3-manylinux1_x86_64.whl .
 aws s3 cp s3://vllm-wheels/${RELEASE_VERSION}+cu130/vllm-${RELEASE_VERSION}+cu130-cp38-abi3-manylinux1_x86_64.whl .
 \`\`\`
 To download and upload the image:
 \`\`\`
 Download images:
 docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-x86_64
 docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-aarch64
 docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-x86_64-cu130
 docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-aarch64-cu130
 docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-rocm-base
 docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-rocm
 Tag and push images:
 docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-x86_64 vllm/vllm-openai:x86_64
 docker tag vllm/vllm-openai:x86_64 vllm/vllm-openai:latest-x86_64
 docker tag vllm/vllm-openai:x86_64 vllm/vllm-openai:v${RELEASE_VERSION}-x86_64
 docker push vllm/vllm-openai:latest-x86_64
 docker push vllm/vllm-openai:v${RELEASE_VERSION}-x86_64
 docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-x86_64-cu130 vllm/vllm-openai:x86_64-cu130
 docker tag vllm/vllm-openai:x86_64-cu130 vllm/vllm-openai:latest-x86_64-cu130
 docker tag vllm/vllm-openai:x86_64-cu130 vllm/vllm-openai:v${RELEASE_VERSION}-x86_64-cu130
 docker push vllm/vllm-openai:latest-x86_64-cu130
 docker push vllm/vllm-openai:v${RELEASE_VERSION}-x86_64-cu130
 docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-aarch64 vllm/vllm-openai:aarch64
 docker tag vllm/vllm-openai:aarch64 vllm/vllm-openai:latest-aarch64
 docker tag vllm/vllm-openai:aarch64 vllm/vllm-openai:v${RELEASE_VERSION}-aarch64
 docker push vllm/vllm-openai:latest-aarch64
 docker push vllm/vllm-openai:v${RELEASE_VERSION}-aarch64
-docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-rocm vllm/vllm-openai:rocm
+docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-aarch64-cu130 vllm/vllm-openai:aarch64-cu130
-docker tag vllm/vllm-openai:rocm vllm/vllm-openai:latest-rocm
+docker tag vllm/vllm-openai:aarch64-cu130 vllm/vllm-openai:latest-aarch64-cu130
-docker tag vllm/vllm-openai:rocm vllm/vllm-openai:v${RELEASE_VERSION}-rocm
+docker tag vllm/vllm-openai:aarch64-cu130 vllm/vllm-openai:v${RELEASE_VERSION}-aarch64-cu130
-docker push vllm/vllm-openai:latest-rocm
+docker push vllm/vllm-openai:latest-aarch64-cu130
-docker push vllm/vllm-openai:v${RELEASE_VERSION}-rocm
+docker push vllm/vllm-openai:v${RELEASE_VERSION}-aarch64-cu130
 docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-rocm vllm/vllm-openai-rocm:${BUILDKITE_COMMIT}-rocm
 docker tag vllm/vllm-openai-rocm:${BUILDKITE_COMMIT}-rocm vllm/vllm-openai-rocm:latest
 docker tag vllm/vllm-openai-rocm:${BUILDKITE_COMMIT}-rocm vllm/vllm-openai-rocm:v${RELEASE_VERSION}-rocm
 docker push vllm/vllm-openai-rocm:latest
 docker push vllm/vllm-openai-rocm:v${RELEASE_VERSION}-rocm
 Create multi-arch manifest:
 docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-rocm-base vllm/vllm-openai-rocm:${BUILDKITE_COMMIT}-base
 docker tag vllm/vllm-openai-rocm:${BUILDKITE_COMMIT}-base vllm/vllm-openai-rocm:latest-base
 docker tag vllm/vllm-openai-rocm:${BUILDKITE_COMMIT}-base vllm/vllm-openai-rocm:v${RELEASE_VERSION}-base
 docker push vllm/vllm-openai-rocm:latest-base
 docker push vllm/vllm-openai-rocm:v${RELEASE_VERSION}-base
 docker manifest rm vllm/vllm-openai:latest
 docker manifest create vllm/vllm-openai:latest vllm/vllm-openai:latest-x86_64 vllm/vllm-openai:latest-aarch64
 docker manifest create vllm/vllm-openai:v${RELEASE_VERSION} vllm/vllm-openai:v${RELEASE_VERSION}-x86_64 vllm/vllm-openai:v${RELEASE_VERSION}-aarch64
 docker manifest push vllm/vllm-openai:latest
 docker manifest push vllm/vllm-openai:v${RELEASE_VERSION}
 docker manifest rm vllm/vllm-openai:latest-cu130
 docker manifest create vllm/vllm-openai:latest-cu130 vllm/vllm-openai:latest-x86_64-cu130 vllm/vllm-openai:latest-aarch64-cu130
 docker manifest create vllm/vllm-openai:v${RELEASE_VERSION}-cu130 vllm/vllm-openai:v${RELEASE_VERSION}-x86_64-cu130 vllm/vllm-openai:v${RELEASE_VERSION}-aarch64-cu130
 docker manifest push vllm/vllm-openai:latest-cu130
 docker manifest push vllm/vllm-openai:v${RELEASE_VERSION}-cu130
 \`\`\`
 EOF 
--- a/.buildkite/scripts/annotate-rocm-release.sh
+++ b/.buildkite/scripts/annotate-rocm-release.sh
@@ -3,25 +3,32 @@
 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project
 #
 # Generate Buildkite annotation for ROCm wheel release
 set -ex
 # Get build configuration from meta-data
 # Extract ROCm version dynamically from Dockerfile.rocm_base
-# BASE_IMAGE format: rocm/dev-ubuntu-22.04:7.1-complete -> extracts "7.1"
+# BASE_IMAGE format: rocm/dev-ubuntu-22.04:7.0-complete -> extracts "7.0"
 ROCM_VERSION=$(grep -E '^ARG BASE_IMAGE=' docker/Dockerfile.rocm_base | sed -E 's/.*:([0-9]+\.[0-9]+).*/\1/' || echo "unknown")
 PYTHON_VERSION=$(buildkite-agent meta-data get rocm-python-version 2>/dev/null || echo "3.12")
 PYTORCH_ROCM_ARCH=$(buildkite-agent meta-data get rocm-pytorch-rocm-arch 2>/dev/null || echo "gfx90a;gfx942;gfx950;gfx1100;gfx1101;gfx1200;gfx1201;gfx1150;gfx1151")
 # TODO: Enable the nightly build for ROCm
 # Get release version, default to 1.0.0.dev for nightly/per-commit builds
 RELEASE_VERSION=$(buildkite-agent meta-data get release-version 2>/dev/null || echo "")
 if [ -z "${RELEASE_VERSION}" ]; then
  RELEASE_VERSION="1.0.0.dev"
 fi
 # S3 URLs
 S3_BUCKET="${S3_BUCKET:-vllm-wheels}"
 S3_REGION="${AWS_DEFAULT_REGION:-us-west-2}"
-S3_URL="https://${S3_BUCKET}.s3.${S3_REGION}.amazonaws.com"
+S3_URL="http://${S3_BUCKET}.s3-website-${S3_REGION}.amazonaws.com"
 ROCM_PATH="rocm/${BUILDKITE_COMMIT}"
 # Format ROCm version for path (e.g., "7.1" -> "rocm710")
 ROCM_VERSION_PATH="rocm$(echo ${ROCM_VERSION} | tr -d '.')"
 ROCM_PATH="rocm/${BUILDKITE_COMMIT}/${ROCM_VERSION_PATH}"
 buildkite-agent annotate --style 'success' --context 'rocm-release-workflow' << EOF
-## :rocm: ROCm Wheel Release
+## ROCm Wheel and Docker Image Releases
 ### Build Configuration
 | Setting | Value |
 |---------|-------|
@@ -34,41 +41,72 @@ buildkite-agent annotate --style 'success' --context 'rocm-release-workflow' <<
 ### :package: Installation
 **Install from this build (by commit):**
 \`\`\`bash
 uv pip install vllm --extra-index-url ${S3_URL}/${ROCM_PATH}/{rocm_variant}/
-# Example:
+\`\`\`bash
-uv pip install vllm --extra-index-url ${S3_URL}/${ROCM_PATH}/rocm700/
+pip install vllm --extra-index-url ${S3_URL}/${ROCM_PATH}/ --trusted-host ${S3_BUCKET}.s3-website-${S3_REGION}.amazonaws.com
 # Example for ROCm ${ROCM_VERSION}:
 pip install vllm --extra-index-url ${S3_URL}/rocm/${BUILDKITE_COMMIT}/${ROCM_VERSION_PATH}/ --trusted-host ${S3_BUCKET}.s3-website-${S3_REGION}.amazonaws.com
 \`\`\`
 **Install from nightly (if published):**
 \`\`\`bash
-uv pip install vllm --extra-index-url ${S3_URL}/rocm/nightly/
+pip install vllm --extra-index-url ${S3_URL}/rocm/nightly/ --trusted-host ${S3_BUCKET}.s3-website-${S3_REGION}.amazonaws.com
 \`\`\`
 ### :floppy_disk: Download Wheels Directly
 \`\`\`bash
 # List all ROCm wheels
-aws s3 ls s3://${S3_BUCKET}/${ROCM_PATH}/
+aws s3 ls s3://${S3_BUCKET}/rocm/${BUILDKITE_COMMIT}/${ROCM_VERSION_PATH}/
 # Download specific wheels
-aws s3 cp s3://${S3_BUCKET}/${ROCM_PATH}/vllm-*.whl .
+aws s3 cp s3://${S3_BUCKET}/rocm/${BUILDKITE_COMMIT}/${ROCM_VERSION_PATH}/vllm-*.whl .
-aws s3 cp s3://${S3_BUCKET}/${ROCM_PATH}/torch-*.whl .
+aws s3 cp s3://${S3_BUCKET}/rocm/${BUILDKITE_COMMIT}/${ROCM_VERSION_PATH}/torch-*.whl .
-aws s3 cp s3://${S3_BUCKET}/${ROCM_PATH}/triton_rocm-*.whl .
+aws s3 cp s3://${S3_BUCKET}/rocm/${BUILDKITE_COMMIT}/${ROCM_VERSION_PATH}/triton-*.whl .
-aws s3 cp s3://${S3_BUCKET}/${ROCM_PATH}/torchvision-*.whl .
+aws s3 cp s3://${S3_BUCKET}/rocm/${BUILDKITE_COMMIT}/${ROCM_VERSION_PATH}/triton-kernels-*.whl .
-aws s3 cp s3://${S3_BUCKET}/${ROCM_PATH}/amdsmi-*.whl .
+aws s3 cp s3://${S3_BUCKET}/rocm/${BUILDKITE_COMMIT}/${ROCM_VERSION_PATH}/torchvision-*.whl .
 aws s3 cp s3://${S3_BUCKET}/rocm/${BUILDKITE_COMMIT}/${ROCM_VERSION_PATH}/torchaudio-*.whl .
 aws s3 cp s3://${S3_BUCKET}/rocm/${BUILDKITE_COMMIT}/${ROCM_VERSION_PATH}/amdsmi-*.whl .
 aws s3 cp s3://${S3_BUCKET}/rocm/${BUILDKITE_COMMIT}/${ROCM_VERSION_PATH}/aiter-*.whl .
 aws s3 cp s3://${S3_BUCKET}/rocm/${BUILDKITE_COMMIT}/${ROCM_VERSION_PATH}/flash-attn-*.whl .
 \`\`\`
 ### :gear: Included Packages
 - **vllm**: vLLM with ROCm support
 - **torch**: PyTorch built for ROCm ${ROCM_VERSION}
- **triton_rocm**: Triton built for ROCm
+- **triton**: Triton
 - **triton-kernels**: Triton kernels
 - **torchvision**: TorchVision for ROCm PyTorch
 - **torchaudio**: Torchaudio for ROCm PyTorch
 - **amdsmi**: AMD SMI Python bindings
 - **aiter**: Aiter for ROCm
 - **flash-attn**: Flash Attention for ROCm
 ### :warning: Notes
 - These wheels are built for **ROCm ${ROCM_VERSION}** and will NOT work with CUDA GPUs
 - Supported GPU architectures: ${PYTORCH_ROCM_ARCH}
 - Platform: Linux x86_64 only
 ### :package: Docker Image Release
 To download and upload the image:
 \`\`\`
 docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-rocm-base
 docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-rocm
 docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-rocm-base vllm/vllm-openai-rocm:${BUILDKITE_COMMIT}-base
 docker tag vllm/vllm-openai-rocm:${BUILDKITE_COMMIT}-base vllm/vllm-openai-rocm:latest-base
 docker tag vllm/vllm-openai-rocm:${BUILDKITE_COMMIT}-base vllm/vllm-openai-rocm:v${RELEASE_VERSION}-base
 docker push vllm/vllm-openai-rocm:latest-base
 docker push vllm/vllm-openai-rocm:v${RELEASE_VERSION}-base
 docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-rocm vllm/vllm-openai-rocm:${BUILDKITE_COMMIT}
 docker tag vllm/vllm-openai-rocm:${BUILDKITE_COMMIT} vllm/vllm-openai-rocm:latest
 docker tag vllm/vllm-openai-rocm:${BUILDKITE_COMMIT} vllm/vllm-openai-rocm:v${RELEASE_VERSION}
 docker push vllm/vllm-openai-rocm:latest
 docker push vllm/vllm-openai-rocm:v${RELEASE_VERSION}
 \`\`\`
 EOF
--- a/.buildkite/scripts/upload-release-wheels-pypi.sh
+++ b/.buildkite/scripts/upload-release-wheels-pypi.sh
@@ -7,17 +7,19 @@ SUBPATH=$BUILDKITE_COMMIT
 S3_COMMIT_PREFIX="s3://$BUCKET/$SUBPATH/"
 RELEASE_VERSION=$(buildkite-agent meta-data get release-version)
 echo "Release version from Buildkite: $RELEASE_VERSION"
 GIT_VERSION=$(git describe --exact-match --tags $BUILDKITE_COMMIT 2>/dev/null)
-if [ -z "$GIT_VERSION" ]; then
+
 echo "Release version from Buildkite: $RELEASE_VERSION"
 if [[ -z "$GIT_VERSION" ]]; then
    echo "[FATAL] Not on a git tag, cannot create release."
    exit 1
 else
    echo "Git version for commit $BUILDKITE_COMMIT: $GIT_VERSION"
 fi
 # sanity check for version mismatch
-if [ "$RELEASE_VERSION" != "$GIT_VERSION" ]; then
+if [[ "$RELEASE_VERSION" != "$GIT_VERSION" ]]; then
-  if [ "$FORCE_RELEASE_IGNORE_VERSION_MISMATCH" == "true" ]; then
+  if [[ "$FORCE_RELEASE_IGNORE_VERSION_MISMATCH" == "true" ]]; then
    echo "[WARNING] Force release and ignore version mismatch"
  else
    echo "[FATAL] Release version from Buildkite does not match Git version."
@@ -27,7 +29,7 @@ fi
 PURE_VERSION=${RELEASE_VERSION#v} # remove leading 'v'
 # check pypi token
-if [ -z "$PYPI_TOKEN" ]; then
+if [[ -z "$PYPI_TOKEN" ]]; then
  echo "[FATAL] PYPI_TOKEN is not set."
  exit 1
 else
@@ -35,41 +37,8 @@ else
  export TWINE_PASSWORD="$PYPI_TOKEN"
 fi
 # check github token
 if [ -z "$GITHUB_TOKEN" ]; then
  echo "[FATAL] GITHUB_TOKEN is not set."
  exit 1
 else
  export GH_TOKEN="$GITHUB_TOKEN"
 fi
 set -x # avoid printing secrets above
 # download gh CLI from github
 # Get latest gh CLI version from GitHub API
 GH_VERSION=$(curl -s https://api.github.com/repos/cli/cli/releases/latest | grep '"tag_name":' | sed -E 's/.*"([^"]+)".*/\1/' | sed 's/^v//')
 if [ -z "$GH_VERSION" ]; then
  echo "[FATAL] Failed to get latest gh CLI version from GitHub"
  exit 1
 fi
 echo "Downloading gh CLI version: $GH_VERSION"
 GH_TARBALL="gh_${GH_VERSION}_linux_amd64.tar.gz"
 GH_URL="https://github.com/cli/cli/releases/download/v${GH_VERSION}/${GH_TARBALL}"
 GH_INSTALL_DIR="/tmp/gh-install"
 mkdir -p "$GH_INSTALL_DIR"
 pushd "$GH_INSTALL_DIR"
 curl -L -o "$GH_TARBALL" "$GH_URL"
 tar -xzf "$GH_TARBALL"
 GH_BIN=$(realpath $(find . -name "gh" -type f -executable | head -n 1))
 if [ -z "$GH_BIN" ]; then
  echo "[FATAL] Failed to find gh CLI executable"
  exit 1
 fi
 echo "gh CLI downloaded successfully, version: $($GH_BIN --version)"
 echo "Last 5 releases on GitHub:" # as a sanity check of gh and GH_TOKEN
 command "$GH_BIN" release list --limit 5
 popd
 # install twine from pypi
 python3 -m venv /tmp/vllm-release-env
 source /tmp/vllm-release-env/bin/activate
@@ -89,16 +58,13 @@ echo "Wheels copied to local directory"
 git archive --format=tar.gz --output="$DIST_DIR/vllm-${PURE_VERSION}.tar.gz" $BUILDKITE_COMMIT
 ls -la $DIST_DIR
 # upload wheels to PyPI (only default variant, i.e. files without '+' in the name)
 PYPI_WHEEL_FILES=$(find $DIST_DIR -name "vllm-${PURE_VERSION}*.whl" -not -name "*+*")
-if [ -z "$PYPI_WHEEL_FILES" ]; then
+if [[ -z "$PYPI_WHEEL_FILES" ]]; then
  echo "No default variant wheels found, quitting..."
  exit 1
 fi
 python3 -m twine check $PYPI_WHEEL_FILES
 python3 -m twine --non-interactive --verbose upload $PYPI_WHEEL_FILES
 echo "Wheels uploaded to PyPI"
-# create release on GitHub with the release version and all wheels
+python3 -m twine check $PYPI_WHEEL_FILES
-command "$GH_BIN" release create $GIT_VERSION -d --latest --notes-from-tag --verify-tag $DIST_DIR/*.whl
+python3 -m twine upload --non-interactive --verbose $PYPI_WHEEL_FILES
 echo "Wheels uploaded to PyPI"
--- a/docker/Dockerfile.rocm
+++ b/docker/Dockerfile.rocm
@@ -227,7 +227,7 @@ RUN if [ "$GIT_REPO_CHECK" != "0" ]; then \
 # This ensures setuptools_scm sees clean repo state for version detection
 RUN --mount=type=bind,source=.git,target=vllm/.git \
    cd vllm \
-    && pip install setuptools_scm \
+    && pip install setuptools_scm regex \
    && VLLM_VERSION=$(python3 -c "import setuptools_scm; print(setuptools_scm.get_version())") \
    && echo "Detected vLLM version: ${VLLM_VERSION}" \
    && echo "${VLLM_VERSION}" > /tmp/vllm_version.txt
@@ -342,6 +342,19 @@ RUN mkdir src && mv vllm src/vllm
 FROM base AS final
 RUN python3 -m pip install --upgrade pip && rm -rf /var/lib/apt/lists/*
 # Clean up sccache from release image (not needed at runtime)
 # This removes the binary and wrappers that may have been installed during build
 RUN rm -f /usr/bin/sccache || true \
    && rm -rf /opt/sccache-wrappers || true
 # Unset sccache environment variables for the release image
 # This prevents S3 bucket config from leaking into production images
 ENV SCCACHE_BUCKET=
 ENV SCCACHE_REGION=
 ENV SCCACHE_S3_NO_CREDENTIALS=
 ENV SCCACHE_IDLE_TIMEOUT=
 # Error related to odd state for numpy 1.20.3 where there is no METADATA etc, but an extra LICENSES_bundled.txt.
 # Manually remove it so that later steps of numpy upgrade can continue
 RUN case "$(which python3)" in \
--- a/docs/features/nixl_connector_usage.md
+++ b/docs/features/nixl_connector_usage.md
@@ -184,15 +184,6 @@ Support use case: Prefill with 'HND' and decode with 'NHD' with experimental con
 --kv-transfer-config '{..., "enable_permute_local_kv":"True"}'
 ```
 ### Cross layers blocks
 By default, this feature is disabled. On attention backends that support this feature, each logical block is contiguous in physical memory. This reduces the number of buffers that need to be transferred.
 To enable this feature:
 ```bash
 --kv-transfer-config '{..., "kv_connector_extra_config": {"enable_cross_layers_blocks": "True"}}'
 ```
 ## Example Scripts/Code
 Refer to these example scripts in the vLLM repository:
--- a/docs/models/supported_models.md
+++ b/docs/models/supported_models.md
@@ -456,6 +456,7 @@ th {
 | `StableLmForCausalLM` | StableLM | `stabilityai/stablelm-3b-4e1t`, `stabilityai/stablelm-base-alpha-7b-v2`, etc. | | |
 | `Starcoder2ForCausalLM` | Starcoder2 | `bigcode/starcoder2-3b`, `bigcode/starcoder2-7b`, `bigcode/starcoder2-15b`, etc. | | ✅︎ |
 | `Step1ForCausalLM` | Step-Audio | `stepfun-ai/Step-Audio-EditX`, etc. | ✅︎ | ✅︎ |
 | `Step3p5ForCausalLM` | Step-3.5-flash | `stepfun-ai/step-3.5-flash`, etc. |  | ✅︎ |
 | `TeleChat2ForCausalLM` | TeleChat2 | `Tele-AI/TeleChat2-3B`, `Tele-AI/TeleChat2-7B`, `Tele-AI/TeleChat2-35B`, etc. | ✅︎ | ✅︎ |
 | `TeleFLMForCausalLM` | TeleFLM | `CofeAI/FLM-2-52B-Instruct-2407`, `CofeAI/Tele-FLM`, etc. | ✅︎ | ✅︎ |
 | `XverseForCausalLM` | XVERSE | `xverse/XVERSE-7B-Chat`, `xverse/XVERSE-13B-Chat`, `xverse/XVERSE-65B-Chat`, etc. | ✅︎ | ✅︎ |
--- a/examples/pooling/score/vision_rerank_api_online.py
+++ b/examples/pooling/score/vision_rerank_api_online.py
@@ -18,48 +18,32 @@ e.g.
 """
 import argparse
-import base64
+import pprint
 import json
 import requests
-
+from vllm.multimodal.utils import encode_image_url, fetch_image
 def encode_base64_content_from_url(content_url: str) -> dict[str, str]:
    """Encode a content retrieved from a remote url to base64 format."""
    with requests.get(content_url, headers=headers) as response:
        response.raise_for_status()
        result = base64.b64encode(response.content).decode("utf-8")
    return {"url": f"data:image/jpeg;base64,{result}"}
 headers = {"accept": "application/json", "Content-Type": "application/json"}
 query = "A woman playing with her dog on a beach at sunset."
-documents = {
+document = (
    "content": [
        {
            "type": "text",
            "text": (
    "A woman shares a joyful moment with her golden retriever on a sun-drenched beach at sunset, "
    "as the dog offers its paw in a heartwarming display of companionship and trust."
-            ),
+)
 image_url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
 documents = [
    {
        "type": "text",
        "text": document,
    },
    {
        "type": "image_url",
-            "image_url": {
+        "image_url": {"url": image_url},
                "url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
            },
    },
    {
        "type": "image_url",
-            "image_url": encode_base64_content_from_url(
+        "image_url": {"url": encode_image_url(fetch_image(image_url))},
                "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
            ),
    },
 ]
 }
 def parse_args():
@@ -74,23 +58,36 @@ def main(args):
    models_url = base_url + "/v1/models"
    rerank_url = base_url + "/rerank"
-    response = requests.get(models_url, headers=headers)
+    response = requests.get(models_url)
    model = response.json()["data"][0]["id"]
-    data = {
+    print("Query: string & Document: list of string")
    prompt = {"model": model, "query": query, "documents": [document]}
    response = requests.post(rerank_url, json=prompt)
    pprint.pprint(response.json())
    print("Query: string & Document: text")
    prompt = {"model": model, "query": query, "documents": {"content": [documents[0]]}}
    response = requests.post(rerank_url, json=prompt)
    pprint.pprint(response.json())
    print("Query: string & Document: image url")
    prompt = {
        "model": model,
        "query": query,
-        "documents": documents,
+        "documents": {"content": [documents[1]]},
    }
-    response = requests.post(rerank_url, headers=headers, json=data)
+    response = requests.post(rerank_url, json=prompt)
    pprint.pprint(response.json())
-    # Check the response
+    print("Query: string & Document: image base64")
-    if response.status_code == 200:
+    prompt = {
-        print("Request successful!")
+        "model": model,
-        print(json.dumps(response.json(), indent=2))
+        "query": query,
-    else:
+        "documents": {"content": [documents[2]]},
-        print(f"Request failed with status code: {response.status_code}")
+    }
-        print(response.text)
+    response = requests.post(rerank_url, json=prompt)
    pprint.pprint(response.json())
 if __name__ == "__main__":
--- a/examples/pooling/score/vision_score_api_online.py
+++ b/examples/pooling/score/vision_score_api_online.py
@@ -17,48 +17,32 @@ e.g.
 """
 import argparse
 import base64
 import json
 import pprint
 import requests
 from vllm.multimodal.utils import encode_image_url, fetch_image
-def encode_base64_content_from_url(content_url: str) -> dict[str, str]:
+query = "A woman playing with her dog on a beach at sunset."
-    """Encode a content retrieved from a remote url to base64 format."""
+document = (
-
+    "A woman shares a joyful moment with her golden retriever on a sun-drenched beach at sunset, "
-    with requests.get(content_url, headers=headers) as response:
+    "as the dog offers its paw in a heartwarming display of companionship and trust."
-        response.raise_for_status()
+)
-        result = base64.b64encode(response.content).decode("utf-8")
+image_url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
-
+documents = [
    return {"url": f"data:image/jpeg;base64,{result}"}
 headers = {"accept": "application/json", "Content-Type": "application/json"}
 queries = "slm markdown"
 documents = {
    "content": [
    {
-            "type": "image_url",
+        "type": "text",
-            "image_url": {
+        "text": document,
                "url": "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/handelsblatt-preview.png"
            },
    },
    {
        "type": "image_url",
-            "image_url": {
+        "image_url": {"url": image_url},
                "url": "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/paper-11.png"
            },
    },
    {
        "type": "image_url",
-            "image_url": encode_base64_content_from_url(
+        "image_url": {"url": encode_image_url(fetch_image(image_url))},
                "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/paper-11.png"
            ),
    },
 ]
 }
 def parse_args():
@@ -73,15 +57,40 @@ def main(args):
    models_url = base_url + "/v1/models"
    score_url = base_url + "/score"
-    response = requests.get(models_url, headers=headers)
+    response = requests.get(models_url)
    model = response.json()["data"][0]["id"]
-    prompt = {"model": model, "queries": queries, "documents": documents}
+    print("Query: string & Document: string")
-    response = requests.post(score_url, headers=headers, json=prompt)
+    prompt = {"model": model, "queries": query, "documents": document}
-    print("\nPrompt when queries is string and documents is a image list:")
+    response = requests.post(score_url, json=prompt)
-    pprint.pprint(prompt)
+    pprint.pprint(response.json())
-    print("\nScore Response:")
+
-    print(json.dumps(response.json(), indent=2))
+    print("Query: string & Document: text")
    prompt = {
        "model": model,
        "queries": query,
        "documents": {"content": [documents[0]]},
    }
    response = requests.post(score_url, json=prompt)
    pprint.pprint(response.json())
    print("Query: string & Document: image url")
    prompt = {
        "model": model,
        "queries": query,
        "documents": {"content": [documents[1]]},
    }
    response = requests.post(score_url, json=prompt)
    pprint.pprint(response.json())
    print("Query: string & Document: image base64")
    prompt = {
        "model": model,
        "queries": query,
        "documents": {"content": [documents[2]]},
    }
    response = requests.post(score_url, json=prompt)
    pprint.pprint(response.json())
 if __name__ == "__main__":
--- a/requirements/build.txt
+++ b/requirements/build.txt
@@ -9,5 +9,5 @@ wheel
 jinja2>=3.1.6
 regex
 build
-protobuf
+protobuf >= 6.33.5
 grpcio-tools
--- a/requirements/common.txt
+++ b/requirements/common.txt
@@ -9,9 +9,9 @@ blake3
 py-cpuinfo
 transformers >= 4.56.0, < 5
 tokenizers >= 0.21.1  # Required for fast incremental detokenization.
-protobuf # Required by LlamaTokenizer, gRPC.
+protobuf >= 6.33.5 # Required by LlamaTokenizer, gRPC. CVE-2026-0994
 fastapi[standard] >= 0.115.0 # Required by FastAPI's form models in the OpenAI API server's audio transcriptions endpoint.
-aiohttp
+aiohttp >= 3.13.3
 openai >= 1.99.1  # For Responses API with reasoning content
 pydantic >= 2.12.0
 prometheus_client >= 0.18.0
--- a/requirements/kv_connectors.txt
+++ b/requirements/kv_connectors.txt
@@ -1,2 +1,2 @@
-lmcache
+lmcache >= 0.3.9
 nixl >= 0.7.1 # Required for disaggregated prefill
--- a/requirements/rocm-test.txt
+++ b/requirements/rocm-test.txt
@@ -14,7 +14,7 @@ pytest-shard==0.1.2
 # Async/HTTP dependencies
 anyio==4.6.2.post1
    # via httpx, starlette
-aiohttp==3.13.0
+aiohttp==3.13.3
    # via gpt-oss
 httpx==0.27.2
    # HTTP testing
--- a/requirements/test.txt
+++ b/requirements/test.txt
@@ -12,7 +12,7 @@ affine==2.4.0
    # via rasterio
 aiohappyeyeballs==2.6.1
    # via aiohttp
-aiohttp==3.13.0
+aiohttp==3.13.3
    # via
    #   aiohttp-cors
    #   datasets
--- a/tests/compile/test_cold_start.py
+++ b/tests/compile/test_cold_start.py
@@ -0,0 +1,48 @@
 # SPDX-License-Identifier: Apache-2.0
 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project
 from torch._dynamo.utils import counters
 from vllm import LLM
 from vllm.config import CompilationConfig, CompilationMode, CUDAGraphMode
 def test_moe_compilation_cold_start(monkeypatch, use_fresh_inductor_cache):
    # Run in same process so we can access PyTorch's internal counters
    monkeypatch.setenv("VLLM_ENABLE_V1_MULTIPROCESSING", "0")
    # I'm not sure if this is going to affect the numbers
    monkeypatch.setenv("VLLM_USE_AOT_COMPILE", "0")
    # Force cold compilation
    monkeypatch.setenv("VLLM_DISABLE_COMPILE_CACHE", "1")
    compilation_config = CompilationConfig(
        mode=CompilationMode.VLLM_COMPILE,
        cudagraph_mode=CUDAGraphMode.NONE,  # make the model loading faster
    )
    counters.clear()
    _ = LLM(
        model="microsoft/Phi-tiny-MoE-instruct",
        max_model_len=256,
        load_format="dummy",  # make the model loading faster
        compilation_config=compilation_config,
        num_gpu_blocks_override=8,  # make the model loading faster
    )
    # vLLM-compile cold start is special. By default, we do
    # one full dynamo capture of the entire forward pass.
    # The forward pass consists of 32 transformer layers.
    # Then, we split on the attention operation. This results in
    # 33 subgraphs (not including the attention operation).
    # The 33 subgraphs then get standalone_compile'd.
    #
    # There are actually only 3 unique subgraphs for this model
    # (all of its transformer layers are the same modulo weights);
    # this is true for most vLLM models.
    # So we test that during cold start, the aot_autograd cache
    # misses for 3 subgraphs and hits for the rest.
    assert counters["aot_autograd"]["autograd_cache_miss"] == 3
    assert counters["aot_autograd"]["autograd_cache_hit"] == 30
--- a/tests/compile/test_graph_partition.py
+++ b/tests/compile/test_graph_partition.py
@@ -8,6 +8,10 @@ import torch
 from torch.fx.experimental.proxy_tensor import make_fx
 from vllm.compilation.backends import split_graph
 from vllm.compilation.fx_utils import find_op_nodes
 # This import automatically registers `torch.ops.silly.attention`
 from . import silly_attention  # noqa: F401
 def test_getitem_moved_to_producer_subgraph():
@@ -122,3 +126,61 @@ def test_no_tuple_inputs_with_multiple_consumers():
    output_split = split_gm(new_x)
    assert torch.allclose(output_original, output_split), "Output mismatch after split"
 def test_consecutive_ops_in_split():
    """
    Test that consecutive splitting operations are grouped into the same subgraph
    """
    def model_fn(x: torch.Tensor) -> torch.Tensor:
        """
        Define a simple model where consecutive operations create opportunities
        for splitting subgraphs.
        """
        # Apply silly attention followed by consecutive operations
        intermediate = torch.relu(x)
        attn_inout = torch.sqrt(intermediate)
        torch.ops.silly.attention(intermediate, intermediate, attn_inout, attn_inout)
        final_result = torch.sigmoid(attn_inout)
        return final_result
    torch.set_default_device("cuda")
    # Create the traced FX graph for the model
    x = torch.randn(8, 4)
    gm = make_fx(model_fn)(x)
    # Assert presence of the expected operations in the setup
    assert (
        len(list(find_op_nodes(torch.ops.aten.relu, gm.graph))) == 1
        and len(list(find_op_nodes(torch.ops.aten.sqrt, gm.graph))) == 1
    ), "Test setup failed: Expected sqrt and relu operations in the graph."
    # Configure split operations to test
    splitting_ops = ["silly::attention", "aten::sqrt"]
    split_gm, split_items = split_graph(gm, splitting_ops)
    # Validate the number of partitions
    assert len(split_items) == 3, (
        "Consecutive splitting operations were not grouped correctly."
    )
    # Validate that correctness is preserved
    new_x = torch.randn(8, 4)
    output_original = gm(new_x)
    output_split = split_gm(new_x)
    assert torch.allclose(output_original, output_split), (
        "Output mismatch after splitting."
    )
    # Check the splitting item has 2 nodes exactly (relu and attn)
    splitting_items = list(s for s in split_items if s.is_splitting_graph)
    assert len(splitting_items) == 1, "Expecting a single splitting graph"
    print(splitting_items[0].graph.graph)
    splitting_gm = splitting_items[0].graph
    assert len(splitting_gm.graph.nodes) == 4, "Expecting 4 nodes in splitting graph"
    assert [node.op for node in splitting_gm.graph.nodes] == ["placeholder"] + 2 * [
        "call_function"
    ] + ["output"]
--- a/tests/entrypoints/pooling/classify/test_online_vision.py
+++ b/tests/entrypoints/pooling/classify/test_online_vision.py
@@ -5,9 +5,9 @@ import json
 import pytest
 import requests
 from tests.entrypoints.test_utils import encode_base64_content_from_url
 from tests.utils import RemoteOpenAIServer
 from vllm.entrypoints.pooling.classify.protocol import ClassificationResponse
 from vllm.multimodal.utils import encode_image_url, fetch_image
 MODEL_NAME = "muziyongshixin/Qwen2.5-VL-7B-for-VideoCls"
 MAXIMUM_VIDEOS = 1
@@ -19,7 +19,7 @@ HF_OVERRIDES = {
 }
 input_text = "This product was excellent and exceeded my expectations"
 image_url = "https://vllm-public-assets.s3.us-west-2.amazonaws.com/multimodal_asset/cat_snow.jpg"
-image_base64 = encode_base64_content_from_url(image_url)
+image_base64 = {"url": encode_image_url(fetch_image(image_url))}
 video_url = "https://www.bogotobogo.com/python/OpenCV_Python/images/mean_shift_tracking/slow_traffic_small.mp4"
--- a/tests/entrypoints/pooling/score/test_online_score_vision.py
+++ b/tests/entrypoints/pooling/score/test_online_score_vision.py
@@ -0,0 +1,122 @@
 # SPDX-License-Identifier: Apache-2.0
 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project
 import pytest
 import requests
 from tests.utils import VLLM_PATH, RemoteOpenAIServer
 from vllm.entrypoints.pooling.score.protocol import ScoreResponse
 from vllm.multimodal.utils import encode_image_url, fetch_image
 MODEL_NAME = "Qwen/Qwen3-VL-Reranker-2B"
 HF_OVERRIDES = {
    "architectures": ["Qwen3VLForSequenceClassification"],
    "classifier_from_token": ["no", "yes"],
    "is_original_qwen3_reranker": True,
 }
 query = "A cat standing in the snow."
 image_url = "https://vllm-public-assets.s3.us-west-2.amazonaws.com/multimodal_asset/cat_snow.jpg"
 documents = [
    {
        "type": "text",
        "text": query,
    },
    {
        "type": "image_url",
        "image_url": {"url": image_url},
    },
    {
        "type": "image_url",
        "image_url": {"url": encode_image_url(fetch_image(image_url))},
    },
 ]
@pytest.fixture(scope="module")
 def server():
    args = [
        "--enforce-eager",
        "--max-model-len",
        "8192",
        "--chat-template",
        str(VLLM_PATH / "examples/pooling/score/template/qwen3_vl_reranker.jinja"),
    ]
    with RemoteOpenAIServer(
        MODEL_NAME, args, override_hf_configs=HF_OVERRIDES
    ) as remote_server:
        yield remote_server
 def test_score_api_queries_str_documents_str(server: RemoteOpenAIServer):
    queries = "What is the capital of France?"
    documents = "The capital of France is Paris."
    score_response = requests.post(
        server.url_for("score"),
        json={
            "model": MODEL_NAME,
            "queries": queries,
            "documents": documents,
        },
    )
    score_response.raise_for_status()
    score = ScoreResponse.model_validate(score_response.json())
    assert score.id is not None
    assert score.data is not None
    assert len(score.data) == 1
 def test_score_api_queries_str_documents_text_content(server: RemoteOpenAIServer):
    score_response = requests.post(
        server.url_for("score"),
        json={
            "model": MODEL_NAME,
            "queries": query,
            "documents": {"content": [documents[0]]},
        },
    )
    score_response.raise_for_status()
    score = ScoreResponse.model_validate(score_response.json())
    assert score.id is not None
    assert score.data is not None
    assert len(score.data) == 1
 def test_score_api_queries_str_documents_image_url_content(server: RemoteOpenAIServer):
    score_response = requests.post(
        server.url_for("score"),
        json={
            "model": MODEL_NAME,
            "queries": query,
            "documents": {"content": [documents[1]]},
        },
    )
    score_response.raise_for_status()
    score = ScoreResponse.model_validate(score_response.json())
    assert score.id is not None
    assert score.data is not None
    assert len(score.data) == 1
 def test_score_api_queries_str_documents_image_base64_content(
    server: RemoteOpenAIServer,
 ):
    score_response = requests.post(
        server.url_for("score"),
        json={
            "model": MODEL_NAME,
            "queries": query,
            "documents": {"content": [documents[2]]},
        },
    )
    score_response.raise_for_status()
    score = ScoreResponse.model_validate(score_response.json())
    assert score.id is not None
    assert score.data is not None
    assert len(score.data) == 1
--- a/tests/entrypoints/test_utils.py
+++ b/tests/entrypoints/test_utils.py
@@ -1,9 +1,5 @@
 # SPDX-License-Identifier: Apache-2.0
 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project
 import base64
 import requests
 from vllm.entrypoints.utils import sanitize_message
@@ -12,11 +8,3 @@ def test_sanitize_message():
        sanitize_message("<_io.BytesIO object at 0x7a95e299e750>")
        == "<_io.BytesIO object>"
    )
 def encode_base64_content_from_url(content_url: str) -> dict[str, str]:
    with requests.get(content_url) as response:
        response.raise_for_status()
        result = base64.b64encode(response.content).decode("utf-8")
    return {"url": f"data:image/jpeg;base64,{result}"}
--- a/tests/kernels/core/test_activation.py
+++ b/tests/kernels/core/test_activation.py
@@ -17,6 +17,8 @@ from vllm.model_executor.layers.activation import (
    QuickGELU,
    SiluAndMul,
    SwigluOAIAndMul,
    SwigluStepAndMul,
    swiglustep_and_mul_triton,
 )
 from vllm.utils.torch_utils import set_random_seed
@@ -36,6 +38,7 @@ CUDA_DEVICES = [f"cuda:{i}" for i in range(1 if torch.cuda.device_count() == 1 e
        "gelu_tanh",
        "fatrelu",
        "swigluoai_and_mul",
        "swiglustep_and_mul",
    ],
 )
@pytest.mark.parametrize("num_tokens", NUM_TOKENS)
@@ -75,9 +78,12 @@ def test_act_and_mul(
    elif activation == "swigluoai_and_mul":
        layer = SwigluOAIAndMul()
        fn = torch.ops._C.swigluoai_and_mul
    elif activation == "swiglustep_and_mul":
        layer = SwigluStepAndMul()
        fn = swiglustep_and_mul_triton
    out = layer(x)
    ref_out = layer.forward_native(x)
-    if activation == "swigluoai_and_mul":
+    if activation in ["swigluoai_and_mul", "swiglustep_and_mul"]:
        rtol = {
            # For fp16, change the relative tolerance from 1e-3 to 2e-3
            torch.float16: 2e-3,
@@ -104,7 +110,7 @@ def test_act_and_mul(
        opcheck(fn, (out, x, threshold))
    elif activation == "swigluoai_and_mul":
        opcheck(fn, (out, x, layer.alpha, layer.limit))
-    else:
+    elif activation != "swiglustep_and_mul":
        opcheck(fn, (out, x))
--- a/tests/kernels/quantization/test_rocm_skinny_gemms.py
+++ b/tests/kernels/quantization/test_rocm_skinny_gemms.py
@@ -87,6 +87,13 @@ NKM_FACTORS_WVSPLITK_FP8 = [
 SEEDS = [0]
 def pad_weights_fp8(weight):
    num_pad = 256 // weight.element_size()
    import torch.nn.functional as F
    return F.pad(weight, (0, num_pad), "constant", 0)[..., :-num_pad]
@pytest.mark.parametrize("n,k,m", NKM_FACTORS_WVSPLITKRC)
@pytest.mark.parametrize("dtype", DTYPES)
@pytest.mark.parametrize("seed", SEEDS)
@@ -191,11 +198,12 @@ def test_rocm_wvsplitk_bias2D_kernel(n, k, m, dtype, seed):
@pytest.mark.parametrize("n,k,m", NKM_FACTORS_WVSPLITK_FP8)
@pytest.mark.parametrize("dtype", DTYPES)
@pytest.mark.parametrize("seed", SEEDS)
@pytest.mark.parametrize("padded", [False, True])
@pytest.mark.skipif(
    not (current_platform.is_rocm() and current_platform.supports_fp8()),
    reason="only test for rocm fp8",
 )
-def test_rocm_wvsplitk_fp8_kernel(n, k, m, dtype, seed):
+def test_rocm_wvsplitk_fp8_kernel(n, k, m, dtype, seed, padded):
    torch.manual_seed(seed)
    A = torch.rand(n, k, device="cuda") - 0.5
@@ -203,6 +211,8 @@ def test_rocm_wvsplitk_fp8_kernel(n, k, m, dtype, seed):
    A, scale_a = ref_dynamic_per_tensor_fp8_quant(A)
    B, scale_b = ref_dynamic_per_tensor_fp8_quant(B)
    if padded:
        B = pad_weights_fp8(B)
    ref_out = torch._scaled_mm(
        A, B.t(), out_dtype=dtype, scale_a=scale_a, scale_b=scale_b
@@ -222,11 +232,12 @@ def test_rocm_wvsplitk_fp8_kernel(n, k, m, dtype, seed):
@pytest.mark.parametrize("n,k,m", NKM_FACTORS_WVSPLITK_FP8)
@pytest.mark.parametrize("dtype", DTYPES)
@pytest.mark.parametrize("seed", SEEDS)
@pytest.mark.parametrize("padded", [False, True])
@pytest.mark.skipif(
    not (current_platform.is_rocm() and current_platform.supports_fp8()),
    reason="only test for rocm fp8",
 )
-def test_rocm_wvsplitk_fp8_bias1D_kernel(n, k, m, dtype, seed):
+def test_rocm_wvsplitk_fp8_bias1D_kernel(n, k, m, dtype, seed, padded):
    torch.manual_seed(seed)
    xavier = math.sqrt(2 / k)  # normalize to avoid large output-bias deltas
@@ -236,6 +247,8 @@ def test_rocm_wvsplitk_fp8_bias1D_kernel(n, k, m, dtype, seed):
    A, scale_a = ref_dynamic_per_tensor_fp8_quant(A)
    B, scale_b = ref_dynamic_per_tensor_fp8_quant(B)
    if padded:
        B = pad_weights_fp8(B)
    ref_out = torch._scaled_mm(
        A, B.t(), out_dtype=dtype, scale_a=scale_a, scale_b=scale_b, bias=BIAS
--- a/tests/models/registry.py
+++ b/tests/models/registry.py
@@ -480,6 +480,9 @@ _TEXT_GENERATION_EXAMPLE_MODELS = {
    "Step1ForCausalLM": _HfExamplesInfo(
        "stepfun-ai/Step-Audio-EditX", trust_remote_code=True
    ),
    "Step3p5ForCausalLM": _HfExamplesInfo(
        "stepfun-ai/step-3.5-flash", is_available_online=False
    ),
    "SmolLM3ForCausalLM": _HfExamplesInfo("HuggingFaceTB/SmolLM3-3B"),
    "StableLMEpochForCausalLM": _HfExamplesInfo("stabilityai/stablelm-zephyr-3b"),
    "StableLmForCausalLM": _HfExamplesInfo("stabilityai/stablelm-3b-4e1t"),
@@ -1081,6 +1084,12 @@ _SPECULATIVE_DECODING_EXAMPLE_MODELS = {
    "Qwen3NextMTP": _HfExamplesInfo(
        "Qwen/Qwen3-Next-80B-A3B-Instruct", min_transformers_version="4.56.3"
    ),
    "Step3p5MTP": _HfExamplesInfo(
        "stepfun-ai/Step-3.5-Flash",
        trust_remote_code=True,
        speculative_model="stepfun-ai/Step-3.5-Flash",
        is_available_online=False,
    ),
 }
 _TRANSFORMERS_BACKEND_MODELS = {
--- a/tests/v1/core/test_prefix_caching.py
+++ b/tests/v1/core/test_prefix_caching.py
@@ -107,7 +107,10 @@ def make_kv_cache_config(block_size: int, num_blocks: int) -> KVCacheConfig:
 def make_kv_cache_config_hybrid_model(
-    block_size: int, num_blocks: int, second_spec_type: str = "sliding_window"
+    block_size: int,
    num_blocks: int,
    sliding_window_blocks: int,
    second_spec_type: str = "sliding_window",
 ) -> KVCacheConfig:
    if second_spec_type == "sliding_window":
        second_spec = SlidingWindowSpec(
@@ -115,7 +118,7 @@ def make_kv_cache_config_hybrid_model(
            num_kv_heads=1,
            head_size=1,
            dtype=torch.float32,
-            sliding_window=2 * block_size,
+            sliding_window=sliding_window_blocks * block_size,
        )
    elif second_spec_type == "mamba":
        second_spec = MambaSpec(
@@ -325,7 +328,7 @@ def test_prefill(hash_fn):
 def test_prefill_hybrid_model():
    block_size = 16
    manager = KVCacheManager(
-        make_kv_cache_config_hybrid_model(block_size, 21),
+        make_kv_cache_config_hybrid_model(block_size, 21, 2),
        max_model_len=8192,
        enable_caching=True,
        hash_block_size=block_size,
@@ -334,7 +337,8 @@ def test_prefill_hybrid_model():
    hash_fn = sha256
    # Complete 3 blocks (48 tokens)
-    common_token_ids = [i for i in range(3) for _ in range(block_size)]
+    num_full_blocks = 3
    common_token_ids = [i for i in range(num_full_blocks) for _ in range(block_size)]
    # Fully cache miss
    # Incomplete 1 block (7 tokens)
@@ -375,6 +379,7 @@ def test_prefill_hybrid_model():
    # Cache hit in the common prefix
    # Incomplete 1 block (5 tokens)
    unique_token_ids = [3] * 5
    all_token_ids = common_token_ids + unique_token_ids
    req1 = make_request("1", common_token_ids + unique_token_ids, block_size, hash_fn)
    computed_blocks, num_computed_tokens = manager.get_computed_blocks(req1)
    assert len(req1.block_hashes) == 3
@@ -394,34 +399,13 @@ def test_prefill_hybrid_model():
    manager.free(req0)
    manager.free(req1)
    cached_block_hash_to_block_bak = copy.copy(
        manager.block_pool.cached_block_hash_to_block._cache
    )
    def test_partial_request_hit(
        request_id: str,
        hash_to_evict: list[BlockHashWithGroupId],
        expect_hit_length: int,
    ):
        req = make_request(
            request_id, common_token_ids + unique_token_ids, block_size, sha256
        )
        for hash_with_group_id in hash_to_evict:
            manager.block_pool.cached_block_hash_to_block._cache.pop(hash_with_group_id)
        computed_blocks, num_computed_tokens = manager.get_computed_blocks(req)
        assert len(req.block_hashes) == 3
        assert num_computed_tokens == expect_hit_length * block_size
        for block_per_group in computed_blocks.blocks:
            assert len(block_per_group) == num_computed_tokens // block_size
        for hash_with_group_id in hash_to_evict:
            manager.block_pool.cached_block_hash_to_block._cache[hash_with_group_id] = (
                cached_block_hash_to_block_bak[hash_with_group_id]
            )
        manager.free(req)
    # Evict the blocks outside sliding window, does not affect the hit length.
-    test_partial_request_hit(
+    _test_partial_request_hit(
        manager,
        block_size,
        num_full_blocks,
        "2",
        all_token_ids,
        [
            make_block_hash_with_group_id(block_hashes[0], 1),
            make_block_hash_with_group_id(block_hashes[0], 2),
@@ -430,13 +414,23 @@ def test_prefill_hybrid_model():
    )
    # Evict the first block of full attention, makes total cache miss.
-    test_partial_request_hit(
+    _test_partial_request_hit(
-        "3", [make_block_hash_with_group_id(block_hashes[0], 0)], 0
+        manager,
        block_size,
        num_full_blocks,
        "3",
        all_token_ids,
        [make_block_hash_with_group_id(block_hashes[0], 0)],
        0,
    )
    # Evict the last block of all layers, reduces the hit length to 2.
-    test_partial_request_hit(
+    _test_partial_request_hit(
        manager,
        block_size,
        num_full_blocks,
        "4",
        all_token_ids,
        [
            make_block_hash_with_group_id(block_hashes[2], 0),
            make_block_hash_with_group_id(block_hashes[2], 1),
@@ -446,18 +440,36 @@ def test_prefill_hybrid_model():
    )
    # Evict the last block of full attention, reduces the hit length to 2.
-    test_partial_request_hit(
+    _test_partial_request_hit(
-        "5", [make_block_hash_with_group_id(block_hashes[2], 0)], 2
+        manager,
        block_size,
        num_full_blocks,
        "5",
        all_token_ids,
        [make_block_hash_with_group_id(block_hashes[2], 0)],
        2,
    )
    # Evict the last block of sliding window, reduces the hit length to 2.
-    test_partial_request_hit(
+    _test_partial_request_hit(
-        "6", [make_block_hash_with_group_id(block_hashes[2], 1)], 2
+        manager,
        block_size,
        num_full_blocks,
        "6",
        all_token_ids,
        [make_block_hash_with_group_id(block_hashes[2], 1)],
        2,
    )
    # Evict the last block of sliding window, reduces the hit length to 2.
-    test_partial_request_hit(
+    _test_partial_request_hit(
-        "7", [make_block_hash_with_group_id(block_hashes[2], 2)], 2
+        manager,
        block_size,
        num_full_blocks,
        "7",
        all_token_ids,
        [make_block_hash_with_group_id(block_hashes[2], 2)],
        2,
    )
    # Evict different set of blocks for full attention and sliding window makes
@@ -466,8 +478,12 @@ def test_prefill_hybrid_model():
    # The cache hit length of sliding window is 2 * block_size.
    # Then it is cache miss as the two type of layers
    # have different hit length.
-    test_partial_request_hit(
+    _test_partial_request_hit(
        manager,
        block_size,
        num_full_blocks,
        "8",
        all_token_ids,
        [
            make_block_hash_with_group_id(block_hashes[2], 0),
            make_block_hash_with_group_id(block_hashes[0], 1),
@@ -477,6 +493,214 @@ def test_prefill_hybrid_model():
    )
 def test_prefill_hybrid_model_eagle():
    block_size = 16
    kv_cache_config = make_kv_cache_config_hybrid_model(block_size, 31, 3)
    manager = KVCacheManager(
        kv_cache_config,
        max_model_len=8192,
        enable_caching=True,
        hash_block_size=block_size,
        use_eagle=True,
    )
    hash_fn = sha256
    # Complete 6 blocks (96 tokens)
    num_full_blocks = 6
    common_token_ids = [i for i in range(num_full_blocks) for _ in range(block_size)]
    # Fully cache miss
    # Incomplete 1 block (7 tokens)
    unique_token_ids = [6] * 7
    all_token_ids = common_token_ids + unique_token_ids
    req0 = make_request("0", all_token_ids, block_size, hash_fn)
    computed_blocks, num_computed_tokens = manager.get_computed_blocks(req0)
    assert len(req0.block_hashes) == len(all_token_ids) // block_size
    assert not computed_blocks.blocks[0]
    assert num_computed_tokens == 0
    blocks = manager.allocate_slots(
        req0, len(all_token_ids), num_computed_tokens, computed_blocks
    )
    block_ids = (
        [1, 2, 3, 4, 5, 6, 7],
        [8, 9, 10, 11, 12, 13, 14],
        [15, 16, 17, 18, 19, 20, 21],
    )
    assert blocks is not None and blocks.get_block_ids() == block_ids
    # Check full block metadata
    parent_block_hash = None
    for i, full_block_ids in enumerate(zip(*(row[:-1] for row in block_ids))):
        block_tokens = tuple(all_token_ids[i * block_size : (i + 1) * block_size])
        block_hash = hash_block_tokens(hash_fn, parent_block_hash, block_tokens)
        for group_id, block_id in enumerate(full_block_ids):
            blk_hash = manager.block_pool.blocks[block_id].block_hash
            assert blk_hash is not None
            assert get_block_hash(blk_hash) == block_hash
            assert get_group_id(blk_hash) == group_id
            assert manager.block_pool.blocks[block_id].ref_cnt == 1
        parent_block_hash = block_hash
    # Check partial block metadata
    for partial_block_id in (row[-1] for row in block_ids):
        assert manager.block_pool.blocks[partial_block_id].block_hash is None
        assert manager.block_pool.blocks[partial_block_id].ref_cnt == 1
    # Cache hit in the common prefix
    # Incomplete 1 block (5 tokens)
    unique_token_ids = [6] * 5
    all_token_ids = common_token_ids + unique_token_ids
    req1 = make_request("1", all_token_ids, block_size, hash_fn)
    computed_blocks, num_computed_tokens = manager.get_computed_blocks(req1)
    assert len(req1.block_hashes) == num_full_blocks
    assert computed_blocks.get_block_ids() == (
        [1, 2, 3, 4],
        [0, 9, 10, 11],
        [0, 16, 17, 18],
    )
    assert num_computed_tokens == 4 * block_size
    num_new_tokens = len(all_token_ids) - num_computed_tokens
    blocks = manager.allocate_slots(
        req1, num_new_tokens, num_computed_tokens, computed_blocks
    )
    assert blocks is not None and blocks.get_block_ids() == (
        [22, 23, 24],
        [25, 26, 27],
        [28, 29, 30],
    )
    for block_per_group in computed_blocks.blocks:
        for block in block_per_group:
            if block != manager.block_pool.null_block:
                assert block.ref_cnt == 2
    block_hashes = req1.block_hashes
    manager.free(req0)
    manager.free(req1)
    # Evict the blocks outside sliding window, does not affect the hit length.
    _test_partial_request_hit(
        manager,
        block_size,
        num_full_blocks,
        "2",
        all_token_ids,
        [
            make_block_hash_with_group_id(block_hashes[0], 1),
            make_block_hash_with_group_id(block_hashes[0], 2),
        ],
        4,
    )
    # Evict the first block of full attention, makes total cache miss.
    _test_partial_request_hit(
        manager,
        block_size,
        num_full_blocks,
        "3",
        all_token_ids,
        [make_block_hash_with_group_id(block_hashes[0], 0)],
        0,
    )
    # Evict the last block of all layers, reduces the hit length to 3.
    _test_partial_request_hit(
        manager,
        block_size,
        num_full_blocks,
        "4",
        all_token_ids,
        [
            make_block_hash_with_group_id(block_hashes[-1], 0),
            make_block_hash_with_group_id(block_hashes[-1], 1),
            make_block_hash_with_group_id(block_hashes[-1], 2),
        ],
        3,
    )
    # Evict the last block of full attention, reduces the hit length to 3.
    _test_partial_request_hit(
        manager,
        block_size,
        num_full_blocks,
        "5",
        all_token_ids,
        [make_block_hash_with_group_id(block_hashes[-1], 0)],
        3,
    )
    # Since the last block of full attention is dropped for eagle, evict
    # the second last block of sliding window, reduces the hit length to 3.
    _test_partial_request_hit(
        manager,
        block_size,
        num_full_blocks,
        "6",
        all_token_ids,
        [make_block_hash_with_group_id(block_hashes[-2], 1)],
        3,
    )
    # Since the last block of full attention is dropped for eagle, evict
    # the second last block of sliding window, reduces the hit length to 3.
    _test_partial_request_hit(
        manager,
        block_size,
        num_full_blocks,
        "7",
        all_token_ids,
        [make_block_hash_with_group_id(block_hashes[-2], 2)],
        3,
    )
    # Evict different set of blocks for full attention and sliding window makes
    # total cache miss.
    # The cache hit length of full attention is 4 * block_size.
    # The cache hit length of sliding window is 3 * block_size.
    # Then it is cache miss as the two type of layers
    # have different hit length.
    _test_partial_request_hit(
        manager,
        block_size,
        num_full_blocks,
        "8",
        all_token_ids,
        [
            make_block_hash_with_group_id(block_hashes[-1], 0),
            make_block_hash_with_group_id(block_hashes[0], 1),
            make_block_hash_with_group_id(block_hashes[0], 2),
        ],
        0,
    )
 def _test_partial_request_hit(
    manager: KVCacheManager,
    block_size: int,
    num_full_blocks,
    request_id: str,
    prompt_token_ids: list[int],
    hash_to_evict: list[BlockHashWithGroupId],
    expect_hit_length: int,
 ):
    cached_block_hash_to_block_bak = copy.copy(
        manager.block_pool.cached_block_hash_to_block._cache
    )
    req = make_request(request_id, prompt_token_ids, block_size, sha256)
    for hash_with_group_id in hash_to_evict:
        manager.block_pool.cached_block_hash_to_block._cache.pop(hash_with_group_id)
    computed_blocks, num_computed_tokens = manager.get_computed_blocks(req)
    assert len(req.block_hashes) == num_full_blocks
    assert num_computed_tokens == expect_hit_length * block_size
    for block_per_group in computed_blocks.blocks:
        assert len(block_per_group) == num_computed_tokens // block_size
    for hash_with_group_id in hash_to_evict:
        manager.block_pool.cached_block_hash_to_block._cache[hash_with_group_id] = (
            cached_block_hash_to_block_bak[hash_with_group_id]
        )
    manager.free(req)
 def _make_hybrid_kv_cache_config(
    block_size: int, num_blocks: int, spec_types: list[str]
 ) -> KVCacheConfig:
@@ -655,6 +879,85 @@ def test_prefill_hybrid_model_combinations(spec_types: list[str]):
    manager.free(req1)
 # Test cases with eagle enabled: Only test a single simple case for now.
 # - 2 groups: 1 full + 1 other
 _EAGLE_HYBRID_MODEL_TEST_CASES = [
    # 2 groups: 1 full + 1 other
    pytest.param(["full", "sliding_window"], 2, id="2g-full+sw"),
 ]
@pytest.mark.parametrize("spec_types,expect_hit_length", _EAGLE_HYBRID_MODEL_TEST_CASES)
 def test_prefill_hybrid_model_combinations_eagle(
    spec_types: list[str], expect_hit_length: int
 ):
    """
    Test prefix caching with hybrid models (1 full attn + 1 other) with EAGLE.
    More complex hybrid models with EAGLE are not yet supported (see issue #32802).
    """
    block_size = 16
    num_groups = len(spec_types)
    # Allocate enough blocks for all groups
    num_blocks = 10 * num_groups
    kv_cache_config = _make_hybrid_kv_cache_config(block_size, num_blocks, spec_types)
    manager = KVCacheManager(
        kv_cache_config,
        max_model_len=8192,
        enable_caching=True,
        hash_block_size=block_size,
        use_eagle=True,
    )
    hash_fn = sha256
    # Complete 3 blocks (48 tokens)
    num_full_blocks = 4
    common_token_ids = [i for i in range(num_full_blocks) for _ in range(block_size)]
    unique_token_ids = [4] * 7
    all_token_ids = common_token_ids + unique_token_ids
    # First request: no cache hit initially
    req0 = make_request("0", all_token_ids, block_size, hash_fn)
    computed_blocks, num_computed_tokens = manager.get_computed_blocks(req0)
    assert len(req0.block_hashes) == num_full_blocks
    assert not computed_blocks.blocks[0]  # No cache hit initially
    assert num_computed_tokens == 0
    blocks = manager.allocate_slots(
        req0, len(all_token_ids), num_computed_tokens, computed_blocks
    )
    assert blocks is not None
    # Should have blocks for all groups
    assert len(blocks.get_block_ids()) == num_groups
    # Second request: should hit cached blocks for common prefix
    all_token_ids = common_token_ids + [6] * 5
    req1 = make_request("1", all_token_ids, block_size, hash_fn)
    computed_blocks, num_computed_tokens = manager.get_computed_blocks(req1)
    # Should hit cached blocks for all groups
    assert num_computed_tokens == expect_hit_length * block_size
    assert len(computed_blocks.blocks) == num_groups
    # Verify each group has the correct number of computed blocks
    for block_per_group in computed_blocks.blocks:
        assert len(block_per_group) == expect_hit_length
    # Allocate and verify blocks for second request
    blocks = manager.allocate_slots(
        req1,
        len(all_token_ids) - num_computed_tokens,
        num_computed_tokens,
        computed_blocks,
    )
    assert blocks is not None
    assert len(blocks.get_block_ids()) == num_groups
    manager.free(req0)
    manager.free(req1)
 def test_prefill_plp():
    """Test prefill with APC and some prompt logprobs (plp) requests.
--- a/tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh
+++ b/tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh
@@ -34,18 +34,11 @@ else
  KV_CONFIG_HETERO_LAYOUT=''
 fi
 CROSS_LAYERS_BLOCKS=${CROSS_LAYERS_BLOCKS:-"False"} # Default to non cross layers
 if [[ "$CROSS_LAYERS_BLOCKS" == "True" ]]; then
  KV_EXTRA_CONFIG=',"kv_connector_extra_config":{"cross_layers_blocks": "True"}'
 else
  KV_EXTRA_CONFIG=''
 fi
 # Build the kv-transfer-config once
 if [[ "$KV_BUFFER_DEVICE" == "cuda" ]]; then
-  KV_CONFIG='{"kv_connector":"NixlConnector","kv_role":"kv_both"'${KV_CONFIG_HETERO_LAYOUT}${KV_EXTRA_CONFIG}'}'
+  KV_CONFIG='{"kv_connector":"NixlConnector","kv_role":"kv_both"'${KV_CONFIG_HETERO_LAYOUT}'}'
 else
-  KV_CONFIG="{\"kv_connector\":\"NixlConnector\",\"kv_role\":\"kv_both\",\"kv_buffer_device\":\"$KV_BUFFER_DEVICE\""${KV_CONFIG_HETERO_LAYOUT}${KV_EXTRA_CONFIG}"}"
+  KV_CONFIG="{\"kv_connector\":\"NixlConnector\",\"kv_role\":\"kv_both\",\"kv_buffer_device\":\"$KV_BUFFER_DEVICE\""${KV_CONFIG_HETERO_LAYOUT}"}"
 fi
 # Models to run
--- a/tests/v1/kv_connector/unit/test_nixl_connector.py
+++ b/tests/v1/kv_connector/unit/test_nixl_connector.py
@@ -18,12 +18,8 @@ import ray
 import torch
 from vllm import LLM
-from vllm.config import KVTransferConfig, set_current_vllm_config
+from vllm.config import KVTransferConfig
-from vllm.distributed.kv_transfer.kv_connector.utils import (
+from vllm.distributed.kv_transfer.kv_connector.utils import KVOutputAggregator
    KVOutputAggregator,
    TpKVTopology,
    get_current_attn_backend,
 )
 from vllm.distributed.kv_transfer.kv_connector.v1 import nixl_connector
 from vllm.distributed.kv_transfer.kv_connector.v1.metrics import KVConnectorStats
 from vllm.distributed.kv_transfer.kv_connector.v1.multi_connector import (
@@ -52,11 +48,8 @@ from vllm.sampling_params import SamplingParams
 from vllm.v1.attention.backends.flash_attn import FlashAttentionBackend
 from vllm.v1.engine import EngineCoreRequest
 from vllm.v1.engine.output_processor import OutputProcessor
 from vllm.v1.kv_cache_interface import AttentionSpec, KVCacheConfig, KVCacheTensor
 from vllm.v1.outputs import KVConnectorOutput, ModelRunnerOutput
 from vllm.v1.request import RequestStatus
 from vllm.v1.worker.kv_connector_model_runner_mixin import KVConnectorModelRunnerMixin
 from vllm.v1.worker.utils import AttentionGroup
 from .utils import create_request, create_scheduler, create_vllm_config
@@ -373,7 +366,6 @@ def test_kv_transfer_handshake(dist_init):
        # Decode connector will be able to create handshake with the prefill connector.
        decode_connector = NixlConnector(vllm_config, KVConnectorRole.WORKER)
        decode_connector.register_kv_caches(kv_caches)
        # Here we are testing the retrieval of NIXLAgentMetadata.
        # Knowing the implementation detail, we override the add_remote_agent
@@ -410,23 +402,6 @@ class FakeNixlConnectorWorker(NixlConnectorWorker):
        self.kv_cache_layout = kv_cache_layout
        # Mock register_kv_caches attribute needed for tests that do not call it.
        self.src_xfer_handles_by_block_size = {self.block_size: 1}
        test_shape = self.attn_backend.get_kv_cache_shape(
            num_blocks=1, block_size=16, num_kv_heads=1, head_size=1
        )
        self.kv_topo = TpKVTopology(
            tp_rank=self.tp_rank,
            engine_id=self.engine_id,
            remote_tp_size=self._tp_size,  # shared state
            remote_block_size=self._block_size,  # shared state
            is_mla=self.use_mla,
            total_num_kv_heads=self.model_config.get_total_num_kv_heads(),
            attn_backend=self.attn_backend,
            tensor_shape=test_shape,
        )
        self.compat_hash = compute_nixl_compatibility_hash(
            self.vllm_config, self.backend_name, self.kv_topo.cross_layers_blocks
        )
    def _nixl_handshake(
        self, host: str, port: int, remote_tp_size: int, expected_engine_id: str
@@ -1395,7 +1370,6 @@ def _run_abort_timeout_test(llm: LLM, timeout: int):
            ),
        ),
        "TRITON_ATTN",
        "FLASHINFER",
    ],
 )
 def test_register_kv_caches(default_vllm_config, dist_init, attn_backend):
@@ -1412,11 +1386,6 @@ def test_register_kv_caches(default_vllm_config, dist_init, attn_backend):
    vllm_config = create_vllm_config(attention_backend=attn_backend)
    # Enable cross layers blocks
    vllm_config.kv_transfer_config.kv_connector_extra_config[
        "enable_cross_layers_blocks"
    ] = True
    # Import the appropriate backend based on the parameter
    if attn_backend == "FLASH_ATTN":
        from vllm.v1.attention.backends.flash_attn import FlashAttentionBackend
@@ -1426,11 +1395,49 @@ def test_register_kv_caches(default_vllm_config, dist_init, attn_backend):
        from vllm.v1.attention.backends.rocm_attn import RocmAttentionBackend
        backend_cls = RocmAttentionBackend
-    else:  # TRITON
+    else:  # TRITON_ATTN
        from vllm.v1.attention.backends.triton_attn import TritonAttentionBackend
        backend_cls = TritonAttentionBackend
    # Create test kv cache tensors using proper backend shape
    kv_cache_shape = backend_cls.get_kv_cache_shape(
        num_blocks=2, block_size=16, num_kv_heads=4, head_size=64
    )
    shared_tensor = torch.zeros(*kv_cache_shape, dtype=torch.float16)
    unique_tensor = torch.zeros(*kv_cache_shape, dtype=torch.float16)
    kv_caches = {
        "layer0": shared_tensor,
        "layer1": unique_tensor,
        "layer2": shared_tensor,
    }
    # Store tensor info for validation
    test_shape = backend_cls.get_kv_cache_shape(
        num_blocks=1, block_size=16, num_kv_heads=1, head_size=1
    )
    is_blocks_first = len(test_shape) == 5 and test_shape[0] == 1
    if is_blocks_first:
        expected_tensor_size = shared_tensor.element_size() * shared_tensor.numel()
        expected_base_addrs = [
            shared_tensor.data_ptr(),
            unique_tensor.data_ptr(),
        ]
        expected_num_entries = 2
    else:
        expected_tensor_size = (
            shared_tensor[0].element_size() * shared_tensor[0].numel()
        )
        expected_base_addrs = [
            shared_tensor[0].data_ptr(),
            shared_tensor[1].data_ptr(),
            unique_tensor[0].data_ptr(),
            unique_tensor[1].data_ptr(),
        ]
        expected_num_entries = 4
    nixl_module = "vllm.distributed.kv_transfer.kv_connector.v1.nixl_connector"
    with (
        patch(f"{nixl_module}.NixlWrapper") as mock_nixl_wrapper,
@@ -1459,107 +1466,6 @@ def test_register_kv_caches(default_vllm_config, dist_init, attn_backend):
        # Reassure the shutdown() check that the thread is terminated
        mock_thread.return_value.is_alive.return_value = False
        expected_tensor_size: int
        expected_base_addrs: list[int]
        expected_num_entries: int
        kv_caches: dict[str, torch.Tensor]
        if connector.prefer_cross_layer_blocks:
            num_layers = 32
            block_size = 16
            num_blocks = 8
            kv_cache_spec = AttentionSpec(
                block_size=block_size,
                num_kv_heads=4,
                head_size=64,
                dtype=torch.bfloat16,
            )
            kv_cache_config = KVCacheConfig(
                num_blocks=num_blocks,
                kv_cache_tensors=[
                    KVCacheTensor(
                        size=kv_cache_spec.page_size_bytes * num_blocks,
                        shared_by=["dummy-layer"],
                    )
                    for i in range(num_layers)
                ],
                # allocate_uniform_kv_caches does not use this
                kv_cache_groups=[],
            )
            with set_current_vllm_config(vllm_config):
                _, cross_layers_kv_cache, _ = (
                    KVConnectorModelRunnerMixin.allocate_uniform_kv_caches(
                        kv_cache_config=kv_cache_config,
                        attn_groups=[
                            [
                                AttentionGroup(
                                    backend=backend_cls,
                                    layer_names=[],
                                    kv_cache_spec=kv_cache_spec,
                                    kv_cache_group_id=0,
                                )
                            ]
                        ],
                        cache_dtype=torch.bfloat16,
                        device=torch.cuda.current_device(),
                        kernel_block_sizes=[block_size],
                    )
                )
            # Store tensor info for validation
            expected_tensor_size = (
                cross_layers_kv_cache.element_size() * cross_layers_kv_cache.numel()
            )
            expected_base_addrs = [
                cross_layers_kv_cache.data_ptr(),
            ]
            expected_num_entries = 1
            expected_blocks_count = 8
            kv_caches = {"all-layers": cross_layers_kv_cache}
        else:
            # Create test kv cache tensors using proper backend shape
            kv_cache_shape = backend_cls.get_kv_cache_shape(
                num_blocks=2, block_size=16, num_kv_heads=4, head_size=64
            )
            shared_tensor = torch.zeros(*kv_cache_shape, dtype=torch.float16)
            unique_tensor = torch.zeros(*kv_cache_shape, dtype=torch.float16)
            kv_caches = {
                "layer0": shared_tensor,
                "layer1": unique_tensor,
                "layer2": shared_tensor,
            }
            # Store tensor info for validation
            test_shape = backend_cls.get_kv_cache_shape(
                num_blocks=1, block_size=16, num_kv_heads=1, head_size=1
            )
            is_blocks_first = len(test_shape) == 5 and test_shape[0] == 1
            if is_blocks_first:
                expected_tensor_size = (
                    shared_tensor.element_size() * shared_tensor.numel()
                )
                expected_base_addrs = [
                    shared_tensor.data_ptr(),
                    unique_tensor.data_ptr(),
                ]
                expected_num_entries = 2
            else:
                expected_tensor_size = (
                    shared_tensor[0].element_size() * shared_tensor[0].numel()
                )
                expected_base_addrs = [
                    shared_tensor[0].data_ptr(),
                    shared_tensor[1].data_ptr(),
                    unique_tensor[0].data_ptr(),
                    unique_tensor[1].data_ptr(),
                ]
                expected_num_entries = 4
            expected_blocks_count = 8
        # Execute register_kv_caches
        connector.register_kv_caches(kv_caches)
@@ -1583,14 +1489,11 @@ def test_register_kv_caches(default_vllm_config, dist_init, attn_backend):
        blocks_data, _ = mock_wrapper_instance.get_xfer_descs.call_args[0]
        # Validate blocks_data structure and size
        expected_blocks_count = 8
        assert len(blocks_data) == expected_blocks_count, (
            f"Expected {expected_blocks_count} blocks, got {len(blocks_data)}"
        )
        if connector.prefer_cross_layer_blocks:
            num_blocks = 8
            expected_block_len = expected_tensor_size // num_blocks
        else:
        num_blocks = 2
        if is_blocks_first:
            expected_block_len = expected_tensor_size // num_blocks // 2
@@ -2146,17 +2049,6 @@ def test_compatibility_hash_validation(
    )
    decode_connector = NixlConnector(local_vllm_config, KVConnectorRole.WORKER)
    decode_worker = decode_connector.connector_worker
    kv_cache_shape = decode_worker.attn_backend.get_kv_cache_shape(
        num_blocks=2, block_size=16, num_kv_heads=4, head_size=64
    )
    shared_tensor = torch.zeros(*kv_cache_shape, dtype=torch.float16)
    unique_tensor = torch.zeros(*kv_cache_shape, dtype=torch.float16)
    kv_caches = {
        "layer0": shared_tensor,
        "layer1": unique_tensor,
        "layer2": shared_tensor,
    }
    decode_connector.register_kv_caches(kv_caches)
    remote_config_params: dict[str, Any] = {
        "model": "facebook/opt-125m",
@@ -2179,9 +2071,7 @@ def test_compatibility_hash_validation(
                )
            )
        remote_hash = compute_nixl_compatibility_hash(
-            remote_vllm_config,
+            remote_vllm_config, decode_worker.backend_name
            decode_worker.backend_name,
            decode_worker.kv_topo.cross_layers_blocks,
        )
    prefill_block_size = config_overrides.get("block_size", 16)
@@ -2260,27 +2150,6 @@ def test_handshake_decode_errors(default_vllm_config, dist_init, error_scenario)
    decode_connector = NixlConnector(local_vllm_config, KVConnectorRole.WORKER)
    decode_worker = decode_connector.connector_worker
    backend = get_current_attn_backend(local_vllm_config)
    test_shape = backend.get_kv_cache_shape(
        num_blocks=1, block_size=16, num_kv_heads=1, head_size=1
    )
    decode_worker.kv_topo = TpKVTopology(
        tp_rank=decode_worker.tp_rank,
        engine_id=decode_worker.engine_id,
        remote_tp_size=decode_worker._tp_size,  # shared state
        remote_block_size=decode_worker._block_size,  # shared state
        is_mla=decode_worker.use_mla,
        total_num_kv_heads=decode_worker.model_config.get_total_num_kv_heads(),
        attn_backend=backend,
        tensor_shape=test_shape,
    )
    decode_worker.compat_hash = compute_nixl_compatibility_hash(
        decode_worker.vllm_config,
        decode_worker.backend_name,
        decode_worker.kv_topo.cross_layers_blocks,
    )
    if error_scenario == "handshake_decode_error":
        msg_bytes = b"this is not valid msgpack data"
    elif error_scenario == "handshake_validation_error":
--- a/tests/weight_loading/models.txt
+++ b/tests/weight_loading/models.txt
@@ -19,7 +19,6 @@ compressed-tensors, nm-testing/tinyllama-oneshot-w8a16-per-channel, main
 compressed-tensors, nm-testing/Meta-Llama-3-8B-FP8-compressed-tensors-test, main
 compressed-tensors, nm-testing/Phi-3-mini-128k-instruct-FP8, main
 compressed-tensors, neuralmagic/Phi-3-medium-128k-instruct-quantized.w4a16, main
 compressed-tensors, nm-testing/TinyLlama-1.1B-Chat-v1.0-actorder-group, main
 #compressed-tensors, mgoin/DeepSeek-Coder-V2-Lite-Instruct-FP8, main
 compressed-tensors, nm-testing/SparseLlama-3.1-8B-gsm8k-pruned.2of4-FP8-Dynamic-testing, main, 90
 compressed-tensors, nm-testing/SparseLlama-3.1-8B-gsm8k-pruned.2of4-W8A8-testing, main, 90
--- a/tools/vllm-rocm/generate-rocm-wheels-root-index.sh
+++ b/tools/vllm-rocm/generate-rocm-wheels-root-index.sh
@@ -0,0 +1,233 @@
 #!/usr/bin/env bash
 # SPDX-License-Identifier: Apache-2.0
 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project
 #
 # Generate S3 PyPI Root Index for Latest Version
 #
 # Creates a PEP 503 compatible index.html at rocm/ pointing to the latest
 # semantic version's packages. This enables users to install with:
 #   uv pip install vllm --extra-index-url s3://vllm-wheels/rocm
 #
 # Usage:
 #   generate-root-index.sh [options]
 #
 # Options:
 #   --dry-run      Preview changes without uploading
 #   --version VER  Use specific version instead of auto-detecting latest
 #
 # Environment variables:
 #   S3_BUCKET   - Bucket name (default: vllm-wheels)
 #   VARIANT     - ROCm variant (default: rocm700)
 #   DRY_RUN     - Set to 1 for preview mode (same as --dry-run)
 set -euo pipefail
 # ======== Configuration ========
 BUCKET="${S3_BUCKET:-vllm-wheels}"
 VARIANT="${VARIANT:-rocm700}"
 DRY_RUN="${DRY_RUN:-0}"
 FORCE_VERSION=""
 # Parse command line arguments
 while [[ $# -gt 0 ]]; do
    case $1 in
        --dry-run)
            DRY_RUN=1
            shift
            ;;
        --version)
            FORCE_VERSION="$2"
            shift 2
            ;;
        *)
            echo "Unknown option: $1"
            exit 1
            ;;
    esac
 done
 # Working directory for generated files
 WORK_DIR=$(mktemp -d)
 trap 'rm -rf "$WORK_DIR"' EXIT
 echo "========================================"
 echo "Generate Root Index for Latest Version"
 echo "========================================"
 echo "S3 Bucket: $BUCKET"
 echo "ROCm Variant: $VARIANT"
 echo "Dry Run: $DRY_RUN"
 echo "========================================"
 echo ""
 # ======== Step 1: Find latest semantic version ========
 echo "Step 1: Finding latest semantic version..."
 # List all directories under rocm/
 aws s3api list-objects-v2 \
    --bucket "$BUCKET" \
    --prefix "rocm/" \
    --delimiter "/" \
    --query 'CommonPrefixes[].Prefix' \
    --output text | tr '\t' '\n' > "$WORK_DIR/all_prefixes.txt"
 # Filter for semantic versions (x.y.z pattern)
 grep -oE 'rocm/[0-9]+\.[0-9]+\.[0-9]+/' "$WORK_DIR/all_prefixes.txt" | \
    sed 's|rocm/||; s|/||' | \
    sort -V > "$WORK_DIR/versions.txt" || true
 if [[ ! -s "$WORK_DIR/versions.txt" ]]; then
    echo "ERROR: No semantic versions found under s3://$BUCKET/rocm/"
    exit 1
 fi
 echo "Found versions:"
 cat "$WORK_DIR/versions.txt"
 echo ""
 if [[ -n "$FORCE_VERSION" ]]; then
    LATEST_VERSION="$FORCE_VERSION"
    echo "Using forced version: $LATEST_VERSION"
 else
    LATEST_VERSION=$(tail -1 "$WORK_DIR/versions.txt")
    echo "Latest version (auto-detected): $LATEST_VERSION"
 fi
 # Verify the version exists
 if ! grep -qx "$LATEST_VERSION" "$WORK_DIR/versions.txt"; then
    echo "ERROR: Version $LATEST_VERSION not found in bucket"
    exit 1
 fi
 # ======== Step 2: List packages from latest version ========
 echo ""
 echo "Step 2: Listing packages from rocm/$LATEST_VERSION/$VARIANT/..."
 VERSION_PREFIX="rocm/$LATEST_VERSION/$VARIANT/"
 # List package directories
 aws s3api list-objects-v2 \
    --bucket "$BUCKET" \
    --prefix "$VERSION_PREFIX" \
    --delimiter "/" \
    --query 'CommonPrefixes[].Prefix' \
    --output text | tr '\t' '\n' > "$WORK_DIR/package_prefixes.txt" || true
 if [[ ! -s "$WORK_DIR/package_prefixes.txt" ]]; then
    echo "ERROR: No packages found under s3://$BUCKET/$VERSION_PREFIX"
    exit 1
 fi
 # Extract package names
 sed "s|${VERSION_PREFIX}||; s|/||g" "$WORK_DIR/package_prefixes.txt" | \
    grep -v '^$' > "$WORK_DIR/packages.txt"
 echo "Found packages:"
 cat "$WORK_DIR/packages.txt"
 echo ""
 # ======== Step 3: Generate root index.html ========
 echo "Step 3: Generating root index.html..."
 mkdir -p "$WORK_DIR/output"
 {
    cat <<'EOF'
 <!DOCTYPE html>
 <html>
 <head>
    <meta name="pypi:repository-version" content="1.0">
 </head>
 <body>
 EOF
    while read -r pkg; do
        echo "    <a href=\"$pkg/\">$pkg</a><br>"
    done < "$WORK_DIR/packages.txt"
    cat <<'EOF'
 </body>
 </html>
 EOF
 } > "$WORK_DIR/output/index.html"
 echo "Generated root index.html:"
 cat "$WORK_DIR/output/index.html"
 echo ""
 # ======== Step 4: Copy and adjust package index files ========
 echo "Step 4: Copying and adjusting package index files..."
 while read -r pkg; do
    echo "Processing package: $pkg"
    # Download existing index.html from versioned path
    SOURCE_INDEX="s3://$BUCKET/$VERSION_PREFIX$pkg/index.html"
    mkdir -p "$WORK_DIR/output/$pkg"
    if aws s3 cp "$SOURCE_INDEX" "$WORK_DIR/output/$pkg/index.html" 2>/dev/null; then
        # Adjust relative paths:
        # Original: href="../../../{commit}/wheel.whl" (from rocm/0.13.0/rocm710/vllm/)
        # New:      href="../{commit}/wheel.whl"       (from rocm/vllm/)
        sed -i 's|href="\.\./\.\./\.\./|href="../|g' "$WORK_DIR/output/$pkg/index.html"
        echo "  - Downloaded and adjusted: $pkg/index.html"
    else
        echo "  - WARNING: Could not download index for $pkg"
    fi
 done < "$WORK_DIR/packages.txt"
 echo ""
 # ======== Step 5: Upload to S3 ========
 echo "Step 5: Uploading to s3://$BUCKET/rocm/..."
 echo ""
 # List what would be uploaded
 echo "Files to upload:"
 find "$WORK_DIR/output" -name "*.html" -type f | while read -r file; do
    rel_path="${file#$WORK_DIR/output/}"
    echo "  rocm/$rel_path"
 done
 echo ""
 if [[ "$DRY_RUN" == "1" ]]; then
    echo "DRY RUN - Skipping upload"
    echo ""
    echo "Preview of generated files:"
    echo "----------------------------------------"
    echo "rocm/index.html:"
    cat "$WORK_DIR/output/index.html"
    echo ""
    echo "----------------------------------------"
    echo "Sample package index (first package):"
    FIRST_PKG=$(head -1 "$WORK_DIR/packages.txt")
    if [[ -f "$WORK_DIR/output/$FIRST_PKG/index.html" ]]; then
        echo "rocm/$FIRST_PKG/index.html:"
        cat "$WORK_DIR/output/$FIRST_PKG/index.html"
    fi
 else
    # Upload all generated files
    aws s3 cp --recursive "$WORK_DIR/output/" "s3://$BUCKET/rocm/" \
        --content-type "text/html"
    echo "Upload complete!"
 fi
 # ======== Summary ========
 echo ""
 echo "========================================"
 echo "Root Index Generation Complete!"
 echo "========================================"
 echo ""
 echo "Latest version: $LATEST_VERSION"
 echo "Packages indexed: $(wc -l < "$WORK_DIR/packages.txt")"
 echo ""
 echo "Install command:"
 echo "  uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/"
 echo "========================================"
--- a/vllm/_custom_ops.py
+++ b/vllm/_custom_ops.py
@@ -900,6 +900,8 @@ def cutlass_sparse_scaled_mm_supported(cuda_device_capability: int) -> bool:
 def cutlass_group_gemm_supported(cuda_device_capability: int) -> bool:
    if cuda_device_capability < 90 or cuda_device_capability >= 110:
        return False
    try:
        return torch.ops._C.cutlass_group_gemm_supported(cuda_device_capability)
    except AttributeError:
@@ -2032,35 +2034,20 @@ def selective_scan_fwd(
    )
 # NOTE: The wvSplitK kernel (and all of the kernels in skinny_gemms.cu)
 # are unable to properly handle non-contiguous
 # tensors.  It might be a good TODO(rasmith) to augment these kernels
 # to be able to handle non-contiguous kernels for better performance.
 def rocm_enforce_contiguous_skinny_gemm_inputs(
    a: torch.Tensor, b: torch.Tensor
 ) -> tuple[torch.Tensor, torch.Tensor]:
    a = a.contiguous()  # no-op if already contiguous, else clone
    b = b.contiguous()  # no-op if already contiguous, else clone
    return a, b
 # ROCm skinny gemms
 def LLMM1(a: torch.Tensor, b: torch.Tensor, rows_per_block: int) -> torch.Tensor:
    a, b = rocm_enforce_contiguous_skinny_gemm_inputs(a, b)
    return torch.ops._rocm_C.LLMM1(a, b, rows_per_block)
 def wvSplitK(
    a: torch.Tensor, b: torch.Tensor, cu_count: int, bias: torch.Tensor = None
 ) -> torch.Tensor:
    a, b = rocm_enforce_contiguous_skinny_gemm_inputs(a, b)
    return torch.ops._rocm_C.wvSplitK(a, b, bias, cu_count)
 def wvSplitKrc(
    a: torch.Tensor, b: torch.Tensor, cu_count: int, bias: torch.Tensor = None
 ) -> torch.Tensor:
    a, b = rocm_enforce_contiguous_skinny_gemm_inputs(a, b)
    return torch.ops._rocm_C.wvSplitKrc(a, b, bias, cu_count)
@@ -2073,7 +2060,6 @@ def wvSplitKQ(
    cu_count: int,
    bias: torch.Tensor = None,
 ) -> torch.Tensor:
    a, b = rocm_enforce_contiguous_skinny_gemm_inputs(a, b)
    out = torch.empty((b.shape[0], a.shape[0]), dtype=out_dtype, device=b.device)
    torch.ops._rocm_C.wvSplitKQ(a, b, bias, out, scale_a, scale_b, cu_count)
    return out
--- a/vllm/compilation/backends.py
+++ b/vllm/compilation/backends.py
@@ -361,6 +361,13 @@ def split_graph(
            subgraph_id += 1
            node_to_subgraph_id[node] = subgraph_id
            split_op_graphs.append(subgraph_id)
            # keep consecutive splitting ops together
            # (we know node.next exists because node isn't the last (output) node)
            if should_split(node.next, splitting_ops):
                # this will get incremented by the next node
                subgraph_id -= 1
            else:
                subgraph_id += 1
        else:
            node_to_subgraph_id[node] = subgraph_id
--- a/vllm/config/compilation.py
+++ b/vllm/config/compilation.py
@@ -581,6 +581,24 @@ class CompilationConfig:
    local_cache_dir: str = field(default=None, init=False)  # type: ignore
    """local cache dir for each rank"""
    fast_moe_cold_start = True
    """Optimization for fast MOE cold start.
    This is a bit of a hack that assumes that:
    1. the only decoder forward pass being run is the current model
    2. the decoder forward pass runs all of the MOEs in the order in which they
       are initialized
    When the above two conditions hold, this option greatly decreases cold start
    time for MOE models.
    If the above two conditions don't hold, then this option will lead to silent
    incorrectness. The only condition in which this doesn't hold is speculative
    decoding, where there is a draft model that may have MOEs in them.
    NB: We're working on a longer-term solution that doesn't need these assumptions.
    """
    # keep track of enabled and disabled custom ops
    enabled_custom_ops: Counter[str] = field(default_factory=Counter, init=False)
    """custom ops that are enabled"""
@@ -925,6 +943,15 @@ class CompilationConfig:
                # for details. Make a copy to avoid mutating the class-level
                # list via reference.
                self.splitting_ops = list(self._attention_ops)
                # unified_kv_cache_update has a string param that prevents Inductor
                # from reusing piecewise graphs. Remove it from the compiled graph.
                # This has the side-effect of excluding cache from cudagraphs but
                # that doesn't seem to affect performance.
                # https://github.com/vllm-project/vllm/issues/33267
                if not self.use_inductor_graph_partition:
                    self.splitting_ops.append("vllm::unified_kv_cache_update")
            elif len(self.splitting_ops) == 0:
                if (
                    self.cudagraph_mode == CUDAGraphMode.PIECEWISE
--- a/vllm/config/speculative.py
+++ b/vllm/config/speculative.py
@@ -40,6 +40,7 @@ MTPModelTypes = Literal[
    "longcat_flash_mtp",
    "mtp",
    "pangu_ultra_moe_mtp",
    "step3p5_mtp",
 ]
 EagleModelTypes = Literal["eagle", "eagle3", MTPModelTypes]
 SpeculativeMethod = Literal[
@@ -252,6 +253,11 @@ class SpeculativeConfig:
                {"n_predict": n_predict, "architectures": ["LongCatFlashMTPModel"]}
            )
        if hf_config.model_type == "step3p5":
            hf_config.model_type = "step3p5_mtp"
            n_predict = getattr(hf_config, "num_nextn_predict_layers", 1)
            hf_config.update({"n_predict": n_predict, "architectures": ["Step3p5MTP"]})
        if initial_architecture == "MistralLarge3ForCausalLM":
            hf_config.update({"architectures": ["EagleMistralLarge3ForCausalLM"]})
--- a/vllm/distributed/kv_transfer/kv_connector/utils.py
+++ b/vllm/distributed/kv_transfer/kv_connector/utils.py
@@ -316,7 +316,6 @@ class TpKVTopology:
    attn_backend: type[AttentionBackend]
    engine_id: EngineId
    remote_block_size: dict[EngineId, int]
    tensor_shape: torch.Size | None = None
    def __post_init__(self):
        # Figure out whether the first dimension of the cache is K/V
@@ -330,32 +329,6 @@ class TpKVTopology:
            len(kv_cache_shape) == 5 and kv_cache_shape[0] == 1
        )
        self._kv_heads_position: int | None = None
        self._cross_layers_blocks = False
        if self.tensor_shape is not None:
            self._cross_layers_blocks = (
                len(self.tensor_shape) == len(kv_cache_shape) + 1
            )
            if self._cross_layers_blocks:
                # prepend layers dimension
                kv_cache_shape = (80,) + kv_cache_shape
            try:
                kv_cache_stride_order = self.attn_backend.get_kv_cache_stride_order(
                    include_num_layers_dimension=self._cross_layers_blocks
                )
            except (AttributeError, NotImplementedError):
                kv_cache_stride_order = tuple(range(len(self.tensor_shape)))
            # permute kv_cache_shape according to stride_order
            kv_cache_shape = tuple(kv_cache_shape[i] for i in kv_cache_stride_order)
            physical_block_size_position = kv_cache_shape.index(16)
            assert physical_block_size_position is not None
            self._physical_block_size_position = -(
                len(kv_cache_shape) - physical_block_size_position
            )
    @property
    def is_kv_layout_blocks_first(self) -> bool:
        return self._is_kv_layout_blocks_first
@@ -363,9 +336,7 @@ class TpKVTopology:
    @property
    def split_k_and_v(self) -> bool:
        # Whether to register regions for K and V separately (when present).
-        return not (
+        return not (self.is_mla or self.is_kv_layout_blocks_first)
            self._cross_layers_blocks or self.is_mla or self.is_kv_layout_blocks_first
        )
    @property
    def tp_size(self) -> int:
@@ -375,14 +346,6 @@ class TpKVTopology:
    def block_size(self) -> int:
        return self.remote_block_size[self.engine_id]
    @property
    def cross_layers_blocks(self) -> bool:
        return self._cross_layers_blocks
    @property
    def block_size_position(self) -> int:
        return self._physical_block_size_position
    def tp_ratio(
        self,
        remote_tp_size: int,
--- a/vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py
+++ b/vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py
@@ -54,7 +54,7 @@ from vllm.forward_context import ForwardContext
 from vllm.logger import init_logger
 from vllm.platforms import current_platform
 from vllm.utils.network_utils import make_zmq_path, make_zmq_socket
-from vllm.v1.attention.backend import AttentionBackend, AttentionMetadata
+from vllm.v1.attention.backend import AttentionMetadata
 from vllm.v1.attention.backends.utils import get_kv_cache_layout
 from vllm.v1.core.sched.output import SchedulerOutput
 from vllm.v1.worker.block_table import BlockTable
@@ -173,7 +173,7 @@ class NixlHandshakePayload(KVConnectorHandshakeMetadata):
 def compute_nixl_compatibility_hash(
-    vllm_config: VllmConfig, attn_backend_name: str, cross_layers_blocks: bool
+    vllm_config: VllmConfig, attn_backend_name: str
 ) -> str:
    """
    Compute compatibility hash for NIXL KV transfer.
@@ -216,7 +216,6 @@ def compute_nixl_compatibility_hash(
        # Attention backend and KV cache dtype affect memory layout
        "attn_backend_name": attn_backend_name,
        "cache_dtype": str(cache_config.cache_dtype),
        "cross_layers_blocks": cross_layers_blocks,
    }
    compat_hash = hash_factors(factors)
@@ -299,20 +298,6 @@ class NixlConnectorMetadata(KVConnectorMetadata):
 class NixlConnector(KVConnectorBase_V1):
    @property
    def prefer_cross_layer_blocks(self) -> bool:
        backend = get_current_attn_backend(self._vllm_config)
        if backend().get_name() not in (
            "FLASH_ATTN",
            "FLASHINFER",
        ):
            # For now there is no benefit to run cross layers when backend
            # does not support on HND
            return False
        extra_config = self.kv_transfer_config.kv_connector_extra_config
        return bool(str(extra_config.get("enable_cross_layers_blocks", "False")))
    def __init__(
        self,
        vllm_config: VllmConfig,
@@ -324,7 +309,6 @@ class NixlConnector(KVConnectorBase_V1):
        assert vllm_config.kv_transfer_config is not None
        assert vllm_config.kv_transfer_config.engine_id is not None
        self.engine_id: EngineId = vllm_config.kv_transfer_config.engine_id
        self.kv_transfer_config = vllm_config.kv_transfer_config
        if role == KVConnectorRole.SCHEDULER:
            self.connector_scheduler: NixlConnectorScheduler | None = (
@@ -411,16 +395,6 @@ class NixlConnector(KVConnectorBase_V1):
        assert self.connector_worker is not None
        self.connector_worker.register_kv_caches(kv_caches)
    def register_cross_layers_kv_cache(
        self, kv_cache: torch.Tensor, attn_backend: type[AttentionBackend]
    ):
        assert self.connector_worker is not None
        cross_layer_name = "ALL_LAYERS"
        kv_caches = {cross_layer_name: kv_cache}
        self.connector_worker.register_kv_caches(kv_caches)
    def set_host_xfer_buffer_ops(self, copy_operation: CopyBlocksOp):
        assert self.connector_worker is not None
        self.connector_worker.set_host_xfer_buffer_ops(copy_operation)
@@ -1002,17 +976,20 @@ class NixlConnectorWorker:
        # Get the attention backend from the first layer
        # NOTE (NickLucche) models with multiple backends are not supported yet
-        self.attn_backend = get_current_attn_backend(vllm_config)
+        backend = get_current_attn_backend(vllm_config)
-        self.backend_name = self.attn_backend.get_name()
+        self.backend_name = backend.get_name()
        self.kv_cache_layout = get_kv_cache_layout()
        self.host_buffer_kv_cache_layout = self.kv_cache_layout
        logger.debug("Detected attention backend %s", self.backend_name)
        logger.debug("Detected kv cache layout %s", self.kv_cache_layout)
-        # lazy initialized in register_kv_caches
+        self.compat_hash = compute_nixl_compatibility_hash(
-        self.compat_hash: str | None = None
+            self.vllm_config, self.backend_name
-        self.kv_topo: TpKVTopology | None = None
+        )
        self.enforce_compat_hash = self.kv_transfer_config.get_from_extra_config(
            "enforce_handshake_compat", True
        )
        self._tp_size: dict[EngineId, int] = {self.engine_id: self.world_size}
        self._block_size: dict[EngineId, int] = {self.engine_id: self.block_size}
@@ -1021,11 +998,16 @@ class NixlConnectorWorker:
        self.consumer_notification_counts_by_req = defaultdict[ReqId, int](int)
        self.xfer_stats = NixlKVConnectorStats()
-        self._physical_blocks_per_logical_kv_block = 1
+        self.kv_topo = TpKVTopology(
-
+            tp_rank=self.tp_rank,
-        self.enforce_compat_hash = self.kv_transfer_config.get_from_extra_config(
+            engine_id=self.engine_id,
-            "enforce_handshake_compat", True
+            remote_tp_size=self._tp_size,  # shared state
            remote_block_size=self._block_size,  # shared state
            is_mla=self.use_mla,
            total_num_kv_heads=self.model_config.get_total_num_kv_heads(),
            attn_backend=backend,
        )
        self._physical_blocks_per_logical_kv_block = 1
    def _nixl_handshake(
        self,
@@ -1040,7 +1022,6 @@ class NixlConnectorWorker:
        # Regardless, only handshake with the remote TP rank(s) that current
        # local rank will read from. Note that With homogeneous TP,
        # this happens to be the same single rank_i.
        assert self.kv_topo is not None
        p_remote_ranks = self.kv_topo.get_target_remote_ranks(remote_tp_size)
        remote_rank_to_agent_name = {}
        path = make_zmq_path("tcp", host, port)
@@ -1078,7 +1059,6 @@ class NixlConnectorWorker:
                )
                # Check compatibility hash BEFORE decoding agent metadata
                assert self.compat_hash is not None
                if (
                    self.enforce_compat_hash
                    and handshake_payload.compatibility_hash != self.compat_hash
@@ -1287,20 +1267,6 @@ class NixlConnectorWorker:
    def register_kv_caches(self, kv_caches: dict[str, torch.Tensor]):
        """Register the KV Cache data in nixl."""
        self.kv_topo = TpKVTopology(
            tp_rank=self.tp_rank,
            engine_id=self.engine_id,
            remote_tp_size=self._tp_size,  # shared state
            remote_block_size=self._block_size,  # shared state
            is_mla=self.use_mla,
            total_num_kv_heads=self.model_config.get_total_num_kv_heads(),
            attn_backend=self.attn_backend,
            tensor_shape=next(iter(kv_caches.values())).shape,
        )
        self.compat_hash = compute_nixl_compatibility_hash(
            self.vllm_config, self.backend_name, self.kv_topo.cross_layers_blocks
        )
        if self.use_host_buffer:
            self.initialize_host_xfer_buffer(kv_caches=kv_caches)
            assert len(self.host_xfer_buffers) == len(kv_caches), (
@@ -1335,21 +1301,29 @@ class NixlConnectorWorker:
        # (roughly 8KB vs 5KB).
        # Conversely for FlashInfer, K and V are registered in the same region
        # to better exploit the memory layout (ie num_blocks is the first dim).
        split_k_and_v = self.kv_topo.split_k_and_v
        tensor_size_bytes = None
        # TODO (NickLucche): Get kernel_block_size in a cleaner way
        # NHD default "view" for non-MLA cache
        if self.device_type == "cpu":
            block_size_position = -2
        else:
            block_size_position = -2 if self.use_mla else -3
        # Enable different block lengths for different layers when MLA is used.
        self.block_len_per_layer = list[int]()
        self.slot_size_per_layer = list[int]()  # HD bytes in kv terms
        for layer_name, cache_or_caches in xfer_buffers.items():
-            cache_list = (
+            cache_list = cache_or_caches if split_k_and_v else [cache_or_caches]
-                cache_or_caches if self.kv_topo.split_k_and_v else [cache_or_caches]
+
            )
            for cache in cache_list:
                base_addr = cache.data_ptr()
                if base_addr in seen_base_addresses:
                    continue
-                kernel_block_size = cache.shape[self.kv_topo.block_size_position]
+                kernel_block_size = cache.shape[block_size_position]
                if self.block_size != kernel_block_size:
                    logger.info_once(
                        "User-specified logical block size (%s) does not match"
@@ -1411,7 +1385,6 @@ class NixlConnectorWorker:
        self.device_kv_caches = kv_caches
        self.dst_num_blocks[self.engine_id] = self.num_blocks
        if self.kv_topo.is_kv_layout_blocks_first:
            for i in range(len(self.slot_size_per_layer)):
                assert self.slot_size_per_layer[i] % 2 == 0
@@ -1467,7 +1440,6 @@ class NixlConnectorWorker:
            block_size=self.block_size,
        )
        # Wrap metadata in payload with hash for defensive decoding
        assert self.compat_hash is not None
        encoder = msgspec.msgpack.Encoder()
        self.xfer_handshake_metadata = NixlHandshakePayload(
            compatibility_hash=self.compat_hash,
@@ -1489,8 +1461,6 @@ class NixlConnectorWorker:
        register another local_xfer_handler using remote block len to ensure
        data copy correctness.
        """
        assert self.kv_topo is not None
        block_size_ratio = self.block_size // block_size
        blocks_data = []
        for i, base_addr in enumerate(self.seen_base_addresses):
@@ -1603,7 +1573,6 @@ class NixlConnectorWorker:
        # remote:               | 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|
        # local origin:|          0|          1|          8|         12|
        # local mapped:| 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|
        assert self.kv_topo is not None
        block_size_ratio = self.kv_topo.block_size_ratio_from_engine_id(engine_id)
        if engine_id not in self.dst_num_blocks:
@@ -1731,10 +1700,7 @@ class NixlConnectorWorker:
        """
        remote_engine_id = nixl_agent_meta.engine_id
-        assert (
+        assert self._tp_size[remote_engine_id] == remote_tp_size
            self._tp_size[remote_engine_id] == remote_tp_size
            and self.kv_topo is not None
        )
        tp_ratio = self.kv_topo.tp_ratio_from_engine_id(remote_engine_id)
        block_size_ratio = self.kv_topo.block_size_ratio_from_engine_id(
@@ -1871,7 +1837,6 @@ class NixlConnectorWorker:
        if len(self.device_kv_caches) == 0:
            return
        assert block_size_ratio >= 1, "Only nP < nD supported currently."
        assert self.kv_topo is not None
        if self.enable_permute_local_kv and block_size_ratio > 1:
            logger.debug(
                "Post-processing device kv cache on receive by converting "
@@ -1891,7 +1856,7 @@ class NixlConnectorWorker:
                block_size_ratio,
            )
-        split_k_and_v = self.kv_topo.split_k_and_v
+        split_k_and_v = not (self.use_mla or self.kv_topo.is_kv_layout_blocks_first)
        for block_ids in block_ids_list:
            indices = torch.tensor(block_ids, device=self.device_type, dtype=torch.long)
@@ -1916,7 +1881,6 @@ class NixlConnectorWorker:
        The scheduler process (via the MultiprocExecutor) will use this output
        to track which workers are done.
        """
        assert self.kv_topo is not None
        done_sending = self._get_new_notifs()
        done_recving = self._pop_done_transfers(self._recving_transfers)
@@ -1986,7 +1950,6 @@ class NixlConnectorWorker:
        are reading from the same producer (heterogeneous TP scenario), wait
        for all consumers to be done pulling.
        """
        assert self.kv_topo is not None
        notified_req_ids: set[str] = set()
        for notifs in self.nixl_wrapper.get_new_notifs().values():
            for notif in notifs:
@@ -2146,7 +2109,7 @@ class NixlConnectorWorker:
                self._reqs_to_send[req_id] = expiration_time
    def _read_blocks_for_req(self, req_id: str, meta: ReqMeta):
-        assert meta.remote is not None and self.kv_topo is not None
+        assert meta.remote is not None
        remote_ranks = self.kv_topo.get_target_remote_ranks_from_engine_id(
            meta.remote.engine_id
        )
@@ -2215,7 +2178,10 @@ class NixlConnectorWorker:
        local_xfer_side_handle: int,
        remote_xfer_side_handle: int,
    ):
-        assert self.kv_topo is not None
+        """
        Post a READ point-to-point xfer request from a single local worker to
        a single remote worker.
        """
        block_size_ratio = self.kv_topo.block_size_ratio_from_engine_id(dst_engine_id)
        if block_size_ratio > 1:
            local_block_ids = self.get_mapped_blocks(
@@ -2448,7 +2414,6 @@ class NixlConnectorWorker:
        For FlashInfer, this is half the length of the whole block, as K and V
        share the same region.
        """
        assert self.kv_topo is not None
        if self.kv_topo.is_kv_layout_blocks_first:
            # For indexing only half (either just the K or V part).
            block_len = self.block_len_per_layer[layer_idx] // 2
--- a/vllm/forward_context.py
+++ b/vllm/forward_context.py
@@ -271,17 +271,22 @@ def create_forward_context(
    additional_kwargs: dict[str, Any] | None = None,
    skip_compiled: bool = False,
 ):
-    no_compile_layers = vllm_config.compilation_config.static_forward_context
+    if vllm_config.compilation_config.fast_moe_cold_start:
-    from vllm.model_executor.layers.fused_moe.layer import FusedMoE
+        if vllm_config.speculative_config is None:
-
+            all_moe_layers = vllm_config.compilation_config.static_all_moe_layers
-    remaining_moe_layers = [
+        else:
-        name for name, layer in no_compile_layers.items() if isinstance(layer, FusedMoE)
+            logger.warning_once(
-    ]
+                "vllm_config.compilation_config.fast_moe_cold_start is not "
-    remaining_moe_layers.reverse()
+                "compatible with speculative decoding so we are ignoring "
                "fast_moe_cold_start."
            )
            all_moe_layers = None
    else:
        all_moe_layers = None
    return ForwardContext(
-        no_compile_layers=no_compile_layers,
+        no_compile_layers=vllm_config.compilation_config.static_forward_context,
-        remaining_moe_layers=remaining_moe_layers,
+        all_moe_layers=all_moe_layers,
        virtual_engine=virtual_engine,
        attn_metadata=attn_metadata,
        slot_mapping=slot_mapping or {},
--- a/vllm/model_executor/layers/activation.py
+++ b/vllm/model_executor/layers/activation.py
@@ -17,11 +17,63 @@ from vllm.logger import init_logger
 from vllm.model_executor.custom_op import CustomOp
 from vllm.model_executor.utils import set_weight_attrs
 from vllm.platforms import current_platform
 from vllm.triton_utils import tl, triton
 from vllm.utils.collection_utils import LazyDict
 logger = init_logger(__name__)
@triton.jit
 def _swiglustep_and_mul_kernel(
    o_ptr,
    o_stride,
    x_ptr,
    x_stride,
    limit: tl.constexpr,
    d: tl.constexpr,
    BLOCK_SIZE: tl.constexpr,
 ) -> None:
    i = tl.program_id(axis=0).to(tl.int64)
    j = tl.program_id(axis=1)
    o_row_ptr = o_ptr + o_stride * i
    x_row_ptr = x_ptr + x_stride * i
    offsets = j * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
    mask = offsets < d
    gate = tl.load(x_row_ptr + offsets, mask=mask).to(tl.float32)
    up = tl.load(x_row_ptr + offsets + d, mask=mask).to(tl.float32)
    gate_silu = tl.sigmoid(gate) * gate
    gate_clamped = tl.minimum(gate_silu, limit)
    up_clamped = tl.minimum(tl.maximum(up, -limit), limit)
    result = gate_clamped * up_clamped
    result = result.to(x_ptr.dtype.element_ty)
    tl.store(o_row_ptr + offsets, result, mask=mask)
 def swiglustep_and_mul_triton(
    output: torch.Tensor, input: torch.Tensor, limit: float = 7.0
 ):
    b, n = input.shape
    assert input.ndim == 2
    assert n % 2 == 0
    d = n // 2
    def grid(meta):
        return (b, triton.cdiv(d, meta["BLOCK_SIZE"]))
    _swiglustep_and_mul_kernel[grid](
        output,
        output.stride(0),
        input,
        input.stride(0),
        limit=limit,
        d=d,
        BLOCK_SIZE=1024,
    )
 # --8<-- [start:fatrelu_and_mul]
@CustomOp.register("fatrelu_and_mul")
 class FatreluAndMul(CustomOp):
@@ -304,6 +356,44 @@ class SwigluOAIAndMul(CustomOp):
        return f"alpha={repr(self.alpha)}, limit={repr(self.limit)}"
 # --8<-- [start:swiglustep_and_mul]
@CustomOp.register("swiglustep_and_mul")
 class SwigluStepAndMul(CustomOp):
    """An activation function for SwiGLU with clamping.
    Computes x -> silu(x[:d]).clamp(max=limit) * x[d:].clamp(-limit, limit)
    where d = x.shape[-1] // 2.
    Shapes:
        x: (num_tokens, 2 * d) or (batch_size, seq_len, 2 * d)
        return: (num_tokens, d) or (batch_size, seq_len, d)
    """
    def __init__(self, limit: float = 7.0):
        super().__init__()
        if limit is None:
            raise ValueError("SwigluStepAndMul requires limit to be set.")
        self.limit = limit
    def forward_native(self, x: torch.Tensor) -> torch.Tensor:
        """PyTorch-native implementation equivalent to forward()."""
        gate, up = x.chunk(2, dim=-1)
        gate = F.silu(gate)
        gate = gate.clamp(max=self.limit)
        up = up.clamp(min=-self.limit, max=self.limit)
        return gate * up
    def forward_cuda(self, x: torch.Tensor) -> torch.Tensor:
        d = x.shape[-1] // 2
        output_shape = x.shape[:-1] + (d,)
        out = torch.empty(output_shape, dtype=x.dtype, device=x.device)
        swiglustep_and_mul_triton(out, x, self.limit)
        return out
    def extra_repr(self) -> str:
        return f"limit={repr(self.limit)}"
 # --8<-- [start:gelu_new]
@CustomOp.register("gelu_new")
 class NewGELU(CustomOp):
--- a/vllm/model_executor/layers/fused_moe/cutlass_moe.py
+++ b/vllm/model_executor/layers/fused_moe/cutlass_moe.py
@@ -649,7 +649,12 @@ class CutlassExpertsFp4(mk.FusedMoEPermuteExpertsUnpermute):
    @staticmethod
    def _supports_current_device() -> bool:
-        return current_platform.has_device_capability((10, 0))
+        p = current_platform
        return p.is_cuda() and (
            p.is_device_capability_family(100)
            or p.is_device_capability_family(110)
            or p.is_device_capability_family(120)
        )
    @staticmethod
    def _supports_no_act_and_mul() -> bool:
--- a/vllm/model_executor/layers/fused_moe/deep_gemm_moe.py
+++ b/vllm/model_executor/layers/fused_moe/deep_gemm_moe.py
@@ -144,7 +144,7 @@ class DeepGemmExperts(mk.FusedMoEPermuteExpertsUnpermute):
    @staticmethod
    def _supports_activation(activation: str) -> bool:
-        return activation in ["silu"]
+        return activation in ["silu", "swiglustep"]
    @staticmethod
    def _supports_parallel_config(moe_parallel_config: FusedMoEParallelConfig) -> bool:
--- a/vllm/model_executor/layers/fused_moe/flashinfer_cutedsl_moe.py
+++ b/vllm/model_executor/layers/fused_moe/flashinfer_cutedsl_moe.py
@@ -54,7 +54,8 @@ class FlashInferCuteDSLExperts(mk.FusedMoEPermuteExpertsUnpermute):
    @staticmethod
    def _supports_current_device() -> bool:
-        return current_platform.is_device_capability_family(100)
+        p = current_platform
        return p.is_cuda() and p.is_device_capability_family(100)
    @staticmethod
    def _supports_no_act_and_mul() -> bool:
--- a/vllm/model_executor/layers/fused_moe/flashinfer_cutlass_moe.py
+++ b/vllm/model_executor/layers/fused_moe/flashinfer_cutlass_moe.py
@@ -91,11 +91,14 @@ class FlashInferExperts(mk.FusedMoEPermuteExpertsUnpermute):
    @staticmethod
    def _supports_current_device() -> bool:
        p = current_platform
        return (
-            current_platform.is_cuda()
+            p.is_cuda()
            and (
-                current_platform.is_device_capability((9, 0))
+                p.is_device_capability(90)
-                or current_platform.is_device_capability_family(100)
+                or p.is_device_capability_family(100)
                or p.is_device_capability_family(110)
                or p.is_device_capability_family(120)
            )
            and has_flashinfer_cutlass_fused_moe()
        )
@@ -109,29 +112,27 @@ class FlashInferExperts(mk.FusedMoEPermuteExpertsUnpermute):
        weight_key: QuantKey | None,
        activation_key: QuantKey | None,
    ) -> bool:
        # The following are supported by FlashInferExperts:
        #   * unquantized
        #   * fp8 static per-tensor on 9.0+
        #   * fp8 block on 9.0
        #   * nvfp4 on 10.0+
        p = current_platform
        scheme = (weight_key, activation_key)
        # The following are supported by FlashInferExperts:
        return (
            # unquantized and fp8 static per-tensor on 9.0+
            (
                scheme
                in [
                    (None, None),
                    (kFp8StaticTensorSym, kFp8StaticTensorSym),
                ]
                and p.has_device_capability(90)
            )
            # fp8 block-scale on 9.0
            or (
-                (scheme == (kFp8Static128BlockSym, kFp8Dynamic128Sym))
+                scheme == (kFp8Static128BlockSym, kFp8Dynamic128Sym)
-                and (p.is_device_capability((9, 0)))
+                and p.is_device_capability(90)
            )
            # nvfp4 on 10.0+
            or (
-                (scheme == (kNvfp4Static, kNvfp4Dynamic))
+                scheme == (kNvfp4Static, kNvfp4Dynamic) and p.has_device_capability(100)
                and (p.is_device_capability_family(100))
            )
        )
--- a/vllm/model_executor/layers/fused_moe/flashinfer_trtllm_moe.py
+++ b/vllm/model_executor/layers/fused_moe/flashinfer_trtllm_moe.py
@@ -30,7 +30,6 @@ from vllm.utils.torch_utils import direct_register_custom_op
 def _supports_current_device() -> bool:
    """Supports only Blackwell-family GPUs."""
    p = current_platform
    # Add check flashinfer trtllm is available
    return p.is_cuda() and p.is_device_capability_family(100)
--- a/vllm/model_executor/layers/fused_moe/fused_batched_moe.py
+++ b/vllm/model_executor/layers/fused_moe/fused_batched_moe.py
@@ -927,6 +927,7 @@ class BatchedTritonExperts(mk.FusedMoEPermuteExpertsUnpermute):
        SUPPORTED_W_A_FP8 = [
            (kFp8Static128BlockSym, kFp8Dynamic128Sym),
            (kFp8StaticChannelSym, kFp8DynamicTokenSym),
            (kFp8StaticTensorSym, kFp8DynamicTokenSym),
            (kFp8StaticTensorSym, kFp8StaticTensorSym),
            (kFp8StaticTensorSym, kFp8DynamicTensorSym),
        ]
--- a/vllm/model_executor/layers/fused_moe/fused_moe.py
+++ b/vllm/model_executor/layers/fused_moe/fused_moe.py
@@ -45,6 +45,7 @@ from vllm.model_executor.layers.quantization.utils.ocp_mx_utils import OCP_MX_Sc
 from vllm.model_executor.layers.quantization.utils.quant_utils import (
    QuantKey,
    kFp8Dynamic128Sym,
    kFp8DynamicTensorSym,
    kFp8DynamicTokenSym,
    kFp8Static128BlockSym,
    kFp8StaticChannelSym,
@@ -1942,12 +1943,13 @@ class TritonExperts(mk.FusedMoEPermuteExpertsUnpermute):
            (kFp8StaticChannelSym, kFp8DynamicTokenSym),
            (kFp8StaticTensorSym, kFp8DynamicTokenSym),
            (kFp8StaticTensorSym, kFp8StaticTensorSym),
            (kFp8StaticTensorSym, kFp8DynamicTensorSym),
        ]
        return (weight_key, activation_key) in SUPPORTED_W_A
    @staticmethod
    def _supports_activation(activation: str) -> bool:
-        return activation in ["silu", "gelu", "swigluoai"]
+        return activation in ["silu", "gelu", "swigluoai", "swiglustep"]
    @staticmethod
    def _supports_parallel_config(moe_parallel_config: FusedMoEParallelConfig) -> bool:
--- a/vllm/model_executor/layers/fused_moe/utils.py
+++ b/vllm/model_executor/layers/fused_moe/utils.py
@@ -358,6 +358,11 @@ def apply_moe_activation(
        torch.ops._C.gelu_and_mul(output, input)
    elif activation == "swigluoai":
        torch.ops._C.swigluoai_and_mul(output, input)
    elif activation == "swiglustep":
        from vllm.model_executor.layers.activation import swiglustep_and_mul_triton
        swiglustep_and_mul_triton(output, input)
    # Activations without gated multiplication
    elif activation == SILU_NO_MUL:
        output.copy_(F.silu(input))
--- a/vllm/model_executor/layers/quantization/kernels/scaled_mm/rocm.py
+++ b/vllm/model_executor/layers/quantization/kernels/scaled_mm/rocm.py
@@ -28,6 +28,7 @@ def rocm_per_tensor_float_w8a8_scaled_mm_impl(
        A.shape[0] == 1
        and B.shape[1] % 16 == 0
        and ((bias is None) or (bias.dtype == out_dtype))
        and A.is_contiguous()
    ):
        output = ops.wvSplitKQ(
            B.t(),
--- a/vllm/model_executor/layers/quantization/utils/flashinfer_fp4_moe.py
+++ b/vllm/model_executor/layers/quantization/utils/flashinfer_fp4_moe.py
@@ -6,7 +6,6 @@ from typing import TYPE_CHECKING
 import torch
 import vllm.envs as envs
 import vllm.model_executor.layers.fused_moe.modular_kernel as mk
 from vllm import _custom_ops as ops
 from vllm.logger import init_logger
@@ -25,10 +24,6 @@ from vllm.model_executor.layers.quantization.utils.quant_utils import (
    swizzle_blockscale,
 )
 from vllm.platforms import current_platform
 from vllm.utils.flashinfer import (
    has_flashinfer_cutedsl_grouped_gemm_nt_masked,
    has_flashinfer_cutlass_fused_moe,
 )
 if TYPE_CHECKING:
    from vllm.model_executor.layers.fused_moe.oracle.nvfp4 import (
@@ -39,8 +34,6 @@ logger = init_logger(__name__)
 __all__ = [
    "is_flashinfer_fp4_cutlass_moe_available",
    "is_flashinfer_fp4_cutedsl_moe_available",
    "reorder_w1w3_to_w3w1",
    "build_flashinfer_fp4_cutlass_moe_prepare_finalize",
 ]
@@ -126,26 +119,6 @@ def is_supported_config_trtllm(
    return True, None
 def is_flashinfer_fp4_cutlass_moe_available() -> bool:
    """Return `True` when FlashInfer CUTLASS NV-FP4 kernels can be used."""
    return (
        envs.VLLM_USE_FLASHINFER_MOE_FP4
        and has_flashinfer_cutlass_fused_moe()
        and current_platform.is_cuda()
        and current_platform.has_device_capability(100)
    )
 def is_flashinfer_fp4_cutedsl_moe_available() -> bool:
    """Return ``True`` when FlashInfer CUTEDSL NV-FP4 kernels can be used."""
    return (
        envs.VLLM_USE_FLASHINFER_MOE_FP4
        and has_flashinfer_cutedsl_grouped_gemm_nt_masked()
        and current_platform.is_cuda()
        and current_platform.is_device_capability_family(100)
    )
 def reorder_w1w3_to_w3w1(
    weight: torch.Tensor, scale: torch.Tensor, dim: int = -2
 ) -> tuple[torch.Tensor, torch.Tensor]:
--- a/vllm/model_executor/layers/quantization/utils/nvfp4_moe_support.py
+++ b/vllm/model_executor/layers/quantization/utils/nvfp4_moe_support.py
@@ -1,67 +0,0 @@
 # SPDX-License-Identifier: Apache-2.0
 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project
 from dataclasses import dataclass
 import vllm.envs as envs
 from vllm.logger import init_logger
 from vllm.model_executor.layers.quantization.utils.flashinfer_fp4_moe import (
    is_flashinfer_fp4_cutedsl_moe_available,
    is_flashinfer_fp4_cutlass_moe_available,
 )
 from vllm.model_executor.layers.quantization.utils.marlin_utils_fp4 import (
    is_fp4_marlin_supported,
 )
 from vllm.model_executor.layers.quantization.utils.quant_utils import (
    cutlass_fp4_supported,
 )
 __all__ = ["detect_nvfp4_moe_support", "NvFp4Support"]
 _logger = init_logger(__name__)
@dataclass(frozen=True)
 class NvFp4Support:
    """Result container for NV-FP4 capability probing."""
    cutlass_supported: bool
    allow_flashinfer: bool
    use_marlin: bool
 def detect_nvfp4_moe_support(class_name: str = "") -> NvFp4Support:
    """Detect platform support for NV-FP4 fused-MoE path"""
    cutlass_supported = cutlass_fp4_supported()
    allow_flashinfer = cutlass_supported and (
        is_flashinfer_fp4_cutlass_moe_available()
        or is_flashinfer_fp4_cutedsl_moe_available()
    )
    if allow_flashinfer:
        _logger.info_once(
            "Using FlashInfer kernels for %s.", class_name or "NVFP4 path"
        )
    else:
        if envs.VLLM_USE_FLASHINFER_MOE_FP4:
            _logger.warning_once(
                "FlashInfer kernels unavailable for %s on current platform.",
                class_name or "NVFP4 path",
            )
    use_marlin = False
    if not cutlass_supported:
        if is_fp4_marlin_supported():
            use_marlin = True
            _logger.info_once("Falling back to Marlin FP4 MoE kernel.")
        else:
            raise ValueError(
                "Current platform does not support NVFP4 quantization. "
                "Please use Blackwell GPUs or enable FlashInfer."
            )
    return NvFp4Support(
        cutlass_supported=cutlass_supported,
        allow_flashinfer=allow_flashinfer,
        use_marlin=use_marlin,
    )
--- a/vllm/model_executor/layers/utils.py
+++ b/vllm/model_executor/layers/utils.py
@@ -146,6 +146,7 @@ def rocm_unquantized_gemm_impl(
            and n <= 128
            and k > 512
            and math.ceil(k / 512) * math.ceil(m / 16) < get_cu_count()
            and x.is_contiguous()
        )
        # k == 2880 and (m == 640 or m == 128))
    )
@@ -165,6 +166,7 @@ def rocm_unquantized_gemm_impl(
        and on_gfx9()
        and x.dtype in [torch.float16, torch.bfloat16]
        and k % 8 == 0
        and x.is_contiguous()
    )
    if use_skinny is not True:
--- a/vllm/model_executor/models/adapters.py
+++ b/vllm/model_executor/models/adapters.py
@@ -466,6 +466,7 @@ def load_weights_using_from_2_way_softmax(
    language_model = _get_language_model_for_seq_cls(model)
    is_vlm = language_model is not model
    using_vlm_head = is_vlm and hasattr(language_model, "score")
    language_model.lm_head = ParallelLMHead(
        text_config.vocab_size, text_config.hidden_size, quant_config=quant_config
@@ -506,14 +507,16 @@ def load_weights_using_from_2_way_softmax(
        torch.float32
    ) - lm_head_weight.data[[false_id]].to(torch.float32)
-    score_layer = language_model.score if is_vlm else model.score
+    score_layer = language_model.score if using_vlm_head else model.score
    param = score_layer.weight
    weight_loader = getattr(param, "weight_loader", default_weight_loader)
    weight_loader(param, score_weight)
    del language_model.lm_head
-    score_weight_name = "language_model.score.weight" if is_vlm else "score.weight"
+    score_weight_name = (
        "language_model.score.weight" if using_vlm_head else "score.weight"
    )
    loaded_weights.add(score_weight_name)
    lm_head_name = "lm_head.weight"
@@ -537,6 +540,7 @@ def load_weights_no_post_processing(model, weights: Iterable[tuple[str, torch.Te
    language_model = _get_language_model_for_seq_cls(model)
    is_vlm = language_model is not model
    using_vlm_head = is_vlm and hasattr(language_model, "score")
    language_model.lm_head = ParallelLMHead(
        text_config.vocab_size, text_config.hidden_size, quant_config=quant_config
@@ -572,14 +576,16 @@ def load_weights_no_post_processing(model, weights: Iterable[tuple[str, torch.Te
    token_ids = [tokenizer.convert_tokens_to_ids(t) for t in tokens]
    score_weight = language_model.lm_head.weight.data[token_ids]
-    score_layer = language_model.score if is_vlm else model.score
+    score_layer = language_model.score if using_vlm_head else model.score
    param = score_layer.weight
    weight_loader = getattr(param, "weight_loader", default_weight_loader)
    weight_loader(param, score_weight)
    del language_model.lm_head
-    score_weight_name = "language_model.score.weight" if is_vlm else "score.weight"
+    score_weight_name = (
        "language_model.score.weight" if using_vlm_head else "score.weight"
    )
    loaded_weights.add(score_weight_name)
    lm_head_name = "lm_head.weight"
--- a/vllm/model_executor/models/nemotron_parse.py
+++ b/vllm/model_executor/models/nemotron_parse.py
@@ -11,7 +11,6 @@ import math
 from collections.abc import Iterable, Mapping, Sequence
 from typing import Annotated, Literal
 import cv2
 import numpy as np
 import torch
 import torch.nn as nn
@@ -416,6 +415,8 @@ class NemotronParseImageProcessor:
        else:
            self.target_height = self.target_width = int(self.final_size)
        import cv2
        self.transform = A.Compose(
            [
                A.PadIfNeeded(
@@ -457,6 +458,8 @@ class NemotronParseImageProcessor:
            new_height = int(new_width / aspect_ratio)
        # Use cv2.INTER_LINEAR like the original
        import cv2
        return cv2.resize(
            image, (new_width, new_height), interpolation=cv2.INTER_LINEAR
        )
--- a/vllm/model_executor/models/registry.py
+++ b/vllm/model_executor/models/registry.py
@@ -188,6 +188,7 @@ _TEXT_GENERATION_MODELS = {
    "SeedOssForCausalLM": ("seed_oss", "SeedOssForCausalLM"),
    "Step1ForCausalLM": ("step1", "Step1ForCausalLM"),
    "Step3TextForCausalLM": ("step3_text", "Step3TextForCausalLM"),
    "Step3p5ForCausalLM": ("step3p5", "Step3p5ForCausalLM"),
    "StableLMEpochForCausalLM": ("stablelm", "StablelmForCausalLM"),
    "StableLmForCausalLM": ("stablelm", "StablelmForCausalLM"),
    "Starcoder2ForCausalLM": ("starcoder2", "Starcoder2ForCausalLM"),
@@ -476,6 +477,7 @@ _SPECULATIVE_DECODING_MODELS = {
    "MedusaModel": ("medusa", "Medusa"),
    "OpenPanguMTPModel": ("openpangu_mtp", "OpenPanguMTP"),
    "Qwen3NextMTP": ("qwen3_next_mtp", "Qwen3NextMTP"),
    "Step3p5MTP": ("step3p5_mtp", "Step3p5MTP"),
    # Temporarily disabled.
    # # TODO(woosuk): Re-enable this once the MLP Speculator is supported in V1.
    # "MLPSpeculatorPreTrainedModel": ("mlp_speculator", "MLPSpeculator"),
--- a/vllm/model_executor/models/step3p5.py
+++ b/vllm/model_executor/models/step3p5.py
@@ -0,0 +1,894 @@
 # SPDX-License-Identifier: Apache-2.0
 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project
 """Inference-only Jurassic model."""
 from collections.abc import Iterable
 from typing import Any
 import torch
 from torch import nn
 from torch.nn.parameter import Parameter
 from vllm.attention.layer import Attention
 from vllm.compilation.decorators import support_torch_compile
 from vllm.config import CacheConfig, ModelConfig, VllmConfig
 from vllm.distributed import (
    get_dp_group,
    get_ep_group,
    get_pp_group,
    get_tensor_model_parallel_rank,
    get_tensor_model_parallel_world_size,
    get_tp_group,
 )
 from vllm.logger import init_logger
 from vllm.model_executor.layers.activation import SiluAndMul, SwigluStepAndMul
 from vllm.model_executor.layers.fused_moe import FusedMoE
 from vllm.model_executor.layers.fused_moe.shared_fused_moe import SharedFusedMoE
 from vllm.model_executor.layers.layernorm import GemmaRMSNorm
 from vllm.model_executor.layers.linear import (
    ColumnParallelLinear,
    MergedColumnParallelLinear,
    QKVParallelLinear,
    ReplicatedLinear,
    RowParallelLinear,
 )
 from vllm.model_executor.layers.logits_processor import LogitsProcessor
 from vllm.model_executor.layers.quantization.base_config import QuantizationConfig
 from vllm.model_executor.layers.rotary_embedding import get_rope
 from vllm.model_executor.layers.vocab_parallel_embedding import (
    DEFAULT_VOCAB_PADDING_SIZE,
    ParallelLMHead,
    VocabParallelEmbedding,
 )
 from vllm.model_executor.model_loader.weight_utils import default_weight_loader
 from vllm.sequence import IntermediateTensors
 from vllm.v1.attention.backend import AttentionType
 from .interfaces import MixtureOfExperts, SupportsPP
 from .utils import (
    AutoWeightsLoader,
    PPMissingLayer,
    WeightsMapper,
    extract_layer_index,
    is_pp_missing_parameter,
    make_empty_intermediate_tensors_factory,
    make_layers,
    maybe_prefix,
 )
 logger = init_logger(__name__)
 class FP32ReplicatedLinear(ReplicatedLinear):
    """
    Use FP32 for higher precision.
    """
    def forward(
        self,
        x: torch.Tensor,
    ) -> torch.Tensor | tuple[torch.Tensor, Parameter | None]:
        assert self.params_dtype == torch.float32
        return super().forward(x.to(torch.float32))
 class Step3p5MLP(nn.Module):
    def __init__(
        self,
        config: ModelConfig,
        hidden_size: int,
        intermediate_size: int,
        hidden_act: str,
        quant_config: QuantizationConfig | None = None,
        reduce_results: bool = True,
        prefix: str = "",
    ) -> None:
        super().__init__()
        self.gate_up_proj = MergedColumnParallelLinear(
            hidden_size,
            [intermediate_size] * 2,
            bias=False,
            quant_config=quant_config,
            prefix=f"{prefix}.gate_up_proj",
        )
        self.down_proj = RowParallelLinear(
            intermediate_size,
            hidden_size,
            bias=False,
            quant_config=quant_config,
            reduce_results=reduce_results,
            prefix=f"{prefix}.down_proj",
        )
        if hidden_act != "silu":
            raise ValueError(
                f"Unsupported activation: {hidden_act}. Only silu is supported for now."
            )
        self.act_fn = SiluAndMul()
        self.prefix = prefix
        self.hidden_size = hidden_size
        self.limit = None
        layer_idx = extract_layer_index(prefix)
        if (
            config.swiglu_limits_shared
            and config.swiglu_limits_shared[layer_idx] is not None
            and config.swiglu_limits_shared[layer_idx] != 0
        ):
            self.limit = config.swiglu_limits_shared[layer_idx]
            self.act_fn = SwigluStepAndMul(limit=self.limit)
    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
        gate_up, _ = self.gate_up_proj(hidden_states)
        intermediate_act = self.act_fn(gate_up)
        output, _ = self.down_proj(intermediate_act)
        return output
 class Step3p5Attention(nn.Module):
    def __init__(
        self,
        hidden_size: int,
        num_heads: int,
        num_kv_heads: int,
        max_position: int = 4096 * 32,
        head_dim: int | None = None,
        rms_norm_eps: float = 1e-06,
        qkv_bias: bool = False,
        rope_theta: float | list[float] | None = 10000,
        cache_config: CacheConfig | None = None,
        quant_config: QuantizationConfig | None = None,
        rope_scaling: dict[str, Any] | None = None,
        prefix: str = "",
        attn_type: str = AttentionType.DECODER,
        # Step3p5 specific args
        sliding_window: int | None = None,
        use_head_wise_attn_gate: bool = False,
        layer_types: list = None,
        use_rope_layers: list = None,
        yarn_only_types: list = None,
        swa_num_attention_heads: int | None = None,
        partial_rotary_factor: float = 1.0,
    ):
        super().__init__()
        self.hidden_size = hidden_size
        self.total_num_heads = num_heads
        tp_size = get_tensor_model_parallel_world_size()
        self.layer_idx = extract_layer_index(prefix)
        if layer_types:
            enable_sliding_window = layer_types[self.layer_idx] == "sliding_attention"
        else:
            enable_sliding_window = self.layer_idx % 2 == 0
        if yarn_only_types and layer_types[self.layer_idx] not in yarn_only_types:
            rope_scaling = None
        if sliding_window is not None and enable_sliding_window:
            sliding_window = sliding_window
            if swa_num_attention_heads is not None:
                num_heads = swa_num_attention_heads
                self.total_num_heads = swa_num_attention_heads
        else:
            sliding_window = None
        if isinstance(rope_theta, list):
            rope_theta = rope_theta[self.layer_idx]
        self.rank = get_tensor_model_parallel_rank()
        self.partial_rotary_factor = partial_rotary_factor
        assert self.total_num_heads % tp_size == 0
        self.num_heads = self.total_num_heads // tp_size
        self.total_num_kv_heads = num_kv_heads
        if self.total_num_kv_heads >= tp_size:
            # Number of KV heads is greater than TP size, so we partition
            # the KV heads across multiple tensor parallel GPUs.
            assert self.total_num_kv_heads % tp_size == 0
        else:
            # Number of KV heads is less than TP size, so we replicate
            # the KV heads across multiple tensor parallel GPUs.
            assert tp_size % self.total_num_kv_heads == 0
        self.num_kv_heads = max(1, self.total_num_kv_heads // tp_size)
        self.head_dim = head_dim or hidden_size // self.total_num_heads
        self.q_size = self.num_heads * self.head_dim
        self.kv_size = self.num_kv_heads * self.head_dim
        self.scaling = self.head_dim**-0.5
        self.rope_theta = rope_theta
        self.qkv_proj = QKVParallelLinear(
            hidden_size,
            self.head_dim,
            self.total_num_heads,
            self.total_num_kv_heads,
            bias=qkv_bias,
            quant_config=quant_config,
            prefix=f"{prefix}.qkv_proj",
        )
        self.o_proj = RowParallelLinear(
            self.total_num_heads * self.head_dim,
            hidden_size,
            bias=False,
            quant_config=quant_config,
            prefix=f"{prefix}.o_proj",
        )
        if rope_scaling is not None and not isinstance(rope_scaling, dict):
            raise ValueError("rope_scaling must be a dict for Step3p5Attention.")
        rope_parameters: dict[str, Any] = (
            dict(rope_scaling) if rope_scaling is not None else {}
        )
        rope_parameters.setdefault("rope_type", "default")
        rope_parameters["rope_theta"] = self.rope_theta
        rope_parameters["partial_rotary_factor"] = partial_rotary_factor
        self.rotary_emb = get_rope(
            head_size=self.head_dim,
            max_position=max_position,
            rope_parameters=rope_parameters,
        )
        self.q_norm = GemmaRMSNorm(self.head_dim, rms_norm_eps)
        self.k_norm = GemmaRMSNorm(self.head_dim, rms_norm_eps)
        self.use_head_wise_attn_gate = use_head_wise_attn_gate
        if use_head_wise_attn_gate:
            self.g_proj = ColumnParallelLinear(
                hidden_size,
                self.total_num_heads,
                bias=False,
                prefix=f"{prefix}.g_proj",
            )
        self.use_rope = True
        if use_rope_layers:
            self.use_rope = use_rope_layers[self.layer_idx]
        self.attn = Attention(
            self.num_heads,
            self.head_dim,
            self.scaling,
            num_kv_heads=self.num_kv_heads,
            cache_config=cache_config,
            quant_config=quant_config,
            prefix=f"{prefix}.attn",
            per_layer_sliding_window=sliding_window,
            attn_type=attn_type,
        )
        self.max_position_embeddings = max_position
        assert self.partial_rotary_factor == 1 or self.partial_rotary_factor == 0.5
        self.rotary_dim = (
            self.head_dim if self.partial_rotary_factor == 1 else self.head_dim // 2
        )
    def forward(
        self,
        positions: torch.Tensor,
        hidden_states: torch.Tensor,
    ) -> torch.Tensor:
        qkv, _ = self.qkv_proj(hidden_states)
        q, k, v = qkv.split([self.q_size, self.kv_size, self.kv_size], dim=-1)
        # Add qk-norm inline similar to Qwen3 MOE attention
        q_by_head = q.view(*q.shape[:-1], q.shape[-1] // self.head_dim, self.head_dim)
        q_by_head = self.q_norm(q_by_head.contiguous())
        q = q_by_head.view(q.shape)
        k_by_head = k.view(*k.shape[:-1], k.shape[-1] // self.head_dim, self.head_dim)
        k_by_head = self.k_norm(k_by_head.contiguous())
        k = k_by_head.view(k.shape)
        if self.use_rope:
            q, k = self.rotary_emb(positions, q, k)
        attn_output = self.attn(q, k, v)
        if self.use_head_wise_attn_gate:
            extra_dims, _ = self.g_proj(hidden_states)
            output = (
                attn_output.view(*attn_output.shape[:-1], self.num_heads, self.head_dim)
                * extra_dims.unsqueeze(-1).sigmoid()
            )
            attn_output = output.view(*attn_output.shape)
        output, _ = self.o_proj(attn_output)
        return output
 class FusedMoEBlock(nn.Module):
    def __init__(
        self,
        vllm_config: VllmConfig,
        prefix: str = "",
    ):
        super().__init__()
        self.tp_size = get_tensor_model_parallel_world_size()
        self.layer_idx = extract_layer_index(prefix)
        self.ep_size = get_ep_group().device_group.size()
        self.ep_rank = get_ep_group().device_group.rank()
        config = vllm_config.model_config.hf_config
        quant_config = vllm_config.quant_config
        parallel_config = vllm_config.parallel_config
        self.hidden_size = config.hidden_size
        self.enable_eplb = parallel_config.enable_eplb
        self.n_routed_experts = config.moe_num_experts
        self.n_logical_experts = self.n_routed_experts
        self.n_redundant_experts = parallel_config.eplb_config.num_redundant_experts
        self.n_physical_experts = self.n_logical_experts + self.n_redundant_experts
        self.n_local_physical_experts = self.n_physical_experts // self.ep_size
        self.physical_expert_start = self.ep_rank * self.n_local_physical_experts
        self.physical_expert_end = (
            self.physical_expert_start + self.n_local_physical_experts
        )
        if self.tp_size > config.moe_num_experts:
            raise ValueError(
                f"Tensor parallel size {self.tp_size} is greater than "
                f"the number of experts {config.moe_num_experts}."
            )
        self.gate = FP32ReplicatedLinear(
            config.hidden_size,
            config.moe_num_experts,
            bias=False,
            quant_config=None,
            params_dtype=torch.float32,  # Use FP32 for higher precision.
            prefix=f"{prefix}.gate",
        )
        self.use_moe_router_bias = config.use_moe_router_bias
        assert self.use_moe_router_bias, "Only support use_moe_router_bias is true."
        self.routed_scaling_factor = config.moe_router_scaling_factor
        self.router_bias = nn.Parameter(
            torch.zeros(config.moe_num_experts, dtype=torch.float32),
            requires_grad=False,
        )
        self.need_fp32_gate = config.need_fp32_gate
        assert self.need_fp32_gate, (
            "Router logits must use FP32 precision for numerical stability."
        )
        activation = "silu"
        swiglu_limits = config.swiglu_limits or []
        swiglu_limit = (
            swiglu_limits[self.layer_idx]
            if self.layer_idx < len(swiglu_limits)
            else None
        )
        if swiglu_limit not in (None, 0):
            swiglu_limit = float(swiglu_limit)
            assert swiglu_limit == 7.0, (
                "Swiglu limit in fused moe block only suport 7.0 now."
            )
            activation = "swiglustep"
            logger.debug(
                "step3p5 layer_idx: %s, activation: %s, limit: %s",
                self.layer_idx,
                activation,
                swiglu_limit,
            )
        self.share_expert = Step3p5MLP(
            config=config,
            hidden_size=self.hidden_size,
            intermediate_size=config.share_expert_dim,
            hidden_act="silu",
            reduce_results=False,
            quant_config=quant_config,
            prefix=f"{prefix}.share_expert",
        )
        self.experts = SharedFusedMoE(
            shared_experts=self.share_expert,
            gate=self.gate,
            num_experts=config.moe_num_experts,
            top_k=config.moe_top_k,
            hidden_size=config.hidden_size,
            intermediate_size=config.moe_intermediate_size,
            reduce_results=False,
            renormalize=config.norm_expert_weight,
            quant_config=quant_config,
            activation=activation,
            prefix=f"{prefix}.experts",
            scoring_func=getattr(config, "moe_router_activation", "sigmoid"),
            e_score_correction_bias=self.router_bias,
            routed_scaling_factor=config.moe_router_scaling_factor,
            enable_eplb=self.enable_eplb,
            num_redundant_experts=self.n_redundant_experts,
        )
    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
        num_tokens, hidden_dim = hidden_states.shape
        hidden_states = hidden_states.view(-1, hidden_dim)
        if self.experts.is_internal_router:
            # In this case, the gate/router runs inside the FusedMoE class
            fused_moe_out = self.experts(
                hidden_states=hidden_states, router_logits=hidden_states
            )
        else:
            # router_logits: (num_tokens, n_experts)
            router_logits, _ = self.gate(hidden_states)
            fused_moe_out = self.experts(
                hidden_states=hidden_states, router_logits=router_logits
            )
        shared_output, final_hidden_states = fused_moe_out
        if self.share_expert is None:
            assert shared_output is None
        if self.share_expert is not None:
            assert shared_output is not None
            final_hidden_states += shared_output
        if self.tp_size > 1:
            final_hidden_states = self.experts.maybe_all_reduce_tensor_model_parallel(
                final_hidden_states
            )
        return final_hidden_states.view(num_tokens, hidden_dim)
 class Step3p5DecoderLayer(nn.Module):
    def __init__(
        self,
        vllm_config: VllmConfig,
        prefix: str = "",
    ) -> None:
        super().__init__()
        config = vllm_config.model_config.hf_config
        self.hidden_size = config.hidden_size
        layer_idx = extract_layer_index(prefix)
        self.layer_idx = layer_idx
        cache_config = vllm_config.cache_config
        quant_config = vllm_config.quant_config
        if cache_config is not None:
            cache_config.sliding_window = None
        if config.att_impl_type == "GQA":
            num_attention_heads = None
            num_attention_groups = None
            head_dim = None
            if (
                getattr(config, "attention_other_setting", None)
                and getattr(config, "layer_types", [])
                and config.layer_types[layer_idx]
                == config.attention_other_setting["attention_type"]
            ):
                num_attention_heads = config.attention_other_setting[
                    "num_attention_heads"
                ]
                num_attention_groups = config.attention_other_setting[
                    "num_attention_groups"
                ]
                head_dim = config.attention_other_setting["head_dim"]
            partial_rotary_factors = getattr(config, "partial_rotary_factors", [])
            self.self_attn = Step3p5Attention(
                hidden_size=self.hidden_size,
                num_heads=num_attention_heads
                if num_attention_heads
                else config.num_attention_heads,
                max_position=config.max_position_embeddings,
                num_kv_heads=num_attention_groups
                if num_attention_groups
                else config.num_attention_groups,
                rope_theta=config.rope_theta,
                rms_norm_eps=config.rms_norm_eps,
                qkv_bias=getattr(config, "attention_bias", False),
                head_dim=head_dim if head_dim else getattr(config, "head_dim", None),
                cache_config=cache_config,
                quant_config=quant_config,
                rope_scaling=getattr(config, "rope_scaling", None),
                sliding_window=getattr(config, "sliding_window", None),
                use_head_wise_attn_gate=getattr(
                    config, "use_head_wise_attn_gate", False
                ),
                layer_types=getattr(config, "layer_types", []),
                use_rope_layers=getattr(config, "use_rope_layers", []),
                yarn_only_types=getattr(config, "yarn_only_types", []),
                partial_rotary_factor=partial_rotary_factors[layer_idx]
                if partial_rotary_factors
                else 1.0,
                prefix=f"{prefix}.self_attn",
            )
        else:
            raise ValueError(
                f"Unsupported attention implementation: {config.att_impl_type}"
            )
        self.use_moe = False
        self.tp_group = get_tp_group()
        self.use_fused_all_reduce = (
            get_tensor_model_parallel_world_size() > 1
            and get_dp_group().world_size == 1
        )
        if self.use_fused_all_reduce:
            logger.warning_once("Enable custom fused all reduce...")
        else:
            logger.warning_once("Disable custom fused all reduce...")
        moe_layers_enum = getattr(config, "moe_layers_enum", None)
        if moe_layers_enum is not None:
            moe_layers_idx = [int(i) for i in moe_layers_enum.strip().split(",")]
        else:
            moe_layers_idx = [i for i in range(1, config.num_hidden_layers)]
        if layer_idx in moe_layers_idx:
            self.moe = FusedMoEBlock(
                vllm_config,
                prefix=f"{prefix}.moe",
            )
            self.use_moe = True
        else:
            self.mlp = Step3p5MLP(
                config=config,
                hidden_size=config.hidden_size,
                intermediate_size=config.intermediate_size,
                hidden_act="silu",
                quant_config=quant_config,
                reduce_results=True,
                prefix=f"{prefix}.mlp",
            )
        self.input_layernorm = GemmaRMSNorm(config.hidden_size, config.rms_norm_eps)
        self.post_attention_layernorm = GemmaRMSNorm(
            config.hidden_size, config.rms_norm_eps
        )
        self.prefix = prefix
    def add_and_maybe_inplace_all_reduce(
        self, in1: torch.Tensor, in2: torch.Tensor
    ) -> torch.Tensor:
        if not self.use_fused_all_reduce:
            return in1 + in2
        return self.tp_group.all_reduce(in1 + in2)
    def forward(
        self, positions: torch.Tensor, hidden_states: torch.Tensor
    ) -> torch.Tensor:
        residual = hidden_states
        hidden_states = self.input_layernorm(hidden_states)
        hidden_states = self.self_attn(
            positions=positions,
            hidden_states=hidden_states,
        )
        hidden_states += residual
        residual = hidden_states
        hidden_states = self.post_attention_layernorm(hidden_states)
        if self.use_moe:
            ffn_output = self.moe(hidden_states)
        else:
            ffn_output = self.mlp(hidden_states)
        hidden_states = ffn_output + residual
        return hidden_states
@support_torch_compile
 class Step3p5Model(nn.Module):
    def __init__(self, vllm_config: VllmConfig, prefix: str = "") -> None:
        super().__init__()
        self.vllm_config = vllm_config
        config = vllm_config.model_config.hf_config
        self.vocab_size = config.vocab_size
        self.config = config
        self.moe_num_experts = config.moe_num_experts
        if get_pp_group().is_first_rank or (
            config.tie_word_embeddings and get_pp_group().is_last_rank
        ):
            self.embed_tokens = VocabParallelEmbedding(
                self.vocab_size,
                config.hidden_size,
            )
        else:
            self.embed_tokens = PPMissingLayer()
        self.start_layer, self.end_layer, self.layers = make_layers(
            config.num_hidden_layers,
            lambda prefix: Step3p5DecoderLayer(
                vllm_config,
                prefix=prefix,
            ),
            prefix=f"{prefix}.layers",
        )
        if get_pp_group().is_last_rank:
            self.norm = GemmaRMSNorm(config.hidden_size, config.rms_norm_eps)
        else:
            self.norm = PPMissingLayer()
        self.make_empty_intermediate_tensors = make_empty_intermediate_tensors_factory(
            ["hidden_states"], config.hidden_size
        )
    def embed_input_ids(self, input_ids: torch.Tensor) -> torch.Tensor:
        return self.embed_tokens(input_ids)
    def forward(
        self,
        input_ids: torch.Tensor,
        positions: torch.Tensor,
        intermediate_tensors: IntermediateTensors | None = None,
        inputs_embeds: torch.Tensor | None = None,
    ) -> torch.Tensor:
        if get_pp_group().is_first_rank:
            if inputs_embeds is not None:
                hidden_states = inputs_embeds
            else:
                hidden_states = self.embed_input_ids(input_ids)
        else:
            assert intermediate_tensors is not None
            hidden_states = intermediate_tensors["hidden_states"]
        for i in range(self.start_layer, self.end_layer):
            layer = self.layers[i]
            hidden_states = layer(positions, hidden_states)
        if not get_pp_group().is_last_rank:
            return IntermediateTensors(
                {
                    "hidden_states": hidden_states,
                }
            )
        return hidden_states
    def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:
        config = self.config
        assert config.num_attention_groups > 1, "Only support GQA"
        qkv_params_mapping = []
        stacked_params_mapping = [
            # (param_name, shard_name, shard_id)
            ("qkv_proj", "q_proj", "q"),
            ("qkv_proj", "k_proj", "k"),
            ("qkv_proj", "v_proj", "v"),
            ("gate_up_proj", "gate_proj", 0),
            ("gate_up_proj", "up_proj", 1),
        ]
        params_dict = dict(self.named_parameters())
        loaded_params: set[str] = set()
        expert_params_mapping = [
            (".moe.experts.w13_weight", ".moe.gate_proj.weight", "w1"),
            (".moe.experts.w13_weight", ".moe.up_proj.weight", "w3"),
            (".moe.experts.w2_weight", ".moe.down_proj.weight", "w2"),
        ]
        disable_moe_stacked_params = [data[1] for data in expert_params_mapping]
        for name, loaded_weight in weights:
            if name.startswith("model."):
                local_name = name[len("model.") :]
                full_name = name
            else:
                local_name = name
                full_name = f"model.{name}" if name else "model"
            spec_layer = get_spec_layer_idx_from_weight_name(config, full_name)
            if spec_layer is not None:
                continue  # skip spec decode layers for main model
            # Skip any layers beyond the main model's depth (e.g., MTP layers)
            if full_name.startswith("model.layers."):
                parts = full_name.split(".")
                if len(parts) > 2 and parts[2].isdigit():
                    layer_idx = int(parts[2])
                    if layer_idx >= config.num_hidden_layers:
                        continue
            for param_name, weight_name, shard_id in stacked_params_mapping:
                if weight_name not in local_name:
                    continue
                if any(
                    disable_moe_stacked_param in local_name
                    for disable_moe_stacked_param in disable_moe_stacked_params
                ):
                    continue
                replaced_name = local_name.replace(weight_name, param_name)
                if is_pp_missing_parameter(replaced_name, self):
                    continue
                if replaced_name not in params_dict:
                    continue
                param = params_dict[replaced_name]
                weight_loader = param.weight_loader
                weight_loader(param, loaded_weight, shard_id)
                loaded_params.add(replaced_name)
                break
            else:
                for param_name, weight_name, shard_id in expert_params_mapping:
                    if weight_name not in local_name:
                        continue
                    replaced_name = local_name.replace(weight_name, param_name)
                    if is_pp_missing_parameter(replaced_name, self):
                        continue
                    if (
                        replaced_name.endswith(".bias")
                        or replaced_name.endswith("_bias")
                    ) and replaced_name not in params_dict:
                        continue
                    if replaced_name not in params_dict:
                        continue
                    param = params_dict[replaced_name]
                    weight_loader = param.weight_loader
                    moe_expert_num = self.moe_num_experts
                    assert loaded_weight.shape[0] == moe_expert_num
                    for expert_id in range(moe_expert_num):
                        loaded_weight_expert = loaded_weight[expert_id]
                        weight_loader(
                            param,
                            loaded_weight_expert,
                            replaced_name,
                            shard_id=shard_id,
                            expert_id=expert_id,
                        )
                    loaded_params.add(replaced_name)
                    break
                else:
                    for (
                        param_name,
                        weight_name,
                        start_idx,
                        end_idx,
                    ) in qkv_params_mapping:
                        if weight_name not in local_name:
                            continue
                        replaced_name = local_name.replace(weight_name, param_name)
                        if is_pp_missing_parameter(replaced_name, self):
                            continue
                        if replaced_name not in params_dict:
                            continue
                        param = params_dict[replaced_name]
                        dim = param.shape[param.output_dim]
                        begin_idx = int(start_idx * dim)
                        end_idx = int(end_idx * dim)
                        param_slice = param.narrow(
                            param.output_dim, begin_idx, end_idx - begin_idx
                        )
                        param_slice.copy_(loaded_weight)
                        loaded_params.add(replaced_name)
                        break
                    else:
                        if is_pp_missing_parameter(local_name, self):
                            continue
                        if "expert_bias" in local_name:
                            logger.warning_once("ignore expert_bias")
                            continue
                        if local_name not in params_dict:
                            continue
                        param = params_dict[local_name]
                        weight_loader = getattr(
                            param, "weight_loader", default_weight_loader
                        )
                        weight_loader(param, loaded_weight)
                        loaded_params.add(local_name)
        return loaded_params
 class Step3p5ForCausalLM(nn.Module, SupportsPP, MixtureOfExperts):
    hf_to_vllm_mapper = WeightsMapper(
        orig_to_new_substr={".share_expert.": ".moe.share_expert."}
    )
    def __init__(
        self,
        *,
        vllm_config: VllmConfig,
        prefix: str = "",
    ):
        super().__init__()
        config = vllm_config.model_config.hf_config
        lora_config = vllm_config.lora_config
        self.config = config
        self.vllm_config = vllm_config
        self.model = Step3p5Model(
            vllm_config=vllm_config, prefix=maybe_prefix(prefix, "model")
        )
        self.moe_layers: list[FusedMoEBlock] = []
        for layer in self.model.layers:
            if isinstance(layer, PPMissingLayer):
                continue
            assert isinstance(layer, Step3p5DecoderLayer)
            if hasattr(layer, "moe") and isinstance(layer.moe, FusedMoEBlock):
                self.moe_layers.append(layer.moe)
        if get_pp_group().is_last_rank:
            self.unpadded_vocab_size = config.vocab_size
            if lora_config:
                self.unpadded_vocab_size += lora_config.lora_extra_vocab_size
            self.lm_head = ParallelLMHead(
                self.unpadded_vocab_size,
                config.hidden_size,
                org_num_embeddings=config.vocab_size,
                padding_size=DEFAULT_VOCAB_PADDING_SIZE
                if not lora_config
                else lora_config.lora_vocab_padding_size,
            )
            self.logits_processor = LogitsProcessor(
                self.unpadded_vocab_size, config.vocab_size
            )
        else:
            self.lm_head = PPMissingLayer()
        self.make_empty_intermediate_tensors = (
            self.model.make_empty_intermediate_tensors
        )
        # Set MoE hyperparameters
        self.expert_weights = []
        assert len(self.moe_layers) > 0, "No MoE layers found in the model."
        example_layer = self.moe_layers[0]
        self.num_moe_layers = len(self.moe_layers)
        self.num_expert_groups = 1
        self.num_shared_experts = 0
        self.num_logical_experts = example_layer.n_logical_experts
        self.num_physical_experts = example_layer.n_physical_experts
        self.num_local_physical_experts = example_layer.n_local_physical_experts
        self.num_routed_experts = example_layer.n_routed_experts
        self.num_redundant_experts = example_layer.n_redundant_experts
    def forward(
        self,
        input_ids: torch.Tensor,
        positions: torch.Tensor,
        intermediate_tensors: IntermediateTensors | None = None,
        inputs_embeds: torch.Tensor | None = None,
    ):
        hidden_states = self.model(
            input_ids, positions, intermediate_tensors, inputs_embeds
        )
        return hidden_states
    def compute_logits(self, hidden_states: torch.Tensor) -> torch.Tensor:
        hidden_states = self.model.norm(hidden_states)
        logits = self.logits_processor(self.lm_head, hidden_states)
        return logits
    def embed_input_ids(self, input_ids: torch.Tensor) -> torch.Tensor:
        return self.model.embed_tokens(input_ids)
    def set_eplb_state(
        self,
        expert_load_view: torch.Tensor,
        logical_to_physical_map: torch.Tensor,
        logical_replica_count: torch.Tensor,
    ) -> None:
        for layer_idx, layer in enumerate(self.moe_layers):
            experts = layer.experts
            assert isinstance(experts, FusedMoE)
            # Register the expert weights.
            self.expert_weights.append(experts.get_expert_weights())
            experts.set_eplb_state(
                moe_layer_idx=layer_idx,
                expert_load_view=expert_load_view,
                logical_to_physical_map=logical_to_physical_map,
                logical_replica_count=logical_replica_count,
            )
    def update_physical_experts_metadata(
        self,
        num_physical_experts: int,
        num_local_physical_experts: int,
    ) -> None:
        assert self.num_local_physical_experts == num_local_physical_experts
        self.num_physical_experts = num_physical_experts
        self.num_local_physical_experts = num_local_physical_experts
        self.num_redundant_experts = num_physical_experts - self.num_logical_experts
        for layer in self.moe_layers:
            assert isinstance(layer, FusedMoEBlock)
            layer.n_local_physical_experts = num_local_physical_experts
            layer.n_physical_experts = num_physical_experts
            layer.n_redundant_experts = self.num_redundant_experts
            layer.experts.update_expert_map()
    def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:
        loader = AutoWeightsLoader(self)
        return loader.load_weights(weights, mapper=self.hf_to_vllm_mapper)
 def get_spec_layer_idx_from_weight_name(
    config: ModelConfig, weight_name: str
 ) -> int | None:
    if hasattr(config, "num_nextn_predict_layers") and (
        config.num_nextn_predict_layers > 0
    ):
        layer_idx = config.num_hidden_layers
        for i in range(config.num_nextn_predict_layers):
            if weight_name.startswith(
                f"layers.{layer_idx + i}."  # Step3p5Model
            ) or weight_name.startswith(f"model.layers.{layer_idx + i}."):  # Step3p5MTP
                return layer_idx + i
    return None
--- a/vllm/model_executor/models/step3p5_mtp.py
+++ b/vllm/model_executor/models/step3p5_mtp.py
@@ -0,0 +1,315 @@
 # SPDX-License-Identifier: Apache-2.0
 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project
 from collections.abc import Iterable
 import torch
 import torch.nn as nn
 from transformers import PretrainedConfig
 from vllm.config import VllmConfig
 from vllm.logger import init_logger
 from vllm.model_executor.layers.layernorm import GemmaRMSNorm
 from vllm.model_executor.layers.logits_processor import LogitsProcessor
 from vllm.model_executor.layers.quantization import QuantizationConfig
 from vllm.model_executor.layers.vocab_parallel_embedding import (
    ParallelLMHead,
    VocabParallelEmbedding,
 )
 from vllm.model_executor.model_loader.weight_utils import default_weight_loader
 from vllm.sequence import IntermediateTensors
 from .step3p5 import Step3p5DecoderLayer, get_spec_layer_idx_from_weight_name
 from .utils import maybe_prefix
 logger = init_logger(__name__)
 class SharedHead(nn.Module):
    def __init__(
        self,
        config: PretrainedConfig,
        quant_config: QuantizationConfig | None = None,
    ) -> None:
        super().__init__()
        self.norm = GemmaRMSNorm(config.hidden_size, config.rms_norm_eps)
        self.head = ParallelLMHead(
            config.vocab_size, config.hidden_size, quant_config=quant_config
        )
    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
        return self.norm(hidden_states)
 class Step3p5AMultiTokenPredictorLayer(nn.Module):
    def __init__(
        self,
        vllm_config: VllmConfig,
        prefix: str,
    ) -> None:
        super().__init__()
        config = vllm_config.model_config.hf_config
        quant_config = vllm_config.quant_config
        self.enorm = GemmaRMSNorm(config.hidden_size, config.rms_norm_eps)
        self.hnorm = GemmaRMSNorm(config.hidden_size, config.rms_norm_eps)
        self.eh_proj = nn.Linear(config.hidden_size * 2, config.hidden_size, bias=False)
        self.shared_head = SharedHead(config=config, quant_config=quant_config)
        self.mtp_block = Step3p5DecoderLayer(
            vllm_config,
            prefix=f"{prefix}.mtp_block",
        )
    def forward(
        self,
        input_ids: torch.Tensor,
        positions: torch.Tensor,
        previous_hidden_states: torch.Tensor,
        inputs_embeds: torch.Tensor | None = None,
        spec_step_index: int = 0,
    ) -> torch.Tensor:
        assert inputs_embeds is not None
        inputs_embeds = self.enorm(inputs_embeds)
        previous_hidden_states = self.hnorm(previous_hidden_states)
        hidden_states = self.eh_proj(
            torch.cat([inputs_embeds, previous_hidden_states], dim=-1)
        )
        hidden_states = self.mtp_block(positions=positions, hidden_states=hidden_states)
        return hidden_states
 class Step3p5AMultiTokenPredictor(nn.Module):
    def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""):
        super().__init__()
        config = vllm_config.model_config.hf_config
        self.embed_tokens = VocabParallelEmbedding(
            config.vocab_size,
            config.hidden_size,
        )
        self.mtp_start_layer_idx = config.num_hidden_layers
        self.num_mtp_layers = config.num_nextn_predict_layers
        # to map the exact layer index from weights
        self.layers = torch.nn.ModuleDict(
            {
                str(idx): Step3p5AMultiTokenPredictorLayer(
                    vllm_config,
                    f"{prefix}.layers.{idx}",
                )
                for idx in range(
                    self.mtp_start_layer_idx,
                    self.mtp_start_layer_idx + self.num_mtp_layers,
                )
            }
        )
        self.logits_processor = LogitsProcessor(config.vocab_size)
    def forward(
        self,
        input_ids: torch.Tensor,
        positions: torch.Tensor,
        previous_hidden_states: torch.Tensor,
        inputs_embeds: torch.Tensor | None = None,
        spec_step_idx: int = 0,
    ) -> torch.Tensor:
        if inputs_embeds is None:
            inputs_embeds = self.embed_tokens(input_ids)
        current_step_idx = spec_step_idx % self.num_mtp_layers
        return self.layers[str(self.mtp_start_layer_idx + current_step_idx)](
            input_ids,
            positions,
            previous_hidden_states,
            inputs_embeds,
            current_step_idx,
        )
    def compute_logits(
        self,
        hidden_states: torch.Tensor,
        spec_step_idx: int = 0,
    ) -> torch.Tensor:
        current_step_idx = spec_step_idx % self.num_mtp_layers
        mtp_layer = self.layers[str(self.mtp_start_layer_idx + current_step_idx)]
        logits = self.logits_processor(
            mtp_layer.shared_head.head, mtp_layer.shared_head(hidden_states)
        )
        return logits
    def embed_input_ids(self, input_ids: torch.Tensor) -> torch.Tensor:
        return self.embed_tokens(input_ids)
 class Step3p5MTP(nn.Module):
    def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""):
        super().__init__()
        self.config = vllm_config.model_config.hf_config
        self.vllm_config = vllm_config
        self.model = Step3p5AMultiTokenPredictor(
            vllm_config=vllm_config, prefix=maybe_prefix(prefix, "model")
        )
    def embed_input_ids(self, input_ids: torch.Tensor) -> torch.Tensor:
        return self.model.embed_input_ids(input_ids)
    def forward(
        self,
        input_ids: torch.Tensor,
        positions: torch.Tensor,
        hidden_states: torch.Tensor,
        intermediate_tensors: IntermediateTensors | None = None,
        inputs_embeds: torch.Tensor | None = None,
        spec_step_idx: int = 0,
    ) -> torch.Tensor:
        hidden_states = self.model(
            input_ids, positions, hidden_states, inputs_embeds, spec_step_idx
        )
        return hidden_states
    def compute_logits(
        self,
        hidden_states: torch.Tensor,
        spec_step_idx: int = 0,
    ) -> torch.Tensor | None:
        return self.model.compute_logits(hidden_states, spec_step_idx)
    def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:
        stacked_params_mapping = [
            # (param_name, shard_name, shard_id)
            ("qkv_proj", "q_proj", "q"),
            ("qkv_proj", "k_proj", "k"),
            ("qkv_proj", "v_proj", "v"),
            ("gate_up_proj", "gate_proj", 0),
            ("gate_up_proj", "up_proj", 1),
        ]
        expert_params_mapping = [
            (".moe.experts.w13_weight", ".moe.gate_proj.weight", "w1"),
            (".moe.experts.w13_weight", ".moe.up_proj.weight", "w3"),
            (".moe.experts.w2_weight", ".moe.down_proj.weight", "w2"),
        ]
        params_dict = dict(self.named_parameters())
        loaded_params: set[str] = set()
        for name, loaded_weight in weights:
            if "rotary_emb.inv_freq" in name:
                continue
            spec_layer = get_spec_layer_idx_from_weight_name(self.config, name)
            if "embed_tokens" not in name and spec_layer is None:
                continue
            name = self._rewrite_spec_layer_name(spec_layer, name)
            for param_name, weight_name, shard_id in stacked_params_mapping:
                # Skip non-stacked layers and experts (experts handled below).
                if weight_name not in name:
                    continue
                # We have mlp.experts[0].gate_proj in the checkpoint.
                # Since we handle the experts below in expert_params_mapping,
                # we need to skip here BEFORE we update the name, otherwise
                # name will be updated to mlp.experts[0].gate_up_proj, which
                # will then be updated below in expert_params_mapping
                # for mlp.experts[0].gate_gate_up_proj, which breaks load.
                if ("mlp.experts." in name) and name not in params_dict:
                    continue
                if "experts" in name or "moe" in name:
                    continue
                name = name.replace(weight_name, param_name)
                # Skip loading extra bias for GPTQ models.
                if name.endswith(".bias") and name not in params_dict:
                    continue
                param = params_dict[name]
                weight_loader = param.weight_loader
                weight_loader(param, loaded_weight, shard_id)
                break
            else:
                for mapping in expert_params_mapping:
                    param_name, weight_name, shard_id = mapping
                    if weight_name not in name:
                        continue
                    name = name.replace(weight_name, param_name)
                    # Skip loading extra bias for GPTQ models.
                    if (
                        name.endswith(".bias") or name.endswith("_bias")
                    ) and name not in params_dict:
                        continue
                    param = params_dict[name]
                    weight_loader = param.weight_loader
                    for expert_id in range(loaded_weight.shape[0]):
                        loaded_weight_expert = loaded_weight[expert_id]
                        weight_loader(
                            param,
                            loaded_weight_expert,
                            name,
                            shard_id=shard_id,
                            expert_id=expert_id,
                        )
                    loaded_params.add(name)
                    break
                else:
                    # Skip loading extra bias for GPTQ models.
                    if (
                        name.endswith(".bias")
                        and name not in params_dict
                        or "tok_embeddings" in name
                    ):
                        continue
                    if spec_layer is not None and ".transformer." in name:
                        name = name.replace(".transformer.", ".")
                    if "shared_head" in name:
                        name = name.replace("shared_head.output", "shared_head.head")
                    if "embed_tokens" in name:
                        assert (
                            hasattr(self.config, "num_nextn_predict_layers")
                            and self.config.num_nextn_predict_layers > 0
                        )
                        name = "model.embed_tokens.weight"
                    param = params_dict[name]
                    weight_loader = getattr(
                        param, "weight_loader", default_weight_loader
                    )
                    weight_loader(param, loaded_weight)
            loaded_params.add(name)
        params_need_to_load = set(params_dict.keys())
        # Some KV cache scales are optional: checkpoints may omit them and vLLM
        # will fall back to default scales during initialization.
        optional_params = {
            name
            for name, param in params_dict.items()
            if name.endswith((".k_scale", ".v_scale", ".q_scale", ".prob_scale"))
            and getattr(param, "numel", lambda: 0)() == 1
            and getattr(param, "requires_grad", False) is False
        }
        params_need_to_load -= optional_params
        if params_need_to_load != loaded_params:
            missing_params = list(params_need_to_load - loaded_params)
            param_name_example = missing_params[0]
            raise RuntimeError(
                "Some parameters like "
                f"{param_name_example} are not in the checkpoint and will falsely "
                "use random initialization"
            )
        return loaded_params
    def _rewrite_spec_layer_name(self, spec_layer: int, name: str) -> str:
        """
        Rewrite the weight name to match the format of the original model.
        Add .mtp_block for modules in transformer layer block for spec layer
        """
        spec_layer_weight_names = [
            "embed_tokens",
            "enorm",
            "hnorm",
            "eh_proj",
            "shared_head",
        ]
        spec_layer_weight = False
        for weight_name in spec_layer_weight_names:
            if weight_name in name:
                spec_layer_weight = True
                break
        if not spec_layer_weight:
            # treat rest weights as weights for transformer layer block
            name = name.replace(
                f"model.layers.{spec_layer}.", f"model.layers.{spec_layer}.mtp_block."
            )
        return name
--- a/vllm/reasoning/init.py
+++ b/vllm/reasoning/init.py
@@ -84,6 +84,10 @@ _REASONING_PARSERS_TO_REGISTER = {
        "step3_reasoning_parser",
        "Step3ReasoningParser",
    ),
    "step3p5": (
        "step3p5_reasoning_parser",
        "Step3p5ReasoningParser",
    ),
 }
--- a/vllm/reasoning/step3p5_reasoning_parser.py
+++ b/vllm/reasoning/step3p5_reasoning_parser.py
@@ -0,0 +1,153 @@
 # SPDX-License-Identifier: Apache-2.0
 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project
 from collections.abc import Sequence
 from vllm.entrypoints.openai.chat_completion.protocol import (
    ChatCompletionRequest,
 )
 from vllm.entrypoints.openai.engine.protocol import DeltaMessage
 from vllm.entrypoints.openai.responses.protocol import (
    ResponsesRequest,
 )
 from vllm.reasoning.basic_parsers import BaseThinkingReasoningParser
 from vllm.tokenizers import TokenizerLike
 class Step3p5ReasoningParser(BaseThinkingReasoningParser):
    """
    Reasoning parser for Step3p5 model.
    Step3p5 uses the <think>...</think> format, but it tends to emit an extra
    newline immediately before and/or after the </think> token. This parser trims:
      - the newline right before </think>
      - the newline right after </think>
    """
    @property
    def start_token(self) -> str:
        return "<think>"
    @property
    def end_token(self) -> str:
        return "</think>"
    def __init__(self, tokenizer: TokenizerLike, *args, **kwargs):
        super().__init__(tokenizer, *args, **kwargs)
        # Used to hold a trailing "\n" from reasoning content so we can decide
        # whether it is immediately before </think>.
        self._pending_reasoning_newline = False
        # Used to delay the reasoning end detection.
        # This is necessary to remove the newline appears immediately after </think>,
        # which may cause the end detection to be delayed by one round.
        self.end_offset = 1
    def is_reasoning_end(self, input_ids: Sequence[int]) -> bool:
        if self.end_token_id in input_ids and self.end_offset > 0:
            self.end_offset -= 1
            return False
        return self.end_offset < 1
    def is_reasoning_end_streaming(
        self, input_ids: Sequence[int], delta_ids: Sequence[int]
    ) -> bool:
        if self.end_token_id in input_ids and self.end_offset > 0:
            self.end_offset -= 1
            return False
        return self.end_offset < 1
    def extract_reasoning(
        self,
        model_output: str,
        request: ChatCompletionRequest | ResponsesRequest,
    ) -> tuple[str | None, str | None]:
        reasoning, content = super().extract_reasoning(model_output, request)
        if reasoning is not None:
            reasoning = reasoning.removesuffix("\n")
        if content is not None:
            content = content.removeprefix("\n")
        return reasoning or None, content or None
    def extract_reasoning_streaming(
        self,
        previous_text: str,
        current_text: str,
        delta_text: str,
        previous_token_ids: Sequence[int],
        current_token_ids: Sequence[int],
        delta_token_ids: Sequence[int],
    ) -> DeltaMessage | None:
        # Drop the immediate newline that models often emit after </think>.
        if previous_text.endswith(self.end_token) and delta_text:
            if delta_text == "\n":
                return None
            elif delta_text.startswith("\n"):
                remaining = delta_text.removeprefix("\n")
                return DeltaMessage(content=remaining) if remaining else None
        ret = super().extract_reasoning_streaming(
            previous_text,
            current_text,
            delta_text,
            previous_token_ids,
            current_token_ids,
            delta_token_ids,
        )
        if ret is None:
            return None
        # Compatibility path for models that don't generate the start token:
        # treat everything before </think> as reasoning and everything after
        # as content.
        if (
            self.start_token_id not in previous_token_ids
            and self.start_token_id not in delta_token_ids
        ):
            if self.end_token_id in delta_token_ids:
                end_index = delta_text.find(self.end_token)
                reasoning = delta_text[:end_index]
                content = delta_text[end_index + len(self.end_token) :]
                ret = DeltaMessage(reasoning=reasoning, content=content or None)
            elif self.end_token_id in previous_token_ids:
                ret = DeltaMessage(content=delta_text)
            else:
                ret = DeltaMessage(reasoning=delta_text)
        reasoning_to_output = ret.reasoning
        content_to_output = ret.content
        # Reasoning: handle the newline immediately before </think>.
        if reasoning_to_output is not None:
            if self._pending_reasoning_newline:
                reasoning_to_output = "\n" + reasoning_to_output
                self._pending_reasoning_newline = False
            if reasoning_to_output.endswith("\n"):
                reasoning_to_output = reasoning_to_output.removesuffix("\n")
                if self.end_token in delta_text:
                    # Trailing "\n" is right before </think>, drop it.
                    self._pending_reasoning_newline = False
                else:
                    # Hold the trailing "\n" until we know whether </think> follows.
                    self._pending_reasoning_newline = True
        # Content: handle the newline immediately after </think>.
        if content_to_output is not None:
            # No need to get into parser again to remove newline after </think>.
            self.end_offset -= 1
            # If we have content, reasoning must have ended.
            self._pending_reasoning_newline = False
            if self.end_token in delta_text and content_to_output.startswith("\n"):
                content_to_output = content_to_output.removeprefix("\n")
        reasoning_to_output = reasoning_to_output or None
        content_to_output = content_to_output or None
        if reasoning_to_output is None and content_to_output is None:
            return None
        return DeltaMessage(reasoning=reasoning_to_output, content=content_to_output)
--- a/vllm/tool_parsers/init.py
+++ b/vllm/tool_parsers/init.py
@@ -134,6 +134,10 @@ _TOOL_PARSERS_TO_REGISTER = {
        "step3_tool_parser",
        "Step3ToolParser",
    ),
    "step3p5": (
        "step3p5_tool_parser",
        "Step3p5ToolParser",
    ),
    "xlam": (
        "xlam_tool_parser",
        "xLAMToolParser",
--- a/vllm/tool_parsers/step3p5_tool_parser.py
+++ b/vllm/tool_parsers/step3p5_tool_parser.py
--- a/vllm/transformers_utils/config.py
+++ b/vllm/transformers_utils/config.py
@@ -96,6 +96,8 @@ _CONFIG_REGISTRY: dict[str, type[PretrainedConfig]] = LazyConfigDict(
    ultravox="UltravoxConfig",
    step3_vl="Step3VLConfig",
    step3_text="Step3TextConfig",
    step3p5="Step3p5Config",
    qwen3_asr="Qwen3ASRConfig",
    qwen3_next="Qwen3NextConfig",
    lfm2_moe="Lfm2MoeConfig",
    tarsier2="Tarsier2Config",
--- a/vllm/transformers_utils/configs/init.py
+++ b/vllm/transformers_utils/configs/init.py
@@ -50,6 +50,8 @@ _CLASS_TO_MODULE: dict[str, str] = {
    "Step3VLConfig": "vllm.transformers_utils.configs.step3_vl",
    "Step3VisionEncoderConfig": "vllm.transformers_utils.configs.step3_vl",
    "Step3TextConfig": "vllm.transformers_utils.configs.step3_vl",
    "Step3p5Config": "vllm.transformers_utils.configs.step3p5",
    "Qwen3ASRConfig": "vllm.transformers_utils.configs.qwen3_asr",
    "Qwen3NextConfig": "vllm.transformers_utils.configs.qwen3_next",
    "Tarsier2Config": "vllm.transformers_utils.configs.tarsier2",
    # Special case: DeepseekV3Config is from HuggingFace Transformers
@@ -90,6 +92,8 @@ __all__ = [
    "Step3VLConfig",
    "Step3VisionEncoderConfig",
    "Step3TextConfig",
    "Step3p5Config",
    "Qwen3ASRConfig",
    "Qwen3NextConfig",
    "Tarsier2Config",
 ]
--- a/vllm/transformers_utils/configs/step3p5.py
+++ b/vllm/transformers_utils/configs/step3p5.py
@@ -0,0 +1,100 @@
 # SPDX-License-Identifier: Apache-2.0
 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project
 from typing import Any
 from transformers.configuration_utils import PretrainedConfig
 class Step3p5Config(PretrainedConfig):
    model_type = "step3p5"
    def __init__(
        self,
        hidden_size: int = 5120,
        intermediate_size: int = 13312,
        num_attention_heads: int = 40,
        num_attention_groups: int = 8,
        num_hidden_layers: int = 48,
        max_seq_len: int = 4096,
        vocab_size: int = 65536,
        rms_norm_eps: float = 1e-5,
        moe_every_n_layer: int = 2,
        use_moe: bool = False,
        moe_intermediate_size: int = 10240,
        moe_num_experts: int = 16,
        moe_top_k: int = 4,
        moe_layer_offset: int = 0,
        rope_theta: float | list[float] | None = 500000,
        rope_scaling: dict[str, Any] | None = None,
        head_dim: int | None = None,
        share_expert_dim: int | None = None,
        norm_expert_weight: bool = True,
        bos_token_id: list[int] | int | None = None,
        eos_token_id: list[int] | int | None = None,
        moe_router_activation: str = "softmax",
        moe_router_scaling_factor: float = 1.0,
        att_impl_type: str = "GQA",
        use_head_wise_attn_gate: bool = False,
        use_moe_router_bias: bool = True,
        need_fp32_gate: bool = True,
        layer_types: list[str] | None = None,
        use_rope_layers: list[bool] | None = None,
        yarn_only_types: list[str] | None = None,
        attention_other_setting: dict[str, Any] | None = None,
        num_nextn_predict_layers: int = 0,
        swiglu_limits: list[float] | None = None,
        swiglu_limits_shared: list[float] | None = None,
        max_position_embeddings: int | None = None,
        **kwargs,
    ):
        self.hidden_size = hidden_size
        self.intermediate_size = intermediate_size
        self.num_attention_heads = num_attention_heads
        self.num_attention_groups = num_attention_groups
        self.num_hidden_layers = num_hidden_layers
        self.max_seq_len = max_seq_len
        self.vocab_size = vocab_size
        self.rms_norm_eps = rms_norm_eps
        self.use_moe = use_moe
        self.moe_intermediate_size = moe_intermediate_size
        self.moe_every_n_layer = moe_every_n_layer
        self.moe_num_experts = moe_num_experts
        self.num_experts_per_tok = moe_top_k
        self.moe_top_k = moe_top_k
        self.moe_layer_offset = moe_layer_offset
        self.rope_theta = rope_theta
        self.rope_scaling = rope_scaling
        self.head_dim = head_dim
        if share_expert_dim is None:
            self.share_expert_dim = self.moe_intermediate_size * self.moe_top_k
        else:
            self.share_expert_dim = share_expert_dim
        self.norm_expert_weight = norm_expert_weight
        self.max_position_embeddings = max_position_embeddings
        self.moe_router_activation = moe_router_activation
        self.moe_router_scaling_factor = moe_router_scaling_factor
        self.use_moe_router_bias = use_moe_router_bias
        self.need_fp32_gate = need_fp32_gate
        self.att_impl_type = att_impl_type
        self.use_head_wise_attn_gate = use_head_wise_attn_gate
        self.layer_types = layer_types
        self.use_rope_layers = use_rope_layers
        self.yarn_only_types = yarn_only_types
        self.attention_other_setting = attention_other_setting
        self.num_nextn_predict_layers = num_nextn_predict_layers
        self.swiglu_limits = swiglu_limits
        self.swiglu_limits_shared = swiglu_limits_shared
        resolved_bos_token_id = 1 if bos_token_id is None else bos_token_id
        resolved_eos_token_id = [2, 3] if eos_token_id is None else eos_token_id
        self.bos_token_id = resolved_bos_token_id
        self.eos_token_id = resolved_eos_token_id
        super().__init__(
            bos_token_id=resolved_bos_token_id,
            eos_token_id=resolved_eos_token_id,
            **kwargs,
        )
--- a/vllm/v1/attention/backends/flash_attn.py
+++ b/vllm/v1/attention/backends/flash_attn.py
@@ -263,18 +263,6 @@ class FlashAttentionMetadataBuilder(AttentionMetadataBuilder[FlashAttentionMetad
        vllm_config: "VllmConfig",
        kv_cache_spec: "AttentionSpec",
    ) -> AttentionCGSupport:
        # FA2 does not support CUDA graphs with encoder-decoder models due to
        # accuracy issues reported in https://github.com/vllm-project/vllm/issues/33091
        if (
            vllm_config.model_config.is_encoder_decoder
            and get_flash_attn_version() == 2
        ):
            logger.warning_once(
                "FlashAttention2 does not support CUDA graphs with "
                "encoder-decoder models due to accuracy issues reported in #33091. "
                "Disabling CUDA graph."
            )
            return AttentionCGSupport.NEVER
        return cls._cudagraph_support
    def __init__(
--- a/vllm/v1/core/kv_cache_coordinator.py
+++ b/vllm/v1/core/kv_cache_coordinator.py
@@ -479,6 +479,16 @@ class HybridKVCacheCoordinator(KVCacheCoordinator):
        hit_length = max_cache_hit_length
        hit_blocks_by_group: list[list[KVCacheBlock] | None] = [None] * num_groups
        # Simple hybrid (1 full attn + 1 other): one iteration suffices.
        # Full attn is always first if it exists. This avoids EAGLE drops
        # being applied multiple times to non-full-attn groups.
        # FIXME (yifan): However, for complex hybrid models with multiple attn
        # groups, we still have the EAGLE spiral block dropping problem. See
        # discussion in issue https://github.com/vllm-project/vllm/issues/32802.
        is_simple_hybrid = len(self.attention_groups) == 2 and isinstance(
            self.attention_groups[0][0], FullAttentionSpec
        )
        while True:
            curr_hit_length = hit_length
@@ -495,10 +505,6 @@ class HybridKVCacheCoordinator(KVCacheCoordinator):
                    # the last iteration.
                    num_blocks = curr_hit_length // spec.block_size
                    curr_hit_length = num_blocks * spec.block_size
                    for group_id in group_ids:
                        blocks = hit_blocks_by_group[group_id]
                        assert blocks is not None
                        del blocks[num_blocks:]
                else:
                    hit_blocks = manager_cls.find_longest_cache_hit(
                        block_hashes=_get_block_hashes(spec),
@@ -513,10 +519,20 @@ class HybridKVCacheCoordinator(KVCacheCoordinator):
                    for group_id, blocks in zip(group_ids, hit_blocks):
                        hit_blocks_by_group[group_id] = blocks
-            if curr_hit_length < hit_length:
+            if curr_hit_length >= hit_length:
                hit_length = curr_hit_length
            else:
                break
            hit_length = curr_hit_length
            # Simple hybrid: exit after one iteration
            if is_simple_hybrid:
                break
        # Truncate full attention blocks to final hit_length (if present)
        spec, group_ids, _ = self.attention_groups[0]
        if isinstance(spec, FullAttentionSpec):
            num_blocks = hit_length // spec.block_size
            for group_id in group_ids:
                if (blks := hit_blocks_by_group[group_id]) is not None:
                    del blks[num_blocks:]
        return tuple(
            blocks if blocks is not None else [] for blocks in hit_blocks_by_group
--- a/vllm/v1/worker/gpu_model_runner.py
+++ b/vllm/v1/worker/gpu_model_runner.py
@@ -1382,12 +1382,14 @@ class GPUModelRunner(
        num_scheduled_tokens: dict[str, int],
        kv_cache_spec: KVCacheSpec,
        num_reqs: int,
        for_cudagraph_capture: bool = False,
    ) -> tuple[torch.Tensor | None, np.ndarray | None]:
        if not isinstance(kv_cache_spec, CrossAttentionSpec):
            return None, None
        # Zero out buffer for padding requests that are not actually scheduled (CGs)
        self.encoder_seq_lens.np[:num_reqs] = 0
        # Build encoder_seq_lens array mapping request indices to
        # encoder lengths for inputs scheduled in this batch
        for req_id in num_scheduled_tokens:
@@ -1404,6 +1406,15 @@ class GPUModelRunner(
                feature.mm_position.length for feature in req_state.mm_features
            )
            self.encoder_seq_lens.np[req_index] = encoder_input_tokens
        if for_cudagraph_capture:
            # During CUDA graph capture, we need to use realistic encoder lengths
            # so that max_seqlen_k is captured with the correct value.
            max_encoder_len = getattr(
                self.model_config.hf_config,
                "max_source_positions",
                self.max_encoder_len,
            )
            self.encoder_seq_lens.np[:num_reqs] = max_encoder_len
        self.encoder_seq_lens.copy_to_gpu(num_reqs)
        encoder_seq_lens = self.encoder_seq_lens.gpu[:num_reqs]
@@ -1821,6 +1832,7 @@ class GPUModelRunner(
                num_scheduled_tokens or {},
                kv_cache_group.kv_cache_spec,
                num_reqs_padded,
                for_cudagraph_capture=for_cudagraph_capture,
            )
            if kv_cache_gid > 0:
                cm.block_table_tensor = _get_block_table(kv_cache_gid)
Author	SHA1	Message	Date
Richard Zou	9cd2cce17d	[torch.compile] Don't do the fast moe cold start optimization if there is speculative decoding (#33624 ) Signed-off-by: Richard Zou <zou3519@gmail.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> (cherry picked from commit `5eac9a1b34`)	2026-02-03 00:07:18 -08:00
Kiersten Stokes	eec3546bba	[Misc][Build] Lazy load cv2 in nemotron_parse.py (#33189 ) Signed-off-by: kiersten-stokes <kierstenstokes@gmail.com> (cherry picked from commit `9e138cb01d`)	2026-02-03 00:03:56 -08:00
zaristei2	7c023baf58	Patch Protobuf for CVE 2026-0994 (#33619 ) Signed-off-by: Zachary Aristei <zaristei@nvidia.com> Co-authored-by: Zachary Aristei <zaristei@nvidia.com>	2026-02-03 00:03:14 -08:00
zaristei2	099a787ee2	Patch aiohttp for CVE-2025-69223 (#33621 ) Signed-off-by: Zachary Aristei <zaristei@nvidia.com> Co-authored-by: Zachary Aristei <zaristei@nvidia.com>	2026-02-03 00:02:39 -08:00
Zhewen Li	31a64c63a8	[Release] Fix format and cherry-pick (#33618 ) Signed-off-by: zhewenli <zhewen@inferact.ai> Co-authored-by: zhewenli <zhewen@inferact.ai>	2026-02-02 16:19:05 -08:00
Zhewen Li	57eae2f891	[Release] patch step3p5 attention class in v0.15.1 release (#33602 ) Signed-off-by: zhewenli <zhewen@inferact.ai> Co-authored-by: zhewenli <zhewen@inferact.ai>	2026-02-02 14:54:08 -08:00
Yifan Qiao	f0d005864a	[Fix] prefix cache hit rate == 0 bug with gpt-oss style models (#33524 ) Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu> (cherry picked from commit `a01ef3fa51`)	2026-02-02 10:31:50 -08:00
Robert Shaw	94cbe0a328	[Nightly CI] Remove CT Model (#33530 ) Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com> (cherry picked from commit `318b120766`)	2026-02-02 02:17:42 -08:00
csy0225	8b45c58fe9	[Models] Step-3.5-Flash (#33523 ) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: i-zhangmingming <i-zhangmingming@stepfun.com> Co-authored-by: xiewuxun <xiewuxun@stepfun.com> Co-authored-by: zetaohong <i-hongzetao@stepfun.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> (cherry picked from commit `c3b40dc3e7`)	2026-02-02 02:16:23 -08:00
Greg Pereira	c7039a80b8	pin LMCache to v0.3.9 or greater with vLLM v0.15.0 (#33440 ) Signed-off-by: greg pereira <grpereir@redhat.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> (cherry picked from commit `d6416fdde9`)	2026-02-02 00:17:01 -08:00
René Honig	15ebd0cedf	fix: Add SM120 (RTX Blackwell) support for FlashInfer CUTLASS NVFP4 MoE kernels (#33417 ) Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: mgoin <mgoin64@gmail.com> (cherry picked from commit `079781177a`)	2026-02-02 00:15:22 -08:00
Luka Govedič	2915268369	[fix][torch.compile] Fix cold-start compilation time increase by adding kv cache update to splitting ops (#33441 ) Signed-off-by: Luka Govedič <lgovedic@redhat.com> Co-authored-by: Richard Zou <zou3519@gmail.com> (cherry picked from commit `15f40b20aa`)	2026-02-02 00:14:07 -08:00
Lucas Wilkinson	d984d664cc	[BugFix] Fix whisper FA2 + full cudagraphs (#33360 ) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> (cherry picked from commit `0a3c71e7e5`)	2026-02-02 00:13:57 -08:00
Gregory Shtrasberg	5f45b0b7e0	[Bugfix][ROCm] Fixing the skinny gemm dispatch logic from #32831 (#33366 ) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> (cherry picked from commit `31aedfe7d6`)	2026-02-02 00:13:45 -08:00
Kevin H. Luu	a2dba556db	[release] Minor fixes to release annotation and wheel upload (#33129 ) Signed-off-by: khluu <khluu000@gmail.com> (cherry picked from commit `2284461d02`)	2026-02-02 00:13:34 -08:00
Michael Goin	6ff16b77f8	[Bugfix] Enable Triton MoE for FP8 per-tensor dynamic (#33300 ) Signed-off-by: mgoin <mgoin64@gmail.com> (cherry picked from commit `bfb9bdaf3f`)	2026-02-02 00:13:23 -08:00
wang.yuqi	1ed963d43a	[Bugfix] Fix Qwen3-VL-Reranker load. (#33298 ) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: wang.yuqi <noooop@126.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> (cherry picked from commit `abb34ac43a`)	2026-02-02 00:13:12 -08:00
Michael Goin	39e8b49378	[Bugfix] Register fp8 cutlass_group_gemm as supported for only SM90+SM100 (#33285 ) Signed-off-by: mgoin <mgoin64@gmail.com> (cherry picked from commit `1bd47d6e5a`)	2026-02-02 00:12:58 -08:00
TJian	f176443446	[Release] [CI] Optim release pipeline (#33156 ) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> (cherry picked from commit `f9d03599ef`)	2026-01-28 22:47:10 -08:00
Or Ozeri	fe18ce4d3f	Revert "Enable Cross layers KV cache layout at NIXL Connector (#30207 )" (#33241 ) Signed-off-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: Kevin H. Luu <khluu000@gmail.com> (cherry picked from commit `2e8de86777`)	2026-01-28 11:44:59 -08:00
`@@ -1,2 +1,2 @@`
	`lmcache`	`lmcache >= 0.3.9`
	`nixl >= 0.7.1 # Required for disaggregated prefill`	`nixl >= 0.7.1 # Required for disaggregated prefill`