[CI] fix version comparsion and exclusion patterns in upload-release-wheels.sh (#32971 )

Signed-off-by: Shengqi Chen <harry-chen@outlook.com> (cherry picked from commit 136c499f6e)
[Bugfix] Fix Whisper/encoder-decoder GPU memory leak (#32789 )
2026-01-23 14:22:49 -08:00 · 2026-01-23 02:53:12 -08:00 · 2026-01-23 02:52:47 -08:00 · 2026-01-23 02:51:53 -08:00 · 2026-01-17 18:38:46 -08:00 · 2026-01-16 21:04:48 -08:00
1237 changed files with 59594 additions and 25739 deletions
--- a/.buildkite/lm-eval-harness/run-lm-eval-chartqa-vllm-vlm-baseline.sh
+++ b/.buildkite/lm-eval-harness/run-lm-eval-chartqa-vllm-vlm-baseline.sh
@@ -2,7 +2,7 @@
 # We can use this script to compute baseline accuracy on chartqa for vllm.
 #
 # Make sure you have lm-eval-harness installed:
-#   pip install lm-eval==0.4.9
+#   pip install "lm-eval[api]>=0.4.9.2"
 usage() {
    echo``
--- a/.buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh
+++ b/.buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh
@@ -2,7 +2,7 @@
 # We can use this script to compute baseline accuracy on GSM for transformers.
 #
 # Make sure you have lm-eval-harness installed:
-#   pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git@206b7722158f58c35b7ffcd53b035fdbdda5126d#egg=lm-eval[api]
+#   pip install "lm-eval[api]>=0.4.9.2"
 usage() {
    echo``
--- a/.buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh
+++ b/.buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh
@@ -3,7 +3,7 @@
 # We use this for fp8, which HF does not support.
 #
 # Make sure you have lm-eval-harness installed:
-#   pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git@206b7722158f58c35b7ffcd53b035fdbdda5126d#egg=lm-eval[api]
+#   pip install "lm-eval[api]>=0.4.9.2"
 usage() {
    echo``
--- a/.buildkite/lm-eval-harness/run-lm-eval-mmlupro-vllm-baseline.sh
+++ b/.buildkite/lm-eval-harness/run-lm-eval-mmlupro-vllm-baseline.sh
@@ -3,7 +3,7 @@
 # We use this for fp8, which HF does not support.
 #
 # Make sure you have lm-eval-harness installed:
-#   pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git@206b7722158f58c35b7ffcd53b035fdbdda5126d#egg=lm-eval[api]
+#   pip install "lm-eval[api]>=0.4.9.2"
 usage() {
    echo``
--- a/.buildkite/lm-eval-harness/test_lm_eval_correctness.py
+++ b/.buildkite/lm-eval-harness/test_lm_eval_correctness.py
@@ -60,6 +60,7 @@ def launch_lm_eval(eval_config, tp_size):
        f"add_bos_token=true,"
        f"trust_remote_code={trust_remote_code},"
        f"max_model_len={max_model_len},"
        "allow_deprecated_quantization=True,"
    )
    env_vars = eval_config.get("env_vars", None)
--- a/.buildkite/performance-benchmarks/README.md
+++ b/.buildkite/performance-benchmarks/README.md
@@ -7,7 +7,7 @@ vLLM also maintains a continuous performance benchmark under [perf.vllm.ai](http
 ## Performance benchmark quick overview
-**Benchmarking Coverage**: latency, throughput and fix-qps serving on B200, A100, H100, Intel® Xeon® Processors and Intel® Gaudi® 3 Accelerators with different models.
+**Benchmarking Coverage**: latency, throughput and fix-qps serving on B200, A100, H100, Intel® Xeon® Processors, Intel® Gaudi® 3 Accelerators and Arm® Neoverse™ with different models.
 **Benchmarking Duration**: about 1hr.
@@ -23,7 +23,7 @@ bash .buildkite/performance-benchmarks/scripts/run-performance-benchmarks.sh
 Runtime environment variables:
- `ON_CPU`: set the value to '1' on Intel® Xeon® Processors. Default value is 0.
+- `ON_CPU`: set the value to '1' on Intel® Xeon® and Arm® Neoverse™ Processors. Default value is 0.
 - `SERVING_JSON`: JSON file to use for the serving tests. Default value is empty string (use default file).
 - `LATENCY_JSON`: JSON file to use for the latency tests. Default value is empty string (use default file).
 - `THROUGHPUT_JSON`: JSON file to use for the throughout tests. Default value is empty string (use default file).
@@ -34,8 +34,9 @@ Runtime environment variables:
 See [performance-benchmarks-descriptions.md](performance-benchmarks-descriptions.md) for detailed descriptions, and use `tests/latency-tests.json`, `tests/throughput-tests.json`, `tests/serving-tests.json` to configure the test cases.
 > NOTE: For Intel® Xeon® Processors, use `tests/latency-tests-cpu.json`, `tests/throughput-tests-cpu.json`, `tests/serving-tests-cpu.json` instead.
-For Intel® Gaudi® 3 Accelerators, use `tests/latency-tests-hpu.json`, `tests/throughput-tests-hpu.json`, `tests/serving-tests-hpu.json` instead.
+> For Intel® Gaudi® 3 Accelerators, use `tests/latency-tests-hpu.json`, `tests/throughput-tests-hpu.json`, `tests/serving-tests-hpu.json` instead.
->
+> For Arm® Neoverse™, use `tests/latency-tests-arm64-cpu.json`, `tests/throughput-tests-arm64-cpu.json`, `tests/serving-tests-arm64-cpu.json` instead.
 ### Latency test
 Here is an example of one test inside `latency-tests.json`:
@@ -175,19 +176,6 @@ If you do not see the table, please wait till the benchmark finish running.
 The json version of the table (together with the json version of the benchmark) will be also attached to the markdown file.
 The raw benchmarking results (in the format of json files) are in the `Artifacts` tab of the benchmarking.
-The `compare-json-results.py` helps to compare benchmark results JSON files converted using `convert-results-json-to-markdown.py`.
+#### Performance Results Comparison  
 When run, benchmark script generates results under `benchmark/results` folder, along with the `benchmark_results.md` and `benchmark_results.json`.
 `compare-json-results.py` compares two `benchmark_results.json` files and provides performance ratio e.g. for Output Tput, Median TTFT and Median TPOT.  
 If only one benchmark_results.json is passed, `compare-json-results.py` compares different TP and PP configurations in the benchmark_results.json instead.
-Here is an example using the script to compare result_a and result_b with Model, Dataset name, input/output length, max concurrency and qps.
+Follow the instructions in [performance results comparison](https://docs.vllm.ai/en/latest/benchmarking/dashboard/#performance-results-comparison) to analyze performance results and the sizing guide.
 `python3 compare-json-results.py -f results_a/benchmark_results.json -f results_b/benchmark_results.json`
 |   | Model | Dataset Name | Input Len | Output Len | # of max concurrency | qps  | results_a/benchmark_results.json | results_b/benchmark_results.json | perf_ratio        |
 |----|---------------------------------------|--------|-----|-----|------|-----|-----------|----------|----------|
 | 0  | meta-llama/Meta-Llama-3.1-8B-Instruct | random | 128 | 128 | 1000 | 1 | 142.633982                             | 156.526018                             | 1.097396 |
 | 1  | meta-llama/Meta-Llama-3.1-8B-Instruct | random | 128 | 128 | 1000 | inf| 241.620334                             | 294.018783                             | 1.216863 |
 A comparison diagram will be generated below the table.
 Here is an example to compare between 96c/results_gnr_96c_091_tp2pp3 and 128c/results_gnr_128c_091_tp2pp3
 <img width="1886" height="828" alt="image" src="https://github.com/user-attachments/assets/c02a43ef-25d0-4fd6-90e5-2169a28682dd" />
--- a/.buildkite/performance-benchmarks/scripts/compare-json-results.py
+++ b/.buildkite/performance-benchmarks/scripts/compare-json-results.py
@@ -1,8 +1,13 @@
 # SPDX-License-Identifier: Apache-2.0
 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project
 from __future__ import annotations
 import argparse
 import html as _html
 import json
 import os
 from dataclasses import dataclass
 from importlib import util
 import pandas as pd
@@ -10,27 +15,49 @@ import pandas as pd
 pd.options.display.float_format = "{:.2f}".format
 plotly_found = util.find_spec("plotly.express") is not None
 DEFAULT_INFO_COLS = [
    "Model",
    "Dataset Name",
    "Input Len",
    "Output Len",
    #    "TP Size",
    #    "PP Size",
    "# of max concurrency.",
    "qps",
 ]
 # Safety net: if any DataFrame leaks into to_html(), keep precision at 2.
 pd.set_option("display.precision", 2)
 pd.set_option("display.float_format", lambda x: f"{x:.2f}")
 # -----------------------------
 # Core data compare
 # -----------------------------
 def compare_data_columns(
-    files, name_column, data_column, info_cols, drop_column, debug=False
+    files: list[str],
    name_column: str,
    data_column: str,
    info_cols: list[str],
    drop_column: str,
    debug: bool = False,
 ):
    """
    Align concatenation by keys derived from info_cols instead of row order.
    - Pick one canonical key list: subset of info_cols present in ALL files.
    - For each file: set index to those keys, aggregate duplicates
-    - (mean for metric, first for names).
+      (mean for metric, first for names).
    - Concat along axis=1 (indexes align), then reset_index so callers can
-    - group by columns.
+      group by columns.
    - If --debug, add a <file_label>_name column per file.
    """
    print("\ncompare_data_column:", data_column)
    frames = []
-    raw_data_cols = []
+    raw_data_cols: list[str] = []
    compare_frames = []
-    # 1) choose a canonical key list from info_cols that exists in ALL files
+    cols_per_file: list[set] = []
    cols_per_file = []
    for f in files:
        try:
            df_tmp = pd.read_json(f, orient="records")
@@ -40,24 +67,20 @@ def compare_data_columns(
    key_cols = [c for c in info_cols if all(c in cset for cset in cols_per_file)]
    if not key_cols:
        # soft fallback: use any info_cols present in the first file
        key_cols = [c for c in info_cols if c in list(cols_per_file[0])]
    if not key_cols:
        raise ValueError(
            "No common key columns found from info_cols across the input files."
        )
    # 2) build a single "meta" block (keys as columns) once, aligned by the key index
    meta_added = False
    for file in files:
        df = pd.read_json(file, orient="records")
        # Keep rows that actually have the compared metric (same as original behavior)
        if drop_column in df.columns:
            df = df.dropna(subset=[drop_column], ignore_index=True)
        # Stabilize numeric key columns (harmless if missing)
        for c in (
            "Input Len",
            "Output Len",
@@ -69,32 +92,26 @@ def compare_data_columns(
            if c in df.columns:
                df[c] = pd.to_numeric(df[c], errors="coerce")
        # Ensure all key columns exist
        for c in key_cols:
            if c not in df.columns:
                df[c] = pd.NA
        # Set index = key_cols and aggregate duplicates → unique MultiIndex
        df_idx = df.set_index(key_cols, drop=False)
        # meta (key columns), unique per key
        meta = df_idx[key_cols]
        if not meta.index.is_unique:
            meta = meta.groupby(level=key_cols, dropna=False).first()
        # metric series for this file, aggregated to one row per key
        file_label = "/".join(file.split("/")[:-1]) or os.path.basename(file)
        s = df_idx[data_column]
        if not s.index.is_unique:
            s = s.groupby(level=key_cols, dropna=False).mean()
-        s.name = file_label  # column label like original
+        s.name = file_label
        # add meta once (from first file) so keys are the leftmost columns
        if not meta_added:
            frames.append(meta)
            meta_added = True
        # (NEW) debug: aligned test-name column per file
        if debug and name_column in df_idx.columns:
            name_s = df_idx[name_column]
            if not name_s.index.is_unique:
@@ -106,26 +123,19 @@ def compare_data_columns(
        raw_data_cols.append(file_label)
        compare_frames.append(s)
        # Generalize ratio: for any file N>=2, add ratio (fileN / file1)
        if len(compare_frames) >= 2:
            base = compare_frames[0]
            current = compare_frames[-1]
            if "P99" in data_column or "Median" in data_column:
-                ratio = base / current  # for latency
+                ratio = base / current
            else:
                ratio = current / base
-            ratio = ratio.mask(base == 0)  # avoid inf when baseline is 0
+            ratio = ratio.mask(base == 0)
            ratio.name = f"Ratio 1 vs {len(compare_frames)}"
            frames.append(ratio)
-    # 4) concat on columns with aligned MultiIndex;
+    concat_df = pd.concat(frames, axis=1).reset_index(drop=True)
    # then reset_index to return keys as columns
    concat_df = pd.concat(frames, axis=1)
    concat_df = concat_df.reset_index(drop=True).reset_index()
    if "index" in concat_df.columns:
        concat_df = concat_df.drop(columns=["index"])
    # Ensure key/info columns appear first (in your info_cols order)
    front = [c for c in info_cols if c in concat_df.columns]
    rest = [c for c in concat_df.columns if c not in front]
    concat_df = concat_df[front + rest]
@@ -134,20 +144,15 @@ def compare_data_columns(
    return concat_df, raw_data_cols
 # -----------------------------
 # Split helper
 # -----------------------------
 def split_json_by_tp_pp(
    input_file: str = "benchmark_results.json", output_root: str = "."
 ) -> list[str]:
    """
    Split a benchmark JSON into separate folders by (TP Size, PP Size).
    Creates: <output_root>/tp{TP}_pp{PP}/benchmark_results.json
    Returns: list of file paths written.
    """
    # Load JSON data into DataFrame
    with open(input_file, encoding="utf-8") as f:
        data = json.load(f)
    # If the JSON is a dict with a list under common keys, use that list
    if isinstance(data, dict):
        for key in ("results", "serving_results", "benchmarks", "data"):
            if isinstance(data.get(key), list):
@@ -156,7 +161,6 @@ def split_json_by_tp_pp(
    df = pd.DataFrame(data)
    # Keep only "serving" tests
    name_col = next(
        (c for c in ["Test name", "test_name", "Test Name"] if c in df.columns), None
    )
@@ -165,7 +169,6 @@ def split_json_by_tp_pp(
            df[name_col].astype(str).str.contains(r"serving", case=False, na=False)
        ].copy()
    # Handle alias column names
    rename_map = {
        "tp_size": "TP Size",
        "tensor_parallel_size": "TP Size",
@@ -176,21 +179,14 @@ def split_json_by_tp_pp(
        columns={k: v for k, v in rename_map.items() if k in df.columns}, inplace=True
    )
    # Ensure TP/PP columns exist (default to 1 if missing)
    if "TP Size" not in df.columns:
        df["TP Size"] = 1
    if "PP Size" not in df.columns:
        df["PP Size"] = 1
-    # make sure TP/PP are numeric ints with no NaN
+    df["TP Size"] = pd.to_numeric(df["TP Size"], errors="coerce").fillna(1).astype(int)
-    df["TP Size"] = (
+    df["PP Size"] = pd.to_numeric(df["PP Size"], errors="coerce").fillna(1).astype(int)
        pd.to_numeric(df.get("TP Size", 1), errors="coerce").fillna(1).astype(int)
    )
    df["PP Size"] = (
        pd.to_numeric(df.get("PP Size", 1), errors="coerce").fillna(1).astype(int)
    )
    # Split into separate folders
    saved_paths: list[str] = []
    for (tp, pp), group_df in df.groupby(["TP Size", "PP Size"], dropna=False):
        folder_name = os.path.join(output_root, f"tp{int(tp)}_pp{int(pp)}")
@@ -203,32 +199,9 @@ def split_json_by_tp_pp(
    return saved_paths
-def _add_limit_line(fig, y_value, label):
+# -----------------------------
-    # Visible dashed line + annotation
+# Styling helpers
-    fig.add_hline(
+# -----------------------------
        y=y_value,
        line_dash="dash",
        line_color="red" if "ttft" in label.lower() else "blue",
        annotation_text=f"{label}: {y_value} ms",
        annotation_position="top left",
    )
    # Optional: add a legend item (as a transparent helper trace)
    if plot and plotly_found:
        import plotly.graph_objects as go
        fig.add_trace(
            go.Scatter(
                x=[None],
                y=[None],
                mode="lines",
                line=dict(
                    dash="dash", color="red" if "ttft" in label.lower() else "blue"
                ),
                name=f"{label}",
            )
        )
 def _find_concurrency_col(df: pd.DataFrame) -> str:
    for c in [
        "# of max concurrency.",
@@ -239,7 +212,6 @@ def _find_concurrency_col(df: pd.DataFrame) -> str:
    ]:
        if c in df.columns:
            return c
    # Fallback: guess an integer-like column (harmless if unused)
    for c in df.columns:
        if df[c].dtype.kind in "iu" and df[c].nunique() > 1 and df[c].min() >= 1:
            return c
@@ -248,8 +220,7 @@ def _find_concurrency_col(df: pd.DataFrame) -> str:
 def _highlight_threshold(
    df: pd.DataFrame, threshold: float
-) -> "pd.io.formats.style.Styler":
+) -> pd.io.formats.style.Styler:
    """Highlight numeric per-configuration columns with value <= threshold."""
    conc_col = _find_concurrency_col(df)
    key_cols = [
        c
@@ -260,6 +231,7 @@ def _highlight_threshold(
        c for c in df.columns if c not in key_cols and not str(c).startswith("Ratio")
    ]
    conf_cols = [c for c in conf_cols if pd.api.types.is_numeric_dtype(df[c])]
    return df.style.map(
        lambda v: "background-color:#e6ffe6;font-weight:bold;"
        if pd.notna(v) and v <= threshold
@@ -268,7 +240,264 @@ def _highlight_threshold(
    )
-if __name__ == "__main__":
+def highlight_ratio_columns(styler: pd.io.formats.style.Styler):
    ratio_cols = [c for c in styler.data.columns if "ratio" in str(c).lower()]
    if not ratio_cols:
        return styler
    styler = styler.apply(
        lambda _: ["background-color: #fff3b0"] * len(styler.data),
        subset=ratio_cols,
        axis=0,
    )
    styler = styler.set_table_styles(
        [
            {
                "selector": f"th.col_heading.level0.col{i}",
                "props": [("background-color", "#fff3b0")],
            }
            for i, col in enumerate(styler.data.columns)
            if col in ratio_cols
        ],
        overwrite=False,
    )
    return styler
 def _apply_two_decimals(
    styler: pd.io.formats.style.Styler,
 ) -> pd.io.formats.style.Styler:
    df = styler.data
    num_cols = df.select_dtypes("number").columns
    if len(num_cols) == 0:
        return styler
    return styler.format({c: "{:.2f}" for c in num_cols}, na_rep="")
 # -----------------------------
 # Valid max concurrency summary helpers
 # -----------------------------
 def _config_value_columns(df: pd.DataFrame, conc_col: str) -> list[str]:
    key_cols = [
        c
        for c in ["Model", "Dataset Name", "Input Len", "Output Len"]
        if c in df.columns
    ]
    exclude = set(key_cols + [conc_col, "qps", "QPS"])
    cols: list[str] = []
    for c in df.columns:
        if c in exclude:
            continue
        lc = str(c).lower()
        if lc.startswith("ratio"):
            continue
        if lc.endswith("_name") or lc == "test name" or lc == "test_name":
            continue
        if pd.api.types.is_numeric_dtype(df[c]):
            cols.append(c)
    return cols
 def _max_concurrency_ok(
    df: pd.DataFrame, conc_col: str, cfg_col: str, threshold: float
 ):
    if df is None or conc_col not in df.columns or cfg_col not in df.columns:
        return pd.NA
    d = df[[conc_col, cfg_col]].copy()
    d[conc_col] = pd.to_numeric(d[conc_col], errors="coerce")
    d[cfg_col] = pd.to_numeric(d[cfg_col], errors="coerce")
    d = d.dropna(subset=[conc_col, cfg_col])
    if d.empty:
        return pd.NA
    ok = d[d[cfg_col] <= threshold]
    if ok.empty:
        return pd.NA
    return ok[conc_col].max()
 def _value_at_concurrency(df: pd.DataFrame, conc_col: str, cfg_col: str, conc_value):
    if (
        df is None
        or conc_col not in df.columns
        or cfg_col not in df.columns
        or pd.isna(conc_value)
    ):
        return pd.NA
    d = df[[conc_col, cfg_col]].copy()
    d[conc_col] = pd.to_numeric(d[conc_col], errors="coerce")
    d[cfg_col] = pd.to_numeric(d[cfg_col], errors="coerce")
    conc_value = pd.to_numeric(conc_value, errors="coerce")
    if pd.isna(conc_value):
        return pd.NA
    hit = d[d[conc_col] == conc_value]
    if hit.empty:
        return pd.NA
    return hit[cfg_col].iloc[0]
 def build_valid_max_concurrency_summary_html(
    tput_group_df: pd.DataFrame | None,
    ttft_group_df: pd.DataFrame | None,
    tpot_group_df: pd.DataFrame | None,
    conc_col: str,
    args,
 ) -> str:
    if ttft_group_df is None and tpot_group_df is None:
        return ""
    ttft_cols = (
        _config_value_columns(ttft_group_df, conc_col)
        if ttft_group_df is not None
        else []
    )
    tpot_cols = (
        _config_value_columns(tpot_group_df, conc_col)
        if tpot_group_df is not None
        else []
    )
    tput_cols = (
        _config_value_columns(tput_group_df, conc_col)
        if tput_group_df is not None
        else []
    )
    if ttft_group_df is not None and tpot_group_df is not None:
        cfg_cols = [c for c in ttft_cols if c in tpot_cols]
        if tput_group_df is not None:
            cfg_cols = [c for c in cfg_cols if c in tput_cols] or cfg_cols
    else:
        cfg_cols = ttft_cols or tpot_cols
    if not cfg_cols:
        cfg_cols = sorted(set(ttft_cols) | set(tpot_cols) | set(tput_cols), key=str)
    rows = []
    for cfg in cfg_cols:
        ttft_max = (
            _max_concurrency_ok(ttft_group_df, conc_col, cfg, args.ttft_max_ms)
            if ttft_group_df is not None
            else pd.NA
        )
        tpot_max = (
            _max_concurrency_ok(tpot_group_df, conc_col, cfg, args.tpot_max_ms)
            if tpot_group_df is not None
            else pd.NA
        )
        both = (
            pd.NA
            if (pd.isna(ttft_max) or pd.isna(tpot_max))
            else min(ttft_max, tpot_max)
        )
        tput_at_both = (
            _value_at_concurrency(tput_group_df, conc_col, cfg, both)
            if tput_group_df is not None
            else pd.NA
        )
        ttft_at_both = (
            _value_at_concurrency(ttft_group_df, conc_col, cfg, both)
            if ttft_group_df is not None
            else pd.NA
        )
        tpot_at_both = (
            _value_at_concurrency(tpot_group_df, conc_col, cfg, both)
            if tpot_group_df is not None
            else pd.NA
        )
        rows.append(
            {
                "Configuration": cfg,
                f"Max {conc_col} (TTFT ≤ {args.ttft_max_ms:g} ms)": ttft_max,
                f"Max {conc_col} (TPOT ≤ {args.tpot_max_ms:g} ms)": tpot_max,
                f"Max {conc_col} (Both)": both,
                "Output Tput @ Both (tok/s)": tput_at_both,
                "TTFT @ Both (ms)": ttft_at_both,
                "TPOT @ Both (ms)": tpot_at_both,
            }
        )
    summary_df = pd.DataFrame(rows)
    # --- Coerce numeric columns so Styler doesn't miss them due to object dtype ---
    for c in summary_df.columns:
        if c == "Configuration":
            continue
        summary_df[c] = pd.to_numeric(summary_df[c], errors="coerce")
    both_col = f"Max {conc_col} (Both)"
    # --- Strict 2-decimal formatting for ALL non-Configuration columns ---
    formatters = {}
    for c in summary_df.columns:
        if c == "Configuration":
            continue
        # default argument binds per-column formatter correctly
        formatters[c] = lambda v: "" if pd.isna(v) else f"{float(v):.2f}"
    styler = summary_df.style.format(formatters)
    def _green(v):
        return "background-color:#e6ffe6;font-weight:bold;" if pd.notna(v) else ""
    if both_col in summary_df.columns:
        styler = styler.map(_green, subset=[both_col])
    title = (
        '<div style="font-size: 1.15em; font-weight: 700; margin: 12px 0 6px 0;">'
        "Valid Max Concurrency Summary"
        "</div>\n"
    )
    return title + styler.to_html(table_attributes='border="1" class="dataframe"')
 # -----------------------------
 # Plot helper
 # -----------------------------
 def _add_limit_line(fig, y_value: float, label: str):
    fig.add_hline(
        y=y_value,
        line_dash="dash",
        line_color="red" if "ttft" in label.lower() else "blue",
        annotation_text=f"{label}: {y_value} ms",
        annotation_position="top left",
    )
    if plotly_found:
        import plotly.graph_objects as go
        fig.add_trace(
            go.Scatter(
                x=[None],
                y=[None],
                mode="lines",
                line=dict(
                    dash="dash",
                    color="red" if "ttft" in label.lower() else "blue",
                ),
                name=label,
            )
        )
 # -----------------------------
 # Refactored main + group-first report
 # -----------------------------
@dataclass(frozen=True)
 class MetricPlan:
    data_cols: list[str]
    drop_column: str
 def build_parser() -> argparse.ArgumentParser:
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "-f", "--file", action="append", type=str, help="input file name"
@@ -308,149 +537,289 @@ if __name__ == "__main__":
        default=100.0,
        help="Reference limit for TPOT plots (ms)",
    )
    return parser
    args = parser.parse_args()
 def choose_metrics(latency: str) -> MetricPlan:
    latency = (latency or "").lower()
    drop_column = "P99"
    name_column = "Test name"
    info_cols = [
        "Model",
        "Dataset Name",
        "Input Len",
        "Output Len",
        "TP Size",
        "PP Size",
        "# of max concurrency.",
        "qps",
    ]
-    if "median" in args.latency:
+    if "median" in latency:
-        data_cols_to_compare = ["Output Tput (tok/s)", "Median TTFT (ms)", "Median"]
+        return MetricPlan(
-        html_msgs_for_data_cols = [
+            data_cols=["Output Tput (tok/s)", "Median TTFT (ms)", "Median"],
-            "Compare Output Tokens /n",
+            drop_column=drop_column,
-            "Median TTFT /n",
+        )
-            "Median TPOT /n",
+
-        ]
+    return MetricPlan(
-        drop_column = "P99"
+        data_cols=["Output Tput (tok/s)", "P99 TTFT (ms)", "P99"],
-    elif "p99" in args.latency:
+        drop_column=drop_column,
-        data_cols_to_compare = ["Output Tput (tok/s)", "P99 TTFT (ms)", "P99"]
+    )
-        html_msgs_for_data_cols = [
+
-            "Compare Output Tokens /n",
+
-            "P99 TTFT /n",
+def prepare_input_files(args, info_cols: list[str]) -> tuple[list[str], list[str]]:
-            "P99 TPOT /n",
+    if not args.file:
-        ]
+        raise ValueError("No input files provided. Use -f/--file.")
    if len(args.file) == 1:
        files = split_json_by_tp_pp(args.file[0], output_root="splits")
        info_cols = [c for c in info_cols if c not in ("TP Size", "PP Size")]
    else:
        files = args.file
    return files, info_cols
 def get_y_axis_col(info_cols: list[str], xaxis: str) -> str:
    y_axis_index = info_cols.index(xaxis) if xaxis in info_cols else 6
    return info_cols[y_axis_index]
 def get_group_cols(output_df: pd.DataFrame, info_cols: list[str]) -> list[str]:
    filtered_info_cols = info_cols[:4]
    group_cols = [c for c in filtered_info_cols if c in output_df.columns]
    if not group_cols:
        raise ValueError(
            f"No valid group-by columns. Expected subset: {filtered_info_cols}, "
            f"but DataFrame has: {list(output_df.columns)}"
        )
    return group_cols
 def normalize_group_key(name):
    return name if isinstance(name, tuple) else (name,)
 def group_filename(name, prefix: str = "perf_comparison_") -> str:
    name_vals = normalize_group_key(name)
    safe = ",".join(map(str, name_vals)).replace(",", "_").replace("/", "-")
    return f"{prefix}{safe}.html"
 def build_group_suffix(group_cols: list[str], name) -> str:
    name_vals = normalize_group_key(name)
    return " , ".join(f"{col} : [ {val} ] " for col, val in zip(group_cols, name_vals))
 def render_metric_table_html(
    display_group: pd.DataFrame,
    metric_label: str,
    group_suffix: str,
    args,
 ) -> str:
    title = (
        f'<div style="font-size: 1.25em; font-weight: 600; margin: 12px 0;">'
        f"{_html.escape(metric_label)}"
        f" — {_html.escape(group_suffix)}"
        f"</div>\n"
    )
    metric_name = metric_label.lower()
    if "ttft" in metric_name:
        styler = _highlight_threshold(display_group, args.ttft_max_ms)
    elif ("tpot" in metric_name) or ("median" in metric_name) or ("p99" in metric_name):
        styler = _highlight_threshold(display_group, args.tpot_max_ms)
    else:
        styler = display_group.style
    styler = _apply_two_decimals(styler)
    styler = highlight_ratio_columns(styler)
    return title + styler.to_html(table_attributes='border="1" class="dataframe"')
 def maybe_write_plot(
    main_fh,
    sub_fh,
    group_df: pd.DataFrame,
    raw_data_cols: list[str],
    metric_label: str,
    y_axis_col: str,
    args,
 ):
    if not (args.plot and plotly_found):
        return
    import plotly.express as px
    df = group_df[raw_data_cols].sort_values(by=y_axis_col)
    df_melted = df.melt(
        id_vars=y_axis_col,
        var_name="Configuration",
        value_name=metric_label,
    )
    fig = px.line(
        df_melted,
        x=y_axis_col,
        y=metric_label,
        color="Configuration",
        title=f"{metric_label} vs {y_axis_col}",
        markers=True,
    )
    # Ensure plot hover + y tick labels are also 2 decimals.
    fig.update_traces(hovertemplate="%{y:.2f}<extra></extra>")
    fig.update_yaxes(tickformat=".2f")
    metric_name = metric_label.lower()
    if "ttft" in metric_name:
        _add_limit_line(fig, args.ttft_max_ms, "TTFT limit")
    elif ("tpot" in metric_name) or ("median" in metric_name) or ("p99" in metric_name):
        _add_limit_line(fig, args.tpot_max_ms, "TPOT limit")
    html = fig.to_html(full_html=True, include_plotlyjs="cdn")
    main_fh.write(html)
    sub_fh.write(html)
 def build_group_keys(
    df: pd.DataFrame, group_cols: list[str], sort_cols: list[str] | None = None
 ):
    if sort_cols:
        df = df.sort_values(by=sort_cols)
    gb = df.groupby(group_cols, dropna=False)
    return [k for k, _ in gb]
 def write_report_group_first(
    files: list[str], info_cols: list[str], plan: MetricPlan, args
 ):
    name_column = "Test name"
    y_axis_col = get_y_axis_col(info_cols, args.xaxis)
    print("comparing : " + ", ".join(files))
-    debug = args.debug
+
-    plot = args.plot
+    metric_cache: dict[str, tuple[pd.DataFrame, list[str]]] = {}
-    # For Plot feature, assign y axis from one of info_cols
+    group_cols_canonical: list[str] | None = None
-    y_axis_index = info_cols.index(args.xaxis) if args.xaxis in info_cols else 6
+
-    with open("perf_comparison.html", "w") as text_file:
+    for metric_label in plan.data_cols:
-        for i in range(len(data_cols_to_compare)):
+        output_df, raw_data_cols = compare_data_columns(
-            output_df, raw_data_cols = compare_data_columns(
+            files,
-                files,
+            name_column,
-                name_column,
+            metric_label,
-                data_cols_to_compare[i],
+            info_cols,
-                info_cols,
+            plan.drop_column,
-                drop_column,
+            debug=args.debug,
-                debug=debug,
+        )
        raw_data_cols = list(raw_data_cols)
        raw_data_cols.insert(0, y_axis_col)
        group_cols = get_group_cols(output_df, info_cols)
        if group_cols_canonical is None:
            group_cols_canonical = group_cols
        else:
            group_cols_canonical = [c for c in group_cols_canonical if c in group_cols]
        metric_cache[metric_label] = (
            output_df.sort_values(by=args.xaxis),
            raw_data_cols,
        )
    if not group_cols_canonical:
        raise ValueError("No canonical group columns found across metrics.")
    first_metric = plan.data_cols[0]
    first_df_sorted, _ = metric_cache[first_metric]
    group_keys = build_group_keys(
        first_df_sorted, group_cols_canonical, sort_cols=[args.xaxis]
    )
    metric_groupbys = {
        metric_label: df.groupby(group_cols_canonical, dropna=False)
        for metric_label, (df, _) in metric_cache.items()
    }
    with open("perf_comparison.html", "w", encoding="utf-8") as main_fh:
        main_fh.write('<meta charset="utf-8">\n')
        for gkey in group_keys:
            gkey_tuple = normalize_group_key(gkey)
            suffix = build_group_suffix(group_cols_canonical, gkey_tuple)
            sub_path = group_filename(gkey_tuple)
            group_header = (
                '<div style="font-size: 1.4em; font-weight: 700; '
                'margin: 18px 0 10px 0;">'
                f"{_html.escape(suffix)}"
                "</div>\n"
            )
-            # For Plot feature, insert y axis from one of info_cols
+            main_fh.write(group_header)
-            raw_data_cols.insert(0, info_cols[y_axis_index])
+            with open(sub_path, "w", encoding="utf-8") as sub_fh:
                sub_fh.write('<meta charset="utf-8">\n')
                sub_fh.write(group_header)
                tput_group_df = None
                ttft_group_df = None
                tpot_group_df = None
                conc_col = args.xaxis
-            filtered_info_cols = info_cols[:-2]
+                for metric_label in plan.data_cols:
-            existing_group_cols = [
+                    gb = metric_groupbys[metric_label]
-                c for c in filtered_info_cols if c in output_df.columns
+                    df_sorted, raw_data_cols = metric_cache[metric_label]
-            ]
+
-            if not existing_group_cols:
+                    try:
-                raise ValueError(
+                        group_df = gb.get_group(gkey)
-                    f"No valid group-by columns  "
+                    except KeyError:
-                    f"Expected subset: {filtered_info_cols}, "
+                        missing = (
-                    f"but DataFrame has: {list(output_df.columns)}"
+                            '<div style="font-size: 1.1em; font-weight: 600; '
                            'margin: 10px 0;">'
                            f"{_html.escape(metric_label)} — missing for this group"
                            "</div>\n"
                        )
                        main_fh.write(missing)
                        sub_fh.write(missing)
                        continue
                    if conc_col not in group_df.columns:
                        conc_col = _find_concurrency_col(group_df)
                    mn = metric_label.lower().strip()
                    if "tok/s" in mn:
                        tput_group_df = group_df
                    elif "ttft" in mn:
                        ttft_group_df = group_df
                    elif mn in ("p99", "median") or "tpot" in mn:
                        tpot_group_df = group_df
                    display_group = group_df.drop(
                        columns=group_cols_canonical, errors="ignore"
                    )
                    html = render_metric_table_html(
                        display_group, metric_label, suffix, args
                    )
                    main_fh.write(html)
                    sub_fh.write(html)
                    maybe_write_plot(
                        main_fh,
                        sub_fh,
                        group_df=group_df,
                        raw_data_cols=raw_data_cols,
                        metric_label=metric_label,
                        y_axis_col=y_axis_col,
                        args=args,
                    )
                summary_html = build_valid_max_concurrency_summary_html(
                    tput_group_df=tput_group_df,
                    ttft_group_df=ttft_group_df,
                    tpot_group_df=tpot_group_df,
                    conc_col=conc_col,
                    args=args,
                )
-            # output_df_sorted = output_df.sort_values(by=existing_group_cols)
+                if summary_html:
-            output_df_sorted = output_df.sort_values(by=args.xaxis)
+                    main_fh.write(summary_html)
-            output_groups = output_df_sorted.groupby(existing_group_cols, dropna=False)
+                    sub_fh.write(summary_html)
            for name, group in output_groups:
                group_name = (
                    ",".join(map(str, name)).replace(",", "_").replace("/", "-")
                )
                group_html_name = "perf_comparison_" + group_name + ".html"
                metric_name = str(data_cols_to_compare[i]).lower()
                if "tok/s" in metric_name:
                    html = group.to_html()
                elif "ttft" in metric_name:
                    styler = _highlight_threshold(group, args.ttft_max_ms).format(
                        {c: "{:.2f}" for c in group.select_dtypes("number").columns},
                        na_rep="—",
                    )
                    html = styler.to_html(
                        table_attributes='border="1" class="dataframe"'
                    )
                elif (
                    "tpot" in metric_name
                    or "median" in metric_name
                    or "p99" in metric_name
                ):
                    styler = _highlight_threshold(group, args.tpot_max_ms).format(
                        {c: "{:.2f}" for c in group.select_dtypes("number").columns},
                        na_rep="—",
                    )
                    html = styler.to_html(
                        table_attributes='border="1" class="dataframe"'
                    )
-                text_file.write(html_msgs_for_data_cols[i])
+def main():
-                text_file.write(html)
+    args = build_parser().parse_args()
-                with open(group_html_name, "a+") as sub_text_file:
+    info_cols = list(DEFAULT_INFO_COLS)
-                    sub_text_file.write(html_msgs_for_data_cols[i])
+    plan = choose_metrics(args.latency)
-                    sub_text_file.write(html)
+    files, info_cols = prepare_input_files(args, info_cols)
    write_report_group_first(files, info_cols, plan, args)
                    if plot and plotly_found:
                        import plotly.express as px
-                        df = group[raw_data_cols]
+if __name__ == "__main__":
-                        df_sorted = df.sort_values(by=info_cols[y_axis_index])
+    main()
                        # Melt DataFrame for plotting
                        df_melted = df_sorted.melt(
                            id_vars=info_cols[y_axis_index],
                            var_name="Configuration",
                            value_name=data_cols_to_compare[i],
                        )
                        title = (
                            data_cols_to_compare[i] + " vs " + info_cols[y_axis_index]
                        )
                        # Create Plotly line chart
                        fig = px.line(
                            df_melted,
                            x=info_cols[y_axis_index],
                            y=data_cols_to_compare[i],
                            color="Configuration",
                            title=title,
                            markers=True,
                        )
                        # ---- Add threshold lines based on metric name ----
                        if "ttft" in metric_name:
                            _add_limit_line(fig, args.ttft_max_ms, "TTFT limit")
                        elif (
                            "tpot" in metric_name
                            or "median" in metric_name
                            or "p99" in metric_name
                        ):
                            _add_limit_line(fig, args.tpot_max_ms, "TPOT limit")
                        # Export to HTML
                        text_file.write(
                            fig.to_html(full_html=True, include_plotlyjs="cdn")
                        )
                        sub_text_file.write(
                            fig.to_html(full_html=True, include_plotlyjs="cdn")
                        )
--- a/.buildkite/performance-benchmarks/scripts/run-performance-benchmarks.sh
+++ b/.buildkite/performance-benchmarks/scripts/run-performance-benchmarks.sh
@@ -49,7 +49,11 @@ check_cpus() {
    echo "Need at least 1 NUMA to run benchmarking."
    exit 1
  fi
-  declare -g gpu_type="cpu"
+  if [[ "$(uname -m)" == "aarch64" ]] || [[ "$(uname -m)" == "arm64" ]]; then
    declare -g gpu_type="arm64-cpu"
  else
    declare -g gpu_type="cpu"
  fi
  echo "GPU type is $gpu_type"
 }
@@ -207,8 +211,8 @@ run_latency_tests() {
    # check if there is enough GPU to run the test
    tp=$(echo "$latency_params" | jq -r '.tensor_parallel_size')
-    if [ "$ON_CPU" == "1" ]; then
+    if [[ "$ON_CPU" == "1" ]]; then
-      pp=$(echo "$latency_params" | jq -r '.pipeline_parallel_size')
+      pp=$(echo "$latency_params" | jq -r '.pipeline_parallel_size // 1')
      world_size=$(($tp*$pp))
      if [[ $numa_count -lt $world_size  && -z "${REMOTE_HOST}" ]]; then
        echo "Required world-size $world_size but only $numa_count NUMA nodes found. Skip testcase $test_name."
@@ -276,8 +280,8 @@ run_throughput_tests() {
    # check if there is enough GPU to run the test
    tp=$(echo "$throughput_params" | jq -r '.tensor_parallel_size')
-    if [ "$ON_CPU" == "1" ]; then
+    if [[ "$ON_CPU" == "1" ]]; then
-      pp=$(echo "$throughput_params" | jq -r '.pipeline_parallel_size')
+      pp=$(echo "$throughput_params" | jq -r '.pipeline_parallel_size // 1')
      world_size=$(($tp*$pp))
      if [[ $numa_count -lt $world_size  && -z "${REMOTE_HOST}" ]]; then
        echo "Required world-size $world_size but only $numa_count NUMA nodes found. Skip testcase $test_name."
@@ -393,8 +397,8 @@ run_serving_tests() {
    # check if there is enough resources to run the test
    tp=$(echo "$server_params" | jq -r '.tensor_parallel_size')
-    if [ "$ON_CPU" == "1" ]; then
+    if [[ "$ON_CPU" == "1" ]]; then
-      pp=$(echo "$server_params" | jq -r '.pipeline_parallel_size')
+      pp=$(echo "$server_params" | jq -r '.pipeline_parallel_size // 1')
      world_size=$(($tp*$pp))
      if [[ $numa_count -lt $world_size  && -z "${REMOTE_HOST}" ]]; then
        echo "Required world-size $world_size but only $numa_count NUMA nodes found. Skip testcase $test_name."
@@ -496,9 +500,9 @@ run_serving_tests() {
 main() {
  local ARCH
  ARCH=''
-  if [ "$ON_CPU" == "1" ];then
+  if [[ "$ON_CPU" == "1" ]]; then
-     check_cpus
+    check_cpus
-     ARCH='-cpu'
+    ARCH="-$gpu_type"
  else
     check_gpus
     ARCH="$arch_suffix"
--- a/.buildkite/performance-benchmarks/tests/latency-tests-arm64-cpu.json
+++ b/.buildkite/performance-benchmarks/tests/latency-tests-arm64-cpu.json
@@ -0,0 +1,26 @@
 [
    {
        "test_name": "latency_llama8B_tp1",
        "environment_variables": {
            "VLLM_RPC_TIMEOUT": 100000,
            "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
            "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
            "VLLM_CPU_KVCACHE_SPACE": 40
        },
        "parameters": {
            "model": "meta-llama/Llama-3.1-8B-Instruct",
            "tensor_parallel_size": 1,
            "load_format": "dummy",
            "dtype": "bfloat16",
            "distributed_executor_backend": "mp",
            "block_size": 128,
            "trust_remote_code": "",
            "disable_log_stats": "",
            "enforce_eager": "",
            "max_num_batched_tokens": 2048,
            "max_num_seqs": 256,
            "num_iters_warmup": 5,
            "num_iters": 15
        }
    }
 ]
--- a/.buildkite/performance-benchmarks/tests/serving-tests-arm64-cpu.json
+++ b/.buildkite/performance-benchmarks/tests/serving-tests-arm64-cpu.json
@@ -0,0 +1,130 @@
 {
  "defaults": {
    "qps_list": [
      "inf"
    ],
    "max_concurrency_list": [
      12,
      16,
      24,
      32,
      64,
      128,
      200
    ],
    "server_environment_variables": {
      "VLLM_RPC_TIMEOUT": 100000,
      "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
      "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
      "VLLM_CPU_SGL_KERNEL": 1,
      "VLLM_CPU_KVCACHE_SPACE": 40
    },
    "server_parameters": {
      "model": "meta-llama/Llama-3.1-8B-Instruct",
      "tensor_parallel_size": 1,
      "dtype": "bfloat16",
      "distributed_executor_backend": "mp",
      "block_size": 128,
      "trust_remote_code": "",
      "disable_log_stats": "",
      "enforce_eager": "",
      "max_num_batched_tokens": 2048,
      "max_num_seqs": 256,
      "load_format": "dummy"
    },
    "client_parameters": {
      "model": "meta-llama/Llama-3.1-8B-Instruct",
      "backend": "vllm",
      "ignore-eos": "",
      "num_prompts": 200
    }
  },
  "tests": [
    {
      "test_name": "serving_llama8B_tp1_sharegpt",
      "server_parameters": {
        "tensor_parallel_size": 1
      },
      "client_parameters": {
        "dataset_name": "sharegpt",
        "dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json"
      }
    },
    {
      "test_name": "serving_llama8B_tp2_sharegpt",
      "server_parameters": {
        "tensor_parallel_size": 2
      },
      "client_parameters": {
        "dataset_name": "sharegpt",
        "dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json"
      }
    },
    {
      "test_name": "serving_llama8B_tp1_random_128_128",
      "server_parameters": {
        "tensor_parallel_size": 1
      },
      "client_parameters": {
        "dataset_name": "random",
        "random-input-len": 128,
        "random-output-len": 128
      }
    },
    {
      "test_name": "serving_llama8B_tp2_random_128_128",
      "server_parameters": {
        "tensor_parallel_size": 2
      },
      "client_parameters": {
        "dataset_name": "random",
        "random-input-len": 128,
        "random-output-len": 128
      }
    },
    {
      "test_name": "serving_llama8B_tp1_random_128_2048",
      "server_parameters": {
        "tensor_parallel_size": 1
      },
      "client_parameters": {
        "dataset_name": "random",
        "random-input-len": 128,
        "random-output-len": 2048
      }
    },
    {
      "test_name": "serving_llama8B_tp2_random_128_2048",
      "server_parameters": {
        "tensor_parallel_size": 2
      },
      "client_parameters": {
        "dataset_name": "random",
        "random-input-len": 128,
        "random-output-len": 2048
      }
    },
    {
      "test_name": "serving_llama8B_tp1_random_2048_128",
      "server_parameters": {
        "tensor_parallel_size": 1
      },
      "client_parameters": {
        "dataset_name": "random",
        "random-input-len": 2048,
        "random-output-len": 128
      }
    },
    {
      "test_name": "serving_llama8B_tp2_random_2048_128",
      "server_parameters": {
        "tensor_parallel_size": 2
      },
      "client_parameters": {
        "dataset_name": "random",
        "random-input-len": 2048,
        "random-output-len": 128
      }
    }
  ]
 }
--- a/.buildkite/performance-benchmarks/tests/serving-tests-cpu.json
+++ b/.buildkite/performance-benchmarks/tests/serving-tests-cpu.json
@@ -19,10 +19,8 @@
      "block_size": 128,
      "trust_remote_code": "",
      "disable_log_stats": "",
      "enforce_eager": "",
      "max_num_batched_tokens": 2048,
-      "max_num_seqs": 256,
+      "max_num_seqs": 256
      "load_format": "dummy"
    },
    "client_parameters": {
      "model": "meta-llama/Llama-3.1-8B-Instruct",
@@ -151,6 +149,45 @@
        "random-output-len": 128
      }
    },
    {
      "test_name": "serving_llama8B_int4_tp1_random_128_128",
      "server_parameters": {
        "model": "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4",
        "tensor_parallel_size": 1
      },
      "client_parameters": {
        "model": "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4",
        "dataset_name": "random",
        "random-input-len": 128,
        "random-output-len": 128
      }
    },
    {
      "test_name": "serving_llama8B_int4_tp2_random_128_128",
      "server_parameters": {
        "model": "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4",
        "tensor_parallel_size": 2
      },
      "client_parameters": {
        "model": "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4",
        "dataset_name": "random",
        "random-input-len": 128,
        "random-output-len": 128
      }
    },
    {
      "test_name": "serving_llama8B_int4_tp4_random_128_128",
      "server_parameters": {
        "model": "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4",
        "tensor_parallel_size": 4
      },
      "client_parameters": {
        "model": "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4",
        "dataset_name": "random",
        "random-input-len": 128,
        "random-output-len": 128
      }
    },
    {
      "test_name": "serving_llama3B_tp1_random_128_128",
      "server_parameters": {
--- a/.buildkite/performance-benchmarks/tests/throughput-tests-arm64-cpu.json
+++ b/.buildkite/performance-benchmarks/tests/throughput-tests-arm64-cpu.json
@@ -0,0 +1,27 @@
 [
    {
        "test_name": "throughput_llama8B_tp1",
        "environment_variables": {
            "VLLM_RPC_TIMEOUT": 100000,
            "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
            "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
            "VLLM_CPU_KVCACHE_SPACE": 40
        },
        "parameters": {
            "model": "meta-llama/Llama-3.1-8B-Instruct",
            "tensor_parallel_size": 1,
            "load_format": "dummy",
            "dtype": "bfloat16",
            "distributed_executor_backend": "mp",
            "block_size": 128,
            "trust_remote_code": "",
            "disable_log_stats": "",
            "enforce_eager": "",
            "max_num_batched_tokens": 2048,
            "max_num_seqs": 256,
            "dataset": "./ShareGPT_V3_unfiltered_cleaned_split.json",
            "num_prompts": 200,
            "backend": "vllm"
        }
    }
 ]
--- a/.buildkite/release-pipeline.yaml
+++ b/.buildkite/release-pipeline.yaml
@@ -1,6 +1,6 @@
 steps:
  # aarch64 + CUDA builds
-  - label: "Build arm64 wheel - CUDA 12.9"
+  - label: "Build wheel - aarch64 - CUDA 12.9"
    depends_on: ~
    id: build-wheel-arm64-cuda-12-9
    agents:
@@ -11,11 +11,11 @@ steps:
      - "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.9.1 --build-arg torch_cuda_arch_list='8.7 8.9 9.0 10.0+PTX 12.0' --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ."
      - "mkdir artifacts"
      - "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'"
-      - "bash .buildkite/scripts/upload-wheels.sh"
+      - "bash .buildkite/scripts/upload-nightly-wheels.sh"
    env:
      DOCKER_BUILDKIT: "1"
-  - label: "Build arm64 wheel - CUDA 13.0"
+  - label: "Build wheel - aarch64 - CUDA 13.0"
    depends_on: ~
    id: build-wheel-arm64-cuda-13-0
    agents:
@@ -26,12 +26,12 @@ steps:
      - "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=13.0.1 --build-arg torch_cuda_arch_list='8.7 8.9 9.0 10.0+PTX 12.0' --build-arg BUILD_BASE_IMAGE=nvidia/cuda:13.0.1-devel-ubuntu22.04  --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ."
      - "mkdir artifacts"
      - "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'"
-      - "bash .buildkite/scripts/upload-wheels.sh manylinux_2_35"
+      - "bash .buildkite/scripts/upload-nightly-wheels.sh manylinux_2_35"
    env:
      DOCKER_BUILDKIT: "1"
  # aarch64 build
-  - label: "Build arm64 CPU wheel"
+  - label: "Build wheel - aarch64 - CPU"
    depends_on: ~
    id: build-wheel-arm64-cpu
    agents:
@@ -40,39 +40,39 @@ steps:
      - "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg GIT_REPO_CHECK=1 --build-arg VLLM_BUILD_ACL=ON --tag vllm-ci:build-image --target vllm-build --progress plain -f docker/Dockerfile.cpu ."
      - "mkdir artifacts"
      - "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'"
-      - "bash .buildkite/scripts/upload-wheels.sh manylinux_2_35"
+      - "bash .buildkite/scripts/upload-nightly-wheels.sh manylinux_2_35"
    env:
      DOCKER_BUILDKIT: "1"
  # x86 + CUDA builds
-  - label: "Build wheel - CUDA 12.9"
+  - label: "Build wheel - x86_64 - CUDA 12.9"
    depends_on: ~
-    id: build-wheel-cuda-12-9
+    id: build-wheel-x86-cuda-12-9
    agents:
      queue: cpu_queue_postmerge
    commands:
      - "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.9.1 --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ."
      - "mkdir artifacts"
      - "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'"
-      - "bash .buildkite/scripts/upload-wheels.sh manylinux_2_31"
+      - "bash .buildkite/scripts/upload-nightly-wheels.sh manylinux_2_31"
    env:
      DOCKER_BUILDKIT: "1"
-  - label: "Build wheel - CUDA 13.0"
+  - label: "Build wheel - x86_64 - CUDA 13.0"
    depends_on: ~
-    id: build-wheel-cuda-13-0
+    id: build-wheel-x86-cuda-13-0
    agents:
      queue: cpu_queue_postmerge
    commands:
      - "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=13.0.1 --build-arg BUILD_BASE_IMAGE=nvidia/cuda:13.0.1-devel-ubuntu22.04 --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ."
      - "mkdir artifacts"
      - "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'"
-      - "bash .buildkite/scripts/upload-wheels.sh manylinux_2_35"
+      - "bash .buildkite/scripts/upload-nightly-wheels.sh manylinux_2_35"
    env:
      DOCKER_BUILDKIT: "1"
  # x86 CPU wheel build
-  - label: "Build x86 CPU wheel"
+  - label: "Build wheel - x86_64 - CPU"
    depends_on: ~
    id: build-wheel-x86-cpu
    agents:
@@ -81,12 +81,12 @@ steps:
      - "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg GIT_REPO_CHECK=1 --build-arg VLLM_CPU_AVX512BF16=true --build-arg VLLM_CPU_AVX512VNNI=true --build-arg VLLM_CPU_AMXBF16=true --tag vllm-ci:build-image --target vllm-build --progress plain -f docker/Dockerfile.cpu ."
      - "mkdir artifacts"
      - "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'"
-      - "bash .buildkite/scripts/upload-wheels.sh manylinux_2_35"
+      - "bash .buildkite/scripts/upload-nightly-wheels.sh manylinux_2_35"
    env:
      DOCKER_BUILDKIT: "1"
-  # Build release images (12.9)
+  # Build release images (CUDA 12.9)
-  - label: "Build release image (x86)"
+  - label: "Build release image - x86_64 - CUDA 12.9"
    depends_on: ~
    id: build-release-image-x86
    agents:
@@ -99,7 +99,7 @@ steps:
      - "docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-$(uname -m) public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT"
      - "docker push public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT"
-  - label: "Build release image (arm64)"
+  - label: "Build release image - aarch64 - CUDA 12.9"
    depends_on: ~
    id: build-release-image-arm64
    agents:
@@ -109,34 +109,93 @@ steps:
      - "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.9.1 --build-arg FLASHINFER_AOT_COMPILE=true --build-arg torch_cuda_arch_list='8.7 8.9 9.0 10.0+PTX 12.0' --build-arg INSTALL_KV_CONNECTORS=true --tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-$(uname -m) --target vllm-openai --progress plain -f docker/Dockerfile ."
      - "docker push public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-$(uname -m)"
-  # Add job to create multi-arch manifest
+  - label: "Create multi-arch manifest - CUDA 12.9"
  - label: "Create multi-arch manifest"
    depends_on:
      - build-release-image-x86
      - build-release-image-arm64
    id: create-multi-arch-manifest
    agents:
-      queue: cpu_queue_postmerge
+      queue: small_cpu_queue_postmerge
    commands:
      - "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
      - "docker manifest create public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-x86_64 public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-aarch64 --amend"
      - "docker manifest push public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT"
-  - label: "Annotate release workflow"
+  - label: "Annotate release workflow - CUDA 12.9"
    depends_on:
      - create-multi-arch-manifest
    id: annotate-release-workflow
    agents:
-      queue: cpu_queue_postmerge
+      queue: small_cpu_queue_postmerge
    commands:
      - "bash .buildkite/scripts/annotate-release.sh"
  - block: "Build CUDA 13.0 release images"
    key: block-release-image-build-cuda-13-0
    depends_on: ~
  - label: "Build release image - x86_64 - CUDA 13.0"
    depends_on: block-release-image-build-cuda-13-0
    id: build-release-image-x86-cuda-13-0
    agents:
      queue: cpu_queue_postmerge
    commands:
      - "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
      - "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=13.0.1 --build-arg INSTALL_KV_CONNECTORS=true --build-arg BUILD_BASE_IMAGE=nvidia/cuda:13.0.1-devel-ubuntu22.04 --tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-$(uname -m)-cu130 --target vllm-openai --progress plain -f docker/Dockerfile ."
      - "docker push public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-$(uname -m)-cu130"
      # re-tag to default image tag and push, just in case arm64 build fails
      - "docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-$(uname -m)-cu130 public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-cu130"
      - "docker push public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-cu130"
  - label: "Build release image - aarch64 - CUDA 13.0"
    depends_on: block-release-image-build-cuda-13-0
    id: build-release-image-arm64-cuda-13-0
    agents:
      queue: arm64_cpu_queue_postmerge
    commands:
      - "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
      # compute capability 12.0 for RTX-50 series / RTX PRO 6000 Blackwell, 12.1 for DGX Spark
      - "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=13.0.1 --build-arg torch_cuda_arch_list='8.7 8.9 9.0 10.0+PTX 12.0 12.1' --build-arg INSTALL_KV_CONNECTORS=true --build-arg BUILD_BASE_IMAGE=nvidia/cuda:13.0.1-devel-ubuntu22.04 --tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-$(uname -m)-cu130 --target vllm-openai --progress plain -f docker/Dockerfile ."
      - "docker push public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-$(uname -m)-cu130"
  - label: "Create multi-arch manifest - CUDA 13.0"
    depends_on:
      - build-release-image-x86-cuda-13-0
      - build-release-image-arm64-cuda-13-0
    id: create-multi-arch-manifest-cuda-13-0
    agents:
      queue: small_cpu_queue_postmerge
    commands:
      - "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
      - "docker manifest create public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-cu130 public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-x86_64-cu130 public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-aarch64-cu130 --amend"
      - "docker manifest push public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-cu130"
  - input: "Provide Release version here"
    id: input-release-version
    fields:
      - text: "What is the release version?"
        key: release-version
  - block: "Confirm update release wheels to PyPI (experimental, use with caution)?"
    key: block-upload-release-wheels
    depends_on:
      - input-release-version
      - build-wheel-x86-cuda-12-9
      - build-wheel-x86-cuda-13-0
      - build-wheel-x86-cpu
      - build-wheel-arm64-cuda-12-9
      - build-wheel-arm64-cuda-13-0
      - build-wheel-arm64-cpu
  - label: "Upload release wheels to PyPI and GitHub"
    depends_on:
      - block-upload-release-wheels
    id: upload-release-wheels
    agents:
      queue: small_cpu_queue_postmerge
    commands:
      - "bash .buildkite/scripts/upload-release-wheels.sh"
  - block: "Build CPU release image"
    key: block-cpu-release-image-build
    depends_on: ~
@@ -169,24 +228,31 @@ steps:
    env:
      DOCKER_BUILDKIT: "1"
  - block: "Build ROCm release image"
    key: block-rocm-release-image-build
    depends_on: ~
  - label: "Build release image (ROCm)"
    depends_on: block-rocm-release-image-build
    id: build-release-image-rocm
    agents:
      queue: cpu_queue_postmerge
    commands:
      - "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
      # Build base image first
      - "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --tag rocm/vllm-dev:base-$BUILDKITE_COMMIT --target final --progress plain -f docker/Dockerfile.rocm_base ."
      # Build vLLM ROCm image using the base
      - "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg BASE_IMAGE=rocm/vllm-dev:base-$BUILDKITE_COMMIT --tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-rocm --target vllm-openai --progress plain -f docker/Dockerfile.rocm ."
      - "docker push public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-rocm"
  - label: "Build and publish nightly multi-arch image to DockerHub"
    depends_on:
      - create-multi-arch-manifest
    if: build.env("NIGHTLY") == "1"
    agents:
-      queue: cpu_queue_postmerge
+      queue: small_cpu_queue_postmerge
    commands:
-      - "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
+      - "bash .buildkite/scripts/push-nightly-builds.sh"
      - "docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-x86_64"
      - "docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-aarch64"
      - "docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-x86_64 vllm/vllm-openai:nightly-x86_64"
      - "docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-aarch64 vllm/vllm-openai:nightly-aarch64"
      - "docker push vllm/vllm-openai:nightly-x86_64"
      - "docker push vllm/vllm-openai:nightly-aarch64"
      - "docker manifest create vllm/vllm-openai:nightly vllm/vllm-openai:nightly-x86_64 vllm/vllm-openai:nightly-aarch64 --amend"
      - "docker manifest create vllm/vllm-openai:nightly-$BUILDKITE_COMMIT vllm/vllm-openai:nightly-x86_64 vllm/vllm-openai:nightly-aarch64 --amend"
      - "docker manifest push vllm/vllm-openai:nightly"
      - "docker manifest push vllm/vllm-openai:nightly-$BUILDKITE_COMMIT"
      # Clean up old nightly builds (keep only last 14)
      - "bash .buildkite/scripts/cleanup-nightly-builds.sh"
    plugins:
@@ -196,3 +262,384 @@ steps:
    env:
      DOCKER_BUILDKIT: "1"
      DOCKERHUB_USERNAME: "vllmbot"
  - label: "Build and publish nightly multi-arch image to DockerHub - CUDA 13.0"
    depends_on:
      - create-multi-arch-manifest-cuda-13-0
    if: build.env("NIGHTLY") == "1"
    agents:
      queue: small_cpu_queue_postmerge
    commands:
      - "bash .buildkite/scripts/push-nightly-builds.sh cu130"
      # Clean up old nightly builds (keep only last 14)
      - "bash .buildkite/scripts/cleanup-nightly-builds.sh cu130-nightly-"
    plugins:
      - docker-login#v3.0.0:
          username: vllmbot
          password-env: DOCKERHUB_TOKEN
    env:
      DOCKER_BUILDKIT: "1"
      DOCKERHUB_USERNAME: "vllmbot"
  # =============================================================================
  # ROCm Release Pipeline (x86_64 only)
  # =============================================================================
  #
  # vLLM version is determined by the Buildkite checkout (like CUDA pipeline).
  # To build a specific version, trigger the build from that branch/tag.
  #
  # Environment variables for ROCm builds (set via Buildkite UI or schedule):
  #   ROCM_PYTHON_VERSION: Python version (default: 3.12)
  #   PYTORCH_ROCM_ARCH: GPU architectures (default: gfx90a;gfx942;gfx950;gfx1100;gfx1101;gfx1200;gfx1201;gfx1150;gfx1151)
  #   ROCM_UPLOAD_WHEELS: Upload to S3 (default: false for nightly, true for releases)
  #   ROCM_FORCE_REBUILD: Force rebuild base wheels, ignore S3 cache (default: false)
  #
  # Note: ROCm version is determined by BASE_IMAGE in docker/Dockerfile.rocm_base
  #       (currently rocm/dev-ubuntu-22.04:7.1-complete)
  #
  # =============================================================================
  # ROCm Input Step - Collect build configuration (manual trigger only)
  - input: "ROCm Wheel Release Build Configuration"
    key: input-rocm-config
    depends_on: ~
    if: build.source == "ui"
    fields:
      - text: "Python Version"
        key: "rocm-python-version"
        default: "3.12"
        hint: "Python version (e.g., 3.12)"
      - text: "GPU Architectures"
        key: "rocm-pytorch-rocm-arch"
        default: "gfx90a;gfx942;gfx950;gfx1100;gfx1101;gfx1200;gfx1201;gfx1150;gfx1151"
        hint: "Semicolon-separated GPU architectures"
      - select: "Upload Wheels to S3"
        key: "rocm-upload-wheels"
        default: "true"
        options:
          - label: "No - Build only (nightly/dev)"
            value: "false"
          - label: "Yes - Upload to S3 (release)"
            value: "true"
      - select: "Force Rebuild Base Wheels"
        key: "rocm-force-rebuild"
        default: "false"
        hint: "Ignore S3 cache and rebuild base wheels from scratch"
        options:
          - label: "No - Use cached wheels if available"
            value: "false"
          - label: "Yes - Rebuild even if cache exists"
            value: "true"
  # ROCm Job 1: Build ROCm Base Wheels (with S3 caching)
  - label: ":rocm: Build ROCm Base Wheels"
    id: build-rocm-base-wheels
    depends_on:
      - step: input-rocm-config
        allow_failure: true  # Allow failure so non-UI builds can proceed (input step is skipped)
    agents:
      queue: cpu_queue_postmerge
    commands:
      # Set configuration and check cache
      - |
        set -euo pipefail
        # Get values from meta-data (set by input step) or use defaults
        PYTHON_VERSION="$$(buildkite-agent meta-data get rocm-python-version 2>/dev/null || echo '')"
        export PYTHON_VERSION="$${PYTHON_VERSION:-3.12}"
        PYTORCH_ROCM_ARCH="$$(buildkite-agent meta-data get rocm-pytorch-rocm-arch 2>/dev/null || echo '')"
        export PYTORCH_ROCM_ARCH="$${PYTORCH_ROCM_ARCH:-gfx90a;gfx942;gfx950;gfx1100;gfx1101;gfx1200;gfx1201;gfx1150;gfx1151}"
        # Check for force rebuild flag
        ROCM_FORCE_REBUILD="$${ROCM_FORCE_REBUILD:-}"
        if [ -z "$${ROCM_FORCE_REBUILD}" ]; then
          ROCM_FORCE_REBUILD="$$(buildkite-agent meta-data get rocm-force-rebuild 2>/dev/null || echo '')"
        fi
        echo "========================================"
        echo "ROCm Base Wheels Build Configuration"
        echo "========================================"
        echo "  PYTHON_VERSION: $${PYTHON_VERSION}"
        echo "  PYTORCH_ROCM_ARCH: $${PYTORCH_ROCM_ARCH}"
        echo "  ROCM_FORCE_REBUILD: $${ROCM_FORCE_REBUILD:-false}"
        echo "========================================"
        # Save resolved config for later jobs
        buildkite-agent meta-data set "rocm-python-version" "$${PYTHON_VERSION}"
        buildkite-agent meta-data set "rocm-pytorch-rocm-arch" "$${PYTORCH_ROCM_ARCH}"
        # Check S3 cache for pre-built wheels
        CACHE_KEY=$$(.buildkite/scripts/cache-rocm-base-wheels.sh key)
        CACHE_PATH=$$(.buildkite/scripts/cache-rocm-base-wheels.sh path)
        echo ""
        echo "Cache key: $${CACHE_KEY}"
        echo "Cache path: $${CACHE_PATH}"
        # Save cache key for downstream jobs
        buildkite-agent meta-data set "rocm-cache-key" "$${CACHE_KEY}"
        CACHE_STATUS="miss"
        if [ "$${ROCM_FORCE_REBUILD}" != "true" ]; then
          CACHE_STATUS=$$(.buildkite/scripts/cache-rocm-base-wheels.sh check)
        else
          echo "Force rebuild requested, skipping cache check"
        fi
        if [ "$${CACHE_STATUS}" = "hit" ]; then
          echo ""
          echo "CACHE HIT! Downloading pre-built wheels..."
          echo ""
          .buildkite/scripts/cache-rocm-base-wheels.sh download
          # Set the S3 path for the cached Docker image (for Job 2 to download)
          S3_ARTIFACT_PATH="s3://$${S3_BUCKET}/rocm/cache/$${CACHE_KEY}"
          buildkite-agent meta-data set "rocm-docker-image-s3-path" "$${S3_ARTIFACT_PATH}/rocm-base-image.tar.gz"
          # Mark that we used cache (for Docker image handling)
          buildkite-agent meta-data set "rocm-used-cache" "true"
          echo ""
          echo "Cache download complete. Skipping Docker build."
          echo "Docker image will be downloaded from: $${S3_ARTIFACT_PATH}/rocm-base-image.tar.gz"
        else
          echo ""
          echo "CACHE MISS. Building from scratch..."
          echo ""
          # Build full base image (for later vLLM build)
          DOCKER_BUILDKIT=1 docker buildx build \
            --file docker/Dockerfile.rocm_base \
            --tag rocm/vllm-dev:base-$${BUILDKITE_BUILD_NUMBER} \
            --build-arg PYTORCH_ROCM_ARCH="$${PYTORCH_ROCM_ARCH}" \
            --build-arg PYTHON_VERSION="$${PYTHON_VERSION}" \
            --build-arg USE_SCCACHE=1 \
            --build-arg SCCACHE_BUCKET_NAME=vllm-build-sccache \
            --build-arg SCCACHE_REGION_NAME=us-west-2 \
            --build-arg SCCACHE_S3_NO_CREDENTIALS=0 \
            --load \
            .
          # Build debs_wheel_release stage for wheel extraction
          DOCKER_BUILDKIT=1 docker buildx build \
            --file docker/Dockerfile.rocm_base \
            --tag rocm-base-debs:$${BUILDKITE_BUILD_NUMBER} \
            --target debs_wheel_release \
            --build-arg PYTORCH_ROCM_ARCH="$${PYTORCH_ROCM_ARCH}" \
            --build-arg PYTHON_VERSION="$${PYTHON_VERSION}" \
            --build-arg USE_SCCACHE=1 \
            --build-arg SCCACHE_BUCKET_NAME=vllm-build-sccache \
            --build-arg SCCACHE_REGION_NAME=us-west-2 \
            --build-arg SCCACHE_S3_NO_CREDENTIALS=0 \
            --load \
            .
          # Extract wheels from Docker image
          mkdir -p artifacts/rocm-base-wheels
          container_id=$$(docker create rocm-base-debs:$${BUILDKITE_BUILD_NUMBER})
          docker cp $${container_id}:/app/debs/. artifacts/rocm-base-wheels/
          docker rm $${container_id}
          echo "Extracted base wheels:"
          ls -lh artifacts/rocm-base-wheels/
          # Upload wheels to S3 cache for future builds
          echo ""
          echo "Uploading wheels to S3 cache..."
          .buildkite/scripts/cache-rocm-base-wheels.sh upload
          # Export base Docker image for reuse in vLLM build
          mkdir -p artifacts/rocm-docker-image
          docker save rocm/vllm-dev:base-$${BUILDKITE_BUILD_NUMBER} | gzip > artifacts/rocm-docker-image/rocm-base-image.tar.gz
          echo "Docker image size:"
          ls -lh artifacts/rocm-docker-image/
          # Upload large Docker image to S3 (also cached by cache key)
          S3_ARTIFACT_PATH="s3://$${S3_BUCKET}/rocm/cache/$${CACHE_KEY}"
          echo "Uploading Docker image to $${S3_ARTIFACT_PATH}/"
          aws s3 cp artifacts/rocm-docker-image/rocm-base-image.tar.gz "$${S3_ARTIFACT_PATH}/rocm-base-image.tar.gz"
          # Save the S3 path for downstream jobs
          buildkite-agent meta-data set "rocm-docker-image-s3-path" "$${S3_ARTIFACT_PATH}/rocm-base-image.tar.gz"
          # Mark that we did NOT use cache
          buildkite-agent meta-data set "rocm-used-cache" "false"
          echo ""
          echo "Build complete. Wheels cached for future builds."
        fi
    artifact_paths:
      - "artifacts/rocm-base-wheels/*.whl"
    env:
      DOCKER_BUILDKIT: "1"
      S3_BUCKET: "vllm-wheels"
  # ROCm Job 2: Build vLLM ROCm Wheel
  - label: ":python: Build vLLM ROCm Wheel"
    id: build-rocm-vllm-wheel
    depends_on:
      - step: build-rocm-base-wheels
        allow_failure: false
    agents:
      queue: cpu_queue_postmerge
    timeout_in_minutes: 180
    commands:
      # Download artifacts and prepare Docker image
      - |
        set -euo pipefail
        # Ensure git tags are up-to-date (Buildkite's default fetch doesn't update tags)
        # This fixes version detection when tags are moved/force-pushed
        echo "Fetching latest tags from origin..."
        git fetch --tags --force origin
        # Log tag information for debugging version detection
        echo "========================================"
        echo "Git Tag Verification"
        echo "========================================"
        echo "Current HEAD: $(git rev-parse HEAD)"
        echo "git describe --tags: $(git describe --tags 2>/dev/null || echo 'No tags found')"
        echo ""
        echo "Recent tags (pointing to commits near HEAD):"
        git tag -l --sort=-creatordate | head -5
        echo "setuptools_scm version detection:"
        pip install -q setuptools_scm 2>/dev/null || true
        python3 -c "import setuptools_scm; print('  Detected version:', setuptools_scm.get_version())" 2>/dev/null || echo "  (setuptools_scm not available in this environment)"
        echo "========================================"
        # Download wheel artifacts from current build
        echo "Downloading wheel artifacts from current build"
        buildkite-agent artifact download "artifacts/rocm-base-wheels/*.whl" .
        # Download Docker image from S3 (too large for Buildkite artifacts)
        DOCKER_IMAGE_S3_PATH="$$(buildkite-agent meta-data get rocm-docker-image-s3-path 2>/dev/null || echo '')"
        if [ -z "$${DOCKER_IMAGE_S3_PATH}" ]; then
          echo "ERROR: rocm-docker-image-s3-path metadata not found"
          echo "This should have been set by the build-rocm-base-wheels job"
          exit 1
        fi
        echo "Downloading Docker image from $${DOCKER_IMAGE_S3_PATH}"
        mkdir -p artifacts/rocm-docker-image
        aws s3 cp "$${DOCKER_IMAGE_S3_PATH}" artifacts/rocm-docker-image/rocm-base-image.tar.gz
        # Load base Docker image and capture the tag
        echo "Loading base Docker image..."
        LOAD_OUTPUT=$$(gunzip -c artifacts/rocm-docker-image/rocm-base-image.tar.gz | docker load)
        echo "$${LOAD_OUTPUT}"
        # Extract the actual loaded image tag from "Loaded image: <tag>" output
        # This avoids picking up stale images (like rocm/vllm-dev:nightly) already on the agent
        BASE_IMAGE_TAG=$$(echo "$${LOAD_OUTPUT}" | grep "Loaded image:" | sed 's/Loaded image: //')
        if [ -z "$${BASE_IMAGE_TAG}" ]; then
          echo "ERROR: Failed to extract image tag from docker load output"
          echo "Load output was: $${LOAD_OUTPUT}"
          exit 1
        fi
        echo "Loaded base image: $${BASE_IMAGE_TAG}"
        # Prepare base wheels for Docker build context
        mkdir -p docker/context/base-wheels
        touch docker/context/base-wheels/.keep
        cp artifacts/rocm-base-wheels/*.whl docker/context/base-wheels/
        echo "Base wheels for vLLM build:"
        ls -lh docker/context/base-wheels/
        # Get GPU architectures from meta-data
        PYTORCH_ROCM_ARCH="$$(buildkite-agent meta-data get rocm-pytorch-rocm-arch 2>/dev/null || echo '')"
        PYTORCH_ROCM_ARCH="$${PYTORCH_ROCM_ARCH:-gfx90a;gfx942;gfx950;gfx1100;gfx1101;gfx1200;gfx1201;gfx1150;gfx1151}"
        echo "========================================"
        echo "Building vLLM wheel with:"
        echo "  BUILDKITE_COMMIT: $${BUILDKITE_COMMIT}"
        echo "  BUILDKITE_BRANCH: $${BUILDKITE_BRANCH}"
        echo "  PYTORCH_ROCM_ARCH: $${PYTORCH_ROCM_ARCH}"
        echo "  BASE_IMAGE: $${BASE_IMAGE_TAG}"
        echo "========================================"
        # Build vLLM wheel using local checkout (REMOTE_VLLM=0)
        DOCKER_BUILDKIT=1 docker build \
          --file docker/Dockerfile.rocm \
          --target export_vllm_wheel_release \
          --output type=local,dest=rocm-dist \
          --build-arg BASE_IMAGE="$${BASE_IMAGE_TAG}" \
          --build-arg ARG_PYTORCH_ROCM_ARCH="$${PYTORCH_ROCM_ARCH}" \
          --build-arg REMOTE_VLLM=0 \
          --build-arg GIT_REPO_CHECK=1 \
          --build-arg USE_SCCACHE=1 \
          --build-arg SCCACHE_BUCKET_NAME=vllm-build-sccache \
          --build-arg SCCACHE_REGION_NAME=us-west-2 \
          --build-arg SCCACHE_S3_NO_CREDENTIALS=0 \
          .
        echo "Built vLLM wheel:"
        ls -lh rocm-dist/*.whl
        # Copy wheel to artifacts directory
        mkdir -p artifacts/rocm-vllm-wheel
        cp rocm-dist/*.whl artifacts/rocm-vllm-wheel/
        echo "Final vLLM wheel:"
        ls -lh artifacts/rocm-vllm-wheel/
    artifact_paths:
      - "artifacts/rocm-vllm-wheel/*.whl"
    env:
      DOCKER_BUILDKIT: "1"
      S3_BUCKET: "vllm-wheels"
  # ROCm Job 3: Upload Wheels to S3
  - label: ":s3: Upload ROCm Wheels to S3"
    id: upload-rocm-wheels
    depends_on:
      - step: build-rocm-vllm-wheel
        allow_failure: false
    agents:
      queue: cpu_queue_postmerge
    timeout_in_minutes: 60
    commands:
      # Download all wheel artifacts and run upload
      - |
        set -euo pipefail
        # Check if upload is enabled (from env var, meta-data, or release branch)
        ROCM_UPLOAD_WHEELS="$${ROCM_UPLOAD_WHEELS:-}"
        if [ -z "$${ROCM_UPLOAD_WHEELS}" ]; then
          # Try to get from meta-data (input form)
          ROCM_UPLOAD_WHEELS="$$(buildkite-agent meta-data get rocm-upload-wheels 2>/dev/null || echo '')"
        fi
        echo "========================================"
        echo "Upload check:"
        echo "  ROCM_UPLOAD_WHEELS: $${ROCM_UPLOAD_WHEELS}"
        echo "  BUILDKITE_BRANCH: $${BUILDKITE_BRANCH}"
        echo "========================================"
        # Skip upload if not enabled
        if [ "$${ROCM_UPLOAD_WHEELS}" != "true" ]; then
          echo "Skipping S3 upload (ROCM_UPLOAD_WHEELS != true, NIGHTLY != 1, not a release branch)"
          echo "To enable upload, set 'Upload Wheels to S3' to 'Yes' in the build configuration"
          exit 0
        fi
        echo "Upload enabled, proceeding..."
        # Download artifacts from current build
        echo "Downloading artifacts from current build"
        buildkite-agent artifact download "artifacts/rocm-base-wheels/*.whl" .
        buildkite-agent artifact download "artifacts/rocm-vllm-wheel/*.whl" .
        # Run upload script
        bash .buildkite/scripts/upload-rocm-wheels.sh
    env:
      DOCKER_BUILDKIT: "1"
      S3_BUCKET: "vllm-wheels"
  # ROCm Job 4: Annotate ROCm Wheel Release
  - label: ":memo: Annotate ROCm wheel release"
    id: annotate-rocm-release
    depends_on:
      - step: upload-rocm-wheels
        allow_failure: true
    agents:
      queue: cpu_queue_postmerge
    commands:
      - "bash .buildkite/scripts/annotate-rocm-release.sh"
    env:
      S3_BUCKET: "vllm-wheels"
--- a/.buildkite/scripts/annotate-release.sh
+++ b/.buildkite/scripts/annotate-release.sh
@@ -32,6 +32,7 @@ To download and upload the image:
 \`\`\`
 docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-x86_64
 docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-aarch64
 docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-rocm
 docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-x86_64 vllm/vllm-openai:x86_64
 docker tag vllm/vllm-openai:x86_64 vllm/vllm-openai:latest-x86_64
@@ -45,6 +46,12 @@ docker tag vllm/vllm-openai:aarch64 vllm/vllm-openai:v${RELEASE_VERSION}-aarch64
 docker push vllm/vllm-openai:latest-aarch64
 docker push vllm/vllm-openai:v${RELEASE_VERSION}-aarch64
 docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-rocm vllm/vllm-openai:rocm
 docker tag vllm/vllm-openai:rocm vllm/vllm-openai:latest-rocm
 docker tag vllm/vllm-openai:rocm vllm/vllm-openai:v${RELEASE_VERSION}-rocm
 docker push vllm/vllm-openai:latest-rocm
 docker push vllm/vllm-openai:v${RELEASE_VERSION}-rocm
 docker manifest rm vllm/vllm-openai:latest
 docker manifest create vllm/vllm-openai:latest vllm/vllm-openai:latest-x86_64 vllm/vllm-openai:latest-aarch64
 docker manifest create vllm/vllm-openai:v${RELEASE_VERSION} vllm/vllm-openai:v${RELEASE_VERSION}-x86_64 vllm/vllm-openai:v${RELEASE_VERSION}-aarch64
--- a/.buildkite/scripts/annotate-rocm-release.sh
+++ b/.buildkite/scripts/annotate-rocm-release.sh
@@ -0,0 +1,74 @@
 #!/bin/bash
 # SPDX-License-Identifier: Apache-2.0
 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project
 #
 # Generate Buildkite annotation for ROCm wheel release
 set -ex
 # Get build configuration from meta-data
 # Extract ROCm version dynamically from Dockerfile.rocm_base
 # BASE_IMAGE format: rocm/dev-ubuntu-22.04:7.1-complete -> extracts "7.1"
 ROCM_VERSION=$(grep -E '^ARG BASE_IMAGE=' docker/Dockerfile.rocm_base | sed -E 's/.*:([0-9]+\.[0-9]+).*/\1/' || echo "unknown")
 PYTHON_VERSION=$(buildkite-agent meta-data get rocm-python-version 2>/dev/null || echo "3.12")
 PYTORCH_ROCM_ARCH=$(buildkite-agent meta-data get rocm-pytorch-rocm-arch 2>/dev/null || echo "gfx90a;gfx942;gfx950;gfx1100;gfx1101;gfx1200;gfx1201;gfx1150;gfx1151")
 # S3 URLs
 S3_BUCKET="${S3_BUCKET:-vllm-wheels}"
 S3_REGION="${AWS_DEFAULT_REGION:-us-west-2}"
 S3_URL="https://${S3_BUCKET}.s3.${S3_REGION}.amazonaws.com"
 ROCM_PATH="rocm/${BUILDKITE_COMMIT}"
 buildkite-agent annotate --style 'success' --context 'rocm-release-workflow' << EOF
 ## :rocm: ROCm Wheel Release
 ### Build Configuration
 | Setting | Value |
 |---------|-------|
 | **ROCm Version** | ${ROCM_VERSION} |
 | **Python Version** | ${PYTHON_VERSION} |
 | **GPU Architectures** | ${PYTORCH_ROCM_ARCH} |
 | **Branch** | \`${BUILDKITE_BRANCH}\` |
 | **Commit** | \`${BUILDKITE_COMMIT}\` |
 ### :package: Installation
 **Install from this build (by commit):**
 \`\`\`bash
 uv pip install vllm --extra-index-url ${S3_URL}/${ROCM_PATH}/{rocm_variant}/
 # Example:
 uv pip install vllm --extra-index-url ${S3_URL}/${ROCM_PATH}/rocm700/
 \`\`\`
 **Install from nightly (if published):**
 \`\`\`bash
 uv pip install vllm --extra-index-url ${S3_URL}/rocm/nightly/
 \`\`\`
 ### :floppy_disk: Download Wheels Directly
 \`\`\`bash
 # List all ROCm wheels
 aws s3 ls s3://${S3_BUCKET}/${ROCM_PATH}/
 # Download specific wheels
 aws s3 cp s3://${S3_BUCKET}/${ROCM_PATH}/vllm-*.whl .
 aws s3 cp s3://${S3_BUCKET}/${ROCM_PATH}/torch-*.whl .
 aws s3 cp s3://${S3_BUCKET}/${ROCM_PATH}/triton_rocm-*.whl .
 aws s3 cp s3://${S3_BUCKET}/${ROCM_PATH}/torchvision-*.whl .
 aws s3 cp s3://${S3_BUCKET}/${ROCM_PATH}/amdsmi-*.whl .
 \`\`\`
 ### :gear: Included Packages
 - **vllm**: vLLM with ROCm support
 - **torch**: PyTorch built for ROCm ${ROCM_VERSION}
 - **triton_rocm**: Triton built for ROCm
 - **torchvision**: TorchVision for ROCm PyTorch
 - **amdsmi**: AMD SMI Python bindings
 ### :warning: Notes
 - These wheels are built for **ROCm ${ROCM_VERSION}** and will NOT work with CUDA GPUs
 - Supported GPU architectures: ${PYTORCH_ROCM_ARCH}
 - Platform: Linux x86_64 only
 EOF
--- a/.buildkite/scripts/cache-rocm-base-wheels.sh
+++ b/.buildkite/scripts/cache-rocm-base-wheels.sh
@@ -0,0 +1,140 @@
 #!/usr/bin/env bash
 # SPDX-License-Identifier: Apache-2.0
 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project
 #
 # Cache helper for ROCm base wheels
 #
 # This script manages caching of pre-built ROCm base wheels (torch, triton, etc.)
 # to avoid rebuilding them when Dockerfile.rocm_base hasn't changed.
 #
 # Usage:
 #   cache-rocm-base-wheels.sh check    - Check if cache exists, outputs "hit" or "miss"
 #   cache-rocm-base-wheels.sh upload   - Upload wheels to cache
 #   cache-rocm-base-wheels.sh download - Download wheels from cache
 #   cache-rocm-base-wheels.sh key      - Output the cache key
 #
 # Environment variables:
 #   S3_BUCKET          - S3 bucket name (default: vllm-wheels)
 #   PYTHON_VERSION     - Python version (affects cache key)
 #   PYTORCH_ROCM_ARCH  - GPU architectures (affects cache key)
 #
 # Note: ROCm version is determined by BASE_IMAGE in Dockerfile.rocm_base,
 #       so changes to ROCm version are captured by the Dockerfile hash.
 set -euo pipefail
 BUCKET="${S3_BUCKET:-vllm-wheels}"
 DOCKERFILE="docker/Dockerfile.rocm_base"
 CACHE_PREFIX="rocm/cache"
 # Generate hash from Dockerfile content + build args
 generate_cache_key() {
    # Include Dockerfile content
    if [[ ! -f "$DOCKERFILE" ]]; then
        echo "ERROR: Dockerfile not found: $DOCKERFILE" >&2
        exit 1
    fi
    local dockerfile_hash=$(sha256sum "$DOCKERFILE" | cut -c1-16)
    # Include key build args that affect the output
    # These should match the ARGs in Dockerfile.rocm_base that change the build output
    # Note: ROCm version is determined by BASE_IMAGE in the Dockerfile, so it's captured by dockerfile_hash
    local args_string="${PYTHON_VERSION:-}|${PYTORCH_ROCM_ARCH:-}"
    local args_hash=$(echo "$args_string" | sha256sum | cut -c1-8)
    echo "${dockerfile_hash}-${args_hash}"
 }
 CACHE_KEY=$(generate_cache_key)
 CACHE_PATH="s3://${BUCKET}/${CACHE_PREFIX}/${CACHE_KEY}/"
 case "${1:-}" in
    check)
        echo "Checking cache for key: ${CACHE_KEY}" >&2
        echo "Cache path: ${CACHE_PATH}" >&2
        echo "Variables used in cache key:" >&2
        echo "  PYTHON_VERSION: ${PYTHON_VERSION:-<not set>}" >&2
        echo "  PYTORCH_ROCM_ARCH: ${PYTORCH_ROCM_ARCH:-<not set>}" >&2
        # Check if cache exists by listing objects
        # We look for at least one .whl file
        echo "Running: aws s3 ls ${CACHE_PATH}" >&2
        S3_OUTPUT=$(aws s3 ls "${CACHE_PATH}" 2>&1) || true
        echo "S3 ls output:" >&2
        echo "$S3_OUTPUT" | head -5 >&2
        if echo "$S3_OUTPUT" | grep -q "\.whl"; then
            echo "hit"
        else
            echo "miss"
        fi
        ;;
    upload)
        echo "========================================"
        echo "Uploading wheels to cache"
        echo "========================================"
        echo "Cache key: ${CACHE_KEY}"
        echo "Cache path: ${CACHE_PATH}"
        echo ""
        if [[ ! -d "artifacts/rocm-base-wheels" ]]; then
            echo "ERROR: artifacts/rocm-base-wheels directory not found" >&2
            exit 1
        fi
        WHEEL_COUNT=$(ls artifacts/rocm-base-wheels/*.whl 2>/dev/null | wc -l)
        if [[ "$WHEEL_COUNT" -eq 0 ]]; then
            echo "ERROR: No wheels found in artifacts/rocm-base-wheels/" >&2
            exit 1
        fi
        echo "Uploading $WHEEL_COUNT wheels..."
        aws s3 cp --recursive artifacts/rocm-base-wheels/ "${CACHE_PATH}"
        echo ""
        echo "Cache upload complete!"
        echo "========================================"
        ;;
    download)
        echo "========================================"
        echo "Downloading wheels from cache"
        echo "========================================"
        echo "Cache key: ${CACHE_KEY}"
        echo "Cache path: ${CACHE_PATH}"
        echo ""
        mkdir -p artifacts/rocm-base-wheels
        aws s3 cp --recursive "${CACHE_PATH}" artifacts/rocm-base-wheels/
        echo ""
        echo "Downloaded wheels:"
        ls -lh artifacts/rocm-base-wheels/
        WHEEL_COUNT=$(ls artifacts/rocm-base-wheels/*.whl 2>/dev/null | wc -l)
        echo ""
        echo "Total: $WHEEL_COUNT wheels"
        echo "========================================"
        ;;
    key)
        echo "${CACHE_KEY}"
        ;;
    path)
        echo "${CACHE_PATH}"
        ;;
    *)
        echo "Usage: $0 {check|upload|download|key|path}" >&2
        echo "" >&2
        echo "Commands:" >&2
        echo "  check    - Check if cache exists, outputs 'hit' or 'miss'" >&2
        echo "  upload   - Upload wheels from artifacts/rocm-base-wheels/ to cache" >&2
        echo "  download - Download wheels from cache to artifacts/rocm-base-wheels/" >&2
        echo "  key      - Output the cache key" >&2
        echo "  path     - Output the full S3 cache path" >&2
        exit 1
        ;;
 esac
--- a/.buildkite/scripts/cleanup-nightly-builds.sh
+++ b/.buildkite/scripts/cleanup-nightly-builds.sh
@@ -3,7 +3,14 @@
 set -ex
 # Clean up old nightly builds from DockerHub, keeping only the last 14 builds
-# This script uses DockerHub API to list and delete old tags with "nightly-" prefix
+# This script uses DockerHub API to list and delete old tags with specified prefix
 # Usage: cleanup-nightly-builds.sh [TAG_PREFIX]
 # Example: cleanup-nightly-builds.sh "nightly-" or cleanup-nightly-builds.sh "cu130-nightly-"
 # Get tag prefix from argument, default to "nightly-" if not provided
 TAG_PREFIX="${1:-nightly-}"
 echo "Cleaning up tags with prefix: $TAG_PREFIX"
 # DockerHub API endpoint for vllm/vllm-openai repository
 REPO_API_URL="https://hub.docker.com/v2/repositories/vllm/vllm-openai/tags"
@@ -45,7 +52,7 @@ get_all_tags() {
        set -x
        # Get both last_updated timestamp and tag name, separated by |
-        local tags=$(echo "$response" | jq -r '.results[] | select(.name | startswith("nightly-")) | "\(.last_updated)|\(.name)"')
+        local tags=$(echo "$response" | jq -r --arg prefix "$TAG_PREFIX" '.results[] | select(.name | startswith($prefix)) | "\(.last_updated)|\(.name)"')
        if [ -z "$tags" ]; then
            break
--- a/.buildkite/scripts/generate-nightly-index.py
+++ b/.buildkite/scripts/generate-nightly-index.py
@@ -16,6 +16,18 @@ from urllib.parse import quote
 import regex as re
 def normalize_package_name(name: str) -> str:
    """
    Normalize package name according to PEP 503.
    https://peps.python.org/pep-0503/#normalized-names
    Replace runs of underscores, hyphens, and periods with a single hyphen,
    and lowercase the result.
    """
    return re.sub(r"[-_.]+", "-", name).lower()
 if not sys.version_info >= (3, 12):
    raise RuntimeError("This script requires Python 3.12 or higher.")
@@ -78,7 +90,13 @@ def parse_from_filename(file: str) -> WheelFileInfo:
            version = version.removesuffix("." + variant)
    else:
        if "+" in version:
-            version, variant = version.split("+")
+            version_part, suffix = version.split("+", 1)
            # Only treat known patterns as variants (rocmXXX, cuXXX, cpu)
            # Git hashes and other suffixes are NOT variants
            if suffix.startswith(("rocm", "cu", "cpu")):
                variant = suffix
                version = version_part
            # Otherwise keep the full version string (variant stays None)
    return WheelFileInfo(
        package_name=package_name,
@@ -206,6 +224,26 @@ def generate_index_and_metadata(
        print("No wheel files found, skipping index generation.")
        return
    # For ROCm builds: inherit variant from vllm wheel
    # All ROCm wheels should share the same variant as vllm
    rocm_variant = None
    for file in parsed_files:
        if (
            file.package_name == "vllm"
            and file.variant
            and file.variant.startswith("rocm")
        ):
            rocm_variant = file.variant
            print(f"Detected ROCm variant from vllm: {rocm_variant}")
            break
    # Apply ROCm variant to all wheels without a variant
    if rocm_variant:
        for file in parsed_files:
            if file.variant is None:
                file.variant = rocm_variant
                print(f"Inherited variant '{rocm_variant}' for {file.filename}")
    # Group by variant
    variant_to_files: dict[str, list[WheelFileInfo]] = {}
    for file in parsed_files:
@@ -256,8 +294,8 @@ def generate_index_and_metadata(
        variant_dir.mkdir(parents=True, exist_ok=True)
-        # gather all package names in this variant
+        # gather all package names in this variant (normalized per PEP 503)
-        packages = set(f.package_name for f in files)
+        packages = set(normalize_package_name(f.package_name) for f in files)
        if variant == "default":
            # these packages should also appear in the "project list"
            # generate after all variants are processed
@@ -269,8 +307,10 @@ def generate_index_and_metadata(
                f.write(project_list_str)
        for package in packages:
-            # filter files belonging to this package only
+            # filter files belonging to this package only (compare normalized names)
-            package_files = [f for f in files if f.package_name == package]
+            package_files = [
                f for f in files if normalize_package_name(f.package_name) == package
            ]
            package_dir = variant_dir / package
            package_dir.mkdir(parents=True, exist_ok=True)
            index_str, metadata_str = generate_package_index_and_metadata(
@@ -291,6 +331,7 @@ if __name__ == "__main__":
    """
    Arguments:
        --version <version> : version string for the current build (e.g., commit hash)
        --wheel-dir <wheel_directory> : directory containing wheel files (default to be same as `version`)
        --current-objects <path_to_json> : path to JSON file containing current S3 objects listing in this version directory
        --output-dir <output_directory> : directory to store generated index files
        --alias-to-default <alias_variant_name> : (optional) alias variant name for the default variant
@@ -318,6 +359,12 @@ if __name__ == "__main__":
        required=True,
        help="Directory to store generated index files",
    )
    parser.add_argument(
        "--wheel-dir",
        type=str,
        default=None,
        help="Directory containing wheel files (default to be same as `version`)",
    )
    parser.add_argument(
        "--alias-to-default",
        type=str,
@@ -334,8 +381,13 @@ if __name__ == "__main__":
    args = parser.parse_args()
    version = args.version
-    if "/" in version or "\\" in version:
+    # Allow rocm/ prefix, reject other slashes and all backslashes
-        raise ValueError("Version string must not contain slashes.")
+    if "\\" in version:
        raise ValueError("Version string must not contain backslashes.")
    if "/" in version and not version.startswith("rocm/"):
        raise ValueError(
            "Version string must not contain slashes (except for 'rocm/' prefix)."
        )
    current_objects_path = Path(args.current_objects)
    output_dir = Path(args.output_dir)
    if not output_dir.exists():
@@ -372,7 +424,7 @@ if __name__ == "__main__":
    print(f"Found {len(wheel_files)} wheel files for version {version}: {wheel_files}")
-    # keep only "official" files for a non-nightly version (specifed by cli args)
+    # keep only "official" files for a non-nightly version (specified by cli args)
    PY_VERSION_RE = re.compile(r"^\d+\.\d+\.\d+([a-zA-Z0-9.+-]*)?$")
    if PY_VERSION_RE.match(version):
        # upload-wheels.sh ensures no "dev" is in args.version
@@ -384,9 +436,25 @@ if __name__ == "__main__":
        print("Nightly version detected, keeping all wheel files.")
    # Generate index and metadata, assuming wheels and indices are stored as:
-    # s3://vllm-wheels/{version}/<wheel files>
+    # s3://vllm-wheels/{wheel_dir}/<wheel files>
    # s3://vllm-wheels/<anything>/<index files>
-    wheel_base_dir = Path(output_dir).parent / version
+    #
    # For ROCm builds, version is "rocm/{commit}" and indices are uploaded to:
    #   - rocm/{commit}/  (same as wheels)
    #   - rocm/nightly/
    #   - rocm/{version}/
    # All these are under the "rocm/" prefix, so relative paths should be
    # relative to "rocm/", not the bucket root.
    if args.wheel_dir:
        # Explicit wheel-dir provided (e.g., for version-specific indices pointing to commit dir)
        wheel_dir = args.wheel_dir.strip().rstrip("/")
    elif version.startswith("rocm/"):
        # For rocm/commit, wheel_base_dir should be just the commit part
        # so relative path from rocm/0.12.0/rocm710/vllm/ -> ../../../{commit}/
        wheel_dir = version.split("/", 1)[1]
    else:
        wheel_dir = version
    wheel_base_dir = Path(output_dir).parent / wheel_dir
    index_base_dir = Path(output_dir)
    generate_index_and_metadata(
--- a/.buildkite/scripts/hardware_ci/run-amd-test.sh
+++ b/.buildkite/scripts/hardware_ci/run-amd-test.sh
@@ -209,12 +209,21 @@ if [[ $commands == *"--shard-id="* ]]; then
    wait "${pid}"
    STATUS+=($?)
  done
  at_least_one_shard_with_tests=0
  for st in "${STATUS[@]}"; do
-    if [[ ${st} -ne 0 ]]; then
+    if [[ ${st} -ne 0 ]] && [[ ${st} -ne 5 ]]; then
      echo "One of the processes failed with $st"
      exit "${st}"
    elif [[ ${st} -eq 5 ]]; then
      echo "Shard exited with status 5 (no tests collected) - treating as success"
    else # This means st is 0
      at_least_one_shard_with_tests=1
    fi
  done
  if [[ ${#STATUS[@]} -gt 0 && ${at_least_one_shard_with_tests} -eq 0 ]]; then
    echo "All shards reported no tests collected. Failing the build."
    exit 1
  fi
 else
  echo "Render devices: $BUILDKITE_AGENT_META_DATA_RENDER_DEVICES"
  docker run \
--- a/.buildkite/scripts/hardware_ci/run-cpu-test.sh
+++ b/.buildkite/scripts/hardware_ci/run-cpu-test.sh
@@ -84,7 +84,7 @@ function cpu_tests() {
  docker exec cpu-test-"$NUMA_NODE" bash -c "
    set -e
    pytest -x -s -v \
-    tests/lora/test_qwen2vl.py"
+    tests/lora/test_qwenvl.py"
  # online serving: tp+pp
  docker exec cpu-test-"$NUMA_NODE" bash -c '
--- a/.buildkite/scripts/hardware_ci/run-tpu-v1-test-part2.sh
+++ b/.buildkite/scripts/hardware_ci/run-tpu-v1-test-part2.sh
@@ -61,7 +61,7 @@ echo "Results will be stored in: $RESULTS_DIR"
 echo "--- Installing Python dependencies ---"
 python3 -m pip install --progress-bar off git+https://github.com/thuml/depyf.git \
    && python3 -m pip install --progress-bar off pytest pytest-asyncio tpu-info \
-    && python3 -m pip install --progress-bar off "lm-eval @ git+https://github.com/EleutherAI/lm-evaluation-harness.git@206b7722158f58c35b7ffcd53b035fdbdda5126d" \
+    && python3 -m pip install --progress-bar off "lm-eval[api]>=0.4.9.2" \
    && python3 -m pip install --progress-bar off hf-transfer tblib==3.1.0
 echo "--- Python dependencies installed ---"
--- a/.buildkite/scripts/hardware_ci/run-tpu-v1-test.sh
+++ b/.buildkite/scripts/hardware_ci/run-tpu-v1-test.sh
@@ -61,7 +61,7 @@ echo "Results will be stored in: $RESULTS_DIR"
 echo "--- Installing Python dependencies ---"
 python3 -m pip install --progress-bar off git+https://github.com/thuml/depyf.git \
    && python3 -m pip install --progress-bar off pytest pytest-asyncio tpu-info \
-    && python3 -m pip install --progress-bar off "lm-eval @ git+https://github.com/EleutherAI/lm-evaluation-harness.git@206b7722158f58c35b7ffcd53b035fdbdda5126d" \
+    && python3 -m pip install --progress-bar off "lm-eval[api]>=0.4.9.2" \
    && python3 -m pip install --progress-bar off hf-transfer tblib==3.1.0
 echo "--- Python dependencies installed ---"
--- a/.buildkite/scripts/push-nightly-builds.sh
+++ b/.buildkite/scripts/push-nightly-builds.sh
@@ -0,0 +1,36 @@
 #!/bin/bash
 set -ex
 # Get tag variant from argument, default to empty if not provided, should be something like "cu130".
 # Due to limits in cleanup script, we must move variants to use separate tags like "cu130-nightly",
 # otherwise they will be cleaned up together with the main "nightly" tags.
 TAG_VARIANT="$1"
 if [ -n "$TAG_VARIANT" ]; then
    ORIG_TAG_SUFFIX="-$TAG_VARIANT"
    TAG_NAME="$TAG_VARIANT-nightly"
 else
    ORIG_TAG_SUFFIX=""
    TAG_NAME="nightly"
 fi
 ORIG_TAG_NAME="$BUILDKITE_COMMIT"
 echo "Pushing original tag $ORIG_TAG_NAME$ORIG_TAG_SUFFIX to new nightly tag name: $TAG_NAME"
 # pull original arch-dependent images from AWS ECR Public
 aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7
 docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:$ORIG_TAG_NAME-x86_64$ORIG_TAG_SUFFIX
 docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:$ORIG_TAG_NAME-aarch64$ORIG_TAG_SUFFIX
 # tag arch-dependent images
 docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$ORIG_TAG_NAME-x86_64$ORIG_TAG_SUFFIX vllm/vllm-openai:$TAG_NAME-x86_64
 docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$ORIG_TAG_NAME-aarch64$ORIG_TAG_SUFFIX vllm/vllm-openai:$TAG_NAME-aarch64
 # push arch-dependent images to DockerHub
 docker push vllm/vllm-openai:$TAG_NAME-x86_64
 docker push vllm/vllm-openai:$TAG_NAME-aarch64
 # push arch-independent manifest to DockerHub
 docker manifest create vllm/vllm-openai:$TAG_NAME vllm/vllm-openai:$TAG_NAME-x86_64 vllm/vllm-openai:$TAG_NAME-aarch64 --amend
 docker manifest create vllm/vllm-openai:$TAG_NAME-$BUILDKITE_COMMIT vllm/vllm-openai:$TAG_NAME-x86_64 vllm/vllm-openai:$TAG_NAME-aarch64 --amend
 docker manifest push vllm/vllm-openai:$TAG_NAME
 docker manifest push vllm/vllm-openai:$TAG_NAME-$BUILDKITE_COMMIT
--- a/.buildkite/scripts/run-multi-node-test.sh
+++ b/.buildkite/scripts/run-multi-node-test.sh
@@ -2,6 +2,17 @@
 set -euox pipefail
 # To detect ROCm
 # Check multiple indicators:
 if [ -e /dev/kfd ] || \
    [ -d /opt/rocm ] || \
    command -v rocm-smi &> /dev/null || \
    [ -n "${ROCM_HOME:-}" ]; then
    IS_ROCM=1
 else
    IS_ROCM=0
 fi
 if [[ $# -lt 4 ]]; then
    echo "Usage: .buildkite/scripts/run-multi-node-test.sh WORKING_DIR NUM_NODES NUM_GPUS DOCKER_IMAGE COMMAND1 COMMAND2 ... COMMANDN"
    exit 1
@@ -26,13 +37,18 @@ for command in "${COMMANDS[@]}"; do
    echo "$command"
 done
 start_network() {
    docker network create --subnet=192.168.10.0/24 docker-net
 }
 start_nodes() {
    for node in $(seq 0 $(($NUM_NODES-1))); do
-        GPU_DEVICES='"device='
+        if [ "$IS_ROCM" -eq 1 ]; then
            GPU_DEVICES='--device /dev/kfd --device /dev/dri -e HIP_VISIBLE_DEVICES='
        else
            GPU_DEVICES='--gpus "device='
        fi
        for node_gpu in $(seq 0 $(($NUM_GPUS - 1))); do
            DEVICE_NUM=$(($node * $NUM_GPUS + $node_gpu))
            GPU_DEVICES+=$(($DEVICE_NUM))
@@ -40,7 +56,9 @@ start_nodes() {
                GPU_DEVICES+=','
            fi
        done
-        GPU_DEVICES+='"'
+        if [ "$IS_ROCM" -eq 0 ]; then
            GPU_DEVICES+='"'
        fi
        # start the container in detached mode
        # things to note:
@@ -49,7 +67,7 @@ start_nodes() {
        # 3. map the huggingface cache directory to the container
        # 3. assign ip addresses to the containers (head node: 192.168.10.10, worker nodes:
        #    starting from 192.168.10.11)
-        docker run -d --gpus "$GPU_DEVICES" --shm-size=10.24gb -e HF_TOKEN \
+        docker run -d $GPU_DEVICES --shm-size=10.24gb -e HF_TOKEN \
            -v ~/.cache/huggingface:/root/.cache/huggingface --name "node$node" \
            --network docker-net --ip 192.168.10.$((10 + $node)) --rm "$DOCKER_IMAGE" \
            /bin/bash -c "tail -f /dev/null"
--- a/.buildkite/scripts/upload-nightly-wheels.sh
+++ b/.buildkite/scripts/upload-nightly-wheels.sh
@@ -102,6 +102,7 @@ if [[ "$version" != *"dev"* ]]; then
    echo "Re-generating indices for /$pure_version/"
    rm -rf "$INDICES_OUTPUT_DIR/*"
    mkdir -p "$INDICES_OUTPUT_DIR"
-    $PYTHON .buildkite/scripts/generate-nightly-index.py --version "$pure_version" --current-objects "$obj_json" --output-dir "$INDICES_OUTPUT_DIR" --comment "version $pure_version" $alias_arg
+    # wheel-dir is overridden to be the commit directory, so that the indices point to the correct wheel path
    $PYTHON .buildkite/scripts/generate-nightly-index.py --version "$pure_version" --wheel-dir "$SUBPATH" --current-objects "$obj_json" --output-dir "$INDICES_OUTPUT_DIR" --comment "version $pure_version" $alias_arg
    aws s3 cp --recursive "$INDICES_OUTPUT_DIR/" "s3://$BUCKET/$pure_version/"
 fi
--- a/.buildkite/scripts/upload-release-wheels.sh
+++ b/.buildkite/scripts/upload-release-wheels.sh
@@ -0,0 +1,104 @@
 #!/usr/bin/env bash
 set -e
 BUCKET="vllm-wheels"
 SUBPATH=$BUILDKITE_COMMIT
 S3_COMMIT_PREFIX="s3://$BUCKET/$SUBPATH/"
 RELEASE_VERSION=$(buildkite-agent meta-data get release-version)
 echo "Release version from Buildkite: $RELEASE_VERSION"
 GIT_VERSION=$(git describe --exact-match --tags $BUILDKITE_COMMIT 2>/dev/null)
 if [ -z "$GIT_VERSION" ]; then
    echo "[FATAL] Not on a git tag, cannot create release."
    exit 1
 else
    echo "Git version for commit $BUILDKITE_COMMIT: $GIT_VERSION"
 fi
 # sanity check for version mismatch
 if [ "$RELEASE_VERSION" != "$GIT_VERSION" ]; then
  if [ "$FORCE_RELEASE_IGNORE_VERSION_MISMATCH" == "true" ]; then
    echo "[WARNING] Force release and ignore version mismatch"
  else
    echo "[FATAL] Release version from Buildkite does not match Git version."
    exit 1
  fi
 fi
 PURE_VERSION=${RELEASE_VERSION#v} # remove leading 'v'
 # check pypi token
 if [ -z "$PYPI_TOKEN" ]; then
  echo "[FATAL] PYPI_TOKEN is not set."
  exit 1
 else
  export TWINE_USERNAME="__token__"
  export TWINE_PASSWORD="$PYPI_TOKEN"
 fi
 # check github token
 if [ -z "$GITHUB_TOKEN" ]; then
  echo "[FATAL] GITHUB_TOKEN is not set."
  exit 1
 else
  export GH_TOKEN="$GITHUB_TOKEN"
 fi
 set -x # avoid printing secrets above
 # download gh CLI from github
 # Get latest gh CLI version from GitHub API
 GH_VERSION=$(curl -s https://api.github.com/repos/cli/cli/releases/latest | grep '"tag_name":' | sed -E 's/.*"([^"]+)".*/\1/' | sed 's/^v//')
 if [ -z "$GH_VERSION" ]; then
  echo "[FATAL] Failed to get latest gh CLI version from GitHub"
  exit 1
 fi
 echo "Downloading gh CLI version: $GH_VERSION"
 GH_TARBALL="gh_${GH_VERSION}_linux_amd64.tar.gz"
 GH_URL="https://github.com/cli/cli/releases/download/v${GH_VERSION}/${GH_TARBALL}"
 GH_INSTALL_DIR="/tmp/gh-install"
 mkdir -p "$GH_INSTALL_DIR"
 pushd "$GH_INSTALL_DIR"
 curl -L -o "$GH_TARBALL" "$GH_URL"
 tar -xzf "$GH_TARBALL"
 GH_BIN=$(realpath $(find . -name "gh" -type f -executable | head -n 1))
 if [ -z "$GH_BIN" ]; then
  echo "[FATAL] Failed to find gh CLI executable"
  exit 1
 fi
 echo "gh CLI downloaded successfully, version: $($GH_BIN --version)"
 echo "Last 5 releases on GitHub:" # as a sanity check of gh and GH_TOKEN
 command "$GH_BIN" release list --limit 5
 popd
 # install twine from pypi
 python3 -m venv /tmp/vllm-release-env
 source /tmp/vllm-release-env/bin/activate
 pip install twine
 python3 -m twine --version
 # copy release wheels to local directory
 DIST_DIR=/tmp/vllm-release-dist
 echo "Existing wheels on S3:"
 aws s3 ls "$S3_COMMIT_PREFIX"
 echo "Copying wheels to local directory"
 mkdir -p $DIST_DIR
 # include only wheels for the release version, ignore all files with "dev" or "rc" in the name (without excluding 'aarch64')
 aws s3 cp --recursive --exclude "*" --include "vllm-${PURE_VERSION}*.whl" --exclude "*dev*" --exclude "*rc[0-9]*" "$S3_COMMIT_PREFIX" $DIST_DIR
 echo "Wheels copied to local directory"
 # generate source tarball
 git archive --format=tar.gz --output="$DIST_DIR/vllm-${PURE_VERSION}.tar.gz" $BUILDKITE_COMMIT
 ls -la $DIST_DIR
 # upload wheels to PyPI (only default variant, i.e. files without '+' in the name)
 PYPI_WHEEL_FILES=$(find $DIST_DIR -name "vllm-${PURE_VERSION}*.whl" -not -name "*+*")
 if [ -z "$PYPI_WHEEL_FILES" ]; then
  echo "No default variant wheels found, quitting..."
  exit 1
 fi
 python3 -m twine check $PYPI_WHEEL_FILES
 python3 -m twine --non-interactive --verbose upload $PYPI_WHEEL_FILES
 echo "Wheels uploaded to PyPI"
 # create release on GitHub with the release version and all wheels
 command "$GH_BIN" release create $GIT_VERSION -d --latest --notes-from-tag --verify-tag $DIST_DIR/*.whl
--- a/.buildkite/scripts/upload-rocm-wheels.sh
+++ b/.buildkite/scripts/upload-rocm-wheels.sh
@@ -0,0 +1,151 @@
 #!/usr/bin/env bash
 # SPDX-License-Identifier: Apache-2.0
 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project
 #
 # Upload ROCm wheels to S3 with proper index generation
 #
 # Required environment variables:
 #   AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY (or IAM role)
 #   S3_BUCKET (default: vllm-wheels)
 #
 # S3 path structure:
 #   s3://vllm-wheels/rocm/{commit}/     - All wheels for this commit
 #   s3://vllm-wheels/rocm/nightly/      - Index pointing to latest nightly
 #   s3://vllm-wheels/rocm/{version}/    - Index for release versions
 set -ex
 # ======== Configuration ========
 BUCKET="${S3_BUCKET:-vllm-wheels}"
 ROCM_SUBPATH="rocm/${BUILDKITE_COMMIT}"
 S3_COMMIT_PREFIX="s3://$BUCKET/$ROCM_SUBPATH/"
 INDICES_OUTPUT_DIR="rocm-indices"
 PYTHON="${PYTHON_PROG:-python3}"
 # ROCm uses manylinux_2_35 (Ubuntu 22.04 based)
 MANYLINUX_VERSION="manylinux_2_35"
 echo "========================================"
 echo "ROCm Wheel Upload Configuration"
 echo "========================================"
 echo "S3 Bucket: $BUCKET"
 echo "S3 Path: $ROCM_SUBPATH"
 echo "Commit: $BUILDKITE_COMMIT"
 echo "Branch: $BUILDKITE_BRANCH"
 echo "========================================"
 # ======== Part 0: Setup Python ========
 # Detect if python3.12+ is available
 has_new_python=$($PYTHON -c "print(1 if __import__('sys').version_info >= (3,12) else 0)" 2>/dev/null || echo 0)
 if [[ "$has_new_python" -eq 0 ]]; then
    # Use new python from docker
    # Use --user to ensure files are created with correct ownership (not root)
    docker pull python:3-slim
    PYTHON="docker run --rm --user $(id -u):$(id -g) -v $(pwd):/app -w /app python:3-slim python3"
 fi
 echo "Using python interpreter: $PYTHON"
 echo "Python version: $($PYTHON --version)"
 # ======== Part 1: Collect and prepare wheels ========
 # Collect all wheels
 mkdir -p all-rocm-wheels
 cp artifacts/rocm-base-wheels/*.whl all-rocm-wheels/ 2>/dev/null || true
 cp artifacts/rocm-vllm-wheel/*.whl all-rocm-wheels/ 2>/dev/null || true
 WHEEL_COUNT=$(ls all-rocm-wheels/*.whl 2>/dev/null | wc -l)
 echo "Total wheels to upload: $WHEEL_COUNT"
 if [ "$WHEEL_COUNT" -eq 0 ]; then
    echo "ERROR: No wheels found to upload!"
    exit 1
 fi
 # Rename linux to manylinux in wheel filenames
 for wheel in all-rocm-wheels/*.whl; do
    if [[ "$wheel" == *"linux"* ]] && [[ "$wheel" != *"manylinux"* ]]; then
        new_wheel="${wheel/linux/$MANYLINUX_VERSION}"
        mv -- "$wheel" "$new_wheel"
        echo "Renamed: $(basename "$wheel") -> $(basename "$new_wheel")"
    fi
 done
 echo ""
 echo "Wheels to upload:"
 ls -lh all-rocm-wheels/
 # ======== Part 2: Upload wheels to S3 ========
 echo ""
 echo "Uploading wheels to $S3_COMMIT_PREFIX"
 for wheel in all-rocm-wheels/*.whl; do
    aws s3 cp "$wheel" "$S3_COMMIT_PREFIX"
 done
 # ======== Part 3: Generate and upload indices ========
 # List existing wheels in commit directory
 echo ""
 echo "Generating indices..."
 obj_json="rocm-objects.json"
 aws s3api list-objects-v2 --bucket "$BUCKET" --prefix "$ROCM_SUBPATH/" --delimiter / --output json > "$obj_json"
 mkdir -p "$INDICES_OUTPUT_DIR"
 # Use the existing generate-nightly-index.py
 # HACK: Replace regex module with stdlib re (same as CUDA script)
 sed -i 's/import regex as re/import re/g' .buildkite/scripts/generate-nightly-index.py
 $PYTHON .buildkite/scripts/generate-nightly-index.py \
    --version "$ROCM_SUBPATH" \
    --current-objects "$obj_json" \
    --output-dir "$INDICES_OUTPUT_DIR" \
    --comment "ROCm commit $BUILDKITE_COMMIT"
 # Upload indices to commit directory
 echo "Uploading indices to $S3_COMMIT_PREFIX"
 aws s3 cp --recursive "$INDICES_OUTPUT_DIR/" "$S3_COMMIT_PREFIX"
 # Update rocm/nightly/ if on main branch and not a PR
 if [[ "$BUILDKITE_BRANCH" == "main" && "$BUILDKITE_PULL_REQUEST" == "false" ]] || [[ "$NIGHTLY" == "1" ]]; then
    echo "Updating rocm/nightly/ index..."
    aws s3 cp --recursive "$INDICES_OUTPUT_DIR/" "s3://$BUCKET/rocm/nightly/"
 fi
 # Extract version from vLLM wheel and update version-specific index
 VLLM_WHEEL=$(ls all-rocm-wheels/vllm*.whl 2>/dev/null | head -1)
 if [ -n "$VLLM_WHEEL" ]; then
    VERSION=$(unzip -p "$VLLM_WHEEL" '**/METADATA' | grep '^Version: ' | cut -d' ' -f2)
    echo "Version in wheel: $VERSION"
    PURE_VERSION="${VERSION%%+*}"
    PURE_VERSION="${PURE_VERSION%%.rocm}"
    echo "Pure version: $PURE_VERSION"
    if [[ "$VERSION" != *"dev"* ]]; then
        echo "Updating rocm/$PURE_VERSION/ index..."
        aws s3 cp --recursive "$INDICES_OUTPUT_DIR/" "s3://$BUCKET/rocm/$PURE_VERSION/"
    fi
 fi
 # ======== Part 4: Summary ========
 echo ""
 echo "========================================"
 echo "ROCm Wheel Upload Complete!"
 echo "========================================"
 echo ""
 echo "Wheels available at:"
 echo "  s3://$BUCKET/$ROCM_SUBPATH/"
 echo ""
 echo "Install command (by commit):"
 echo "  pip install vllm --extra-index-url https://${BUCKET}.s3.amazonaws.com/$ROCM_SUBPATH/"
 echo ""
 if [[ "$BUILDKITE_BRANCH" == "main" ]] || [[ "$NIGHTLY" == "1" ]]; then
    echo "Install command (nightly):"
    echo "  pip install vllm --extra-index-url https://${BUCKET}.s3.amazonaws.com/rocm/nightly/"
 fi
 echo ""
 echo "Wheel count: $WHEEL_COUNT"
 echo "========================================"
--- a/.buildkite/test-amd.yaml
+++ b/.buildkite/test-amd.yaml
@@ -162,7 +162,7 @@ steps:
  - tests/entrypoints/test_chat_utils
  commands:
  - export VLLM_WORKER_MULTIPROC_METHOD=spawn
-  - pytest -v -s entrypoints/openai --ignore=entrypoints/openai/test_chat_with_tool_reasoning.py --ignore=entrypoints/openai/test_oot_registration.py --ignore=entrypoints/openai/test_tensorizer_entrypoint.py --ignore=entrypoints/openai/correctness/ --ignore=entrypoints/openai/tool_parsers/
+  - pytest -v -s entrypoints/openai --ignore=entrypoints/openai/test_chat_with_tool_reasoning.py --ignore=entrypoints/openai/test_oot_registration.py --ignore=entrypoints/openai/test_tensorizer_entrypoint.py --ignore=entrypoints/openai/correctness/ --ignore=entrypoints/openai/tool_parsers/ --ignore=entrypoints/openai/responses
  - pytest -v -s entrypoints/test_chat_utils.py
 - label: Entrypoints Integration Test (API Server 2)
@@ -199,6 +199,21 @@ steps:
  - export VLLM_WORKER_MULTIPROC_METHOD=spawn
  - pytest -v -s entrypoints/pooling
 - label: Entrypoints Integration Test (Responses API)
  timeout_in_minutes: 50
  mirror_hardwares: [amdexperimental]
  agent_pool: mi325_1
  # grade: Blocking
  working_dir: "/vllm-workspace/tests"
  fast_check: true
  torch_nightly: true
  source_file_dependencies:
  - vllm/
  - tests/entrypoints/openai/responses
  commands:
  - export VLLM_WORKER_MULTIPROC_METHOD=spawn
  - pytest -v -s entrypoints/openai/responses
 - label: Distributed Tests (4 GPUs) # 35min
  timeout_in_minutes: 50
  mirror_hardwares: [amdexperimental]
@@ -219,6 +234,9 @@ steps:
  - tests/v1/engine/test_engine_core_client.py
  - tests/distributed/test_symm_mem_allreduce.py
  commands:
  # Work around HIP bug tracked here: https://github.com/ROCm/hip/issues/3876
  # TODO: Remove when the bug is fixed in a future ROCm release
  - export TORCH_NCCL_BLOCKING_WAIT=1
  # test with torchrun tp=2 and external_dp=2
  - torchrun --nproc-per-node=4 distributed/test_torchrun_example.py
  # test with torchrun tp=2 and pp=2
@@ -267,9 +285,10 @@ steps:
  - vllm/v1/executor/uniproc_executor.py
  - vllm/v1/worker/gpu_worker.py
  commands:
  # https://github.com/NVIDIA/nccl/issues/1838
  #- export NCCL_CUMEM_HOST_ENABLE=0
  # test with torchrun tp=2 and dp=4 with ep
  # Work around HIP bug tracked here: https://github.com/ROCm/hip/issues/3876
  # TODO: Remove when the bug is fixed in a future ROCm release
  - export TORCH_NCCL_BLOCKING_WAIT=1
  - torchrun --nproc-per-node=8 ../examples/offline_inference/torchrun_dp_example.py --tp-size=2 --pp-size=1 --dp-size=4 --enable-ep
 - label: EPLB Algorithm Test # 5min
@@ -349,7 +368,9 @@ steps:
 - label: V1 Test e2e + engine # 65min
  timeout_in_minutes: 90
  mirror_hardwares: [amdexperimental]
-  agent_pool: mi325_4
+  # The test uses 4 GPUs, but we schedule it on 8-GPU machines for stability.
  # See discussion here: https://github.com/vllm-project/vllm/pull/31040
  agent_pool: mi325_8
  # grade: Blocking
  source_file_dependencies:
    - vllm/
@@ -510,8 +531,7 @@ steps:
  - tests/samplers
  - tests/conftest.py
  commands:
-    - pytest -v -s samplers
+    - pytest -v -s -m 'not skip_v1' samplers
    - VLLM_USE_FLASHINFER_SAMPLER=1 pytest -v -s samplers
 - label: LoRA Test %N # 20min each
  timeout_in_minutes: 30
@@ -725,7 +745,7 @@ steps:
 - label: Quantization Test # 70min
  timeout_in_minutes: 90
-  mirror_hardwares: [amdexperimental]
+  mirror_hardwares: [amdexperimental, amdproduction]
  agent_pool: mi325_1
  # grade: Blocking
  source_file_dependencies:
@@ -765,8 +785,9 @@ steps:
  - csrc/
  - vllm/entrypoints/openai/
  - vllm/model_executor/models/whisper.py
  - tools/
  commands: # LMEval+Transcription WER check
-  # Transcription WER check is skipped because encoder-decoder models are not supported on ROCm, see https://github.com/vllm-project/vllm/issues/27442
+  - bash ../tools/install_torchcodec_rocm.sh || exit 1
  - pytest -s entrypoints/openai/correctness/
@@ -861,6 +882,7 @@ steps:
    # Shard slow subset of standard language models tests. Only run when model
    # source is modified, or when specified test files are modified
    - pip freeze | grep -E 'torch'
    - export TORCH_NCCL_BLOCKING_WAIT=1
    - pytest -v -s models/language -m 'core_model and slow_test' \
             --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT \
             --shard-id=$$BUILDKITE_PARALLEL_JOB
@@ -878,7 +900,7 @@ steps:
  commands:
    # Install fast path packages for testing against transformers
    # Note: also needed to run plamo2 model in vLLM
-    - uv pip install --system --no-build-isolation 'git+https://github.com/state-spaces/mamba@v2.2.5'
+    - uv pip install --system --no-build-isolation 'git+https://github.com/AndreasKaratzas/mamba@fix-rocm-7.0-warp-size-constexpr'
    - uv pip install --system --no-build-isolation 'git+https://github.com/Dao-AILab/causal-conv1d@v1.5.2'
    # Shard hybrid language model tests
    - pytest -v -s models/language/generation \
@@ -899,7 +921,7 @@ steps:
  commands:
    # Install fast path packages for testing against transformers
    # Note: also needed to run plamo2 model in vLLM
-    - uv pip install --system --no-build-isolation 'git+https://github.com/state-spaces/mamba@v2.2.5'
+    - uv pip install --system --no-build-isolation 'git+https://github.com/AndreasKaratzas/mamba@fix-rocm-7.0-warp-size-constexpr'
    - uv pip install --system --no-build-isolation 'git+https://github.com/Dao-AILab/causal-conv1d@v1.5.2'
    - pytest -v -s models/language/generation -m '(not core_model) and (not hybrid_model)'
@@ -964,7 +986,7 @@ steps:
    - pytest -v -s models/multimodal/processing
 - label: Multi-Modal Models Test (Standard) # 60min
-  timeout_in_minutes: 80
+  timeout_in_minutes: 100
  mirror_hardwares: [amdexperimental]
  agent_pool: mi325_1
  # grade: Blocking
@@ -973,13 +995,16 @@ steps:
  - vllm/
  - tests/models/multimodal
  commands:
    - export MIOPEN_DEBUG_CONV_DIRECT=0
    - export MIOPEN_DEBUG_CONV_GEMM=0
    - pip install git+https://github.com/TIGER-AI-Lab/Mantis.git
    - pip freeze | grep -E 'torch'
-    - pytest -v -s models/multimodal -m core_model --ignore models/multimodal/generation/test_whisper.py --ignore models/multimodal/processing
+    - pytest -v -s models/multimodal -m core_model --ignore models/multimodal/generation/test_whisper.py --ignore models/multimodal/processing --ignore models/multimodal/pooling/test_prithvi_mae.py
    - pytest -v -s models/multimodal/pooling/test_prithvi_mae.py -m core_model
    - cd .. && VLLM_WORKER_MULTIPROC_METHOD=spawn pytest -v -s tests/models/multimodal/generation/test_whisper.py -m core_model  # Otherwise, mp_method="spawn" doesn't work
- label: Multi-Modal Accuracy Eval (Small Models) # 150min - 180min
+- label: Multi-Modal Accuracy Eval (Small Models) # 5min
-  timeout_in_minutes: 180
+  timeout_in_minutes: 10
  mirror_hardwares: [amdexperimental, amdproduction]
  agent_pool: mi325_1
  # grade: Blocking
@@ -989,7 +1014,9 @@ steps:
  - vllm/inputs/
  - vllm/v1/core/
  commands:
-  - pytest -s -v test_lm_eval_correctness.py --config-list-file=configs/models-mm-small.txt --tp-size=1
+  - export MIOPEN_DEBUG_CONV_DIRECT=0
  - export MIOPEN_DEBUG_CONV_GEMM=0
  - pytest -s -v test_lm_eval_correctness.py --config-list-file=configs/models-mm-small.txt
 - label: Multi-Modal Models Test (Extended) 1 # 60min
  timeout_in_minutes: 120
@@ -1001,10 +1028,13 @@ steps:
  - vllm/
  - tests/models/multimodal
  commands:
    - export MIOPEN_DEBUG_CONV_DIRECT=0
    - export MIOPEN_DEBUG_CONV_GEMM=0
    - pip install git+https://github.com/TIGER-AI-Lab/Mantis.git
    - pytest -v -s models/multimodal -m 'not core_model' --ignore models/multimodal/generation/test_common.py --ignore models/multimodal/processing
- label: Multi-Modal Models Test (Extended) 2
+- label: Multi-Modal Models Test (Extended) 2 #60min
  timeout_in_minutes: 120
  mirror_hardwares: [amdexperimental]
  agent_pool: mi325_1
  # grade: Blocking
@@ -1013,6 +1043,8 @@ steps:
  - vllm/
  - tests/models/multimodal
  commands:
    - export MIOPEN_DEBUG_CONV_DIRECT=0
    - export MIOPEN_DEBUG_CONV_GEMM=0
    - pip install git+https://github.com/TIGER-AI-Lab/Mantis.git
    - pytest -v -s models/multimodal/generation/test_common.py -m 'split(group=0) and not core_model'
@@ -1026,6 +1058,8 @@ steps:
  - vllm/
  - tests/models/multimodal
  commands:
    - export MIOPEN_DEBUG_CONV_DIRECT=0
    - export MIOPEN_DEBUG_CONV_GEMM=0
    - pip install git+https://github.com/TIGER-AI-Lab/Mantis.git
    - pytest -v -s models/multimodal/generation/test_common.py -m 'split(group=1) and not core_model'
@@ -1085,8 +1119,8 @@ steps:
  - vllm/v1/attention/backends/flashinfer.py
  - vllm/v1/attention/backends/mla/cutlass_mla.py
  - vllm/v1/attention/backends/mla/flashinfer_mla.py
  - vllm/v1/attention/selector.py
  - vllm/platforms/cuda.py
  - vllm/attention/selector.py
  commands:
    - nvidia-smi
    - python3 examples/offline_inference/basic/chat.py
@@ -1243,13 +1277,13 @@ steps:
  - # the following commands are for the first node, with ip 192.168.10.10 (ray environment already set up)
    - VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py | grep 'Same node test passed'
    - NUM_NODES=2 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_node_count.py | grep 'Node count test passed'
-    - python3 ../examples/offline_inference/data_parallel.py --dp-size=2 --tp-size=1 --node-size=2 --node-rank=0 --master-addr=192.168.10.10 --master-port=12345 --enforce-eager --trust-remote-code
+    - python3 ../examples/offline_inference/data_parallel.py -dp=2 -tp=1 --dp-num-nodes=2 --dp-node-rank=0 --dp-master-addr=192.168.10.10 --dp-master-port=12345 --enforce-eager --trust-remote-code
    - VLLM_MULTI_NODE=1 pytest -v -s distributed/test_multi_node_assignment.py
    - VLLM_MULTI_NODE=1 pytest -v -s distributed/test_pipeline_parallel.py
  - # the following commands are for the second node, with ip 192.168.10.11 (ray environment already set up)
    - VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py | grep 'Same node test passed'
    - NUM_NODES=2 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_node_count.py | grep 'Node count test passed'
-    - python3 ../examples/offline_inference/data_parallel.py --dp-size=2 --tp-size=1 --node-size=2 --node-rank=1 --master-addr=192.168.10.10 --master-port=12345 --enforce-eager --trust-remote-code
+    - python3 ../examples/offline_inference/data_parallel.py -dp=2 -tp=1 --dp-num-nodes=2 --dp-node-rank=1 --dp-master-addr=192.168.10.10 --dp-master-port=12345 --enforce-eager --trust-remote-code
 - label: Distributed Tests (2 GPUs) # 68min
  timeout_in_minutes: 90
@@ -1275,6 +1309,9 @@ steps:
  - tests/v1/shutdown
  - tests/v1/worker/test_worker_memory_snapshot.py
  commands:
  # Work around HIP bug tracked here: https://github.com/ROCm/hip/issues/3876
  # TODO: Remove when the bug is fixed in a future ROCm release
  - export TORCH_NCCL_BLOCKING_WAIT=1
  - TP_SIZE=1 DP_SIZE=2 pytest -v -s v1/distributed/test_async_llm_dp.py
  - TP_SIZE=1 DP_SIZE=2 pytest -v -s v1/distributed/test_eagle_dp.py
  - TP_SIZE=1 DP_SIZE=2 pytest -v -s v1/distributed/test_external_lb_dp.py
@@ -1424,8 +1461,22 @@ steps:
    - vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py
    - tests/v1/kv_connector/nixl_integration/
  commands:
-    - uv pip install --system -r /vllm-workspace/requirements/kv_connectors.txt
+    - uv pip install --system -r /vllm-workspace/requirements/kv_connectors_rocm.txt
-    - bash v1/kv_connector/nixl_integration/tp_config_sweep_accuracy_test.sh
+    - VLLM_ATTENTION_BACKEND=ROCM_ATTN bash v1/kv_connector/nixl_integration/config_sweep_accuracy_test.sh
 - label: DP EP NixlConnector PD accuracy tests (Distributed) # 15min
  mirror_hardwares: [amdexperimental]
  agent_pool: mi325_4
  # grade: Blocking
  timeout_in_minutes: 15
  working_dir: "/vllm-workspace/tests"
  num_gpus: 4
  source_file_dependencies:
    - vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py
    - tests/v1/kv_connector/nixl_integration/
  commands:
    - uv pip install --system -r /vllm-workspace/requirements/kv_connectors_rocm.txt
    - VLLM_ATTENTION_BACKEND=ROCM_ATTN DP_EP=1 bash v1/kv_connector/nixl_integration/config_sweep_accuracy_test.sh
 ##### multi gpus test #####
 ##### A100 test #####
@@ -1497,7 +1548,7 @@ steps:
    - "VLLM_TEST_CLEAN_GPU_MEMORY=1 pytest -v -s tests/compile/distributed/test_fusions_e2e.py -k 'not Llama-4'"
    - VLLM_TEST_CLEAN_GPU_MEMORY=1 pytest -v -s tests/distributed/test_sequence_parallel.py
    - pytest -v -s tests/distributed/test_context_parallel.py
-    - HIP_VISIBLE_DEVICES=0,1 VLLM_USE_DEEP_GEMM=1 VLLM_LOGGING_LEVEL=DEBUG python3 examples/offline_inference/data_parallel.py --model Qwen/Qwen1.5-MoE-A2.7B --tp-size=1 --dp-size=2 --max-model-len 2048 --all2all-backend deepep_high_throughput
+    - HIP_VISIBLE_DEVICES=0,1 VLLM_LOGGING_LEVEL=DEBUG python3 examples/offline_inference/data_parallel.py --model=Qwen/Qwen1.5-MoE-A2.7B -tp=1 -dp=2 --max-model-len=2048 --all2all-backend=allgather_reducescatter --disable-nccl-for-dp-synchronization
    - pytest -v -s tests/v1/distributed/test_dbo.py
 ##### B200 test #####
@@ -1576,6 +1627,8 @@ steps:
  - .buildkite/scripts/run-prime-rl-test.sh
  commands:
    - bash .buildkite/scripts/run-prime-rl-test.sh
 ##### EPLB Accuracy Tests #####
 - label: DeepSeek V2-Lite Accuracy
  mirror_hardwares: [amdexperimental, amdproduction]
  agent_pool: mi325_4
--- a/.buildkite/test-pipeline.yaml
+++ b/.buildkite/test-pipeline.yaml
@@ -144,7 +144,7 @@ steps:
  - tests/entrypoints/test_chat_utils
  commands:
  - export VLLM_WORKER_MULTIPROC_METHOD=spawn
-  - pytest -v -s entrypoints/openai --ignore=entrypoints/openai/test_chat_with_tool_reasoning.py --ignore=entrypoints/openai/test_oot_registration.py --ignore=entrypoints/openai/test_tensorizer_entrypoint.py --ignore=entrypoints/openai/correctness/  --ignore=entrypoints/openai/tool_parsers/
+  - pytest -v -s entrypoints/openai --ignore=entrypoints/openai/test_chat_with_tool_reasoning.py --ignore=entrypoints/openai/test_oot_registration.py --ignore=entrypoints/openai/test_tensorizer_entrypoint.py --ignore=entrypoints/openai/correctness/  --ignore=entrypoints/openai/tool_parsers/ --ignore=entrypoints/openai/responses
  - pytest -v -s entrypoints/test_chat_utils.py
 - label: Entrypoints Integration Test (API Server 2)
@@ -177,6 +177,18 @@ steps:
  - export VLLM_WORKER_MULTIPROC_METHOD=spawn
  - pytest -v -s entrypoints/pooling
 - label: Entrypoints Integration Test (Responses API)
  timeout_in_minutes: 50
  mirror_hardwares: [amdexperimental]
  working_dir: "/vllm-workspace/tests"
  fast_check: true
  torch_nightly: true
  source_file_dependencies:
  - vllm/
  - tests/entrypoints/openai/responses
  commands:
  - pytest -v -s entrypoints/openai/responses
 - label: Distributed Tests (4 GPUs) # 35min
  timeout_in_minutes: 50
  mirror_hardwares: [amdexperimental]
@@ -943,7 +955,6 @@ steps:
  timeout_in_minutes: 30
  working_dir: "/vllm-workspace/"
  gpu: b200
  # optional: true
  source_file_dependencies:
  - csrc/quantization/fp4/
  - csrc/attention/mla/
@@ -955,8 +966,8 @@ steps:
  - vllm/v1/attention/backends/flashinfer.py
  - vllm/v1/attention/backends/mla/cutlass_mla.py
  - vllm/v1/attention/backends/mla/flashinfer_mla.py
  - vllm/v1/attention/selector.py
  - vllm/platforms/cuda.py
  - vllm/attention/selector.py
  commands:
    - nvidia-smi
    - python3 examples/offline_inference/basic/chat.py
@@ -1105,17 +1116,18 @@ steps:
  - vllm/model_executor/models/
  - tests/distributed/
  - tests/examples/offline_inference/data_parallel.py
  - .buildkite/scripts/run-multi-node-test.sh
  commands:
  - # the following commands are for the first node, with ip 192.168.10.10 (ray environment already set up)
    - VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py | grep 'Same node test passed'
    - NUM_NODES=2 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_node_count.py | grep 'Node count test passed'
-    - python3 ../examples/offline_inference/data_parallel.py --dp-size=2 --tp-size=1 --node-size=2 --node-rank=0 --master-addr=192.168.10.10 --master-port=12345 --enforce-eager --trust-remote-code
+    - python3 ../examples/offline_inference/data_parallel.py -dp=2 -tp=1 --dp-num-nodes=2 --dp-node-rank=0 --dp-master-addr=192.168.10.10 --dp-master-port=12345 --enforce-eager --trust-remote-code
    - VLLM_MULTI_NODE=1 pytest -v -s distributed/test_multi_node_assignment.py
    - VLLM_MULTI_NODE=1 pytest -v -s distributed/test_pipeline_parallel.py
  - # the following commands are for the second node, with ip 192.168.10.11 (ray environment already set up)
    - VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py | grep 'Same node test passed'
    - NUM_NODES=2 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_node_count.py | grep 'Node count test passed'
-    - python3 ../examples/offline_inference/data_parallel.py --dp-size=2 --tp-size=1 --node-size=2 --node-rank=1 --master-addr=192.168.10.10 --master-port=12345 --enforce-eager --trust-remote-code
+    - python3 ../examples/offline_inference/data_parallel.py -dp=2 -tp=1 --dp-num-nodes=2 --dp-node-rank=1 --dp-master-addr=192.168.10.10 --dp-master-port=12345 --enforce-eager --trust-remote-code
 - label: Distributed Tests (2 GPUs) # 68min
  timeout_in_minutes: 90
@@ -1267,8 +1279,8 @@ steps:
  commands:
    - bash weight_loading/run_model_weight_loading_test.sh -c weight_loading/models-large.txt
- label: NixlConnector PD accuracy tests (Distributed) # 30min
+- label: NixlConnector PD accuracy tests (Distributed) # 40min
-  timeout_in_minutes: 30
+  timeout_in_minutes: 40
  working_dir: "/vllm-workspace/tests"
  num_gpus: 4
  source_file_dependencies:
@@ -1276,7 +1288,18 @@ steps:
    - tests/v1/kv_connector/nixl_integration/
  commands:
    - uv pip install --system -r /vllm-workspace/requirements/kv_connectors.txt
-    - bash v1/kv_connector/nixl_integration/tp_config_sweep_accuracy_test.sh
+    - bash v1/kv_connector/nixl_integration/config_sweep_accuracy_test.sh
 - label: DP EP NixlConnector PD accuracy tests (Distributed) # 15min
  timeout_in_minutes: 15
  working_dir: "/vllm-workspace/tests"
  num_gpus: 4
  source_file_dependencies:
    - vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py
    - tests/v1/kv_connector/nixl_integration/
  commands:
    - uv pip install --system -r /vllm-workspace/requirements/kv_connectors.txt
    - DP_EP=1 bash v1/kv_connector/nixl_integration/config_sweep_accuracy_test.sh
 ##### multi gpus test #####
@@ -1334,9 +1357,17 @@ steps:
    - "VLLM_TEST_CLEAN_GPU_MEMORY=1 pytest -v -s tests/compile/distributed/test_fusions_e2e.py -k 'not Llama-4'"
    - VLLM_TEST_CLEAN_GPU_MEMORY=1 pytest -v -s tests/distributed/test_sequence_parallel.py
    - pytest -v -s tests/distributed/test_context_parallel.py
-    - CUDA_VISIBLE_DEVICES=1,2 VLLM_USE_DEEP_GEMM=1 VLLM_LOGGING_LEVEL=DEBUG python3 examples/offline_inference/data_parallel.py --model Qwen/Qwen1.5-MoE-A2.7B --tp-size=1 --dp-size=2 --max-model-len 2048 --all2all-backend deepep_high_throughput
+    - CUDA_VISIBLE_DEVICES=1,2 VLLM_USE_DEEP_GEMM=1 VLLM_LOGGING_LEVEL=DEBUG python3 examples/offline_inference/data_parallel.py --model=Qwen/Qwen1.5-MoE-A2.7B -tp=1 -dp=2 --max-model-len=2048 --all2all-backend=deepep_high_throughput
    - pytest -v -s tests/v1/distributed/test_dbo.py
 - label: LM Eval Large Models (H200) # optional
  timeout_in_minutes: 60
  gpu: h200
  optional: true
  num_gpus: 8
  commands:
    - pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=configs/models-h200.txt
 ##### B200 test #####
 - label: Distributed Tests (B200) # optional
  gpu: b200
@@ -1359,6 +1390,7 @@ steps:
  - vllm/
  - .buildkite/scripts/run-prime-rl-test.sh
  commands:
    - nvidia-smi
    - bash .buildkite/scripts/run-prime-rl-test.sh
 - label: DeepSeek V2-Lite Accuracy
@@ -1387,3 +1419,26 @@ steps:
  working_dir: "/vllm-workspace"
  commands:
  - bash .buildkite/scripts/scheduled_integration_test/qwen30b_a3b_fp8_block_ep_eplb.sh 0.8 200 8020 2 1
 ##### MoE Refactor (Temporary) Tests #####
 - label: MoE Refactor Integration Test (H100 - TEMPORARY) # optional
  gpu: h100
  optional: true
  num_gpus: 2
  commands:
    - pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=evals/gsm8k/configs/moe-refactor/config-h100.txt
 - label: MoE Refactor Integration Test (B200 - TEMPORARY) # optional
  gpu: b200
  optional: true
  num_gpus: 2
  commands:
    - pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=evals/gsm8k/configs/moe-refactor/config-b200.txt
 - label: MoE Refactor Integration Test (B200 DP - TEMPORARY) # optional
  gpu: b200
  optional: true
  num_gpus: 2
  commands:
    - pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=evals/gsm8k/configs/moe-refactor-dp-ep/config-b200.txt
--- a/.buildkite/test_areas/distributed.yaml
+++ b/.buildkite/test_areas/distributed.yaml
@@ -145,7 +145,7 @@ steps:
    - VLLM_TEST_CLEAN_GPU_MEMORY=1 pytest -v -s tests/compile/distributed/test_fusions_e2e.py -k 'not Llama-4'
    - VLLM_TEST_CLEAN_GPU_MEMORY=1 pytest -v -s tests/distributed/test_sequence_parallel.py
    - pytest -v -s tests/distributed/test_context_parallel.py
-    - CUDA_VISIBLE_DEVICES=1,2 VLLM_USE_DEEP_GEMM=1 VLLM_LOGGING_LEVEL=DEBUG python3 examples/offline_inference/data_parallel.py --model Qwen/Qwen1.5-MoE-A2.7B --tp-size=1 --dp-size=2 --max-model-len 2048 --all2all-backend deepep_high_throughput
+    - CUDA_VISIBLE_DEVICES=1,2 VLLM_USE_DEEP_GEMM=1 VLLM_LOGGING_LEVEL=DEBUG python3 examples/offline_inference/data_parallel.py --model=Qwen/Qwen1.5-MoE-A2.7B -tp=1 -dp=2 --max-model-len=2048 --all2all-backend=deepep_high_throughput
    - pytest -v -s tests/v1/distributed/test_dbo.py
 - label: Distributed Tests (2 GPUs)(B200)
@@ -171,7 +171,7 @@ steps:
  - tests/distributed/
  - tests/examples/offline_inference/data_parallel.py
  commands:
-    - ./.buildkite/scripts/run-multi-node-test.sh /vllm-workspace/tests 2 2 public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:0bec63fa317e1fbd62e19b0fc31c43c81bf89077 "VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py | grep 'Same node test passed' && NUM_NODES=2 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_node_count.py | grep 'Node count test passed' && python3 ../examples/offline_inference/data_parallel.py --dp-size=2 --tp-size=1 --node-size=2 --node-rank=0 --master-addr=192.168.10.10 --master-port=12345 --enforce-eager --trust-remote-code && VLLM_MULTI_NODE=1 pytest -v -s distributed/test_multi_node_assignment.py && VLLM_MULTI_NODE=1 pytest -v -s distributed/test_pipeline_parallel.py" "VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py | grep 'Same node test passed' && NUM_NODES=2 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_node_count.py | grep 'Node count test passed' && python3 ../examples/offline_inference/data_parallel.py --dp-size=2 --tp-size=1 --node-size=2 --node-rank=1 --master-addr=192.168.10.10 --master-port=12345 --enforce-eager --trust-remote-code"
+    - ./.buildkite/scripts/run-multi-node-test.sh /vllm-workspace/tests 2 2 public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:0bec63fa317e1fbd62e19b0fc31c43c81bf89077 "VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py | grep 'Same node test passed' && NUM_NODES=2 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_node_count.py | grep 'Node count test passed' && python3 ../examples/offline_inference/data_parallel.py -dp=2 -tp=1 --dp-num-nodes=2 --dp-node-rank=0 --dp-master-addr=192.168.10.10 --dp-master-port=12345 --enforce-eager --trust-remote-code && VLLM_MULTI_NODE=1 pytest -v -s distributed/test_multi_node_assignment.py && VLLM_MULTI_NODE=1 pytest -v -s distributed/test_pipeline_parallel.py" "VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py | grep 'Same node test passed' && NUM_NODES=2 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_node_count.py | grep 'Node count test passed' && python3 ../examples/offline_inference/data_parallel.py -dp=2 -tp=1 --dp-num-nodes=2 --dp-node-rank=1 --dp-master-addr=192.168.10.10 --dp-master-port=12345 --enforce-eager --trust-remote-code"
 - label: Distributed NixlConnector PD accuracy (4 GPUs)
  timeout_in_minutes: 30
@@ -182,7 +182,7 @@ steps:
    - tests/v1/kv_connector/nixl_integration/
  commands:
    - uv pip install --system -r /vllm-workspace/requirements/kv_connectors.txt
-    - bash v1/kv_connector/nixl_integration/tp_config_sweep_accuracy_test.sh
+    - bash v1/kv_connector/nixl_integration/config_sweep_accuracy_test.sh
 - label: Pipeline + Context Parallelism (4 GPUs))
  timeout_in_minutes: 60
--- a/.buildkite/test_areas/entrypoints.yaml
+++ b/.buildkite/test_areas/entrypoints.yaml
@@ -34,10 +34,9 @@ steps:
  - tests/entrypoints/test_chat_utils
  commands:
  - export VLLM_WORKER_MULTIPROC_METHOD=spawn
-  - pytest -v -s entrypoints/openai --ignore=entrypoints/openai/test_chat_with_tool_reasoning.py --ignore=entrypoints/openai/test_oot_registration.py --ignore=entrypoints/openai/test_tensorizer_entrypoint.py --ignore=entrypoints/openai/correctness/  --ignore=entrypoints/openai/tool_parsers/
+  - pytest -v -s entrypoints/openai --ignore=entrypoints/openai/test_chat_with_tool_reasoning.py --ignore=entrypoints/openai/test_oot_registration.py --ignore=entrypoints/openai/test_tensorizer_entrypoint.py --ignore=entrypoints/openai/correctness/  --ignore=entrypoints/openai/tool_parsers/ --ignore=entrypoints/openai/responses
  - pytest -v -s entrypoints/test_chat_utils.py
 - label: Entrypoints Integration (API Server 2)
  timeout_in_minutes: 130
  working_dir: "/vllm-workspace/tests"
@@ -64,6 +63,14 @@ steps:
  - export VLLM_WORKER_MULTIPROC_METHOD=spawn
  - pytest -v -s entrypoints/pooling
 - label: Entrypoints Integration (Responses API)
  timeout_in_minutes: 50
  working_dir: "/vllm-workspace/tests"
  source_file_dependencies:
  - vllm/
  - tests/entrypoints/openai/responses
  commands:
  - pytest -v -s entrypoints/openai/responses
 - label: Entrypoints V1
  timeout_in_minutes: 50
--- a/.buildkite/test_areas/kernels.yaml
+++ b/.buildkite/test_areas/kernels.yaml
@@ -90,8 +90,8 @@ steps:
  - vllm/v1/attention/backends/flashinfer.py
  - vllm/v1/attention/backends/mla/cutlass_mla.py
  - vllm/v1/attention/backends/mla/flashinfer_mla.py
  - vllm/v1/attention/selector.py
  - vllm/platforms/cuda.py
  - vllm/attention/selector.py
  commands:
    - nvidia-smi
    - python3 examples/offline_inference/basic/chat.py
--- a/.github/CODEOWNERS
+++ b/.github/CODEOWNERS
@@ -3,7 +3,6 @@
 # This lists cover the "core" components of vLLM that require careful review
 /vllm/attention @LucasWilkinson
 /vllm/attention/backends/abstract.py @WoosukKwon @zhuohan123 @youkaichao @alexm-redhat @njhill
 /vllm/executor/executor_base.py @zhuohan123 @youkaichao @alexm-redhat @njhill @22quinn
 /vllm/model_executor/layers/fused_moe @mgoin @pavanimajety
 /vllm/model_executor/layers/quantization @mgoin @robertgshaw2-redhat @tlrmchlsmth @yewentao256 @pavanimajety
@@ -15,6 +14,7 @@
 /vllm/lora @jeejeelee
 /vllm/reasoning @aarnphm @chaunceyjiang
 /vllm/entrypoints @aarnphm @chaunceyjiang
 /vllm/tool_parsers @aarnphm @chaunceyjiang
 /vllm/compilation @zou3519 @youkaichao @ProExpertProg
 /vllm/distributed/kv_transfer @NickLucche @ApostaC
 CMakeLists.txt @tlrmchlsmth @LucasWilkinson
@@ -26,6 +26,7 @@ CMakeLists.txt @tlrmchlsmth @LucasWilkinson
 # vLLM V1
 /vllm/v1/attention @LucasWilkinson
 /vllm/v1/attention/backend.py @WoosukKwon @zhuohan123 @youkaichao @alexm-redhat @njhill
 /vllm/v1/attention/backends/mla @pavanimajety
 /vllm/v1/attention/backends/flashinfer.py @mgoin @pavanimajety
 /vllm/v1/attention/backends/triton_attn.py @tdoublep
@@ -116,15 +117,15 @@ mkdocs.yaml @hmellor
 /vllm/transformers_utils/tokenizers/mistral.py @patrickvonplaten
 # Kernels
-/vllm/attention/ops/chunked_prefill_paged_decode.py @tdoublep
+/vllm/v1/attention/ops/chunked_prefill_paged_decode.py @tdoublep
-/vllm/attention/ops/triton_unified_attention.py @tdoublep
+/vllm/v1/attention/ops/triton_unified_attention.py @tdoublep
 # ROCm related: specify owner with write access to notify AMD folks for careful code review
 /vllm/**/*rocm* @tjtanaa
 /docker/Dockerfile.rocm* @gshtras @tjtanaa
 /vllm/v1/attention/backends/rocm*.py @gshtras @tjtanaa
 /vllm/v1/attention/backends/mla/rocm*.py @gshtras @tjtanaa
-/vllm/attention/ops/rocm*.py @gshtras @tjtanaa
+/vllm/v1/attention/ops/rocm*.py @gshtras @tjtanaa
 /vllm/model_executor/layers/fused_moe/rocm*.py @gshtras @tjtanaa
 /csrc/rocm @gshtras @tjtanaa
 /requirements/*rocm* @tjtanaa
@@ -152,7 +153,7 @@ mkdocs.yaml @hmellor
 /vllm/entrypoints/pooling @noooop
 /vllm/config/pooler.py @noooop
 /vllm/pooling_params.py @noooop
-/vllm/model_executor/layers/pooler.py @noooop
+/vllm/model_executor/layers/pooler @noooop
 # Security guide and policies
 /docs/usage/security.md @russellb
--- a/.github/mergify.yml
+++ b/.github/mergify.yml
@@ -222,10 +222,10 @@ pull_request_rules:
      - files~=^csrc/rocm/
      - files~=^docker/Dockerfile.rocm
      - files~=^requirements/rocm.*\.txt
      - files~=^vllm/attention/backends/rocm.*\.py
      - files~=^vllm/attention/ops/rocm.*\.py
      - files~=^vllm/model_executor/layers/fused_moe/rocm.*\.py
      - files~=^vllm/v1/attention/backends/rocm.*\.py
      - files~=^vllm/v1/attention/backends/mla/rocm.*\.py
      - files~=^vllm/v1/attention/ops/rocm.*\.py
      - files~=^tests/kernels/.*_rocm.*\.py
      - files=vllm/platforms/rocm.py
      - title~=(?i)AMD
--- a/.gitignore
+++ b/.gitignore
@@ -227,3 +227,8 @@ ep_kernels_workspace/
 # Allow tracked library source folders under submodules (e.g., benchmarks/lib)
 !vllm/benchmarks/lib/
 # Generated gRPC protobuf files (compiled at build time from vllm_engine.proto)
 vllm/grpc/vllm_engine_pb2.py
 vllm/grpc/vllm_engine_pb2_grpc.py
 vllm/grpc/vllm_engine_pb2.pyi
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -282,6 +282,7 @@ endif()
 set(VLLM_EXT_SRC
  "csrc/mamba/mamba_ssm/selective_scan_fwd.cu"
  "csrc/cache_kernels.cu"
  "csrc/cache_kernels_fused.cu"
  "csrc/attention/paged_attention_v1.cu"
  "csrc/attention/paged_attention_v2.cu"
  "csrc/attention/merge_attn_states.cu"
@@ -799,24 +800,6 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
  else()
    cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0a;10.1a;10.3a" "${CUDA_ARCHS}")
  endif()
  if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.8 AND SCALED_MM_ARCHS)
    set(SRCS "csrc/quantization/w8a8/cutlass/moe/blockwise_scaled_group_mm_sm100.cu")
    set_gencode_flags_for_srcs(
      SRCS "${SRCS}"
      CUDA_ARCHS "${SCALED_MM_ARCHS}")
    list(APPEND VLLM_EXT_SRC "${SRCS}")
    list(APPEND VLLM_GPU_FLAGS "-DENABLE_CUTLASS_MOE_SM100=1")
    message(STATUS "Building blockwise_scaled_group_mm_sm100 for archs: ${SCALED_MM_ARCHS}")
  else()
    if (NOT ${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.8 AND SCALED_MM_ARCHS)
      message(STATUS "Not building blockwise_scaled_group_mm_sm100 kernels as CUDA Compiler version is "
                     "not >= 12.8, we recommend upgrading to CUDA 12.8 or later "
                     "if you intend on running FP8 quantized MoE models on Blackwell.")
    else()
      message(STATUS "Not building blockwise_scaled_group_mm_sm100 as no compatible archs found "
                     "in CUDA target architectures")
    endif()
  endif()
  #
  # Machete kernels
--- a/README.md
+++ b/README.md
@@ -14,51 +14,8 @@ Easy, fast, and cheap LLM serving for everyone
 | <a href="https://docs.vllm.ai"><b>Documentation</b></a> | <a href="https://blog.vllm.ai/"><b>Blog</b></a> | <a href="https://arxiv.org/abs/2309.06180"><b>Paper</b></a> | <a href="https://x.com/vllm_project"><b>Twitter/X</b></a> | <a href="https://discuss.vllm.ai"><b>User Forum</b></a> | <a href="https://slack.vllm.ai"><b>Developer Slack</b></a> |
 </p>
---
+🔥 We have built a vllm website to help you get started with vllm. Please visit [vllm.ai](https://vllm.ai) to learn more.
-Join us at the [PyTorch Conference, October 22-23](https://events.linuxfoundation.org/pytorch-conference/) and [Ray Summit, November 3-5](https://www.anyscale.com/ray-summit/2025) in San Francisco for our latest updates on vLLM and to meet the vLLM team! Register now for the largest vLLM community events of the year!
+For events, please visit [vllm.ai/events](https://vllm.ai/events) to join us.
 ---
 *Latest News* 🔥
 - [2025/11] We hosted [vLLM Bangkok Meetup](https://luma.com/v0f647nv). We explored vLLM and LMCache inference and low-resource language adaptation with speakers from Embedded LLM, AMD, and Red Hat. Please find the meetup slides [here](https://drive.google.com/drive/folders/1H0DS57F8HQ5q3kSOSoRmucPJWL3E0A_X?usp=sharing).
 - [2025/11] We hosted [the first vLLM Europe Meetup in Zurich](https://luma.com/0gls27kb) focused on quantization, distributed inference, and reinforcement learning at scale with speakers from Mistral, IBM, and Red Hat. Please find the meetup slides [here](https://docs.google.com/presentation/d/1UC9PTLCHYXQpOmJDSFg6Sljra3iVXzc09DeEI7dnxMc/edit?usp=sharing) and recording [here](https://www.youtube.com/watch?v=6m6ZE6yVEDI)
 - [2025/11] We hosted [vLLM Beijing Meetup](https://mp.weixin.qq.com/s/xSrYXjNgr1HbCP4ExYNG1w) focusing on distributed inference and diverse accelerator support with vLLM! Please find the meetup slides [here](https://drive.google.com/drive/folders/1nQJ8ZkLSjKxvu36sSHaceVXtttbLvvu-?usp=drive_link).
 - [2025/10] We hosted [vLLM Shanghai Meetup](https://mp.weixin.qq.com/s/__xb4OyOsImz-9eAVrdlcg) focused on hands-on vLLM inference optimization! Please find the meetup slides [here](https://drive.google.com/drive/folders/1KqwjsFJLfEsC8wlDugnrR61zsWHt94Q6).
 - [2025/09] We hosted [vLLM Toronto Meetup](https://luma.com/e80e0ymm) focused on tackling inference at scale and speculative decoding with speakers from NVIDIA and Red Hat! Please find the meetup slides [here](https://docs.google.com/presentation/d/1IYJYmJcu9fLpID5N5RbW_vO0XLo0CGOR14IXOjB61V8/edit?usp=sharing).
 - [2025/08] We hosted [vLLM Shenzhen Meetup](https://mp.weixin.qq.com/s/k8ZBO1u2_2odgiKWH_GVTQ) focusing on the ecosystem around vLLM! Please find the meetup slides [here](https://drive.google.com/drive/folders/1Ua2SVKVSu-wp5vou_6ElraDt2bnKhiEA).
 - [2025/08] We hosted [vLLM Singapore Meetup](https://www.sginnovate.com/event/vllm-sg-meet). We shared V1 updates, disaggregated serving and MLLM speedups with speakers from Embedded LLM, AMD, WekaIO, and A*STAR. Please find the meetup slides [here](https://drive.google.com/drive/folders/1ncf3GyqLdqFaB6IeB834E5TZJPLAOiXZ?usp=sharing).
 - [2025/08] We hosted [vLLM Shanghai Meetup](https://mp.weixin.qq.com/s/pDmAXHcN7Iqc8sUKgJgGtg) focusing on building, developing, and integrating with vLLM! Please find the meetup slides [here](https://drive.google.com/drive/folders/1OvLx39wnCGy_WKq8SiVKf7YcxxYI3WCH).
 - [2025/05] vLLM is now a hosted project under PyTorch Foundation! Please find the announcement [here](https://pytorch.org/blog/pytorch-foundation-welcomes-vllm/).
 - [2025/01] We are excited to announce the alpha release of vLLM V1: A major architectural upgrade with 1.7x speedup! Clean code, optimized execution loop, zero-overhead prefix caching, enhanced multimodal support, and more. Please check out our blog post [here](https://blog.vllm.ai/2025/01/27/v1-alpha-release.html).
 <details>
 <summary>Previous News</summary>
 - [2025/08] We hosted [vLLM Korea Meetup](https://luma.com/cgcgprmh) with Red Hat and Rebellions! We shared the latest advancements in vLLM along with project spotlights from the vLLM Korea community. Please find the meetup slides [here](https://drive.google.com/file/d/1bcrrAE1rxUgx0mjIeOWT6hNe2RefC5Hm/view).
 - [2025/08] We hosted [vLLM Beijing Meetup](https://mp.weixin.qq.com/s/dgkWg1WFpWGO2jCdTqQHxA) focusing on large-scale LLM deployment! Please find the meetup slides [here](https://drive.google.com/drive/folders/1Pid6NSFLU43DZRi0EaTcPgXsAzDvbBqF) and the recording [here](https://www.chaspark.com/#/live/1166916873711665152).
 - [2025/05] We hosted [NYC vLLM Meetup](https://lu.ma/c1rqyf1f)! Please find the meetup slides [here](https://docs.google.com/presentation/d/1_q_aW_ioMJWUImf1s1YM-ZhjXz8cUeL0IJvaquOYBeA/edit?usp=sharing).
 - [2025/04] We hosted [Asia Developer Day](https://www.sginnovate.com/event/limited-availability-morning-evening-slots-remaining-inaugural-vllm-asia-developer-day)! Please find the meetup slides from the vLLM team [here](https://docs.google.com/presentation/d/19cp6Qu8u48ihB91A064XfaXruNYiBOUKrBxAmDOllOo/edit?usp=sharing).
 - [2025/03] We hosted [vLLM x Ollama Inference Night](https://lu.ma/vllm-ollama)! Please find the meetup slides from the vLLM team [here](https://docs.google.com/presentation/d/16T2PDD1YwRnZ4Tu8Q5r6n53c5Lr5c73UV9Vd2_eBo4U/edit?usp=sharing).
 - [2025/03] We hosted [the first vLLM China Meetup](https://mp.weixin.qq.com/s/n77GibL2corAtQHtVEAzfg)! Please find the meetup slides from vLLM team [here](https://docs.google.com/presentation/d/1REHvfQMKGnvz6p3Fd23HhSO4c8j5WPGZV0bKYLwnHyQ/edit?usp=sharing).
 - [2025/03] We hosted [the East Coast vLLM Meetup](https://lu.ma/7mu4k4xx)! Please find the meetup slides [here](https://docs.google.com/presentation/d/1NHiv8EUFF1NLd3fEYODm56nDmL26lEeXCaDgyDlTsRs/edit#slide=id.g31441846c39_0_0).
 - [2025/02] We hosted [the ninth vLLM meetup](https://lu.ma/h7g3kuj9) with Meta! Please find the meetup slides from vLLM team [here](https://docs.google.com/presentation/d/1jzC_PZVXrVNSFVCW-V4cFXb6pn7zZ2CyP_Flwo05aqg/edit?usp=sharing) and AMD [here](https://drive.google.com/file/d/1Zk5qEJIkTmlQ2eQcXQZlljAx3m9s7nwn/view?usp=sharing). The slides from Meta will not be posted.
 - [2025/01] We hosted [the eighth vLLM meetup](https://lu.ma/zep56hui) with Google Cloud! Please find the meetup slides from vLLM team [here](https://docs.google.com/presentation/d/1epVkt4Zu8Jz_S5OhEHPc798emsYh2BwYfRuDDVEF7u4/edit?usp=sharing), and Google Cloud team [here](https://drive.google.com/file/d/1h24pHewANyRL11xy5dXUbvRC9F9Kkjix/view?usp=sharing).
 - [2024/12] vLLM joins [pytorch ecosystem](https://pytorch.org/blog/vllm-joins-pytorch)! Easy, Fast, and Cheap LLM Serving for Everyone!
 - [2024/11] We hosted [the seventh vLLM meetup](https://lu.ma/h0qvrajz) with Snowflake! Please find the meetup slides from vLLM team [here](https://docs.google.com/presentation/d/1e3CxQBV3JsfGp30SwyvS3eM_tW-ghOhJ9PAJGK6KR54/edit?usp=sharing), and Snowflake team [here](https://docs.google.com/presentation/d/1qF3RkDAbOULwz9WK5TOltt2fE9t6uIc_hVNLFAaQX6A/edit?usp=sharing).
 - [2024/10] We have just created a developer slack ([slack.vllm.ai](https://slack.vllm.ai)) focusing on coordinating contributions and discussing features. Please feel free to join us there!
 - [2024/10] Ray Summit 2024 held a special track for vLLM! Please find the opening talk slides from the vLLM team [here](https://docs.google.com/presentation/d/1B_KQxpHBTRa_mDF-tR6i8rWdOU5QoTZNcEg2MKZxEHM/edit?usp=sharing). Learn more from the [talks](https://www.youtube.com/playlist?list=PLzTswPQNepXl6AQwifuwUImLPFRVpksjR) from other vLLM contributors and users!
 - [2024/09] We hosted [the sixth vLLM meetup](https://lu.ma/87q3nvnh) with NVIDIA! Please find the meetup slides [here](https://docs.google.com/presentation/d/1wrLGwytQfaOTd5wCGSPNhoaW3nq0E-9wqyP7ny93xRs/edit?usp=sharing).
 - [2024/07] We hosted [the fifth vLLM meetup](https://lu.ma/lp0gyjqr) with AWS! Please find the meetup slides [here](https://docs.google.com/presentation/d/1RgUD8aCfcHocghoP3zmXzck9vX3RCI9yfUAB2Bbcl4Y/edit?usp=sharing).
 - [2024/07] In partnership with Meta, vLLM officially supports Llama 3.1 with FP8 quantization and pipeline parallelism! Please check out our blog post [here](https://blog.vllm.ai/2024/07/23/llama31.html).
 - [2024/06] We hosted [the fourth vLLM meetup](https://lu.ma/agivllm) with Cloudflare and BentoML! Please find the meetup slides [here](https://docs.google.com/presentation/d/1iJ8o7V2bQEi0BFEljLTwc5G1S10_Rhv3beed5oB0NJ4/edit?usp=sharing).
 - [2024/04] We hosted [the third vLLM meetup](https://robloxandvllmmeetup2024.splashthat.com/) with Roblox! Please find the meetup slides [here](https://docs.google.com/presentation/d/1A--47JAK4BJ39t954HyTkvtfwn0fkqtsL8NGFuslReM/edit?usp=sharing).
 - [2024/01] We hosted [the second vLLM meetup](https://lu.ma/ygxbpzhl) with IBM! Please find the meetup slides [here](https://docs.google.com/presentation/d/12mI2sKABnUw5RBWXDYY-HtHth4iMSNcEoQ10jDQbxgA/edit?usp=sharing).
 - [2023/10] We hosted [the first vLLM meetup](https://lu.ma/first-vllm-meetup) with a16z! Please find the meetup slides [here](https://docs.google.com/presentation/d/1QL-XPFXiFpDBh86DbEegFXBXFXjix4v032GhShbKf3s/edit?usp=sharing).
 - [2023/08] We would like to express our sincere gratitude to [Andreessen Horowitz](https://a16z.com/2023/08/30/supporting-the-open-source-ai-community/) (a16z) for providing a generous grant to support the open-source development and research of vLLM.
 - [2023/06] We officially released vLLM! FastChat-vLLM integration has powered [LMSYS Vicuna and Chatbot Arena](https://chat.lmsys.org) since mid-April. Check out our [blog post](https://vllm.ai).
 </details>
 ---
@@ -118,50 +75,6 @@ Visit our [documentation](https://docs.vllm.ai/en/latest/) to learn more.
 We welcome and value any contributions and collaborations.
 Please check out [Contributing to vLLM](https://docs.vllm.ai/en/latest/contributing/index.html) for how to get involved.
 ## Sponsors
 vLLM is a community project. Our compute resources for development and testing are supported by the following organizations. Thank you for your support!
 <!-- Note: Please sort them in alphabetical order. -->
 <!-- Note: Please keep these consistent with docs/community/sponsors.md -->
 Cash Donations:
 - a16z
 - Dropbox
 - Sequoia Capital
 - Skywork AI
 - ZhenFund
 Compute Resources:
 - Alibaba Cloud
 - AMD
 - Anyscale
 - Arm
 - AWS
 - Crusoe Cloud
 - Databricks
 - DeepInfra
 - Google Cloud
 - IBM
 - Intel
 - Lambda Lab
 - Nebius
 - Novita AI
 - NVIDIA
 - Red Hat
 - Replicate
 - Roblox
 - RunPod
 - Trainy
 - UC Berkeley
 - UC San Diego
 - Volcengine
 Slack Sponsor: Anyscale
 We also have an official fundraising venue through [OpenCollective](https://opencollective.com/vllm). We plan to use the fund to support the development, maintenance, and adoption of vLLM.
 ## Citation
 If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs/2309.06180):
@@ -182,7 +95,7 @@ If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs
 - For discussing with fellow users, please use the [vLLM Forum](https://discuss.vllm.ai)
 - For coordinating contributions and development, please use [Slack](https://slack.vllm.ai)
 - For security disclosures, please use GitHub's [Security Advisories](https://github.com/vllm-project/vllm/security/advisories) feature
- For collaborations and partnerships, please contact us at [vllm-questions@lists.berkeley.edu](mailto:vllm-questions@lists.berkeley.edu)
+- For collaborations and partnerships, please contact us at [collaboration@vllm.ai](mailto:collaboration@vllm.ai)
 <!-- --8<-- [end:contact-us] -->
 ## Media Kit
--- a/RELEASE.md
+++ b/RELEASE.md
@@ -1,47 +1,30 @@
 # Releasing vLLM
-vLLM releases offer a reliable version of the code base, packaged into a binary format that can be conveniently accessed via PyPI. These releases also serve as key milestones for the development team to communicate with the community about newly available features, improvements, and upcoming changes that could affect users, including potential breaking changes.
+vLLM releases offer a reliable version of the code base, packaged into a binary format that can be conveniently accessed via [PyPI](https://pypi.org/project/vllm). These releases also serve as key milestones for the development team to communicate with the community about newly available features, improvements, and upcoming changes that could affect users, including potential breaking changes.
-## Release Versioning
+## Release Cadence and Versioning
-vLLM uses a “right-shifted” versioning scheme where a new patch release is out every 2 weeks. And patch releases contain features and bug fixes (as opposed to semver where patch release contains only backwards-compatible bug fixes). When critical fixes need to be made, special release post1 is released.
+We aim to have a regular release every 2 weeks. Since v0.12.0, regular releases increment the minor version rather than patch version. The list of past releases can be found [here](https://vllm.ai/releases).
-* _major_ major architectural milestone and when incompatible API changes are made, similar to PyTorch 2.0.
+Our version numbers are expressed in the form `vX.Y.Z`, where `X` is the major version, `Y` is the minor version, and `Z` is the patch version. They are incremented according to the following rules:
 * _minor_ major features
 * _patch_ features and backwards-compatible bug fixes
 * _post1_ or _patch-1_ backwards-compatible bug fixes, either explicit or implicit post release
-## Release Cadence
+* _Major_ releases are reserved for architectural milestones involving sweeping API changes, similar to PyTorch 2.0.
 * _Minor_ releases correspond to regular releases, which include new features, bug fixes and other backwards-compatible changes.
 * _Patch_ releases correspond to special releases for new models, as well as emergency patches for critical performance, functionality and security issues.
-Patch release is released on bi-weekly basis. Post release 1-3 days after patch release and uses same branch as patch release.
+This versioning scheme is similar to [SemVer](https://semver.org/) for compatibility purposes, except that backwards compatibility is only guaranteed for a limited number of minor releases (see our [deprecation policy](https://docs.vllm.ai/en/latest/contributing/deprecation_policy) for details).
 Following is the release cadence for year 2025. All future release dates below are tentative. Please note: Post releases are optional.
-| Release Date | Patch release versions | Post Release versions |
+## Release Branch
 | --- | --- | --- |
 | Jan 2025 | 0.7.0 | --- |
 | Feb 2025 | 0.7.1, 0.7.2, 0.7.3  | --- |
 | Mar 2025 | 0.7.4, 0.7.5 | --- |
 | Apr 2025 | 0.7.6, 0.7.7 | --- |
 | May 2025 | 0.7.8, 0.7.9 | --- |
 | Jun 2025 | 0.7.10, 0.7.11 | --- |
 | Jul 2025 | 0.7.12, 0.7.13 | --- |
 | Aug 2025 | 0.7.14, 0.7.15 | --- |
 | Sep 2025 | 0.7.16, 0.7.17 | --- |
 | Oct 2025 | 0.7.18, 0.7.19 | --- |
 | Nov 2025 | 0.7.20, 0.7.21 | --- |
 | Dec 2025 | 0.7.22, 0.7.23 | --- |
 ## Release branch
 Each release is built from a dedicated release branch.
-* For _major_, _minor_, _patch_ releases, the release branch cut is performed 1-2 days before release is live.
+* For _major_ and _minor_ releases, the release branch cut is performed 1-2 days before release is live.
-* For post releases, previously cut release branch is reused
+* For _patch_ releases, previously cut release branch is reused.
-* Release builds are triggered via push to RC tag like vX.Y.Z-rc1 . This enables us to build and test multiple RCs for each release.
+* Release builds are triggered via push to RC tag like `vX.Y.Z-rc1`. This enables us to build and test multiple RCs for each release.
-* Final tag : vX.Y.Z does not trigger the build but used for Release notes and assets.
+* Final tag: `vX.Y.Z` does not trigger the build but used for Release notes and assets.
-* After branch cut is created we monitor the main branch for any reverts and apply these reverts to a release branch.
+* After branch cut is created, we monitor the main branch for any reverts and apply these reverts to a release branch.
-## Release Cherry-Pick Criteria
+### Cherry-Pick Criteria
 After branch cut, we approach finalizing the release branch with clear criteria on what cherry picks are allowed in. Note: a cherry pick is a process to land a PR in the release branch after branch cut. These are typically limited to ensure that the team has sufficient time to complete a thorough round of testing on a stable code base.
--- a/benchmarks/benchmark_batch_invariance.py
+++ b/benchmarks/benchmark_batch_invariance.py
@@ -104,7 +104,6 @@ def run_benchmark_with_batch_invariant(
    random.seed(seed)
    # Set environment variables
    os.environ["VLLM_ATTENTION_BACKEND"] = backend
    if batch_invariant:
        os.environ["VLLM_BATCH_INVARIANT"] = "1"
    else:
@@ -140,6 +139,7 @@ def run_benchmark_with_batch_invariant(
            max_model_len=max_model_len,
            dtype="bfloat16",
            tensor_parallel_size=tp_size,
            attention_config={"backend": backend},
            enable_prefix_caching=False,
        )
        init_time = time.perf_counter() - start_init
--- a/benchmarks/benchmark_ngram_proposer.py
+++ b/benchmarks/benchmark_ngram_proposer.py
@@ -135,7 +135,6 @@ def benchmark_batched_propose(args):
            block_sizes=[16],
        )
        dummy_input_batch._req_ids = list(str(id) for id in range(args.num_req))
        dummy_input_batch.spec_decode_unsupported_reqs = ()
        dummy_input_batch.num_tokens_no_spec = [args.num_token] * args.num_req
        dummy_input_batch.token_ids_cpu = np.random.randint(
            0, 20, (args.num_req, args.num_token)
@@ -151,10 +150,8 @@ def benchmark_batched_propose(args):
            start = time.time()
            runner.drafter.propose(
                sampled_token_ids,
                dummy_input_batch.req_ids,
                dummy_input_batch.num_tokens_no_spec,
                dummy_input_batch.token_ids_cpu,
                dummy_input_batch.spec_decode_unsupported_reqs,
            )
            end = time.time()
            print(f"Iteration time (s): {end - start}")
--- a/benchmarks/cutlass_benchmarks/sparse_benchmarks.py
+++ b/benchmarks/cutlass_benchmarks/sparse_benchmarks.py
@@ -343,7 +343,9 @@ def bench(
        return bench_int8(dtype, m, k, n, label, sub_label)
    if dtype == torch.float8_e4m3fn:
        return bench_fp8(dtype, m, k, n, label, sub_label)
-    raise ValueError("unsupported type")
+    raise ValueError(
        f"Unsupported dtype {dtype}: should be one of torch.int8, torch.float8_e4m3fn."
    )
 # runner
--- a/benchmarks/kernels/bench_nvfp4_quant.py
+++ b/benchmarks/kernels/bench_nvfp4_quant.py
@@ -0,0 +1,177 @@
 # SPDX-License-Identifier: Apache-2.0
 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project
 import argparse
 import copy
 import itertools
 import torch
 from weight_shapes import WEIGHT_SHAPES
 from vllm import _custom_ops as ops
 from vllm.platforms import current_platform
 from vllm.scalar_type import scalar_types
 from vllm.triton_utils import triton
 from vllm.utils.flashinfer import flashinfer_fp4_quantize
 if not current_platform.has_device_capability(100):
    raise RuntimeError("NVFP4 requires compute capability of 10.0 (Blackwell)")
 FLOAT4_E2M1_MAX = scalar_types.float4_e2m1f.max()
 FLOAT8_E4M3_MAX = torch.finfo(torch.float8_e4m3fn).max
 PROVIDER_CFGS = {
    "vllm": dict(backend="vllm", enabled=True),
    "flashinfer": dict(backend="flashinfer", enabled=True),
 }
 _enabled = [k for k, v in PROVIDER_CFGS.items() if v["enabled"]]
 def compute_global_scale(tensor: torch.Tensor) -> torch.Tensor:
    """Compute global scale for FP4 quantization."""
    amax = torch.abs(tensor).max().to(torch.float32)
    return FLOAT8_E4M3_MAX * FLOAT4_E2M1_MAX / amax
@triton.testing.perf_report(
    triton.testing.Benchmark(
        x_names=["batch_size"],
        x_vals=[1, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096],
        x_log=False,
        line_arg="provider",
        line_vals=_enabled,
        line_names=_enabled,
        ylabel="us (lower is better)",
        plot_name="NVFP4 Input Quantization Latency (us)",
        args={},
    )
 )
 def benchmark(batch_size, provider, N, K):
    M = batch_size
    device = "cuda"
    dtype = torch.bfloat16
    # Create input tensor
    a = torch.randn((M, K), device=device, dtype=dtype)
    # Compute global scale for activation
    a_global_scale = compute_global_scale(a)
    quantiles = [0.5, 0.2, 0.8]
    cfg = PROVIDER_CFGS[provider]
    if cfg["backend"] == "vllm":
        # vLLM's FP4 quantization
        ms, min_ms, max_ms = triton.testing.do_bench_cudagraph(
            lambda: ops.scaled_fp4_quant(a, a_global_scale),
            quantiles=quantiles,
        )
    elif cfg["backend"] == "flashinfer":
        # FlashInfer's FP4 quantization
        # Use is_sf_swizzled_layout=True to match vLLM's output format
        ms, min_ms, max_ms = triton.testing.do_bench_cudagraph(
            lambda: flashinfer_fp4_quantize(
                a, a_global_scale, is_sf_swizzled_layout=True
            ),
            quantiles=quantiles,
        )
    # Convert ms to us for better readability at small batch sizes
    to_us = lambda t_ms: t_ms * 1000
    return to_us(ms), to_us(max_ms), to_us(min_ms)
 def prepare_shapes(args):
    out = []
    for model, tp_size in itertools.product(args.models, args.tp_sizes):
        for KN, tp_dim in copy.deepcopy(WEIGHT_SHAPES[model]):
            KN[tp_dim] //= tp_size
            KN.append(model)
            out.append(KN)
    return out
 def _test_accuracy_once(M: int, K: int, dtype: torch.dtype, device: str):
    """Test accuracy between vLLM and FlashInfer FP4 quantization."""
    # Create input tensor
    a = torch.randn((M, K), device=device, dtype=dtype)
    # Compute global scale
    a_global_scale = compute_global_scale(a)
    # vLLM quantization
    vllm_fp4, vllm_scale = ops.scaled_fp4_quant(a, a_global_scale)
    # FlashInfer quantization (with swizzled layout to match vLLM's output)
    flashinfer_fp4, flashinfer_scale = flashinfer_fp4_quantize(
        a, a_global_scale, is_sf_swizzled_layout=True
    )
    flashinfer_scale = flashinfer_scale.view(torch.float8_e4m3fn)
    # Compare outputs
    torch.testing.assert_close(
        vllm_fp4,
        flashinfer_fp4,
    )
    print(f"M={M}, K={K}, dtype={dtype}: PASSED")
 def test_accuracy():
    """Run accuracy tests across various shapes."""
    print("\n" + "=" * 60)
    print("Running accuracy tests: vLLM vs FlashInfer")
    print("=" * 60)
    device = "cuda"
    dtype = torch.bfloat16
    # Test various batch sizes and hidden dimensions
    Ms = [1, 1024]
    Ks = [4096]
    for M in Ms:
        for K in Ks:
            _test_accuracy_once(M, K, dtype, device)
    print("\nAll accuracy tests passed!")
 if __name__ == "__main__":
    parser = argparse.ArgumentParser(
        description="Benchmark NVFP4 quantization: vLLM vs FlashInfer"
    )
    parser.add_argument(
        "--models",
        nargs="+",
        type=str,
        default=["meta-llama/Llama-3.1-8B-Instruct"],
        choices=list(WEIGHT_SHAPES.keys()),
    )
    parser.add_argument("--tp-sizes", nargs="+", type=int, default=[1])
    parser.add_argument(
        "--save-path",
        type=str,
        default=None,
        help="Path to save benchmark results",
    )
    parser.add_argument(
        "--accuracy",
        action="store_true",
        help="Run accuracy tests",
    )
    args = parser.parse_args()
    if args.accuracy:
        test_accuracy()
    for K, N, model in prepare_shapes(args):
        print(f"\n{model}, N={N} K={K}")
        benchmark.run(
            print_data=True,
            save_path=args.save_path,
            N=N,
            K=K,
        )
    print("\nBenchmark finished!")
--- a/benchmarks/kernels/benchmark_activation.py
+++ b/benchmarks/kernels/benchmark_activation.py
@@ -8,10 +8,9 @@ import torch
 import vllm.model_executor.layers.activation  # noqa F401
 from vllm.model_executor.custom_op import CustomOp
 from vllm.platforms import current_platform
 from vllm.triton_utils import triton
 from vllm.utils.argparse_utils import FlexibleArgumentParser
-from vllm.utils.torch_utils import STR_DTYPE_TO_TORCH_DTYPE
+from vllm.utils.torch_utils import STR_DTYPE_TO_TORCH_DTYPE, set_random_seed
 batch_size_range = [1, 16, 128]
 seq_len_range = [1, 16, 64, 1024, 4096]
@@ -30,7 +29,7 @@ def benchmark_activation(
    device = "cuda"
    num_tokens = batch_size * seq_len
    dim = intermediate_size
-    current_platform.seed_everything(42)
+    set_random_seed(42)
    torch.set_default_device(device)
    if func_name == "gelu_and_mul":
--- a/benchmarks/kernels/benchmark_cutlass_moe_fp8.py
+++ b/benchmarks/kernels/benchmark_cutlass_moe_fp8.py
@@ -6,15 +6,19 @@ kernel. Both kernels take in fp8 quantized weights and 16-bit activations,
 but use different quantization strategies and backends.
 """
 import nvtx
 import torch
 import vllm.model_executor.layers.fused_moe.modular_kernel as mk
 from vllm import _custom_ops as ops
 from vllm.model_executor.layers.fused_moe.config import fp8_w8a8_moe_quant_config
-from vllm.model_executor.layers.fused_moe.cutlass_moe import cutlass_moe_fp8
+from vllm.model_executor.layers.fused_moe.cutlass_moe import CutlassExpertsFp8
 from vllm.model_executor.layers.fused_moe.fused_moe import fused_experts, fused_topk
 from vllm.model_executor.layers.fused_moe.prepare_finalize import (
    MoEPrepareAndFinalizeNoEP,
 )
 from vllm.platforms import current_platform
 from vllm.utils.argparse_utils import FlexibleArgumentParser
 from vllm.v1.worker.workspace import init_workspace_manager
 # Weight shapes for different models: [num_experts, topk, hidden_size,
 # intermediate_size]
@@ -58,6 +62,7 @@ def bench_run(
    per_out_ch: bool,
    mkn: tuple[int, int, int],
 ):
    init_workspace_manager(torch.cuda.current_device())
    (m, k, n) = mkn
    dtype = torch.half
@@ -120,85 +125,6 @@ def bench_run(
    # Force per-tensor quantization for all cases
    per_act_token = False
    # Create stride tensors for CUTLASS
    ab_strides1 = torch.full((num_experts,), k, dtype=torch.int64, device=device)
    ab_strides2 = torch.full((num_experts,), n, dtype=torch.int64, device=device)
    c_strides1 = torch.full((num_experts,), 2 * n, dtype=torch.int64, device=device)
    c_strides2 = torch.full((num_experts,), k, dtype=torch.int64, device=device)
    def run_triton_moe(
        a: torch.Tensor,
        w1: torch.Tensor,
        w2: torch.Tensor,
        topk_weights: torch.Tensor,
        topk_ids: torch.Tensor,
        w1_scale: torch.Tensor,
        w2_scale: torch.Tensor,
        a1_scale: torch.Tensor,
        a2_scale: torch.Tensor,
        num_repeats: int,
    ):
        quant_config = fp8_w8a8_moe_quant_config(
            w1_scale=w1_scale,
            w2_scale=w2_scale,
            a1_scale=a1_scale,
            a2_scale=a2_scale,
            per_act_token_quant=per_act_token,
            per_out_ch_quant=per_out_ch,
        )
        for _ in range(num_repeats):
            fused_experts(
                a,
                w1,
                w2,
                topk_weights,
                topk_ids,
                quant_config=quant_config,
            )
    def run_cutlass_moe_fp8(
        a: torch.Tensor,
        w1: torch.Tensor,
        w2: torch.Tensor,
        topk_weights: torch.Tensor,
        topk_ids: torch.Tensor,
        ab_strides1: torch.Tensor,
        ab_strides2: torch.Tensor,
        c_strides1: torch.Tensor,
        c_strides2: torch.Tensor,
        w1_scale: torch.Tensor,
        w2_scale: torch.Tensor,
        a1_scale: torch.Tensor,
        a2_scale: torch.Tensor,
        num_repeats: int,
    ):
        quant_config = fp8_w8a8_moe_quant_config(
            w1_scale=w1_scale,
            w2_scale=w2_scale,
            a1_scale=a1_scale,
            a2_scale=a2_scale,
            per_act_token_quant=per_act_token,
            per_out_ch_quant=per_out_ch,
        )
        for _ in range(num_repeats):
            with nvtx.annotate("cutlass_moe_fp8", color="blue"):
                cutlass_moe_fp8(
                    a=a,
                    w1_q=w1,
                    w2_q=w2,
                    topk_weights=topk_weights,
                    topk_ids=topk_ids,
                    ab_strides1=ab_strides1,
                    ab_strides2=ab_strides2,
                    c_strides1=c_strides1,
                    c_strides2=c_strides2,
                    quant_config=quant_config,
                    activation="silu",
                    global_num_experts=num_experts,
                )
    # Pre-create quantization config to avoid creating it inside CUDA graph
    quant_config = fp8_w8a8_moe_quant_config(
        w1_scale=w1_scale,
@@ -209,23 +135,30 @@ def bench_run(
        per_out_ch_quant=per_out_ch,
    )
    fn = mk.FusedMoEModularKernel(
        MoEPrepareAndFinalizeNoEP(),
        CutlassExpertsFp8(
            out_dtype=a.dtype,
            e=num_experts,
            n=n,
            k=k,
            quant_config=quant_config,
            device=w1.device,
        ),
    )
    # Create CUDA graphs for CUTLASS (match benchmark_moe.py pattern exactly)
    cutlass_stream = torch.cuda.Stream()
    cutlass_graph = torch.cuda.CUDAGraph()
    with torch.cuda.graph(cutlass_graph, stream=cutlass_stream):
        # Capture 10 invocations like benchmark_moe.py
        for _ in range(10):
-            cutlass_moe_fp8(
+            fn(
-                a=a,
+                a,
-                w1_q=w1_fp8q_cutlass,
+                w1_fp8q_cutlass,
-                w2_q=w2_fp8q_cutlass,
+                w2_fp8q_cutlass,
-                topk_weights=topk_weights,
+                topk_weights,
-                topk_ids=topk_ids,
+                topk_ids,
                ab_strides1=ab_strides1,
                ab_strides2=ab_strides2,
                c_strides1=c_strides1,
                c_strides2=c_strides2,
                quant_config=quant_config,
                activation="silu",
                global_num_experts=num_experts,
            )
@@ -297,6 +230,10 @@ def bench_run(
 def main(args):
    # Initialize workspace manager (required for CUTLASS MoE kernels)
    device = torch.device("cuda:0")
    init_workspace_manager(device)
    print("Benchmarking models:")
    for i, model in enumerate(args.models):
        print(f"[{i}]  {model}")
--- a/benchmarks/kernels/benchmark_cutlass_moe_nvfp4.py
+++ b/benchmarks/kernels/benchmark_cutlass_moe_nvfp4.py
@@ -11,16 +11,23 @@ import nvtx
 import torch
 import torch.utils.benchmark as benchmark
 import vllm.model_executor.layers.fused_moe.modular_kernel as mk
 from vllm import _custom_ops as ops
 from vllm.config import ParallelConfig, VllmConfig, set_current_vllm_config
 from vllm.model_executor.layers.fused_moe.config import (
    fp8_w8a8_moe_quant_config,
    nvfp4_moe_quant_config,
 )
-from vllm.model_executor.layers.fused_moe.cutlass_moe import cutlass_moe_fp4
+from vllm.model_executor.layers.fused_moe.cutlass_moe import (
    CutlassExpertsFp4,
 )
 from vllm.model_executor.layers.fused_moe.fused_moe import fused_experts, fused_topk
 from vllm.model_executor.layers.fused_moe.prepare_finalize import (
    MoEPrepareAndFinalizeNoEP,
 )
 from vllm.scalar_type import scalar_types
 from vllm.utils.argparse_utils import FlexibleArgumentParser
 from vllm.v1.worker.workspace import init_workspace_manager
 WEIGHT_SHAPES_MOE = {
    "nvidia/DeepSeek-R1-FP4": [
@@ -187,19 +194,24 @@ def bench_run(
            g1_alphas=w1_gs,
            g2_alphas=w2_gs,
        )
        kernel = mk.FusedMoEModularKernel(
            MoEPrepareAndFinalizeNoEP(defer_input_quant=True),
            CutlassExpertsFp4(
                out_dtype=dtype,
                max_experts_per_worker=e,
                quant_config=quant_config,
            ),
        )
        for _ in range(num_repeats):
            with nvtx.annotate("cutlass_moe_fp4", color="green"):
-                cutlass_moe_fp4(
+                kernel(
-                    a=a,
+                    hidden_states=a,
-                    w1_fp4=w1_fp4,
+                    w1=w1_fp4,
-                    w2_fp4=w2_fp4,
+                    w2=w2_fp4,
                    topk_weights=topk_weights,
                    topk_ids=topk_ids,
                    m=m,
                    n=n,
                    k=k,
                    e=num_experts,
                    quant_config=quant_config,
                )
    def run_cutlass_from_graph(
@@ -229,20 +241,24 @@ def bench_run(
            g2_alphas=w2_gs,
        )
        kernel = mk.FusedMoEModularKernel(
            MoEPrepareAndFinalizeNoEP(defer_input_quant=True),
            CutlassExpertsFp4(
                out_dtype=dtype,
                max_experts_per_worker=e,
                quant_config=quant_config,
            ),
        )
        with set_current_vllm_config(
            VllmConfig(parallel_config=ParallelConfig(pipeline_parallel_size=1))
        ):
-            return cutlass_moe_fp4(
+            return kernel(
-                a=a,
+                hidden_states=a,
-                w1_fp4=w1_fp4,
+                w1=w1_fp4,
-                w2_fp4=w2_fp4,
+                w2=w2_fp4,
                topk_weights=topk_weights,
                topk_ids=topk_ids,
                m=m,
                n=n,
                k=k,
                e=num_experts,
                quant_config=quant_config,
            )
    def run_triton_from_graph(
@@ -441,6 +457,10 @@ def bench_run(
 def main(args):
    # Initialize workspace manager (required for CUTLASS MoE kernels)
    device = torch.device("cuda:0")
    init_workspace_manager(device)
    print("Benchmarking models:")
    for i, model in enumerate(args.models):
        print(f"[{i}]  {model}")
--- a/benchmarks/kernels/benchmark_device_communicators.py
+++ b/benchmarks/kernels/benchmark_device_communicators.py
@@ -293,7 +293,7 @@ class CommunicatorBenchmark:
                    graph = torch.cuda.CUDAGraph()
                    graph_pool = torch.cuda.graph_pool_handle()
                    set_graph_pool_id(graph_pool)
-                    with torch.cuda.graph(graph, pool=graph_pool):
+                    with torch.cuda.graph(graph, pool=graph_pool, stream=stream):
                        for _ in range(CUDA_GRAPH_CAPTURE_CYCLES):
                            allreduce_fn(graph_input)
--- a/benchmarks/kernels/benchmark_grouped_gemm_cutlass.py
+++ b/benchmarks/kernels/benchmark_grouped_gemm_cutlass.py
@@ -5,15 +5,20 @@ import torch
 import torch.utils.benchmark as benchmark
 from benchmark_shapes import WEIGHT_SHAPES_MOE
 import vllm.model_executor.layers.fused_moe.modular_kernel as mk
 from vllm import _custom_ops as ops
 from vllm.config import ParallelConfig, VllmConfig, set_current_vllm_config
 from vllm.model_executor.layers.fused_moe.config import fp8_w8a8_moe_quant_config
-from vllm.model_executor.layers.fused_moe.cutlass_moe import cutlass_moe_fp8
+from vllm.model_executor.layers.fused_moe.cutlass_moe import CutlassExpertsFp8
 from vllm.model_executor.layers.fused_moe.fused_moe import (
    fused_experts,
    fused_topk,
 )
 from vllm.model_executor.layers.fused_moe.prepare_finalize import (
    MoEPrepareAndFinalizeNoEP,
 )
 from vllm.utils.argparse_utils import FlexibleArgumentParser
 from vllm.v1.worker.workspace import init_workspace_manager
 DEFAULT_MODELS = [
    "mistralai/Mixtral-8x7B-Instruct-v0.1",
@@ -44,6 +49,7 @@ def bench_run(
    per_out_ch: bool,
    mkn: tuple[int, int, int],
 ):
    init_workspace_manager(torch.cuda.current_device())
    label = "Quant Matmul"
    sub_label = (
@@ -81,11 +87,6 @@ def bench_run(
        a, score, topk, renormalize=False
    )
    ab_strides1 = torch.full((num_experts,), k, device="cuda", dtype=torch.int64)
    ab_strides2 = torch.full((num_experts,), n, device="cuda", dtype=torch.int64)
    c_strides1 = torch.full((num_experts,), 2 * n, device="cuda", dtype=torch.int64)
    c_strides2 = torch.full((num_experts,), k, device="cuda", dtype=torch.int64)
    def run_triton_moe(
        a: torch.Tensor,
        w1: torch.Tensor,
@@ -119,10 +120,6 @@ def bench_run(
        w2: torch.Tensor,
        w1_scale: torch.Tensor,
        w2_scale: torch.Tensor,
        ab_strides1: torch.Tensor,
        ab_strides2: torch.Tensor,
        c_strides1: torch.Tensor,
        c_strides2: torch.Tensor,
        topk_weights: torch.Tensor,
        topk_ids: torch.Tensor,
        per_act_token: bool,
@@ -134,31 +131,29 @@ def bench_run(
            per_act_token_quant=per_act_token,
        )
-        for _ in range(num_repeats):
+        fn = mk.FusedMoEModularKernel(
-            cutlass_moe_fp8(
+            MoEPrepareAndFinalizeNoEP(),
-                a,
+            CutlassExpertsFp8(
-                w1,
+                out_dtype=a.dtype,
-                w2,
+                # NOTE(rob): w2 is shaped as [E, hidden, intermediate]
-                topk_weights,
+                e=w2.shape[0],
-                topk_ids,
+                n=w2.shape[2],
-                ab_strides1,
+                k=w2.shape[1],
                ab_strides2,
                c_strides1,
                c_strides2,
                quant_config=quant_config,
-            )
+                device=w1.device,
            ),
        )
        for _ in range(num_repeats):
            fn(a, w1, w2, topk_weights, topk_ids)
    def run_cutlass_from_graph(
        a: torch.Tensor,
        a_scale: torch.Tensor,
-        w1_q: torch.Tensor,
+        w1: torch.Tensor,
-        w2_q: torch.Tensor,
+        w2: torch.Tensor,
        w1_scale: torch.Tensor,
        w2_scale: torch.Tensor,
        ab_strides1: torch.Tensor,
        ab_strides2: torch.Tensor,
        c_strides1: torch.Tensor,
        c_strides2: torch.Tensor,
        topk_weights: torch.Tensor,
        topk_ids: torch.Tensor,
    ):
@@ -168,21 +163,23 @@ def bench_run(
            per_act_token_quant=per_act_token,
        )
        fn = mk.FusedMoEModularKernel(
            MoEPrepareAndFinalizeNoEP(),
            CutlassExpertsFp8(
                out_dtype=a.dtype,
                # NOTE(rob): w2 is shaped as [E, hidden, intermediate]
                e=w2.shape[0],
                n=w2.shape[2],
                k=w2.shape[1],
                quant_config=quant_config,
                device=w1.device,
            ),
        )
        with set_current_vllm_config(
            VllmConfig(parallel_config=ParallelConfig(pipeline_parallel_size=1))
        ):
-            return cutlass_moe_fp8(
+            return fn(a, w1, w2, topk_weights, topk_ids)
                a,
                w1_q,
                w2_q,
                topk_weights,
                topk_ids,
                ab_strides1,
                ab_strides2,
                c_strides1,
                c_strides2,
                quant_config=quant_config,
            )
    def run_triton_from_graph(
        a: torch.Tensor,
@@ -226,10 +223,6 @@ def bench_run(
            w2_q,
            w1_scale,
            w2_scale,
            ab_strides1,
            ab_strides2,
            c_strides1,
            c_strides2,
            topk_weights,
            topk_ids,
        )
@@ -267,10 +260,6 @@ def bench_run(
        "w1_scale": w1_scale,
        "w2_scale": w2_scale,
        "per_act_token": per_act_token,
        "ab_strides1": ab_strides1,
        "ab_strides2": ab_strides2,
        "c_strides1": c_strides1,
        "c_strides2": c_strides2,
        # cuda graph params
        "cutlass_graph": cutlass_graph,
        "triton_graph": triton_graph,
@@ -329,10 +318,6 @@ def bench_run(
        w2_q,
        w1_scale,
        w2_scale,
        ab_strides1,
        ab_strides2,
        c_strides1,
        c_strides2,
        topk_weights,
        topk_ids,
        per_act_token,
@@ -341,7 +326,7 @@ def bench_run(
    results.append(
        benchmark.Timer(
-            stmt="run_cutlass_moe(a, a_scale, w1_q, w2_q, w1_scale, w2_scale, ab_strides1, ab_strides2, c_strides1, c_strides2, topk_weights, topk_ids, per_act_token, num_runs)",  # noqa: E501
+            stmt="run_cutlass_moe(a, a_scale, w1_q, w2_q, w1_scale, w2_scale, topk_weights, topk_ids, per_act_token, num_runs)",  # noqa: E501
            globals=globals,
            label=label,
            sub_label=sub_label,
@@ -364,6 +349,10 @@ def bench_run(
 def main(args):
    # Initialize workspace manager (required for CUTLASS MoE kernels)
    device = torch.device("cuda:0")
    init_workspace_manager(device)
    print("Benchmarking models:")
    for i, model in enumerate(args.models):
        print(f"[{i}]  {model}")
--- a/benchmarks/kernels/benchmark_layernorm.py
+++ b/benchmarks/kernels/benchmark_layernorm.py
@@ -6,9 +6,8 @@ import time
 import torch
 from vllm.model_executor.layers.layernorm import RMSNorm
 from vllm.platforms import current_platform
 from vllm.utils.argparse_utils import FlexibleArgumentParser
-from vllm.utils.torch_utils import STR_DTYPE_TO_TORCH_DTYPE
+from vllm.utils.torch_utils import STR_DTYPE_TO_TORCH_DTYPE, set_random_seed
@torch.inference_mode()
@@ -22,7 +21,7 @@ def main(
    num_warmup_iters: int = 5,
    num_iters: int = 100,
 ) -> None:
-    current_platform.seed_everything(seed)
+    set_random_seed(seed)
    torch.set_default_device("cuda")
    layer = RMSNorm(hidden_size).to(dtype=dtype)
--- a/benchmarks/kernels/benchmark_moe.py
+++ b/benchmarks/kernels/benchmark_moe.py
@@ -2,6 +2,7 @@
 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project
 import argparse
 import gc
 import json
 import os
 import time
@@ -23,9 +24,48 @@ from vllm.platforms import current_platform
 from vllm.transformers_utils.config import get_config
 from vllm.triton_utils import triton
 from vllm.utils.argparse_utils import FlexibleArgumentParser
 from vllm.utils.torch_utils import set_random_seed
 FP8_DTYPE = current_platform.fp8_dtype()
 # Default interval for clearing Triton JIT cache during tuning
 # Set to 0 to disable automatic cache clearing
 _CACHE_CLEAR_INTERVAL_ENV = "VLLM_MOE_TUNE_CACHE_CLEAR_INTERVAL"
 TRITON_CACHE_CLEAR_INTERVAL = int(os.environ.get(_CACHE_CLEAR_INTERVAL_ENV, "50"))
 def clear_triton_cache():
    """Clear Triton JIT compilation cache and Python/CUDA memory.
    This helps prevent OOM during tuning with large models (many experts).
    """
    # Force Python garbage collection
    gc.collect()
    # Clear CUDA memory cache
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
    # Try to clear Triton's runtime cache
    try:
        if (
            hasattr(triton, "runtime")
            and hasattr(triton.runtime, "cache")
            and hasattr(triton.runtime.cache, "clear")
        ):
            triton.runtime.cache.clear()
    except ImportError:
        # Triton not installed, skip cache clearing
        pass
    except AttributeError:
        # Triton version doesn't have expected cache API
        pass
    except Exception as e:
        print(f"Warning: Failed to clear Triton cache: {e}")
    # Additional garbage collection after clearing caches
    gc.collect()
 def ensure_divisibility(numerator, denominator, text):
    """Ensure that numerator is divisible by the denominator."""
@@ -390,7 +430,7 @@ def merge_unique_dicts(list1, list2):
 class BenchmarkWorker:
    def __init__(self, seed: int) -> None:
        torch.set_default_device("cuda")
-        current_platform.seed_everything(seed)
+        set_random_seed(seed)
        self.seed = seed
        # Get the device ID to allocate tensors and kernels
        # on the respective GPU. This is required for Ray to work
@@ -410,7 +450,7 @@ class BenchmarkWorker:
        block_quant_shape: list[int] = None,
        use_deep_gemm: bool = False,
    ) -> tuple[dict[str, int], float]:
-        current_platform.seed_everything(self.seed)
+        set_random_seed(self.seed)
        dtype_str = _get_config_dtype_str(
            dtype, use_int8_w8a16=use_int8_w8a16, use_fp8_w8a8=use_fp8_w8a8
        )
@@ -483,7 +523,7 @@ class BenchmarkWorker:
                need_device_guard = True
        with torch.cuda.device(self.device_id) if need_device_guard else nullcontext():
-            for config in tqdm(search_space):
+            for idx, config in enumerate(tqdm(search_space)):
                try:
                    kernel_time = benchmark_config(
                        config,
@@ -506,6 +546,19 @@ class BenchmarkWorker:
                if kernel_time < best_time:
                    best_time = kernel_time
                    best_config = config
                # Periodically clear Triton JIT cache to prevent OOM
                # This is especially important for large models with many experts
                if (
                    TRITON_CACHE_CLEAR_INTERVAL > 0
                    and idx > 0
                    and idx % TRITON_CACHE_CLEAR_INTERVAL == 0
                ):
                    clear_triton_cache()
        # Final cleanup after tuning completes
        clear_triton_cache()
        now = datetime.now()
        print(f"{now.ctime()}] Completed tuning for batch_size={num_tokens}")
        assert best_config is not None
--- a/benchmarks/kernels/benchmark_moe_permute_unpermute.py
+++ b/benchmarks/kernels/benchmark_moe_permute_unpermute.py
@@ -18,6 +18,7 @@ from vllm.model_executor.layers.fused_moe.moe_permute_unpermute import (
 from vllm.model_executor.layers.fused_moe.utils import _fp8_quantize
 from vllm.platforms import current_platform
 from vllm.utils.argparse_utils import FlexibleArgumentParser
 from vllm.utils.torch_utils import set_random_seed
 FP8_DTYPE = current_platform.fp8_dtype()
@@ -261,7 +262,7 @@ def benchmark_unpermute(
 class BenchmarkWorker:
    def __init__(self, seed: int) -> None:
        torch.set_default_device("cuda")
-        current_platform.seed_everything(seed)
+        set_random_seed(seed)
        self.seed = seed
        # Get the device ID to allocate tensors and kernels
        # on the respective GPU. This is required for Ray to work
@@ -279,7 +280,7 @@ class BenchmarkWorker:
        use_int8_w8a16: bool,
        use_customized_permute: bool = False,
    ) -> tuple[dict[str, int], float]:
-        current_platform.seed_everything(self.seed)
+        set_random_seed(self.seed)
        permute_time = benchmark_permute(
            num_tokens,
--- a/benchmarks/kernels/benchmark_mrope.py
+++ b/benchmarks/kernels/benchmark_mrope.py
@@ -37,9 +37,9 @@ import numpy as np
 import torch
 from vllm.model_executor.layers.rotary_embedding import get_rope
 from vllm.platforms import current_platform
 from vllm.transformers_utils.config import get_config
 from vllm.utils.argparse_utils import FlexibleArgumentParser
 from vllm.utils.torch_utils import set_random_seed
 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
@@ -94,7 +94,7 @@ def benchmark_mrope(
    benchmark_iter: int = 100,
    csv_writer=None,
 ):
-    current_platform.seed_everything(seed)
+    set_random_seed(seed)
    torch.set_default_device(device)
    # the parameters to compute the q k v size based on tp_size
    mrope_helper_class = get_rope(
--- a/benchmarks/kernels/benchmark_paged_attention.py
+++ b/benchmarks/kernels/benchmark_paged_attention.py
@@ -13,6 +13,7 @@ from vllm.utils.argparse_utils import FlexibleArgumentParser
 from vllm.utils.torch_utils import (
    STR_DTYPE_TO_TORCH_DTYPE,
    create_kv_caches_with_random,
    set_random_seed,
 )
 logger = init_logger(__name__)
@@ -38,7 +39,7 @@ def main(
    device: str = "cuda",
    kv_cache_dtype: str | None = None,
 ) -> None:
-    current_platform.seed_everything(seed)
+    set_random_seed(seed)
    scale = float(1.0 / (head_size**0.5))
    query = torch.empty(
--- a/benchmarks/kernels/benchmark_quant.py
+++ b/benchmarks/kernels/benchmark_quant.py
@@ -6,9 +6,8 @@ import time
 import torch
 from vllm import _custom_ops as ops
 from vllm.platforms import current_platform
 from vllm.utils.argparse_utils import FlexibleArgumentParser
-from vllm.utils.torch_utils import STR_DTYPE_TO_TORCH_DTYPE
+from vllm.utils.torch_utils import STR_DTYPE_TO_TORCH_DTYPE, set_random_seed
@torch.inference_mode()
@@ -23,7 +22,7 @@ def main(
    num_warmup_iters: int = 5,
    num_iters: int = 100,
 ) -> None:
-    current_platform.seed_everything(seed)
+    set_random_seed(seed)
    torch.set_default_device("cuda")
    x = torch.randn(num_tokens, hidden_size, dtype=dtype)
--- a/benchmarks/kernels/benchmark_reshape_and_cache.py
+++ b/benchmarks/kernels/benchmark_reshape_and_cache.py
@@ -8,11 +8,11 @@ from tabulate import tabulate
 from vllm import _custom_ops as ops
 from vllm.logger import init_logger
 from vllm.platforms import current_platform
 from vllm.utils.argparse_utils import FlexibleArgumentParser
 from vllm.utils.torch_utils import (
    STR_DTYPE_TO_TORCH_DTYPE,
    create_kv_caches_with_random,
    set_random_seed,
 )
 logger = init_logger(__name__)
@@ -36,7 +36,7 @@ def run_benchmark(
    if kv_cache_dtype == "fp8" and head_size % 16:
        raise ValueError("fp8 kv-cache requires head_size to be a multiple of 16.")
-    current_platform.seed_everything(42)
+    set_random_seed(42)
    torch.set_default_device(device)
    # create random key / value tensors [T, H, D].
--- a/benchmarks/kernels/benchmark_reshape_and_cache_flash.py
+++ b/benchmarks/kernels/benchmark_reshape_and_cache_flash.py
@@ -7,15 +7,15 @@ import torch
 from tabulate import tabulate
 from vllm import _custom_ops as ops
 from vllm.attention.ops.triton_reshape_and_cache_flash import (
    triton_reshape_and_cache_flash,
 )
 from vllm.logger import init_logger
 from vllm.platforms import current_platform
 from vllm.utils.argparse_utils import FlexibleArgumentParser
 from vllm.utils.torch_utils import (
    STR_DTYPE_TO_TORCH_DTYPE,
    create_kv_caches_with_random_flash,
    set_random_seed,
 )
 from vllm.v1.attention.ops.triton_reshape_and_cache_flash import (
    triton_reshape_and_cache_flash,
 )
 logger = init_logger(__name__)
@@ -49,7 +49,7 @@ def run_benchmark(
    if implementation == "triton" and kv_cache_layout == "HND":
        return float("nan")  # Triton does not support HND layout yet.
-    current_platform.seed_everything(42)
+    set_random_seed(42)
    torch.set_default_device(device)
    # create random key / value tensors [T, H, D].
--- a/benchmarks/kernels/benchmark_silu_mul_fp8_quant.py
+++ b/benchmarks/kernels/benchmark_silu_mul_fp8_quant.py
@@ -23,9 +23,9 @@ import torch
 from vllm.model_executor.layers.fused_moe.batched_deep_gemm_moe import (
    persistent_masked_m_silu_mul_quant,
 )
 from vllm.platforms import current_platform
 from vllm.triton_utils import tl, triton
 from vllm.utils.deep_gemm import is_deep_gemm_e8m0_used
 from vllm.utils.torch_utils import set_random_seed
@triton.jit
@@ -207,7 +207,7 @@ def benchmark(
 ):
    def generate_data(seed_offset=0):
        """Generate input data with given seed offset"""
-        current_platform.seed_everything(42 + seed_offset)
+        set_random_seed(42 + seed_offset)
        y = torch.rand((E, T, 2 * H), dtype=torch.bfloat16, device="cuda").contiguous()
        if gen_strategy == "random_imbalanced":
--- a/benchmarks/kernels/cpu/benchmark_cpu_attn.py
+++ b/benchmarks/kernels/cpu/benchmark_cpu_attn.py
@@ -0,0 +1,272 @@
 # SPDX-License-Identifier: Apache-2.0
 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project
 import functools
 import time
 import numpy as np
 import torch
 from vllm._custom_ops import (
    cpu_attention_with_kv_cache,
    cpu_attn_get_scheduler_metadata,
    cpu_attn_reshape_and_cache,
 )
 from vllm.platforms import CpuArchEnum, current_platform
 from vllm.utils.argparse_utils import FlexibleArgumentParser
 from vllm.utils.torch_utils import STR_DTYPE_TO_TORCH_DTYPE
 from vllm.v1.attention.backends.cpu_attn import CPUAttentionBackend, _get_attn_isa
 def get_attn_isa(
    block_size: int | None = None,
    dtype: torch.dtype | None = None,
 ):
    if block_size and dtype:
        return _get_attn_isa(dtype, block_size)
    else:
        if current_platform.get_cpu_architecture() == CpuArchEnum.ARM:
            return "neon"
        elif torch._C._cpu._is_amx_tile_supported():
            return "amx"
        else:
            return "vec"
 # rand number generation takes too much time, cache rand tensors
@functools.lru_cache(maxsize=128, typed=False)
 def tensor_cache(
    elem_num: int,
    dtype: torch.dtype,
 ) -> torch.Tensor:
    tensor = torch.randn(elem_num, dtype=dtype)
    return tensor
@torch.inference_mode()
 def main(
    seq_lens: list[tuple[int, int]],
    num_heads: tuple[int, int],
    head_size: int,
    sliding_window: int = None,
    dtype: torch.dtype = torch.bfloat16,
    block_size: int = 128,
    num_blocks: int = 4096,
    use_sink: bool = False,
    enable_kv_split: bool = False,
    isa: str | None = None,
    seed: int = 0,
    iters: int = 20,
 ) -> None:
    current_platform.seed_everything(seed)
    num_seqs = len(seq_lens)
    query_lens = [x[0] for x in seq_lens]
    kv_lens = [x[1] for x in seq_lens]
    num_query_heads = num_heads[0]
    num_kv_heads = num_heads[1]
    assert num_query_heads % num_kv_heads == 0
    max_kv_len = max(kv_lens)
    window_size = (sliding_window - 1, 0) if sliding_window is not None else (-1, -1)
    scale = head_size**-0.5
    token_num = sum(query_lens)
    if isa is None:
        isa = get_attn_isa(block_size, dtype)
    s_aux = (
        15 * torch.rand((num_query_heads,), dtype=torch.bfloat16) if use_sink else None
    )
    query = tensor_cache(
        elem_num=token_num * num_query_heads * head_size,
        dtype=dtype,
    )
    query = query.view(
        token_num,
        num_query_heads,
        head_size,
    )
    key_value = tensor_cache(
        elem_num=2 * num_blocks * num_kv_heads * block_size * head_size,
        dtype=dtype,
    )
    key_value = key_value.view(
        2,
        num_blocks,
        block_size,
        num_kv_heads,
        head_size,
    )
    key_cache, value_cache = key_value.unbind(0)
    # KV cache for CPU attention
    packed_key_cache = torch.empty(
        num_blocks, num_kv_heads, block_size, head_size, dtype=dtype
    )
    packed_value_cache = torch.empty_like(packed_key_cache)
    cu_query_lens = torch.tensor([0] + query_lens, dtype=torch.int32).cumsum(
        dim=0, dtype=torch.int32
    )
    kv_lens_tensor = torch.tensor(kv_lens, dtype=torch.int32)
    max_num_blocks_per_seq = (max_kv_len + block_size - 1) // block_size
    block_tables = torch.randint(
        0, num_blocks, (num_seqs, max_num_blocks_per_seq), dtype=torch.int32
    )
    # use reshape_and_cache to pack key_cache and value_cache
    slot_mapping = torch.arange(0, num_blocks * block_size, dtype=torch.int64)
    cpu_attn_reshape_and_cache(
        key=key_cache.view(-1, num_kv_heads, head_size),
        value=value_cache.view(-1, num_kv_heads, head_size),
        key_cache=packed_key_cache,
        value_cache=packed_value_cache,
        slot_mapping=slot_mapping,
        isa=isa,
    )
    metadata = cpu_attn_get_scheduler_metadata(
        num_reqs=num_seqs,
        num_heads=num_query_heads,
        num_kv_heads=num_kv_heads,
        head_dim=head_size,
        seq_lens=kv_lens_tensor,
        dtype=dtype,
        query_start_loc=cu_query_lens,
        causal=True,
        sliding_window_size=sliding_window if sliding_window is not None else -1,
        isa=isa,
        enable_kv_split=enable_kv_split,
    )
    out_with_split = torch.empty_like(query)
    def run_benchmark(iters: int) -> list[float]:
        times = []
        for _ in range(iters):
            start_time = time.perf_counter_ns()
            cpu_attention_with_kv_cache(
                query=query,
                key_cache=packed_key_cache,
                value_cache=packed_value_cache,
                output=out_with_split,
                query_start_loc=cu_query_lens,
                seq_lens=kv_lens_tensor,
                scale=scale,
                causal=True,
                alibi_slopes=None,
                sliding_window=window_size,
                block_table=block_tables,
                softcap=0,
                scheduler_metadata=metadata,
                s_aux=s_aux,
            )
            end_time = time.perf_counter_ns()
            times.append((end_time - start_time) / 1e6)
        return times
    # warmup
    run_benchmark(5)
    # benchmark
    times = run_benchmark(iters)
    time_min = min(times)
    time_max = max(times)
    time_mean = np.mean(times)
    time_std = np.std(times)
    print("\tmin (ms) = ", time_min)
    print("\tmax (ms) = ", time_max)
    print("\tmean (ms) = ", time_mean)
    print("\tstd = ", time_std)
    print("\tmedian (ms) = ", np.median(times))
 def generate_seq_lens(
    batch_size: int,
    q_len_min: int,
    q_len_max: int,
    kv_len_min: int,
    kv_len_max: int,
    seed: int = 0,
 ) -> list[tuple[int, int]]:
    assert 1 <= q_len_min <= q_len_max
    assert 1 <= kv_len_min <= kv_len_max
    assert kv_len_max >= q_len_min
    g = torch.Generator(device="cpu").manual_seed(seed)
    def rint(lo: int, hi: int) -> int:
        return torch.randint(lo, hi + 1, (1,), generator=g).item()
    seq_lens: list[tuple[int, int]] = []
    for _ in range(batch_size):
        # ensure q <= kv
        kv = rint(max(kv_len_min, q_len_min), kv_len_max)
        q = rint(q_len_min, min(q_len_max, kv))
        seq_lens.append((q, kv))
    return seq_lens
 if __name__ == "__main__":
    parser = FlexibleArgumentParser(description="Benchmark the paged attention kernel.")
    parser.add_argument("--batch-size", type=int, default=64)
    parser.add_argument("--q-len-min", type=int, default=512)
    parser.add_argument("--q-len-max", type=int, default=512)
    parser.add_argument("--kv-len-min", type=int, default=512)
    parser.add_argument("--kv-len-max", type=int, default=512)
    parser.add_argument("--num-blocks", type=int, default=4096)
    parser.add_argument("--sliding-window", type=int, default=None)
    parser.add_argument("--num-query-heads", type=int, default=32)
    parser.add_argument("--num-kv-heads", type=int, default=8)
    parser.add_argument(
        "--head-size",
        type=int,
        choices=CPUAttentionBackend.get_supported_head_sizes(),
        default=128,
    )
    parser.add_argument("--enable-kv-split", action="store_true")
    parser.add_argument("--block-size", type=int, choices=[32, 64, 128], default=128)
    parser.add_argument(
        "--dtype", type=str, choices=["half", "bfloat16", "float"], default="bfloat16"
    )
    parser.add_argument("--use-sink", action="store_true")
    parser.add_argument(
        "--isa", type=str, choices=["vec", "neon", "amx", "vec16"], default=None
    )
    parser.add_argument("--seed", type=int, default=0)
    parser.add_argument("--iters", type=int, default=20)
    args = parser.parse_args()
    print(args)
    seq_lens = generate_seq_lens(
        args.batch_size,
        args.q_len_min,
        args.q_len_max,
        args.kv_len_min,
        args.kv_len_max,
        args.seed,
    )
    print("batch (query len, kv len) = ", seq_lens)
    main(
        seq_lens=seq_lens,
        num_heads=(args.num_query_heads, args.num_kv_heads),
        head_size=args.head_size,
        sliding_window=args.sliding_window,
        dtype=STR_DTYPE_TO_TORCH_DTYPE[args.dtype],
        block_size=args.block_size,
        num_blocks=args.num_blocks,
        use_sink=args.use_sink,
        enable_kv_split=args.enable_kv_split,
        isa=args.isa
        if args.isa is not None
        else get_attn_isa(args.block_size, STR_DTYPE_TO_TORCH_DTYPE[args.dtype]),
        seed=args.seed,
        iters=args.iters,
    )
--- a/benchmarks/kernels/cpu/benchmark_cpu_fused_moe.py
+++ b/benchmarks/kernels/cpu/benchmark_cpu_fused_moe.py
@@ -0,0 +1,175 @@
 # SPDX-License-Identifier: Apache-2.0
 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project
 import sys
 import time
 import numpy as np
 import torch
 from vllm.platforms import current_platform
 from vllm.utils.argparse_utils import FlexibleArgumentParser
 # Check if CPU MoE operations are available
 try:
    from vllm._custom_ops import cpu_fused_moe, cpu_prepack_moe_weight
 except (ImportError, AttributeError) as e:
    print("ERROR: CPU fused MoE operations are not available on this platform.")
    print("This benchmark requires x86 CPU with proper vLLM CPU extensions compiled.")
    print(
        "The cpu_fused_moe kernel is typically available on Linux x86_64 "
        "with AVX2/AVX512."
    )
    print(f"Import error: {e}")
    sys.exit(1)
 # ISA selection following test_cpu_fused_moe.py pattern
 ISA_CHOICES = ["amx", "vec"] if torch._C._cpu._is_amx_tile_supported() else ["vec"]
@torch.inference_mode()
 def main(
    batch_size: int,
    expert_num: int,
    hidden_size: int,
    intermediate_size: int,
    topk_num: int,
    use_bias: bool = False,
    dtype: torch.dtype = torch.bfloat16,
    activation: str = "silu",
    isa: str = "vec",
    seed: int = 0,
    iters: int = 20,
 ) -> None:
    current_platform.seed_everything(seed)
    # up_dim = 2 * intermediate_size for gate + up projection
    up_dim = 2 * intermediate_size
    input_tensor = torch.randn((batch_size, hidden_size), dtype=dtype) / (
        0.5 * hidden_size**0.5
    )
    w13 = torch.randn((expert_num, up_dim, hidden_size), dtype=dtype) / (
        0.5 * hidden_size**0.5
    )
    w2 = torch.randn((expert_num, hidden_size, intermediate_size), dtype=dtype) / (
        0.5 * intermediate_size**0.5
    )
    w13_bias = None
    w2_bias = None
    if use_bias:
        w13_bias = torch.randn((expert_num, up_dim), dtype=dtype) / (0.5 * up_dim**0.5)
        w2_bias = torch.randn((expert_num, hidden_size), dtype=dtype) / (
            0.5 * hidden_size**0.5
        )
    router_logits = torch.randn((batch_size, expert_num), dtype=dtype)
    score = torch.softmax(router_logits, dim=-1, dtype=torch.float32)
    topk_weights, topk_ids = torch.topk(score, topk_num)
    topk_ids = topk_ids.to(torch.int32)
    packed_w13 = cpu_prepack_moe_weight(w13, isa)
    packed_w2 = cpu_prepack_moe_weight(w2, isa)
    def run_benchmark(iters: int) -> list[float]:
        times = []
        for _ in range(iters):
            start_time = time.perf_counter_ns()
            _ = cpu_fused_moe(
                input_tensor,
                packed_w13,
                packed_w2,
                w13_bias,
                w2_bias,
                topk_weights,
                topk_ids,
                activation,
                isa,
            )
            end_time = time.perf_counter_ns()
            times.append((end_time - start_time) / 1e6)
        return times
    # warmup
    run_benchmark(5)
    # benchmark
    times = run_benchmark(iters)
    if not times:
        print("No iterations to measure. Set --iters > 0.")
        return
    time_min = min(times)
    time_max = max(times)
    time_mean = np.mean(times)
    time_std = np.std(times)
    print("\tmin (ms) = ", time_min)
    print("\tmax (ms) = ", time_max)
    print("\tmean (ms) = ", time_mean)
    print("\tstd = ", time_std)
    print("\tmedian (ms) = ", np.median(times))
    # Calculate throughput metrics
    # FLOPs estimation: 2 * batch * topk * (hidden * up_dim + intermediate * hidden)
    flops_per_token = (
        2 * topk_num * (hidden_size * up_dim + intermediate_size * hidden_size)
    )
    total_flops = batch_size * flops_per_token
    tflops = total_flops / (time_mean * 1e-3) / 1e12
    print(f"\tthroughput (TFLOP/s) = {tflops:.4f}")
 if __name__ == "__main__":
    parser = FlexibleArgumentParser(description="Benchmark the CPU fused MoE kernel.")
    parser.add_argument("--batch-size", type=int, default=64)
    parser.add_argument("--expert-num", type=int, default=8)
    parser.add_argument("--hidden-size", type=int, default=2880)
    parser.add_argument("--intermediate-size", type=int, default=2880)
    parser.add_argument(
        "--topk-num",
        type=int,
        default=None,
        help="Number of experts to route each token to (default: expert_num // 2)",
    )
    parser.add_argument("--use-bias", action="store_true")
    parser.add_argument(
        "--activation",
        type=str,
        choices=["silu", "swigluoai"],
        default="silu",
        help="Activation function",
    )
    parser.add_argument(
        "--isa",
        type=str,
        choices=ISA_CHOICES,
        default=ISA_CHOICES[0],
        help=f"ISA to use (available: {ISA_CHOICES})",
    )
    parser.add_argument("--seed", type=int, default=0)
    parser.add_argument("--iters", type=int, default=20)
    args = parser.parse_args()
    # Default topk_num to expert_num // 2, minimum 1
    topk_num = (
        args.topk_num if args.topk_num is not None else max(args.expert_num // 2, 1)
    )
    print(args)
    main(
        batch_size=args.batch_size,
        expert_num=args.expert_num,
        hidden_size=args.hidden_size,
        intermediate_size=args.intermediate_size,
        topk_num=topk_num,
        use_bias=args.use_bias,
        dtype=torch.bfloat16,  # Following test_cpu_fused_moe.py
        activation=args.activation,
        isa=args.isa,
        seed=args.seed,
        iters=args.iters,
    )
--- a/cmake/external_projects/qutlass.cmake
+++ b/cmake/external_projects/qutlass.cmake
@@ -31,10 +31,15 @@ if(NOT qutlass_SOURCE_DIR)
 endif()
 message(STATUS "[QUTLASS] QuTLASS is available at ${qutlass_SOURCE_DIR}")
-cuda_archs_loose_intersection(QUTLASS_ARCHS "12.0a;10.0a" "${CUDA_ARCHS}")
+if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 13.0)
-if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.8 AND QUTLASS_ARCHS)
+  cuda_archs_loose_intersection(QUTLASS_ARCHS "12.0a;10.0f" "${CUDA_ARCHS}")
 else()
  cuda_archs_loose_intersection(QUTLASS_ARCHS "12.0a;10.0a;10.3a" "${CUDA_ARCHS}")
 endif()
-  if(QUTLASS_ARCHS MATCHES "10\\.0a")
+if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.8 AND QUTLASS_ARCHS)
  if(QUTLASS_ARCHS MATCHES "10\\.(0a|3a|0f)")
    set(QUTLASS_TARGET_CC 100)
  elseif(QUTLASS_ARCHS MATCHES "12\\.0a")
    set(QUTLASS_TARGET_CC 120)
--- a/cmake/external_projects/vllm_flash_attn.cmake
+++ b/cmake/external_projects/vllm_flash_attn.cmake
@@ -38,7 +38,7 @@ else()
  FetchContent_Declare(
          vllm-flash-attn
          GIT_REPOSITORY https://github.com/vllm-project/flash-attention.git
-          GIT_TAG 86f8f157cf82aa2342743752b97788922dd7de43
+          GIT_TAG 188be16520ceefdc625fdf71365585d2ee348fe2
          GIT_PROGRESS TRUE
          # Don't share the vllm-flash-attn build between build types
          BINARY_DIR ${CMAKE_BINARY_DIR}/vllm-flash-attn
--- a/csrc/cache.h
+++ b/csrc/cache.h
@@ -9,16 +9,6 @@
 void swap_blocks(torch::Tensor& src, torch::Tensor& dst,
                 const torch::Tensor& block_mapping);
 // Note: the key_caches and value_caches vectors are constant but
 // not the Tensors they contain. The vectors need to be const refs
 // in order to satisfy pytorch's C++ operator registration code.
 void copy_blocks(std::vector<torch::Tensor> const& key_caches,
                 std::vector<torch::Tensor> const& value_caches,
                 const torch::Tensor& block_mapping);
 void copy_blocks_mla(std::vector<torch::Tensor> const& kv_caches,
                     const torch::Tensor& block_mapping);
 void reshape_and_cache(torch::Tensor& key, torch::Tensor& value,
                       torch::Tensor& key_cache, torch::Tensor& value_cache,
                       torch::Tensor& slot_mapping,
@@ -37,6 +27,13 @@ void concat_and_cache_mla(torch::Tensor& kv_c, torch::Tensor& k_pe,
                          const std::string& kv_cache_dtype,
                          torch::Tensor& scale);
 // NOTE: k_pe and kv_c order is flipped compared to concat_and_cache_mla
 void concat_and_cache_mla_rope_fused(
    torch::Tensor& positions, torch::Tensor& q_pe, torch::Tensor& k_pe,
    torch::Tensor& kv_c, torch::Tensor& rope_cos_sin_cache, bool rope_is_neox,
    torch::Tensor& kv_cache_slot_mapping, torch::Tensor& kv_cache,
    const std::string& kv_cache_dtype, torch::Tensor& kv_cache_quant_scale);
 // Just for unittest
 void convert_fp8(torch::Tensor& dst_cache, torch::Tensor& src_cache,
                 const double scale, const std::string& kv_cache_dtype);
--- a/csrc/cache_kernels.cu
+++ b/csrc/cache_kernels.cu
@@ -119,94 +119,6 @@ __global__ void copy_blocks_mla_kernel(
 }  // namespace vllm
 // Note: the key_caches and value_caches vectors are constant but
 // not the Tensors they contain. The vectors need to be const refs
 // in order to satisfy pytorch's C++ operator registration code.
 void copy_blocks(std::vector<torch::Tensor> const& key_caches,
                 std::vector<torch::Tensor> const& value_caches,
                 const torch::Tensor& block_mapping) {
  int num_layers = key_caches.size();
  TORCH_CHECK(num_layers == value_caches.size());
  if (num_layers == 0) {
    return;
  }
  torch::Device cache_device = key_caches[0].device();
  TORCH_CHECK(cache_device.is_cuda());
  // Create data structures for the kernel.
  // Create an array of pointers to the key and value caches.
  int64_t key_cache_ptrs[num_layers];
  int64_t value_cache_ptrs[num_layers];
  for (int layer_idx = 0; layer_idx < num_layers; ++layer_idx) {
    key_cache_ptrs[layer_idx] =
        reinterpret_cast<int64_t>(key_caches[layer_idx].data_ptr());
    value_cache_ptrs[layer_idx] =
        reinterpret_cast<int64_t>(value_caches[layer_idx].data_ptr());
  }
  // block_mapping is a 2D tensor with shape (num_pairs, 2).
  int num_pairs = block_mapping.size(0);
  // Move the data structures to the GPU.
  // NOTE: This synchronizes the CPU and GPU.
  torch::Tensor key_cache_ptrs_tensor =
      torch::from_blob(key_cache_ptrs, {num_layers}, torch::kInt64)
          .to(cache_device);
  torch::Tensor value_cache_ptrs_tensor =
      torch::from_blob(value_cache_ptrs, {num_layers}, torch::kInt64)
          .to(cache_device);
  // Launch the kernel.
  const int numel_per_block = key_caches[0][0].numel();
  dim3 grid(num_layers, num_pairs);
  dim3 block(std::min(1024, numel_per_block));
  const at::cuda::OptionalCUDAGuard device_guard(cache_device);
  const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
  VLLM_DISPATCH_FLOATING_AND_BYTE_TYPES(
      key_caches[0].scalar_type(), "copy_blocks_kernel", ([&] {
        vllm::copy_blocks_kernel<scalar_t><<<grid, block, 0, stream>>>(
            key_cache_ptrs_tensor.data_ptr<int64_t>(),
            value_cache_ptrs_tensor.data_ptr<int64_t>(),
            block_mapping.data_ptr<int64_t>(), numel_per_block);
      }));
 }
 // copy blocks kernel for MLA (assumes a joint KV-cache)
 void copy_blocks_mla(std::vector<torch::Tensor> const& kv_caches,
                     const torch::Tensor& block_mapping) {
  int num_layers = kv_caches.size();
  if (num_layers == 0) {
    return;
  }
  torch::Device cache_device = kv_caches[0].device();
  TORCH_CHECK(cache_device.is_cuda(), "kv_cache must be on CUDA");
  std::vector<int64_t> cache_ptrs(num_layers);
  for (int layer_idx = 0; layer_idx < num_layers; ++layer_idx) {
    cache_ptrs[layer_idx] =
        reinterpret_cast<int64_t>(kv_caches[layer_idx].data_ptr());
  }
  torch::Tensor cache_ptrs_tensor =
      torch::from_blob(cache_ptrs.data(), {num_layers}, torch::kInt64)
          .to(cache_device);
  int num_pairs = block_mapping.size(0);
  // We use the stride instead of numel in case the cache is padded for memory
  // alignment reasons, we assume the blocks data (inclusive of any padding)
  // is contiguous in memory
  int mem_footprint_per_block = kv_caches[0].stride(0);
  dim3 grid(num_layers, num_pairs);
  dim3 block(std::min(1024, mem_footprint_per_block));
  const at::cuda::OptionalCUDAGuard device_guard(cache_device);
  const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
  VLLM_DISPATCH_FLOATING_AND_BYTE_TYPES(
      kv_caches[0].scalar_type(), "copy_blocks_mla_kernel", ([&] {
        vllm::copy_blocks_mla_kernel<scalar_t><<<grid, block, 0, stream>>>(
            cache_ptrs_tensor.data_ptr<int64_t>(),
            block_mapping.data_ptr<int64_t>(), mem_footprint_per_block);
      }));
 }
 namespace vllm {
 // Used to copy/convert one element
@@ -539,9 +451,6 @@ __global__ void indexer_k_quant_and_cache_kernel(
  for (int i = 0; i < VEC_SIZE; i++) {
    amax = fmaxf(amax, fabsf(float(k_val_ptr[i])));
  }
 #ifndef USE_ROCM
  __syncwarp();
 #endif
  // Reduced amax
  for (int mask = 16; mask > 0; mask /= 2) {
@@ -551,9 +460,7 @@ __global__ void indexer_k_quant_and_cache_kernel(
    amax = fmaxf(amax, __shfl_xor_sync(unsigned(-1), amax, mask));
 #endif
  }
-#ifndef USE_ROCM
+
  __syncwarp();
 #endif
 #if defined(__gfx942__)
  float scale = fmaxf(amax, 1e-4) / 224.0f;
 #else
--- a/csrc/cache_kernels_fused.cu
+++ b/csrc/cache_kernels_fused.cu
@@ -0,0 +1,279 @@
 #include <torch/all.h>
 #include <ATen/cuda/CUDAContext.h>
 #include <c10/cuda/CUDAGuard.h>
 #include "cuda_compat.h"
 #include "dispatch_utils.h"
 #include "quantization/w8a8/fp8/common.cuh"
 #ifdef USE_ROCM
  #include "quantization/w8a8/fp8/amd/quant_utils.cuh"
 #else
  #include "quantization/w8a8/fp8/nvidia/quant_utils.cuh"
 #endif
 #ifdef USE_ROCM
  #include <hip/hip_bf16.h>
 typedef __hip_bfloat16 __nv_bfloat16;
 #endif
 namespace vllm {
 // NOTE Be EXTRA careful with raw_kv_scalar_t, for __half and __nv_bfloat16 it's
 // using u16 as the backing type.
 template <typename qk_t, bool IS_NEOX, typename raw_kv_scalar_t,
          typename cache_t, Fp8KVCacheDataType kv_dt>
 __global__ void concat_and_cache_mla_rope_fused_kernel(
    const int64_t* __restrict__ positions,  // [num_tokens]
    qk_t* __restrict__ q_pe,        // [num_tokens, num_q_heads, rot_dim]
    qk_t* __restrict__ k_pe,        // [num_tokens, rot_dim]
    const qk_t* __restrict__ kv_c,  // [num_tokens, kv_lora_rank]
    const qk_t* __restrict__ rope_cos_sin_cache,  // [max_position, 2,
                                                  // rot_dim // 2]
    const int rot_dim, const int64_t q_pe_stride_token,
    const int64_t q_pe_stride_head, const int64_t k_pe_stride,
    const int64_t kv_c_stride, const int num_q_heads,
    cache_t* __restrict__ kv_cache,  // [num_blocks, block_size, (kv_lora_rank +
                                     // rot_dim)]
    const int64_t* __restrict__ kv_cache_slot_mapping,  // [num_tokens]
    const int block_stride, const int entry_stride, const int kv_lora_rank,
    const int block_size, const float* kv_cache_quant_scale) {
  // Each thread block is responsible for one token.
  const int64_t token_idx = blockIdx.x;
  const int64_t pos = positions[token_idx];
  const qk_t* cos_sin_ptr = rope_cos_sin_cache + pos * rot_dim;
  const int embed_dim = rot_dim / 2;
  // Q ROPE
  const int nq = num_q_heads * embed_dim;
  for (int i = threadIdx.x; i < nq; i += blockDim.x) {
    int head_idx = i / embed_dim;
    int pair_idx = i % embed_dim;
    // NOTE: Would be nice to have interleaved sin/cos so we could just load
    // both at the same time.
    qk_t cos = VLLM_LDG(cos_sin_ptr + pair_idx);
    qk_t sin = VLLM_LDG(cos_sin_ptr + pair_idx + embed_dim);
    qk_t* q_pe_head_ptr =
        q_pe + token_idx * q_pe_stride_token + head_idx * q_pe_stride_head;
    int pair_idx_x, pair_idx_y;
    if constexpr (IS_NEOX) {
      // GPT-NeoX style rotary embedding.
      pair_idx_x = pair_idx;
      pair_idx_y = embed_dim + pair_idx;
    } else {
      // GPT-J style rotary embedding.
      pair_idx_x = pair_idx * 2;
      pair_idx_y = pair_idx * 2 + 1;
    }
    qk_t x_src = q_pe_head_ptr[pair_idx_x];
    qk_t y_src = q_pe_head_ptr[pair_idx_y];
    qk_t x_dst = x_src * cos - y_src * sin;
    qk_t y_dst = y_src * cos + x_src * sin;
    q_pe_head_ptr[pair_idx_x] = x_dst;
    q_pe_head_ptr[pair_idx_y] = y_dst;
  }
  const int64_t slot_idx = kv_cache_slot_mapping[token_idx];
  const int64_t block_idx = slot_idx / block_size;
  const int64_t entry_idx = slot_idx % block_size;
  // NOTE: slot_idx can be -1 if the token is padded
  if (slot_idx < 0) {
    return;
  }
  // K with 1 HEAD
  for (int i = threadIdx.x; i < embed_dim; i += blockDim.x) {
    int pair_idx = i;
    qk_t cos = VLLM_LDG(cos_sin_ptr + pair_idx);
    qk_t sin = VLLM_LDG(cos_sin_ptr + pair_idx + embed_dim);
    qk_t* k_pe_head_ptr = k_pe + token_idx * k_pe_stride;
    int pair_idx_x, pair_idx_y;
    if constexpr (IS_NEOX) {
      // GPT-NeoX style rotary embedding.
      pair_idx_x = pair_idx;
      pair_idx_y = embed_dim + pair_idx;
    } else {
      // GPT-J style rotary embedding.
      pair_idx_x = pair_idx * 2;
      pair_idx_y = pair_idx * 2 + 1;
    }
    qk_t x_src = k_pe_head_ptr[pair_idx_x];
    qk_t y_src = k_pe_head_ptr[pair_idx_y];
    qk_t x_dst = x_src * cos - y_src * sin;
    qk_t y_dst = y_src * cos + x_src * sin;
    k_pe_head_ptr[pair_idx_x] = x_dst;
    k_pe_head_ptr[pair_idx_y] = y_dst;
    // NOTE Why is this monster necessary?
    // When K is of type float16, the actual template replacement for
    // raw_kv_scalar_t with be u16. That's why it's used at the last moment
    // otherwise CUDA ALU would break.
    const raw_kv_scalar_t raw_x_value =
        *reinterpret_cast<const raw_kv_scalar_t*>(&x_dst);
    const raw_kv_scalar_t raw_y_value =
        *reinterpret_cast<const raw_kv_scalar_t*>(&y_dst);
    cache_t* kv_cache_ptr = kv_cache + block_idx * block_stride +
                            entry_idx * entry_stride + kv_lora_rank;
    // MLA Cache Store
    if constexpr (kv_dt == Fp8KVCacheDataType::kAuto) {
      kv_cache_ptr[pair_idx_x] = raw_x_value;
      kv_cache_ptr[pair_idx_y] = raw_y_value;
    } else {
      kv_cache_ptr[pair_idx_x] =
          fp8::scaled_convert<cache_t, raw_kv_scalar_t, kv_dt>(
              raw_x_value, *kv_cache_quant_scale);
      kv_cache_ptr[pair_idx_y] =
          fp8::scaled_convert<cache_t, raw_kv_scalar_t, kv_dt>(
              raw_y_value, *kv_cache_quant_scale);
    }
  }
  // NOPE
  for (int i = threadIdx.x; i < kv_lora_rank; i += blockDim.x) {
    const qk_t* src_ptr = kv_c + token_idx * kv_c_stride + i;
    const raw_kv_scalar_t src_value =
        *reinterpret_cast<const raw_kv_scalar_t*>(src_ptr);
    cache_t* kv_cache_ptr =
        kv_cache + block_idx * block_stride + entry_idx * entry_stride;
    if constexpr (kv_dt == Fp8KVCacheDataType::kAuto) {
      kv_cache_ptr[i] = src_value;
    } else {
      kv_cache_ptr[i] = fp8::scaled_convert<cache_t, raw_kv_scalar_t, kv_dt>(
          src_value, *kv_cache_quant_scale);
    }
  }
 }
 }  // namespace vllm
 #define CALL_CONCAT_AND_CACHE_MLA_ROPE_FUSED(RAW_KV_T, CACHE_T, KV_DTYPE)      \
  do {                                                                         \
    VLLM_DISPATCH_FLOATING_TYPES(q_pe.scalar_type(), "qk_scalar_type", [&] {   \
      using qk_t = scalar_t;                                                   \
      if (rope_is_neox) {                                                      \
        vllm::concat_and_cache_mla_rope_fused_kernel<qk_t, true, RAW_KV_T,     \
                                                     CACHE_T, KV_DTYPE>        \
            <<<grid, block, 0, stream>>>(                                      \
                positions.data_ptr<int64_t>(), q_pe.data_ptr<qk_t>(),          \
                k_pe.data_ptr<qk_t>(), kv_c.data_ptr<qk_t>(),                  \
                rope_cos_sin_cache.data_ptr<qk_t>(), rot_dim,                  \
                q_pe_stride_token, q_pe_stride_head, k_pe_stride, kv_c_stride, \
                num_q_heads, reinterpret_cast<CACHE_T*>(kv_cache.data_ptr()),  \
                kv_cache_slot_mapping.data_ptr<int64_t>(), block_stride,       \
                entry_stride, kv_lora_rank, block_size,                        \
                kv_cache_quant_scale.data_ptr<float>());                       \
      } else {                                                                 \
        vllm::concat_and_cache_mla_rope_fused_kernel<qk_t, false, RAW_KV_T,    \
                                                     CACHE_T, KV_DTYPE>        \
            <<<grid, block, 0, stream>>>(                                      \
                positions.data_ptr<int64_t>(), q_pe.data_ptr<qk_t>(),          \
                k_pe.data_ptr<qk_t>(), kv_c.data_ptr<qk_t>(),                  \
                rope_cos_sin_cache.data_ptr<qk_t>(), rot_dim,                  \
                q_pe_stride_token, q_pe_stride_head, k_pe_stride, kv_c_stride, \
                num_q_heads, reinterpret_cast<CACHE_T*>(kv_cache.data_ptr()),  \
                kv_cache_slot_mapping.data_ptr<int64_t>(), block_stride,       \
                entry_stride, kv_lora_rank, block_size,                        \
                kv_cache_quant_scale.data_ptr<float>());                       \
      }                                                                        \
    });                                                                        \
  } while (false)
 // Executes RoPE on q_pe and k_pe, then writes k_pe and kv_c in the kv cache.
 // q_pe and k_pe are modified in place.
 // Replaces DeepseekScalingRotaryEmbedding.self.rotary_emb and
 // concat_and_cache_mla.
 void concat_and_cache_mla_rope_fused(
    torch::Tensor& positions,           // [num_tokens]
    torch::Tensor& q_pe,                // [num_tokens, num_q_heads, rot_dim]
    torch::Tensor& k_pe,                // [num_tokens, rot_dim]
    torch::Tensor& kv_c,                // [num_tokens, kv_lora_rank]
    torch::Tensor& rope_cos_sin_cache,  // [max_position, rot_dim]
    bool rope_is_neox,
    torch::Tensor&
        kv_cache_slot_mapping,  // [num_tokens] or [num_actual_tokens]
    torch::Tensor&
        kv_cache,  // [num_blocks, block_size, (kv_lora_rank + rot_dim)]
    const std::string& kv_cache_dtype, torch::Tensor& kv_cache_quant_scale) {
  const int64_t num_tokens = q_pe.size(0);
  const int num_q_heads = q_pe.size(1);
  const int rot_dim = q_pe.size(2);
  const int kv_lora_rank = kv_c.size(1);
  TORCH_CHECK(positions.size(0) >=
              num_tokens);  // CUDA Graphs might pad this for us
  TORCH_CHECK_EQ(positions.dim(), 1);
  TORCH_CHECK_EQ(positions.scalar_type(), c10::ScalarType::Long);
  TORCH_CHECK_EQ(q_pe.size(0), num_tokens);
  TORCH_CHECK_EQ(q_pe.size(1), num_q_heads);
  TORCH_CHECK_EQ(q_pe.size(2), rot_dim);
  TORCH_CHECK_EQ(q_pe.dim(), 3);
  TORCH_CHECK_EQ(k_pe.size(0), num_tokens);
  TORCH_CHECK_EQ(k_pe.size(1), rot_dim);
  TORCH_CHECK_EQ(k_pe.dim(), 2);
  TORCH_CHECK_EQ(k_pe.scalar_type(), q_pe.scalar_type());
  TORCH_CHECK_EQ(kv_c.size(0), num_tokens);
  TORCH_CHECK_EQ(kv_c.size(1), kv_lora_rank);
  TORCH_CHECK_EQ(kv_c.dim(), 2);
  TORCH_CHECK_EQ(kv_c.scalar_type(), q_pe.scalar_type());
  TORCH_CHECK_EQ(kv_c.dtype(), q_pe.dtype());
  TORCH_CHECK_EQ(rope_cos_sin_cache.size(1), rot_dim);
  TORCH_CHECK_EQ(rope_cos_sin_cache.scalar_type(), q_pe.scalar_type());
  TORCH_CHECK_EQ(kv_cache_slot_mapping.size(0), num_tokens);
  TORCH_CHECK_EQ(kv_cache_slot_mapping.scalar_type(), c10::ScalarType::Long);
  TORCH_CHECK_EQ(kv_cache.size(2), kv_lora_rank + rot_dim);
  TORCH_CHECK_EQ(kv_cache.dim(), 3);
  TORCH_CHECK_EQ(kv_cache_quant_scale.numel(), 1);
  TORCH_CHECK_EQ(kv_cache_quant_scale.scalar_type(), c10::ScalarType::Float);
  int64_t q_pe_stride_token = q_pe.stride(0);
  int64_t q_pe_stride_head = q_pe.stride(1);
  int64_t k_pe_stride = k_pe.stride(0);
  int64_t kv_c_stride = kv_c.stride(0);
  int block_size = kv_cache.size(1);
  int block_stride = kv_cache.stride(0);
  int entry_stride = kv_cache.stride(1);
  int rope_block_size = std::min(num_q_heads * rot_dim / 2, 512);
  int mla_block_size = kv_lora_rank;
  int thread_block_size =
      std::min(std::max(rope_block_size, mla_block_size), 512);
  dim3 grid(num_tokens, 1, 1);
  dim3 block(thread_block_size, 1, 1);
  const at::cuda::OptionalCUDAGuard device_guard(device_of(positions));
  const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
  DISPATCH_BY_KV_CACHE_DTYPE(kv_c.dtype(), kv_cache_dtype,
                             CALL_CONCAT_AND_CACHE_MLA_ROPE_FUSED);
 }
--- a/csrc/cpu/cpu_attn.cpp
+++ b/csrc/cpu/cpu_attn.cpp
@@ -15,6 +15,7 @@
 #ifdef __aarch64__
  #include "cpu_attn_neon.hpp"
  // NEON requires head_dim to be a multiple of 32
  #define NEON_DISPATCH(...)                                                   \
    case cpu_attention::ISA::NEON: {                                           \
      using attn_impl = cpu_attention::AttentionImpl<cpu_attention::ISA::NEON, \
@@ -36,7 +37,9 @@
    switch (HEAD_DIM) {                                         \
      CPU_ATTN_DISPATCH_CASE(32, __VA_ARGS__)                   \
      CPU_ATTN_DISPATCH_CASE(64, __VA_ARGS__)                   \
      CPU_ATTN_DISPATCH_CASE(80, __VA_ARGS__)                   \
      CPU_ATTN_DISPATCH_CASE(96, __VA_ARGS__)                   \
      CPU_ATTN_DISPATCH_CASE(112, __VA_ARGS__)                  \
      CPU_ATTN_DISPATCH_CASE(128, __VA_ARGS__)                  \
      CPU_ATTN_DISPATCH_CASE(160, __VA_ARGS__)                  \
      CPU_ATTN_DISPATCH_CASE(192, __VA_ARGS__)                  \
--- a/csrc/cpu/cpu_attn_amx.hpp
+++ b/csrc/cpu/cpu_attn_amx.hpp
@@ -377,7 +377,7 @@ class AttentionImpl<ISA::AMX, scalar_t, head_dim> {
      const int32_t q_heads_per_kv, const int64_t q_num_stride,
      const int64_t q_head_stride, const float scale) {
    constexpr int64_t bytes_per_head = head_dim * sizeof(scalar_t);
-    static_assert(bytes_per_head % AMX_TILE_ROW_BYTES == 0);
+    // static_assert(bytes_per_head % AMX_TILE_ROW_BYTES == 0);
    constexpr int64_t head_size_block_num = bytes_per_head / AMX_TILE_ROW_BYTES;
    constexpr int64_t head_elem_num_pre_block =
        AMX_TILE_ROW_BYTES / sizeof(scalar_t);
--- a/csrc/cpu/cpu_attn_neon.hpp
+++ b/csrc/cpu/cpu_attn_neon.hpp
@@ -264,7 +264,7 @@ class AttentionImpl<ISA::NEON, scalar_t, head_dim> {
  constexpr static ISA ISAType = ISA::NEON;
  constexpr static bool scale_on_logits = false;  // apply scale on q_buffer
-  static_assert(HeadDim % HeadDimAlignment == 0);
+  //  static_assert(HeadDim % HeadDimAlignment == 0);
  // the gemm micro kernel is Mx8
  static_assert(HeadDimAlignment % 8 == 0);
  static_assert(BlockSizeAlignment % 8 == 0);
--- a/csrc/cpu/utils.cpp
+++ b/csrc/cpu/utils.cpp
@@ -24,6 +24,8 @@ std::string init_cpu_threads_env(const std::string& cpu_ids) {
 #ifndef VLLM_NUMA_DISABLED
 std::string init_cpu_threads_env(const std::string& cpu_ids) {
  bitmask* omp_cpu_mask = numa_parse_cpustring_all(cpu_ids.c_str());
  TORCH_CHECK(omp_cpu_mask != nullptr,
              "Failed to parse CPU string: " + cpu_ids);
  TORCH_CHECK(omp_cpu_mask->size > 0);
  std::vector<int> omp_cpu_ids;
  omp_cpu_ids.reserve(omp_cpu_mask->size);
@@ -44,20 +46,12 @@ std::string init_cpu_threads_env(const std::string& cpu_ids) {
  // Memory node binding
  if (numa_available() != -1) {
    int mem_node_id = numa_node_of_cpu(omp_cpu_ids.front());
    std::set<int> node_ids;
    for (const auto& cpu_id : omp_cpu_ids) {
      int node_id = numa_node_of_cpu(cpu_id);
      if (node_id != -1) {
        node_ids.insert(node_id);
      }
      if (node_id != mem_node_id) {
        TORCH_WARN("CPU ", cpu_id, " is on NUMA node ", node_id, ", but CPU ",
                   omp_cpu_ids.front(), " is on NUMA node ", mem_node_id,
                   ". All CPUs should be on the same NUMA node for optimal "
                   "performance. Memory will be bound to NUMA node ",
                   mem_node_id, ".");
      }
    }
    // Concatenate all node_ids into a single comma-separated string
    if (!node_ids.empty()) {
@@ -70,7 +64,7 @@ std::string init_cpu_threads_env(const std::string& cpu_ids) {
      }
      bitmask* mask = numa_parse_nodestring(node_ids_str.c_str());
-      bitmask* src_mask = numa_get_membind();
+      bitmask* src_mask = numa_get_mems_allowed();
      int pid = getpid();
@@ -83,15 +77,46 @@ std::string init_cpu_threads_env(const std::string& cpu_ids) {
                     std::to_string(errno));
        }
-        // restrict memory allocation node.
+        // Restrict memory allocation to the selected NUMA node(s).
-        numa_set_membind(mask);
+        // Enhances memory locality for the threads bound to those NUMA CPUs.
        if (node_ids.size() > 1) {
          errno = 0;
          numa_set_interleave_mask(mask);
          if (errno != 0) {
            TORCH_WARN("numa_set_interleave_mask failed. errno: " +
                       std::to_string(errno));
          } else {
            TORCH_WARN(
                "NUMA binding: Using INTERLEAVE policy for memory "
                "allocation across multiple NUMA nodes (nodes: " +
                node_ids_str +
                "). Memory allocations will be "
                "interleaved across the specified NUMA nodes.");
          }
        } else {
          errno = 0;
          numa_set_membind(mask);
          if (errno != 0) {
            TORCH_WARN("numa_set_membind failed. errno: " +
                       std::to_string(errno));
          } else {
            TORCH_WARN(
                "NUMA binding: Using MEMBIND policy for memory "
                "allocation on the NUMA nodes (" +
                node_ids_str +
                "). Memory allocations will be "
                "strictly bound to these NUMA nodes.");
          }
        }
        numa_set_strict(1);
        numa_free_nodemask(mask);
        numa_free_nodemask(src_mask);
      } else {
-        TORCH_WARN("numa_parse_nodestring or numa_get_membind failed. errno: " +
+        TORCH_WARN(
-                   std::to_string(errno));
+            "numa_parse_nodestring or numa_get_run_node_mask failed. errno: " +
            std::to_string(errno));
      }
    }
  }
--- a/csrc/cpu/utils.hpp
+++ b/csrc/cpu/utils.hpp
@@ -37,10 +37,12 @@ struct VecTypeTrait<c10::BFloat16> {
 };
 #endif
 #if !defined(__powerpc__)
 template <>
 struct VecTypeTrait<c10::Half> {
  using vec_t = vec_op::FP16Vec16;
 };
 #endif
 struct Counter {
  std::atomic<int64_t> counter;
--- a/csrc/fused_qknorm_rope_kernel.cu
+++ b/csrc/fused_qknorm_rope_kernel.cu
@@ -107,7 +107,8 @@ __global__ void fusedQKNormRopeKernel(
    void const* k_weight_void,       // RMSNorm weights for key
    void const* cos_sin_cache_void,  // Pre-computed cos/sin cache
    int64_t const* position_ids,     // Position IDs for RoPE
-    int const num_tokens             // Number of tokens
+    int const num_tokens,            // Number of tokens
    int const rotary_dim             // Dimension for RoPE
 ) {
 #if (!defined(__CUDA_ARCH__) || __CUDA_ARCH__ < 800) && !defined(USE_ROCM)
  if constexpr ((std::is_same_v<scalar_t_in, c10::BFloat16>) ||
@@ -227,56 +228,59 @@ __global__ void fusedQKNormRopeKernel(
    // Calculate cache pointer for this position - similar to
    // pos_encoding_kernels.cu
-    T_cache const* cache_ptr = cos_sin_cache + pos_id * head_dim;
+    T_cache const* cache_ptr = cos_sin_cache + pos_id * rotary_dim;
-    int const embed_dim = head_dim / 2;
+    int const embed_dim = rotary_dim / 2;
    T_cache const* cos_ptr = cache_ptr;
    T_cache const* sin_ptr = cache_ptr + embed_dim;
-
+    int const rotary_lanes = rotary_dim / numElemsPerThread;  // rotary range
-    if constexpr (interleave) {
+    if (laneId < rotary_lanes) {
-      // Perform interleaving. Use pre-computed cos/sin values.
+      if constexpr (interleave) {
        // Perform interleaving. Use pre-computed cos/sin values.
 #pragma unroll
-      for (int i = 0; i < numElemsPerThread / 2; ++i) {
+        for (int i = 0; i < numElemsPerThread / 2; ++i) {
-        int const idx0 = 2 * i;
+          int const idx0 = 2 * i;
-        int const idx1 = 2 * i + 1;
+          int const idx1 = 2 * i + 1;
          // Global dimension index in the head
          int const dim_idx = laneId * numElemsPerThread + idx0;
-        float const val0 = elements[idx0];
+          float const val0 = elements[idx0];
-        float const val1 = elements[idx1];
+          float const val1 = elements[idx1];
-        int const dim_idx = laneId * numElemsPerThread + idx0;
+          int const half_dim = dim_idx / 2;
-        int const half_dim = dim_idx / 2;
+          float const cos_val =
-        float const cos_val =
+              CacheConverter::convert(VLLM_LDG(cos_ptr + half_dim));
-            CacheConverter::convert(VLLM_LDG(cos_ptr + half_dim));
+          float const sin_val =
-        float const sin_val =
+              CacheConverter::convert(VLLM_LDG(sin_ptr + half_dim));
            CacheConverter::convert(VLLM_LDG(sin_ptr + half_dim));
-        elements[idx0] = val0 * cos_val - val1 * sin_val;
+          elements[idx0] = val0 * cos_val - val1 * sin_val;
-        elements[idx1] = val0 * sin_val + val1 * cos_val;
+          elements[idx1] = val0 * sin_val + val1 * cos_val;
      }
    } else {
      // Before data exchange with in warp, we need to sync.
      __syncwarp();
      // Get the data from the other half of the warp. Use pre-computed cos/sin
      // values.
 #pragma unroll
      for (int i = 0; i < numElemsPerThread; i++) {
        elements2[i] = __shfl_xor_sync(FINAL_MASK, elements[i], 16);
        if (laneId < 16) {
          elements2[i] = -elements2[i];
        }
      } else {
        // Before data exchange with in warp, we need to sync.
        __syncwarp();
        int pairOffset = (rotary_dim / 2) / numElemsPerThread;
        // Get the data from the other half of the warp. Use pre-computed
        // cos/sin values.
 #pragma unroll
        for (int i = 0; i < numElemsPerThread; i++) {
          elements2[i] = __shfl_xor_sync(FINAL_MASK, elements[i], pairOffset);
-        int dim_idx = laneId * numElemsPerThread + i;
+          if (laneId < pairOffset) {
-        dim_idx = (dim_idx * 2) % head_dim;
+            elements2[i] = -elements2[i];
-        int half_dim = dim_idx / 2;
+          }
-        // Use pre-computed cos/sin from cache
+          int dim_idx = laneId * numElemsPerThread + i;
        float cos_val = CacheConverter::convert(VLLM_LDG(cos_ptr + half_dim));
        float sin_val = CacheConverter::convert(VLLM_LDG(sin_ptr + half_dim));
-        elements[i] = elements[i] * cos_val + elements2[i] * sin_val;
+          dim_idx = (dim_idx * 2) % rotary_dim;
          int half_dim = dim_idx / 2;
          float cos_val = CacheConverter::convert(VLLM_LDG(cos_ptr + half_dim));
          float sin_val = CacheConverter::convert(VLLM_LDG(sin_ptr + half_dim));
          elements[i] = elements[i] * cos_val + elements2[i] * sin_val;
        }
        // __shfl_xor_sync does not provide memfence. Need to sync again.
        __syncwarp();
      }
      // __shfl_xor_sync does not provide memfence. Need to sync again.
      __syncwarp();
    }
    // Store.
    {
      vec_T vec;
@@ -312,10 +316,10 @@ template <typename scalar_t_in, typename scalar_t_cache>
 void launchFusedQKNormRope(void* qkv, int const num_tokens,
                           int const num_heads_q, int const num_heads_k,
                           int const num_heads_v, int const head_dim,
-                           float const eps, void const* q_weight,
+                           int const rotary_dim, float const eps,
-                           void const* k_weight, void const* cos_sin_cache,
+                           void const* q_weight, void const* k_weight,
-                           bool const interleave, int64_t const* position_ids,
+                           void const* cos_sin_cache, bool const interleave,
-                           cudaStream_t stream) {
+                           int64_t const* position_ids, cudaStream_t stream) {
  constexpr int blockSize = 256;
  int const warpsPerBlock = blockSize / 32;
@@ -332,7 +336,7 @@ void launchFusedQKNormRope(void* qkv, int const num_tokens,
        fusedQKNormRopeKernel<scalar_t_in, scalar_t_cache, 64, INTERLEAVE>
            <<<gridDim, blockDim, 0, stream>>>(
                qkv, num_heads_q, num_heads_k, num_heads_v, eps, q_weight,
-                k_weight, cos_sin_cache, position_ids, num_tokens);
+                k_weight, cos_sin_cache, position_ids, num_tokens, rotary_dim);
      });
      break;
    case 128:
@@ -340,7 +344,7 @@ void launchFusedQKNormRope(void* qkv, int const num_tokens,
        fusedQKNormRopeKernel<scalar_t_in, scalar_t_cache, 128, INTERLEAVE>
            <<<gridDim, blockDim, 0, stream>>>(
                qkv, num_heads_q, num_heads_k, num_heads_v, eps, q_weight,
-                k_weight, cos_sin_cache, position_ids, num_tokens);
+                k_weight, cos_sin_cache, position_ids, num_tokens, rotary_dim);
      });
      break;
    case 256:
@@ -348,7 +352,7 @@ void launchFusedQKNormRope(void* qkv, int const num_tokens,
        fusedQKNormRopeKernel<scalar_t_in, scalar_t_cache, 256, INTERLEAVE>
            <<<gridDim, blockDim, 0, stream>>>(
                qkv, num_heads_q, num_heads_k, num_heads_v, eps, q_weight,
-                k_weight, cos_sin_cache, position_ids, num_tokens);
+                k_weight, cos_sin_cache, position_ids, num_tokens, rotary_dim);
      });
      break;
    default:
@@ -392,8 +396,11 @@ void fused_qk_norm_rope(
              "Query weights size must match head dimension");
  TORCH_CHECK(k_weight.size(0) == head_dim,
              "Key weights size must match head dimension");
-  TORCH_CHECK(cos_sin_cache.size(1) == head_dim,
+
-              "Cos/sin cache dimension must match head_dim");
+  TORCH_CHECK(cos_sin_cache.size(1) % 2 == 0, "rotary_dim must be even");
  TORCH_CHECK(cos_sin_cache.size(1) <= head_dim,
              "rotary_dim must be less than or equal to head_dim");
  TORCH_CHECK(qkv.scalar_type() == q_weight.scalar_type() &&
                  qkv.scalar_type() == k_weight.scalar_type(),
              "qkv, q_weight and k_weight must have the same dtype");
@@ -419,7 +426,8 @@ void fused_qk_norm_rope(
              qkv.data_ptr(), static_cast<int>(num_tokens),
              static_cast<int>(num_heads_q), static_cast<int>(num_heads_k),
              static_cast<int>(num_heads_v), static_cast<int>(head_dim),
-              static_cast<float>(eps), q_weight.data_ptr(), k_weight.data_ptr(),
+              static_cast<int>(cos_sin_cache.size(1)), static_cast<float>(eps),
              q_weight.data_ptr(), k_weight.data_ptr(),
              cos_sin_cache.data_ptr(), !is_neox,
              reinterpret_cast<int64_t const*>(position_ids.data_ptr()),
              stream);
--- a/csrc/moe/grouped_topk_kernels.cu
+++ b/csrc/moe/grouped_topk_kernels.cu
@@ -457,8 +457,8 @@ __device__ inline T apply_scoring(T val) {
  }
 }
-template <typename T, ScoringFunc SF>
+template <typename T, typename BiasT, ScoringFunc SF>
-__device__ void topk_with_k2(T* output, T const* input, T const* bias,
+__device__ void topk_with_k2(T* output, T const* input, BiasT const* bias,
                             cg::thread_block_tile<32> const& tile,
                             int32_t const lane_id,
                             int const num_experts_per_group) {
@@ -469,7 +469,7 @@ __device__ void topk_with_k2(T* output, T const* input, T const* bias,
  if (num_experts_per_group > WARP_SIZE) {
    for (int i = lane_id; i < num_experts_per_group; i += WARP_SIZE) {
      T value = apply_scoring<SF>(input[i]);
-      value = value + bias[i];
+      value = value + static_cast<T>(bias[i]);
      if (value > largest) {
        second_largest = largest;
@@ -481,7 +481,7 @@ __device__ void topk_with_k2(T* output, T const* input, T const* bias,
  } else {
    for (int i = lane_id; i < num_experts_per_group; i += WARP_SIZE) {
      T value = apply_scoring<SF>(input[i]);
-      value = value + bias[i];
+      value = value + static_cast<T>(bias[i]);
      largest = value;
    }
  }
@@ -503,8 +503,8 @@ __device__ void topk_with_k2(T* output, T const* input, T const* bias,
  }
 }
-template <typename T, ScoringFunc SF>
+template <typename T, typename BiasT, ScoringFunc SF>
-__global__ void topk_with_k2_kernel(T* output, T* input, T const* bias,
+__global__ void topk_with_k2_kernel(T* output, T* input, BiasT const* bias,
                                    int64_t const num_tokens,
                                    int64_t const num_cases,
                                    int64_t const n_group,
@@ -517,7 +517,7 @@ __global__ void topk_with_k2_kernel(T* output, T* input, T const* bias,
    input += case_id * num_experts_per_group;
    // bias is per expert group, offset to current group
    int32_t group_id = case_id % n_group;
-    T const* group_bias = bias + group_id * num_experts_per_group;
+    BiasT const* group_bias = bias + group_id * num_experts_per_group;
    output += case_id;
    cg::thread_block block = cg::this_thread_block();
@@ -526,18 +526,19 @@ __global__ void topk_with_k2_kernel(T* output, T* input, T const* bias,
 #if (defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900))
    asm volatile("griddepcontrol.wait;");
 #endif
-    topk_with_k2<T, SF>(output, input, group_bias, tile, lane_id,
+    topk_with_k2<T, BiasT, SF>(output, input, group_bias, tile, lane_id,
-                        num_experts_per_group);
+                               num_experts_per_group);
  }
 #if (defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900))
  asm volatile("griddepcontrol.launch_dependents;");
 #endif
 }
-template <typename T, typename IdxT, ScoringFunc SF, int NGroup = -1>
+template <typename T, typename BiasT, typename IdxT, ScoringFunc SF,
          int NGroup = -1>
 __global__ void group_idx_and_topk_idx_kernel(
    T* scores, T const* group_scores, float* topk_values, IdxT* topk_indices,
-    T const* bias, int64_t const num_tokens, int64_t const n_group,
+    BiasT const* bias, int64_t const num_tokens, int64_t const n_group,
    int64_t const topk_group, int64_t const topk, int64_t const num_experts,
    int64_t const num_experts_per_group, bool renormalize,
    double routed_scaling_factor) {
@@ -623,7 +624,7 @@ __global__ void group_idx_and_topk_idx_kernel(
            T input = scores[offset + i];
            if (is_finite(input)) {
              T score = apply_scoring<SF>(input);
-              candidates = score + bias[offset + i];
+              candidates = score + static_cast<T>(bias[offset + i]);
            }
          }
          queue.add(candidates, offset + i);
@@ -698,10 +699,10 @@ __global__ void group_idx_and_topk_idx_kernel(
 #endif
 }
-template <typename T, typename IdxT, ScoringFunc SF>
+template <typename T, typename BiasT, typename IdxT, ScoringFunc SF>
 inline void launch_group_idx_and_topk_kernel(
    cudaLaunchConfig_t const& config, T* scores, T* group_scores,
-    float* topk_values, IdxT* topk_indices, T const* bias,
+    float* topk_values, IdxT* topk_indices, BiasT const* bias,
    int64_t const num_tokens, int64_t const n_group, int64_t const topk_group,
    int64_t const topk, int64_t const num_experts,
    int64_t const num_experts_per_group, bool const renormalize,
@@ -715,36 +716,36 @@ inline void launch_group_idx_and_topk_kernel(
  switch (n_group) {
    case 4: {
-      launch(&group_idx_and_topk_idx_kernel<T, IdxT, SF, 4>);
+      launch(&group_idx_and_topk_idx_kernel<T, BiasT, IdxT, SF, 4>);
      break;
    }
    case 8: {
-      launch(&group_idx_and_topk_idx_kernel<T, IdxT, SF, 8>);
+      launch(&group_idx_and_topk_idx_kernel<T, BiasT, IdxT, SF, 8>);
      break;
    }
    case 16: {
-      launch(&group_idx_and_topk_idx_kernel<T, IdxT, SF, 16>);
+      launch(&group_idx_and_topk_idx_kernel<T, BiasT, IdxT, SF, 16>);
      break;
    }
    case 32: {
-      launch(&group_idx_and_topk_idx_kernel<T, IdxT, SF, 32>);
+      launch(&group_idx_and_topk_idx_kernel<T, BiasT, IdxT, SF, 32>);
      break;
    }
    default: {
-      launch(&group_idx_and_topk_idx_kernel<T, IdxT, SF>);
+      launch(&group_idx_and_topk_idx_kernel<T, BiasT, IdxT, SF>);
      break;
    }
  }
 }
-template <typename T, typename IdxT>
+template <typename T, typename BiasT, typename IdxT>
 void invokeNoAuxTc(T* scores, T* group_scores, float* topk_values,
-                   IdxT* topk_indices, T const* bias, int64_t const num_tokens,
+                   IdxT* topk_indices, BiasT const* bias,
-                   int64_t const num_experts, int64_t const n_group,
+                   int64_t const num_tokens, int64_t const num_experts,
-                   int64_t const topk_group, int64_t const topk,
+                   int64_t const n_group, int64_t const topk_group,
-                   bool const renormalize, double const routed_scaling_factor,
+                   int64_t const topk, bool const renormalize,
-                   int const scoring_func, bool enable_pdl = false,
+                   double const routed_scaling_factor, int const scoring_func,
-                   cudaStream_t const stream = 0) {
+                   bool enable_pdl = false, cudaStream_t const stream = 0) {
  int64_t num_cases = num_tokens * n_group;
  int64_t topk_with_k2_num_blocks = (num_cases - 1) / NUM_WARPS_PER_BLOCK + 1;
  cudaLaunchConfig_t config;
@@ -765,12 +766,12 @@ void invokeNoAuxTc(T* scores, T* group_scores, float* topk_values,
  };
  switch (sf) {
    case SCORING_NONE: {
-      auto* kernel_instance1 = &topk_with_k2_kernel<T, SCORING_NONE>;
+      auto* kernel_instance1 = &topk_with_k2_kernel<T, BiasT, SCORING_NONE>;
      launch_topk_with_k2(kernel_instance1);
      break;
    }
    case SCORING_SIGMOID: {
-      auto* kernel_instance1 = &topk_with_k2_kernel<T, SCORING_SIGMOID>;
+      auto* kernel_instance1 = &topk_with_k2_kernel<T, BiasT, SCORING_SIGMOID>;
      launch_topk_with_k2(kernel_instance1);
      break;
    }
@@ -794,14 +795,14 @@ void invokeNoAuxTc(T* scores, T* group_scores, float* topk_values,
  config.attrs = attrs;
  switch (sf) {
    case SCORING_NONE: {
-      launch_group_idx_and_topk_kernel<T, IdxT, SCORING_NONE>(
+      launch_group_idx_and_topk_kernel<T, BiasT, IdxT, SCORING_NONE>(
          config, scores, group_scores, topk_values, topk_indices, bias,
          num_tokens, n_group, topk_group, topk, num_experts,
          num_experts_per_group, renormalize, routed_scaling_factor);
      break;
    }
    case SCORING_SIGMOID: {
-      launch_group_idx_and_topk_kernel<T, IdxT, SCORING_SIGMOID>(
+      launch_group_idx_and_topk_kernel<T, BiasT, IdxT, SCORING_SIGMOID>(
          config, scores, group_scores, topk_values, topk_indices, bias,
          num_tokens, n_group, topk_group, topk, num_experts,
          num_experts_per_group, renormalize, routed_scaling_factor);
@@ -812,17 +813,23 @@ void invokeNoAuxTc(T* scores, T* group_scores, float* topk_values,
  }
 }
-#define INSTANTIATE_NOAUX_TC(T, IdxT)                                       \
+#define INSTANTIATE_NOAUX_TC(T, BiasT, IdxT)                                  \
-  template void invokeNoAuxTc<T, IdxT>(                                     \
+  template void invokeNoAuxTc<T, BiasT, IdxT>(                                \
-      T * scores, T * group_scores, float* topk_values, IdxT* topk_indices, \
+      T * scores, T * group_scores, float* topk_values, IdxT* topk_indices,   \
-      T const* bias, int64_t const num_tokens, int64_t const num_experts,   \
+      BiasT const* bias, int64_t const num_tokens, int64_t const num_experts, \
-      int64_t const n_group, int64_t const topk_group, int64_t const topk,  \
+      int64_t const n_group, int64_t const topk_group, int64_t const topk,    \
-      bool const renormalize, double const routed_scaling_factor,           \
+      bool const renormalize, double const routed_scaling_factor,             \
      int const scoring_func, bool enable_pdl, cudaStream_t const stream);
-INSTANTIATE_NOAUX_TC(float, int32_t);
+INSTANTIATE_NOAUX_TC(float, float, int32_t);
-INSTANTIATE_NOAUX_TC(half, int32_t);
+INSTANTIATE_NOAUX_TC(float, half, int32_t);
-INSTANTIATE_NOAUX_TC(__nv_bfloat16, int32_t);
+INSTANTIATE_NOAUX_TC(float, __nv_bfloat16, int32_t);
 INSTANTIATE_NOAUX_TC(half, float, int32_t);
 INSTANTIATE_NOAUX_TC(half, half, int32_t);
 INSTANTIATE_NOAUX_TC(half, __nv_bfloat16, int32_t);
 INSTANTIATE_NOAUX_TC(__nv_bfloat16, float, int32_t);
 INSTANTIATE_NOAUX_TC(__nv_bfloat16, half, int32_t);
 INSTANTIATE_NOAUX_TC(__nv_bfloat16, __nv_bfloat16, int32_t);
 }  // end namespace moe
 }  // namespace vllm
@@ -831,6 +838,7 @@ std::tuple<torch::Tensor, torch::Tensor> grouped_topk(
    int64_t topk, bool renormalize, double routed_scaling_factor,
    torch::Tensor const& bias, int64_t scoring_func = 0) {
  auto data_type = scores.scalar_type();
  auto bias_type = bias.scalar_type();
  auto input_size = scores.sizes();
  int64_t num_tokens = input_size[0];
  int64_t num_experts = input_size[1];
@@ -854,39 +862,62 @@ std::tuple<torch::Tensor, torch::Tensor> grouped_topk(
  auto stream = c10::cuda::getCurrentCUDAStream(scores.get_device());
 #define LAUNCH_KERNEL(T, IdxT)                                               \
  do {                                                                       \
    switch (bias_type) {                                                     \
      case torch::kFloat16:                                                  \
        vllm::moe::invokeNoAuxTc<T, half, IdxT>(                             \
            reinterpret_cast<T*>(scores.mutable_data_ptr()),                 \
            reinterpret_cast<T*>(group_scores.mutable_data_ptr()),           \
            reinterpret_cast<float*>(topk_values.mutable_data_ptr()),        \
            reinterpret_cast<IdxT*>(topk_indices.mutable_data_ptr()),        \
            reinterpret_cast<half const*>(bias.data_ptr()), num_tokens,      \
            num_experts, n_group, topk_group, topk, renormalize,             \
            routed_scaling_factor, static_cast<int>(scoring_func), false,    \
            stream);                                                         \
        break;                                                               \
      case torch::kFloat32:                                                  \
        vllm::moe::invokeNoAuxTc<T, float, IdxT>(                            \
            reinterpret_cast<T*>(scores.mutable_data_ptr()),                 \
            reinterpret_cast<T*>(group_scores.mutable_data_ptr()),           \
            reinterpret_cast<float*>(topk_values.mutable_data_ptr()),        \
            reinterpret_cast<IdxT*>(topk_indices.mutable_data_ptr()),        \
            reinterpret_cast<float const*>(bias.data_ptr()), num_tokens,     \
            num_experts, n_group, topk_group, topk, renormalize,             \
            routed_scaling_factor, static_cast<int>(scoring_func), false,    \
            stream);                                                         \
        break;                                                               \
      case torch::kBFloat16:                                                 \
        vllm::moe::invokeNoAuxTc<T, __nv_bfloat16, IdxT>(                    \
            reinterpret_cast<T*>(scores.mutable_data_ptr()),                 \
            reinterpret_cast<T*>(group_scores.mutable_data_ptr()),           \
            reinterpret_cast<float*>(topk_values.mutable_data_ptr()),        \
            reinterpret_cast<IdxT*>(topk_indices.mutable_data_ptr()),        \
            reinterpret_cast<__nv_bfloat16 const*>(bias.data_ptr()),         \
            num_tokens, num_experts, n_group, topk_group, topk, renormalize, \
            routed_scaling_factor, static_cast<int>(scoring_func), false,    \
            stream);                                                         \
        break;                                                               \
      default:                                                               \
        throw std::invalid_argument(                                         \
            "Invalid bias dtype, only supports float16, float32, and "       \
            "bfloat16");                                                     \
        break;                                                               \
    }                                                                        \
  } while (0)
  switch (data_type) {
    case torch::kFloat16:
      // Handle Float16
-      vllm::moe::invokeNoAuxTc<half, int32_t>(
+      LAUNCH_KERNEL(half, int32_t);
          reinterpret_cast<half*>(scores.mutable_data_ptr()),
          reinterpret_cast<half*>(group_scores.mutable_data_ptr()),
          reinterpret_cast<float*>(topk_values.mutable_data_ptr()),
          reinterpret_cast<int32_t*>(topk_indices.mutable_data_ptr()),
          reinterpret_cast<half const*>(bias.data_ptr()), num_tokens,
          num_experts, n_group, topk_group, topk, renormalize,
          routed_scaling_factor, static_cast<int>(scoring_func), false, stream);
      break;
    case torch::kFloat32:
      // Handle Float32
-      vllm::moe::invokeNoAuxTc<float, int32_t>(
+      LAUNCH_KERNEL(float, int32_t);
          reinterpret_cast<float*>(scores.mutable_data_ptr()),
          reinterpret_cast<float*>(group_scores.mutable_data_ptr()),
          reinterpret_cast<float*>(topk_values.mutable_data_ptr()),
          reinterpret_cast<int32_t*>(topk_indices.mutable_data_ptr()),
          reinterpret_cast<float const*>(bias.data_ptr()), num_tokens,
          num_experts, n_group, topk_group, topk, renormalize,
          routed_scaling_factor, static_cast<int>(scoring_func), false, stream);
      break;
    case torch::kBFloat16:
      // Handle BFloat16
-      vllm::moe::invokeNoAuxTc<__nv_bfloat16, int32_t>(
+      LAUNCH_KERNEL(__nv_bfloat16, int32_t);
          reinterpret_cast<__nv_bfloat16*>(scores.mutable_data_ptr()),
          reinterpret_cast<__nv_bfloat16*>(group_scores.mutable_data_ptr()),
          reinterpret_cast<float*>(topk_values.mutable_data_ptr()),
          reinterpret_cast<int32_t*>(topk_indices.mutable_data_ptr()),
          reinterpret_cast<__nv_bfloat16 const*>(bias.data_ptr()), num_tokens,
          num_experts, n_group, topk_group, topk, renormalize,
          routed_scaling_factor, static_cast<int>(scoring_func), false, stream);
      break;
    default:
      // Handle other data types
@@ -894,5 +925,6 @@ std::tuple<torch::Tensor, torch::Tensor> grouped_topk(
          "Invalid dtype, only supports float16, float32, and bfloat16");
      break;
  }
 #undef LAUNCH_KERNEL
  return {topk_values, topk_indices};
 }
--- a/csrc/moe/marlin_moe_wna16/kernel.h
+++ b/csrc/moe/marlin_moe_wna16/kernel.h
@@ -7,20 +7,20 @@
 #include "quantization/gptq_marlin/marlin_dtypes.cuh"
 #include "core/scalar_type.hpp"
-#define MARLIN_KERNEL_PARAMS                                                  \
+#define MARLIN_KERNEL_PARAMS                                          \
-  const int4 *__restrict__ A, const int4 *__restrict__ B,                     \
+  const int4 *__restrict__ A, const int4 *__restrict__ B,             \
-      int4 *__restrict__ C, int4 *__restrict__ C_tmp,                         \
+      int4 *__restrict__ C, int4 *__restrict__ C_tmp,                 \
-      const int4 *__restrict__ b_bias_ptr,                                    \
+      const int4 *__restrict__ b_bias_ptr,                            \
-      const float *__restrict__ a_scales_ptr,                                 \
+      const float *__restrict__ a_scales_ptr,                         \
-      const int4 *__restrict__ scales_ptr,                                    \
+      const int4 *__restrict__ scales_ptr,                            \
-      const uint16_t *__restrict__ global_scale_ptr,                          \
+      const uint16_t *__restrict__ global_scale_ptr,                  \
-      const int4 *__restrict__ zp_ptr, const int *__restrict__ g_idx,         \
+      const int4 *__restrict__ zp_ptr, const int *__restrict__ g_idx, \
-      const int32_t *__restrict__ sorted_token_ids_ptr,                       \
+      const int32_t *__restrict__ sorted_token_ids_ptr,               \
-      const int32_t *__restrict__ expert_ids_ptr,                             \
+      const int32_t *__restrict__ expert_ids_ptr,                     \
-      const int32_t *__restrict__ num_tokens_past_padded_ptr,                 \
+      const int32_t *__restrict__ num_tokens_past_padded_ptr,         \
-      const float *__restrict__ topk_weights_ptr, int top_k,                  \
+      const float *__restrict__ topk_weights_ptr, int top_k,          \
-      bool mul_topk_weights, bool is_ep, int num_groups, int prob_m,          \
+      bool mul_topk_weights, int num_groups, int prob_m, int prob_n,  \
-      int prob_n, int prob_k, int *locks, bool has_bias, bool use_atomic_add, \
+      int prob_k, int *locks, bool has_bias, bool use_atomic_add,     \
      bool use_fp32_reduce
 namespace MARLIN_NAMESPACE_NAME {
--- a/csrc/moe/marlin_moe_wna16/marlin_template.h
+++ b/csrc/moe/marlin_moe_wna16/marlin_template.h
@@ -71,7 +71,6 @@ __global__ void Marlin(
    const float* __restrict__ topk_weights_ptr,              // moe top weights
    int top_k,              // num of experts per token
    bool mul_topk_weights,  // mul topk weights or not
    bool is_ep,             // expert parallelism
    int num_groups,         // number of scale groups per output channel
    int prob_m,             // batch dimension m
    int prob_n,             // output dimension n
@@ -273,7 +272,6 @@ __global__ void Marlin(
    const float* __restrict__ topk_weights_ptr,              // moe top weights
    int top_k,              // num of experts per token
    bool mul_topk_weights,  // mul topk weights or not
    bool is_ep,             // expert parallelism
    int num_groups,         // number of scale groups per output channel
    int prob_m,             // batch dimension m
    int prob_n,             // output dimension n
@@ -376,14 +374,6 @@ __global__ void Marlin(
  // parallel: num valid moe blocks
  int parallel = num_tokens_past_padded / moe_block_size;
  int num_valid_blocks = parallel;
  if (is_ep) {
    for (int i = 0; i < parallel; i++) {
      if (expert_ids_ptr[i] == -1) num_valid_blocks--;
    }
  }
  int num_invalid_blocks = parallel - num_valid_blocks;
  parallel = num_valid_blocks;
  int k_tiles = prob_k / 16 / thread_k_blocks;
  int n_tiles = prob_n / 16 / thread_n_blocks;
@@ -538,22 +528,8 @@ __global__ void Marlin(
    if (par_id >= parallel) return;
    old_expert_id = expert_id;
-    if (num_invalid_blocks > 0) {
+    block_id = par_id;
-      int skip_count = par_id;
+    expert_id = expert_ids_ptr[block_id];
      for (int i = 0; i < num_tokens_past_padded / moe_block_size; i++) {
        expert_id = expert_ids_ptr[i];
        if (expert_id != -1) {
          if (skip_count == 0) {
            block_id = i;
            break;
          };
          skip_count--;
        };
      }
    } else {
      block_id = par_id;
      expert_id = expert_ids_ptr[block_id];
    }
    if constexpr (b_type == vllm::kFE2M1f && s_type == vllm::kFE4M3fn) {
      uint16_t val = global_scale_ptr[expert_id];
--- a/csrc/moe/marlin_moe_wna16/ops.cu
+++ b/csrc/moe/marlin_moe_wna16/ops.cu
@@ -336,14 +336,14 @@ void marlin_mm(const void* A, const void* B, void* C, void* C_tmp, void* b_bias,
               void* perm, void* a_tmp, void* sorted_token_ids,
               void* expert_ids, void* num_tokens_past_padded,
               void* topk_weights, int moe_block_size, int num_experts,
-               int top_k, bool mul_topk_weights, bool is_ep, int prob_m,
+               int top_k, bool mul_topk_weights, int prob_m, int prob_n,
-               int prob_n, int prob_k, void* workspace,
+               int prob_k, void* workspace, vllm::ScalarType const& a_type,
-               vllm::ScalarType const& a_type, vllm::ScalarType const& b_type,
+               vllm::ScalarType const& b_type, vllm::ScalarType const& c_type,
-               vllm::ScalarType const& c_type, vllm::ScalarType const& s_type,
+               vllm::ScalarType const& s_type, bool has_bias,
-               bool has_bias, bool has_act_order, bool is_k_full, bool has_zp,
+               bool has_act_order, bool is_k_full, bool has_zp, int num_groups,
-               int num_groups, int group_size, int dev, cudaStream_t stream,
+               int group_size, int dev, cudaStream_t stream, int thread_k,
-               int thread_k, int thread_n, int sms, int blocks_per_sm,
+               int thread_n, int sms, int blocks_per_sm, bool use_atomic_add,
-               bool use_atomic_add, bool use_fp32_reduce, bool is_zp_float) {
+               bool use_fp32_reduce, bool is_zp_float) {
  int thread_m_blocks = div_ceil(moe_block_size, 16);
  bool m_block_size_8 = moe_block_size == 8;
  bool is_a_8bit = a_type.size_bits() == 8;
@@ -523,7 +523,7 @@ void marlin_mm(const void* A, const void* B, void* C, void* C_tmp, void* b_bias,
  kernel<<<blocks, num_threads, max_shared_mem, stream>>>(
      A_ptr, B_ptr, C_ptr, C_tmp_ptr, bias_ptr, a_s_ptr, b_s_ptr, g_s_ptr, zp_ptr, g_idx_ptr,
      sorted_token_ids_ptr, expert_ids_ptr, num_tokens_past_padded_ptr,
-      topk_weights_ptr, top_k, mul_topk_weights, is_ep, num_groups, prob_m,
+      topk_weights_ptr, top_k, mul_topk_weights, num_groups, prob_m,
      prob_n, prob_k, locks, has_bias, use_atomic_add, use_fp32_reduce);
  // clang-format on
 }
@@ -541,7 +541,7 @@ torch::Tensor moe_wna16_marlin_gemm(
    std::optional<torch::Tensor> const& perm_or_none, torch::Tensor& workspace,
    torch::Tensor& sorted_token_ids, torch::Tensor& expert_ids,
    torch::Tensor& num_tokens_past_padded, torch::Tensor& topk_weights,
-    int64_t moe_block_size, int64_t top_k, bool mul_topk_weights, bool is_ep,
+    int64_t moe_block_size, int64_t top_k, bool mul_topk_weights,
    vllm::ScalarTypeId const& b_type_id, int64_t size_m, int64_t size_n,
    int64_t size_k, bool is_k_full, bool use_atomic_add, bool use_fp32_reduce,
    bool is_zp_float, int64_t thread_k, int64_t thread_n,
@@ -855,9 +855,9 @@ torch::Tensor moe_wna16_marlin_gemm(
      perm.data_ptr(), a_tmp.data_ptr(), sorted_token_ids.data_ptr(),
      expert_ids.data_ptr(), num_tokens_past_padded.data_ptr(),
      topk_weights.data_ptr(), moe_block_size, num_experts, top_k,
-      mul_topk_weights, is_ep, size_m, size_n, size_k, workspace.data_ptr(),
+      mul_topk_weights, size_m, size_n, size_k, workspace.data_ptr(), a_type,
-      a_type, b_type, c_type, s_type, has_bias, has_act_order, is_k_full,
+      b_type, c_type, s_type, has_bias, has_act_order, is_k_full, has_zp,
-      has_zp, num_groups, group_size, dev, at::cuda::getCurrentCUDAStream(dev),
+      num_groups, group_size, dev, at::cuda::getCurrentCUDAStream(dev),
      thread_k, thread_n, sms, blocks_per_sm, use_atomic_add, use_fp32_reduce,
      is_zp_float);
@@ -866,4 +866,4 @@ torch::Tensor moe_wna16_marlin_gemm(
 TORCH_LIBRARY_IMPL_EXPAND(TORCH_EXTENSION_NAME, CUDA, m) {
  m.impl("moe_wna16_marlin_gemm", &moe_wna16_marlin_gemm);
-}
+}
--- a/csrc/moe/torch_bindings.cpp
+++ b/csrc/moe/torch_bindings.cpp
@@ -71,7 +71,7 @@ TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, m) {
      "Tensor sorted_token_ids,"
      "Tensor! expert_ids, Tensor! num_tokens_past_padded,"
      "Tensor! topk_weights, int moe_block_size, int top_k, "
-      "bool mul_topk_weights, bool is_ep, int b_type_id,"
+      "bool mul_topk_weights, int b_type_id,"
      "int size_m, int size_n, int size_k,"
      "bool is_full_k, bool use_atomic_add,"
      "bool use_fp32_reduce, bool is_zp_float,"
--- a/csrc/ops.h
+++ b/csrc/ops.h
@@ -2,6 +2,7 @@
 #include <optional>
 #include <torch/library.h>
 #include <tuple>
 #include "core/scalar_type.hpp"
@@ -265,6 +266,11 @@ void get_cutlass_moe_mm_problem_sizes(
    const int64_t k, const std::optional<torch::Tensor>& blockscale_offsets,
    std::optional<bool> force_swap_ab = std::nullopt);
 void get_cutlass_moe_mm_problem_sizes_from_expert_offsets(
    const torch::Tensor& expert_first_token_offset,
    torch::Tensor& problem_sizes1, torch::Tensor& problem_sizes2,
    const int64_t n, const int64_t k, const bool swap_ab);
 void get_cutlass_pplx_moe_mm_data(torch::Tensor& expert_offsets,
                                  torch::Tensor& problem_sizes1,
                                  torch::Tensor& problem_sizes2,
@@ -301,6 +307,12 @@ void scaled_fp4_experts_quant(
    torch::Tensor const& input_offset_by_experts,
    torch::Tensor const& output_scale_offset_by_experts);
 void silu_and_mul_scaled_fp4_experts_quant(
    torch::Tensor& output, torch::Tensor& output_scale,
    torch::Tensor const& input, torch::Tensor const& input_global_scale,
    torch::Tensor const& input_offset_by_experts,
    torch::Tensor const& output_scale_offset_by_experts);
 void per_token_group_quant_fp8(const torch::Tensor& input,
                               torch::Tensor& output_q, torch::Tensor& output_s,
                               int64_t group_size, double eps, double fp8_min,
@@ -335,8 +347,9 @@ torch::Tensor gptq_gemm(torch::Tensor a, torch::Tensor b_q_weight,
 void gptq_shuffle(torch::Tensor q_weight, torch::Tensor q_perm, int64_t bit);
-void static_scaled_fp8_quant(torch::Tensor& out, torch::Tensor const& input,
+void static_scaled_fp8_quant(
-                             torch::Tensor const& scale);
+    torch::Tensor& out, torch::Tensor const& input, torch::Tensor const& scale,
    std::optional<std::tuple<int64_t, int64_t>> group_shape = std::nullopt);
 void dynamic_scaled_fp8_quant(torch::Tensor& out, torch::Tensor const& input,
                              torch::Tensor& scale);
--- a/csrc/quantization/fp4/activation_nvfp4_quant_fusion_kernels.cu
+++ b/csrc/quantization/fp4/activation_nvfp4_quant_fusion_kernels.cu
@@ -31,37 +31,6 @@
 namespace vllm {
 // silu in float32
 __device__ __forceinline__ float silu(float x) {
  return __fdividef(x, (1.f + __expf(-x)));
 }
 __device__ __forceinline__ float2 silu2(float2 x) {
  return make_float2(silu(x.x), silu(x.y));
 }
 template <class Type>
 __inline__ __device__ PackedVec<Type> compute_silu_mul(PackedVec<Type>& vec,
                                                       PackedVec<Type>& vec2) {
  PackedVec<Type> result;
  using packed_type = typename TypeConverter<Type>::Type;
 #pragma unroll
  for (int i = 0; i < CVT_FP4_ELTS_PER_THREAD / 2; ++i) {
    // silu_mul in float32
    if constexpr (std::is_same_v<Type, half>) {
      float2 silu_vec = silu2(__half22float2(vec.elts[i]));
      result.elts[i] =
          __float22half2_rn(__fmul2_rn(silu_vec, __half22float2(vec2.elts[i])));
    } else {
      float2 silu_vec = silu2(__bfloat1622float2(vec.elts[i]));
      result.elts[i] = __float22bfloat162_rn(
          __fmul2_rn(silu_vec, __bfloat1622float2(vec2.elts[i])));
    }
  }
  return result;
 }
 // Use UE4M3 by default.
 template <class Type, bool UE8M0_SF = false>
 __global__ void __launch_bounds__(1024, VLLM_BLOCKS_PER_SM(1024))
@@ -74,6 +43,9 @@ __global__ void __launch_bounds__(1024, VLLM_BLOCKS_PER_SM(1024))
  static_assert(sizeof(PackedVec) == sizeof(Type) * CVT_FP4_ELTS_PER_THREAD,
                "Vec size is not matched.");
  // Precompute SF layout parameter (constant for entire kernel).
  int32_t const numKTiles = (numCols + 63) / 64;
  // Get the global scaling factor, which will be applied to the SF.
  // Note SFScale is the same as next GEMM's alpha, which is
  // (448.f / (Alpha_A / 6.f)).
@@ -101,7 +73,7 @@ __global__ void __launch_bounds__(1024, VLLM_BLOCKS_PER_SM(1024))
      auto sf_out =
          cvt_quant_to_fp4_get_sf_out_offset<uint32_t,
                                             CVT_FP4_NUM_THREADS_PER_SF>(
-              rowIdx, colIdx, numCols, SFout);
+              rowIdx, colIdx, numKTiles, SFout);
      out_pos = cvt_warp_fp16_to_fp4<Type, UE8M0_SF>(out_silu_mul, SFScaleVal,
                                                     sf_out);
--- a/csrc/quantization/fp4/nvfp4_blockwise_moe_kernel.cu
+++ b/csrc/quantization/fp4/nvfp4_blockwise_moe_kernel.cu
@@ -62,7 +62,9 @@ __global__ void __get_group_gemm_starts(
    ElementSF* a_scales_base_as_int, ElementSF* b_scales_base_as_int,
    ElementAccumulator* alphas_base_as_int, const int32_t* expert_offsets,
    const int32_t* sf_offsets, const int32_t* problem_sizes_as_shapes,
-    const int K, const int N) {
+    int64_t* a_strides, int64_t* b_strides, int64_t* c_strides,
    const int64_t a_stride_val, const int64_t b_stride_val,
    const int64_t c_stride_val, const int K, const int N) {
  int64_t expert_id = threadIdx.x;
  if (expert_id >= gridDim.x * blockDim.x) {
    return;
@@ -103,6 +105,11 @@ __global__ void __get_group_gemm_starts(
  // Shape of alpha = [E]
  alpha_offsets[expert_id] = alphas_base_as_int + expert_id;
  // Initialize strides (constant across all experts, avoids separate kernels)
  a_strides[expert_id] = a_stride_val;
  b_strides[expert_id] = b_stride_val;
  c_strides[expert_id] = c_stride_val;
  LayoutSFA* layout_sfa_ptr = layout_sfa_base_as_int + expert_id;
  LayoutSFB* layout_sfb_ptr = layout_sfb_base_as_int + expert_id;
@@ -135,7 +142,11 @@ __global__ void __get_group_gemm_starts(
            static_cast<float*>(alphas.data_ptr()),                           \
            static_cast<int32_t*>(expert_offsets.data_ptr()),                 \
            static_cast<int32_t*>(sf_offsets.data_ptr()),                     \
-            static_cast<int32_t*>(problem_sizes.data_ptr()), K, N);           \
+            static_cast<int32_t*>(problem_sizes.data_ptr()),                  \
            static_cast<int64_t*>(a_strides.data_ptr()),                      \
            static_cast<int64_t*>(b_strides.data_ptr()),                      \
            static_cast<int64_t*>(c_strides.data_ptr()), a_stride_val,        \
            b_stride_val, c_stride_val, K, N);                                \
  }
 template <typename LayoutSFA, typename LayoutSFB, typename ScaleConfig>
@@ -144,6 +155,9 @@ void run_get_group_gemm_starts(
    const torch::Tensor& out_starts, const torch::Tensor& a_scales_starts,
    const torch::Tensor& b_scales_starts, const torch::Tensor& alpha_starts,
    const torch::Tensor& layout_sfa, const torch::Tensor& layout_sfb,
    const torch::Tensor& a_strides, const torch::Tensor& b_strides,
    const torch::Tensor& c_strides, int64_t a_stride_val, int64_t b_stride_val,
    int64_t c_stride_val,
    /*these are used for their base addresses*/
    torch::Tensor const& a_tensors, torch::Tensor const& b_tensors,
    torch::Tensor const& out_tensors, torch::Tensor const& a_scales,
@@ -269,17 +283,16 @@ void run_fp4_blockwise_scaled_group_mm_sm100(
  torch::Tensor alpha_ptrs = torch::empty(num_experts, options_int);
  torch::Tensor layout_sfa = torch::empty({num_experts, 5}, options_int);
  torch::Tensor layout_sfb = torch::empty({num_experts, 5}, options_int);
-  torch::Tensor c_strides1 =
+  torch::Tensor a_strides1 = torch::empty(num_experts, options_int);
-      torch::full({num_experts}, output.stride(0), options_int);
+  torch::Tensor b_strides1 = torch::empty(num_experts, options_int);
-  torch::Tensor a_strides1 =
+  torch::Tensor c_strides1 = torch::empty(num_experts, options_int);
      torch::full({num_experts}, a.stride(0) * 2, options_int);
  torch::Tensor b_strides1 =
      torch::full({num_experts}, b.stride(1) * 2, options_int);
  run_get_group_gemm_starts<LayoutSFA, LayoutSFB, ScaleConfig>(
      a_ptrs, b_ptrs, out_ptrs, a_scales_ptrs, b_scales_ptrs, alpha_ptrs,
-      layout_sfa, layout_sfb, a, b, output, a_blockscale, b_blockscales, alphas,
+      layout_sfa, layout_sfb, a_strides1, b_strides1, c_strides1,
-      expert_offsets, sf_offsets, problem_sizes, M, N, K);
+      a.stride(0) * 2, b.stride(1) * 2, output.stride(0), a, b, output,
      a_blockscale, b_blockscales, alphas, expert_offsets, sf_offsets,
      problem_sizes, M, N, K);
  // Create an instance of the GEMM
  Gemm gemm_op;
@@ -444,17 +457,16 @@ void run_fp4_blockwise_scaled_group_mm_sm120(
  torch::Tensor alpha_ptrs = torch::empty(num_experts, options_int);
  torch::Tensor layout_sfa = torch::empty({num_experts, 5}, options_int);
  torch::Tensor layout_sfb = torch::empty({num_experts, 5}, options_int);
-  torch::Tensor c_strides1 =
+  torch::Tensor a_strides1 = torch::empty(num_experts, options_int);
-      torch::full({num_experts}, output.stride(0), options_int);
+  torch::Tensor b_strides1 = torch::empty(num_experts, options_int);
-  torch::Tensor a_strides1 =
+  torch::Tensor c_strides1 = torch::empty(num_experts, options_int);
      torch::full({num_experts}, a.stride(0) * 2, options_int);
  torch::Tensor b_strides1 =
      torch::full({num_experts}, b.stride(1) * 2, options_int);
  run_get_group_gemm_starts<LayoutSFA, LayoutSFB, ScaleConfig>(
      a_ptrs, b_ptrs, out_ptrs, a_scales_ptrs, b_scales_ptrs, alpha_ptrs,
-      layout_sfa, layout_sfb, a, b, output, a_blockscale, b_blockscales, alphas,
+      layout_sfa, layout_sfb, a_strides1, b_strides1, c_strides1,
-      expert_offsets, sf_offsets, problem_sizes, M, N, K);
+      a.stride(0) * 2, b.stride(1) * 2, output.stride(0), a, b, output,
      a_blockscale, b_blockscales, alphas, expert_offsets, sf_offsets,
      problem_sizes, M, N, K);
  // Create an instance of the GEMM
  Gemm gemm_op;
--- a/csrc/quantization/fp4/nvfp4_experts_quant.cu
+++ b/csrc/quantization/fp4/nvfp4_experts_quant.cu
@@ -25,13 +25,18 @@
 #include <cuda_fp8.h>
 #include "dispatch_utils.h"
 #include "cuda_utils.h"
 #include "nvfp4_utils.cuh"
 #include "launch_bounds_utils.h"
 namespace vllm {
 // NVFP4 quantization kernel for experts (low-latency path).
 // When FUSE_SILU_MUL=true, expects input with gate||up layout and fuses
 // SiLU(gate)*up before quantization.
 // Use UE4M3 by default.
-template <class Type, bool UE8M0_SF = false, bool SMALL_NUM_EXPERTS = false>
+template <class Type, bool FUSE_SILU_MUL = false, bool UE8M0_SF = false,
          bool SMALL_NUM_EXPERTS = false>
 __global__ void __launch_bounds__(512, VLLM_BLOCKS_PER_SM(512))
    cvt_fp16_to_fp4(int32_t numRows, int32_t numCols, Type const* in,
                    float const* SFScale, uint32_t* out, uint32_t* SFout,
@@ -44,8 +49,13 @@ __global__ void __launch_bounds__(512, VLLM_BLOCKS_PER_SM(512))
  static_assert(sizeof(PackedVec) == sizeof(Type) * CVT_FP4_ELTS_PER_THREAD,
                "Vec size is not matched.");
  // Precompute SF layout parameter (constant for entire kernel).
  int32_t const numKTiles = (numCols + 63) / 64;
  int tid = blockIdx.x * blockDim.x + threadIdx.x;
  int colsPerRow = numCols / CVT_FP4_ELTS_PER_THREAD;
  // When fusing SiLU+Mul, input has gate || up layout (doubled width)
  int inColsPerRow = FUSE_SILU_MUL ? colsPerRow * 2 : colsPerRow;
  // Each global thread processes one element
  for (int globalIdx = tid; globalIdx < numRows * colsPerRow;
@@ -54,13 +64,6 @@ __global__ void __launch_bounds__(512, VLLM_BLOCKS_PER_SM(512))
    int rowIdx = globalIdx / colsPerRow;
    int colIdx = globalIdx % colsPerRow;
    int64_t inOffset = rowIdx * colsPerRow + colIdx;
    PackedVec in_vec = reinterpret_cast<PackedVec const*>(in)[inOffset];
    // Get the output tensor offset.
    // Same as inOffset because 8 elements are packed into one uint32_t.
    int64_t outOffset = inOffset;
    auto& out_pos = out[outOffset];
    // Find index within the experts using different strategies based on expert
    // count
    int rowIdx_in_expert = 0;
@@ -107,29 +110,46 @@ __global__ void __launch_bounds__(512, VLLM_BLOCKS_PER_SM(512))
      }
    }
    // Load input and optionally apply fused SiLU+Mul
    int64_t inOffset = rowIdx * inColsPerRow + colIdx;
    PackedVec in_vec = reinterpret_cast<PackedVec const*>(in)[inOffset];
    PackedVec quant_input;
    if constexpr (FUSE_SILU_MUL) {
      PackedVec in_vec_up =
          reinterpret_cast<PackedVec const*>(in)[inOffset + colsPerRow];
      quant_input = compute_silu_mul(in_vec, in_vec_up);
    } else {
      quant_input = in_vec;
    }
    // Get the output tensor offset.
    // Same as inOffset because 8 elements are packed into one uint32_t.
    int64_t outOffset = rowIdx * colsPerRow + colIdx;
    auto& out_pos = out[outOffset];
    // Get the global scaling factor, which will be applied to the SF.
    // Note SFScale is the same as next GEMM's alpha, which is
    // (448.f / (Alpha_A / 6.f)).
    float const SFScaleVal = SFScale == nullptr ? 1.0f : SFScale[expert_idx];
    int factor = CVT_FP4_SF_VEC_SIZE * 4;
    // The actual output_scales dim is computed from the padded numCols.
    int32_t numCols_padded = (numCols + factor - 1) / factor * factor;
    int numCols_SFout = numCols_padded / CVT_FP4_SF_VEC_SIZE / 4;
    uint32_t* SFout_in_expert =
-        SFout + output_scale_offset_by_experts[expert_idx] * numCols_SFout;
+        SFout + output_scale_offset_by_experts[expert_idx] * numKTiles;
    auto sf_out =
        cvt_quant_to_fp4_get_sf_out_offset<uint32_t,
                                           CVT_FP4_NUM_THREADS_PER_SF>(
-            rowIdx_in_expert, colIdx, numCols, SFout_in_expert);
+            rowIdx_in_expert, colIdx, numKTiles, SFout_in_expert);
-    out_pos = cvt_warp_fp16_to_fp4<Type, UE8M0_SF>(in_vec, SFScaleVal, sf_out);
+    out_pos =
        cvt_warp_fp16_to_fp4<Type, UE8M0_SF>(quant_input, SFScaleVal, sf_out);
  }
 }
-// Kernel for LARGE_M_TOPK = true (large m_topk optimized version)
+// NVFP4 quantization kernel for LARGE_M_TOPK = true (large m_topk optimized
-template <class Type, bool UE8M0_SF = false, bool SMALL_NUM_EXPERTS = false>
+// version). When FUSE_SILU_MUL=true, expects input with gate||up layout and
 // fuses SiLU(gate)*up before quantization.
 template <class Type, bool FUSE_SILU_MUL = false, bool UE8M0_SF = false,
          bool SMALL_NUM_EXPERTS = false>
 __global__ void __launch_bounds__(1024, VLLM_BLOCKS_PER_SM(1024))
    cvt_fp16_to_fp4(int32_t numRows, int32_t numCols, Type const* in,
                    float const* SFScale, uint32_t* out, uint32_t* SFout,
@@ -140,6 +160,10 @@ __global__ void __launch_bounds__(1024, VLLM_BLOCKS_PER_SM(1024))
      (CVT_FP4_SF_VEC_SIZE / CVT_FP4_ELTS_PER_THREAD);
  static_assert(sizeof(PackedVec) == sizeof(Type) * CVT_FP4_ELTS_PER_THREAD,
                "Vec size is not matched.");
  // Precompute SF layout parameter (constant for entire kernel).
  int32_t const numKTiles = (numCols + 63) / 64;
  extern __shared__ uint32_t shared_input_offsets[];
  // Load input offsets into shared memory.
@@ -163,6 +187,8 @@ __global__ void __launch_bounds__(1024, VLLM_BLOCKS_PER_SM(1024))
  int tid = blockIdx.x * blockDim.x + threadIdx.x;
  int colsPerRow = numCols / CVT_FP4_ELTS_PER_THREAD;
  // When fusing SiLU+Mul, input has gate || up layout (doubled width)
  int inColsPerRow = FUSE_SILU_MUL ? colsPerRow * 2 : colsPerRow;
  // Each global thread processes one element
  for (int globalIdx = tid; globalIdx < numRows * colsPerRow;
@@ -171,11 +197,6 @@ __global__ void __launch_bounds__(1024, VLLM_BLOCKS_PER_SM(1024))
    int rowIdx = globalIdx / colsPerRow;
    int colIdx = globalIdx % colsPerRow;
    int64_t inOffset = rowIdx * colsPerRow + colIdx;
    PackedVec in_vec = reinterpret_cast<PackedVec const*>(in)[inOffset];
    int64_t outOffset = inOffset;
    auto& out_pos = out[outOffset];
    // Find expert using binary search for better performance with large m_topk
    int rowIdx_in_expert = 0;
    int expert_idx = 0;
@@ -200,34 +221,43 @@ __global__ void __launch_bounds__(1024, VLLM_BLOCKS_PER_SM(1024))
      }
    }
    // Load input and optionally apply fused SiLU+Mul
    int64_t inOffset = rowIdx * inColsPerRow + colIdx;
    PackedVec in_vec = reinterpret_cast<PackedVec const*>(in)[inOffset];
    PackedVec quant_input;
    if constexpr (FUSE_SILU_MUL) {
      PackedVec in_vec_up =
          reinterpret_cast<PackedVec const*>(in)[inOffset + colsPerRow];
      quant_input = compute_silu_mul(in_vec, in_vec_up);
    } else {
      quant_input = in_vec;
    }
    int64_t outOffset = rowIdx * colsPerRow + colIdx;
    auto& out_pos = out[outOffset];
    float const SFScaleVal = SFScale == nullptr ? 1.0f : SFScale[expert_idx];
    int factor = CVT_FP4_SF_VEC_SIZE * 4;
    int32_t numCols_padded = (numCols + factor - 1) / factor * factor;
    int numCols_SFout = numCols_padded / CVT_FP4_SF_VEC_SIZE / 4;
    uint32_t* SFout_in_expert =
-        SFout + output_scale_offset_by_experts[expert_idx] * numCols_SFout;
+        SFout + output_scale_offset_by_experts[expert_idx] * numKTiles;
    auto sf_out =
        cvt_quant_to_fp4_get_sf_out_offset<uint32_t,
                                           CVT_FP4_NUM_THREADS_PER_SF>(
-            rowIdx_in_expert, colIdx, numCols, SFout_in_expert);
+            rowIdx_in_expert, colIdx, numKTiles, SFout_in_expert);
-    out_pos = cvt_warp_fp16_to_fp4<Type, UE8M0_SF>(in_vec, SFScaleVal, sf_out);
+    out_pos =
        cvt_warp_fp16_to_fp4<Type, UE8M0_SF>(quant_input, SFScaleVal, sf_out);
  }
 }
-template <typename T>
+template <typename T, bool FUSE_SILU_MUL = false>
 void quant_impl(void* output, void* output_scale, void* input,
                void* input_global_scale, void* input_offset_by_experts,
                void* output_scale_offset_by_experts, int m_topk, int k,
                int n_experts, cudaStream_t stream) {
-  // TODO: this multiProcessorCount should be cached.
+  int multiProcessorCount =
-  int device;
+      get_device_attribute(cudaDevAttrMultiProcessorCount, -1);
  cudaGetDevice(&device);
  int multiProcessorCount;
  cudaDeviceGetAttribute(&multiProcessorCount, cudaDevAttrMultiProcessorCount,
                         device);
  // Grid, Block size.
  // Each thread converts 8 values.
@@ -249,7 +279,7 @@ void quant_impl(void* output, void* output_scale, void* input,
  if (blockRepeat > 1) {
    size_t shared_mem_size = (n_experts + 1) * sizeof(uint32_t);
    if (n_experts >= 4) {
-      cvt_fp16_to_fp4<T, false, false>
+      cvt_fp16_to_fp4<T, FUSE_SILU_MUL, false, false>
          <<<grid, block, shared_mem_size, stream>>>(
              m_topk, k, reinterpret_cast<T*>(input),
              reinterpret_cast<float*>(input_global_scale),
@@ -259,34 +289,37 @@ void quant_impl(void* output, void* output_scale, void* input,
              reinterpret_cast<uint32_t*>(output_scale_offset_by_experts),
              n_experts);
    } else {
-      cvt_fp16_to_fp4<T, false, true><<<grid, block, shared_mem_size, stream>>>(
+      cvt_fp16_to_fp4<T, FUSE_SILU_MUL, false, true>
-          m_topk, k, reinterpret_cast<T*>(input),
+          <<<grid, block, shared_mem_size, stream>>>(
-          reinterpret_cast<float*>(input_global_scale),
+              m_topk, k, reinterpret_cast<T*>(input),
-          reinterpret_cast<uint32_t*>(output),
+              reinterpret_cast<float*>(input_global_scale),
-          reinterpret_cast<uint32_t*>(output_scale),
+              reinterpret_cast<uint32_t*>(output),
-          reinterpret_cast<uint32_t*>(input_offset_by_experts),
+              reinterpret_cast<uint32_t*>(output_scale),
-          reinterpret_cast<uint32_t*>(output_scale_offset_by_experts),
+              reinterpret_cast<uint32_t*>(input_offset_by_experts),
-          n_experts);
+              reinterpret_cast<uint32_t*>(output_scale_offset_by_experts),
              n_experts);
    }
  } else {
    if (n_experts >= 16) {
-      cvt_fp16_to_fp4<T, false, false><<<grid, block, 0, stream>>>(
+      cvt_fp16_to_fp4<T, FUSE_SILU_MUL, false, false>
-          m_topk, k, reinterpret_cast<T*>(input),
+          <<<grid, block, 0, stream>>>(
-          reinterpret_cast<float*>(input_global_scale),
+              m_topk, k, reinterpret_cast<T*>(input),
-          reinterpret_cast<uint32_t*>(output),
+              reinterpret_cast<float*>(input_global_scale),
-          reinterpret_cast<uint32_t*>(output_scale),
+              reinterpret_cast<uint32_t*>(output),
-          reinterpret_cast<uint32_t*>(input_offset_by_experts),
+              reinterpret_cast<uint32_t*>(output_scale),
-          reinterpret_cast<uint32_t*>(output_scale_offset_by_experts),
+              reinterpret_cast<uint32_t*>(input_offset_by_experts),
-          n_experts, /* bool low_latency */ true);
+              reinterpret_cast<uint32_t*>(output_scale_offset_by_experts),
              n_experts, /* bool low_latency */ true);
    } else {
-      cvt_fp16_to_fp4<T, false, true><<<grid, block, 0, stream>>>(
+      cvt_fp16_to_fp4<T, FUSE_SILU_MUL, false, true>
-          m_topk, k, reinterpret_cast<T*>(input),
+          <<<grid, block, 0, stream>>>(
-          reinterpret_cast<float*>(input_global_scale),
+              m_topk, k, reinterpret_cast<T*>(input),
-          reinterpret_cast<uint32_t*>(output),
+              reinterpret_cast<float*>(input_global_scale),
-          reinterpret_cast<uint32_t*>(output_scale),
+              reinterpret_cast<uint32_t*>(output),
-          reinterpret_cast<uint32_t*>(input_offset_by_experts),
+              reinterpret_cast<uint32_t*>(output_scale),
-          reinterpret_cast<uint32_t*>(output_scale_offset_by_experts),
+              reinterpret_cast<uint32_t*>(input_offset_by_experts),
-          n_experts, /* bool low_latency */ true);
+              reinterpret_cast<uint32_t*>(output_scale_offset_by_experts),
              n_experts, /* bool low_latency */ true);
    }
  }
 }
@@ -307,19 +340,19 @@ constexpr auto FLOAT = at::ScalarType::Float;
 constexpr auto INT = at::ScalarType::Int;
 constexpr auto UINT8 = at::ScalarType::Byte;
-void scaled_fp4_experts_quant_sm1xxa(
+// Common validation for fp4 experts quantization entry points.
-    torch::Tensor& output, torch::Tensor& output_scale,
+static void validate_fp4_experts_quant_inputs(
    torch::Tensor const& output, torch::Tensor const& output_scale,
    torch::Tensor const& input, torch::Tensor const& input_global_scale,
    torch::Tensor const& input_offset_by_experts,
-    torch::Tensor const& output_scale_offset_by_experts) {
+    torch::Tensor const& output_scale_offset_by_experts, int64_t m_topk,
-  CHECK_INPUT(output, "output must be a CUDA tensor");
+    int64_t k) {
-  CHECK_INPUT(output_scale, "output_scale must be a CUDA tensor");
+  CHECK_INPUT(output, "output");
-  CHECK_INPUT(input, "input must be a CUDA tensor");
+  CHECK_INPUT(output_scale, "output_scale");
-  CHECK_INPUT(input_global_scale, "input_global_scale must be a CUDA tensor");
+  CHECK_INPUT(input, "input");
-  CHECK_INPUT(input_offset_by_experts,
+  CHECK_INPUT(input_global_scale, "input_global_scale");
-              "input_offset_by_experts must be a CUDA tensor");
+  CHECK_INPUT(input_offset_by_experts, "input_offset_by_experts");
-  CHECK_INPUT(output_scale_offset_by_experts,
+  CHECK_INPUT(output_scale_offset_by_experts, "output_scale_offset_by_experts");
              "output_scale_offset_by_experts must be a CUDA tensor");
  TORCH_CHECK(output.dim() == 2);
  TORCH_CHECK(output_scale.dim() == 2);
@@ -338,8 +371,6 @@ void scaled_fp4_experts_quant_sm1xxa(
  TORCH_CHECK(output_scale.scalar_type() == INT);
  const int BLOCK_SIZE = 16;
  auto m_topk = input.size(0);
  auto k = input.size(1);
  TORCH_CHECK(k % BLOCK_SIZE == 0, "k must be a multiple of 16");
  auto n_experts = input_global_scale.size(0);
  TORCH_CHECK(input_offset_by_experts.size(0) == n_experts + 1);
@@ -351,7 +382,21 @@ void scaled_fp4_experts_quant_sm1xxa(
  int padded_k = (scales_k + (4 - 1)) / 4 * 4;
  // 4 means 4 fp8 values are packed into one int32
  TORCH_CHECK(output_scale.size(1) * 4 == padded_k);
 }
 void scaled_fp4_experts_quant_sm1xxa(
    torch::Tensor& output, torch::Tensor& output_scale,
    torch::Tensor const& input, torch::Tensor const& input_global_scale,
    torch::Tensor const& input_offset_by_experts,
    torch::Tensor const& output_scale_offset_by_experts) {
  auto m_topk = input.size(0);
  auto k = input.size(1);
  validate_fp4_experts_quant_inputs(output, output_scale, input,
                                    input_global_scale, input_offset_by_experts,
                                    output_scale_offset_by_experts, m_topk, k);
  auto n_experts = input_global_scale.size(0);
  const at::cuda::OptionalCUDAGuard device_guard(device_of(input));
  const cudaStream_t stream =
      at::cuda::getCurrentCUDAStream(input.get_device());
@@ -359,7 +404,38 @@ void scaled_fp4_experts_quant_sm1xxa(
  VLLM_DISPATCH_HALF_TYPES(
      input.scalar_type(), "nvfp4_experts_quant_kernel", [&] {
        using cuda_type = vllm::CUDATypeConverter<scalar_t>::Type;
-        vllm::quant_impl<cuda_type>(
+        vllm::quant_impl<cuda_type, /*FUSE_SILU_MUL=*/false>(
            output.data_ptr(), output_scale.data_ptr(), input.data_ptr(),
            input_global_scale.data_ptr(), input_offset_by_experts.data_ptr(),
            output_scale_offset_by_experts.data_ptr(), m_topk, k, n_experts,
            stream);
      });
 }
 void silu_and_mul_scaled_fp4_experts_quant_sm1xxa(
    torch::Tensor& output, torch::Tensor& output_scale,
    torch::Tensor const& input, torch::Tensor const& input_global_scale,
    torch::Tensor const& input_offset_by_experts,
    torch::Tensor const& output_scale_offset_by_experts) {
  auto m_topk = input.size(0);
  // Input has gate || up layout, so k = input.size(1) / 2
  auto k_times_2 = input.size(1);
  TORCH_CHECK(k_times_2 % 2 == 0, "input width must be even (gate || up)");
  auto k = k_times_2 / 2;
  validate_fp4_experts_quant_inputs(output, output_scale, input,
                                    input_global_scale, input_offset_by_experts,
                                    output_scale_offset_by_experts, m_topk, k);
  auto n_experts = input_global_scale.size(0);
  const at::cuda::OptionalCUDAGuard device_guard(device_of(input));
  const cudaStream_t stream =
      at::cuda::getCurrentCUDAStream(input.get_device());
  VLLM_DISPATCH_HALF_TYPES(
      input.scalar_type(), "silu_mul_nvfp4_experts_quant_kernel", [&] {
        using cuda_type = vllm::CUDATypeConverter<scalar_t>::Type;
        vllm::quant_impl<cuda_type, /*FUSE_SILU_MUL=*/true>(
            output.data_ptr(), output_scale.data_ptr(), input.data_ptr(),
            input_global_scale.data_ptr(), input_offset_by_experts.data_ptr(),
            output_scale_offset_by_experts.data_ptr(), m_topk, k, n_experts,
--- a/csrc/quantization/fp4/nvfp4_quant_entry.cu
+++ b/csrc/quantization/fp4/nvfp4_quant_entry.cu
@@ -41,6 +41,15 @@ void silu_and_mul_nvfp4_quant_sm1xxa(torch::Tensor& output,
                                     torch::Tensor& input_sf);
 #endif
 #if (defined(ENABLE_NVFP4_SM100) && ENABLE_NVFP4_SM100) || \
    (defined(ENABLE_NVFP4_SM120) && ENABLE_NVFP4_SM120)
 void silu_and_mul_scaled_fp4_experts_quant_sm1xxa(
    torch::Tensor& output, torch::Tensor& output_scale,
    torch::Tensor const& input, torch::Tensor const& input_global_scale,
    torch::Tensor const& input_offset_by_experts,
    torch::Tensor const& output_scale_offset_by_experts);
 #endif
 void scaled_fp4_quant(torch::Tensor& output, torch::Tensor const& input,
                      torch::Tensor& output_sf, torch::Tensor const& input_sf) {
 #if (defined(ENABLE_NVFP4_SM100) && ENABLE_NVFP4_SM100) || \
@@ -74,3 +83,18 @@ void silu_and_mul_nvfp4_quant(torch::Tensor& output, torch::Tensor& output_sf,
  TORCH_CHECK_NOT_IMPLEMENTED(
      false, "No compiled silu_and_mul nvfp4 quantization kernel");
 }
 void silu_and_mul_scaled_fp4_experts_quant(
    torch::Tensor& output, torch::Tensor& output_scale,
    torch::Tensor const& input, torch::Tensor const& input_global_scale,
    torch::Tensor const& input_offset_by_experts,
    torch::Tensor const& output_scale_offset_by_experts) {
 #if (defined(ENABLE_NVFP4_SM100) && ENABLE_NVFP4_SM100) || \
    (defined(ENABLE_NVFP4_SM120) && ENABLE_NVFP4_SM120)
  return silu_and_mul_scaled_fp4_experts_quant_sm1xxa(
      output, output_scale, input, input_global_scale, input_offset_by_experts,
      output_scale_offset_by_experts);
 #endif
  TORCH_CHECK_NOT_IMPLEMENTED(
      false, "No compiled silu_and_mul nvfp4 experts quantization kernel");
 }
--- a/csrc/quantization/fp4/nvfp4_quant_kernels.cu
+++ b/csrc/quantization/fp4/nvfp4_quant_kernels.cu
@@ -35,7 +35,13 @@ template <typename Int>
 __host__ __device__ inline Int round_up(Int x, Int y) {
  static_assert(std::is_integral_v<Int>,
                "round_up argument must be integral type");
-  return (x + y - 1) / y * y;
+  return ((x + y - 1) / y) * y;
 }
 // Compute effective rows for grid configuration with swizzled SF layouts.
 inline int computeEffectiveRows(int m) {
  constexpr int ROW_TILE = 128;
  return round_up(m, ROW_TILE);
 }
 // Use UE4M3 by default.
@@ -49,81 +55,57 @@ __global__ void __launch_bounds__(512, VLLM_BLOCKS_PER_SM(512))
  static_assert(sizeof(PackedVec) == sizeof(Type) * CVT_FP4_ELTS_PER_THREAD,
                "Vec size is not matched.");
  // Precompute SF layout parameter (constant for entire kernel).
  int32_t const numKTiles = (numCols + 63) / 64;
  int sf_m = round_up<int>(numRows, 128);
  int sf_n_unpadded = numCols / CVT_FP4_SF_VEC_SIZE;
  int sf_n_int = round_up<int>(sf_n_unpadded, 4) / 4;
-  for (int row = numRows + blockIdx.x; row < sf_m; row += gridDim.x) {
+  int num_padded_cols = sf_n_int * 4 * CVT_FP4_SF_VEC_SIZE;
    // Each thread writes 4 uint32_t elements.
    for (int col = sf_n_unpadded + threadIdx.x * 4; col < sf_n_int;
         col += blockDim.x * 4) {
      SFout[row * sf_n_int + col] = 0x00;
    }
  }
  // Get the global scaling factor, which will be applied to the SF.
  // Note SFScale is the same as next GEMM's alpha, which is
  // (448.f / (Alpha_A / 6.f)).
  float const global_scale = SFScale == nullptr ? 1.0f : SFScale[0];
-  // Input tensor row/col loops.
+  // Iterate over all rows and cols including padded ones -
-  for (int rowIdx = blockIdx.x; rowIdx < numRows; rowIdx += gridDim.x) {
+  //  ensures we visit every single scale factor address to initialize it.
-    for (int colIdx = threadIdx.x; colIdx < numCols / CVT_FP4_ELTS_PER_THREAD;
+  for (int rowIdx = blockIdx.x; rowIdx < sf_m; rowIdx += gridDim.x) {
    for (int colIdx = threadIdx.x;
         colIdx < num_padded_cols / CVT_FP4_ELTS_PER_THREAD;
         colIdx += blockDim.x) {
      int elem_idx = colIdx * CVT_FP4_ELTS_PER_THREAD;
      PackedVec in_vec;
      int64_t inOffset = rowIdx * (numCols / CVT_FP4_ELTS_PER_THREAD) + colIdx;
-      PackedVec in_vec = reinterpret_cast<PackedVec const*>(in)[inOffset];
+
-      // Get the output tensor offset.
+      // If we are outside valid rows OR outside valid columns -> Use Zeros
-      // Same as inOffset because 8 elements are packed into one uint32_t.
+      if (rowIdx >= numRows || elem_idx >= numCols) {
-      int64_t outOffset = inOffset;
+        memset(&in_vec, 0, sizeof(PackedVec));
-      auto& out_pos = out[outOffset];
+
      } else {
        // Valid Region: Load actual data
        in_vec = reinterpret_cast<PackedVec const*>(in)[inOffset];
      }
      auto sf_out =
          cvt_quant_to_fp4_get_sf_out_offset<uint32_t,
                                             CVT_FP4_NUM_THREADS_PER_SF>(
-              rowIdx, colIdx, numCols, SFout);
+              rowIdx, colIdx, numKTiles, SFout);
-      out_pos =
+      auto out_val =
          cvt_warp_fp16_to_fp4<Type, UE8M0_SF>(in_vec, global_scale, sf_out);
      // We do NOT write output for padding because the 'out' tensor is not
      // padded.
      if (rowIdx < numRows && elem_idx < numCols) {
        // Same as inOffset because 8 elements are packed into one uint32_t.
        out[inOffset] = out_val;
      }
    }
  }
 }
 template <typename T>
 void invokeFP4Quantization(int m, int n, T const* input, float const* SFScale,
                           int64_t* output, int32_t* SFOuput, bool useUE8M0,
                           int multiProcessorCount, cudaStream_t stream) {
  // Grid, Block size.
  // Each thread converts 8 values.
  dim3 block(std::min(int(n / ELTS_PER_THREAD), 512));
  // Get number of blocks per SM
  int const numBlocksPerSM =
      vllm_runtime_blocks_per_sm(static_cast<int>(block.x));
  dim3 grid(std::min(int(m), multiProcessorCount * numBlocksPerSM));
  // Launch the cvt kernel.
  if (useUE8M0) {
    cvt_fp16_to_fp4<T, true><<<grid, block, 0, stream>>>(
        m, n, input, SFScale, reinterpret_cast<uint32_t*>(output),
        reinterpret_cast<uint32_t*>(SFOuput));
  } else {
    cvt_fp16_to_fp4<T, false><<<grid, block, 0, stream>>>(
        m, n, input, SFScale, reinterpret_cast<uint32_t*>(output),
        reinterpret_cast<uint32_t*>(SFOuput));
  }
 }
 // Instantiate the function.
 template void invokeFP4Quantization(int m, int n, half const* input,
                                    float const* SFScale, int64_t* output,
                                    int32_t* SFOuput, bool useUE8M0,
                                    int multiProcessorCount,
                                    cudaStream_t stream);
 template void invokeFP4Quantization(int m, int n, __nv_bfloat16 const* input,
                                    float const* SFScale, int64_t* output,
                                    int32_t* SFOuput, bool useUE8M0,
                                    int multiProcessorCount,
                                    cudaStream_t stream);
 }  // namespace vllm
 void scaled_fp4_quant_sm1xxa(torch::Tensor const& output,
@@ -147,13 +129,19 @@ void scaled_fp4_quant_sm1xxa(torch::Tensor const& output,
  const at::cuda::OptionalCUDAGuard device_guard(device_of(input));
  auto stream = at::cuda::getCurrentCUDAStream(input.get_device());
-  // We don't support e8m0 scales at this moment.
+  // Grid, Block size. Each thread converts 8 values.
-  bool useUE8M0 = false;
+  dim3 block(std::min(int(n / ELTS_PER_THREAD), 512));
  int const numBlocksPerSM =
      vllm_runtime_blocks_per_sm(static_cast<int>(block.x));
  int effectiveRows = vllm::computeEffectiveRows(m);
  dim3 grid(std::min(effectiveRows, multiProcessorCount * numBlocksPerSM));
  VLLM_DISPATCH_HALF_TYPES(input.scalar_type(), "nvfp4_quant_kernel", [&] {
    using cuda_type = vllm::CUDATypeConverter<scalar_t>::Type;
    auto input_ptr = static_cast<cuda_type const*>(input.data_ptr());
-    vllm::invokeFP4Quantization(m, n, input_ptr, input_sf_ptr, output_ptr,
+    // NOTE: We don't support e8m0 scales at this moment.
-                                sf_out, useUE8M0, multiProcessorCount, stream);
+    vllm::cvt_fp16_to_fp4<cuda_type, false><<<grid, block, 0, stream>>>(
        m, n, input_ptr, input_sf_ptr, reinterpret_cast<uint32_t*>(output_ptr),
        reinterpret_cast<uint32_t*>(sf_out));
  });
-}
+}
--- a/csrc/quantization/fp4/nvfp4_utils.cuh
+++ b/csrc/quantization/fp4/nvfp4_utils.cuh
@@ -128,51 +128,42 @@ inline __device__ float reciprocal_approximate_ftz(float a) {
  return b;
 }
 // Compute SF output offset for swizzled tensor core layout.
 // SF layout: [numMTiles, numKTiles, 32, 4, 4]
 // Caller must precompute: numKTiles = (numCols + 63) / 64
 template <class SFType, int CVT_FP4_NUM_THREADS_PER_SF>
-__device__ uint8_t* cvt_quant_to_fp4_get_sf_out_offset(int rowIdx, int colIdx,
+__device__ __forceinline__ uint8_t* cvt_quant_to_fp4_get_sf_out_offset(
-                                                       int numCols,
+    int rowIdx, int colIdx, int32_t numKTiles, SFType* SFout) {
                                                       SFType* SFout) {
  static_assert(CVT_FP4_NUM_THREADS_PER_SF == 1 ||
                CVT_FP4_NUM_THREADS_PER_SF == 2);
  // One pair of threads write one SF to global memory.
  // TODO: stage through smem for packed STG.32
  // is it better than STG.8 from 4 threads ?
-  if (threadIdx.x % CVT_FP4_NUM_THREADS_PER_SF == 0) {
+  if (threadIdx.x % CVT_FP4_NUM_THREADS_PER_SF != 0) {
-    // SF vector index (16 elements share one SF in the K dimension).
+    return nullptr;
    int32_t kIdx = colIdx / CVT_FP4_NUM_THREADS_PER_SF;
    int32_t mIdx = rowIdx;
    // SF layout [numMTiles, numKTiles, 32 (mTile), 4 (mTile), 4(kTile)]
    // --> index [mTileIdx, kTileIdx, outerMIdx, innerMIdx, innerKIdx]
    int32_t mTileIdx = mIdx / (32 * 4);
    // SF vector size 16.
    int factor = CVT_FP4_SF_VEC_SIZE * 4;
    int32_t numKTiles = (numCols + factor - 1) / factor;
    int64_t mTileStride = numKTiles * 32 * 4 * 4;
    int32_t kTileIdx = (kIdx / 4);
    int64_t kTileStride = 32 * 4 * 4;
    // M tile layout [32, 4] is column-major.
    int32_t outerMIdx = (mIdx % 32);
    int64_t outerMStride = 4 * 4;
    int32_t innerMIdx = (mIdx % (32 * 4)) / 32;
    int64_t innerMStride = 4;
    int32_t innerKIdx = (kIdx % 4);
    int64_t innerKStride = 1;
    // Compute the global offset.
    int64_t SFOffset = mTileIdx * mTileStride + kTileIdx * kTileStride +
                       outerMIdx * outerMStride + innerMIdx * innerMStride +
                       innerKIdx * innerKStride;
    return reinterpret_cast<uint8_t*>(SFout) + SFOffset;
  }
-  return nullptr;
+
  // SF vector index (16 elements share one SF in the K dimension).
  int32_t kIdx = colIdx / CVT_FP4_NUM_THREADS_PER_SF;
  int32_t mIdx = rowIdx;
  // Decompose indices using bitwise ops (all divisors are powers of 2).
  // SF layout [numMTiles, numKTiles, 32 (mTile), 4 (mTile), 4(kTile)]
  int32_t mTileIdx = mIdx >> 7;         // mIdx / 128
  int32_t outerMIdx = mIdx & 31;        // mIdx % 32
  int32_t innerMIdx = (mIdx >> 5) & 3;  // (mIdx / 32) % 4
  int32_t kTileIdx = kIdx >> 2;         // kIdx / 4
  int32_t innerKIdx = kIdx & 3;         // kIdx % 4
  // Compute global SF offset: mTileIdx * (numKTiles * 512) + kTileIdx * 512 +
  //                           outerMIdx * 16 + innerMIdx * 4 + innerKIdx
  // Use bitwise OR for non-overlapping lower bits.
  int64_t SFOffset = (static_cast<int64_t>(mTileIdx) * numKTiles + kTileIdx)
                         << 9 |
                     (outerMIdx << 4) | (innerMIdx << 2) | innerKIdx;
  return reinterpret_cast<uint8_t*>(SFout) + SFOffset;
 }
 // Quantizes the provided PackedVec into the uint32_t output
@@ -248,4 +239,34 @@ __device__ uint32_t cvt_warp_fp16_to_fp4(PackedVec<Type>& vec, float SFScaleVal,
  return e2m1Vec;
 }
 // silu in float32
 __device__ __forceinline__ float silu(float x) {
  return __fdividef(x, (1.f + __expf(-x)));
 }
 __device__ __forceinline__ float2 silu2(float2 x) {
  return make_float2(silu(x.x), silu(x.y));
 }
 template <class Type>
 __inline__ __device__ PackedVec<Type> compute_silu_mul(
    const PackedVec<Type>& x_vec, const PackedVec<Type>& y_vec) {
  PackedVec<Type> result;
 #pragma unroll
  for (int i = 0; i < CVT_FP4_ELTS_PER_THREAD / 2; ++i) {
    // silu_mul in float32
    if constexpr (std::is_same_v<Type, half>) {
      float2 silu_vec = silu2(__half22float2(x_vec.elts[i]));
      result.elts[i] = __float22half2_rn(
          __fmul2_rn(silu_vec, __half22float2(y_vec.elts[i])));
    } else {
      float2 silu_vec = silu2(__bfloat1622float2(x_vec.elts[i]));
      result.elts[i] = __float22bfloat162_rn(
          __fmul2_rn(silu_vec, __bfloat1622float2(y_vec.elts[i])));
    }
  }
  return result;
 }
 }  // namespace vllm
--- a/csrc/quantization/gptq/q_gemm.cu
+++ b/csrc/quantization/gptq/q_gemm.cu
@@ -233,11 +233,6 @@ __global__ void gemm_half_q_half_gptq_4bit_kernel(
  // Zero output
  if (n >= size_n) return;
  if (blockIdx.z == 0) {
    for (int m = 0; m < m_count; m++)
      *((uint64_t*)c_.item_ptr(offset_m + m, n)) = 0;
  }
  __syncthreads();
  // Find initial group
@@ -372,11 +367,6 @@ __global__ void gemm_half_q_half_gptq_2bit_kernel(
  // Zero output
  if (n >= size_n) return;
  if (blockIdx.z == 0) {
    for (int m = 0; m < m_count; m++)
      *((uint64_t*)c_.item_ptr(offset_m + m, n)) = 0;
  }
  __syncthreads();
  // Find initial group
@@ -494,11 +484,6 @@ __global__ void gemm_half_q_half_gptq_3bit_kernel(
  // Zero output
  if (n >= size_n) return;
  if (blockIdx.z == 0) {
    for (int m = 0; m < m_count; m++)
      *((uint64_t*)c_.item_ptr(offset_m + m, n)) = 0;
  }
  __syncthreads();
  // Find initial group
@@ -623,11 +608,6 @@ __global__ void gemm_half_q_half_gptq_8bit_kernel(
  // Zero output
  if (n >= size_n) return;
  if (blockIdx.z == 0) {
    for (int m = 0; m < m_count; m++)
      *((uint64_t*)c_.item_ptr(offset_m + m, n)) = 0;
  }
  __syncthreads();
  // Find initial group
@@ -1224,9 +1204,6 @@ __global__ void gemm_half_q_half_alt_4bit_kernel(
        __halves2half2(__int2half_rn(val & 0xF), __int2half_rn(val >> 4));
  }
  if (blockIdx.z == 0) {
    for (int m = 0; m < b_end; m++) mul[(b + m) * width + w] = __int2half_rn(0);
  }
  __syncthreads();
  int i = width * h + w;
@@ -1319,9 +1296,6 @@ __global__ void gemm_half_q_half_alt_8bit_kernel(
    }
  }
  if (blockIdx.z == 0) {
    for (int m = 0; m < b_end; m++) mul[(b + m) * width + w] = __int2half_rn(0);
  }
  __syncthreads();
  int i = width * h + w;
@@ -1857,7 +1831,7 @@ torch::Tensor gptq_gemm(torch::Tensor a, torch::Tensor b_q_weight,
                        bool use_exllama, bool use_v2_format, int64_t bit) {
  const at::cuda::OptionalCUDAGuard device_guard(device_of(a));
  auto options = torch::TensorOptions().dtype(a.dtype()).device(a.device());
-  at::Tensor c = torch::empty({a.size(0), b_q_weight.size(1)}, options);
+  at::Tensor c = torch::zeros({a.size(0), b_q_weight.size(1)}, options);
  at::Tensor temp_dq = torch::empty(
      {b_q_weight.size(0) * 32 / bit, b_q_weight.size(1)}, options);
--- a/csrc/quantization/w8a8/cutlass/moe/blockwise_scaled_group_mm_sm100.cu
+++ b/csrc/quantization/w8a8/cutlass/moe/blockwise_scaled_group_mm_sm100.cu
@@ -1,373 +0,0 @@
 #include "core/registration.h"
 #include <torch/all.h>
 #include <cutlass/arch/arch.h>
 #include <ATen/cuda/CUDAContext.h>
 #include <c10/cuda/CUDAGuard.h>
 #include <c10/cuda/CUDAStream.h>
 #include "cute/tensor.hpp"
 #include "cutlass/tensor_ref.h"
 #include "cutlass/epilogue/collective/default_epilogue.hpp"
 #include "cutlass/epilogue/thread/linear_combination.h"
 #include "cutlass/gemm/dispatch_policy.hpp"
 #include "cutlass/gemm/group_array_problem_shape.hpp"
 #include "cutlass/gemm/collective/collective_builder.hpp"
 #include "cutlass/epilogue/collective/collective_builder.hpp"
 #include "cutlass/gemm/device/gemm_universal_adapter.h"
 #include "cutlass/gemm/kernel/gemm_universal.hpp"
 #include "cutlass/util/command_line.h"
 #include "cutlass/util/distribution.h"
 #include "cutlass/util/host_tensor.h"
 #include "cutlass/util/packed_stride.hpp"
 #include "cutlass/util/tensor_view_io.h"
 #include "cutlass/util/reference/device/gemm.h"
 #include "cutlass/util/reference/device/tensor_compare.h"
 #include "cutlass/util/reference/host/tensor_fill.h"
 #include "cutlass/util/reference/host/gett.hpp"
 #include "cutlass/util/reference/host/tensor_norm.h"
 #include "cutlass/util/reference/host/tensor_compare.h"
 #include <cassert>
 using namespace cute;
 template <typename ElementAB, typename ElementC, typename ElementAccumulator,
          typename LayoutSFA, typename LayoutSFB, typename ScaleConfig>
 __global__ void get_ggemm_starts(
    int32_t* expert_offsets, ElementAB** a_offsets, ElementAB** b_offsets,
    ElementC** out_offsets, ElementAccumulator** a_scale_offsets,
    ElementAccumulator** b_scale_offsets, ElementAB* a_base_as_int,
    ElementAB* b_base_as_int, ElementC* out_base_as_int,
    ElementAccumulator* a_scale_base_as_int,
    ElementAccumulator* b_scale_base_as_int, LayoutSFA* layout_sfa_base_as_int,
    LayoutSFB* layout_sfb_base_as_int, int* problem_sizes) {
  int expert_id = threadIdx.x;
  if (expert_id >= gridDim.x * blockDim.x) {
    return;
  }
  int m = problem_sizes[expert_id * 3];
  int n = problem_sizes[expert_id * 3 + 1];
  int k = problem_sizes[expert_id * 3 + 2];
  int32_t expert_offset = expert_offsets[expert_id];
  int a_stride = expert_offset * k;
  int b_stride = expert_id * k * n;
  int a_scale_stride = expert_offset * k / 128;
  int b_scale_stride = expert_id * k * n / 128 / 128;
  a_offsets[expert_id] = a_base_as_int + a_stride;
  b_offsets[expert_id] = b_base_as_int + b_stride;
  out_offsets[expert_id] = out_base_as_int + expert_offset * n;
  a_scale_offsets[expert_id] = a_scale_base_as_int + a_scale_stride;
  b_scale_offsets[expert_id] = b_scale_base_as_int + b_scale_stride;
  LayoutSFA* layout_sfa_ptr = layout_sfa_base_as_int + expert_id;
  LayoutSFB* layout_sfb_ptr = layout_sfb_base_as_int + expert_id;
  *layout_sfa_ptr =
      ScaleConfig::tile_atom_to_shape_SFA(cute::make_shape(m, n, k, 1));
  *layout_sfb_ptr =
      ScaleConfig::tile_atom_to_shape_SFB(cute::make_shape(m, n, k, 1));
 }
 #define __CALL_GET_STARTS_KERNEL(TENSOR_C_TYPE, C_TYPE, LayoutSFA, LayoutSFB, \
                                 ScaleConfig)                                 \
  else if (out_tensors.dtype() == TENSOR_C_TYPE) {                            \
    get_ggemm_starts<cutlass::float_e4m3_t, C_TYPE, float, LayoutSFA,         \
                     LayoutSFB, ScaleConfig><<<1, num_experts, 0, stream>>>(  \
        static_cast<int32_t*>(expert_offsets.data_ptr()),                     \
        static_cast<cutlass::float_e4m3_t**>(a_ptrs.data_ptr()),              \
        static_cast<cutlass::float_e4m3_t**>(b_ptrs.data_ptr()),              \
        static_cast<C_TYPE**>(out_ptrs.data_ptr()),                           \
        static_cast<float**>(a_scales_ptrs.data_ptr()),                       \
        static_cast<float**>(b_scales_ptrs.data_ptr()),                       \
        static_cast<cutlass::float_e4m3_t*>(a_tensors.data_ptr()),            \
        static_cast<cutlass::float_e4m3_t*>(b_tensors.data_ptr()),            \
        static_cast<C_TYPE*>(out_tensors.data_ptr()),                         \
        static_cast<float*>(a_scales.data_ptr()),                             \
        static_cast<float*>(b_scales.data_ptr()),                             \
        reinterpret_cast<LayoutSFA*>(layout_sfa.data_ptr()),                  \
        reinterpret_cast<LayoutSFB*>(layout_sfb.data_ptr()),                  \
        static_cast<int*>(problem_sizes.data_ptr()));                         \
  }
 template <typename LayoutSFA, typename LayoutSFB, typename ScaleConfig>
 void run_get_ggemm_starts(
    torch::Tensor const& expert_offsets, torch::Tensor& a_ptrs,
    torch::Tensor& b_ptrs, torch::Tensor& out_ptrs,
    torch::Tensor& a_scales_ptrs, torch::Tensor& b_scales_ptrs,
    torch::Tensor const& a_tensors, torch::Tensor const& b_tensors,
    torch::Tensor out_tensors, torch::Tensor const& a_scales,
    torch::Tensor const& b_scales, torch::Tensor const& layout_sfa,
    torch::Tensor const& layout_sfb, torch::Tensor const& problem_sizes) {
  TORCH_CHECK(a_tensors.dtype() == torch::kFloat8_e4m3fn);
  TORCH_CHECK(b_tensors.dtype() == torch::kFloat8_e4m3fn);
  TORCH_CHECK(a_scales.dtype() == torch::kFloat32);
  TORCH_CHECK(b_scales.dtype() == torch::kFloat32);
  TORCH_CHECK(out_tensors.size(1) % 128 == 0 or out_tensors.size(0) % 128 == 0);
  TORCH_CHECK(a_tensors.size(1) % 128 == 0 or a_tensors.size(0) % 128 == 0);
  int num_experts = (int)expert_offsets.size(0);
  auto stream = at::cuda::getCurrentCUDAStream(a_tensors.device().index());
  if (false) {
  }
  __CALL_GET_STARTS_KERNEL(torch::kBFloat16, cutlass::bfloat16_t, LayoutSFA,
                           LayoutSFB, ScaleConfig)
  __CALL_GET_STARTS_KERNEL(torch::kFloat16, cutlass::half_t, LayoutSFA,
                           LayoutSFB, ScaleConfig)
  else {
    TORCH_CHECK(false, "Unsupported output tensor type");
  }
 }
 template <typename OutType, typename ScheduleConfig, typename LayoutD>
 void run_blockwise_scaled_group_mm(
    torch::Tensor& out_ptrs, const torch::Tensor& a_ptrs,
    const torch::Tensor& b_ptrs, const torch::Tensor& a_scales_ptrs,
    const torch::Tensor& b_scales_ptrs, const torch::Tensor& stride_a,
    const torch::Tensor& stride_b, const torch::Tensor& stride_c,
    const torch::Tensor& layout_sfa, const torch::Tensor& layout_sfb,
    const torch::Tensor& problem_sizes, const torch::Tensor& expert_offsets) {
  using ProblemShape = cutlass::gemm::GroupProblemShape<Shape<int, int, int>>;
  // Types
  using ElementA = cutlass::float_e4m3_t;
  using ElementB = cutlass::float_e4m3_t;
  using ElementC = OutType;
  using ElementD = ElementC;
  using ElementAccumulator = float;
  using LayoutA = cutlass::layout::RowMajor;
  using LayoutB = cutlass::layout::ColumnMajor;
  using LayoutC = LayoutD;
  // Alignments
  static constexpr int AlignmentA = 128 / cutlass::sizeof_bits<ElementA>::value;
  static constexpr int AlignmentB = 128 / cutlass::sizeof_bits<ElementB>::value;
  static constexpr int AlignmentC = 128 / cutlass::sizeof_bits<ElementC>::value;
  using ArchTag = cutlass::arch::Sm100;
  using OperatorClass = cutlass::arch::OpClassTensorOp;
  using CollectiveEpilogue =
      typename cutlass::epilogue::collective::CollectiveBuilder<
          ArchTag, OperatorClass, typename ScheduleConfig::MmaTileShape,
          typename ScheduleConfig::ClusterShape,
          cutlass::epilogue::collective::EpilogueTileAuto, ElementAccumulator,
          ElementAccumulator, void, LayoutC*, AlignmentC, ElementD, LayoutC*,
          AlignmentC, typename ScheduleConfig::EpilogueSchedule>::CollectiveOp;
  using CollectiveMainloop =
      typename cutlass::gemm::collective::CollectiveBuilder<
          ArchTag, OperatorClass, ElementA,
          cute::tuple<LayoutA*, typename ScheduleConfig::LayoutSFA*>,
          AlignmentA, ElementB,
          cute::tuple<LayoutB*, typename ScheduleConfig::LayoutSFB*>,
          AlignmentB, ElementAccumulator, typename ScheduleConfig::MmaTileShape,
          typename ScheduleConfig::ClusterShape,
          cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(
              sizeof(typename CollectiveEpilogue::SharedStorage))>,
          typename ScheduleConfig::KernelSchedule>::CollectiveOp;
  using GemmKernel =
      cutlass::gemm::kernel::GemmUniversal<ProblemShape, CollectiveMainloop,
                                           CollectiveEpilogue, void>;
  using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
  using StrideA = typename Gemm::GemmKernel::InternalStrideA;
  using StrideB = typename Gemm::GemmKernel::InternalStrideB;
  using StrideC = typename Gemm::GemmKernel::InternalStrideC;
  using StrideD = typename Gemm::GemmKernel::InternalStrideD;
  using UnderlyingProblemShape = ProblemShape::UnderlyingProblemShape;
  int num_experts = (int)expert_offsets.size(0);
  Gemm gemm_op;
  // Mainloop Arguments
  typename GemmKernel::MainloopArguments mainloop_args{
      static_cast<const ElementA**>(a_ptrs.data_ptr()),
      static_cast<StrideA*>(stride_a.data_ptr()),
      static_cast<const ElementB**>(b_ptrs.data_ptr()),
      static_cast<StrideB*>(stride_b.data_ptr()),
      static_cast<const ElementAccumulator**>(a_scales_ptrs.data_ptr()),
      reinterpret_cast<typename ScheduleConfig::LayoutSFA*>(
          layout_sfa.data_ptr()),
      static_cast<const ElementAccumulator**>(b_scales_ptrs.data_ptr()),
      reinterpret_cast<typename ScheduleConfig::LayoutSFB*>(
          layout_sfb.data_ptr())};
  int device_id = a_ptrs.device().index();
  static const cutlass::KernelHardwareInfo hw_info{
      device_id, cutlass::KernelHardwareInfo::query_device_multiprocessor_count(
                     device_id)};
  // Epilogue Arguments
  typename GemmKernel::EpilogueArguments epilogue_args{
      {},  // epilogue.thread
      nullptr,
      static_cast<StrideC*>(stride_c.data_ptr()),
      static_cast<ElementD**>(out_ptrs.data_ptr()),
      static_cast<StrideC*>(stride_c.data_ptr())};
  UnderlyingProblemShape* problem_sizes_as_shapes =
      static_cast<UnderlyingProblemShape*>(problem_sizes.data_ptr());
  // Gemm Arguments
  typename GemmKernel::Arguments args{
      cutlass::gemm::GemmUniversalMode::kGrouped,
      {num_experts, problem_sizes_as_shapes, nullptr},
      mainloop_args,
      epilogue_args,
      hw_info};
  at::cuda::CUDAGuard device_guard{(char)a_ptrs.device().index()};
  const cudaStream_t stream =
      at::cuda::getCurrentCUDAStream(a_ptrs.get_device());
  auto can_implement_status = gemm_op.can_implement(args);
  TORCH_CHECK(can_implement_status == cutlass::Status::kSuccess,
              "Failed to implement GEMM");
  size_t workspace_size = gemm_op.get_workspace_size(args);
  auto const workspace_options =
      torch::TensorOptions().dtype(torch::kUInt8).device(a_ptrs.device());
  auto workspace = torch::empty(workspace_size, workspace_options);
  auto status = gemm_op.initialize(args, workspace.data_ptr(), stream);
  TORCH_CHECK(status == cutlass::Status::kSuccess, "Failed to initialize GEMM");
  status = gemm_op.run(stream);
  TORCH_CHECK(status == cutlass::Status::kSuccess, "Failed to run GEMM");
 }
 template <typename OutType>
 void blockwise_scaled_group_mm_dispatch_shape(
    torch::Tensor& output, const torch::Tensor& a, const torch::Tensor& b,
    const torch::Tensor& scales_a, const torch::Tensor& scales_b,
    const torch::Tensor& problem_sizes, const torch::Tensor& expert_offsets) {
  struct MmaConfig {
    using ElementA = cutlass::float_e4m3_t;
    using KernelSchedule =
        cutlass::gemm::KernelPtrArrayTmaWarpSpecializedBlockwise1SmSm100;
    using EpilogueSchedule = cutlass::epilogue::PtrArrayTmaWarpSpecialized1Sm;
    using ScaleConfig = cutlass::detail::Sm100BlockwiseScaleConfig<
        1, 128, 128, cute::UMMA::Major::K, cute::UMMA::Major::K>;
    using LayoutSFA = decltype(ScaleConfig::deduce_layoutSFA());
    using LayoutSFB = decltype(ScaleConfig::deduce_layoutSFB());
    using LayoutC = cutlass::layout::RowMajor;
    using MmaTileShape = Shape<_128, _128, _128>;
    using ClusterShape = Shape<_1, _1, _1>;
  };
  int num_experts = (int)expert_offsets.size(0);
  auto a_ptrs = torch::empty(
      {num_experts},
      torch::TensorOptions().dtype(torch::kInt64).device(a.device()));
  auto b_ptrs = torch::empty(
      {num_experts},
      torch::TensorOptions().dtype(torch::kInt64).device(a.device()));
  auto out_ptrs = torch::empty(
      {num_experts},
      torch::TensorOptions().dtype(torch::kInt64).device(a.device()));
  auto a_scales_ptrs = torch::empty(
      {num_experts},
      torch::TensorOptions().dtype(torch::kInt64).device(a.device()));
  auto b_scales_ptrs = torch::empty(
      {num_experts},
      torch::TensorOptions().dtype(torch::kInt64).device(a.device()));
  auto layout_sfa = torch::empty(
      {num_experts, 5},
      torch::TensorOptions().dtype(torch::kInt32).device(a.device()));
  auto layout_sfb = torch::empty(
      {num_experts, 5},
      torch::TensorOptions().dtype(torch::kInt32).device(a.device()));
  auto stride_a = torch::full(
      {num_experts}, a.size(1),
      torch::TensorOptions().dtype(torch::kInt64).device(a.device()));
  auto stride_b = torch::full(
      {num_experts}, a.size(1),
      torch::TensorOptions().dtype(torch::kInt64).device(a.device()));
  auto stride_c = torch::full(
      {num_experts}, output.size(1),
      torch::TensorOptions().dtype(torch::kInt64).device(a.device()));
  torch::TensorOptions options_int =
      torch::TensorOptions().dtype(torch::kInt64).device(a.device());
  run_get_ggemm_starts<typename MmaConfig::LayoutSFA,
                       typename MmaConfig::LayoutSFB,
                       typename MmaConfig::ScaleConfig>(
      expert_offsets, a_ptrs, b_ptrs, out_ptrs, a_scales_ptrs, b_scales_ptrs, a,
      b, output, scales_a, scales_b, layout_sfa, layout_sfb, problem_sizes);
  run_blockwise_scaled_group_mm<OutType, MmaConfig,
                                typename MmaConfig::LayoutC>(
      out_ptrs, a_ptrs, b_ptrs, a_scales_ptrs, b_scales_ptrs, stride_a,
      stride_b, stride_c, layout_sfa, layout_sfb, problem_sizes,
      expert_offsets);
 }
 void cutlass_blockwise_scaled_grouped_mm(
    torch::Tensor& output, const torch::Tensor& a, const torch::Tensor& b,
    const torch::Tensor& scales_a, const torch::Tensor& scales_b,
    const torch::Tensor& problem_sizes, const torch::Tensor& expert_offsets) {
  TORCH_CHECK(problem_sizes.dim() == 2, "problem_sizes must be 2D tensor");
  TORCH_CHECK(problem_sizes.size(1) == 3,
              "problem_sizes must have shape (num_experts, 3)");
  TORCH_CHECK(problem_sizes.size(0) == expert_offsets.size(0),
              "Number of experts in problem_sizes must match expert_offsets");
  TORCH_CHECK(problem_sizes.dtype() == torch::kInt32,
              "problem_sizes must be int32");
  TORCH_CHECK(a.scalar_type() == torch::kFloat8_e4m3fn,
              "a must be kFloat8_e4m3fn");
  TORCH_CHECK(b.scalar_type() == torch::kFloat8_e4m3fn,
              "b must be kFloat8_e4m3fn");
  TORCH_CHECK(output.scalar_type() == torch::kBFloat16 ||
                  output.scalar_type() == torch::kHalf,
              "output must be bfloat16 or half");
  TORCH_CHECK(scales_a.scalar_type() == torch::kFloat32,
              "scales_a must be float32");
  TORCH_CHECK(scales_b.scalar_type() == torch::kFloat32,
              "scales_b must be float32");
  TORCH_CHECK(expert_offsets.scalar_type() == torch::kInt32,
              "expert_offsets must be int32");
  TORCH_CHECK(output.dim() == 2, "output must be 2D tensor");
  TORCH_CHECK(a.dim() == 2, "a must be 2D tensor");
  TORCH_CHECK(b.dim() == 3, "b must be 3D tensor");
  TORCH_CHECK(scales_a.dim() == 2, "scales_a must be 2D tensor");
  TORCH_CHECK(scales_b.dim() == 3, "scales_b must be 3D tensor");
  TORCH_CHECK(problem_sizes.dim() == 2, "problem_sizes must be 2D tensor");
  TORCH_CHECK(problem_sizes.size(1) == 3,
              "problem_sizes must have shape (num_experts, 3)");
  TORCH_CHECK(problem_sizes.size(0) == expert_offsets.size(0),
              "Number of experts in problem_sizes must match expert_offsets");
  TORCH_CHECK(problem_sizes.dtype() == torch::kInt32,
              "problem_sizes must be int32");
  TORCH_CHECK(expert_offsets.dim() == 1, "expert_offsets must be 1D tensor");
 #if defined(ENABLE_CUTLASS_MOE_SM100) && ENABLE_CUTLASS_MOE_SM100
  if (output.scalar_type() == torch::kBFloat16) {
    blockwise_scaled_group_mm_dispatch_shape<cutlass::bfloat16_t>(
        output, a, b, scales_a, scales_b, problem_sizes, expert_offsets);
  } else if (output.scalar_type() == torch::kFloat16) {
    blockwise_scaled_group_mm_dispatch_shape<cutlass::half_t>(
        output, a, b, scales_a, scales_b, problem_sizes, expert_offsets);
  } else {
    TORCH_CHECK(false, "Unsupported output tensor type");
  }
 #endif
 }
 TORCH_LIBRARY_IMPL_EXPAND(TORCH_EXTENSION_NAME, CUDA, m) {
  m.impl("cutlass_blockwise_scaled_grouped_mm",
         &cutlass_blockwise_scaled_grouped_mm);
 }
--- a/csrc/quantization/w8a8/cutlass/moe/moe_data.cu
+++ b/csrc/quantization/w8a8/cutlass/moe/moe_data.cu
@@ -3,6 +3,8 @@
 #include <c10/cuda/CUDAGuard.h>
 #include <torch/all.h>
 #include "dispatch_utils.h"
 #include <iostream>
 constexpr uint64_t THREADS_PER_EXPERT = 512;
@@ -114,22 +116,17 @@ inline void launch_compute_problem_sizes(const torch::Tensor& topk_ids,
                                         const bool swap_ab) {
  int num_threads = min(THREADS_PER_EXPERT, topk_ids.numel());
-  const int32_t* topk_ptr = static_cast<const int32_t*>(topk_ids.data_ptr());
+  auto const* topk_ptr = topk_ids.data_ptr<int32_t>();
-  int32_t* ps1_ptr = static_cast<int32_t*>(problem_sizes1.data_ptr());
+  auto* ps1_ptr = problem_sizes1.data_ptr<int32_t>();
-  int32_t* ps2_ptr = static_cast<int32_t*>(problem_sizes2.data_ptr());
+  auto* ps2_ptr = problem_sizes2.data_ptr<int32_t>();
-  int32_t* atomic_ptr = static_cast<int32_t*>(atomic_buffer.data_ptr());
+  auto* atomic_ptr = atomic_buffer.data_ptr<int32_t>();
-  if (swap_ab) {
+  VLLM_DISPATCH_BOOL(swap_ab, SwapAB, [&] {
-    compute_problem_sizes<true><<<num_experts, num_threads, 0, stream>>>(
+    compute_problem_sizes<SwapAB><<<num_experts, num_threads, 0, stream>>>(
        topk_ptr, ps1_ptr, ps2_ptr, atomic_ptr,
        static_cast<int>(topk_ids.numel()), static_cast<int>(n),
        static_cast<int>(k));
-  } else {
+  });
    compute_problem_sizes<false><<<num_experts, num_threads, 0, stream>>>(
        topk_ptr, ps1_ptr, ps2_ptr, atomic_ptr,
        static_cast<int>(topk_ids.numel()), static_cast<int>(n),
        static_cast<int>(k));
  }
 }
 }  // namespace
@@ -153,6 +150,93 @@ void get_cutlass_moe_mm_problem_sizes_caller(
                               may_swap_ab);
 }
 template <bool SWAP_AB>
 __global__ void compute_problem_sizes_from_expert_offsets(
    const int64_t* __restrict__ expert_first_token_offset,
    int32_t* __restrict__ problem_sizes1, int32_t* __restrict__ problem_sizes2,
    const int num_experts, const int n, const int k) {
  int const expert_id = blockIdx.x * blockDim.x + threadIdx.x;
  if (expert_id >= num_experts) {
    return;
  }
  int64_t const m64 = expert_first_token_offset[expert_id + 1] -
                      expert_first_token_offset[expert_id];
  int32_t const m = static_cast<int32_t>(m64);
  int32_t* ps1 = problem_sizes1 + expert_id * 3;
  int32_t* ps2 = problem_sizes2 + expert_id * 3;
  if constexpr (!SWAP_AB) {
    // [M, 2*N, K]
    ps1[0] = m;
    ps1[1] = 2 * n;
    ps1[2] = k;
    // [M, K, N]
    ps2[0] = m;
    ps2[1] = k;
    ps2[2] = n;
  } else {
    // swap logical M/N in the problem shape
    // [2*N, M, K]
    ps1[0] = 2 * n;
    ps1[1] = m;
    ps1[2] = k;
    // [K, M, N]
    ps2[0] = k;
    ps2[1] = m;
    ps2[2] = n;
  }
 }
 void get_cutlass_moe_mm_problem_sizes_from_expert_offsets_caller(
    const torch::Tensor& expert_first_token_offset,
    torch::Tensor& problem_sizes1, torch::Tensor& problem_sizes2,
    const int64_t n, const int64_t k, const bool swap_ab) {
  TORCH_CHECK(expert_first_token_offset.is_cuda(),
              "expert_first_token_offset must be a CUDA tensor");
  TORCH_CHECK(expert_first_token_offset.dtype() == torch::kInt64,
              "expert_first_token_offset must be int64");
  TORCH_CHECK(problem_sizes1.is_cuda() && problem_sizes2.is_cuda(),
              "problem_sizes must be CUDA tensors");
  TORCH_CHECK(problem_sizes1.dtype() == torch::kInt32 &&
                  problem_sizes2.dtype() == torch::kInt32,
              "problem_sizes must be int32");
  TORCH_CHECK(problem_sizes1.is_contiguous() && problem_sizes2.is_contiguous(),
              "problem_sizes must be contiguous");
  TORCH_CHECK(problem_sizes1.dim() == 2 && problem_sizes2.dim() == 2,
              "problem_sizes must be 2D tensors");
  TORCH_CHECK(problem_sizes1.size(1) == 3 && problem_sizes2.size(1) == 3,
              "problem_sizes second dim must be 3");
  TORCH_CHECK(problem_sizes1.sizes() == problem_sizes2.sizes(),
              "problem_sizes1 and problem_sizes2 must have same shape");
  int64_t const num_experts64 = problem_sizes1.size(0);
  TORCH_CHECK(expert_first_token_offset.numel() == num_experts64 + 1,
              "expert_first_token_offset must have num_experts + 1 elements");
  TORCH_CHECK(num_experts64 <= INT32_MAX, "num_experts must fit in int32");
  TORCH_CHECK(n <= INT32_MAX && k <= INT32_MAX, "n and k must fit in int32");
  int const num_experts = static_cast<int>(num_experts64);
  auto stream = at::cuda::getCurrentCUDAStream(
      expert_first_token_offset.device().index());
  int const threads = (num_experts < 256) ? num_experts : 256;
  int const blocks = (num_experts + threads - 1) / threads;
  auto const* offsets_ptr = expert_first_token_offset.data_ptr<int64_t>();
  auto* ps1_ptr = problem_sizes1.data_ptr<int32_t>();
  auto* ps2_ptr = problem_sizes2.data_ptr<int32_t>();
  VLLM_DISPATCH_BOOL(swap_ab, SwapAB, [&] {
    compute_problem_sizes_from_expert_offsets<SwapAB>
        <<<blocks, threads, 0, stream>>>(offsets_ptr, ps1_ptr, ps2_ptr,
                                         num_experts, static_cast<int>(n),
                                         static_cast<int>(k));
  });
 }
 void get_cutlass_moe_mm_data_caller(
    const torch::Tensor& topk_ids, torch::Tensor& expert_offsets,
    torch::Tensor& problem_sizes1, torch::Tensor& problem_sizes2,
--- a/csrc/quantization/w8a8/cutlass/scaled_mm_entry.cu
+++ b/csrc/quantization/w8a8/cutlass/scaled_mm_entry.cu
@@ -83,6 +83,11 @@ void get_cutlass_moe_mm_problem_sizes_caller(
    const int64_t k, const std::optional<torch::Tensor>& blockscale_offsets,
    std::optional<bool> force_swap_ab = std::nullopt);
 void get_cutlass_moe_mm_problem_sizes_from_expert_offsets_caller(
    const torch::Tensor& expert_first_token_offset,
    torch::Tensor& problem_sizes1, torch::Tensor& problem_sizes2,
    const int64_t n, const int64_t k, const bool swap_ab);
 void get_cutlass_pplx_moe_mm_data_caller(torch::Tensor& expert_offsets,
                                         torch::Tensor& problem_sizes1,
                                         torch::Tensor& problem_sizes2,
@@ -322,6 +327,25 @@ void get_cutlass_moe_mm_problem_sizes(
      version_num, ". Required capability: 90, 100, or 120");
 }
 void get_cutlass_moe_mm_problem_sizes_from_expert_offsets(
    const torch::Tensor& expert_first_token_offset,
    torch::Tensor& problem_sizes1, torch::Tensor& problem_sizes2,
    const int64_t n, const int64_t k, const bool swap_ab) {
  int32_t version_num = get_sm_version_num();
 #if (defined ENABLE_CUTLASS_MOE_SM90 && ENABLE_CUTLASS_MOE_SM90) ||   \
    (defined ENABLE_CUTLASS_MOE_SM100 && ENABLE_CUTLASS_MOE_SM100) || \
    (defined ENABLE_CUTLASS_MOE_SM120 && ENABLE_CUTLASS_MOE_SM120)
  get_cutlass_moe_mm_problem_sizes_from_expert_offsets_caller(
      expert_first_token_offset, problem_sizes1, problem_sizes2, n, k, swap_ab);
  return;
 #endif
  TORCH_CHECK_NOT_IMPLEMENTED(
      false,
      "No compiled get_cutlass_moe_mm_problem_sizes_from_expert_offsets: "
      "no cutlass_scaled_mm kernel for CUDA device capability: ",
      version_num, ". Required capability: 90, 100, or 120");
 }
 void get_cutlass_pplx_moe_mm_data(torch::Tensor& expert_offsets,
                                  torch::Tensor& problem_sizes1,
                                  torch::Tensor& problem_sizes2,
--- a/csrc/quantization/w8a8/fp8/common.cu
+++ b/csrc/quantization/w8a8/fp8/common.cu
@@ -4,28 +4,77 @@
 #include "quantization/vectorization_utils.cuh"
 #include <c10/cuda/CUDAGuard.h>
 #include <ATen/cuda/Exceptions.h>
 #include <tuple>
 namespace vllm {
-template <typename scalar_t, typename fp8_type>
+// STRIDE_I_ZERO: true if scale_stride_i == 0 (per-tensor or per-channel)
-__global__ void scaled_fp8_quant_kernel_strided(
+// STRIDE_J_ZERO: true if scale_stride_j == 0 (per-tensor or per-token)
 template <typename scalar_t, typename fp8_type, bool STRIDE_I_ZERO,
          bool STRIDE_J_ZERO>
 __global__ void scaled_fp8_quant_kernel_strided_group_shape(
    fp8_type* __restrict__ out, const scalar_t* __restrict__ input,
    const float* __restrict__ scale, int hidden_size, int64_t in_row_stride,
-    int64_t out_row_stride) {
+    int64_t out_row_stride, int group_m, int group_n, int64_t scale_stride_i,
-  const int64_t token_idx = blockIdx.x;  // one token per block
+    int64_t scale_stride_j) {
  const int64_t token_idx = blockIdx.x;
  const int tid = threadIdx.x;
  const scalar_t* token_in = input + token_idx * in_row_stride;
  fp8_type* token_out = out + token_idx * out_row_stride;
-  const float inv_scale = 1.0f / (*scale);
+  // Precompute row-level base offset for scale access (compile-time eliminated
  // when STRIDE_I_ZERO)
  const int64_t scale_row_base =
      STRIDE_I_ZERO ? 0
                    : static_cast<int>(token_idx) / group_m * scale_stride_i;
-  vectorize_with_alignment<16>(
+  auto get_inv_scale = [&](int gj) {
-      token_in, token_out, hidden_size, tid, blockDim.x,
+    return 1.0f / scale[scale_row_base + gj * scale_stride_j];
-      [=] __device__(fp8_type & dst, const scalar_t& src) {
+  };
-        dst = scaled_fp8_conversion<true, fp8_type>(static_cast<float>(src),
+
-                                                    inv_scale);
+  int cached_gj = -1;
-      });
+  float cached_inv_scale = 0.0f;
  auto get_inv_scale_cached = [&](int gj) {
    if (gj != cached_gj) {
      cached_inv_scale = 1.0f / scale[scale_row_base + gj * scale_stride_j];
      cached_gj = gj;
    }
    return cached_inv_scale;
  };
  constexpr int VEC_SIZE = 16;  // FP8 so vectorize to 128 bits
  auto scaled_fp8_conversion_vectorized = [&](const scalar_t* in, fp8_type* out,
                                              int size, float inv_scale) {
    vectorize_with_alignment<VEC_SIZE>(
        in, out, size, tid, blockDim.x,
        [=] __device__(fp8_type & dst, const scalar_t& src) {
          dst = scaled_fp8_conversion<true, fp8_type>(static_cast<float>(src),
                                                      inv_scale);
        });
  };
  if (STRIDE_J_ZERO && hidden_size % VEC_SIZE == 0) {
    // Per-tensor or per-token: single scale per row, vectorize full row
    scaled_fp8_conversion_vectorized(token_in, token_out, hidden_size,
                                     get_inv_scale(0));
  } else if (group_n % VEC_SIZE == 0) {
    // Multiple column groups with vectorization
    const int num_groups_n = hidden_size / group_n;
    for (int gj = 0; gj < num_groups_n; gj++) {
      scaled_fp8_conversion_vectorized(token_in + gj * group_n,
                                       token_out + gj * group_n, group_n,
                                       get_inv_scale(gj));
    }
  } else {
    // Scalar path for small column groups (group_n < VEC_SIZE)
    for (int n = tid; n < hidden_size; n += blockDim.x) {
      const int gj = n / group_n;
      token_out[n] = scaled_fp8_conversion<true, fp8_type>(
          static_cast<float>(token_in[n]), get_inv_scale_cached(gj));
    }
  }
 }
 template <typename scalar_t, typename fp8_type>
@@ -133,17 +182,116 @@ __global__ void dynamic_per_token_scaled_fp8_quant_kernel_strided(
 }  // namespace vllm
-void static_scaled_fp8_quant(torch::Tensor& out,          // [..., d]
+void static_scaled_fp8_quant(
-                             torch::Tensor const& input,  // [..., d]
+    torch::Tensor& out,          // [..., d]
-                             torch::Tensor const& scale)  // [1]
+    torch::Tensor const& input,  // [..., d]
    torch::Tensor const& scale,  // various shapes
    std::optional<std::tuple<int64_t, int64_t>>
        opt_group_shape)  // optional explicit (group_m, group_n)
 {
  TORCH_CHECK(input.stride(-1) == 1,
              "last dimension of input must be contiguous");
  TORCH_CHECK(out.stride(-1) == 1,
              "last dimension of output must be contiguous");
-  const int hidden_size = input.size(-1);
+  const int hidden_size = input.size(-1);              // N (columns)
-  const int num_tokens = input.numel() / hidden_size;
+  const int num_tokens = input.numel() / hidden_size;  // M (rows)
  // Determine group_m, group_n, and scale strides from scale shape
  // Scale indexing: scale[gi * scale_stride_j + gj * scale_stride_i]
  // where gi = m / group_m, gj = n / group_n
  int group_m, group_n;
  int64_t scale_stride_i, scale_stride_j;
  if (scale.dim() == 0 || scale.numel() == 1) {
    // Per-tensor: one scale for the entire tensor
    group_m = num_tokens;
    group_n = hidden_size;
    scale_stride_i = 0;
    scale_stride_j = 0;
  } else if (scale.dim() == 1) {
    // 1D scale: require explicit group_shape to disambiguate per-channel vs
    // per-token (avoids edge case where num_tokens == hidden_size)
    TORCH_CHECK(opt_group_shape.has_value(),
                "1D scale requires explicit group_shape to disambiguate "
                "per-channel vs per-token quantization. "
                "Use group_shape=(-1, 1) for per-channel or group_shape=(1, "
                "-1) for per-token.");
    const auto& [opt_group_m, opt_group_n] = opt_group_shape.value();
    group_m = opt_group_m == -1 ? num_tokens : static_cast<int>(opt_group_m);
    group_n = opt_group_n == -1 ? hidden_size : static_cast<int>(opt_group_n);
    // Validate the explicit group shape matches the 1D scale
    const int64_t scale_len = scale.numel();
    const int64_t expected_scale_m = num_tokens / group_m;
    const int64_t expected_scale_n = hidden_size / group_n;
    const int64_t expected_scale_numel = expected_scale_m * expected_scale_n;
    TORCH_CHECK(scale_len == expected_scale_numel, "1D scale length (",
                scale_len, ") does not match expected size (",
                expected_scale_numel, ") for group_shape (", opt_group_m, ", ",
                opt_group_n, ") with input shape (", num_tokens, ", ",
                hidden_size, ")");
    // For 1D scale, determine strides based on which dim is trivial
    // Scale indexing: scale[gi * scale_stride_i + gj * scale_stride_j]
    // where gi = m / group_m (row group), gj = n / group_n (col group)
    if (expected_scale_m == 1) {
      // Per-channel style: one scale in M dim, scale varies along N
      // gi = 0 always, gj varies, so stride_1 traverses the scale
      scale_stride_i = 0;
      scale_stride_j = scale.stride(0);
    } else if (expected_scale_n == 1) {
      // Per-token style: one scale in N dim, scale varies along M
      // gj = 0 always, gi varies, so stride_0 traverses the scale
      scale_stride_i = scale.stride(0);
      scale_stride_j = 0;
    } else {
      TORCH_CHECK(
          false,
          "1D scale can only be used when one of the scale dimensions is 1. "
          "For 2D group scaling, use a 2D scale tensor.");
    }
  } else if (scale.dim() == 2) {
    // 2D scale: infer group sizes from scale dimensions (or use explicit if
    // provided)
    const int64_t scale_size_0 = scale.size(0);
    const int64_t scale_size_1 = scale.size(1);
    TORCH_CHECK(num_tokens % scale_size_0 == 0, "num_tokens (", num_tokens,
                ") must be divisible by scale.size(0) (", scale_size_0, ")");
    TORCH_CHECK(hidden_size % scale_size_1 == 0, "hidden_size (", hidden_size,
                ") must be divisible by scale.size(1) (", scale_size_1, ")");
    // Infer from 2D scale shape
    int inferred_group_m = num_tokens / scale_size_0;
    int inferred_group_n = hidden_size / scale_size_1;
    // Use explicit if provided, otherwise use inferred
    if (opt_group_shape.has_value()) {
      const auto& [opt_group_m, opt_group_n] = opt_group_shape.value();
      group_m = opt_group_m == -1 ? num_tokens : static_cast<int>(opt_group_m);
      group_n = opt_group_n == -1 ? hidden_size : static_cast<int>(opt_group_n);
      // Validate explicit matches inferred
      TORCH_CHECK(group_m == inferred_group_m && group_n == inferred_group_n,
                  "Explicit group_shape (", opt_group_m, ", ", opt_group_n,
                  ") does not match inferred group shape (", inferred_group_m,
                  ", ", inferred_group_n, ") from 2D scale tensor shape (",
                  scale_size_0, ", ", scale_size_1, ")");
    } else {
      group_m = inferred_group_m;
      group_n = inferred_group_n;
    }
    scale_stride_i = scale.stride(0);
    scale_stride_j = scale.stride(1);
  } else {
    TORCH_CHECK(false, "scale must be 0D, 1D, or 2D tensor, but got ",
                scale.dim(), "D");
  }
  const int block_size = 256;
  dim3 grid(num_tokens);
  dim3 block(block_size);
@@ -153,15 +301,23 @@ void static_scaled_fp8_quant(torch::Tensor& out,          // [..., d]
  const at::cuda::OptionalCUDAGuard device_guard(device_of(input));
  const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
  // Dispatch to template-specialized kernel based on stride pattern
  VLLM_DISPATCH_FLOATING_TYPES(
      input.scalar_type(), "scaled_fp8_quant_kernel_scalar_type", [&] {
        VLLM_DISPATCH_FP8_TYPES(
            out.scalar_type(), "scaled_fp8_quant_kernel_fp8_type", [&] {
-              vllm::scaled_fp8_quant_kernel_strided<scalar_t, fp8_t>
+              VLLM_DISPATCH_BOOL(scale_stride_i == 0, S0_ZERO, [&] {
-                  <<<grid, block, 0, stream>>>(
+                VLLM_DISPATCH_BOOL(scale_stride_j == 0, S1_ZERO, [&] {
-                      out.data_ptr<fp8_t>(), input.data_ptr<scalar_t>(),
+                  vllm::scaled_fp8_quant_kernel_strided_group_shape<
-                      scale.data_ptr<float>(), hidden_size, in_row_stride,
+                      scalar_t, fp8_t, S0_ZERO, S1_ZERO>
-                      out_row_stride);
+                      <<<grid, block, 0, stream>>>(
                          out.data_ptr<fp8_t>(), input.data_ptr<scalar_t>(),
                          scale.data_ptr<float>(), hidden_size, in_row_stride,
                          out_row_stride, group_m, group_n, scale_stride_i,
                          scale_stride_j);
                });
              });
            });
      });
 }
--- a/csrc/sampler.cu
+++ b/csrc/sampler.cu
@@ -1,3 +1,4 @@
 #include "cuda_compat.h"
 #include "dispatch_utils.h"
 #include <torch/cuda.h>
@@ -97,7 +98,9 @@ static inline __device__ bool isPartialMatch(float x, uint32_t pattern) {
 template <typename T, typename idxT, typename Func>
 __device__ void vectorized_process(size_t thread_rank, size_t num_threads,
                                   const T* in, idxT len, Func f) {
-  constexpr int WARP_SIZE = 32;
+  // Use dynamic WARP_SIZE from cuda_compat.h to support both
  // Wave64 (MI300X/gfx942) and Wave32 (Strix Halo/gfx1151) architectures
  constexpr int kWarpSize = WARP_SIZE;
  using WideT = float4;
  if constexpr (sizeof(T) >= sizeof(WideT)) {
    for (idxT i = thread_rank; i < len; i += num_threads) {
@@ -132,8 +135,8 @@ __device__ void vectorized_process(size_t thread_rank, size_t num_threads,
      }
    }
-    static_assert(WARP_SIZE >= items_per_scalar);
+    static_assert(kWarpSize >= items_per_scalar);
-    // and because items_per_scalar > skip_cnt, WARP_SIZE > skip_cnt
+    // and because items_per_scalar > skip_cnt, kWarpSize > skip_cnt
    // no need to use loop
    if (thread_rank < skip_cnt) {
      f(in[thread_rank], thread_rank);
@@ -142,7 +145,7 @@ __device__ void vectorized_process(size_t thread_rank, size_t num_threads,
    // len_cast * items_per_scalar + items_per_scalar > len - skip_cnt;
    // and so
    // len - (skip_cnt + len_cast * items_per_scalar) < items_per_scalar <=
-    // WARP_SIZE no need to use loop
+    // kWarpSize no need to use loop
    const idxT remain_i = skip_cnt + len_cast * items_per_scalar + thread_rank;
    if (remain_i < len) {
      f(in[remain_i], remain_i);
--- a/csrc/torch_bindings.cpp
+++ b/csrc/torch_bindings.cpp
@@ -416,13 +416,6 @@ TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) {
      "                      Tensor alpha) -> ()");
  ops.impl("cutlass_scaled_fp4_mm", torch::kCUDA, &cutlass_scaled_fp4_mm);
  // cutlass blockwise scaledgroup GEMM
  ops.def(
      "cutlass_blockwise_scaled_grouped_mm(Tensor! output, Tensor a, Tensor b, "
      "Tensor scales_a, Tensor scales_b, "
      "Tensor problem_sizes, Tensor expert_offsets) -> ()");
  // conditionally compiled so impl registration is in source file
  // cutlass nvfp4 block scaled group GEMM
  ops.def(
      "cutlass_fp4_group_mm(Tensor! out, Tensor a, Tensor b,"
@@ -494,6 +487,17 @@ TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) {
  ops.impl("get_cutlass_moe_mm_problem_sizes", torch::kCUDA,
           &get_cutlass_moe_mm_problem_sizes);
  // compute per-expert problem sizes from expert_first_token_offset
  // produced by vLLM's moe_permute kernel
  ops.def(
      "get_cutlass_moe_mm_problem_sizes_from_expert_offsets("
      "    Tensor expert_first_token_offset, "
      "    Tensor! problem_sizes1, "
      "    Tensor! problem_sizes2, "
      "    int n, int k, bool swap_ab) -> ()");
  ops.impl("get_cutlass_moe_mm_problem_sizes_from_expert_offsets", torch::kCUDA,
           &get_cutlass_moe_mm_problem_sizes_from_expert_offsets);
  // A function that computes data required to run fused MoE with w8a8 grouped
  // GEMM and PPLX. It takes expert_num_tokens and non_zero_expert_idxs
  // as an input, and computes expert_offsets (token start indices of each
@@ -565,6 +569,15 @@ TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) {
      "Tensor output_scale_offset_by_experts) -> ()");
  ops.impl("scaled_fp4_experts_quant", torch::kCUDA, &scaled_fp4_experts_quant);
  // Fused SiLU+Mul+NVFP4 experts quantization.
  ops.def(
      "silu_and_mul_scaled_fp4_experts_quant(Tensor! output, Tensor! "
      "output_scale,"
      "Tensor input, Tensor input_global_scale, Tensor input_offset_by_experts,"
      "Tensor output_scale_offset_by_experts) -> ()");
  ops.impl("silu_and_mul_scaled_fp4_experts_quant", torch::kCUDA,
           &silu_and_mul_scaled_fp4_experts_quant);
  // Check if cutlass_scaled_mm_fp4 is supported for CUDA devices
  // of the given capability
  ops.def("cutlass_scaled_mm_supports_fp4(int cuda_device_capability) -> bool");
@@ -586,9 +599,12 @@ TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) {
  ops.impl("gptq_shuffle", torch::kCUDA, &gptq_shuffle);
  // Compute FP8 quantized tensor for given scaling factor.
  // Supports per-tensor, per-channel, per-token, and arbitrary 2D group
  // scaling. Optional group_m/group_n specify the group shape explicitly;
  // required for 1D scales to disambiguate per-channel vs per-token.
  ops.def(
-      "static_scaled_fp8_quant(Tensor! result, Tensor input, Tensor scale) -> "
+      "static_scaled_fp8_quant(Tensor! result, Tensor input, Tensor scale, "
-      "()");
+      "(int, int)? group_shape=None) -> ()");
  ops.impl("static_scaled_fp8_quant", torch::kCUDA, &static_scaled_fp8_quant);
  // Compute dynamic-per-tensor FP8 quantized tensor and scaling factor.
@@ -692,16 +708,6 @@ TORCH_LIBRARY_EXPAND(CONCAT(TORCH_EXTENSION_NAME, _cache_ops), cache_ops) {
      "swap_blocks(Tensor src, Tensor! dst, Tensor block_mapping) -> ()");
  cache_ops.impl("swap_blocks", torch::kCUDA, &swap_blocks);
  // Copy the cache blocks from src to dst.
  cache_ops.def(
      "copy_blocks(Tensor(a!)[] key_caches, Tensor[](b!) value_caches, "
      "Tensor block_mapping) -> ()");
  cache_ops.impl("copy_blocks", torch::kCUDA, &copy_blocks);
  cache_ops.def(
      "copy_blocks_mla(Tensor(a!)[] kv_caches, Tensor block_mapping) -> ()");
  cache_ops.impl("copy_blocks_mla", torch::kCUDA, &copy_blocks_mla);
  // Reshape the key and value tensors and cache them.
  cache_ops.def(
      "reshape_and_cache(Tensor key, Tensor value,"
@@ -731,6 +737,22 @@ TORCH_LIBRARY_EXPAND(CONCAT(TORCH_EXTENSION_NAME, _cache_ops), cache_ops) {
      "                     Tensor scale) -> ()");
  cache_ops.impl("concat_and_cache_mla", torch::kCUDA, &concat_and_cache_mla);
  // Rotate Q and K, then write to kv cache for MLA
  cache_ops.def(
      "concat_and_cache_mla_rope_fused("
      "                     Tensor positions,"
      "                     Tensor! q_pe,"
      "                     Tensor! k_pe,"
      "                     Tensor kv_c,"
      "                     Tensor cos_sin_cache,"
      "                     bool is_neox,"
      "                     Tensor slot_mapping,"
      "                     Tensor! kv_cache,"
      "                     str kv_cache_dtype,"
      "                     Tensor kv_cache_scale) -> ()");
  cache_ops.impl("concat_and_cache_mla_rope_fused", torch::kCUDA,
                 &concat_and_cache_mla_rope_fused);
  // Convert the key and value cache to fp8 data type.
  cache_ops.def(
      "convert_fp8(Tensor! dst_cache, Tensor src_cache, float scale, "
--- a/docker/Dockerfile
+++ b/docker/Dockerfile
@@ -183,7 +183,7 @@ ARG nvcc_threads=8
 ENV NVCC_THREADS=$nvcc_threads
 ARG USE_SCCACHE
-ARG SCCACHE_DOWNLOAD_URL=https://github.com/mozilla/sccache/releases/download/v0.8.1/sccache-v0.8.1-x86_64-unknown-linux-musl.tar.gz
+ARG SCCACHE_DOWNLOAD_URL
 ARG SCCACHE_ENDPOINT
 ARG SCCACHE_BUCKET_NAME=vllm-build-sccache
 ARG SCCACHE_REGION_NAME=us-west-2
@@ -201,10 +201,16 @@ ENV SETUPTOOLS_SCM_PRETEND_VERSION="0.0.0+csrc.build"
 RUN --mount=type=cache,target=/root/.cache/uv \
    if [ "$USE_SCCACHE" = "1" ]; then \
        echo "Installing sccache..." \
        && case "${TARGETPLATFORM}" in \
          linux/arm64) SCCACHE_ARCH="aarch64" ;; \
          linux/amd64) SCCACHE_ARCH="x86_64" ;; \
          *) echo "Unsupported TARGETPLATFORM for sccache: ${TARGETPLATFORM}" >&2; exit 1 ;; \
        esac \
        && export SCCACHE_DOWNLOAD_URL="${SCCACHE_DOWNLOAD_URL:-https://github.com/mozilla/sccache/releases/download/v0.8.1/sccache-v0.8.1-${SCCACHE_ARCH}-unknown-linux-musl.tar.gz}" \
        && curl -L -o sccache.tar.gz ${SCCACHE_DOWNLOAD_URL} \
        && tar -xzf sccache.tar.gz \
-        && sudo mv sccache-v0.8.1-x86_64-unknown-linux-musl/sccache /usr/bin/sccache \
+        && sudo mv sccache-v0.8.1-${SCCACHE_ARCH}-unknown-linux-musl/sccache /usr/bin/sccache \
-        && rm -rf sccache.tar.gz sccache-v0.8.1-x86_64-unknown-linux-musl \
+        && rm -rf sccache.tar.gz sccache-v0.8.1-${SCCACHE_ARCH}-unknown-linux-musl \
        && if [ ! -z ${SCCACHE_ENDPOINT} ] ; then export SCCACHE_ENDPOINT=${SCCACHE_ENDPOINT} ; fi \
        && export SCCACHE_BUCKET=${SCCACHE_BUCKET_NAME} \
        && export SCCACHE_REGION=${SCCACHE_REGION_NAME} \
@@ -267,6 +273,7 @@ RUN mkdir -p /tmp/deepgemm/dist && touch /tmp/deepgemm/dist/.deepgemm_skipped
 COPY tools/ep_kernels/install_python_libraries.sh /tmp/install_python_libraries.sh
 ARG PPLX_COMMIT_HASH
 ARG DEEPEP_COMMIT_HASH
 ARG NVSHMEM_VER
 RUN --mount=type=cache,target=/root/.cache/uv \
    mkdir -p /tmp/ep_kernels_workspace/dist && \
    export TORCH_CUDA_ARCH_LIST='9.0a 10.0a' && \
@@ -274,7 +281,8 @@ RUN --mount=type=cache,target=/root/.cache/uv \
        --workspace /tmp/ep_kernels_workspace \
        --mode wheel \
        ${PPLX_COMMIT_HASH:+--pplx-ref "$PPLX_COMMIT_HASH"} \
-        ${DEEPEP_COMMIT_HASH:+--deepep-ref "$DEEPEP_COMMIT_HASH"} && \
+        ${DEEPEP_COMMIT_HASH:+--deepep-ref "$DEEPEP_COMMIT_HASH"} \
        ${NVSHMEM_VER:+--nvshmem-ver "$NVSHMEM_VER"} && \
    find /tmp/ep_kernels_workspace/nvshmem -name '*.a' -delete
 #################### EXTENSIONS BUILD IMAGE ####################
@@ -609,6 +617,7 @@ RUN mv vllm src/vllm
 FROM vllm-base AS vllm-openai-base
 ARG TARGETPLATFORM
 ARG INSTALL_KV_CONNECTORS=false
 ARG CUDA_VERSION
 ARG PIP_INDEX_URL UV_INDEX_URL
 ARG PIP_EXTRA_INDEX_URL UV_EXTRA_INDEX_URL
@@ -618,10 +627,30 @@ ARG PIP_EXTRA_INDEX_URL UV_EXTRA_INDEX_URL
 ENV UV_HTTP_TIMEOUT=500
 # install kv_connectors if requested
 ARG torch_cuda_arch_list='7.0 7.5 8.0 8.9 9.0 10.0 12.0'
 ENV TORCH_CUDA_ARCH_LIST=${torch_cuda_arch_list}
 RUN --mount=type=cache,target=/root/.cache/uv \
    --mount=type=bind,source=requirements/kv_connectors.txt,target=/tmp/kv_connectors.txt,ro \
    CUDA_MAJOR="${CUDA_VERSION%%.*}"; \
    CUDA_VERSION_DASH=$(echo $CUDA_VERSION | cut -d. -f1,2 | tr '.' '-'); \
    CUDA_HOME=/usr/local/cuda; \
    # lmcache requires explicit specifying CUDA_HOME
    BUILD_PKGS="libcusparse-dev-${CUDA_VERSION_DASH} \
                libcublas-dev-${CUDA_VERSION_DASH} \
                libcusolver-dev-${CUDA_VERSION_DASH}"; \
    if [ "$INSTALL_KV_CONNECTORS" = "true" ]; then \
-        uv pip install --system -r /tmp/kv_connectors.txt || true; \
+        if [ "$CUDA_MAJOR" -ge 13 ]; then \
            uv pip install --system nixl-cu13; \
        fi; \
        uv pip install --system -r /tmp/kv_connectors.txt --no-build || ( \
            # if the above fails, install from source
            apt-get update -y && \
            apt-get install -y --no-install-recommends ${BUILD_PKGS} && \
            uv pip install --system -r /tmp/kv_connectors.txt --no-build-isolation && \
            apt-get purge -y ${BUILD_PKGS} && \
            # clean up -dev packages, keep runtime libraries
            rm -rf /var/lib/apt/lists/* \
        ); \
    fi
 ENV VLLM_USAGE_SOURCE production-docker-image
--- a/docker/Dockerfile.ppc64le
+++ b/docker/Dockerfile.ppc64le
@@ -22,13 +22,13 @@ RUN microdnf install -y dnf && dnf install -y gcc-toolset-14 make wget unzip \
 ###############################################################
 FROM registry.access.redhat.com/ubi9/ubi-minimal:${BASE_UBI_IMAGE_TAG} AS centos-deps-builder
 RUN  microdnf install -y dnf && \ 
-     dnf install -y https://mirror.stream.centos.org/9-stream/BaseOS/`arch`/os/Packages/centos-gpg-keys-9.0-24.el9.noarch.rpm \
+     dnf install -y https://mirror.stream.centos.org/9-stream/BaseOS/`arch`/os/Packages/centos-gpg-keys-9.0-26.el9.noarch.rpm \
-        https://mirror.stream.centos.org/9-stream/BaseOS/`arch`/os/Packages/centos-stream-repos-9.0-24.el9.noarch.rpm \
+        https://mirror.stream.centos.org/9-stream/BaseOS/`arch`/os/Packages/centos-stream-repos-9.0-26.el9.noarch.rpm \
        https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm && \
        dnf config-manager --set-enabled crb
-RUN dnf install -y openjpeg2-devel lcms2-devel tcl-devel tk-devel fribidi-devel && \
+RUN dnf install -y openjpeg2-devel lcms2-devel tcl-devel tk-devel fribidi-devel yajl-devel && \
-    dnf remove -y centos-gpg-keys-9.0-24.el9.noarch centos-stream-repos-9.0-24.el9.noarch 
+    dnf remove -y centos-gpg-keys-9.0-24.el9.noarch centos-stream-repos-9.0-26.el9.noarch 
 ###############################################################
@@ -346,4 +346,4 @@ WORKDIR /workspace/
 RUN ln -s /workspace/vllm/tests && ln -s /workspace/vllm/examples && ln -s /workspace/vllm/benchmarks
-ENTRYPOINT ["vllm", "serve"]
+ENTRYPOINT ["vllm", "serve"]
--- a/docker/Dockerfile.rocm
+++ b/docker/Dockerfile.rocm
@@ -3,6 +3,14 @@ ARG REMOTE_VLLM="0"
 ARG COMMON_WORKDIR=/app
 ARG BASE_IMAGE=rocm/vllm-dev:base
 # Sccache configuration (only used in release pipeline)
 ARG USE_SCCACHE
 ARG SCCACHE_DOWNLOAD_URL
 ARG SCCACHE_ENDPOINT
 ARG SCCACHE_BUCKET_NAME=vllm-build-sccache
 ARG SCCACHE_REGION_NAME=us-west-2
 ARG SCCACHE_S3_NO_CREDENTIALS=0
 FROM ${BASE_IMAGE} AS base
 ARG ARG_PYTORCH_ROCM_ARCH
@@ -14,9 +22,14 @@ ENV RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES=1
 RUN apt-get update -q -y && apt-get install -q -y \
    sqlite3 libsqlite3-dev libfmt-dev libmsgpack-dev libsuitesparse-dev \
    apt-transport-https ca-certificates wget curl
 # Remove sccache
 RUN python3 -m pip install --upgrade pip
-RUN apt-get purge -y sccache; python3 -m pip uninstall -y sccache; rm -f "$(which sccache)"
+# Remove sccache only if not using sccache (it exists in base image from Dockerfile.rocm_base)
 ARG USE_SCCACHE
 RUN if [ "$USE_SCCACHE" != "1" ]; then \
        apt-get purge -y sccache || true; \
        python3 -m pip uninstall -y sccache || true; \
        rm -f "$(which sccache)" || true; \
    fi
 # Install UV
 RUN curl -LsSf https://astral.sh/uv/install.sh | env UV_INSTALL_DIR="/usr/local/bin" sh
@@ -28,6 +41,39 @@ ENV UV_INDEX_STRATEGY="unsafe-best-match"
 # Use copy mode to avoid hardlink failures with Docker cache mounts
 ENV UV_LINK_MODE=copy
 # Install sccache if USE_SCCACHE is enabled (for release builds)
 ARG USE_SCCACHE
 ARG SCCACHE_DOWNLOAD_URL
 ARG SCCACHE_ENDPOINT
 ARG SCCACHE_BUCKET_NAME
 ARG SCCACHE_REGION_NAME
 ARG SCCACHE_S3_NO_CREDENTIALS
 RUN if [ "$USE_SCCACHE" = "1" ]; then \
        if command -v sccache >/dev/null 2>&1; then \
            echo "sccache already installed, skipping installation"; \
            sccache --version; \
        else \
            echo "Installing sccache..." \
            && SCCACHE_ARCH="x86_64" \
            && SCCACHE_VERSION="v0.8.1" \
            && SCCACHE_DL_URL="${SCCACHE_DOWNLOAD_URL:-https://github.com/mozilla/sccache/releases/download/${SCCACHE_VERSION}/sccache-${SCCACHE_VERSION}-${SCCACHE_ARCH}-unknown-linux-musl.tar.gz}" \
            && curl -L -o /tmp/sccache.tar.gz ${SCCACHE_DL_URL} \
            && tar -xzf /tmp/sccache.tar.gz -C /tmp \
            && mv /tmp/sccache-${SCCACHE_VERSION}-${SCCACHE_ARCH}-unknown-linux-musl/sccache /usr/bin/sccache \
            && chmod +x /usr/bin/sccache \
            && rm -rf /tmp/sccache.tar.gz /tmp/sccache-${SCCACHE_VERSION}-${SCCACHE_ARCH}-unknown-linux-musl \
            && sccache --version; \
        fi; \
    fi
 # Set sccache environment variables only when USE_SCCACHE=1
 # This prevents S3 config from leaking into images when sccache is not used
 ARG USE_SCCACHE
 ENV SCCACHE_BUCKET=${USE_SCCACHE:+${SCCACHE_BUCKET_NAME}}
 ENV SCCACHE_REGION=${USE_SCCACHE:+${SCCACHE_REGION_NAME}}
 ENV SCCACHE_S3_NO_CREDENTIALS=${USE_SCCACHE:+${SCCACHE_S3_NO_CREDENTIALS}}
 ENV SCCACHE_IDLE_TIMEOUT=${USE_SCCACHE:+0}
 ARG COMMON_WORKDIR
 WORKDIR ${COMMON_WORKDIR}
@@ -39,6 +85,8 @@ ONBUILD COPY ./ vllm/
 FROM base AS fetch_vllm_1
 ARG VLLM_REPO="https://github.com/vllm-project/vllm.git"
 ARG VLLM_BRANCH="main"
 ENV VLLM_REPO=${VLLM_REPO}
 ENV VLLM_BRANCH=${VLLM_BRANCH}
 ONBUILD RUN git clone ${VLLM_REPO} \
 	    && cd vllm \
 	    && git fetch -v --prune -- origin ${VLLM_BRANCH} \
@@ -51,7 +99,7 @@ FROM fetch_vllm_${REMOTE_VLLM} AS fetch_vllm
 # -----------------------
 # vLLM build stages
 FROM fetch_vllm AS build_vllm
-# Build vLLM
+# Build vLLM (setup.py auto-detects sccache in PATH)
 RUN cd vllm \
    && python3 -m pip install -r requirements/rocm.txt \
    && python3 setup.py clean --all  \
@@ -67,6 +115,178 @@ COPY --from=build_vllm ${COMMON_WORKDIR}/vllm/docker/Dockerfile.rocm /docker/
 COPY --from=build_vllm ${COMMON_WORKDIR}/vllm/.buildkite /.buildkite
 COPY --from=build_vllm ${COMMON_WORKDIR}/vllm/vllm/v1 /vllm_v1
 # RIXL/UCX build stages
 FROM base AS build_rixl
 ARG RIXL_BRANCH="f33a5599"
 ARG RIXL_REPO="https://github.com/ROCm/RIXL.git"
 ARG UCX_BRANCH="da3fac2a"
 ARG UCX_REPO="https://github.com/ROCm/ucx.git"
 ENV ROCM_PATH=/opt/rocm
 ENV UCX_HOME=/usr/local/ucx
 ENV RIXL_HOME=/usr/local/rixl
 ENV RIXL_BENCH_HOME=/usr/local/rixl_bench
 # RIXL build system dependences and RDMA support
 RUN apt-get -y update && apt-get -y install autoconf libtool pkg-config \
    libgrpc-dev \
    libgrpc++-dev \
    libprotobuf-dev \
    protobuf-compiler-grpc \
    libcpprest-dev \
    libaio-dev \
    librdmacm1 \
    librdmacm-dev \
    libibverbs1 \
    libibverbs-dev \
    ibverbs-utils \
    rdmacm-utils \
    ibverbs-providers \
    && rm -rf /var/lib/apt/lists/*
 RUN uv pip install --system meson auditwheel patchelf tomlkit
 RUN cd /usr/local/src && \
    git clone ${UCX_REPO} &&  \
    cd ucx  && \
    git checkout ${UCX_BRANCH} && \
    ./autogen.sh && \
    mkdir build && cd build && \
    ../configure \
        --prefix=/usr/local/ucx \
        --enable-shared \
        --disable-static \
        --disable-doxygen-doc \
        --enable-optimizations \
        --enable-devel-headers \
        --with-rocm=/opt/rocm \
        --with-verbs \
        --with-dm \
        --enable-mt && \
    make -j && \
    make install
 ENV PATH=/usr/local/ucx/bin:$PATH
 ENV LD_LIBRARY_PATH=${UCX_HOME}/lib:${LD_LIBRARY_PATH}
 RUN git clone ${RIXL_REPO} /opt/rixl && \
    cd /opt/rixl && \
    git checkout ${RIXL_BRANCH} && \
    meson setup build --prefix=${RIXL_HOME} \
                     -Ducx_path=${UCX_HOME} \
                     -Drocm_path=${ROCM_PATH} && \
    cd build && \
    ninja && \
    ninja install
 # Generate RIXL wheel
 RUN cd /opt/rixl && mkdir -p /app/install && \
    ./contrib/build-wheel.sh \
        --output-dir /app/install \
        --rocm-dir ${ROCM_PATH} \
        --ucx-plugins-dir ${UCX_HOME}/lib/ucx \
        --nixl-plugins-dir ${RIXL_HOME}/lib/x86_64-linux-gnu/plugins
 # -----------------------
 # vLLM wheel release build stage (for building distributable wheels)
 # This stage pins dependencies to custom ROCm wheel versions and handles version detection
 FROM fetch_vllm AS build_vllm_wheel_release
 ARG COMMON_WORKDIR
 # Create /install directory for custom wheels
 RUN mkdir -p /install
 # Copy custom ROCm wheels from docker/context if they exist
 # COPY ensures Docker cache is invalidated when wheels change
 # .keep file ensures directory always exists for COPY to work
 COPY docker/context/base-wheels/ /tmp/base-wheels/
 # This is how we know if we are building for a wheel release or not.
 # If there are not wheels found there, we are not building for a wheel release. 
 # So we exit with an error. To skip this stage.
 RUN if [ -n "$(ls /tmp/base-wheels/*.whl 2>/dev/null)" ]; then \
        echo "Found custom wheels - copying to /install"; \
        cp /tmp/base-wheels/*.whl /install/ && \
        echo "Copied custom wheels:"; \
        ls -lh /install/; \
    else \
        echo "ERROR: No custom wheels found in docker/context/base-wheels/"; \
        echo "Wheel releases require pre-built ROCm wheels."; \
        exit 1; \
    fi
 # GIT_REPO_CHECK: Verify repo is clean and tags are available (for release builds)
 # This matches CUDA's Dockerfile behavior for proper version detection via setuptools_scm
 ARG GIT_REPO_CHECK=0
 RUN if [ "$GIT_REPO_CHECK" != "0" ]; then \
        echo "Running repository checks..."; \
        cd vllm && bash tools/check_repo.sh; \
    fi
 # Extract version from git BEFORE any modifications (pin_rocm_dependencies.py modifies requirements/rocm.txt)
 # This ensures setuptools_scm sees clean repo state for version detection
 RUN --mount=type=bind,source=.git,target=vllm/.git \
    cd vllm \
    && pip install setuptools_scm \
    && VLLM_VERSION=$(python3 -c "import setuptools_scm; print(setuptools_scm.get_version())") \
    && echo "Detected vLLM version: ${VLLM_VERSION}" \
    && echo "${VLLM_VERSION}" > /tmp/vllm_version.txt
 # Fail if git-based package dependencies are found in requirements files
 # (uv doesn't handle git+ URLs well, and packages should be distributed on PyPI)
 # Extra notes: pip install is able to handle git+ URLs, but uv doesn't.
 RUN echo "Checking for git-based packages in requirements files..." \
    && echo "Checking common.txt for git-based packages:" \
    && if grep -q 'git+' ${COMMON_WORKDIR}/vllm/requirements/common.txt; then \
         echo "ERROR: Git-based packages found in common.txt:"; \
         grep 'git+' ${COMMON_WORKDIR}/vllm/requirements/common.txt; \
         echo "Please publish these packages to PyPI instead of using git dependencies."; \
         exit 1; \
       else \
         echo "  ✓ No git-based packages found in common.txt"; \
       fi \
    && echo "Checking rocm.txt for git-based packages:" \
    && if grep -q 'git+' ${COMMON_WORKDIR}/vllm/requirements/rocm.txt; then \
         echo "ERROR: Git-based packages found in rocm.txt:"; \
         grep 'git+' ${COMMON_WORKDIR}/vllm/requirements/rocm.txt; \
         echo "Please publish these packages to PyPI instead of using git dependencies."; \
         exit 1; \
       else \
         echo "  ✓ No git-based packages found in rocm.txt"; \
       fi \
    && echo "All requirements files are clean - no git-based packages found"
 # Pin vLLM dependencies to exact versions of custom ROCm wheels
 # This ensures 'pip install vllm' automatically installs correct torch/triton/torchvision/amdsmi
 COPY tools/vllm-rocm/pin_rocm_dependencies.py /tmp/pin_rocm_dependencies.py
 RUN echo "Pinning vLLM dependencies to custom wheel versions..." \
    && python3 /tmp/pin_rocm_dependencies.py /install ${COMMON_WORKDIR}/vllm/requirements/rocm.txt
 # Install dependencies using custom wheels from /install
 RUN cd vllm \
    && echo "Building vLLM with custom wheels from /install" \
    && python3 -m pip install --find-links /install -r requirements/rocm.txt \
    && python3 setup.py clean --all
 # Build wheel using pre-extracted version to avoid dirty state from modified requirements/rocm.txt
 # (setup.py auto-detects sccache in PATH)
 RUN --mount=type=bind,source=.git,target=vllm/.git \
    cd vllm \
    && export SETUPTOOLS_SCM_PRETEND_VERSION=$(cat /tmp/vllm_version.txt) \
    && echo "Building wheel with version: ${SETUPTOOLS_SCM_PRETEND_VERSION}" \
    && python3 setup.py bdist_wheel --dist-dir=dist
 FROM scratch AS export_vllm_wheel_release
 ARG COMMON_WORKDIR
 COPY --from=build_vllm_wheel_release ${COMMON_WORKDIR}/vllm/dist/*.whl /
 COPY --from=build_vllm_wheel_release ${COMMON_WORKDIR}/vllm/requirements /requirements
 COPY --from=build_vllm_wheel_release ${COMMON_WORKDIR}/vllm/benchmarks /benchmarks
 COPY --from=build_vllm_wheel_release ${COMMON_WORKDIR}/vllm/tests /tests
 COPY --from=build_vllm_wheel_release ${COMMON_WORKDIR}/vllm/examples /examples
 COPY --from=build_vllm_wheel_release ${COMMON_WORKDIR}/vllm/docker/Dockerfile.rocm /docker/
 COPY --from=build_vllm_wheel_release ${COMMON_WORKDIR}/vllm/.buildkite /.buildkite
 COPY --from=build_vllm_wheel_release ${COMMON_WORKDIR}/vllm/vllm/v1 /vllm_v1
 # -----------------------
 # Test vLLM image
 FROM base AS test
@@ -83,6 +303,10 @@ RUN --mount=type=bind,from=export_vllm,src=/,target=/install \
    && pip uninstall -y vllm \
    && uv pip install --system *.whl
 # Install RIXL wheel
 RUN --mount=type=bind,from=build_rixl,src=/app/install,target=/rixl_install \
    uv pip install --system /rixl_install/*.whl
 WORKDIR /vllm-workspace
 ARG COMMON_WORKDIR
 COPY --from=build_vllm ${COMMON_WORKDIR}/vllm /vllm-workspace
@@ -97,6 +321,14 @@ RUN --mount=type=cache,target=/root/.cache/uv \
    uv pip install --system hf_transfer
 ENV HF_HUB_ENABLE_HF_TRANSFER=1
 # install audio decode package `torchcodec` from source (required due to 
 # ROCm and torch version mismatch) for tests with datasets package
 COPY tools/install_torchcodec_rocm.sh /tmp/install_torchcodec.sh
 RUN bash /tmp/install_torchcodec.sh \
    && rm /tmp/install_torchcodec.sh \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*
 # Copy in the v1 package (for python-only install test group)
 COPY --from=export_vllm /vllm_v1 /usr/local/lib/python${PYTHON_VERSION}/dist-packages/vllm/v1
@@ -130,6 +362,7 @@ RUN --mount=type=bind,from=export_vllm,src=/,target=/install \
    && uv pip install --system *.whl
 ARG COMMON_WORKDIR
 ARG BASE_IMAGE
 # Copy over the benchmark scripts as well
 COPY --from=export_vllm /benchmarks ${COMMON_WORKDIR}/vllm/benchmarks
@@ -144,4 +377,13 @@ ENV SAFETENSORS_FAST_GPU=1
 # Performance environment variable.
 ENV HIP_FORCE_DEV_KERNARG=1
 # Workaround for ROCm profiler limits
 RUN echo "ROCTRACER_MAX_EVENTS=10000000" > ${COMMON_WORKDIR}/libkineto.conf
 ENV KINETO_CONFIG="${COMMON_WORKDIR}/libkineto.conf"
 RUN echo "VLLM_BASE_IMAGE=${BASE_IMAGE}" >> ${COMMON_WORKDIR}/versions.txt
 CMD ["/bin/bash"]
 #Set entrypoint for vllm-openai official images
 FROM final As vllm-openai
 ENTRYPOINT ["vllm", "serve"]
--- a/docker/Dockerfile.rocm_base
+++ b/docker/Dockerfile.rocm_base
@@ -1,16 +1,26 @@
-ARG BASE_IMAGE=rocm/dev-ubuntu-22.04:7.1-complete
+ARG BASE_IMAGE=rocm/dev-ubuntu-22.04:7.0-complete
 ARG TRITON_BRANCH="57c693b6"
 ARG TRITON_REPO="https://github.com/ROCm/triton.git"
-ARG PYTORCH_BRANCH="1c57644d"
+ARG PYTORCH_BRANCH="89075173"
 ARG PYTORCH_VISION_BRANCH="v0.23.0"
 ARG PYTORCH_REPO="https://github.com/ROCm/pytorch.git"
 ARG PYTORCH_VISION_BRANCH="v0.24.1"
 ARG PYTORCH_VISION_REPO="https://github.com/pytorch/vision.git"
 ARG PYTORCH_AUDIO_BRANCH="v2.9.0"
 ARG PYTORCH_AUDIO_REPO="https://github.com/pytorch/audio.git"
 ARG FA_BRANCH="0e60e394"
 ARG FA_REPO="https://github.com/Dao-AILab/flash-attention.git"
-ARG AITER_BRANCH="59bd8ff2"
+ARG AITER_BRANCH="6af8b687"
 ARG AITER_REPO="https://github.com/ROCm/aiter.git"
 ARG MORI_BRANCH="2d02c6a9"
 ARG MORI_REPO="https://github.com/ROCm/mori.git"
 # Sccache configuration (only used in release pipeline)
 ARG USE_SCCACHE
 ARG SCCACHE_DOWNLOAD_URL
 ARG SCCACHE_ENDPOINT
 ARG SCCACHE_BUCKET_NAME=vllm-build-sccache
 ARG SCCACHE_REGION_NAME=us-west-2
 ARG SCCACHE_S3_NO_CREDENTIALS=0
 FROM ${BASE_IMAGE} AS base
@@ -20,6 +30,7 @@ ENV LD_LIBRARY_PATH=/opt/rocm/lib:/usr/local/lib:
 ARG PYTORCH_ROCM_ARCH=gfx90a;gfx942;gfx950;gfx1100;gfx1101;gfx1200;gfx1201;gfx1150;gfx1151
 ENV PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH}
 ENV AITER_ROCM_ARCH=gfx942;gfx950
 ENV MORI_GPU_ARCHS=gfx942;gfx950
 # Required for RCCL in ROCm7.1
 ENV HSA_NO_SCRATCH_RECLAIM=1
@@ -33,7 +44,7 @@ ENV DEBIAN_FRONTEND=noninteractive
 # Install Python and other dependencies
 RUN apt-get update -y \
-    && apt-get install -y software-properties-common git curl sudo vim less libgfortran5 \
+    && apt-get install -y software-properties-common git curl sudo vim less libgfortran5 libopenmpi-dev libpci-dev \
    && for i in 1 2 3; do \
        add-apt-repository -y ppa:deadsnakes/ppa && break || \
        { echo "Attempt $i failed, retrying in 5s..."; sleep 5; }; \
@@ -50,6 +61,53 @@ RUN apt-get update -y \
 RUN pip install -U packaging 'cmake<4' ninja wheel 'setuptools<80' pybind11 Cython
 RUN apt-get update && apt-get install -y libjpeg-dev libsox-dev libsox-fmt-all sox && rm -rf /var/lib/apt/lists/*
 # Install sccache if USE_SCCACHE is enabled (for release builds)
 ARG USE_SCCACHE
 ARG SCCACHE_DOWNLOAD_URL
 ARG SCCACHE_ENDPOINT
 ARG SCCACHE_BUCKET_NAME
 ARG SCCACHE_REGION_NAME
 ARG SCCACHE_S3_NO_CREDENTIALS
 RUN if [ "$USE_SCCACHE" = "1" ]; then \
        echo "Installing sccache..." \
        && SCCACHE_ARCH="x86_64" \
        && SCCACHE_VERSION="v0.8.1" \
        && SCCACHE_DL_URL="${SCCACHE_DOWNLOAD_URL:-https://github.com/mozilla/sccache/releases/download/${SCCACHE_VERSION}/sccache-${SCCACHE_VERSION}-${SCCACHE_ARCH}-unknown-linux-musl.tar.gz}" \
        && curl -L -o /tmp/sccache.tar.gz ${SCCACHE_DL_URL} \
        && tar -xzf /tmp/sccache.tar.gz -C /tmp \
        && mv /tmp/sccache-${SCCACHE_VERSION}-${SCCACHE_ARCH}-unknown-linux-musl/sccache /usr/bin/sccache \
        && chmod +x /usr/bin/sccache \
        && rm -rf /tmp/sccache.tar.gz /tmp/sccache-${SCCACHE_VERSION}-${SCCACHE_ARCH}-unknown-linux-musl \
        && sccache --version; \
    fi
 # Setup sccache for HIP compilation via HIP_CLANG_PATH
 # This creates wrapper scripts in a separate directory and points HIP to use them
 # This avoids modifying the original ROCm binaries which can break detection
 # NOTE: HIP_CLANG_PATH is NOT set as ENV to avoid affecting downstream images (Dockerfile.rocm)
 # Instead, each build stage should export HIP_CLANG_PATH=/opt/sccache-wrappers if USE_SCCACHE=1
 RUN if [ "$USE_SCCACHE" = "1" ]; then \
        echo "Setting up sccache wrappers for HIP compilation..." \
        && mkdir -p /opt/sccache-wrappers \
        && printf '#!/bin/bash\nexec sccache /opt/rocm/lib/llvm/bin/clang++ "$@"\n' > /opt/sccache-wrappers/clang++ \
        && chmod +x /opt/sccache-wrappers/clang++ \
        && printf '#!/bin/bash\nexec sccache /opt/rocm/lib/llvm/bin/clang "$@"\n' > /opt/sccache-wrappers/clang \
        && chmod +x /opt/sccache-wrappers/clang \
        && echo "sccache wrappers created in /opt/sccache-wrappers"; \
    fi
 # Set sccache environment variables only when USE_SCCACHE=1
 # This prevents S3 config from leaking into images when sccache is not used
 ARG USE_SCCACHE
 ENV SCCACHE_BUCKET=${USE_SCCACHE:+${SCCACHE_BUCKET_NAME}}
 ENV SCCACHE_REGION=${USE_SCCACHE:+${SCCACHE_REGION_NAME}}
 ENV SCCACHE_S3_NO_CREDENTIALS=${USE_SCCACHE:+${SCCACHE_S3_NO_CREDENTIALS}}
 ENV SCCACHE_IDLE_TIMEOUT=${USE_SCCACHE:+0}
 ###
 ### Triton Build
 ###
 FROM base AS build_triton
 ARG TRITON_BRANCH
 ARG TRITON_REPO
@@ -62,11 +120,19 @@ RUN cd triton \
 RUN if [ -d triton/python/triton_kernels ]; then pip install build && cd triton/python/triton_kernels \
    && python3 -m build --wheel && cp dist/*.whl /app/install; fi
 ###
 ### AMD SMI Build
 ###
 FROM base AS build_amdsmi
 RUN cd /opt/rocm/share/amd_smi \
    && pip wheel . --wheel-dir=dist
 RUN mkdir -p /app/install && cp /opt/rocm/share/amd_smi/dist/*.whl /app/install
 ###
 ### Pytorch build
 ###
 FROM base AS build_pytorch
 ARG PYTORCH_BRANCH
 ARG PYTORCH_VISION_BRANCH
@@ -74,42 +140,93 @@ ARG PYTORCH_AUDIO_BRANCH
 ARG PYTORCH_REPO
 ARG PYTORCH_VISION_REPO
 ARG PYTORCH_AUDIO_REPO
 ARG USE_SCCACHE
 RUN git clone ${PYTORCH_REPO} pytorch
 RUN cd pytorch && git checkout ${PYTORCH_BRANCH} \
    && pip install -r requirements.txt && git submodule update --init --recursive \
    && python3 tools/amd_build/build_amd.py \
    && if [ "$USE_SCCACHE" = "1" ]; then \
           export HIP_CLANG_PATH=/opt/sccache-wrappers \
           && export CMAKE_C_COMPILER_LAUNCHER=sccache \
           && export CMAKE_CXX_COMPILER_LAUNCHER=sccache \
           && sccache --show-stats; \
       fi \
    && CMAKE_PREFIX_PATH=$(python3 -c 'import sys; print(sys.prefix)') python3 setup.py bdist_wheel --dist-dir=dist \
    && if [ "$USE_SCCACHE" = "1" ]; then sccache --show-stats; fi \
    && pip install dist/*.whl
 RUN git clone ${PYTORCH_VISION_REPO} vision
 RUN cd vision && git checkout ${PYTORCH_VISION_BRANCH} \
    && if [ "$USE_SCCACHE" = "1" ]; then \
           export HIP_CLANG_PATH=/opt/sccache-wrappers \
           && export CMAKE_C_COMPILER_LAUNCHER=sccache \
           && export CMAKE_CXX_COMPILER_LAUNCHER=sccache; \
       fi \
    && python3 setup.py bdist_wheel --dist-dir=dist \
    && if [ "$USE_SCCACHE" = "1" ]; then sccache --show-stats; fi \
    && pip install dist/*.whl
 RUN git clone ${PYTORCH_AUDIO_REPO} audio
 RUN cd audio && git checkout ${PYTORCH_AUDIO_BRANCH} \
    && git submodule update --init --recursive \
    && pip install -r requirements.txt \
    && if [ "$USE_SCCACHE" = "1" ]; then \
           export HIP_CLANG_PATH=/opt/sccache-wrappers \
           && export CMAKE_C_COMPILER_LAUNCHER=sccache \
           && export CMAKE_CXX_COMPILER_LAUNCHER=sccache; \
       fi \
    && python3 setup.py bdist_wheel --dist-dir=dist \
    && if [ "$USE_SCCACHE" = "1" ]; then sccache --show-stats; fi \
    && pip install dist/*.whl
 RUN mkdir -p /app/install && cp /app/pytorch/dist/*.whl /app/install \
    && cp /app/vision/dist/*.whl /app/install \
    && cp /app/audio/dist/*.whl /app/install
 ###
 ### MORI Build
 ###
 FROM base AS build_mori
 ARG MORI_BRANCH
 ARG MORI_REPO
 RUN --mount=type=bind,from=build_pytorch,src=/app/install/,target=/install \
    pip install /install/*.whl
 RUN git clone ${MORI_REPO}
 RUN cd mori \
    && git checkout ${MORI_BRANCH} \
    && git submodule update --init --recursive \
    && python3 setup.py bdist_wheel --dist-dir=dist && ls /app/mori/dist/*.whl
 RUN mkdir -p /app/install && cp /app/mori/dist/*.whl /app/install
 ###
 ### FlashAttention Build
 ###
 FROM base AS build_fa
 ARG FA_BRANCH
 ARG FA_REPO
 ARG USE_SCCACHE
 RUN --mount=type=bind,from=build_pytorch,src=/app/install/,target=/install \
    pip install /install/*.whl
 RUN git clone ${FA_REPO}
 RUN cd flash-attention \
    && git checkout ${FA_BRANCH} \
    && git submodule update --init \
-    && GPU_ARCHS=$(echo ${PYTORCH_ROCM_ARCH} | sed -e 's/;gfx1[0-9]\{3\}//g') python3 setup.py bdist_wheel --dist-dir=dist
+    && if [ "$USE_SCCACHE" = "1" ]; then \
           export HIP_CLANG_PATH=/opt/sccache-wrappers \
           && sccache --show-stats; \
       fi \
    && GPU_ARCHS=$(echo ${PYTORCH_ROCM_ARCH} | sed -e 's/;gfx1[0-9]\{3\}//g') python3 setup.py bdist_wheel --dist-dir=dist \
    && if [ "$USE_SCCACHE" = "1" ]; then sccache --show-stats; fi
 RUN mkdir -p /app/install && cp /app/flash-attention/dist/*.whl /app/install
 ###
 ### AITER Build
 ###
 FROM base AS build_aiter
 ARG AITER_BRANCH
 ARG AITER_REPO
 ARG USE_SCCACHE
 RUN --mount=type=bind,from=build_pytorch,src=/app/install/,target=/install \
    pip install /install/*.whl
 RUN git clone --recursive ${AITER_REPO}
@@ -117,9 +234,37 @@ RUN cd aiter \
    && git checkout ${AITER_BRANCH} \
    && git submodule update --init --recursive \
    && pip install -r requirements.txt
-RUN pip install pyyaml && cd aiter && PREBUILD_KERNELS=1 GPU_ARCHS=${AITER_ROCM_ARCH} python3 setup.py bdist_wheel --dist-dir=dist && ls /app/aiter/dist/*.whl
+RUN pip install pyyaml && cd aiter \
    && if [ "$USE_SCCACHE" = "1" ]; then \
           export HIP_CLANG_PATH=/opt/sccache-wrappers \
           && sccache --show-stats; \
       fi \
    && PREBUILD_KERNELS=1 GPU_ARCHS=${AITER_ROCM_ARCH} python3 setup.py bdist_wheel --dist-dir=dist \
    && if [ "$USE_SCCACHE" = "1" ]; then sccache --show-stats; fi \
    && ls /app/aiter/dist/*.whl
 RUN mkdir -p /app/install && cp /app/aiter/dist/*.whl /app/install
 ###
 ### Final Build
 ###
 # Wheel release stage - 
 # only includes dependencies used by wheel release pipeline
 FROM base AS debs_wheel_release
 RUN mkdir /app/debs
 RUN --mount=type=bind,from=build_triton,src=/app/install/,target=/install \
    cp /install/*.whl /app/debs
 RUN --mount=type=bind,from=build_fa,src=/app/install/,target=/install \
    cp /install/*.whl /app/debs
 RUN --mount=type=bind,from=build_amdsmi,src=/app/install/,target=/install \
    cp /install/*.whl /app/debs
 RUN --mount=type=bind,from=build_pytorch,src=/app/install/,target=/install \
    cp /install/*.whl /app/debs
 RUN --mount=type=bind,from=build_aiter,src=/app/install/,target=/install \
    cp /install/*.whl /app/debs
 # Full debs stage - includes Mori (used by Docker releases)
 FROM base AS debs
 RUN mkdir /app/debs
 RUN --mount=type=bind,from=build_triton,src=/app/install/,target=/install \
@@ -132,6 +277,8 @@ RUN --mount=type=bind,from=build_pytorch,src=/app/install/,target=/install \
    cp /install/*.whl /app/debs
 RUN --mount=type=bind,from=build_aiter,src=/app/install/,target=/install \
    cp /install/*.whl /app/debs
 RUN --mount=type=bind,from=build_mori,src=/app/install/,target=/install \
    cp /install/*.whl /app/debs
 FROM base AS final
 RUN --mount=type=bind,from=debs,src=/app/debs,target=/install \
@@ -150,6 +297,8 @@ ARG FA_BRANCH
 ARG FA_REPO
 ARG AITER_BRANCH
 ARG AITER_REPO
 ARG MORI_BRANCH
 ARG MORI_REPO
 RUN echo "BASE_IMAGE: ${BASE_IMAGE}" > /app/versions.txt \
    && echo "TRITON_BRANCH: ${TRITON_BRANCH}" >> /app/versions.txt \
    && echo "TRITON_REPO: ${TRITON_REPO}" >> /app/versions.txt \
@@ -162,4 +311,6 @@ RUN echo "BASE_IMAGE: ${BASE_IMAGE}" > /app/versions.txt \
    && echo "FA_BRANCH: ${FA_BRANCH}" >> /app/versions.txt \
    && echo "FA_REPO: ${FA_REPO}" >> /app/versions.txt \
    && echo "AITER_BRANCH: ${AITER_BRANCH}" >> /app/versions.txt \
-    && echo "AITER_REPO: ${AITER_REPO}" >> /app/versions.txt
+    && echo "AITER_REPO: ${AITER_REPO}" >> /app/versions.txt \
    && echo "MORI_BRANCH: ${MORI_BRANCH}" >> /app/versions.txt \
    && echo "MORI_REPO: ${MORI_REPO}" >> /app/versions.txt
--- a/docker/Dockerfile.xpu
+++ b/docker/Dockerfile.xpu
@@ -2,7 +2,7 @@ FROM intel/deep-learning-essentials:2025.2.2-0-devel-ubuntu24.04 AS vllm-base
 RUN wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | gpg --dearmor | tee /usr/share/keyrings/oneapi-archive-keyring.gpg > /dev/null && \
    echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" | tee /etc/apt/sources.list.d/oneAPI.list && \
-    add-apt-repository -y ppa:kobuk-team/intel-graphics
+    add-apt-repository -y ppa:kobuk-team/intel-graphics-staging
 RUN apt clean && apt-get update -y && \
    apt-get install -y --no-install-recommends --fix-missing \
@@ -28,10 +28,14 @@ RUN update-alternatives --install /usr/bin/python python /usr/bin/python3.12 1
 RUN apt install -y libze1 libze-dev libze-intel-gpu1 intel-opencl-icd libze-intel-gpu-raytracing intel-ocloc
 # This oneccl contains the BMG support which is not the case for default version of oneapi 2025.2.
-RUN wget https://github.com/uxlfoundation/oneCCL/releases/download/2021.15.6/intel-oneccl-2021.15.6.9_offline.sh
+ARG ONECCL_INSTALLER="intel-oneccl-2021.15.7.6_offline.sh"
-RUN bash intel-oneccl-2021.15.6.9_offline.sh -a --silent --eula accept && \
+RUN wget "https://github.com/uxlfoundation/oneCCL/releases/download/2021.15.7/${ONECCL_INSTALLER}" && \
    bash "${ONECCL_INSTALLER}" -a --silent --eula accept && \
    rm "${ONECCL_INSTALLER}" && \
    echo "source /opt/intel/oneapi/setvars.sh --force" >> /root/.bashrc && \
    echo "source /opt/intel/oneapi/ccl/2021.15/env/vars.sh --force" >> /root/.bashrc
 RUN rm -f /opt/intel/oneapi/ccl/latest && \
    ln -s /opt/intel/oneapi/ccl/2021.15 /opt/intel/oneapi/ccl/latest
 SHELL ["bash", "-c"]
 CMD ["bash", "-c", "source /root/.bashrc && exec bash"]
@@ -47,6 +51,11 @@ RUN --mount=type=cache,target=/root/.cache/pip \
    pip install --no-cache-dir \
    -r requirements/xpu.txt
 # arctic-inference is built from source which needs torch-xpu properly installed
 # used for suffix method speculative decoding
 RUN --mount=type=cache,target=/root/.cache/pip \
    pip install --no-cache-dir arctic-inference==0.1.1
 ENV LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/lib/"
 COPY . .
--- a/docker/docker-bake.hcl
+++ b/docker/docker-bake.hcl
@@ -0,0 +1,76 @@
 # docker-bake.hcl - vLLM Docker build configuration
 #
 # This file lives in vLLM repo at docker/docker-bake.hcl
 #
 # Usage:
 #   cd docker && docker buildx bake        # Build default target (openai)
 #   cd docker && docker buildx bake test   # Build test target
 #   docker buildx bake --print             # Show resolved config
 #
 # Reference: https://docs.docker.com/build/bake/reference/
 # Build configuration
 variable "MAX_JOBS" {
  default = 16
 }
 variable "NVCC_THREADS" {
  default = 8
 }
 variable "TORCH_CUDA_ARCH_LIST" {
  default = "8.0 8.9 9.0 10.0"
 }
 variable "COMMIT" {
  default = ""
 }
 # Groups
 group "default" {
  targets = ["openai"]
 }
 # Base targets
 target "_common" {
  dockerfile = "docker/Dockerfile"
  context    = "."
  args = {
    max_jobs             = MAX_JOBS
    nvcc_threads         = NVCC_THREADS
    torch_cuda_arch_list = TORCH_CUDA_ARCH_LIST
  }
 }
 target "_labels" {
  labels = {
    "org.opencontainers.image.source"      = "https://github.com/vllm-project/vllm"
    "org.opencontainers.image.vendor"      = "vLLM"
    "org.opencontainers.image.title"       = "vLLM"
    "org.opencontainers.image.description" = "vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs"
    "org.opencontainers.image.licenses"    = "Apache-2.0"
    "org.opencontainers.image.revision"    = COMMIT
  }
  annotations = [
      "index,manifest:org.opencontainers.image.revision=${COMMIT}",
  ]
 }
 # Build targets
 target "test" {
  inherits = ["_common", "_labels"]
  target   = "test"
  tags     = ["vllm:test"]
  output   = ["type=docker"]
 }
 target "openai" {
  inherits = ["_common", "_labels"]
  target   = "vllm-openai"
  tags     = ["vllm:openai"]
  output   = ["type=docker"]
 }
--- a/docs/README.md
+++ b/docs/README.md
@@ -62,7 +62,7 @@ vLLM is flexible and easy to use with:
 For more information, check out the following:
- [vLLM announcing blog post](https://vllm.ai) (intro to PagedAttention)
+- [vLLM announcing blog post](https://blog.vllm.ai/2023/06/20/vllm.html) (intro to PagedAttention)
 - [vLLM paper](https://arxiv.org/abs/2309.06180) (SOSP 2023)
 - [How continuous batching enables 23x throughput in LLM inference while reducing p50 latency](https://www.anyscale.com/blog/continuous-batching-llm-inference) by Cade Daniel et al.
 - [vLLM Meetups](community/meetups.md)
--- a/docs/api/README.md
+++ b/docs/api/README.md
@@ -72,7 +72,6 @@ Internal data structures.
 - [vllm.multimodal.inputs.MultiModalFieldConfig][]
 - [vllm.multimodal.inputs.MultiModalKwargsItem][]
 - [vllm.multimodal.inputs.MultiModalKwargsItems][]
 - [vllm.multimodal.inputs.MultiModalKwargs][]
 - [vllm.multimodal.inputs.MultiModalInputs][]
 ### Data Parsing
--- a/docs/assets/deployment/claude-code-example.png
+++ b/docs/assets/deployment/claude-code-example.png
--- a/docs/benchmarking/dashboard.md
+++ b/docs/benchmarking/dashboard.md
@@ -8,12 +8,19 @@ The results are automatically published to the public [vLLM Performance Dashboar
 ## Manually Trigger the benchmark
 Use [vllm-ci-test-repo images](https://gallery.ecr.aws/q9t5s3a7/vllm-ci-test-repo) with vLLM benchmark suite.
-For CPU environment, please use the image with "-cpu" postfix.
+For x86 CPU environment, please use the image with "-cpu" postfix. For AArch64 CPU environment, please use the image with "-arm64-cpu" postfix.
-Here is an example for docker run command for CPU.
+Here is an example for docker run command for CPU. For GPUs skip setting the `ON_CPU` env var.
 ```bash
-docker run -it --entrypoint /bin/bash -v /data/huggingface:/root/.cache/huggingface  -e HF_TOKEN=''  --shm-size=16g --name vllm-cpu-ci  public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:1da94e673c257373280026f75ceb4effac80e892-cpu
+export VLLM_COMMIT=1da94e673c257373280026f75ceb4effac80e892 # use full commit hash from the main branch
 export HF_TOKEN=<valid Hugging Face token>
 if [[ "$(uname -m)" == aarch64 || "$(uname -m)" == arm64 ]]; then
  IMG_SUFFIX="arm64-cpu"
 else
  IMG_SUFFIX="cpu"
 fi
 docker run -it --entrypoint /bin/bash -v /data/huggingface:/root/.cache/huggingface -e HF_TOKEN=$HF_TOKEN -e ON_ARM64_CPU=1 --shm-size=16g --name vllm-cpu-ci public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:${VLLM_COMMIT}-${IMG_SUFFIX}
 ```
 Then, run below command inside the docker instance.
@@ -26,14 +33,65 @@ When run, benchmark script generates results under **benchmark/results** folder,
 ### Runtime environment variables
- `ON_CPU`: set the value to '1' on Intel® Xeon® Processors. Default value is 0.
+- `ON_CPU`: set the value to '1' on Intel® Xeon® and Arm® Neoverse™ Processors. Default value is 0.
 - `SERVING_JSON`: JSON file to use for the serving tests. Default value is empty string (use default file).
 - `LATENCY_JSON`: JSON file to use for the latency tests. Default value is empty string (use default file).
 - `THROUGHPUT_JSON`: JSON file to use for the throughout tests. Default value is empty string (use default file).
 - `REMOTE_HOST`: IP for the remote vLLM service to benchmark. Default value is empty string.
 - `REMOTE_PORT`: Port for the remote vLLM service to benchmark. Default value is empty string.
-For more results visualization, check the [visualizing the results](https://github.com/intel-ai-tce/vllm/blob/more_cpu_models/.buildkite/nightly-benchmarks/README.md#visualizing-the-results).
+### Visualization
 The `convert-results-json-to-markdown.py` helps you put the benchmarking results inside a markdown table with real benchmarking results.
 You can find the result presented as a table inside the `buildkite/performance-benchmark` job page.
 If you do not see the table, please wait till the benchmark finish running.
 The json version of the table (together with the json version of the benchmark) will be also attached to the markdown file.
 The raw benchmarking results (in the format of json files) are in the `Artifacts` tab of the benchmarking.
 #### Performance Results Comparison
 The `compare-json-results.py` helps to compare benchmark results JSON files converted using `convert-results-json-to-markdown.py`.
 When run, benchmark script generates results under `benchmark/results` folder, along with the `benchmark_results.md` and `benchmark_results.json`.
 `compare-json-results.py` compares two `benchmark_results.json` files and provides performance ratio e.g. for Output Tput, Median TTFT and Median TPOT.  
 If only one benchmark_results.json is passed, `compare-json-results.py` compares different TP and PP configurations in the benchmark_results.json instead.
 Here is an example using the script to compare result_a and result_b with max concurrency and qps for same Model, Dataset name, input/output length.
 `python3 compare-json-results.py -f results_a/benchmark_results.json -f results_b/benchmark_results.json`
 ***Output Tput (tok/s) — Model : [ meta-llama/Llama-3.1-8B-Instruct ] , Dataset Name : [ random ] , Input Len : [ 2048.0 ] , Output Len : [ 2048.0 ]***
 |    | # of max concurrency | qps  | results_a/benchmark_results.json | results_b/benchmark_results.json | perf_ratio        |
 |----|------|-----|-----------|----------|----------|
 | 0  | 12 | inf | 24.98   | 186.03 |  7.45 |
 | 1  | 16 | inf|  25.49  | 246.92 | 9.69 |
 | 2  | 24 | inf| 27.74  | 293.34 |  10.57 |
 | 3  | 32 | inf| 28.61  |306.69 | 10.72 |
 ***compare-json-results.py – Command-Line Parameters***  
 compare-json-results.py provides configurable parameters to compare one or more benchmark_results.json files and generate summary tables and plots.  
 In most cases, users only need to specify --file to parse the desired benchmark results.
 | Parameter              | Type               | Default Value           | Description                                                                                           |
 | ---------------------- | ------------------ | ----------------------- | ----------------------------------------------------------------------------------------------------- |
 | `--file`               | `str` (appendable) | *None*                  | Input JSON result file(s). Can be specified multiple times to compare multiple benchmark outputs.     |
 | `--debug`              | `bool`             | `False`                 | Enables debug mode. When set, prints all available information to aid troubleshooting and validation. |
 | `--plot` / `--no-plot` | `bool`             | `True`                  | Controls whether performance plots are generated. Use `--no-plot` to disable graph generation.        |
 | `--xaxis`              | `str`              | `# of max concurrency.` | Column name used as the X-axis in comparison plots (for example, concurrency or batch size).          |
 | `--latency`            | `str`              | `p99`                   | Latency aggregation method used for TTFT/TPOT. Supported values: `median` or `p99`.                   |
 | `--ttft-max-ms`        | `float`            | `3000.0`                | Reference upper bound (milliseconds) for TTFT plots, typically used to visualize SLA thresholds.      |
 | `--tpot-max-ms`        | `float`            | `100.0`                 | Reference upper bound (milliseconds) for TPOT plots, typically used to visualize SLA thresholds.      |
 ***Valid Max Concurrency Summary***  
 Based on the configured TTFT and TPOT SLA thresholds, compare-json-results.py computes the maximum valid concurrency for each benchmark result.  
 The “Max # of max concurrency. (Both)” column represents the highest concurrency level that satisfies both TTFT and TPOT constraints simultaneously.  
 This value is typically used in capacity planning and sizing guides.  
 | # | Configuration  | Max # of max concurrency. (TTFT ≤ 10000 ms) | Max # of max concurrency. (TPOT ≤ 100 ms) | Max # of max concurrency. (Both) | Output Tput @ Both (tok/s) | TTFT @ Both (ms) | TPOT @ Both (ms) |
 | - | -------------- | ------------------------------------------- | ----------------------------------------- | -------------------------------- | -------------------------- | ---------------- | ---------------- |
 | 0 | results-a      | 128.00                                      | 12.00                                     | 12.00                            | 127.76                     | 3000.82          | 93.24            |
 | 1 | results-b      | 128.00                                      | 32.00                                     | 32.00                            | 371.42                     | 2261.53          | 81.74            |
 More information on the performance benchmarks and their parameters can be found in [Benchmark README](https://github.com/intel-ai-tce/vllm/blob/more_cpu_models/.buildkite/nightly-benchmarks/README.md) and [performance benchmark description](../../.buildkite/performance-benchmarks/performance-benchmarks-descriptions.md).
--- a/docs/benchmarking/sweeps.md
+++ b/docs/benchmarking/sweeps.md
@@ -129,10 +129,10 @@ vllm bench sweep serve_sla \
 The algorithm for adjusting the SLA variable is as follows:
-1. Run the benchmark with infinite QPS, and use the corresponding metrics to determine the initial value of the variable.
+1. Run the benchmark once with maximum possible QPS, and once with minimum possible QPS. For each run, calculate the distance of the SLA metrics from their targets, resulting in data points of QPS vs SLA distance.
-    - For example, the initial request rate is set to the concurrency under infinite QPS.
+2. Perform spline interpolation between the data points to estimate the QPS that results in zero SLA distance.
-2. If the SLA is still satisfied, keep doubling the value until the SLA is no longer satisfied. This gives a relatively narrow window that contains the point where the SLA is barely satisfied.
+3. Run the benchmark with the estimated QPS and add the resulting data point to the history.
-3. Apply binary search over the window to find the maximum value that still satisfies the SLA.
+4. Repeat Steps 2 and 3 until the maximum QPS that passes SLA and the minimum QPS that fails SLA in the history are close enough to each other.
 !!! important
    SLA tuning is applied over each combination of `--serve-params`, `--bench-params`, and `--sla-params`.
--- a/docs/cli/bench/latency.md
+++ b/docs/cli/bench/latency.md
@@ -6,4 +6,4 @@
 ## Arguments
--8<-- "docs/argparse/bench_latency.inc.md"
+--8<-- "docs/generated/argparse/bench_latency.inc.md"
--- a/Show More
+++ b/Show More
`@@ -6,4 +6,4 @@`

	`## Arguments`	`## Arguments`

	`--8<-- "docs/argparse/bench_latency.inc.md"`	`--8<-- "docs/generated/argparse/bench_latency.inc.md"`