[Bugfix] Disable cross-layer KV cache for MLA attention backends (#37090)

Signed-off-by: haosdent <haosdent@gmail.com> Co-authored-by: Or Ozeri <oro@il.ibm.com>
2026-03-17 01:03:10 +08:00
parent 55e6d3d5c0
commit ca1954d58c
5 changed files with 56 additions and 8 deletions
--- a/vllm/v1/attention/backends/mla/indexer.py
+++ b/vllm/v1/attention/backends/mla/indexer.py
@@ -63,6 +63,9 @@ class DeepseekV32IndexerBackend(AttentionBackend):
        include_num_layers_dimension: bool = False,
    ) -> tuple[int, ...]:
        if include_num_layers_dimension:
+            # DeepseekV32Indexer kernels do not support cross-layer
+            # KV cache layout. Identity permutation keeps num_layers
+            # first, signaling incompatibility.
            return (0, 1, 2, 3)
        return (0, 1, 2)