# NixlConnector Compatibility Matrix This page documents the feature compatibility of **disaggregated prefilling with the NixlConnector**. For general usage instructions, see the [NixlConnector Usage Guide](nixl_connector_usage.md). For an overview of disaggregated prefilling, see [Disaggregated Prefilling](disagg_prefill.md). !!! note This page reflects the current state of the codebase and is subject to change as features evolve. Entries marked 🟠 or ❌ may link to tracking issues. See the [NIXL connector roadmap](https://github.com/vllm-project/vllm/issues/33702) for upcoming feature development. **Legend:** - ✅ = Fully supported - 🟠 = Partial support (see footnotes) - ❌ = Not supported - ❔ = Unknown / not yet validated - 🚧 = Work in progress !!! info "Universally supported features" The following features work with **all** model architectures when using NixlConnector PD disaggregated serving: [Chunked Prefill](../configuration/optimization.md#chunked-prefill) | [APC (Prefix Caching)](automatic_prefix_caching.md) | [Data Parallel](../serving/data_parallel_deployment.md) | CUDA graph | Logprobs | Prompt Logprobs | [Prompt Embeds](prompt_embeds.md) | Multiple NIXL backends (UCX, GDS, LIBFABRIC, etc.) ## Model Architecture x Capability | Model type | Basic PD | Spec Decode | Hetero TP | Cross-layer blocks | SWA | Host buffer | Hetero block size | | - | - | - | - | - | - | - | - | | Dense Transformers | ✅ | ✅¹ | ✅ | ✅² | ✅ | ✅ | 🟠³ | | MLA (e.g. DeepSeek-V2/V3) | ✅ | ✅¹ | 🟠⁴ | ✅² | ✅ | ✅ | 🟠³ | | Sparse MLA (e.g. DeepSeek-V3.2) | ✅ | ✅¹ | 🟠⁴ | ✅² | ✅ | ✅ | 🟠³ | | Hybrid SSM / Mamba | ✅ | ❔ | 🚧⁵ | ❌ | ✅ | ✅ | ❌⁶ | | MoE | ✅ | ✅¹ | ✅ | ✅² | ✅ | ✅ | 🟠³ | | Multimodal | ❔ | ❔ | ❔ | ❔ | ❔ | ❔ | ❔ | | Encoder-Decoder | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ¹ P and D instances must use the same speculation configuration. ² Requires `FLASH_ATTN` or `FLASHINFER` backend **and** `HND` KV cache layout. Enable via `--kv-transfer-config '{"kv_connector_extra_config": {"enable_cross_layers_blocks": "True"}}'`. ³ Supported only when HMA is **not** required (i.e., non-hybrid models). Block IDs are remapped automatically. Only P block size < D block size is supported. ⁴ MLA KV cache is replicated across TP workers, so heterogeneous TP works but there is no head-splitting. When P TP > D TP, only a single read is executed (redundant ranks are skipped). D TP > P TP also works. ⁵ Hybrid SSM (Mamba) models require **homogeneous TP** (`P TP == D TP`). Heterogeneous TP is not yet supported for Mamba layers. ⁶ HMA (required by hybrid models) does not support different remote block sizes. ## Configuration Notes ### What must match between P and D By default, a **compatibility hash** is checked during handshake. P and D instances must agree on: - vLLM version and NIXL connector version - Model (architecture, dtype, number of KV heads, head size, number of hidden layers) - Attention backend - KV cache dtype (`cache_dtype`) !!! warning Disable the hash check with `--kv-transfer-config '{"kv_connector_extra_config": {"enforce_handshake_compat": false}}'` at your own risk. ### What can safely differ between P and D - `tensor-parallel-size` (heterogeneous TP, subject to model restrictions above) - `block-size` (heterogeneous block size, subject to restrictions above) - Number of KV cache blocks (determined by available memory on each instance) ### KV cache layout - NixlConnector defaults to **`HND`** layout for optimal transfer performance (non-MLA models). - `NHD` layout is supported but does **not** allow heterogeneous TP head splitting. - Experimental `HND` ↔ `NHD` permute: enable via `--kv-transfer-config '{"enable_permute_local_kv": true}'`. Not supported with HMA. ### Quantized KV cache [Quantized KV cache](quantization/quantized_kvcache.md) (e.g., FP8) requires both P and D instances to use the **same** `cache_dtype`. Mismatched cache dtypes will fail the compatibility hash check during handshake. - **Static quantization** (scales loaded from checkpoint): ✅ Supported. Scales are loaded independently by each instance from the model checkpoint. - **Dynamic quantization** (scales computed at runtime): ❌ Not supported. Per-block scales are not transferred alongside KV cache data. - **Packed-layout scales** (scales stored inline with weights): ✅ Supported. Scales are transferred together with the KV cache blocks.