Files

Nicolò Lucchesi 7337ff7f03 [Docs] PD with Nixl compat matrix (#38628 )

Signed-off-by: NickLucche <nlucches@redhat.com>

2026-03-31 15:01:21 +00:00

5.3 KiB

Raw Blame History

NixlConnector Compatibility Matrix

This page documents the feature compatibility of disaggregated prefilling with the NixlConnector. For general usage instructions, see the NixlConnector Usage Guide. For an overview of disaggregated prefilling, see Disaggregated Prefilling.

!!! note This page reflects the current state of the codebase and is subject to change as features evolve. Entries marked 🟠 or ❌ may link to tracking issues. See the NIXL connector roadmap for upcoming feature development.

Legend:

✅ = Fully supported
🟠 = Partial support (see footnotes)
❌ = Not supported
❔ = Unknown / not yet validated
🚧 = Work in progress

!!! info "Universally supported features" The following features work with all model architectures when using NixlConnector PD disaggregated serving:

[Chunked Prefill](../configuration/optimization.md#chunked-prefill) |
[APC (Prefix Caching)](automatic_prefix_caching.md) |
[Data Parallel](../serving/data_parallel_deployment.md) |
CUDA graph |
Logprobs |
Prompt Logprobs |
[Prompt Embeds](prompt_embeds.md) |
Multiple NIXL backends (UCX, GDS, LIBFABRIC, etc.)

Model Architecture x Capability

Model type	Basic PD	Spec Decode	Hetero TP	Cross-layer blocks	SWA	Host buffer	Hetero block size
Dense Transformers	✅	✅¹	✅	✅²	✅	✅	🟠³
MLA (e.g. DeepSeek-V2/V3)	✅	✅¹	🟠⁴	✅²	✅	✅	🟠³
Sparse MLA (e.g. DeepSeek-V3.2)	✅	✅¹	🟠⁴	✅²	✅	✅	🟠³
Hybrid SSM / Mamba	✅	❔	🚧⁵	❌	✅	✅	❌⁶
MoE	✅	✅¹	✅	✅²	✅	✅	🟠³
Multimodal	❔	❔	❔	❔	❔	❔	❔
Encoder-Decoder	❌	❌	❌	❌	❌	❌	❌

¹ P and D instances must use the same speculation configuration.

² Requires FLASH_ATTN or FLASHINFER backend and HND KV cache layout. Enable via --kv-transfer-config '{"kv_connector_extra_config": {"enable_cross_layers_blocks": "True"}}'.

³ Supported only when HMA is not required (i.e., non-hybrid models). Block IDs are remapped automatically. Only P block size < D block size is supported.

⁴ MLA KV cache is replicated across TP workers, so heterogeneous TP works but there is no head-splitting. When P TP > D TP, only a single read is executed (redundant ranks are skipped). D TP > P TP also works.

⁵ Hybrid SSM (Mamba) models require homogeneous TP (P TP == D TP). Heterogeneous TP is not yet supported for Mamba layers.

⁶ HMA (required by hybrid models) does not support different remote block sizes.

Configuration Notes

What must match between P and D

By default, a compatibility hash is checked during handshake. P and D instances must agree on:

vLLM version and NIXL connector version
Model (architecture, dtype, number of KV heads, head size, number of hidden layers)
Attention backend
KV cache dtype (cache_dtype)

!!! warning Disable the hash check with --kv-transfer-config '{"kv_connector_extra_config": {"enforce_handshake_compat": false}}' at your own risk.

What can safely differ between P and D

tensor-parallel-size (heterogeneous TP, subject to model restrictions above)
block-size (heterogeneous block size, subject to restrictions above)
Number of KV cache blocks (determined by available memory on each instance)

KV cache layout

NixlConnector defaults to HND layout for optimal transfer performance (non-MLA models).
NHD layout is supported but does not allow heterogeneous TP head splitting.
Experimental HND ↔ NHD permute: enable via --kv-transfer-config '{"enable_permute_local_kv": true}'. Not supported with HMA.

Quantized KV cache

Quantized KV cache (e.g., FP8) requires both P and D instances to use the same cache_dtype. Mismatched cache dtypes will fail the compatibility hash check during handshake.

Static quantization (scales loaded from checkpoint): ✅ Supported. Scales are loaded independently by each instance from the model checkpoint.
Dynamic quantization (scales computed at runtime): ❌ Not supported. Per-block scales are not transferred alongside KV cache data.
Packed-layout scales (scales stored inline with weights): ✅ Supported. Scales are transferred together with the KV cache blocks.

5.3 KiB Raw Blame History