docs/features/nixl_connector_usage.md

# NixlConnector Usage Guide

NixlConnector is a high-performance KV cache transfer connector for vLLM's disaggregated prefilling feature. It provides fully asynchronous send/receive operations using the NIXL library for efficient cross-process KV cache transfer.

## Prerequisites

### Installation

Install the NIXL library: `uv pip install nixl`, as a quick start on Nvidia platform.

- Refer to [NIXL official repository](https://github.com/ai-dynamo/nixl) for more installation instructions
- The specified required NIXL version can be found in [requirements/kv_connectors.txt](../../requirements/kv_connectors.txt) and other relevant config files

For ROCm platform, the [base ROCm docker file](../../docker/Dockerfile.rocm_base) includes RIXL and ucx already.

- Refer to [RIXL official repository](https://github.com/rocm/rixl) for more information
- The supportive libraries for RIXL can be found in [requirements/kv_connectors_rocm.txt](../../requirements/kv_connectors_rocm.txt)
- In the future we may remove RIXL from docker image file and users will be able to install from pre-compiled binary packages

For non-cuda platform, please install nixl with ucx build from source, instructed as below.

```bash
python tools/install_nixl_from_source_ubuntu.py
```

### Transport Configuration

NixlConnector uses NIXL library for underlying communication, which supports multiple transport backends. UCX (Unified Communication X) is the primary default transport library used by NIXL. Configure transport environment variables:

```bash
# Example UCX configuration, adjust according to your environment
export UCX_TLS=all  # or specify specific transports like "rc,ud,sm,^cuda_ipc" ..etc
export UCX_NET_DEVICES=all  # or specify network devices like "mlx5_0:1,mlx5_1:1"
```

!!! tip
    When using UCX as the transport backend, NCCL environment variables (like `NCCL_IB_HCA`, `NCCL_SOCKET_IFNAME`) are not applicable to NixlConnector, so configure UCX-specific environment variables instead of NCCL variables.

#### Selecting a NIXL transport backend (plugin)

NixlConnector can use different NIXL transport backends (plugins). By default, NixlConnector uses UCX as the transport backend.

To select a different backend, set `kv_connector_extra_config.backends` in `--kv-transfer-config`.

### Example: using LIBFABRIC backend

```bash
vllm serve <MODEL> \
  --kv-transfer-config '{
    "kv_connector":"NixlConnector",
    "kv_role":"kv_both",
    "kv_connector_extra_config":{"backends":["LIBFABRIC"]}
  }'
```

You can also pass JSON keys individually using dotted arguments, and you can append list elements using `+`:

```bash
vllm serve <MODEL> \
  --kv-transfer-config.kv_connector NixlConnector \
  --kv-transfer-config.kv_role kv_both \
  --kv-transfer-config.kv_connector_extra_config.backends+ LIBFABRIC
```

!!! note
    Backend availability depends on how NIXL was built and what plugins are present in your environment. Refer to the [NIXL repository](https://github.com/ai-dynamo/nixl) for available backends and build instructions.

## Basic Usage (on the same host)

### Producer (Prefiller) Configuration

Start a prefiller instance that produces KV caches

```bash
# 1st GPU as prefiller
CUDA_VISIBLE_DEVICES=0 \
UCX_NET_DEVICES=all \
VLLM_NIXL_SIDE_CHANNEL_PORT=5600 \
vllm serve Qwen/Qwen3-0.6B \
  --port 8100 \
  --enforce-eager \
  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both","kv_load_failure_policy":"fail"}'
```

### Consumer (Decoder) Configuration

Start a decoder instance that consumes KV caches:

```bash
# 2nd GPU as decoder
CUDA_VISIBLE_DEVICES=1 \
UCX_NET_DEVICES=all \
VLLM_NIXL_SIDE_CHANNEL_PORT=5601 \
vllm serve Qwen/Qwen3-0.6B \
  --port 8200 \
  --enforce-eager \
  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both","kv_load_failure_policy":"fail"}'
```

### Proxy Server

Use a proxy server to route requests between prefiller and decoder:

```bash
python tests/v1/kv_connector/nixl_integration/toy_proxy_server.py \
  --port 8192 \
  --prefiller-hosts localhost \
  --prefiller-ports 8100 \
  --decoder-hosts localhost \
  --decoder-ports 8200
```

## Environment Variables

- `VLLM_NIXL_SIDE_CHANNEL_PORT`: Port for NIXL handshake communication
    - Default: 5600
    - **Required for both prefiller and decoder instances**
    - Each vLLM worker needs a unique port on its host; using the same port number across different hosts is fine
    - For TP/DP deployments, each worker's port on a node is computed as: base_port + dp_rank (e.g., with `--data-parallel-size=2` and base_port=5600, dp_rank 0..1 use port 5600, 5601 on that node).
    - Used for the initial NIXL handshake between the prefiller and the decoder

- `VLLM_NIXL_SIDE_CHANNEL_HOST`: Host for side channel communication
    - Default: "localhost"
    - Set when prefiller and decoder are on different machines
    - Connection info is passed via KVTransferParams from prefiller to decoder for handshake

- `VLLM_NIXL_ABORT_REQUEST_TIMEOUT`: Timeout (in seconds) for automatically releasing the prefiller’s KV cache for a particular request. (Optional)
    - Default: 480
    - If a request is aborted and the decoder has not yet read the KV-cache blocks through the nixl channel, the prefill instance will release its KV-cache blocks after this timeout to avoid holding them indefinitely.

## Multi-Instance Setup

### Multiple Prefiller Instances on Different Machines

```bash
# Prefiller 1 on Machine A (example IP: ${IP1})
VLLM_NIXL_SIDE_CHANNEL_HOST=${IP1} \
VLLM_NIXL_SIDE_CHANNEL_PORT=5600 \
UCX_NET_DEVICES=all \
vllm serve Qwen/Qwen3-0.6B --port 8000 \
  --tensor-parallel-size 8 \
  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_producer","kv_load_failure_policy":"fail"}'

# Prefiller 2 on Machine B (example IP: ${IP2})
VLLM_NIXL_SIDE_CHANNEL_HOST=${IP2} \
VLLM_NIXL_SIDE_CHANNEL_PORT=5600 \
UCX_NET_DEVICES=all \
vllm serve Qwen/Qwen3-0.6B --port 8000 \
  --tensor-parallel-size 8 \
  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_producer","kv_load_failure_policy":"fail"}'
```

### Multiple Decoder Instances on Different Machines

```bash
# Decoder 1 on Machine C (example IP: ${IP3})
VLLM_NIXL_SIDE_CHANNEL_HOST=${IP3} \
VLLM_NIXL_SIDE_CHANNEL_PORT=5600 \
UCX_NET_DEVICES=all \
vllm serve Qwen/Qwen3-0.6B --port 8000 \
  --tensor-parallel-size 8 \
  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_consumer","kv_load_failure_policy":"fail"}'

# Decoder 2 on Machine D (example IP: ${IP4})
VLLM_NIXL_SIDE_CHANNEL_HOST=${IP4} \
VLLM_NIXL_SIDE_CHANNEL_PORT=5600 \
UCX_NET_DEVICES=all \
vllm serve Qwen/Qwen3-0.6B --port 8000 \
  --tensor-parallel-size 8 \
  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_consumer","kv_load_failure_policy":"fail"}'
```

### Proxy for Multiple Instances

```bash
python tests/v1/kv_connector/nixl_integration/toy_proxy_server.py \
  --port 8192 \
  --prefiller-hosts ${IP1} ${IP2} \
  --prefiller-ports 8000 8000 \
  --decoder-hosts ${IP3} ${IP4} \
  --decoder-ports 8000 8000
```

For multi-host DP deployment, only need to provide the host/port of the head instances.

### KV Role Options

- **kv_producer**: For prefiller instances that generate KV caches
- **kv_consumer**: For decoder instances that consume KV caches from prefiller
- **kv_both**: Enables symmetric functionality where the connector can act as both producer and consumer. This provides flexibility for experimental setups and scenarios where the role distinction is not predetermined.

!!! tip
    NixlConnector currently does not distinguish `kv_role`; the actual prefiller/decoder roles are determined by the upper-level proxy (e.g., `toy_proxy_server.py` using `--prefiller-hosts` and `--decoder-hosts`).
    Therefore, `kv_role` in `--kv-transfer-config` is effectively a placeholder and does not affect NixlConnector's behavior.

### KV Load Failure Policy

The `kv_load_failure_policy` setting controls how the system handles failures when the decoder instance loads KV cache blocks from the prefiller instance:

- **fail** (recommended): Immediately fail the request with an error when KV load fails. This prevents performance degradation by avoiding recomputation of prefill work on the decode instance.
- **recompute** (default): Recompute failed blocks locally on the decode instance. This may cause performance _jitter_ on decode instances as the scheduled prefill will delay and interfere with other decodes. Furthermore, decode instances are typically configured with low-latency optimizations.

!!! warning
    Using `kv_load_failure_policy="recompute"` can lead to performance degradation in production deployments. When KV loads fail, the decode instance will execute prefill work with decode-optimized configurations, which is inefficient and defeats the purpose of disaggregated prefilling. This also increases tail latency for other ongoing decode requests.

## Experimental Feature

### Heterogeneous KV Layout support

Support use case: Prefill with 'HND' and decode with 'NHD' with experimental configuration

```bash
--kv-transfer-config '{..., "enable_permute_local_kv":"True"}'
```

### Cross layers blocks

By default, this feature is disabled. On attention backends that support this feature, each logical block is contiguous in physical memory. This reduces the number of buffers that need to be transferred.
To enable this feature:

```bash
--kv-transfer-config '{..., "kv_connector_extra_config": {"enable_cross_layers_blocks": "True"}}'
```

## Example Scripts/Code

Refer to these example scripts in the vLLM repository:

- [run_accuracy_test.sh](../../tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh)
- [toy_proxy_server.py](../../tests/v1/kv_connector/nixl_integration/toy_proxy_server.py)
- [test_accuracy.py](../../tests/v1/kv_connector/nixl_integration/test_accuracy.py)
-												[Docs] NixlConnector quickstart guide (#24249)

Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
Signed-off-by: Peter Pan <peter.pan@daocloud.io>
Signed-off-by: Nicolò Lucchesi<nicolo.lucchesi@gmail.com>
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>
											
										
										
											2025-09-23 22:23:22 +08:00
+								# NixlConnector Usage Guide
 								NixlConnector is a high-performance KV cache transfer connector for vLLM's disaggregated prefilling feature. It provides fully asynchronous send/receive operations using the NIXL library for efficient cross-process KV cache transfer.
 								## Prerequisites
 								### Installation
-												[CI]Test Group 'NixlConnector PD accuracy tests' is fixed (#31460)

Signed-off-by: qli88 <qiang.li2@amd.com>
											
										
										
											2025-12-29 17:48:56 -06:00
+								Install the NIXL library: `uv pip install nixl`, as a quick start on Nvidia platform.
-												[Docs] NixlConnector quickstart guide (#24249)

Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
Signed-off-by: Peter Pan <peter.pan@daocloud.io>
Signed-off-by: Nicolò Lucchesi<nicolo.lucchesi@gmail.com>
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>
											
										
										
											2025-09-23 22:23:22 +08:00
 								- Refer to [NIXL official repository](https://github.com/ai-dynamo/nixl) for more installation instructions
-												[Docs] Reduce custom syntax used in docs (#27009)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-10-17 04:05:34 +01:00
+								- The specified required NIXL version can be found in [requirements/kv_connectors.txt](../../requirements/kv_connectors.txt) and other relevant config files
-												[CI]Test Group 'NixlConnector PD accuracy tests' is fixed (#31460)

Signed-off-by: qli88 <qiang.li2@amd.com>
											
										
										
											2025-12-29 17:48:56 -06:00
 								For ROCm platform, the [base ROCm docker file](../../docker/Dockerfile.rocm_base) includes RIXL and ucx already.
 								- Refer to [RIXL official repository](https://github.com/rocm/rixl) for more information
 								- The supportive libraries for RIXL can be found in [requirements/kv_connectors_rocm.txt](../../requirements/kv_connectors_rocm.txt)
 								- In the future we may remove RIXL from docker image file and users will be able to install from pre-compiled binary packages
-												[Docs] NixlConnector quickstart guide (#24249)

Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
Signed-off-by: Peter Pan <peter.pan@daocloud.io>
Signed-off-by: Nicolò Lucchesi<nicolo.lucchesi@gmail.com>
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>
											
										
										
											2025-09-23 22:23:22 +08:00
-												[NIXL][non-cuda] Add install script for nixl with non-cuda ucx (#25959)

Signed-off-by: Chendi Xue <Chendi.Xue@intel.com>
											
										
										
											2025-10-08 09:19:53 -05:00
+								For non-cuda platform, please install nixl with ucx build from source, instructed as below.
 								```bash
 								python tools/install_nixl_from_source_ubuntu.py
 								```
-												[Docs] NixlConnector quickstart guide (#24249)

Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
Signed-off-by: Peter Pan <peter.pan@daocloud.io>
Signed-off-by: Nicolò Lucchesi<nicolo.lucchesi@gmail.com>
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>
											
										
										
											2025-09-23 22:23:22 +08:00
+								### Transport Configuration
 								NixlConnector uses NIXL library for underlying communication, which supports multiple transport backends. UCX (Unified Communication X) is the primary default transport library used by NIXL. Configure transport environment variables:
 								```bash
-												Fix typos in comments across multiple files (#30345)

Signed-off-by: Wilson Wu <iwilsonwu@gmail.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
											
										
										
											2025-12-10 12:05:28 +08:00
+								# Example UCX configuration, adjust according to your environment
-												[Docs] NixlConnector quickstart guide (#24249)

Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
Signed-off-by: Peter Pan <peter.pan@daocloud.io>
Signed-off-by: Nicolò Lucchesi<nicolo.lucchesi@gmail.com>
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>
											
										
										
											2025-09-23 22:23:22 +08:00
+								export UCX_TLS=all  # or specify specific transports like "rc,ud,sm,^cuda_ipc" ..etc
 								export UCX_NET_DEVICES=all  # or specify network devices like "mlx5_0:1,mlx5_1:1"
 								```
 								!!! tip
 								    When using UCX as the transport backend, NCCL environment variables (like `NCCL_IB_HCA`, `NCCL_SOCKET_IFNAME`) are not applicable to NixlConnector, so configure UCX-specific environment variables instead of NCCL variables.
-												Document NixlConnector backend selection via kv_connector_extra_config (#33552)

Signed-off-by: KrxGu <krishom70@gmail.com>
											
										
										
											2026-02-03 19:19:59 +05:30
+								#### Selecting a NIXL transport backend (plugin)
 								NixlConnector can use different NIXL transport backends (plugins). By default, NixlConnector uses UCX as the transport backend.
 								To select a different backend, set `kv_connector_extra_config.backends` in `--kv-transfer-config`.
 								### Example: using LIBFABRIC backend
 								```bash
 								vllm serve <MODEL> \
 								  --kv-transfer-config '{
 								    "kv_connector":"NixlConnector",
 								    "kv_role":"kv_both",
 								    "kv_connector_extra_config":{"backends":["LIBFABRIC"]}
 								  }'
 								```
 								You can also pass JSON keys individually using dotted arguments, and you can append list elements using `+`:
 								```bash
 								vllm serve <MODEL> \
 								  --kv-transfer-config.kv_connector NixlConnector \
 								  --kv-transfer-config.kv_role kv_both \
 								  --kv-transfer-config.kv_connector_extra_config.backends+ LIBFABRIC
 								```
 								!!! note
 								    Backend availability depends on how NIXL was built and what plugins are present in your environment. Refer to the [NIXL repository](https://github.com/ai-dynamo/nixl) for available backends and build instructions.
-												[Docs] NixlConnector quickstart guide (#24249)

Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
Signed-off-by: Peter Pan <peter.pan@daocloud.io>
Signed-off-by: Nicolò Lucchesi<nicolo.lucchesi@gmail.com>
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>
											
										
										
											2025-09-23 22:23:22 +08:00
+								## Basic Usage (on the same host)
 								### Producer (Prefiller) Configuration
 								Start a prefiller instance that produces KV caches
 								```bash
 								# 1st GPU as prefiller
 								CUDA_VISIBLE_DEVICES=0 \
 								UCX_NET_DEVICES=all \
 								VLLM_NIXL_SIDE_CHANNEL_PORT=5600 \
 								vllm serve Qwen/Qwen3-0.6B \
 								  --port 8100 \
 								  --enforce-eager \
-												[Docs] Nixl Usage recommend `fail` kv_load_failure_policy (#32198)

Signed-off-by: NickLucche <nlucches@redhat.com>
											
										
										
											2026-01-13 13:51:57 +01:00
+								  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both","kv_load_failure_policy":"fail"}'
-												[Docs] NixlConnector quickstart guide (#24249)

Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
Signed-off-by: Peter Pan <peter.pan@daocloud.io>
Signed-off-by: Nicolò Lucchesi<nicolo.lucchesi@gmail.com>
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>
											
										
										
											2025-09-23 22:23:22 +08:00
+								```
 								### Consumer (Decoder) Configuration
 								Start a decoder instance that consumes KV caches:
 								```bash
 								# 2nd GPU as decoder
 								CUDA_VISIBLE_DEVICES=1 \
 								UCX_NET_DEVICES=all \
 								VLLM_NIXL_SIDE_CHANNEL_PORT=5601 \
 								vllm serve Qwen/Qwen3-0.6B \
 								  --port 8200 \
 								  --enforce-eager \
-												[Docs] Nixl Usage recommend `fail` kv_load_failure_policy (#32198)

Signed-off-by: NickLucche <nlucches@redhat.com>
											
										
										
											2026-01-13 13:51:57 +01:00
+								  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both","kv_load_failure_policy":"fail"}'
-												[Docs] NixlConnector quickstart guide (#24249)

Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
Signed-off-by: Peter Pan <peter.pan@daocloud.io>
Signed-off-by: Nicolò Lucchesi<nicolo.lucchesi@gmail.com>
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>
											
										
										
											2025-09-23 22:23:22 +08:00
+								```
 								### Proxy Server
 								Use a proxy server to route requests between prefiller and decoder:
 								```bash
 								python tests/v1/kv_connector/nixl_integration/toy_proxy_server.py \
 								  --port 8192 \
 								  --prefiller-hosts localhost \
 								  --prefiller-ports 8100 \
 								  --decoder-hosts localhost \
 								  --decoder-ports 8200
 								```
 								## Environment Variables
 								- `VLLM_NIXL_SIDE_CHANNEL_PORT`: Port for NIXL handshake communication
 								    - Default: 5600
 								    - **Required for both prefiller and decoder instances**
 								    - Each vLLM worker needs a unique port on its host; using the same port number across different hosts is fine
-												[Bugfix] Missing NIXL metadata for handshake initialization if instance spans multi-node (#26338)

Signed-off-by: Guan Luo <gluo@nvidia.com>
Signed-off-by: GuanLuo <41310872+GuanLuo@users.noreply.github.com>
Signed-off-by: Guan Luo <41310872+GuanLuo@users.noreply.github.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
											
										
										
											2025-11-01 01:16:00 +08:00
+								    - For TP/DP deployments, each worker's port on a node is computed as: base_port + dp_rank (e.g., with `--data-parallel-size=2` and base_port=5600, dp_rank 0..1 use port 5600, 5601 on that node).
-												[Docs] NixlConnector quickstart guide (#24249)

Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
Signed-off-by: Peter Pan <peter.pan@daocloud.io>
Signed-off-by: Nicolò Lucchesi<nicolo.lucchesi@gmail.com>
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>
											
										
										
											2025-09-23 22:23:22 +08:00
+								    - Used for the initial NIXL handshake between the prefiller and the decoder
 								- `VLLM_NIXL_SIDE_CHANNEL_HOST`: Host for side channel communication
 								    - Default: "localhost"
 								    - Set when prefiller and decoder are on different machines
 								    - Connection info is passed via KVTransferParams from prefiller to decoder for handshake
 								- `VLLM_NIXL_ABORT_REQUEST_TIMEOUT`: Timeout (in seconds) for automatically releasing the prefiller’s KV cache for a particular request. (Optional)
-												[NIXL] Increase default KV block eviction timeout on P (#25897)

Signed-off-by: NickLucche <nlucches@redhat.com>
											
										
										
											2025-09-29 23:35:14 +02:00
+								    - Default: 480
-												[Docs] NixlConnector quickstart guide (#24249)

Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
Signed-off-by: Peter Pan <peter.pan@daocloud.io>
Signed-off-by: Nicolò Lucchesi<nicolo.lucchesi@gmail.com>
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>
											
										
										
											2025-09-23 22:23:22 +08:00
+								    - If a request is aborted and the decoder has not yet read the KV-cache blocks through the nixl channel, the prefill instance will release its KV-cache blocks after this timeout to avoid holding them indefinitely.
 								## Multi-Instance Setup
 								### Multiple Prefiller Instances on Different Machines
 								```bash
 								# Prefiller 1 on Machine A (example IP: ${IP1})
 								VLLM_NIXL_SIDE_CHANNEL_HOST=${IP1} \
 								VLLM_NIXL_SIDE_CHANNEL_PORT=5600 \
 								UCX_NET_DEVICES=all \
 								vllm serve Qwen/Qwen3-0.6B --port 8000 \
 								  --tensor-parallel-size 8 \
-												[Docs] Nixl Usage recommend `fail` kv_load_failure_policy (#32198)

Signed-off-by: NickLucche <nlucches@redhat.com>
											
										
										
											2026-01-13 13:51:57 +01:00
+								  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_producer","kv_load_failure_policy":"fail"}'
-												[Docs] NixlConnector quickstart guide (#24249)

Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
Signed-off-by: Peter Pan <peter.pan@daocloud.io>
Signed-off-by: Nicolò Lucchesi<nicolo.lucchesi@gmail.com>
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>
											
										
										
											2025-09-23 22:23:22 +08:00
 								# Prefiller 2 on Machine B (example IP: ${IP2})
 								VLLM_NIXL_SIDE_CHANNEL_HOST=${IP2} \
 								VLLM_NIXL_SIDE_CHANNEL_PORT=5600 \
 								UCX_NET_DEVICES=all \
 								vllm serve Qwen/Qwen3-0.6B --port 8000 \
 								  --tensor-parallel-size 8 \
-												[Docs] Nixl Usage recommend `fail` kv_load_failure_policy (#32198)

Signed-off-by: NickLucche <nlucches@redhat.com>
											
										
										
											2026-01-13 13:51:57 +01:00
+								  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_producer","kv_load_failure_policy":"fail"}'
-												[Docs] NixlConnector quickstart guide (#24249)

Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
Signed-off-by: Peter Pan <peter.pan@daocloud.io>
Signed-off-by: Nicolò Lucchesi<nicolo.lucchesi@gmail.com>
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>
											
										
										
											2025-09-23 22:23:22 +08:00
+								```
 								### Multiple Decoder Instances on Different Machines
 								```bash
 								# Decoder 1 on Machine C (example IP: ${IP3})
 								VLLM_NIXL_SIDE_CHANNEL_HOST=${IP3} \
 								VLLM_NIXL_SIDE_CHANNEL_PORT=5600 \
 								UCX_NET_DEVICES=all \
 								vllm serve Qwen/Qwen3-0.6B --port 8000 \
 								  --tensor-parallel-size 8 \
-												[Docs] Nixl Usage recommend `fail` kv_load_failure_policy (#32198)

Signed-off-by: NickLucche <nlucches@redhat.com>
											
										
										
											2026-01-13 13:51:57 +01:00
+								  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_consumer","kv_load_failure_policy":"fail"}'
-												[Docs] NixlConnector quickstart guide (#24249)

Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
Signed-off-by: Peter Pan <peter.pan@daocloud.io>
Signed-off-by: Nicolò Lucchesi<nicolo.lucchesi@gmail.com>
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>
											
										
										
											2025-09-23 22:23:22 +08:00
 								# Decoder 2 on Machine D (example IP: ${IP4})
 								VLLM_NIXL_SIDE_CHANNEL_HOST=${IP4} \
 								VLLM_NIXL_SIDE_CHANNEL_PORT=5600 \
 								UCX_NET_DEVICES=all \
 								vllm serve Qwen/Qwen3-0.6B --port 8000 \
 								  --tensor-parallel-size 8 \
-												[Docs] Nixl Usage recommend `fail` kv_load_failure_policy (#32198)

Signed-off-by: NickLucche <nlucches@redhat.com>
											
										
										
											2026-01-13 13:51:57 +01:00
+								  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_consumer","kv_load_failure_policy":"fail"}'
-												[Docs] NixlConnector quickstart guide (#24249)

Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
Signed-off-by: Peter Pan <peter.pan@daocloud.io>
Signed-off-by: Nicolò Lucchesi<nicolo.lucchesi@gmail.com>
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>
											
										
										
											2025-09-23 22:23:22 +08:00
+								```
 								### Proxy for Multiple Instances
 								```bash
 								python tests/v1/kv_connector/nixl_integration/toy_proxy_server.py \
 								  --port 8192 \
 								  --prefiller-hosts ${IP1} ${IP2} \
 								  --prefiller-ports 8000 8000 \
 								  --decoder-hosts ${IP3} ${IP4} \
 								  --decoder-ports 8000 8000
 								```
-												[Disagg] Support large batch size in proxy server and update NixlConnector doc for DP (#28782)

Signed-off-by: Ming Yang <minos.future@gmail.com>
											
										
										
											2025-12-08 16:01:08 -08:00
+								For multi-host DP deployment, only need to provide the host/port of the head instances.
-												[Docs] NixlConnector quickstart guide (#24249)

Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
Signed-off-by: Peter Pan <peter.pan@daocloud.io>
Signed-off-by: Nicolò Lucchesi<nicolo.lucchesi@gmail.com>
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>
											
										
										
											2025-09-23 22:23:22 +08:00
+								### KV Role Options
 								- **kv_producer**: For prefiller instances that generate KV caches
 								- **kv_consumer**: For decoder instances that consume KV caches from prefiller
 								- **kv_both**: Enables symmetric functionality where the connector can act as both producer and consumer. This provides flexibility for experimental setups and scenarios where the role distinction is not predetermined.
 								!!! tip
 								    NixlConnector currently does not distinguish `kv_role`; the actual prefiller/decoder roles are determined by the upper-level proxy (e.g., `toy_proxy_server.py` using `--prefiller-hosts` and `--decoder-hosts`).
 								    Therefore, `kv_role` in `--kv-transfer-config` is effectively a placeholder and does not affect NixlConnector's behavior.
-												[Docs] Nixl Usage recommend `fail` kv_load_failure_policy (#32198)

Signed-off-by: NickLucche <nlucches@redhat.com>
											
										
										
											2026-01-13 13:51:57 +01:00
+								### KV Load Failure Policy
 								The `kv_load_failure_policy` setting controls how the system handles failures when the decoder instance loads KV cache blocks from the prefiller instance:
 								- **fail** (recommended): Immediately fail the request with an error when KV load fails. This prevents performance degradation by avoiding recomputation of prefill work on the decode instance.
 								- **recompute** (default): Recompute failed blocks locally on the decode instance. This may cause performance _jitter_ on decode instances as the scheduled prefill will delay and interfere with other decodes. Furthermore, decode instances are typically configured with low-latency optimizations.
 								!!! warning
 								    Using `kv_load_failure_policy="recompute"` can lead to performance degradation in production deployments. When KV loads fail, the decode instance will execute prefill work with decode-optimized configurations, which is inefficient and defeats the purpose of disaggregated prefilling. This also increases tail latency for other ongoing decode requests.
-												[NIXL][HeteroTP] Enable KV transfer from HND prefill to NHD decode (#26556)

Signed-off-by: Chendi Xue <chendi.xue@intel.com>
											
										
										
											2025-10-14 04:46:05 -05:00
+								## Experimental Feature
-												[Doc]: fix typos in various files (#29010)

Signed-off-by: Didier Durand <durand.didier@gmail.com>
											
										
										
											2025-11-19 13:56:21 +01:00
+								### Heterogeneous KV Layout support
-												[NIXL][HeteroTP] Enable KV transfer from HND prefill to NHD decode (#26556)

Signed-off-by: Chendi Xue <chendi.xue@intel.com>
											
										
										
											2025-10-14 04:46:05 -05:00
 								Support use case: Prefill with 'HND' and decode with 'NHD' with experimental configuration
 								```bash
 								--kv-transfer-config '{..., "enable_permute_local_kv":"True"}'
 								```
-												Enable Cross layers KV cache layout at NIXL Connector V2 (#33339)

Signed-off-by: Liran Schour <lirans@il.ibm.com>
Signed-off-by: liranschour <liranschour@users.noreply.github.com>
Co-authored-by: Or Ozeri <or@ozery.com>
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
											
										
										
											2026-02-05 12:17:02 +02:00
+								### Cross layers blocks
 								By default, this feature is disabled. On attention backends that support this feature, each logical block is contiguous in physical memory. This reduces the number of buffers that need to be transferred.
 								To enable this feature:
 								```bash
 								--kv-transfer-config '{..., "kv_connector_extra_config": {"enable_cross_layers_blocks": "True"}}'
 								```
-												[Docs] NixlConnector quickstart guide (#24249)

Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
Signed-off-by: Peter Pan <peter.pan@daocloud.io>
Signed-off-by: Nicolò Lucchesi<nicolo.lucchesi@gmail.com>
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>
											
										
										
											2025-09-23 22:23:22 +08:00
+								## Example Scripts/Code
 								Refer to these example scripts in the vLLM repository:
-												[Docs] Reduce custom syntax used in docs (#27009)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-10-17 04:05:34 +01:00
+								- [run_accuracy_test.sh](../../tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh)
 								- [toy_proxy_server.py](../../tests/v1/kv_connector/nixl_integration/toy_proxy_server.py)
 								- [test_accuracy.py](../../tests/v1/kv_connector/nixl_integration/test_accuracy.py)