[Docs] Nixl Usage recommend fail kv_load_failure_policy (#32198)
Signed-off-by: NickLucche <nlucches@redhat.com>
This commit is contained in:
@@ -50,7 +50,7 @@ VLLM_NIXL_SIDE_CHANNEL_PORT=5600 \
|
||||
vllm serve Qwen/Qwen3-0.6B \
|
||||
--port 8100 \
|
||||
--enforce-eager \
|
||||
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}'
|
||||
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both","kv_load_failure_policy":"fail"}'
|
||||
```
|
||||
|
||||
### Consumer (Decoder) Configuration
|
||||
@@ -65,7 +65,7 @@ VLLM_NIXL_SIDE_CHANNEL_PORT=5601 \
|
||||
vllm serve Qwen/Qwen3-0.6B \
|
||||
--port 8200 \
|
||||
--enforce-eager \
|
||||
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}'
|
||||
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both","kv_load_failure_policy":"fail"}'
|
||||
```
|
||||
|
||||
### Proxy Server
|
||||
@@ -110,7 +110,7 @@ VLLM_NIXL_SIDE_CHANNEL_PORT=5600 \
|
||||
UCX_NET_DEVICES=all \
|
||||
vllm serve Qwen/Qwen3-0.6B --port 8000 \
|
||||
--tensor-parallel-size 8 \
|
||||
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_producer"}'
|
||||
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_producer","kv_load_failure_policy":"fail"}'
|
||||
|
||||
# Prefiller 2 on Machine B (example IP: ${IP2})
|
||||
VLLM_NIXL_SIDE_CHANNEL_HOST=${IP2} \
|
||||
@@ -118,7 +118,7 @@ VLLM_NIXL_SIDE_CHANNEL_PORT=5600 \
|
||||
UCX_NET_DEVICES=all \
|
||||
vllm serve Qwen/Qwen3-0.6B --port 8000 \
|
||||
--tensor-parallel-size 8 \
|
||||
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_producer"}'
|
||||
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_producer","kv_load_failure_policy":"fail"}'
|
||||
```
|
||||
|
||||
### Multiple Decoder Instances on Different Machines
|
||||
@@ -130,7 +130,7 @@ VLLM_NIXL_SIDE_CHANNEL_PORT=5600 \
|
||||
UCX_NET_DEVICES=all \
|
||||
vllm serve Qwen/Qwen3-0.6B --port 8000 \
|
||||
--tensor-parallel-size 8 \
|
||||
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_consumer"}'
|
||||
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_consumer","kv_load_failure_policy":"fail"}'
|
||||
|
||||
# Decoder 2 on Machine D (example IP: ${IP4})
|
||||
VLLM_NIXL_SIDE_CHANNEL_HOST=${IP4} \
|
||||
@@ -138,7 +138,7 @@ VLLM_NIXL_SIDE_CHANNEL_PORT=5600 \
|
||||
UCX_NET_DEVICES=all \
|
||||
vllm serve Qwen/Qwen3-0.6B --port 8000 \
|
||||
--tensor-parallel-size 8 \
|
||||
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_consumer"}'
|
||||
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_consumer","kv_load_failure_policy":"fail"}'
|
||||
```
|
||||
|
||||
### Proxy for Multiple Instances
|
||||
@@ -164,6 +164,16 @@ For multi-host DP deployment, only need to provide the host/port of the head ins
|
||||
NixlConnector currently does not distinguish `kv_role`; the actual prefiller/decoder roles are determined by the upper-level proxy (e.g., `toy_proxy_server.py` using `--prefiller-hosts` and `--decoder-hosts`).
|
||||
Therefore, `kv_role` in `--kv-transfer-config` is effectively a placeholder and does not affect NixlConnector's behavior.
|
||||
|
||||
### KV Load Failure Policy
|
||||
|
||||
The `kv_load_failure_policy` setting controls how the system handles failures when the decoder instance loads KV cache blocks from the prefiller instance:
|
||||
|
||||
- **fail** (recommended): Immediately fail the request with an error when KV load fails. This prevents performance degradation by avoiding recomputation of prefill work on the decode instance.
|
||||
- **recompute** (default): Recompute failed blocks locally on the decode instance. This may cause performance _jitter_ on decode instances as the scheduled prefill will delay and interfere with other decodes. Furthermore, decode instances are typically configured with low-latency optimizations.
|
||||
|
||||
!!! warning
|
||||
Using `kv_load_failure_policy="recompute"` can lead to performance degradation in production deployments. When KV loads fail, the decode instance will execute prefill work with decode-optimized configurations, which is inefficient and defeats the purpose of disaggregated prefilling. This also increases tail latency for other ongoing decode requests.
|
||||
|
||||
## Experimental Feature
|
||||
|
||||
### Heterogeneous KV Layout support
|
||||
|
||||
Reference in New Issue
Block a user