A **disaggregated encoder** runs the vision-encoder stage of a multimodal LLM in a process that is separate from the pre-fill / decoder stage. Deploying these two stages in independent vLLM instances brings three practical benefits:
1.**Independent, fine-grained scaling**
2.**Lower time-to-first-token (TTFT)**
3.**Cross-process reuse and caching of encoder outputs**
Please refer to the directories `tests/v1/ec_connector`
## 4 Development
Disaggregated encoding is implemented by running two parts:
* **Encoder instance** – a vLLM instance to performs vision encoding.
* **Prefill/Decode (PD) instance(s)** – runs language pre-fill and decode.
* PD can be in either a single normal instance with `disagg_encoder_example.sh` (E->PD) or in disaggregated instances with `disagg_epd_example.sh` (E->P->D)
A connector transfers encoder-cache (EC) embeddings from the encoder instance to the PD instance.
All related code is under `vllm/distributed/ec_transfer`.
### Key abstractions
* **ECConnector** – interface for retrieving EC caches produced by the encoder.
* *Scheduler role* – checks cache existence and schedules loads.
* *Worker role* – loads the embeddings into memory.
Here is a figure illustrating disaggregate encoder flow:
For the PD disaggregation part, the Prefill instance receives cache exactly the same as the disaggregated encoder flow above. Prefill instance executes 1 step (prefill -> 1 token output) and then transfers KV cache to the Decode instance for the remaining execution. The KV transfer part purely happens after the execution of the PD instance.
`docs/features/disagg_prefill.md` shows the brief idea about the disaggregated prefill (v0)
We create the example setup with the **NixlConnector** from `vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py` and referred to the `tests/v1/kv_connector/nixl_integration/toy_proxy_server.py` to facilitate the kv transfer between P and D;