docs/serving/expert_parallel_deployment.md

# Expert Parallel Deployment

vLLM supports Expert Parallelism (EP), which allows experts in Mixture-of-Experts (MoE) models to be deployed on separate GPUs, increasing locality, efficiency, and throughput overall.

EP is typically coupled with Data Parallelism (DP). While DP can be used independently of EP, EP is more efficient when used in conjunction with DP. You can read more about data parallelism [here](data_parallel_deployment.md).

## Prerequisites

Before using EP, you need to install the necessary dependencies. We are actively working on making this easier in the future:

1. **Install DeepEP**: Set up host environment following vLLM's guide for EP kernels [here](../../tools/ep_kernels).
2. **Install DeepGEMM library**: Follow the [official instructions](https://github.com/deepseek-ai/DeepGEMM#installation).
3. **For disaggregated serving**: Install `gdrcopy` by running the [`install_gdrcopy.sh`](../../tools/install_gdrcopy.sh) script (e.g., `install_gdrcopy.sh "${GDRCOPY_OS_VERSION}" "12.8" "x64"`). You can find available OS versions [here](https://developer.download.nvidia.com/compute/redist/gdrcopy/CUDA%2012.8/).

### Backend Selection Guide

vLLM provides multiple communication backends for EP. Use `--all2all-backend` to select one:

| Backend | Use Case | Features | Best For |
| ------- | -------- | -------- | -------- |
| `allgather_reducescatter` | Default backend | Standard all2all using allgather/reducescatter primitives | General purpose, works with any EP+DP configuration |
| `deepep_high_throughput` | Multi-node prefill | Grouped GEMM with continuous layout, optimized for prefill | Prefill-dominated workloads, high-throughput scenarios |
| `deepep_low_latency` | Multi-node decode | CUDA graph support, masked layout, optimized for decode | Decode-dominated workloads, low-latency scenarios |
| `flashinfer_nvlink_one_sided` | MNNVL systems | FlashInfer's one-sided A2A strategy for multi-node NVLink | High-throughput workloads |
| `flashinfer_nvlink_two_sided` | MNNVL systems | FlashInfer's two-sided A2A strategy for multi-node NVLink | Systems with NVLink across nodes |

## Single Node Deployment

!!! warning
    EP is an experimental feature. Argument names and default values may change in the future.

### Configuration

Enable EP by setting the `--enable-expert-parallel` flag. The EP size is automatically calculated as:

```text
EP_SIZE = TP_SIZE × DP_SIZE
```

Where:

- `TP_SIZE`: Tensor parallel size
- `DP_SIZE`: Data parallel size
- `EP_SIZE`: Expert parallel size (computed automatically)

### Layer Behavior with EP Enabled

When EP is enabled, different layers in MoE models behave differently:

| Layer Type | Behavior | Parallelism Used |
| ---------- | -------- | ---------------- |
| **Expert (MoE) Layers** | Sharded across all EP ranks | Expert Parallel (EP) of size `TP × DP` |
| **Attention Layers** | Behavior depends on TP size | See below |

**Attention layer parallelism:**

- **When `TP = 1`**: Attention weights are **replicated** across all DP ranks (data parallelism)
- **When `TP > 1`**: Attention weights are **sharded** using tensor parallelism across TP ranks within each DP group

For example, with `TP=2, DP=4` (8 GPUs total):

- Expert layers form an EP group of size 8, with experts distributed across all GPUs
- Attention layers use TP=2 within each of the 4 DP groups

!!! note "Key Difference from Data Parallel Deployment"
    Without `--enable-expert-parallel`, MoE layers would use tensor parallelism (forming a TP group of size `TP × DP`), similar to dense models. With EP enabled, expert layers switch to expert parallelism, which can provide better efficiency and locality for MoE models.

### Example Command

The following command serves a `DeepSeek-V3-0324` model with 1-way tensor parallel, 8-way (attention) data parallel, and 8-way expert parallel. The attention weights are replicated across all GPUs, while the expert weights are split across GPUs. It will work on a H200 (or H20) node with 8 GPUs. For H100, you can try to serve a smaller model or refer to the multi-node deployment section.

```bash
# Single node EP deployment
vllm serve deepseek-ai/DeepSeek-V3-0324 \
    --tensor-parallel-size 1 \       # Tensor parallelism across 1 GPU
    --data-parallel-size 8 \         # Data parallelism across 8 processes
    --enable-expert-parallel         # Enable expert parallelism
```

## Multi-Node Deployment

For multi-node deployment, use the DeepEP communication kernel with one of two modes (see [Backend Selection Guide](#backend-selection-guide) above).

### Deployment Steps

1. **Run one command per node** - Each node requires its own launch command
2. **Configure networking** - Ensure proper IP addresses and port configurations
3. **Set node roles** - First node handles requests, additional nodes run in headless mode

### Example: 2-Node Deployment

The following example deploys `DeepSeek-V3-0324` across 2 nodes using `deepep_low_latency` mode:

```bash
# Node 1 (Primary - handles incoming requests)
vllm serve deepseek-ai/DeepSeek-V3-0324 \
    --all2all-backend deepep_low_latency \
    --tensor-parallel-size 1 \               # TP size per node
    --enable-expert-parallel \               # Enable EP
    --data-parallel-size 16 \                # Total DP size across all nodes
    --data-parallel-size-local 8 \           # Local DP size on this node (8 GPUs per node)
    --data-parallel-address 192.168.1.100 \  # Replace with actual IP of Node 1
    --data-parallel-rpc-port 13345 \         # RPC communication port, can be any port as long as reachable by all nodes
    --api-server-count=8                     # Number of API servers for load handling (scaling this out to # local ranks is recommended)

# Node 2 (Secondary - headless mode, no API server)
vllm serve deepseek-ai/DeepSeek-V3-0324 \
    --all2all-backend deepep_low_latency \
    --tensor-parallel-size 1 \               # TP size per node
    --enable-expert-parallel \               # Enable EP
    --data-parallel-size 16 \                # Total DP size across all nodes
    --data-parallel-size-local 8 \           # Local DP size on this node
    --data-parallel-start-rank 8 \           # Starting rank offset for this node
    --data-parallel-address 192.168.1.100 \  # IP of primary node (Node 1)
    --data-parallel-rpc-port 13345 \         # Same RPC port as primary
    --headless                               # No API server, worker only
```

### Key Configuration Notes

- **Headless mode**: Secondary nodes run with `--headless` flag, meaning all client requests are handled by the primary node
- **Rank calculation**: `--data-parallel-start-rank` should equal the cumulative local DP size of previous nodes
- **Load scaling**: Adjust `--api-server-count` on the primary node to handle higher request loads

### Network Configuration

!!! important "InfiniBand Clusters"
    On InfiniBand networked clusters, set this environment variable to prevent initialization hangs:
    ```bash
    export GLOO_SOCKET_IFNAME=eth0
    ```
    This ensures torch distributed group discovery uses Ethernet instead of InfiniBand for initial setup.

## Expert Parallel Load Balancer (EPLB)

While MoE models are typically trained so that each expert receives a similar number of tokens, in practice the distribution of tokens across experts can be highly skewed. vLLM provides an Expert Parallel Load Balancer (EPLB) to redistribute expert mappings across EP ranks, evening the load across experts.

### Configuration

Enable EPLB with the `--enable-eplb` flag.

When enabled, vLLM collects load statistics with every forward pass and periodically rebalances expert distribution.

### EPLB Parameters

Configure EPLB with the `--eplb-config` argument, which accepts a JSON string. The available keys and their descriptions are:

| Parameter | Description | Default |
| --------- | ----------- | ------- |
| `window_size` | Number of engine steps to track for rebalancing decisions | 1000 |
| `step_interval` | Frequency of rebalancing (every N engine steps) | 3000 |
| `log_balancedness` | Log balancedness metrics (avg tokens per expert ÷ max tokens per expert) | `false` |
| `num_redundant_experts` | Additional global experts per EP rank beyond equal distribution | `0` |
| `use_async` | Use non-blocking EPLB for reduced latency overhead | `false` |
| `policy` | The policy type for expert parallel load balancing | `"default"` |

For example:

```bash
vllm serve Qwen/Qwen3-30B-A3B \
  --enable-eplb \
  --eplb-config '{"window_size":1000,"step_interval":3000,"num_redundant_experts":2,"log_balancedness":true}'
```

??? tip "Prefer individual arguments instead of JSON?"

    ```bash
    vllm serve Qwen/Qwen3-30B-A3B \
            --enable-eplb \
            --eplb-config.window_size 1000 \
            --eplb-config.step_interval 3000 \
            --eplb-config.num_redundant_experts 2 \
            --eplb-config.log_balancedness true
    ```

### Expert Distribution Formula

- **Default**: Each EP rank has `NUM_TOTAL_EXPERTS ÷ NUM_EP_RANKS` experts
- **With redundancy**: Each EP rank has `(NUM_TOTAL_EXPERTS + NUM_REDUNDANT_EXPERTS) ÷ NUM_EP_RANKS` experts

### Memory Footprint Overhead

EPLB uses redundant experts that need to fit in GPU memory. This means that EPLB may not be a good fit for memory constrained environments or when KV cache space is at a premium.

This overhead equals `NUM_MOE_LAYERS * BYTES_PER_EXPERT * (NUM_TOTAL_EXPERTS + NUM_REDUNDANT_EXPERTS) ÷ NUM_EP_RANKS`.
For DeepSeekV3, this is approximately `2.4 GB` for one redundant expert per EP rank.

### Example Command

Single node deployment with EPLB enabled:

```bash
# Single node with EPLB load balancing
vllm serve deepseek-ai/DeepSeek-V3-0324 \
    --tensor-parallel-size 1 \       # Tensor parallelism
    --data-parallel-size 8 \         # Data parallelism
    --enable-expert-parallel \       # Enable EP
    --enable-eplb \                  # Enable load balancer
    --eplb-config '{"window_size":1000,"step_interval":3000,"num_redundant_experts":2,"log_balancedness":true}'
```

For multi-node deployment, add these EPLB flags to each node's command. We recommend setting `--eplb-config '{"num_redundant_experts":32}'` to 32 in large scale use cases so the most popular experts are always available.

## Advanced Configuration

### Performance Optimization

- **DeepEP kernels**: The `high_throughput` and `low_latency` kernels are optimized for disaggregated serving and may show poor performance for mixed workloads
- **Dual Batch Overlap**: Use `--enable-dbo` to overlap all-to-all communication with compute. See [Dual Batch Overlap](../design/dbo.md) for more details.
- **Async scheduling (experimental)**: Try `--async-scheduling` to overlap scheduling with model execution.

### Troubleshooting

- **`non-zero status: 7 cannot register cq buf`**: When using Infiniband/RoCE, make sure host VM and pods show `ulimit -l` "unlimited".
- **`init failed for transport: IBGDA`**: The InfiniBand GDA kernel modules are missing. Run `tools/ep_kernels/configure_system_drivers.sh` on each GPU node and reboot. Also fixes error `NVSHMEM API called before NVSHMEM initialization has completed`.
- **NVSHMEM peer disconnect**: Usually a networking misconfiguration. If deploying via Kubernetes, verify that every pod runs with `hostNetwork: true`, `securityContext.privileged: true` to access Infiniband.

### Benchmarking

- Use simulator flags `VLLM_MOE_ROUTING_SIMULATION_STRATEGY=uniform_random` and `VLLM_RANDOMIZE_DP_DUMMY_INPUTS=1` so token routing is balanced across EP ranks.

- Increasing `VLLM_MOE_DP_CHUNK_SIZE` may increase throughput by increasing the maximum batch size for inter-rank token transfers. This may cause DeepEP  to throw `assert self.nvshmem_qp_depth >= (num_max_dispatch_tokens_per_rank + 1) * 2`, which can be fixed by increasing environment variable `NVSHMEM_QP_DEPTH`.

## Disaggregated Serving (Prefill/Decode Split)

For production deployments requiring strict SLA guarantees for time-to-first-token and inter-token latency, disaggregated serving allows independent scaling of prefill and decode operations.

### Architecture Overview

- **Prefill Instance**: Uses `deepep_high_throughput` backend for optimal prefill performance
- **Decode Instance**: Uses `deepep_low_latency` backend for minimal decode latency  
- **KV Cache Transfer**: Connects instances via NIXL or other KV connectors

### Setup Steps

1. **Install gdrcopy/ucx/nixl**: For maximum performance, run the [install_gdrcopy.sh](../../tools/install_gdrcopy.sh) script to install `gdrcopy` (e.g., `install_gdrcopy.sh "${GDRCOPY_OS_VERSION}" "12.8" "x64"`). You can find available OS versions [here](https://developer.download.nvidia.com/compute/redist/gdrcopy/CUDA%2012.8/). If `gdrcopy` is not installed, things will still work with a plain `pip install nixl`, just with lower performance. `nixl` and `ucx` are installed as dependencies via pip. For non-cuda platform to install nixl with non-cuda UCX build, run the [install_nixl_from_source_ubuntu.py](../../tools/install_nixl_from_source_ubuntu.py) script.

2. **Configure Both Instances**: Add this flag to both prefill and decode instances `--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}`. Noted, you may also specify one or multiple NIXL_Backend. Such as: `--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both", "kv_connector_extra_config":{"backends":["UCX", "GDS"]}}'`

3. **Client Orchestration**: Use the client-side script below to coordinate prefill/decode operations. We are actively working on routing solutions.

### Client Orchestration Example

```python
from openai import OpenAI
import uuid

try:
    # 1: Set up clients for prefill and decode instances
    openai_api_key = "EMPTY"  # vLLM doesn't require a real API key
    
    # Replace these IP addresses with your actual instance addresses
    prefill_client = OpenAI(
        api_key=openai_api_key,
        base_url="http://192.168.1.100:8000/v1",  # Prefill instance URL
    )
    decode_client = OpenAI(
        api_key=openai_api_key,
        base_url="http://192.168.1.101:8001/v1",  # Decode instance URL  
    )
    
    # Get model name from prefill instance
    models = prefill_client.models.list()
    model = models.data[0].id
    print(f"Using model: {model}")

    # 2: Prefill Phase
    # Generate unique request ID to link prefill and decode operations
    request_id = str(uuid.uuid4())
    print(f"Request ID: {request_id}")
    
    prefill_response = prefill_client.completions.create(
        model=model,
        # Prompt must exceed vLLM's block size (16 tokens) for PD to work
        prompt="Write a detailed explanation of Paged Attention for Transformers works including the management of KV cache for multi-turn conversations",
        max_tokens=1,  # Force prefill-only operation
        extra_body={
            "kv_transfer_params": {
                "do_remote_decode": True,     # Enable remote decode
                "do_remote_prefill": False,   # This is the prefill instance
                "remote_engine_id": None,     # Will be populated by vLLM
                "remote_block_ids": None,     # Will be populated by vLLM
                "remote_host": None,          # Will be populated by vLLM
                "remote_port": None,          # Will be populated by vLLM
            }
        },
        extra_headers={"X-Request-Id": request_id},
    )
    
    print("-" * 50)
    print("✓ Prefill completed successfully")
    print(f"Prefill response: {prefill_response.choices[0].text}")
    
    # 3: Decode Phase
    # Transfer KV cache parameters from prefill to decode instance
    decode_response = decode_client.completions.create(
        model=model,
        prompt="This prompt is ignored during decode",  # Original prompt not needed
        max_tokens=150,  # Generate up to 150 tokens
        extra_body={
            "kv_transfer_params": prefill_response.kv_transfer_params  # Pass KV cache info
        },
        extra_headers={"X-Request-Id": request_id},  # Same request ID
    )
    
    print("-" * 50)
    print("✓ Decode completed successfully")
    print(f"Final response: {decode_response.choices[0].text}")

except Exception as e:
    print(f"❌ Error during disaggregated serving: {e}")
    print("Check that both prefill and decode instances are running and accessible")
```

### Benchmarking

- To simulate the decode deployment of disaggregated serving, pass `--kv-transfer-config '{"kv_connector":"DecodeBenchConnector","kv_role":"kv_both"}'` to the `vllm serve` invocation. The connector populates KV cache with random values so decode can be profiled in isolation.

- **CUDAGraph capture**: Use `--compilation_config '{"cudagraph_mode": "FULL_DECODE_ONLY"}'` to enable CUDA graph capture for decode only and save KV cache.
-												[Docs] Add Expert Parallelism Initial Documentation (#21373)

Signed-off-by: simon-mo <simon.mo@hey.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-07-24 12:36:06 -07:00
+								# Expert Parallel Deployment
 								vLLM supports Expert Parallelism (EP), which allows experts in Mixture-of-Experts (MoE) models to be deployed on separate GPUs, increasing locality, efficiency, and throughput overall.
 								EP is typically coupled with Data Parallelism (DP). While DP can be used independently of EP, EP is more efficient when used in conjunction with DP. You can read more about data parallelism [here](data_parallel_deployment.md).
 								## Prerequisites
 								Before using EP, you need to install the necessary dependencies. We are actively working on making this easier in the future:
-												[WideEP] Remove pplx all2all backend (#33724)

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
											
										
										
											2026-02-26 17:30:10 -05:00
+. **Install DeepEP**: Set up host environment following vLLM's guide for EP kernels [here](../../tools/ep_kernels).
-												[Docs] Add Expert Parallelism Initial Documentation (#21373)

Signed-off-by: simon-mo <simon.mo@hey.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-07-24 12:36:06 -07:00
+. **Install DeepGEMM library**: Follow the [official instructions](https://github.com/deepseek-ai/DeepGEMM#installation).
-												[Docs] Reduce custom syntax used in docs (#27009)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-10-17 04:05:34 +01:00
+. **For disaggregated serving**: Install `gdrcopy` by running the [`install_gdrcopy.sh`](../../tools/install_gdrcopy.sh) script (e.g., `install_gdrcopy.sh "${GDRCOPY_OS_VERSION}" "12.8" "x64"`). You can find available OS versions [here](https://developer.download.nvidia.com/compute/redist/gdrcopy/CUDA%2012.8/).
-												[Docs] Add Expert Parallelism Initial Documentation (#21373)

Signed-off-by: simon-mo <simon.mo@hey.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-07-24 12:36:06 -07:00
 								### Backend Selection Guide
-												[UX] Replace VLLM_ALL2ALL_BACKEND with --all2all-backend (#26732)

Signed-off-by: mgoin <mgoin64@gmail.com>
											
										
										
											2025-10-13 21:12:52 -04:00
+								vLLM provides multiple communication backends for EP. Use `--all2all-backend` to select one:
-												[Docs] Add Expert Parallelism Initial Documentation (#21373)

Signed-off-by: simon-mo <simon.mo@hey.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-07-24 12:36:06 -07:00
 								| Backend | Use Case | Features | Best For |
-												Allow `markdownlint` to run locally (#36398)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2026-03-09 03:05:24 +00:00
+								| ------- | -------- | -------- | -------- |
-												[UX] Replace VLLM_ALL2ALL_BACKEND with --all2all-backend (#26732)

Signed-off-by: mgoin <mgoin64@gmail.com>
											
										
										
											2025-10-13 21:12:52 -04:00
+								| `allgather_reducescatter` | Default backend | Standard all2all using allgather/reducescatter primitives | General purpose, works with any EP+DP configuration |
 								| `deepep_high_throughput` | Multi-node prefill | Grouped GEMM with continuous layout, optimized for prefill | Prefill-dominated workloads, high-throughput scenarios |
 								| `deepep_low_latency` | Multi-node decode | CUDA graph support, masked layout, optimized for decode | Decode-dominated workloads, low-latency scenarios |
-												[Kernel] Add FlashInfer MoE A2A Kernel (#36022)

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Signed-off-by: Leo Tian <lctian@nvidia.com>
Co-authored-by: wzhao18 <wzhao18.sz@gmail.com>
Co-authored-by: Stefano Castagnetta <scastagnetta@nvidia.com>
Co-authored-by: root <root@lyris0267.lyris.clusters.nvidia.com>
											
										
										
											2026-03-16 02:45:32 -04:00
+								| `flashinfer_nvlink_one_sided` | MNNVL systems | FlashInfer's one-sided A2A strategy for multi-node NVLink | High-throughput workloads |
 								| `flashinfer_nvlink_two_sided` | MNNVL systems | FlashInfer's two-sided A2A strategy for multi-node NVLink | Systems with NVLink across nodes |
-												[Docs] Add Expert Parallelism Initial Documentation (#21373)

Signed-off-by: simon-mo <simon.mo@hey.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-07-24 12:36:06 -07:00
 								## Single Node Deployment
 								!!! warning
 								    EP is an experimental feature. Argument names and default values may change in the future.
 								### Configuration
 								Enable EP by setting the `--enable-expert-parallel` flag. The EP size is automatically calculated as:
-												[Docs] Switch to better markdown linting pre-commit hook (#21851)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-07-30 03:45:08 +01:00
+								```text
-												[Docs] Add Expert Parallelism Initial Documentation (#21373)

Signed-off-by: simon-mo <simon.mo@hey.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-07-24 12:36:06 -07:00
+								EP_SIZE = TP_SIZE × DP_SIZE
 								```
 								Where:
-												[Docs] Switch to better markdown linting pre-commit hook (#21851)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-07-30 03:45:08 +01:00
-												[Docs] Update EPLB docs (#30426)

Signed-off-by: mgoin <mgoin64@gmail.com>
											
										
										
											2025-12-10 15:56:51 -05:00
+								- `TP_SIZE`: Tensor parallel size
-												[Docs] Add Expert Parallelism Initial Documentation (#21373)

Signed-off-by: simon-mo <simon.mo@hey.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-07-24 12:36:06 -07:00
+								- `DP_SIZE`: Data parallel size
 								- `EP_SIZE`: Expert parallel size (computed automatically)
-												[Docs] Clarify Expert Parallel behavior for attention and MoE layers (#30615)

Signed-off-by: majiayu000 <1835304752@qq.com>
											
										
										
											2025-12-14 01:37:59 +08:00
+								### Layer Behavior with EP Enabled
 								When EP is enabled, different layers in MoE models behave differently:
 								| Layer Type | Behavior | Parallelism Used |
-												Allow `markdownlint` to run locally (#36398)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2026-03-09 03:05:24 +00:00
+								| ---------- | -------- | ---------------- |
-												[Docs] Clarify Expert Parallel behavior for attention and MoE layers (#30615)

Signed-off-by: majiayu000 <1835304752@qq.com>
											
										
										
											2025-12-14 01:37:59 +08:00
+								| **Expert (MoE) Layers** | Sharded across all EP ranks | Expert Parallel (EP) of size `TP × DP` |
 								| **Attention Layers** | Behavior depends on TP size | See below |
 								**Attention layer parallelism:**
 								- **When `TP = 1`**: Attention weights are **replicated** across all DP ranks (data parallelism)
 								- **When `TP > 1`**: Attention weights are **sharded** using tensor parallelism across TP ranks within each DP group
 								For example, with `TP=2, DP=4` (8 GPUs total):
 								- Expert layers form an EP group of size 8, with experts distributed across all GPUs
 								- Attention layers use TP=2 within each of the 4 DP groups
 								!!! note "Key Difference from Data Parallel Deployment"
 								    Without `--enable-expert-parallel`, MoE layers would use tensor parallelism (forming a TP group of size `TP × DP`), similar to dense models. With EP enabled, expert layers switch to expert parallelism, which can provide better efficiency and locality for MoE models.
-												[Docs] Update EPLB docs (#30426)

Signed-off-by: mgoin <mgoin64@gmail.com>
											
										
										
											2025-12-10 15:56:51 -05:00
-												[Docs] Add Expert Parallelism Initial Documentation (#21373)

Signed-off-by: simon-mo <simon.mo@hey.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-07-24 12:36:06 -07:00
+								### Example Command
 								The following command serves a `DeepSeek-V3-0324` model with 1-way tensor parallel, 8-way (attention) data parallel, and 8-way expert parallel. The attention weights are replicated across all GPUs, while the expert weights are split across GPUs. It will work on a H200 (or H20) node with 8 GPUs. For H100, you can try to serve a smaller model or refer to the multi-node deployment section.
 								```bash
-												[WideEP] Remove pplx all2all backend (#33724)

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
											
										
										
											2026-02-26 17:30:10 -05:00
+								# Single node EP deployment
-												[UX] Replace VLLM_ALL2ALL_BACKEND with --all2all-backend (#26732)

Signed-off-by: mgoin <mgoin64@gmail.com>
											
										
										
											2025-10-13 21:12:52 -04:00
+								vllm serve deepseek-ai/DeepSeek-V3-0324 \
 								    --tensor-parallel-size 1 \       # Tensor parallelism across 1 GPU
-												[Docs] Add Expert Parallelism Initial Documentation (#21373)

Signed-off-by: simon-mo <simon.mo@hey.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-07-24 12:36:06 -07:00
+								    --data-parallel-size 8 \         # Data parallelism across 8 processes
-												[WideEP] Remove pplx all2all backend (#33724)

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
											
										
										
											2026-02-26 17:30:10 -05:00
+								    --enable-expert-parallel         # Enable expert parallelism
-												[Docs] Add Expert Parallelism Initial Documentation (#21373)

Signed-off-by: simon-mo <simon.mo@hey.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-07-24 12:36:06 -07:00
+								```
 								## Multi-Node Deployment
 								For multi-node deployment, use the DeepEP communication kernel with one of two modes (see [Backend Selection Guide](#backend-selection-guide) above).
 								### Deployment Steps
 . **Run one command per node** - Each node requires its own launch command
 . **Configure networking** - Ensure proper IP addresses and port configurations
 . **Set node roles** - First node handles requests, additional nodes run in headless mode
 								### Example: 2-Node Deployment
 								The following example deploys `DeepSeek-V3-0324` across 2 nodes using `deepep_low_latency` mode:
 								```bash
 								# Node 1 (Primary - handles incoming requests)
-												[UX] Replace VLLM_ALL2ALL_BACKEND with --all2all-backend (#26732)

Signed-off-by: mgoin <mgoin64@gmail.com>
											
										
										
											2025-10-13 21:12:52 -04:00
+								vllm serve deepseek-ai/DeepSeek-V3-0324 \
 								    --all2all-backend deepep_low_latency \
-												[Docs] Add Expert Parallelism Initial Documentation (#21373)

Signed-off-by: simon-mo <simon.mo@hey.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-07-24 12:36:06 -07:00
+								    --tensor-parallel-size 1 \               # TP size per node
 								    --enable-expert-parallel \               # Enable EP
 								    --data-parallel-size 16 \                # Total DP size across all nodes
 								    --data-parallel-size-local 8 \           # Local DP size on this node (8 GPUs per node)
 								    --data-parallel-address 192.168.1.100 \  # Replace with actual IP of Node 1
 								    --data-parallel-rpc-port 13345 \         # RPC communication port, can be any port as long as reachable by all nodes
-												[docs] Improve wide-EP performance + benchmarking documentation (#27933)

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
											
										
										
											2025-12-10 17:15:54 -05:00
+								    --api-server-count=8                     # Number of API servers for load handling (scaling this out to # local ranks is recommended)
-												[Docs] Add Expert Parallelism Initial Documentation (#21373)

Signed-off-by: simon-mo <simon.mo@hey.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-07-24 12:36:06 -07:00
 								# Node 2 (Secondary - headless mode, no API server)
-												[UX] Replace VLLM_ALL2ALL_BACKEND with --all2all-backend (#26732)

Signed-off-by: mgoin <mgoin64@gmail.com>
											
										
										
											2025-10-13 21:12:52 -04:00
+								vllm serve deepseek-ai/DeepSeek-V3-0324 \
 								    --all2all-backend deepep_low_latency \
-												[Docs] Add Expert Parallelism Initial Documentation (#21373)

Signed-off-by: simon-mo <simon.mo@hey.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-07-24 12:36:06 -07:00
+								    --tensor-parallel-size 1 \               # TP size per node
 								    --enable-expert-parallel \               # Enable EP
 								    --data-parallel-size 16 \                # Total DP size across all nodes
 								    --data-parallel-size-local 8 \           # Local DP size on this node
 								    --data-parallel-start-rank 8 \           # Starting rank offset for this node
 								    --data-parallel-address 192.168.1.100 \  # IP of primary node (Node 1)
 								    --data-parallel-rpc-port 13345 \         # Same RPC port as primary
 								    --headless                               # No API server, worker only
 								```
 								### Key Configuration Notes
 								- **Headless mode**: Secondary nodes run with `--headless` flag, meaning all client requests are handled by the primary node
 								- **Rank calculation**: `--data-parallel-start-rank` should equal the cumulative local DP size of previous nodes
 								- **Load scaling**: Adjust `--api-server-count` on the primary node to handle higher request loads
 								### Network Configuration
 								!!! important "InfiniBand Clusters"
 								    On InfiniBand networked clusters, set this environment variable to prevent initialization hangs:
 								    ```bash
 								    export GLOO_SOCKET_IFNAME=eth0
 								    ```
 								    This ensures torch distributed group discovery uses Ethernet instead of InfiniBand for initial setup.
 								## Expert Parallel Load Balancer (EPLB)
 								While MoE models are typically trained so that each expert receives a similar number of tokens, in practice the distribution of tokens across experts can be highly skewed. vLLM provides an Expert Parallel Load Balancer (EPLB) to redistribute expert mappings across EP ranks, evening the load across experts.
 								### Configuration
 								Enable EPLB with the `--enable-eplb` flag.
 								When enabled, vLLM collects load statistics with every forward pass and periodically rebalances expert distribution.
 								### EPLB Parameters
-												[Docs]add eplb_config param use docs (#24213)

Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>
											
										
										
											2025-09-09 00:36:57 +08:00
+								Configure EPLB with the `--eplb-config` argument, which accepts a JSON string. The available keys and their descriptions are:
-												[Docs] Add Expert Parallelism Initial Documentation (#21373)

Signed-off-by: simon-mo <simon.mo@hey.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-07-24 12:36:06 -07:00
+								| Parameter | Description | Default |
-												Allow `markdownlint` to run locally (#36398)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2026-03-09 03:05:24 +00:00
+								| --------- | ----------- | ------- |
 								| `window_size` | Number of engine steps to track for rebalancing decisions | 1000 |
 								| `step_interval` | Frequency of rebalancing (every N engine steps) | 3000 |
-												[Docs]add eplb_config param use docs (#24213)

Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>
											
										
										
											2025-09-09 00:36:57 +08:00
+								| `log_balancedness` | Log balancedness metrics (avg tokens per expert ÷ max tokens per expert) | `false` |
 								| `num_redundant_experts` | Additional global experts per EP rank beyond equal distribution | `0` |
-												[Docs] Update EPLB docs (#30426)

Signed-off-by: mgoin <mgoin64@gmail.com>
											
										
										
											2025-12-10 15:56:51 -05:00
+								| `use_async` | Use non-blocking EPLB for reduced latency overhead | `false` |
 								| `policy` | The policy type for expert parallel load balancing | `"default"` |
-												[Docs]add eplb_config param use docs (#24213)

Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>
											
										
										
											2025-09-09 00:36:57 +08:00
 								For example:
 								```bash
 								vllm serve Qwen/Qwen3-30B-A3B \
 								  --enable-eplb \
 								  --eplb-config '{"window_size":1000,"step_interval":3000,"num_redundant_experts":2,"log_balancedness":true}'
 								```
 								??? tip "Prefer individual arguments instead of JSON?"
 								    ```bash
 								    vllm serve Qwen/Qwen3-30B-A3B \
 								            --enable-eplb \
 								            --eplb-config.window_size 1000 \
 								            --eplb-config.step_interval 3000 \
 								            --eplb-config.num_redundant_experts 2 \
 								            --eplb-config.log_balancedness true
 								    ```
-												[Docs] Add Expert Parallelism Initial Documentation (#21373)

Signed-off-by: simon-mo <simon.mo@hey.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-07-24 12:36:06 -07:00
 								### Expert Distribution Formula
 								- **Default**: Each EP rank has `NUM_TOTAL_EXPERTS ÷ NUM_EP_RANKS` experts
 								- **With redundancy**: Each EP rank has `(NUM_TOTAL_EXPERTS + NUM_REDUNDANT_EXPERTS) ÷ NUM_EP_RANKS` experts
-												[Docs] Document the extra memory footprint overhead when using EPLB (#24537)

Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
											
										
										
											2025-09-10 09:09:49 -04:00
+								### Memory Footprint Overhead
-												[Docs] Fix typos in EP deployment doc (#24669)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-09-11 17:07:23 +01:00
+								EPLB uses redundant experts that need to fit in GPU memory. This means that EPLB may not be a good fit for memory constrained environments or when KV cache space is at a premium.
-												[Docs] Document the extra memory footprint overhead when using EPLB (#24537)

Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
											
										
										
											2025-09-10 09:09:49 -04:00
 								This overhead equals `NUM_MOE_LAYERS * BYTES_PER_EXPERT * (NUM_TOTAL_EXPERTS + NUM_REDUNDANT_EXPERTS) ÷ NUM_EP_RANKS`.
-												[Docs] Fix typos in EP deployment doc (#24669)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-09-11 17:07:23 +01:00
+								For DeepSeekV3, this is approximately `2.4 GB` for one redundant expert per EP rank.
-												[Docs] Document the extra memory footprint overhead when using EPLB (#24537)

Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
											
										
										
											2025-09-10 09:09:49 -04:00
-												[Docs] Add Expert Parallelism Initial Documentation (#21373)

Signed-off-by: simon-mo <simon.mo@hey.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-07-24 12:36:06 -07:00
+								### Example Command
 								Single node deployment with EPLB enabled:
 								```bash
 								# Single node with EPLB load balancing
-												[UX] Replace VLLM_ALL2ALL_BACKEND with --all2all-backend (#26732)

Signed-off-by: mgoin <mgoin64@gmail.com>
											
										
										
											2025-10-13 21:12:52 -04:00
+								vllm serve deepseek-ai/DeepSeek-V3-0324 \
 								    --tensor-parallel-size 1 \       # Tensor parallelism
 								    --data-parallel-size 8 \         # Data parallelism
 								    --enable-expert-parallel \       # Enable EP
 								    --enable-eplb \                  # Enable load balancer
-												[Docs]add eplb_config param use docs (#24213)

Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>
											
										
										
											2025-09-09 00:36:57 +08:00
+								    --eplb-config '{"window_size":1000,"step_interval":3000,"num_redundant_experts":2,"log_balancedness":true}'
-												[Docs] Add Expert Parallelism Initial Documentation (#21373)

Signed-off-by: simon-mo <simon.mo@hey.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-07-24 12:36:06 -07:00
+								```
-												[Docs]add eplb_config param use docs (#24213)

Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>
											
										
										
											2025-09-09 00:36:57 +08:00
+								For multi-node deployment, add these EPLB flags to each node's command. We recommend setting `--eplb-config '{"num_redundant_experts":32}'` to 32 in large scale use cases so the most popular experts are always available.
-												[Docs] Add Expert Parallelism Initial Documentation (#21373)

Signed-off-by: simon-mo <simon.mo@hey.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-07-24 12:36:06 -07:00
-												[docs] Improve wide-EP performance + benchmarking documentation (#27933)

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
											
										
										
											2025-12-10 17:15:54 -05:00
+								## Advanced Configuration
 								### Performance Optimization
 								- **DeepEP kernels**: The `high_throughput` and `low_latency` kernels are optimized for disaggregated serving and may show poor performance for mixed workloads
 								- **Dual Batch Overlap**: Use `--enable-dbo` to overlap all-to-all communication with compute. See [Dual Batch Overlap](../design/dbo.md) for more details.
 								- **Async scheduling (experimental)**: Try `--async-scheduling` to overlap scheduling with model execution.
 								### Troubleshooting
 								- **`non-zero status: 7 cannot register cq buf`**: When using Infiniband/RoCE, make sure host VM and pods show `ulimit -l` "unlimited".
 								- **`init failed for transport: IBGDA`**: The InfiniBand GDA kernel modules are missing. Run `tools/ep_kernels/configure_system_drivers.sh` on each GPU node and reboot. Also fixes error `NVSHMEM API called before NVSHMEM initialization has completed`.
 								- **NVSHMEM peer disconnect**: Usually a networking misconfiguration. If deploying via Kubernetes, verify that every pod runs with `hostNetwork: true`, `securityContext.privileged: true` to access Infiniband.
 								### Benchmarking
 								- Use simulator flags `VLLM_MOE_ROUTING_SIMULATION_STRATEGY=uniform_random` and `VLLM_RANDOMIZE_DP_DUMMY_INPUTS=1` so token routing is balanced across EP ranks.
 								- Increasing `VLLM_MOE_DP_CHUNK_SIZE` may increase throughput by increasing the maximum batch size for inter-rank token transfers. This may cause DeepEP  to throw `assert self.nvshmem_qp_depth >= (num_max_dispatch_tokens_per_rank + 1) * 2`, which can be fixed by increasing environment variable `NVSHMEM_QP_DEPTH`.
-												[Docs] Add Expert Parallelism Initial Documentation (#21373)

Signed-off-by: simon-mo <simon.mo@hey.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-07-24 12:36:06 -07:00
+								## Disaggregated Serving (Prefill/Decode Split)
 								For production deployments requiring strict SLA guarantees for time-to-first-token and inter-token latency, disaggregated serving allows independent scaling of prefill and decode operations.
 								### Architecture Overview
 								- **Prefill Instance**: Uses `deepep_high_throughput` backend for optimal prefill performance
 								- **Decode Instance**: Uses `deepep_low_latency` backend for minimal decode latency
 								- **KV Cache Transfer**: Connects instances via NIXL or other KV connectors
 								### Setup Steps
-												[Docs] Reduce custom syntax used in docs (#27009)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-10-17 04:05:34 +01:00
+. **Install gdrcopy/ucx/nixl**: For maximum performance, run the [install_gdrcopy.sh](../../tools/install_gdrcopy.sh) script to install `gdrcopy` (e.g., `install_gdrcopy.sh "${GDRCOPY_OS_VERSION}" "12.8" "x64"`). You can find available OS versions [here](https://developer.download.nvidia.com/compute/redist/gdrcopy/CUDA%2012.8/). If `gdrcopy` is not installed, things will still work with a plain `pip install nixl`, just with lower performance. `nixl` and `ucx` are installed as dependencies via pip. For non-cuda platform to install nixl with non-cuda UCX build, run the [install_nixl_from_source_ubuntu.py](../../tools/install_nixl_from_source_ubuntu.py) script.
-												[Docs] Add Expert Parallelism Initial Documentation (#21373)

Signed-off-by: simon-mo <simon.mo@hey.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-07-24 12:36:06 -07:00
-												[docs] fix nixl kv_connector_extra_config.backends key (#25565)

Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
Signed-off-by: Peter Pan <peter.pan@daocloud.io>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
											
										
										
											2025-09-24 19:00:27 +08:00
+. **Configure Both Instances**: Add this flag to both prefill and decode instances `--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}`. Noted, you may also specify one or multiple NIXL_Backend. Such as: `--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both", "kv_connector_extra_config":{"backends":["UCX", "GDS"]}}'`
-												[Docs] Add Expert Parallelism Initial Documentation (#21373)

Signed-off-by: simon-mo <simon.mo@hey.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-07-24 12:36:06 -07:00
 . **Client Orchestration**: Use the client-side script below to coordinate prefill/decode operations. We are actively working on routing solutions.
 								### Client Orchestration Example
 								```python
 								from openai import OpenAI
 								import uuid
 								try:
 								    # 1: Set up clients for prefill and decode instances
 								    openai_api_key = "EMPTY"  # vLLM doesn't require a real API key
 								    # Replace these IP addresses with your actual instance addresses
 								    prefill_client = OpenAI(
 								        api_key=openai_api_key,
 								        base_url="http://192.168.1.100:8000/v1",  # Prefill instance URL
 								    )
 								    decode_client = OpenAI(
 								        api_key=openai_api_key,
 								        base_url="http://192.168.1.101:8001/v1",  # Decode instance URL
 								    )
 								    # Get model name from prefill instance
 								    models = prefill_client.models.list()
 								    model = models.data[0].id
 								    print(f"Using model: {model}")
 								    # 2: Prefill Phase
 								    # Generate unique request ID to link prefill and decode operations
 								    request_id = str(uuid.uuid4())
 								    print(f"Request ID: {request_id}")
 								    prefill_response = prefill_client.completions.create(
 								        model=model,
 								        # Prompt must exceed vLLM's block size (16 tokens) for PD to work
 								        prompt="Write a detailed explanation of Paged Attention for Transformers works including the management of KV cache for multi-turn conversations",
 								        max_tokens=1,  # Force prefill-only operation
 								        extra_body={
 								            "kv_transfer_params": {
 								                "do_remote_decode": True,     # Enable remote decode
 								                "do_remote_prefill": False,   # This is the prefill instance
 								                "remote_engine_id": None,     # Will be populated by vLLM
 								                "remote_block_ids": None,     # Will be populated by vLLM
 								                "remote_host": None,          # Will be populated by vLLM
-												[Doc] ruff format remaining Python examples (#26795)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-10-15 16:25:49 +08:00
+								                "remote_port": None,          # Will be populated by vLLM
-												[Docs] Add Expert Parallelism Initial Documentation (#21373)

Signed-off-by: simon-mo <simon.mo@hey.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-07-24 12:36:06 -07:00
+								            }
 								        },
-												[Doc] ruff format remaining Python examples (#26795)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-10-15 16:25:49 +08:00
+								        extra_headers={"X-Request-Id": request_id},
-												[Docs] Add Expert Parallelism Initial Documentation (#21373)

Signed-off-by: simon-mo <simon.mo@hey.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-07-24 12:36:06 -07:00
+								    )
 								    print("-" * 50)
 								    print("✓ Prefill completed successfully")
 								    print(f"Prefill response: {prefill_response.choices[0].text}")
 								    # 3: Decode Phase
 								    # Transfer KV cache parameters from prefill to decode instance
 								    decode_response = decode_client.completions.create(
 								        model=model,
 								        prompt="This prompt is ignored during decode",  # Original prompt not needed
 								        max_tokens=150,  # Generate up to 150 tokens
 								        extra_body={
 								            "kv_transfer_params": prefill_response.kv_transfer_params  # Pass KV cache info
 								        },
-												[Doc] ruff format remaining Python examples (#26795)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-10-15 16:25:49 +08:00
+								        extra_headers={"X-Request-Id": request_id},  # Same request ID
-												[Docs] Add Expert Parallelism Initial Documentation (#21373)

Signed-off-by: simon-mo <simon.mo@hey.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-07-24 12:36:06 -07:00
+								    )
 								    print("-" * 50)
 								    print("✓ Decode completed successfully")
 								    print(f"Final response: {decode_response.choices[0].text}")
 								except Exception as e:
 								    print(f"❌ Error during disaggregated serving: {e}")
 								    print("Check that both prefill and decode instances are running and accessible")
 								```
-												[docs] Improve wide-EP performance + benchmarking documentation (#27933)

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
											
										
										
											2025-12-10 17:15:54 -05:00
 								### Benchmarking
 								- To simulate the decode deployment of disaggregated serving, pass `--kv-transfer-config '{"kv_connector":"DecodeBenchConnector","kv_role":"kv_both"}'` to the `vllm serve` invocation. The connector populates KV cache with random values so decode can be profiled in isolation.
 								- **CUDAGraph capture**: Use `--compilation_config '{"cudagraph_mode": "FULL_DECODE_ONLY"}'` to enable CUDA graph capture for decode only and save KV cache.