Signed-off-by: ahao-anyscale <ahao@anyscale.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
111 lines
4.3 KiB
Markdown
111 lines
4.3 KiB
Markdown
# NCCL Engine
|
|
|
|
The NCCL weight transfer engine uses [NCCL](https://developer.nvidia.com/nccl) broadcast operations to transfer weights from the trainer to inference workers. It supports **multi-node** and **multi-GPU** setups where the trainer and inference engine run on separate GPUs.
|
|
|
|
## When to Use NCCL
|
|
|
|
- Training and inference on **separate GPUs** (possibly across nodes)
|
|
- **Tensor-parallel** inference with multiple workers that all need the updated weights
|
|
- You need high-bandwidth, low-latency weight transfer over NVLink or InfiniBand
|
|
|
|
## How It Works
|
|
|
|
1. The trainer and all inference workers join a shared NCCL process group using `StatelessProcessGroup` (vLLM's torch.distributed-independent group abstraction).
|
|
2. The trainer broadcasts weights to all workers simultaneously. Each worker receives and loads weights incrementally.
|
|
3. Optionally, **packed tensor broadcasting** batches multiple small tensors into larger buffers with double/triple buffering and CUDA stream overlap for higher throughput. This implementation is based on [NeMo-RL's packed tensor](https://github.com/NVIDIA-NeMo/RL/blob/main/nemo_rl/utils/packed_tensor.py).
|
|
|
|
## Initialization
|
|
|
|
NCCL requires explicit process group setup. The trainer and inference workers must agree on a master address, port, and world size.
|
|
|
|
### Inference Side
|
|
|
|
```python
|
|
from vllm.distributed.weight_transfer.base import WeightTransferInitRequest
|
|
|
|
# rank_offset accounts for the trainer occupying rank 0
|
|
llm.init_weight_transfer_engine(
|
|
WeightTransferInitRequest(
|
|
init_info=dict(
|
|
master_address=master_address,
|
|
master_port=master_port,
|
|
rank_offset=1,
|
|
world_size=world_size, # trainer + all inference workers
|
|
)
|
|
)
|
|
)
|
|
```
|
|
|
|
### Trainer Side
|
|
|
|
```python
|
|
from vllm.distributed.weight_transfer.nccl_engine import (
|
|
NCCLWeightTransferEngine,
|
|
)
|
|
|
|
group = NCCLWeightTransferEngine.trainer_init(
|
|
dict(
|
|
master_address=master_address,
|
|
master_port=master_port,
|
|
world_size=world_size,
|
|
)
|
|
)
|
|
```
|
|
|
|
!!! note
|
|
`trainer_init` always assigns the trainer to rank 0. Inference workers start at `rank_offset` (typically 1).
|
|
|
|
## Sending Weights
|
|
|
|
```python
|
|
from vllm.distributed.weight_transfer.nccl_engine import (
|
|
NCCLTrainerSendWeightsArgs,
|
|
NCCLWeightTransferEngine,
|
|
)
|
|
|
|
trainer_args = NCCLTrainerSendWeightsArgs(
|
|
group=group,
|
|
packed=True, # use packed broadcasting for efficiency
|
|
)
|
|
|
|
NCCLWeightTransferEngine.trainer_send_weights(
|
|
iterator=model.named_parameters(),
|
|
trainer_args=trainer_args,
|
|
)
|
|
```
|
|
|
|
See [`NCCLTrainerSendWeightsArgs`](https://github.com/vllm-project/vllm/blob/main/vllm/distributed/weight_transfer/nccl_engine.py) for the full list of configurable fields.
|
|
|
|
### Packed Tensor Broadcasting
|
|
|
|
When `packed=True`, multiple weight tensors are packed into large contiguous buffers before broadcasting. This reduces the number of NCCL operations and uses double/triple buffering with dedicated CUDA streams for overlap between packing, broadcasting, and unpacking.
|
|
|
|
Both the trainer (`NCCLTrainerSendWeightsArgs`) and inference side (`NCCLWeightTransferUpdateInfo`) must use matching `packed_buffer_size_bytes` and `packed_num_buffers` values.
|
|
|
|
## Receiving Weights (Inference Side)
|
|
|
|
The inference side triggers weight reception by calling `update_weights`:
|
|
|
|
```python
|
|
from vllm.distributed.weight_transfer.base import WeightTransferUpdateRequest
|
|
|
|
llm.update_weights(
|
|
WeightTransferUpdateRequest(
|
|
update_info=dict(
|
|
names=names,
|
|
dtype_names=dtype_names,
|
|
shapes=shapes,
|
|
packed=True,
|
|
)
|
|
)
|
|
)
|
|
```
|
|
|
|
The `names`, `dtype_names`, and `shapes` lists describe each parameter. These must match the order in which the trainer iterates over its parameters.
|
|
|
|
## Examples
|
|
|
|
- [RLHF with NCCL weight syncing (offline, Ray)](../../examples/rl/rlhf_nccl.md) - Trainer on one GPU, 2x tensor-parallel vLLM engine on two others, with packed NCCL weight broadcast
|
|
- [RLHF with async weight syncing (offline, Ray)](../../examples/rl/rlhf_async_new_apis.md) - Async generation with mid-flight pause, weight sync, resume, and validation against a fresh model
|
|
- [RLHF with NCCL weight syncing (online serving, HTTP)](../../examples/rl/rlhf_http_nccl.md) - Weight transfer with a running vLLM HTTP server using HTTP control plane and NCCL data plane
|