Comment fix for async rl example (#35244)

Signed-off-by: hao-aaron <ahao@anyscale.com>
This commit is contained in:
Aaron Hao
2026-03-19 12:46:07 -07:00
committed by GitHub
parent 040a505ff5
commit 4ee847e400

View File

@@ -2,25 +2,38 @@
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
"""
Demonstrates async reinforcement learning using vLLM and Ray,
with native weight syncing APIs at engine instance.
with native weight syncing APIs and batch-invariant generation.
The script separates training and inference workloads onto distinct GPUs
so that Ray can manage process placement and inter-process communication.
A Hugging Face Transformer model occupies one GPU for training, whereas a
2x tensor-parallel vLLM inference engine occupies two GPUs.
A Hugging Face Transformer model occupies one GPU for training, and a
vLLM AsyncLLMEngine occupies another GPU for inference.
Batch invariance is enabled so that generation output is deterministic
regardless of how many requests are batched together. This is required
for the validation phase to succeed. Batch invariance currently requires
NVIDIA GPUs with compute capability 9.0 or higher:
- H-series: H100, H200
- B-series: B100, B200
The example performs the following steps:
* Load the training model on one gpu (scheduled via ray)
* Initialize the inference model with dummy weights across
two gpus using vLLM's tensor parallelism and Ray placement groups.
* Generate gibberish from a list of prompts using the randomly initialized
inference engine.
* Pause generation once generation completes for one sequence
* Update the weights of the training model and broadcast the updated weights
to the inference engine by using a Ray collective RPC group.
* Resume generation and print out the results
* Load the training model (Qwen3-1.7B) on one GPU via a Ray actor.
* Initialize the inference engine with a base model (Qwen3-1.7B-Base)
on a separate GPU using vLLM's AsyncLLMEngine with Ray as the
distributed executor backend.
* Set up an NCCL-based weight transfer channel between the trainer
and the inference engine.
* Submit generation requests for a batch of prompts.
* Pause generation once any request reaches a token threshold.
* Broadcast the training model's weights to the inference engine
via the NCCL weight transfer engine, replacing the base weights.
* Resume generation and collect results, noting which tokens were
generated before vs. after the weight swap.
* Validate correctness by launching a fresh vLLM instance loaded
directly with the training model and comparing its output to the
post-swap tokens from the weight-synced engine.
This example assumes a single-node cluster with three GPUs, but Ray
This example assumes a single-node cluster with two GPUs, but Ray
supports multi-node clusters. vLLM expects the GPUs are only used for vLLM
workloads. Residual GPU activity interferes with vLLM memory profiling and
causes unexpected behavior.