Files

Aaron Hao 47a1f11bff [docs] Add docs for new RL flows (#36188 )

Signed-off-by: ahao-anyscale <ahao@anyscale.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

2026-03-18 09:04:26 +00:00

1.7 KiB

Raw Blame History

Reinforcement Learning from Human Feedback

Reinforcement Learning from Human Feedback (RLHF) is a technique that fine-tunes language models using human-generated preference data to align model outputs with desired behaviors. vLLM can be used to generate the completions for RLHF.

The following open-source RL libraries use vLLM for fast rollouts (sorted alphabetically and non-exhaustive):

For weight synchronization between training and inference, see the Weight Transfer documentation, which covers the pluggable backend system with NCCL (multi-GPU) and IPC (same-GPU) engines.

For pipelining generation and training to improve GPU utilization and throughput, see the Async Reinforcement Learning guide, which covers the pause/resume API for safely updating weights mid-flight.

See the following notebooks showing how to use vLLM for GRPO:

1.7 KiB Raw Blame History

Reinforcement Learning from Human Feedback

1.7 KiB

Raw Blame History