docs/training/rlhf.md

# Reinforcement Learning from Human Feedback

Reinforcement Learning from Human Feedback (RLHF) is a technique that fine-tunes language models using human-generated preference data to align model outputs with desired behaviors. vLLM can be used to generate the completions for RLHF.

The following open-source RL libraries use vLLM for fast rollouts (sorted alphabetically and non-exhaustive):

- [Cosmos-RL](https://github.com/nvidia-cosmos/cosmos-rl)
- [ms-swift](https://github.com/modelscope/ms-swift/tree/main)
- [NeMo-RL](https://github.com/NVIDIA-NeMo/RL)
- [Open Instruct](https://github.com/allenai/open-instruct)
- [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF)
- [PipelineRL](https://github.com/ServiceNow/PipelineRL)
- [Prime-RL](https://github.com/PrimeIntellect-ai/prime-rl)
- [SkyRL](https://github.com/NovaSky-AI/SkyRL)
- [TRL](https://github.com/huggingface/trl)
- [Unsloth](https://github.com/unslothai/unsloth)
- [verl](https://github.com/volcengine/verl)

For weight synchronization between training and inference, see the [Weight Transfer](weight_transfer/README.md) documentation, which covers the pluggable backend system with [NCCL](weight_transfer/nccl.md) (multi-GPU) and [IPC](weight_transfer/ipc.md) (same-GPU) engines.

For pipelining generation and training to improve GPU utilization and throughput, see the [Async Reinforcement Learning](async_rl.md) guide, which covers the pause/resume API for safely updating weights mid-flight.

See the following notebooks showing how to use vLLM for GRPO:

- [Efficient Online Training with GRPO and vLLM in TRL](https://huggingface.co/learn/cookbook/grpo_vllm_online_training)
- [Qwen-3 4B GRPO using Unsloth + vLLM](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_(4B)-GRPO.ipynb)
Add RLHF document (#14482) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> 2025-03-08 10:51:39 +01:00			`# Reinforcement Learning from Human Feedback`

Add more libraries to rlhf.md (#26374) Signed-off-by: Michael Goin <mgoin64@gmail.com> 2025-10-07 16:59:41 -04:00			`Reinforcement Learning from Human Feedback (RLHF) is a technique that fine-tunes language models using human-generated preference data to align model outputs with desired behaviors. vLLM can be used to generate the completions for RLHF.`
Add RLHF document (#14482) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> 2025-03-08 10:51:39 +01:00
Add more libraries to rlhf.md (#26374) Signed-off-by: Michael Goin <mgoin64@gmail.com> 2025-10-07 16:59:41 -04:00			`The following open-source RL libraries use vLLM for fast rollouts (sorted alphabetically and non-exhaustive):`

			`- [Cosmos-RL](https://github.com/nvidia-cosmos/cosmos-rl)`
[Document] Add ms-swift library to rlhf.md (#27469) 2025-10-25 04:31:50 +08:00			`- [ms-swift](https://github.com/modelscope/ms-swift/tree/main)`
Add more libraries to rlhf.md (#26374) Signed-off-by: Michael Goin <mgoin64@gmail.com> 2025-10-07 16:59:41 -04:00			`- [NeMo-RL](https://github.com/NVIDIA-NeMo/RL)`
			`- [Open Instruct](https://github.com/allenai/open-instruct)`
			`- [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF)`
			`- [PipelineRL](https://github.com/ServiceNow/PipelineRL)`
			`- [Prime-RL](https://github.com/PrimeIntellect-ai/prime-rl)`
			`- [SkyRL](https://github.com/NovaSky-AI/SkyRL)`
			`- [TRL](https://github.com/huggingface/trl)`
			`- [Unsloth](https://github.com/unslothai/unsloth)`
			`- [verl](https://github.com/volcengine/verl)`
Add RLHF document (#14482) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> 2025-03-08 10:51:39 +01:00
[docs] Add docs for new RL flows (#36188) Signed-off-by: ahao-anyscale <ahao@anyscale.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> 2026-03-18 02:04:26 -07:00			`For weight synchronization between training and inference, see the [Weight Transfer](weight_transfer/README.md) documentation, which covers the pluggable backend system with [NCCL](weight_transfer/nccl.md) (multi-GPU) and [IPC](weight_transfer/ipc.md) (same-GPU) engines.`
Add RLHF document (#14482) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> 2025-03-08 10:51:39 +01:00
[docs] Add docs for new RL flows (#36188) Signed-off-by: ahao-anyscale <ahao@anyscale.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> 2026-03-18 02:04:26 -07:00			`For pipelining generation and training to improve GPU utilization and throughput, see the [Async Reinforcement Learning](async_rl.md) guide, which covers the pause/resume API for safely updating weights mid-flight.`
Add Unsloth to RLHF.md (#21636) 2025-07-25 17:06:48 -07:00
			`See the following notebooks showing how to use vLLM for GRPO:`

Add TRL example notebook to RLHF docs (#26346) Signed-off-by: sergiopaniego <sergiopaniegoblanco@gmail.com> 2025-10-07 13:31:28 +02:00			`- [Efficient Online Training with GRPO and vLLM in TRL](https://huggingface.co/learn/cookbook/grpo_vllm_online_training)`
Add Unsloth to RLHF.md (#21636) 2025-07-25 17:06:48 -07:00			`- [Qwen-3 4B GRPO using Unsloth + vLLM](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_(4B)-GRPO.ipynb)`