2025-03-08 10:51:39 +01:00
# Reinforcement Learning from Human Feedback
2025-10-07 16:59:41 -04:00
Reinforcement Learning from Human Feedback (RLHF) is a technique that fine-tunes language models using human-generated preference data to align model outputs with desired behaviors. vLLM can be used to generate the completions for RLHF.
2025-03-08 10:51:39 +01:00
2025-10-07 16:59:41 -04:00
The following open-source RL libraries use vLLM for fast rollouts (sorted alphabetically and non-exhaustive):
- [Cosmos-RL ](https://github.com/nvidia-cosmos/cosmos-rl )
2025-10-25 04:31:50 +08:00
- [ms-swift ](https://github.com/modelscope/ms-swift/tree/main )
2025-10-07 16:59:41 -04:00
- [NeMo-RL ](https://github.com/NVIDIA-NeMo/RL )
- [Open Instruct ](https://github.com/allenai/open-instruct )
- [OpenRLHF ](https://github.com/OpenRLHF/OpenRLHF )
- [PipelineRL ](https://github.com/ServiceNow/PipelineRL )
- [Prime-RL ](https://github.com/PrimeIntellect-ai/prime-rl )
- [SkyRL ](https://github.com/NovaSky-AI/SkyRL )
- [TRL ](https://github.com/huggingface/trl )
- [Unsloth ](https://github.com/unslothai/unsloth )
- [verl ](https://github.com/volcengine/verl )
2025-03-08 10:51:39 +01:00
See the following basic examples to get started if you don't want to use an existing library:
2025-05-26 22:38:04 +08:00
- [Training and inference processes are located on separate GPUs (inspired by OpenRLHF) ](../examples/offline_inference/rlhf.md )
- [Training and inference processes are colocated on the same GPUs using Ray ](../examples/offline_inference/rlhf_colocate.md )
- [Utilities for performing RLHF with vLLM ](../examples/offline_inference/rlhf_utils.md )
2025-07-25 17:06:48 -07:00
See the following notebooks showing how to use vLLM for GRPO:
2025-10-07 13:31:28 +02:00
- [Efficient Online Training with GRPO and vLLM in TRL ](https://huggingface.co/learn/cookbook/grpo_vllm_online_training )
2025-07-25 17:06:48 -07:00
- [Qwen-3 4B GRPO using Unsloth + vLLM ](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_(4B )-GRPO.ipynb)