docs/training/rlhf.md

# Reinforcement Learning from Human Feedback

Reinforcement Learning from Human Feedback (RLHF) is a technique that fine-tunes language models using human-generated preference data to align model outputs with desired behaviors.

vLLM can be used to generate the completions for RLHF. Some ways to do this include using libraries like [TRL](https://github.com/huggingface/trl), [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF), [verl](https://github.com/volcengine/verl) and [unsloth](https://github.com/unslothai/unsloth).

See the following basic examples to get started if you don't want to use an existing library:

- [Training and inference processes are located on separate GPUs (inspired by OpenRLHF)](../examples/offline_inference/rlhf.md)
- [Training and inference processes are colocated on the same GPUs using Ray](../examples/offline_inference/rlhf_colocate.md)
- [Utilities for performing RLHF with vLLM](../examples/offline_inference/rlhf_utils.md)

See the following notebooks showing how to use vLLM for GRPO:

- [Qwen-3 4B GRPO using Unsloth + vLLM](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_(4B)-GRPO.ipynb)
Add RLHF document (#14482) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> 2025-03-08 10:51:39 +01:00			`# Reinforcement Learning from Human Feedback`

[Doc] Fix typo in documentation (#14783) Signed-off-by: yasu52 <tsuguro4649@gmail.com> 2025-03-13 20:33:09 -07:00			`Reinforcement Learning from Human Feedback (RLHF) is a technique that fine-tunes language models using human-generated preference data to align model outputs with desired behaviors.`
Add RLHF document (#14482) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> 2025-03-08 10:51:39 +01:00
Add Unsloth to RLHF.md (#21636) 2025-07-25 17:06:48 -07:00			`vLLM can be used to generate the completions for RLHF. Some ways to do this include using libraries like [TRL](https://github.com/huggingface/trl), [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF), [verl](https://github.com/volcengine/verl) and [unsloth](https://github.com/unslothai/unsloth).`
Add RLHF document (#14482) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> 2025-03-08 10:51:39 +01:00
			`See the following basic examples to get started if you don't want to use an existing library:`

[Doc] Move examples and further reorganize user guide (#18666) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> 2025-05-26 22:38:04 +08:00			`- [Training and inference processes are located on separate GPUs (inspired by OpenRLHF)](../examples/offline_inference/rlhf.md)`
			`- [Training and inference processes are colocated on the same GPUs using Ray](../examples/offline_inference/rlhf_colocate.md)`
			`- [Utilities for performing RLHF with vLLM](../examples/offline_inference/rlhf_utils.md)`
Add Unsloth to RLHF.md (#21636) 2025-07-25 17:06:48 -07:00
			`See the following notebooks showing how to use vLLM for GRPO:`

			`- [Qwen-3 4B GRPO using Unsloth + vLLM](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_(4B)-GRPO.ipynb)`