Reinforcement Learning from Human Feedback (RLHF)
Reinforcement Learning from Human Feedback (RLHF) is a training approach in which a model’s behavior is optimized based on human feedback. Instead of learning directly from explicit rules or labeled datasets, the model is refined using evaluations that represent human preferences.
The process typically consists of multiple steps: after initial pretraining, human annotators rank or evaluate different model outputs, which are then used to train a so-called reward model. The generative model is subsequently optimized through reinforcement learning to produce responses that better align with these learned human preferences.
RLHF plays a key role in aligning the behavior and safety properties of modern large language models. However, it does not guarantee fully invariant behavior, particularly in the presence of further fine-tuning steps or adversarial inputs.
Related concepts: