Reinforcement Learning from Human Feedback

The RL Setup for LLMs

Agent: Language Model

State: the prompt (input tokens)

Action: which token is selected as next token

Reward Model: language model should be rewarded for generating “good responses”, not receive reward for generating “bad responses”

Policy: language model itself (since it models the probability of the action space given the current state of the agent: $a_{t} \sim π (\cdot ∣ s_{t})$ )

Reward Model Loss: $L oss = - lo g [σ (r (x, y_{w}) - r (x, y_{l}))]$

Algorithm

Initial training with Supervised Learning

Collecting Human Feedback

Human evaluators review these outputs and provide feedback, helping the model understand which outputs are preferred

Fine-tuning with RL

reward model is trained based on the human feedback to predict the quality of an output based on human preferences.

the reward model then provides a reward signal to optimize the main model using RL techniques (usually PPO)

Brayden Zhang

Explorer

Reinforcement Learning from Human Feedback

Graph View

Backlinks