The RL Setup for LLMs

  • Agent: Language Model
  • State: the prompt (input tokens)
  • Action: which token is selected as next token
  • Reward Model: language model should be rewarded for generating “good responses”, not receive reward for generating “bad responses”
  • Policy: language model itself (since it models the probability of the action space given the current state of the agent: )

Reward Model Loss:

Algorithm

  1. Initial training with Supervised Learning
  2. Collecting Human Feedback
  3. Human evaluators review these outputs and provide feedback, helping the model understand which outputs are preferred
  4. Fine-tuning with RL
  5. reward model is trained based on the human feedback to predict the quality of an output based on human preferences.
  6. the reward model then provides a reward signal to optimize the main model using RL techniques (usually PPO)