The RL Setup for LLMs
- Agent: Language Model
- State: the prompt (input tokens)
- Action: which token is selected as next token
- Reward Model: language model should be rewarded for generating “good responses”, not receive reward for generating “bad responses”
- Policy: language model itself (since it models the probability of the action space given the current state of the agent: )
Reward Model Loss:
Algorithm
- Initial training with Supervised Learning
- Collecting Human Feedback
- Human evaluators review these outputs and provide feedback, helping the model understand which outputs are preferred
- Fine-tuning with RL
- reward model is trained based on the human feedback to predict the quality of an output based on human preferences.
- the reward model then provides a reward signal to optimize the main model using RL techniques (usually PPO)