Policy-based methods
- Directly learn to approximate without having to learn a value function
- So we parameterize the policy using a neural network.
Goal
Maximize the performance of the parameterized policy using gradient ascent
Pros and Cons of Policy Gradient Methods
Pros
- Directly estimate policy without storing additional data
- Can learn a stochastic policy
- Don’t need to implement an exploration/expolitation tradeoff
- Much more effective in high-dimensional (continuous) action spaces
- Better convergence properties Cons
- They converge to a local max instead of a global max
- Can take longer to train
- Can have higher variance
Policy Gradient Algorithm
- Initialize the policy parameters randomly.
- Define the policy as a parameterized probability distribution over actions.
- Generate an episode by following the current policy :
- Observe states , take actions , and receive rewards for .
- Calculate the cumulative reward for each step:
where is the discount factor.
- Compute the policy gradient:
- Estimate this gradient by sampling episodes and using Monte Carlo methods.
- Update the policy parameters using gradient ascent:
where is the learning rate.
- Repeat steps 2–5 until convergence or for a predefined number of iterations.
Why does the policy gradient work?
- The gradient pushes the policy to assign higher probabilities to actions that lead to higher rewards.
- This allows the policy to improve iteratively by maximizing the expected cumulative reward.
Variants of Policy Gradient
- REINFORCE: A simple implementation using Monte Carlo estimates for .
- Actor-Critic: Combines a parameterized policy (actor) with a value function (critic) to reduce variance in gradient estimation.
- Proximal Policy Optimization (PPO): Regularizes policy updates to ensure stability and prevent large changes in the policy.
Policy Gradient Theorem
Policy Gradient Theorem
- Objective Function:
- The goal is to maximize the expected cumulative reward under a parameterized policy :
- Here, is the discount factor.
- Policy Gradient Theorem:
- The gradient of is given by:
- This connects the expected reward to the gradient of the policy’s log-probability, weighted by the action-value function .
- Simplified Gradient Estimate:
- Using the cumulative reward as an estimate of :
Key Advantages of Policy Gradient Theorem
- Directly optimizes the policy without needing a value function.
- Works well for stochastic or continuous action spaces.
- Facilitates modeling of complex, parameterized policies.
Variants and Improvements
- Baseline Subtraction:
- Reduce variance by subtracting a baseline from :
- Common choice for is the value function .
- Actor-Critic Methods:
- Use a critic to estimate or to reduce variance and stabilize training.