Policy-based methods

  • Directly learn to approximate without having to learn a value function
  • So we parameterize the policy using a neural network.

Goal

Maximize the performance of the parameterized policy using gradient ascent

Pros and Cons of Policy Gradient Methods

Pros

  • Directly estimate policy without storing additional data
  • Can learn a stochastic policy
    • Don’t need to implement an exploration/expolitation tradeoff
  • Much more effective in high-dimensional (continuous) action spaces
  • Better convergence properties Cons
  • They converge to a local max instead of a global max
  • Can take longer to train
  • Can have higher variance

Policy Gradient Algorithm

  1. Initialize the policy parameters randomly.
    • Define the policy as a parameterized probability distribution over actions.
  2. Generate an episode by following the current policy :
    • Observe states , take actions , and receive rewards for .
  3. Calculate the cumulative reward for each step:

    where is the discount factor.

  4. Compute the policy gradient:
    • Estimate this gradient by sampling episodes and using Monte Carlo methods.
  5. Update the policy parameters using gradient ascent:

    where is the learning rate.

  6. Repeat steps 2–5 until convergence or for a predefined number of iterations.

Why does the policy gradient work?

  • The gradient pushes the policy to assign higher probabilities to actions that lead to higher rewards.
  • This allows the policy to improve iteratively by maximizing the expected cumulative reward.

Variants of Policy Gradient

  • REINFORCE: A simple implementation using Monte Carlo estimates for .
  • Actor-Critic: Combines a parameterized policy (actor) with a value function (critic) to reduce variance in gradient estimation.
  • Proximal Policy Optimization (PPO): Regularizes policy updates to ensure stability and prevent large changes in the policy.

Policy Gradient Theorem

Policy Gradient Theorem

  1. Objective Function:
    • The goal is to maximize the expected cumulative reward under a parameterized policy :
    • Here, is the discount factor.
  2. Policy Gradient Theorem:
    • The gradient of is given by:
    • This connects the expected reward to the gradient of the policy’s log-probability, weighted by the action-value function .
  3. Simplified Gradient Estimate:
    • Using the cumulative reward as an estimate of :

Key Advantages of Policy Gradient Theorem

  • Directly optimizes the policy without needing a value function.
  • Works well for stochastic or continuous action spaces.
  • Facilitates modeling of complex, parameterized policies.

Variants and Improvements

  • Baseline Subtraction:
    • Reduce variance by subtracting a baseline from :
    • Common choice for is the value function .
  • Actor-Critic Methods:
    • Use a critic to estimate or to reduce variance and stabilize training.