Policy Gradient

Policy-based methods

Directly learn to approximate $π^{*}$ without having to learn a value function

So we parameterize the policy using a neural network.

$π_{θ} (s) = P (A ∣ s; θ)$

Goal

Maximize the performance of the parameterized policy using gradient ascent

Pros and Cons of Policy Gradient Methods

Pros

Directly estimate policy without storing additional data

Can learn a stochastic policy

Don’t need to implement an exploration/expolitation tradeoff

Much more effective in high-dimensional (continuous) action spaces

Better convergence properties Cons

They converge to a local max instead of a global max

Can take longer to train

Can have higher variance

Policy Gradient Algorithm

Initialize the policy parameters $θ$ randomly.

Define the policy $π_{θ} (s)$ as a parameterized probability distribution over actions.

Generate an episode by following the current policy $π_{θ}$ :

Observe states $s_{t}$ , take actions $a_{t}$ , and receive rewards $r_{t}$ for $t = 1, 2, \dots, T$ .

Calculate the cumulative reward for each step:
$G_{t} = k = t \sum T γ^{k - t} r_{k},$
where $γ$ is the discount factor.

Compute the policy gradient:
$\nabla_{θ} J (θ) = E_{π_{θ}} [\nabla_{θ} lo g π_{θ} (s, a) G_{t}] .$

Estimate this gradient by sampling episodes and using Monte Carlo methods.

Update the policy parameters using gradient ascent:
$θ \leftarrow θ + α \nabla_{θ} J (θ),$
where $α$ is the learning rate.

Repeat steps 2–5 until convergence or for a predefined number of iterations.

Why does the policy gradient work?

The gradient $\nabla_{θ} J (θ)$ pushes the policy to assign higher probabilities to actions that lead to higher rewards.

This allows the policy to improve iteratively by maximizing the expected cumulative reward.

Variants of Policy Gradient

REINFORCE: A simple implementation using Monte Carlo estimates for $G_{t}$ .

Actor-Critic: Combines a parameterized policy (actor) with a value function (critic) to reduce variance in gradient estimation.

Proximal Policy Optimization (PPO): Regularizes policy updates to ensure stability and prevent large changes in the policy.

Policy Gradient Theorem

Policy Gradient Theorem

Objective Function:

The goal is to maximize the expected cumulative reward under a parameterized policy $π_{θ}$ :

$J (θ) = E_{π_{θ}} [t = 0 \sum \infty γ^{t} r_{t}] .$

Here, $γ \in [0, 1]$ is the discount factor.

Policy Gradient Theorem:

The gradient of $J (θ)$ is given by:

$\nabla_{θ} J (θ) = E_{π_{θ}} [\nabla_{θ} lo g π_{θ} (s, a) Q^{π_{θ}} (s, a)] .$

This connects the expected reward to the gradient of the policy’s log-probability, weighted by the action-value function $Q^{π_{θ}} (s, a)$ .

Simplified Gradient Estimate:

Using the cumulative reward $G_{t}$ as an estimate of $Q^{π_{θ}} (s, a)$ :

$\nabla_{θ} J (θ) \approx E_{π_{θ}} [\nabla_{θ} lo g π_{θ} (s, a) G_{t}] .$

Key Advantages of Policy Gradient Theorem

Directly optimizes the policy without needing a value function.

Works well for stochastic or continuous action spaces.

Facilitates modeling of complex, parameterized policies.

Variants and Improvements

Baseline Subtraction:

Reduce variance by subtracting a baseline $b (s)$ from $G_{t}$ :

$\nabla_{θ} J (θ) = E_{π_{θ}} [\nabla_{θ} lo g π_{θ} (s, a) (G_{t} - b (s))] .$

Common choice for $b (s)$ is the value function $V^{π_{θ}} (s)$ .

Actor-Critic Methods:

Use a critic to estimate $Q^{π_{θ}} (s, a)$ or $G_{t}$ to reduce variance and stabilize training.

Brayden Zhang

Explorer

Policy Gradient

Policy Gradient Theorem

Recent Notes

HOME - Deep Reinforcement Learning

HOME - Deep Learning

HOME - Robotics

Graph View

Backlinks