Actor-Critic Methods

Variance Problem in REINFORCE

REINFORCE relies on Monte Carlo sampling to estimate the gradient:

$\nabla_{θ} J (θ) \approx E_{π_{θ}} [\nabla_{θ} lo g π_{θ} (s, a) G_{t}] .$

The estimate of $G_{t}$ (cumulative reward) can have high variance, especially when:

Rewards are sparse or delayed.

The episode length is long.

The policy $π_{θ}$ explores random or suboptimal actions frequently.

Why is high variance a problem?

High variance causes instability in training.

The policy parameters $θ$ may oscillate or fail to converge.

Learning slows down as the policy struggles to distinguish between good and bad updates.

Intuition for the Variance Problem

Consider an environment where reward depends heavily on the final action in a long episode.

If actions earlier in the episode are unrelated to the reward, their gradients still get updated based on noisy estimates of $G_{t}$ , leading to inefficient learning.

Actor-Critic Methods

Actor: Parameterizes the policy $π_{θ} (s) = P (A ∣ s; θ)$ .

Critic: Estimates the value function $V (s)$ or action-value function $Q (s, a)$ .

Training:

The actor updates the policy using policy gradients.

The critic updates the value estimates using temporal difference (TD) learning.

A2C (Advantage Actor-Critic)

A2C is a reinforcement learning algorithm that combines actor-critic methods and uses the advantage function to improve training stability and efficiency.

In A2C, the actor learns the policy $π_{θ} (s)$ , while the critic learns the value function $V^{π} (s)$ .

The advantage function is defined as:

$A^{π} (s, a) = Q^{π} (s, a) - V^{π} (s),$
which represents how much better the action $a$ is compared to the average action at state $s$ .

The actor is updated using the policy gradient:

$\nabla_{θ} J (θ) = E_{π_{θ}} [\nabla_{θ} lo g π_{θ} (s, a) A^{π} (s, a)] .$

The critic is updated using the temporal difference (TD) error:

$δ_{t} = r_{t} + γ V^{π} (s_{t + 1}) - V^{π} (s_{t}) .$

Key Benefits of A2C:

Reduces variance by using the advantage function.

More stable than basic REINFORCE, as it incorporates a value function.

A3C (Asynchronous Advantage Actor-Critic)

A3C extends A2C by using multiple asynchronous agents that update the global model in parallel.

These agents each interact with their own environment and compute gradients, updating the global parameters asynchronously.

The global network aggregates updates from multiple workers to improve training speed and stability.

Key Benefits of A3C:

Asynchronous updates prevent correlated gradients and reduce the risk of local minima, leading to faster convergence.

It can explore different parts of the environment simultaneously, leading to better generalization.

Architecture of A3C:

Each worker runs a separate instance of the environment and computes gradients based on its experiences.

Gradients from each worker are asynchronously sent to a global network, which updates the shared parameters.

The global network combines the benefits of multiple workers to converge faster than a single worker.

How A3C Improves Over A2C?

Parallelism: A3C uses multiple agents (workers) running in parallel, allowing for asynchronous updates to the global model, which improves training efficiency and exploration.

Reduced Overfitting: As different workers interact with different environments, A3C reduces the likelihood of overfitting to a single environment.

Brayden Zhang

Explorer

Actor-Critic Methods

Recent Notes

HOME - Deep Reinforcement Learning

HOME - Deep Learning

HOME - Robotics

Graph View

Backlinks