Variance Problem in REINFORCE

  • REINFORCE relies on Monte Carlo sampling to estimate the gradient:

  • The estimate of (cumulative reward) can have high variance, especially when:

    • Rewards are sparse or delayed.
    • The episode length is long.
    • The policy explores random or suboptimal actions frequently.

Why is high variance a problem?

  • High variance causes instability in training.
  • The policy parameters may oscillate or fail to converge.
  • Learning slows down as the policy struggles to distinguish between good and bad updates.

Intuition for the Variance Problem

  • Consider an environment where reward depends heavily on the final action in a long episode.
  • If actions earlier in the episode are unrelated to the reward, their gradients still get updated based on noisy estimates of , leading to inefficient learning.

Actor-Critic Methods

  • Actor: Parameterizes the policy .
  • Critic: Estimates the value function or action-value function .
  • Training:
    • The actor updates the policy using policy gradients.
    • The critic updates the value estimates using temporal difference (TD) learning.

A2C (Advantage Actor-Critic)

  • A2C is a reinforcement learning algorithm that combines actor-critic methods and uses the advantage function to improve training stability and efficiency.
  • In A2C, the actor learns the policy , while the critic learns the value function .
    • The advantage function is defined as:

    which represents how much better the action is compared to the average action at state .

    • The actor is updated using the policy gradient:
    • The critic is updated using the temporal difference (TD) error:
    • Key Benefits of A2C:
      • Reduces variance by using the advantage function.
      • More stable than basic REINFORCE, as it incorporates a value function.

A3C (Asynchronous Advantage Actor-Critic)

  • A3C extends A2C by using multiple asynchronous agents that update the global model in parallel.
  • These agents each interact with their own environment and compute gradients, updating the global parameters asynchronously.
  • The global network aggregates updates from multiple workers to improve training speed and stability.
  • Key Benefits of A3C:
    • Asynchronous updates prevent correlated gradients and reduce the risk of local minima, leading to faster convergence.
    • It can explore different parts of the environment simultaneously, leading to better generalization.
    • Architecture of A3C:
      • Each worker runs a separate instance of the environment and computes gradients based on its experiences.
      • Gradients from each worker are asynchronously sent to a global network, which updates the shared parameters.
      • The global network combines the benefits of multiple workers to converge faster than a single worker.

How A3C Improves Over A2C?

  • Parallelism: A3C uses multiple agents (workers) running in parallel, allowing for asynchronous updates to the global model, which improves training efficiency and exploration.
  • Reduced Overfitting: As different workers interact with different environments, A3C reduces the likelihood of overfitting to a single environment.