Actor-Critic Methods

  • A method that combines policy-based and value-based methods to reduce the variance of the REINFORCE algorithm
  • Actor: Parameterizes the policy .
  • Critic: Estimates the value function or action-value function .
  • Training:
    • The actor updates the policy using policy gradients.
    • The critic updates the value estimates using temporal difference (TD) learning.

A2C (Advantage Actor-Critic)

  • A2C is a reinforcement learning algorithm that combines actor-critic methods and uses the advantage function to improve training stability and efficiency.
  • In A2C, the actor learns the policy , while the critic learns the value function .
    • The advantage function is defined as:

    which represents how much better the action is compared to the average action at state .

    • The actor is updated using the policy gradient:
    • The critic is updated using the temporal difference (TD) error:
    • Key Benefits of A2C:
      • Reduces variance by using the advantage function.
      • More stable than basic REINFORCE, as it incorporates a value function.

A3C (Asynchronous Advantage Actor-Critic)

  • A3C extends A2C by using multiple asynchronous agents that update the global model in parallel.
  • These agents each interact with their own environment and compute gradients, updating the global parameters asynchronously.
  • The global network aggregates updates from multiple workers to improve training speed and stability.
  • Key Benefits of A3C:
    • Asynchronous updates prevent correlated gradients and reduce the risk of local minima, leading to faster convergence.
    • It can explore different parts of the environment simultaneously, leading to better generalization.
    • Architecture of A3C:
      • Each worker runs a separate instance of the environment and computes gradients based on its experiences.
      • Gradients from each worker are asynchronously sent to a global network, which updates the shared parameters.
      • The global network combines the benefits of multiple workers to converge faster than a single worker.

How A3C Improves Over A2C?

  • Parallelism: A3C uses multiple agents (workers) running in parallel, allowing for asynchronous updates to the global model, which improves training efficiency and exploration.
  • Reduced Overfitting: As different workers interact with different environments, A3C reduces the likelihood of overfitting to a single environment.

IMPALA