Actor-Critic Methods
- A method that combines policy-based and value-based methods to reduce the variance of the REINFORCE algorithm
- Actor: Parameterizes the policy .
- Critic: Estimates the value function or action-value function .
- Training:
- The actor updates the policy using policy gradients.
- The critic updates the value estimates using temporal difference (TD) learning.
A2C (Advantage Actor-Critic)
- A2C is a reinforcement learning algorithm that combines actor-critic methods and uses the advantage function to improve training stability and efficiency.
- In A2C, the actor learns the policy , while the critic learns the value function .
- The advantage function is defined as:
which represents how much better the action is compared to the average action at state .
- The actor is updated using the policy gradient:
- The critic is updated using the temporal difference (TD) error:
- Key Benefits of A2C:
- Reduces variance by using the advantage function.
- More stable than basic REINFORCE, as it incorporates a value function.
A3C (Asynchronous Advantage Actor-Critic)
- A3C extends A2C by using multiple asynchronous agents that update the global model in parallel.
- These agents each interact with their own environment and compute gradients, updating the global parameters asynchronously.
- The global network aggregates updates from multiple workers to improve training speed and stability.
- Key Benefits of A3C:
- Asynchronous updates prevent correlated gradients and reduce the risk of local minima, leading to faster convergence.
- It can explore different parts of the environment simultaneously, leading to better generalization.
- Architecture of A3C:
- Each worker runs a separate instance of the environment and computes gradients based on its experiences.
- Gradients from each worker are asynchronously sent to a global network, which updates the shared parameters.
- The global network combines the benefits of multiple workers to converge faster than a single worker.
How A3C Improves Over A2C?
- Parallelism: A3C uses multiple agents (workers) running in parallel, allowing for asynchronous updates to the global model, which improves training efficiency and exploration.
- Reduced Overfitting: As different workers interact with different environments, A3C reduces the likelihood of overfitting to a single environment.
IMPALA