Variance Problem in REINFORCE
REINFORCE relies on Monte Carlo sampling to estimate the gradient:
The estimate of (cumulative reward) can have high variance, especially when:
- Rewards are sparse or delayed.
- The episode length is long.
- The policy explores random or suboptimal actions frequently.
Why is high variance a problem?
- High variance causes instability in training.
- The policy parameters may oscillate or fail to converge.
- Learning slows down as the policy struggles to distinguish between good and bad updates.
Intuition for the Variance Problem
- Consider an environment where reward depends heavily on the final action in a long episode.
- If actions earlier in the episode are unrelated to the reward, their gradients still get updated based on noisy estimates of , leading to inefficient learning.
Actor-Critic Methods
- Actor: Parameterizes the policy .
- Critic: Estimates the value function or action-value function .
- Training:
- The actor updates the policy using policy gradients.
- The critic updates the value estimates using temporal difference (TD) learning.
A2C (Advantage Actor-Critic)
- A2C is a reinforcement learning algorithm that combines actor-critic methods and uses the advantage function to improve training stability and efficiency.
- In A2C, the actor learns the policy , while the critic learns the value function .
- The advantage function is defined as:
which represents how much better the action is compared to the average action at state .
- The actor is updated using the policy gradient:
- The critic is updated using the temporal difference (TD) error:
- Key Benefits of A2C:
- Reduces variance by using the advantage function.
- More stable than basic REINFORCE, as it incorporates a value function.
A3C (Asynchronous Advantage Actor-Critic)
- A3C extends A2C by using multiple asynchronous agents that update the global model in parallel.
- These agents each interact with their own environment and compute gradients, updating the global parameters asynchronously.
- The global network aggregates updates from multiple workers to improve training speed and stability.
- Key Benefits of A3C:
- Asynchronous updates prevent correlated gradients and reduce the risk of local minima, leading to faster convergence.
- It can explore different parts of the environment simultaneously, leading to better generalization.
- Architecture of A3C:
- Each worker runs a separate instance of the environment and computes gradients based on its experiences.
- Gradients from each worker are asynchronously sent to a global network, which updates the shared parameters.
- The global network combines the benefits of multiple workers to converge faster than a single worker.
How A3C Improves Over A2C?
- Parallelism: A3C uses multiple agents (workers) running in parallel, allowing for asynchronous updates to the global model, which improves training efficiency and exploration.
- Reduced Overfitting: As different workers interact with different environments, A3C reduces the likelihood of overfitting to a single environment.