What is Q-Learning?
Q-Learning is an off-policy, value-based method that uses a TD approach to train its action-value function.
Remember
- Recall: value-based method: find the optimal policy indirectly by training a value/action-value function that tells us the value of each state / value of each state-action pair
- Recall: TD approach: update the value function after each time step, instead of at the end of the episode.
What is the difference between value and reward?
- Reward: The immediate reward received after taking an action in a state.
- Value: The expected cumulative reward from taking an action in a state and following a policy thereafter.
Representing the Q-Function
- The Q-function is a table (Q-table) that maps state-action pairs to values.
- It tells us the value of taking a specific action in a given state.
- Initially, the Q-table is uninformative (all values are zero or random), but as the agent explores the environment, the Q-table is updated to approximate the optimal policy.
Why does Q-Learning allow us to learn the optimal policy?
- By training the Q-function, represented as a Q-table, we derive the optimal policy since it maps each state-action pair to the best action.
The Q-Learning Algorithm
- Step 1: Initialize the Q-table:
- Set for all and .
- Ensure .
- Step 2: Choose an action using the epsilon-greedy strategy:
- Initialize .
- Exploration: With probability , pick a random action.
- Exploitation: With probability , pick the best action from the Q-table.
- Step 3: Perform the action :
- Observe reward and the next state .
- Step 4: Update the Q-value using the Bellman equation:
DQN
Why do we use Deep Q-Learning?
- The traditional Q-Learning Q-table becomes impractical for large state spaces.
- Deep Q-Learning replaces the Q-table with a neural network that approximates the Q-function, enabling efficient handling of large state spaces.
Deep Q-Learning Architecture & Algorithm
- Input: A stack of 4 frames passed through a convolutional neural network (CNN).
Why 4 frames?
- To capture the motion of the agent.
- Output: A vector of Q-values, one for each possible action.
Loss Function: MSE
- Minimize the mean squared error between predicted and target Q-values.
- Use the Bellman equation to calculate the target Q-values.
Training Deep Q-Learning
- Step 1: Sampling
- Perform actions in the environment.
- Store observed experience tuples in a replay memory.
- Step 2: Training
- Randomly sample a small batch of experience tuples from replay memory.
- Use gradient descent to update the Q-network.
Improvements to Deep Q-Learning
- Experience Replay:
- Store experiences in a replay memory.
- Sample uniformly during training to reduce correlations in data.
- Fixed Q-Targets:
- Use two networks:
- Q-network to select actions.
- Target network to evaluate actions.
- Update the target network less frequently.
- Double Deep Q-Learning:
- Mitigate overestimation of Q-values by using separate networks for action selection and evaluation.