What is Q-Learning?

Q-Learning is an off-policy, value-based method that uses a TD approach to train its action-value function.


  • Recall: value-based method: find the optimal policy indirectly by training a value/action-value function that tells us the value of each state / value of each state-action pair
  • Recall: TD approach: update the value function after each time step, instead of at the end of the episode.

What is the difference between value and reward?

  • Reward: The immediate reward received after taking an action in a state.
  • Value: The expected cumulative reward from taking an action in a state and following a policy thereafter.

Representing the Q-Function

  • The Q-function is a table (Q-table) that maps state-action pairs to values.
  • It tells us the value of taking a specific action in a given state.
  • Initially, the Q-table is uninformative (all values are zero or random), but as the agent explores the environment, the Q-table is updated to approximate the optimal policy.

Why does Q-Learning allow us to learn the optimal policy?

  • By training the Q-function, represented as a Q-table, we derive the optimal policy since it maps each state-action pair to the best action.

The Q-Learning Algorithm

  • Step 1: Initialize the Q-table:
    • Set for all and .
    • Ensure .
  • Step 2: Choose an action using the epsilon-greedy strategy:
    • Initialize .
    • Exploration: With probability , pick a random action.
    • Exploitation: With probability , pick the best action from the Q-table.
  • Step 3: Perform the action :
    • Observe reward and the next state .
  • Step 4: Update the Q-value using the Bellman equation:


Why do we use Deep Q-Learning?

  • The traditional Q-Learning Q-table becomes impractical for large state spaces.
  • Deep Q-Learning replaces the Q-table with a neural network that approximates the Q-function, enabling efficient handling of large state spaces.

Deep Q-Learning Architecture & Algorithm

  • Input: A stack of 4 frames passed through a convolutional neural network (CNN).

    Why 4 frames?

    • To capture the motion of the agent.
  • Output: A vector of Q-values, one for each possible action.

Loss Function: MSE

  • Minimize the mean squared error between predicted and target Q-values.
  • Use the Bellman equation to calculate the target Q-values.

Training Deep Q-Learning

  • Step 1: Sampling
    • Perform actions in the environment.
    • Store observed experience tuples in a replay memory.
  • Step 2: Training
    • Randomly sample a small batch of experience tuples from replay memory.
    • Use gradient descent to update the Q-network.

Improvements to Deep Q-Learning

  • Experience Replay:
    • Store experiences in a replay memory.
    • Sample uniformly during training to reduce correlations in data.
  • Fixed Q-Targets:
    • Use two networks:
      • Q-network to select actions.
      • Target network to evaluate actions.
    • Update the target network less frequently.
  • Double Deep Q-Learning:
    • Mitigate overestimation of Q-values by using separate networks for action selection and evaluation.