Q-Learning

What is Q-Learning?

Q-Learning is an off-policy, value-based method that uses a TD approach to train its action-value function.

Remember

Recall: value-based method: find the optimal policy indirectly by training a value/action-value function that tells us the value of each state / value of each state-action pair

Recall: TD approach: update the value function after each time step, instead of at the end of the episode.

What is the difference between value and reward?

Reward: The immediate reward received after taking an action in a state.

Value: The expected cumulative reward from taking an action in a state and following a policy thereafter.

Representing the Q-Function

The Q-function is a table (Q-table) that maps state-action pairs to values.

It tells us the value of taking a specific action in a given state.

Initially, the Q-table is uninformative (all values are zero or random), but as the agent explores the environment, the Q-table is updated to approximate the optimal policy.

Why does Q-Learning allow us to learn the optimal policy?

By training the Q-function, represented as a Q-table, we derive the optimal policy since it maps each state-action pair to the best action.

The Q-Learning Algorithm

Step 1: Initialize the Q-table:

Set $Q (s, a) = 0$ for all $s \in S$ and $a \in A (s)$ .

Ensure $Q (terminal_state, \cdot) = 0$ .

Step 2: Choose an action using the epsilon-greedy strategy:

Initialize $ϵ = 1.0$ .

Exploration: With probability $ϵ$ , pick a random action.

Exploitation: With probability $1 - ϵ$ , pick the best action from the Q-table.

Step 3: Perform the action $A_{t}$ :

Observe reward $R_{t + 1}$ and the next state $S_{t + 1}$ .

Step 4: Update the Q-value using the Bellman equation:
$Q (S_{t}, A_{t}) \leftarrow Q (S_{t}, A_{t}) + α [R_{t + 1} + γ a max Q (S_{t + 1}, a) - Q (S_{t}, A_{t})] .$

DQN

Why do we use Deep Q-Learning?

The traditional Q-Learning Q-table becomes impractical for large state spaces.

Deep Q-Learning replaces the Q-table with a neural network that approximates the Q-function, enabling efficient handling of large state spaces.

Deep Q-Learning Architecture & Algorithm

Input: A stack of 4 frames passed through a convolutional neural network (CNN).

Why 4 frames?

To capture the motion of the agent.

Output: A vector of Q-values, one for each possible action.

Loss Function: MSE

Minimize the mean squared error between predicted and target Q-values.

Use the Bellman equation to calculate the target Q-values.

Training Deep Q-Learning

Step 1: Sampling

Perform actions in the environment.

Store observed experience tuples $(S_{t}, A_{t}, R_{t + 1}, S_{t + 1})$ in a replay memory.

Step 2: Training

Randomly sample a small batch of experience tuples from replay memory.

Use gradient descent to update the Q-network.

Improvements to Deep Q-Learning

Experience Replay:

Store experiences in a replay memory.

Sample uniformly during training to reduce correlations in data.

Fixed Q-Targets:

Use two networks:

Q-network to select actions.

Target network to evaluate actions.

Update the target network less frequently.

Double Deep Q-Learning:

Mitigate overestimation of Q-values by using separate networks for action selection and evaluation.

Brayden Zhang

Explorer

Q-Learning

DQN

Recent Notes

HOME - Deep Reinforcement Learning

HOME - Deep Learning

HOME - Robotics

Graph View

Backlinks