The Underpinnings of RL

What is RL?

Important

In RL, an agent learns to make decisions by interacting with an environment (through trial and error)

The agent receives feedback in the form of rewards.

The goal is to maximize the total reward over time.

How does the agent make decisions?

The agent uses a policy ( $π$ ) to decide what action to take in a given state.

Given a state, the policy outputs an action (if deterministic) or a probability distribution over actions (if stochastic).

The goal of RL is to find an optimal policy $π^{*}$ that maximizes the expected cumulative reward.

What is the Exploration/Exploitation trade-off?

Exploration: trying out new actions to discover the best ones.

Exploitation: choosing the best actions based on what we already know.

What are the two main types of RL algorithms?

Value-based methods: estimate the value of being in a state (or taking an action in a state) and use this to make decisions.

train a value function that estimates the expected cumulative reward of being in a state.

Policy-based methods: train a policy function that outputs the action to take in a given state.

Value-based methods

What is the value of a state?

The value of a state is the expected discounted return the agent can get if it starts at that state and then acts according to our policy.

But what does it mean to "act according to our policy?"

For value-based methods, you don’t train the policy.

Instead, the policy is just a simple pre-specified function that uses the values given by the value-function to select its actions.

The greedy policy is an example of this: it selects the action that maximizes the value function. > - Epsilon-greedy policy is commonly used (it handles the exploration/exploitation trade-off that will be mentioned later).

Tldr

in policy-based methods, the optimal policy ( $π^{*}$ ) is found by training the policy directly.

in value-based methods, the optimal policy is found by finding an optimal value function ( $V^{*}$ or $Q^{*}$ ) and then using it to derive the optimal policy.

What is the link between Value and Policy?

$π^{*} (s) = ar g a max Q^{*} (s, a)$

State-Value Function

The state-value function ( $V^{π} (s)$ ) estimates the expected cumulative reward of being in a state and following a policy $π$ .

$V_{π} (s) = E [R_{t} ∣ s_{t} = s, π]$

For each state, the state-value function outputs the expected return if the agent starts at that state and then follows the policy $π$ forever.

The action-value function

The action-value function ( $Q^{π} (s, a)$ ) estimates the expected cumulative reward of taking an action in a state and then following a policy $π$ . $Q_{π} (s, a) = E_{π} [R_{t} ∣ s_{t} = s, a_{t} = a]$

Action-Value VS. State-Value function:

State-value function: calculates the value of a state.

Action-value function: calculates the value of taking an action in a state (state-action pair).

The Bellman Equation

What is the Bellman equation?

A way to simplify our state-value or state-action value calculation.

Remember: To calculate the value of a state, we need to calculate the return starting at that state and then following the policy forever.

However, when calculating $V (S_{t})$ , we need to know $V (S_{t + 1})$ .

Thus, to avoid repeated computation, we use Dynamic Programming, specifically, the Bellman Equation:

V_{π} (s) = E_{π} [R_{t + 1} + γ V_{π} (S_{t + 1}) ∣ S_{t} = s]

The value of our state is the expected reward we get at the next time step plus the (discounted) expected value of the next state.

The value of $V (S_{t + 1})$ is equal to the immediate reward $R_{t + 2}$ plus the discounted value of the next state ( $γ \cdot V (S_{t + 2})$ ).

Monte Carlo and Temporal Difference Learning

Monte Carlo: learning at the end of an episode

In Monte Carlo Learning, you wait until the end of the epsiode, then calculate the return $G_{t}$ , and then update the value function $V (S_{t})$ .

$V (S_{t}) = V (S_{t}) + α (G_{t} - V (S_{t}))$

Brayden Zhang

Explorer

The Underpinnings of RL

What is RL?

Value-based methods

The Bellman Equation

Monte Carlo and Temporal Difference Learning

Monte Carlo: learning at the end of an episode

Recent Notes

HOME - Deep Reinforcement Learning

HOME - Deep Learning

HOME - Robotics

Table of Contents

Graph View

Backlinks