Imitation Learning

MDP (Markov Decision process)

Definitions:

$S$ : State space, $s_{t} \in S$ : state at time $t$

$A$ : Action space, $a_{t} \in A$ : action at time $t$

$p$ : Transition probability, $s_{t + 1} \sim p (\cdot ∣ s_{t}, a_{t})$

$r$ : Reward function, $r : S \times A \to R$

Goal: Learn a policy $π (a_{t} ∣ s_{t})$ .

POMDP (Partially Observed MDP)

Additional Definitions:

$O$ : Observation space, $o_{t} \in O$ : observation at time $t$

$h$ : Observation model, $o_{t} \sim h (\cdot ∣ s_{t})$

Goal: Learn a policy $π (a_{t} ∣ o_{t})$ .

Imitation Learning

Idea

collect expert data (observation/state and action pairs)

Train a function to map observations/states to actions

Dataset Aggregation (DAgger)

Process:
- Start with expert demonstrations.
- Train policy $π_{1}$ via supervised learning.
- Run $π_{1}$ , query the expert to correct mistakes, and collect new data.
- Aggregate new and old data, retrain to create $π_{2}$ .
- Repeat the process iteratively.
Advantages:
- Reduces cascading errors.
- Provides theoretical regret guarantees.
Limitations:
- Requires frequent expert queries.

IL with Privileged Teachers

It can be hard to directly learn the policy $π_{θ} (a_{t}, o_{t})$ especially if $o_{t}$ is high-dimensional

Obtain a "privileged" teacher $π_{p} (a_{t}, o_{t})$

$p_{t}$ contains “ground truth” information that is not available to the “students”

Then use $π_{p} (a_{t}, o_{t})$ to generate demonstrations for $π_{θ} (a_{t}, o_{t})$

Example

Stage 1: learn a “privileged agent” from expert

It knows ground truth state (traffic light, other vehicles’ pos/vel, etc)

Stage 2: a sensorimotor student learns from this trained privileged agent

This is especially useful in simulation, because we know every variable’s value in sim. So the privileged teacher learns from that, but the student only learns from stuff it can directly see/measure.

privileged teacher is usually trained by PPO

Variants

Student learning in the latent space: Adapting Rapid Motor Adaptation for Bipedal Robots

Student learning to predict rays: [2401.17583] Agile But Safe: Learning Collision-Free High-Speed Legged Locomotion

Deep Imitation Learning with Generative Modeling

What is the problem posed by generative modeling?

Learn: learn a distribution $p_{θ}$ that matches $p_{d} a t a$

Sample: Generate novel data so that $x_{n e w} \sim p_{θ}$

For robotics, we want our $p_{d a t a}$ to be from experts. There are three leading approaches:

GAN + Imitation Learning $\Rightarrow$ Generative Adversarial Imitation Learning (GAIL)

Sample trajectory from students

Update the discriminator, which is aimed at classifying the teacher and the student

Train the student policy which aims to minimize the discriminator’s accuracy

VAE + IL $\Rightarrow$ Action Chunking with Transformers

Based on CVAE (conditional VAE)

Encoder: expert action sequence + observation → latent

Decoder: latent + more observation → action sequence prediction

Key: action chunking + temporal ensemble

Diffusion + IL $\Rightarrow$ Diffusion Policy

Diffusion Policy

Brayden Zhang

Explorer

Imitation Learning

Dataset Aggregation (DAgger)

IL with Privileged Teachers

Deep Imitation Learning with Generative Modeling

Recent Notes

HOME - Deep Reinforcement Learning

HOME - Deep Learning

HOME - Robotics

Table of Contents

Graph View

Backlinks