MDP (Markov Decision Process)

Formal Definition: An MDP is defined by a 4-tuple :

  • : Set of all states.

  • : Set of all actions.

  • : Transition probability, i.e., the probability of reaching state after taking action in state .

  • : Reward function assigning a scalar reward for state-action pairs.

Markov Property:

  • The future state depends only on the current state and action:

Objective: Learn a policy that maximizes the expected return: where is the discount factor.


POMDP (Partially Observed MDP)

Formal Definition: A POMDP extends an MDP by including observations and an observation function:

  • : Observation space.

  • : Observation model giving the likelihood of observing from state .

Why it matters: In real-world robotics, true states (e.g., positions, velocities) are often hidden. The agent must act based only on observations (e.g., camera frames).

Policy Objective: Learn a policy that maps partial observations to actions.


Imitation Learning (IL)

Concept:

  • IL focuses on learning behavior by mimicking expert demonstrations.

  • Avoids the need to hand-design a reward function.

  • Common use cases: autonomous driving, robotic manipulation, navigation.

Problem Setup:

  • Expert provides a dataset

  • Learn to minimize supervised loss:


Behavior Cloning (BC)

Is Behavior Cloning just Supervised Learning?

  • Yes: Directly fits a function from observations to actions.

  • No: The i.i.d. assumption breaks down during deployment because action decisions affect future inputs.

Key Pitfall:

  • Training distribution deployment distribution.

  • Small prediction errors cause the agent to drift into unseen states (distribution shift).

Theoretical Insight:

  • If policy makes error per step, expected error after steps can grow to .

  • See: Ross et al., 2011


🧠 Dataset Aggregation (DAgger)

DAgger = Dataset Aggregation

  • Proposed to combat error compounding in BC.

Algorithm:

  1. Collect initial dataset of expert demonstrations.

  2. Train initial policy .

  3. Deploy , record the visited states.

  4. Ask expert for the correct actions at these new states.

  5. Aggregate data, retrain to form .

  6. Repeat.

Pros:

  • Provably no-regret with enough iterations.

  • Fixes cascading error issue by training on states visited by the learner.

Cons:

  • Needs an online expert oracle to label learner states.

  • Costly in human-in-the-loop settings.

Regret Bound:


🧑‍🏫 IL with Privileged Teachers

Challenge:

  • Learning is hard when lacks crucial state info.

[!check] Approach:

  • Use sim-based privileged policy as teacher.

  • Student policy learns via supervised learning on teacher rollouts.

Simulation Workflow:

  • Use full state info in simulation (e.g. positions, object ids).

  • Train with RL (e.g. PPO).

  • Render observations , generate labels.

  • Train student to predict from .

"Simulation lets you generate perfect labels even if you don’t have a human expert."

Applications:

  • A-RMA: learns locomotion by mapping proprioception to actions via latent adaptation.

  • Agile But Safe: ray prediction for safe locomotion.


🎲 Deep Imitation Learning with Generative Models

Motivation for Generative Modeling:

  • Behavior can be multimodal — multiple valid actions for the same state.

  • Simple supervised learning may average across modes.

  • Generative models capture this uncertainty.

Generative Adversarial Imitation Learning

  • Discriminator trained to distinguish expert vs learner state-action pairs.

  • Learner trained to fool using policy gradient.

  • No reward function required — reward is derived from .

VAE-Based Methods:

  • Learn a latent variable to represent expert strategy.

  • Model using conditional VAE.

  • Use transformer decoders for long-horizon planning.

Diffusion Policy:

  • Iteratively denoises sampled trajectories conditioned on current observation.

  • Enables high-resolution multi-step predictions.

  • State-of-the-art in dexterous manipulation.

  • Website: diffusion-policy.cs.columbia.edu


Summary Table

MethodNeeds Expert?Compounding Error?Key StrengthKey Weakness
BCYesSimple to implementPoor generalization
DAggerYes (interactive)Theoretical guaranteesNeeds expert queries
Priv. TeacherYesUses full sim state for trainingSim-to-real gap
GAILYesNo reward neededUnstable adversarial training
VAE + ILYesCaptures latent intentComplex training
Diffusion ILYesFlexible, multimodal modelingComputationally heavy

📌 Practical Tips

  • For safety-critical systems, use IL to initialize policies before fine-tuning with RL.

  • In simulation, generate diverse scenarios to improve generalization.

  • Use camera-based augmentation (e.g. domain randomization) to bridge sim-to-real gap.

  • Consider latent or goal-conditioned policies for multi-task setups.


Final Takeaway

  • Imitation Learning is intuitive and data-efficient, but brittle.

  • Robust IL requires careful data coverage, model design, and sometimes, generative modeling.


📚 References