Diffusion Models

Tldr

Generative Model used to generate new data by learning how data changes through time through a process of adding noise and then reversing this process.

Forward Process (diffusion)

The model starts with a clean image and adds noise step by step. [Fixed].
After many steps, the image becomes completely unrecognizable noise.
- the clean image $x_{0}$ transitions to noisy images $x_{1}, x_{2}, ..., x_{T}$ with each step adding more and more noise.
- Mathematically this is a Markov Chain , where at each step Gaussian noise is added:
- $x_{t + 1} = α_{t} x_{t} + 1 - α_{t} ϵ_{t}$
  - $ϵ_{t} \in N (0, I)$
  - $α_{t}$ controls how much noise is added at each step

Reverse Process (Denoising)

The reverse process involves recovering the original clean image $x_{0}$ step by step from the noisy image $x_{T}$ .
This is modeled as another Markov Chain, where at each step noise is subtracted to move closer to the original image:
- Mathematically, the reverse step can be expressed as: $x_{t} = \frac{1}{α _{t}} (x_{t + 1} - 1 - α_{t} ϵ_{t}) + σ_{t} z$
  - Here:
    - $z \sim N (0, I)$ (optional noise for stochasticity)
    - $σ_{t}$ controls the variance during the reverse step.
==The reverse process is learned using a neural network that predicts the amount of noise $ϵ_{t}$ added at each step. This prediction is used to iteratively denoise.==

Training Objective

The model is trained to minimize the difference between the predicted noise and the actual noise added at each forward step:
Training is performed by maximizing the ELBO, which leads to maximizing log likelihood.

$L_{simple} = E_{x_{0}, ϵ \sim N (0, I), t} [∥ ϵ - ϵ_{θ} (x_{t}, t) ∥^{2}]$
- $ϵ_{θ} (x_{t}, t)$ : Neural network’s prediction of noise for input $x_{t}$ at time $t$ .

Key Components

Forward Process (Noise Addition):
- Progressively adds Gaussian noise to transition $x_{0} \to x_{T}$ .
- Defined by: $q (x_{t} ∣ x_{t - 1}) = N (x_{t}; α_{t} x_{t - 1}, (1 - α_{t}) I)$
Reverse Process (Noise Removal):
- Approximates the denoising distribution: $p_{θ} (x_{t - 1} ∣ x_{t}) = N (x_{t - 1}; μ_{θ} (x_{t}, t), Σ_{θ} (x_{t}, t))$
- Where:
  - $μ_{θ} (x_{t}, t)$ : Mean predicted by the neural network.
  - $Σ_{θ} (x_{t}, t)$ : Variance (can be fixed or learned).

Connection to Variational Inference

The forward process defines a fixed distribution $q (x_{t} ∣ x_{0})$ .
The reverse process tries to approximate $q (x_{t - 1} ∣ x_{t}, x_{0})$ by optimizing the Evidence Lower Bound (ELBO): $L = E_{q} [\sum_{t = 1}^{T} D_{KL} (q (x_{t - 1} ∣ x_{t}, x_{0}) ∥ p_{θ} (x_{t - 1} ∣ x_{t}))]$

Resources

Diffusion Without Tears Step-by-Step Diffusion: An Elementary Tutorial (arxiv.org)

Brayden Zhang

Explorer

Diffusion Models

Forward Process (diffusion)

Reverse Process (Denoising)

Training Objective

Key Components

Connection to Variational Inference

Resources

Table of Contents

Graph View

Backlinks