Gradient Descent

The goal of machine learning is to use data to minimize some particular objective (loss) function. The most widespread way to do this is through gradient descent, which essentially takes small steps in the direction of steepest descent for a particular objective function to find the “minimum”.

x_{t + 1} = x_{t} - α_{t} \nabla f (x_{t})

$α$ = learning rate

In practice, we use stochastic gradient descent (SGD), which uses one data point at a time for a single step and uses a much smaller subset of data points at any given step.

Method	Data Used per Update	Update Rule	Speed	Stability	Noise	Best Used When
Vanilla GD / Batch GD	All training data ${(x_{i}, y_{i})}_{i = 1}^{N}$	$θ = θ - α \cdot \frac{1}{N} \sum_{i = 1}^{N} \nabla_{θ} L_{i} (θ)$	Slow	Very stable	No noise	Small datasets
Stochastic GD (SGD)	One sample $(x_{i}, y_{i})$	$θ = θ - α \cdot \nabla_{θ} L_{i} (θ)$	Fast	High variance	High noise	Large datasets or online learning
Mini-batch GD	A small batch of $m ≪ N$ samples	$θ = θ - α \cdot \frac{1}{m} \sum_{j = 1}^{m} \nabla_{θ} L_{j} (θ)$	Medium	Good tradeoff	Medium noise	Deep learning (standard default)

Brayden Zhang

Explorer

Gradient Descent

Graph View

Backlinks