The goal of machine learning is to use data to minimize some particular objective (loss) function. The most widespread way to do this is through gradient descent, which essentially takes small steps in the direction of steepest descent for a particular objective function to find the “minimum”.
- = learning rate
In practice, we use stochastic gradient descent (SGD), which uses one data point at a time for a single step and uses a much smaller subset of data points at any given step.
Method | Data Used per Update | Update Rule | Speed | Stability | Noise | Best Used When |
---|---|---|---|---|---|---|
Vanilla GD / Batch GD | All training data | Slow | Very stable | No noise | Small datasets | |
Stochastic GD (SGD) | One sample | Fast | High variance | High noise | Large datasets or online learning | |
Mini-batch GD | A small batch of samples | Medium | Good tradeoff | Medium noise | Deep learning (standard default) |