Gradient Descent: How Machine Learning Models Learn

simulator intermediate ~8 min
Loading simulation...
Converges in ~50 steps — with LR=0.1 and momentum=0.9

With a learning rate of 0.1 and momentum of 0.9, gradient descent typically reaches the minimum of a smooth landscape in about 50 steps. Higher learning rates converge faster but risk overshooting.

Formula

θₜ₊₁ = θₜ - α × ∇f(θₜ)
Momentum update: vₜ = β × vₜ₋₁ + ∇f(θₜ); θₜ₊₁ = θₜ - α × vₜ
Adam: mₜ = β₁mₜ₋₁ + (1-β₁)gₜ; vₜ = β₂vₜ₋₁ + (1-β₂)gₜ²

The Engine of Machine Learning

Gradient descent is the algorithm that makes machine learning work. Every neural network, every logistic regression, every support vector machine learns by following gradients downhill through a loss landscape. The idea dates back to Cauchy in 1847: to minimize a function, take small steps in the direction of steepest descent. Simple in principle, gradient descent powers everything from GPT to AlphaFold.

The Loss Landscape

Imagine the loss function as a mountainous terrain where altitude represents error. Gradient descent starts at a random position and slides downhill, following the steepest slope at each point. The gradient — a vector of partial derivatives — points uphill, so we move in the opposite direction. The learning rate determines how far we step each time. The simulation above visualizes this journey across different landscape topographies.

The Learning Rate Dilemma

The learning rate is the single most important hyperparameter in deep learning. Too large, and the optimizer bounces wildly across the landscape, possibly diverging to infinity. Too small, and training takes forever, potentially getting stuck in sharp local minima. Modern optimizers like Adam adapt the learning rate per parameter, but the initial learning rate still matters enormously — most practitioners use learning rate schedules that start large and decay over time.

Beyond Vanilla Gradient Descent

Modern optimization has moved far beyond simple gradient descent. Momentum accumulates velocity from past gradients, helping the optimizer barrel through shallow local minima. Nesterov acceleration looks ahead before computing the gradient. RMSProp and Adam maintain per-parameter learning rates. Stochastic mini-batching adds noise that helps escape saddle points. The simulation lets you explore how momentum and learning rate interact to navigate complex landscapes.

FAQ

What is gradient descent?

Gradient descent is an optimization algorithm that finds the minimum of a function by iteratively moving in the direction of steepest decrease (negative gradient). Each step updates the parameters: θ = θ - α∇f(θ), where α is the learning rate and ∇f is the gradient.

Why is the learning rate so important?

The learning rate controls step size. Too small: convergence is painfully slow. Too large: the optimizer overshoots the minimum and may diverge entirely. Finding the right learning rate is one of the most critical hyperparameter choices in deep learning.

What is the difference between gradient descent, SGD, and Adam?

Gradient descent uses all data per step. Stochastic gradient descent (SGD) uses one random sample. Mini-batch SGD uses a small batch. Adam combines momentum and adaptive learning rates per parameter. Adam is the default optimizer for most deep learning today.

Can gradient descent get stuck in local minima?

Yes. In non-convex landscapes, gradient descent can converge to local minima or saddle points instead of the global minimum. Momentum, learning rate schedules, and stochastic noise help escape poor local optima.

Sources

Embed

<iframe src="https://homo-deus.com/lab/data-science/gradient-descent/embed" width="100%" height="400" frameborder="0"></iframe>
View source on GitHub