The Engine of Machine Learning
Gradient descent is the algorithm that makes machine learning work. Every neural network, every logistic regression, every support vector machine learns by following gradients downhill through a loss landscape. The idea dates back to Cauchy in 1847: to minimize a function, take small steps in the direction of steepest descent. Simple in principle, gradient descent powers everything from GPT to AlphaFold.
The Loss Landscape
Imagine the loss function as a mountainous terrain where altitude represents error. Gradient descent starts at a random position and slides downhill, following the steepest slope at each point. The gradient — a vector of partial derivatives — points uphill, so we move in the opposite direction. The learning rate determines how far we step each time. The simulation above visualizes this journey across different landscape topographies.
The Learning Rate Dilemma
The learning rate is the single most important hyperparameter in deep learning. Too large, and the optimizer bounces wildly across the landscape, possibly diverging to infinity. Too small, and training takes forever, potentially getting stuck in sharp local minima. Modern optimizers like Adam adapt the learning rate per parameter, but the initial learning rate still matters enormously — most practitioners use learning rate schedules that start large and decay over time.
Beyond Vanilla Gradient Descent
Modern optimization has moved far beyond simple gradient descent. Momentum accumulates velocity from past gradients, helping the optimizer barrel through shallow local minima. Nesterov acceleration looks ahead before computing the gradient. RMSProp and Adam maintain per-parameter learning rates. Stochastic mini-batching adds noise that helps escape saddle points. The simulation lets you explore how momentum and learning rate interact to navigate complex landscapes.