Descending the Loss Landscape
At the heart of machine learning is an optimization problem: find the parameters that minimize a loss function measuring how poorly the model fits the data. Gradient descent solves this by repeatedly computing the gradient — the direction of steepest ascent — and stepping in the opposite direction. The loss landscape of a neural network is a high-dimensional surface with hills, valleys, saddle points, and flat plateaus, and gradient descent must navigate all of these to find good solutions.
Learning Rate: The Step Size Dilemma
The learning rate is arguably the most important hyperparameter in deep learning. Too large and the optimizer bounces around or diverges entirely — the loss increases instead of decreasing. Too small and training takes impractically long, potentially getting stuck in sharp local minima. The simulation lets you see this tradeoff directly: watch how different learning rates produce smooth convergence, oscillation, or catastrophic divergence on the same loss surface.
Momentum and Acceleration
Vanilla gradient descent struggles with ravine-shaped loss surfaces — it zigzags across the narrow dimension instead of rolling down the long axis. Momentum solves this by accumulating velocity: if the gradient consistently points in one direction, the optimizer accelerates; if it oscillates, the momentum terms cancel out. Nesterov momentum improves further by evaluating the gradient at the 'lookahead' position, giving better convergence on convex problems.
Modern Optimizers
Stochastic gradient descent (SGD) with momentum remains the optimizer of choice for many computer vision tasks, but adaptive methods dominate elsewhere. Adam (Adaptive Moment Estimation) maintains per-parameter learning rates by tracking first and second moments of the gradient. This makes it robust to sparse gradients and different loss surface geometries. The simulation's noise parameter mimics the stochastic aspect — mini-batch gradient estimates are inherently noisy, and this noise can actually help escape sharp local minima.