Gradient Descent: How Neural Networks Learn

simulator intermediate ~10 min
Loading simulation...
Loss = 0.0023 — converged in ~50 steps with momentum

With learning rate 0.05 and momentum 0.9 on a quadratic bowl, gradient descent converges to near-zero loss in about 50 steps. Momentum accelerates convergence by building velocity in consistent gradient directions.

Formula

Parameter update: θ_{t+1} = θ_t − η × ∇L(θ_t)
Momentum update: v_t = β × v_{t-1} + ∇L(θ_t); θ_{t+1} = θ_t − η × v_t
Gradient: ∇L = [∂L/∂θ₁, ∂L/∂θ₂, ..., ∂L/∂θₙ]

Descending the Loss Landscape

At the heart of machine learning is an optimization problem: find the parameters that minimize a loss function measuring how poorly the model fits the data. Gradient descent solves this by repeatedly computing the gradient — the direction of steepest ascent — and stepping in the opposite direction. The loss landscape of a neural network is a high-dimensional surface with hills, valleys, saddle points, and flat plateaus, and gradient descent must navigate all of these to find good solutions.

Learning Rate: The Step Size Dilemma

The learning rate is arguably the most important hyperparameter in deep learning. Too large and the optimizer bounces around or diverges entirely — the loss increases instead of decreasing. Too small and training takes impractically long, potentially getting stuck in sharp local minima. The simulation lets you see this tradeoff directly: watch how different learning rates produce smooth convergence, oscillation, or catastrophic divergence on the same loss surface.

Momentum and Acceleration

Vanilla gradient descent struggles with ravine-shaped loss surfaces — it zigzags across the narrow dimension instead of rolling down the long axis. Momentum solves this by accumulating velocity: if the gradient consistently points in one direction, the optimizer accelerates; if it oscillates, the momentum terms cancel out. Nesterov momentum improves further by evaluating the gradient at the 'lookahead' position, giving better convergence on convex problems.

Modern Optimizers

Stochastic gradient descent (SGD) with momentum remains the optimizer of choice for many computer vision tasks, but adaptive methods dominate elsewhere. Adam (Adaptive Moment Estimation) maintains per-parameter learning rates by tracking first and second moments of the gradient. This makes it robust to sparse gradients and different loss surface geometries. The simulation's noise parameter mimics the stochastic aspect — mini-batch gradient estimates are inherently noisy, and this noise can actually help escape sharp local minima.

FAQ

What is gradient descent?

Gradient descent is an optimization algorithm that iteratively adjusts parameters to minimize a loss function. At each step, it computes the gradient (slope) of the loss with respect to each parameter and moves in the opposite direction — downhill. The learning rate controls the step size. It is the fundamental training algorithm for virtually all neural networks.

What is the learning rate?

The learning rate controls how large each parameter update is. Too small, and training takes forever; too large, and the optimizer overshoots the minimum and may diverge. Finding the right learning rate is one of the most important hyperparameter choices in deep learning. Modern techniques like learning rate warmup and cosine annealing adjust it dynamically.

What is momentum in gradient descent?

Momentum adds a fraction of the previous update to the current one, like a ball rolling downhill that accumulates speed. It helps gradient descent move faster through flat regions and dampen oscillations in narrow valleys. The momentum coefficient (typically 0.9) controls how much past gradients influence the current step.

What is the difference between GD, SGD, and Adam?

Batch GD computes gradients over the entire dataset — accurate but slow. SGD (stochastic gradient descent) uses random mini-batches, adding noise that can help escape local minima. Adam combines momentum with per-parameter adaptive learning rates, making it robust to different loss surface geometries and the default choice for many deep learning tasks.

Sources

Embed

<iframe src="https://homo-deus.com/lab/machine-learning/gradient-descent/embed" width="100%" height="400" frameborder="0"></iframe>
View source on GitHub