Overfitting: When Models Memorize Instead of Learn

simulator intermediate ~10 min
Loading simulation...
Train MSE = 0.24, Test MSE = 0.31 — good fit with degree 3

A degree-3 polynomial on 20 noisy points from a true cubic function achieves training MSE 0.24 and test MSE 0.31. The small gap indicates the model captures the true pattern without significant overfitting.

Formula

MSE = (1/n) × Σ (y_i − ŷ_i)²
Bias-variance: E[Error] = Bias² + Variance + Noise
L2 regularized loss: L_reg = L + λ × Σ w_i²

The Fundamental Tension

Machine learning walks a tightrope between two failure modes. Too simple a model (high bias) cannot capture the true relationship in the data — a straight line through a cubic trend. Too complex a model (high variance) captures everything, including the noise — a degree-15 polynomial that passes through every point but oscillates wildly. The simulation lets you watch this tradeoff unfold: increase polynomial degree and watch training error drop to zero while test error explodes.

Training Error vs Test Error

The hallmark of overfitting is a gap between training and test performance. Training error always decreases with model complexity — given enough parameters, any dataset can be fit perfectly. But test error (on unseen data) follows a U-shaped curve: it first decreases as the model captures the true pattern, then increases as the model starts fitting noise. The bottom of this U is the optimal complexity, and finding it is the art of machine learning.

The Bias-Variance Decomposition

The expected prediction error at any point decomposes into three components: bias squared (how far the average model prediction is from truth), variance (how much predictions vary between different training sets), and irreducible noise. Simple models have high bias and low variance; complex models have low bias and high variance. This decomposition, formalized by Geman et al. in 1992, provides the theoretical foundation for understanding overfitting and guiding model selection.

Modern Perspectives: Double Descent

Classical wisdom says test error follows a U-curve with complexity. But recent research by Belkin et al. revealed a surprising 'double descent' phenomenon: as models become massively overparameterized (far more parameters than data points), test error can decrease again after an initial spike. This occurs in neural networks, random forests, and even linear models. The interpolation threshold — where the model just barely fits training data perfectly — is the worst point, and going beyond it into the overparameterized regime can actually improve generalization.

FAQ

What is overfitting in machine learning?

Overfitting occurs when a model learns the training data too well — including its noise and random fluctuations — rather than the underlying pattern. An overfit model performs excellently on training data but poorly on new, unseen data. It has 'memorized' rather than 'generalized.' This is the central challenge in machine learning.

What is the bias-variance tradeoff?

Bias is error from oversimplified models (underfitting) — they miss the true pattern. Variance is error from overcomplicated models (overfitting) — they are too sensitive to training data. Total error = bias² + variance + noise. Simple models have high bias/low variance; complex models have low bias/high variance. The optimal model minimizes total error.

How do you prevent overfitting?

Key techniques include: regularization (L1/L2 penalties on parameter size), early stopping (halt training before the model memorizes noise), dropout (randomly disable neurons during training), cross-validation (evaluate on held-out data), data augmentation (create more training examples), and ensemble methods (average multiple models). More training data is the most reliable cure.

What is regularization?

Regularization adds a penalty for model complexity to the loss function. L2 (ridge) adds the sum of squared weights, shrinking all weights toward zero. L1 (lasso) adds the sum of absolute weights, driving some to exactly zero (feature selection). The regularization strength hyperparameter controls the bias-variance tradeoff — higher values mean simpler, more constrained models.

Sources

Embed

<iframe src="https://homo-deus.com/lab/machine-learning/overfitting/embed" width="100%" height="400" frameborder="0"></iframe>
View source on GitHub