Regression Diagnostics: Validating Model Assumptions

simulator intermediate ~10 min
Loading simulation...
R² ≈ 0.56 — moderate fit with diagnostics

With 80 observations, slope 0.8, noise σ=1.0, and 5% outliers, the OLS regression achieves R²≈0.56. Diagnostic plots reveal the outliers as high-leverage or high-residual points.

Formula

R² = 1 − Σ(y_i − ŷ_i)² / Σ(y_i − ȳ)²
Cook's D_i = (e_i² / (p · MSE)) · (h_ii / (1 − h_ii)²)
h_ii = X_i(X'X)⁻¹X_i' (leverage)

Trust, But Verify

Fitting a regression model is easy. Trusting its results requires checking that the model's assumptions are approximately satisfied and that no small set of observations is driving the conclusions. Regression diagnostics provide the tools for this verification. In biostatistics, where regression results may determine treatment guidelines affecting millions of patients, thorough diagnostics are not optional — they are essential to responsible analysis.

The Four Diagnostic Plots

The visualization displays four standard diagnostic panels. Top-left: residuals vs. fitted values (checking linearity and homoscedasticity). Top-right: normal Q-Q plot of standardized residuals (checking normality). Bottom-left: scale-location plot (√|standardized residuals| vs. fitted values, a clearer check for heteroscedasticity). Bottom-right: residuals vs. leverage with Cook's distance contours (identifying influential observations).

Outliers, Leverage, and Influence

Three related but distinct concepts determine how individual observations affect regression. Outliers have large residuals — they deviate from the pattern of other data. High-leverage points have unusual predictor values — they sit far from the center of the predictor space. Influential points actually change the regression results when removed — they combine outlier status with high leverage. Cook's distance elegantly combines these into a single measure of influence.

What to Do When Assumptions Fail

When diagnostics reveal problems, the response depends on the violation. Heteroscedasticity can be addressed with weighted least squares or robust standard errors (HC estimators). Non-normality often matters less than feared (the central limit theorem protects inference for large samples). Nonlinearity suggests adding polynomial terms, splines, or using generalized additive models. Influential observations should be investigated substantively — not automatically deleted — as they may represent important subpopulations or data quality issues.

FAQ

Why are regression diagnostics important?

Regression results (coefficients, p-values, confidence intervals) are only valid when model assumptions hold: linearity, independence, homoscedasticity (constant variance), and normality of residuals. Diagnostics check these assumptions and reveal influential observations that may distort conclusions.

What is Cook's distance?

Cook's distance measures how much all fitted values change when a single observation is removed. It combines leverage (how unusual the predictor values are) with residual size. Observations with Cook's D > 0.5 deserve scrutiny; D > 1 is highly influential.

What do residual plots reveal?

Residuals vs. fitted values should show a random scatter with constant spread. Patterns indicate violations: a funnel shape suggests heteroscedasticity, a curve suggests nonlinearity, and clusters suggest missing variables or subgroups. QQ-plots check normality of residuals.

What is leverage in regression?

Leverage measures how far an observation's predictor values are from the mean. High-leverage points have disproportionate influence on the fitted line. They are not necessarily outliers in the response, but if they are, their combination of high leverage and large residual makes them highly influential.

Sources

Embed

<iframe src="https://homo-deus.com/lab/biostatistics/regression-diagnostics/embed" width="100%" height="400" frameborder="0"></iframe>
View source on GitHub