Trust, But Verify
Fitting a regression model is easy. Trusting its results requires checking that the model's assumptions are approximately satisfied and that no small set of observations is driving the conclusions. Regression diagnostics provide the tools for this verification. In biostatistics, where regression results may determine treatment guidelines affecting millions of patients, thorough diagnostics are not optional — they are essential to responsible analysis.
The Four Diagnostic Plots
The visualization displays four standard diagnostic panels. Top-left: residuals vs. fitted values (checking linearity and homoscedasticity). Top-right: normal Q-Q plot of standardized residuals (checking normality). Bottom-left: scale-location plot (√|standardized residuals| vs. fitted values, a clearer check for heteroscedasticity). Bottom-right: residuals vs. leverage with Cook's distance contours (identifying influential observations).
Outliers, Leverage, and Influence
Three related but distinct concepts determine how individual observations affect regression. Outliers have large residuals — they deviate from the pattern of other data. High-leverage points have unusual predictor values — they sit far from the center of the predictor space. Influential points actually change the regression results when removed — they combine outlier status with high leverage. Cook's distance elegantly combines these into a single measure of influence.
What to Do When Assumptions Fail
When diagnostics reveal problems, the response depends on the violation. Heteroscedasticity can be addressed with weighted least squares or robust standard errors (HC estimators). Non-normality often matters less than feared (the central limit theorem protects inference for large samples). Nonlinearity suggests adding polynomial terms, splines, or using generalized additive models. Influential observations should be investigated substantively — not automatically deleted — as they may represent important subpopulations or data quality issues.