Outliers and Robust Regression: Linear Regression
🎯 Core Idea
Outliers are data points that deviate markedly from the pattern of the rest of the data. Ordinary Least Squares (OLS) minimizes squared residuals, so a single extreme point can pull the fitted line (or hyperplane) toward itself and distort parameter estimates. Robust regression replaces or downweights the squared-loss OLS objective with alternatives (e.g., Huber loss, L1, M-estimators) or uses algorithms (RANSAC, Theil–Sen) that are resistant to a small fraction of gross errors. The goal: produce parameter estimates that reflect the bulk of the data, not a few extreme observations.
🌱 Intuition & Real-World Analogy
- Why outliers hurt: OLS squares residuals, so errors are amplified quadratically. Think of OLS as a tug-of-war where each data point pulls the fit with force proportional to the square of its vertical distance — one heavyweight (outlier) dominates the result.
- Analogy 1 — “Group photo”: If everyone lines up but one person stands far forward, the camera’s automatic centering will shift to include them; robust methods “ignore” that person so the group center stays representative.
- Analogy 2 — “Noisy thermometer”: If one thermometer is broken and reads 1000°C, the average of thermometers is useless; using a median (or trimmed mean) or downweighting the broken instrument gives a meaningful estimate.
📐 Mathematical Foundation
1) OLS sensitivity (brief)
OLS solves
$$ \hat{\beta}_{OLS} = \arg\min_\beta \sum_{i=1}^n (y_i - x_i^\top \beta)^2 $$Normal equations:
$$ X^\top X \hat\beta_{OLS} = X^\top y. $$A single $y_j$ change affects $\hat\beta$ linearly via $(X^\top X)^{-1} x_j$: large leverage $x_j$ or large residual $y_j - x_j^\top\hat\beta$ ⇒ big influence.
2) Hat matrix, leverage, and influence
Fitted values: $\hat{y} = H y$ with $H = X (X^\top X)^{-1} X^\top$. Leverage of point $i$: $h_{ii}\in[0,1]$ (diagonal of $H$). A high $h_{ii}$ means $x_i$ is far from the bulk of the $X$-space.
Studentized residual:
$$ t_i = \frac{e_i}{\hat\sigma_{(i)}\sqrt{1-h_{ii}}}, $$where $e_i = y_i - \hat{y}i$ and $\hat\sigma{(i)}$ is the standard error estimated without point $i$.
Cook’s distance (measures overall influence on fitted values):
$$ D_i = \frac{(\hat{y}-\hat{y}_{(i)})^\top(\hat{y}-\hat{y}_{(i)})}{p\,\hat\sigma^2} = \frac{e_i^2}{p\,\hat\sigma^2}\cdot\frac{h_{ii}}{(1-h_{ii})^2}, $$where $p$ is number of parameters (including intercept). Large $D_i$ ⇒ removing $i$ substantially changes the fit.
3) Robust regression objectives
General M-estimator:
$$ \hat\beta = \arg\min_\beta \sum_{i=1}^n \rho\big(e_i(\beta)\big),\qquad e_i(\beta)=y_i-x_i^\top\beta, $$where $\rho$ grows slower than quadratic for large $|e|$.
- Huber loss (quadratic near 0, linear in tails):
with tuning constant $\delta$. Equivalent to weighted least squares via iterative reweighted least squares (IRLS).
- LAD / L1 regression:
Median-based; high breakdown (resistant to single large $y$ outlier), but less efficient under Gaussian noise.
- Tukey’s bisquare (redescending ρ): completely downweights very large residuals; more aggressive than Huber but non-convex and can be harder to optimize.
4) RANSAC (Random Sample Consensus)
Algorithmic, not an explicit loss minimization:
- Repeatedly sample minimal subset of points, fit model, count inliers within threshold, keep model with most inliers.
- Good for high-breakdown cases (large fraction of data may be outliers), but requires inlier threshold and is randomized.
Derivation: Why leverage matters (sketch)
Change $y_i$ by $\delta$. Fitted change:
$$ \Delta\hat{y} = H \Delta y \quad\Rightarrow\quad \Delta\hat{y}_j = h_{j i}\delta. $$So the impact on fitted values is scaled by hat matrix entries; in particular, $\Delta\hat{y}i = h{ii}\delta/(1-h_{ii})$ (when accounting for refitting). Hence large leverage amplifies the influence of a change in $y_i$.
⚖️ Strengths, Limitations & Trade-offs
OLS (baseline)
- Strengths: simple, closed-form, efficient under Gauss-Markov assumptions (linear, homoscedastic, uncorrelated errors).
- Limitations: extremely sensitive to outliers (squared loss), not robust to heavy tails or gross errors.
L1 / LAD
- Strengths: high breakdown for vertical outliers; median-like robustness.
- Limitations: less efficient with Gaussian noise; solutions may not be unique; optimization through linear programming.
M-estimators (Huber, Tukey)
- Strengths: balance between efficiency and robustness; Huber is convex (good optimization properties); IRLS available.
- Limitations: require tuning constant (δ); redescending functions (Tukey) are nonconvex and can be sensitive to initialization.
RANSAC
- Strengths: can tolerate many outliers; good when outliers are arbitrary and concentrated.
- Limitations: needs inlier threshold and sufficient iterations; non-deterministic; poor for mild, continuous contamination (not gross outliers).
Theil–Sen
- Strengths: robust slope estimator (median of slopes), good for simple regression, high-breakdown.
- Limitations: computationally heavier for large n; mainly for simple linear regression.
Trade-off summary: robustness vs statistical efficiency under the assumed noise model, and computational cost vs ease of optimization.
🔍 Variants & Extensions
- Weighted Least Squares (WLS): downweight suspected outliers with precomputed weights.
- IRLS: iterative algorithm to fit many M-estimators — reweights residuals each iteration based on influence function.
- High-breakdown methods: S-estimators, MM-estimators (combine high-breakdown and high-efficiency).
- Redescending M-estimators: Tukey bisquare — more aggressive rejection of large residuals.
- Robust covariance estimators: use robust scale estimates (e.g., MAD) to compute studentized residuals.
- RANSAC variants: PROSAC, MLESAC — variations to improve speed or likelihood-based scoring.
🚧 Common Challenges & Pitfalls
-
Masking and swamping
- Masking: multiple outliers hide each other, making them look normal to simple diagnostics.
- Swamping: valid points appear as outliers because of other outliers’ influence. Detect using robust residuals and high-breakdown methods.
-
Confusing leverage and residual size
- High-leverage points (extreme X) may have small residuals but still be influential. Use both leverage and residual-based measures (Cook’s D, DFBETAS).
-
Automatic removal of data
- Don’t delete points mechanically. Investigate: measurement error? Data entry? Legitimate rare event? Model mismatch?
-
Improper tuning
- Huber δ, RANSAC thresholds, or Tukey constants matter. Wrong choices either under-reject or over-reject.
-
Nonconvex optimization
- Some robust losses are nonconvex; optimization can get stuck. Start with convex M-estimators (Huber) before trying redescending ones.
-
Overconfidence after robustifying
- Robust fit reduces bias from outliers but you still need correct inference: standard errors may require robust variance estimation.
-
Using robust methods to hide model misspecification
- Outliers can signal model misspecification (missing covariates, nonlinearity). Robust methods are not a substitute for revisiting model form.
🔁 Probing Question — Answer (short, interview-ready)
Q: If your regression is heavily skewed by a single point, how would you detect and address it?
Detect
-
Plot residuals vs fitted values — an extreme vertical deviation is a red flag.
-
Compute leverage $h_{ii}$ (hat values): large $h_{ii}$ indicates X outlier.
-
Compute studentized/externally studentized residuals $t_i$: large magnitude (e.g., |t|>3) indicates outlying y given X.
-
Compute influence measures: Cook’s distance $D_i$ and DFBETAS to see effect on coefficients.
- Rule of thumb: $D_i > 4/n$ or $D_i$ substantially larger than others → examine point.
-
Use robust diagnostics (robust residuals or high-breakdown methods) to detect masking.
Address (ordered, interview-friendly strategy)
-
Investigate — Is it a data error? If so, correct or remove with justification. Document the rationale.
-
Re-fit with robust methods:
- Start with Huber M-estimator (convex, stable) or LAD if you suspect symmetric heavy tails.
- For extreme contamination or when many outliers exist, consider RANSAC or Theil–Sen (for simple slope).
-
Compare fits: OLS vs robust estimate vs fit without the point (leave-one-out). Use DFBETAS to see which coefficients are affected.
-
If leverage is the problem:
- Consider transforming X (standardize, winsorize), or add richer model terms (interaction, nonlinearity) if model misspecification explains the leverage.
-
If the outlier is legitimate but rare:
- Model it explicitly (mixture model; heavy-tailed error like Student-t) or report both robust and OLS results.
-
Report: Be transparent in write-up: show diagnostics, justify actions, and provide sensitivity analysis (how conclusions change if you remove or downweight the point).
📚 Reference Pointers
- Huber, P. J. (1964). Robust Estimation of a Location Parameter. Annals of Mathematical Statistics. (classic on M-estimators).
- Tukey, J. W. (1977). Exploratory Data Analysis. (intuition for robust methods).
- Fischler, M. A. & Bolles, R. C. (1981). Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography. (RANSAC original).
- Wikipedia — Huber loss: https://en.wikipedia.org/wiki/Huber_loss.
- Wikipedia — RANSAC: https://en.wikipedia.org/wiki/RANSAC.
- Cook, R. D. (1977). Detection of Influential Observations in Linear Regression. (Cook’s distance).
- For robust regression theory & algorithms: Wikipedia — Robust regression: https://en.wikipedia.org/wiki/Robust_regression.
- For general probabilistic assumptions and modeling context: Koller & Friedman, Probabilistic Graphical Models (MIT Press). https://mitpress.mit.edu/9780262013192/probabilistic-graphical-models