Outliers and Robust Regression: Linear Regression

7 min read 1308 words

🎯 Core Idea

Outliers are data points that deviate markedly from the pattern of the rest of the data. Ordinary Least Squares (OLS) minimizes squared residuals, so a single extreme point can pull the fitted line (or hyperplane) toward itself and distort parameter estimates. Robust regression replaces or downweights the squared-loss OLS objective with alternatives (e.g., Huber loss, L1, M-estimators) or uses algorithms (RANSAC, Theil–Sen) that are resistant to a small fraction of gross errors. The goal: produce parameter estimates that reflect the bulk of the data, not a few extreme observations.


🌱 Intuition & Real-World Analogy

  • Why outliers hurt: OLS squares residuals, so errors are amplified quadratically. Think of OLS as a tug-of-war where each data point pulls the fit with force proportional to the square of its vertical distance — one heavyweight (outlier) dominates the result.
  • Analogy 1 — “Group photo”: If everyone lines up but one person stands far forward, the camera’s automatic centering will shift to include them; robust methods “ignore” that person so the group center stays representative.
  • Analogy 2 — “Noisy thermometer”: If one thermometer is broken and reads 1000°C, the average of thermometers is useless; using a median (or trimmed mean) or downweighting the broken instrument gives a meaningful estimate.

📐 Mathematical Foundation

1) OLS sensitivity (brief)

OLS solves

$$ \hat{\beta}_{OLS} = \arg\min_\beta \sum_{i=1}^n (y_i - x_i^\top \beta)^2 $$

Normal equations:

$$ X^\top X \hat\beta_{OLS} = X^\top y. $$

A single $y_j$ change affects $\hat\beta$ linearly via $(X^\top X)^{-1} x_j$: large leverage $x_j$ or large residual $y_j - x_j^\top\hat\beta$ ⇒ big influence.

2) Hat matrix, leverage, and influence

Fitted values: $\hat{y} = H y$ with $H = X (X^\top X)^{-1} X^\top$. Leverage of point $i$: $h_{ii}\in[0,1]$ (diagonal of $H$). A high $h_{ii}$ means $x_i$ is far from the bulk of the $X$-space.

Studentized residual:

$$ t_i = \frac{e_i}{\hat\sigma_{(i)}\sqrt{1-h_{ii}}}, $$

where $e_i = y_i - \hat{y}i$ and $\hat\sigma{(i)}$ is the standard error estimated without point $i$.

Cook’s distance (measures overall influence on fitted values):

$$ D_i = \frac{(\hat{y}-\hat{y}_{(i)})^\top(\hat{y}-\hat{y}_{(i)})}{p\,\hat\sigma^2} = \frac{e_i^2}{p\,\hat\sigma^2}\cdot\frac{h_{ii}}{(1-h_{ii})^2}, $$

where $p$ is number of parameters (including intercept). Large $D_i$ ⇒ removing $i$ substantially changes the fit.

3) Robust regression objectives

General M-estimator:

$$ \hat\beta = \arg\min_\beta \sum_{i=1}^n \rho\big(e_i(\beta)\big),\qquad e_i(\beta)=y_i-x_i^\top\beta, $$

where $\rho$ grows slower than quadratic for large $|e|$.

  • Huber loss (quadratic near 0, linear in tails):
$$ \rho_H(e)= \begin{cases} \frac{1}{2} e^2 &\text{if }|e|\le \delta,\\[4pt] \delta(|e|-\tfrac{1}{2}\delta) &\text{if }|e|>\delta, \end{cases} $$

with tuning constant $\delta$. Equivalent to weighted least squares via iterative reweighted least squares (IRLS).

  • LAD / L1 regression:
$$ \hat\beta_{L1} = \arg\min_\beta \sum_i |e_i|. $$

Median-based; high breakdown (resistant to single large $y$ outlier), but less efficient under Gaussian noise.

  • Tukey’s bisquare (redescending ρ): completely downweights very large residuals; more aggressive than Huber but non-convex and can be harder to optimize.

4) RANSAC (Random Sample Consensus)

Algorithmic, not an explicit loss minimization:

  • Repeatedly sample minimal subset of points, fit model, count inliers within threshold, keep model with most inliers.
  • Good for high-breakdown cases (large fraction of data may be outliers), but requires inlier threshold and is randomized.

Derivation: Why leverage matters (sketch)

Change $y_i$ by $\delta$. Fitted change:

$$ \Delta\hat{y} = H \Delta y \quad\Rightarrow\quad \Delta\hat{y}_j = h_{j i}\delta. $$

So the impact on fitted values is scaled by hat matrix entries; in particular, $\Delta\hat{y}i = h{ii}\delta/(1-h_{ii})$ (when accounting for refitting). Hence large leverage amplifies the influence of a change in $y_i$.


⚖️ Strengths, Limitations & Trade-offs

OLS (baseline)

  • Strengths: simple, closed-form, efficient under Gauss-Markov assumptions (linear, homoscedastic, uncorrelated errors).
  • Limitations: extremely sensitive to outliers (squared loss), not robust to heavy tails or gross errors.

L1 / LAD

  • Strengths: high breakdown for vertical outliers; median-like robustness.
  • Limitations: less efficient with Gaussian noise; solutions may not be unique; optimization through linear programming.

M-estimators (Huber, Tukey)

  • Strengths: balance between efficiency and robustness; Huber is convex (good optimization properties); IRLS available.
  • Limitations: require tuning constant (δ); redescending functions (Tukey) are nonconvex and can be sensitive to initialization.

RANSAC

  • Strengths: can tolerate many outliers; good when outliers are arbitrary and concentrated.
  • Limitations: needs inlier threshold and sufficient iterations; non-deterministic; poor for mild, continuous contamination (not gross outliers).

Theil–Sen

  • Strengths: robust slope estimator (median of slopes), good for simple regression, high-breakdown.
  • Limitations: computationally heavier for large n; mainly for simple linear regression.

Trade-off summary: robustness vs statistical efficiency under the assumed noise model, and computational cost vs ease of optimization.


🔍 Variants & Extensions

  • Weighted Least Squares (WLS): downweight suspected outliers with precomputed weights.
  • IRLS: iterative algorithm to fit many M-estimators — reweights residuals each iteration based on influence function.
  • High-breakdown methods: S-estimators, MM-estimators (combine high-breakdown and high-efficiency).
  • Redescending M-estimators: Tukey bisquare — more aggressive rejection of large residuals.
  • Robust covariance estimators: use robust scale estimates (e.g., MAD) to compute studentized residuals.
  • RANSAC variants: PROSAC, MLESAC — variations to improve speed or likelihood-based scoring.

🚧 Common Challenges & Pitfalls

  1. Masking and swamping

    • Masking: multiple outliers hide each other, making them look normal to simple diagnostics.
    • Swamping: valid points appear as outliers because of other outliers’ influence. Detect using robust residuals and high-breakdown methods.
  2. Confusing leverage and residual size

    • High-leverage points (extreme X) may have small residuals but still be influential. Use both leverage and residual-based measures (Cook’s D, DFBETAS).
  3. Automatic removal of data

    • Don’t delete points mechanically. Investigate: measurement error? Data entry? Legitimate rare event? Model mismatch?
  4. Improper tuning

    • Huber δ, RANSAC thresholds, or Tukey constants matter. Wrong choices either under-reject or over-reject.
  5. Nonconvex optimization

    • Some robust losses are nonconvex; optimization can get stuck. Start with convex M-estimators (Huber) before trying redescending ones.
  6. Overconfidence after robustifying

    • Robust fit reduces bias from outliers but you still need correct inference: standard errors may require robust variance estimation.
  7. Using robust methods to hide model misspecification

    • Outliers can signal model misspecification (missing covariates, nonlinearity). Robust methods are not a substitute for revisiting model form.

🔁 Probing Question — Answer (short, interview-ready)

Q: If your regression is heavily skewed by a single point, how would you detect and address it?

Detect

  1. Plot residuals vs fitted values — an extreme vertical deviation is a red flag.

  2. Compute leverage $h_{ii}$ (hat values): large $h_{ii}$ indicates X outlier.

  3. Compute studentized/externally studentized residuals $t_i$: large magnitude (e.g., |t|>3) indicates outlying y given X.

  4. Compute influence measures: Cook’s distance $D_i$ and DFBETAS to see effect on coefficients.

    • Rule of thumb: $D_i > 4/n$ or $D_i$ substantially larger than others → examine point.
  5. Use robust diagnostics (robust residuals or high-breakdown methods) to detect masking.

Address (ordered, interview-friendly strategy)

  1. Investigate — Is it a data error? If so, correct or remove with justification. Document the rationale.

  2. Re-fit with robust methods:

    • Start with Huber M-estimator (convex, stable) or LAD if you suspect symmetric heavy tails.
    • For extreme contamination or when many outliers exist, consider RANSAC or Theil–Sen (for simple slope).
  3. Compare fits: OLS vs robust estimate vs fit without the point (leave-one-out). Use DFBETAS to see which coefficients are affected.

  4. If leverage is the problem:

    • Consider transforming X (standardize, winsorize), or add richer model terms (interaction, nonlinearity) if model misspecification explains the leverage.
  5. If the outlier is legitimate but rare:

    • Model it explicitly (mixture model; heavy-tailed error like Student-t) or report both robust and OLS results.
  6. Report: Be transparent in write-up: show diagnostics, justify actions, and provide sensitivity analysis (how conclusions change if you remove or downweight the point).


📚 Reference Pointers

Any doubt in content? Ask me anything?
Chat
🤖 👋 Hi there! I'm your learning assistant. If you have any questions about this page or need clarification, feel free to ask!