Outliers and Robust Regression: Linear Regression

Machine Learning Interview Guide for Top Tech Roles (2025)

Linear Regression: Complete Interview Guide for Interviews

7 min read 1308 words

🎯 Core Idea

Outliers are data points that deviate markedly from the pattern of the rest of the data. Ordinary Least Squares (OLS) minimizes squared residuals, so a single extreme point can pull the fitted line (or hyperplane) toward itself and distort parameter estimates. Robust regression replaces or downweights the squared-loss OLS objective with alternatives (e.g., Huber loss, L1, M-estimators) or uses algorithms (RANSAC, Theil–Sen) that are resistant to a small fraction of gross errors. The goal: produce parameter estimates that reflect the bulk of the data, not a few extreme observations.

🌱 Intuition & Real-World Analogy

Why outliers hurt: OLS squares residuals, so errors are amplified quadratically. Think of OLS as a tug-of-war where each data point pulls the fit with force proportional to the square of its vertical distance — one heavyweight (outlier) dominates the result.
Analogy 1 — “Group photo”: If everyone lines up but one person stands far forward, the camera’s automatic centering will shift to include them; robust methods “ignore” that person so the group center stays representative.
Analogy 2 — “Noisy thermometer”: If one thermometer is broken and reads 1000°C, the average of thermometers is useless; using a median (or trimmed mean) or downweighting the broken instrument gives a meaningful estimate.

📐 Mathematical Foundation

1) OLS sensitivity (brief)

OLS solves

$$ \hat{\beta}_{OLS} = \arg\min_\beta \sum_{i=1}^n (y_i - x_i^\top \beta)^2 $$

Normal equations:

$$ X^\top X \hat\beta_{OLS} = X^\top y. $$

A single $y_j$ change affects $\hat\beta$ linearly via $(X^\top X)^{-1} x_j$: large leverage $x_j$ or large residual $y_j - x_j^\top\hat\beta$ ⇒ big influence.

2) Hat matrix, leverage, and influence

Fitted values: $\hat{y} = H y$ with $H = X (X^\top X)^{-1} X^\top$. Leverage of point $i$: $h_{ii}\in[0,1]$ (diagonal of $H$). A high $h_{ii}$ means $x_i$ is far from the bulk of the $X$-space.

Studentized residual:

$$ t_i = \frac{e_i}{\hat\sigma_{(i)}\sqrt{1-h_{ii}}}, $$

where $e_i = y_i - \hat{y}i$ and $\hat\sigma{(i)}$ is the standard error estimated without point $i$.

Cook’s distance (measures overall influence on fitted values):

$$ D_i = \frac{(\hat{y}-\hat{y}_{(i)})^\top(\hat{y}-\hat{y}_{(i)})}{p\,\hat\sigma^2} = \frac{e_i^2}{p\,\hat\sigma^2}\cdot\frac{h_{ii}}{(1-h_{ii})^2}, $$

where $p$ is number of parameters (including intercept). Large $D_i$ ⇒ removing $i$ substantially changes the fit.

3) Robust regression objectives

General M-estimator:

$$ \hat\beta = \arg\min_\beta \sum_{i=1}^n \rho\big(e_i(\beta)\big),\qquad e_i(\beta)=y_i-x_i^\top\beta, $$

where $\rho$ grows slower than quadratic for large $|e|$.

Huber loss (quadratic near 0, linear in tails):

$$ \rho_H(e)= \begin{cases} \frac{1}{2} e^2 &\text{if }|e|\le \delta,\\[4pt] \delta(|e|-\tfrac{1}{2}\delta) &\text{if }|e|>\delta, \end{cases} $$

with tuning constant $\delta$. Equivalent to weighted least squares via iterative reweighted least squares (IRLS).

LAD / L1 regression:

$$ \hat\beta_{L1} = \arg\min_\beta \sum_i |e_i|. $$

Median-based; high breakdown (resistant to single large $y$ outlier), but less efficient under Gaussian noise.

Tukey’s bisquare (redescending ρ): completely downweights very large residuals; more aggressive than Huber but non-convex and can be harder to optimize.

4) RANSAC (Random Sample Consensus)

Algorithmic, not an explicit loss minimization:

Repeatedly sample minimal subset of points, fit model, count inliers within threshold, keep model with most inliers.
Good for high-breakdown cases (large fraction of data may be outliers), but requires inlier threshold and is randomized.

Derivation: Why leverage matters (sketch)

Change $y_i$ by $\delta$. Fitted change:

$$ \Delta\hat{y} = H \Delta y \quad\Rightarrow\quad \Delta\hat{y}_j = h_{j i}\delta. $$

So the impact on fitted values is scaled by hat matrix entries; in particular, $\Delta\hat{y}i = h{ii}\delta/(1-h_{ii})$ (when accounting for refitting). Hence large leverage amplifies the influence of a change in $y_i$.

⚖️ Strengths, Limitations & Trade-offs

OLS (baseline)

Strengths: simple, closed-form, efficient under Gauss-Markov assumptions (linear, homoscedastic, uncorrelated errors).
Limitations: extremely sensitive to outliers (squared loss), not robust to heavy tails or gross errors.

L1 / LAD

Strengths: high breakdown for vertical outliers; median-like robustness.
Limitations: less efficient with Gaussian noise; solutions may not be unique; optimization through linear programming.

M-estimators (Huber, Tukey)

Strengths: balance between efficiency and robustness; Huber is convex (good optimization properties); IRLS available.
Limitations: require tuning constant (δ); redescending functions (Tukey) are nonconvex and can be sensitive to initialization.

RANSAC

Strengths: can tolerate many outliers; good when outliers are arbitrary and concentrated.
Limitations: needs inlier threshold and sufficient iterations; non-deterministic; poor for mild, continuous contamination (not gross outliers).

Theil–Sen

Strengths: robust slope estimator (median of slopes), good for simple regression, high-breakdown.
Limitations: computationally heavier for large n; mainly for simple linear regression.

Trade-off summary: robustness vs statistical efficiency under the assumed noise model, and computational cost vs ease of optimization.

🔍 Variants & Extensions

Weighted Least Squares (WLS): downweight suspected outliers with precomputed weights.
IRLS: iterative algorithm to fit many M-estimators — reweights residuals each iteration based on influence function.
High-breakdown methods: S-estimators, MM-estimators (combine high-breakdown and high-efficiency).
Redescending M-estimators: Tukey bisquare — more aggressive rejection of large residuals.
Robust covariance estimators: use robust scale estimates (e.g., MAD) to compute studentized residuals.
RANSAC variants: PROSAC, MLESAC — variations to improve speed or likelihood-based scoring.

🚧 Common Challenges & Pitfalls

Masking and swamping
- Masking: multiple outliers hide each other, making them look normal to simple diagnostics.
- Swamping: valid points appear as outliers because of other outliers’ influence. Detect using robust residuals and high-breakdown methods.
Confusing leverage and residual size
- High-leverage points (extreme X) may have small residuals but still be influential. Use both leverage and residual-based measures (Cook’s D, DFBETAS).
Automatic removal of data
- Don’t delete points mechanically. Investigate: measurement error? Data entry? Legitimate rare event? Model mismatch?
Improper tuning
- Huber δ, RANSAC thresholds, or Tukey constants matter. Wrong choices either under-reject or over-reject.
Nonconvex optimization
- Some robust losses are nonconvex; optimization can get stuck. Start with convex M-estimators (Huber) before trying redescending ones.
Overconfidence after robustifying
- Robust fit reduces bias from outliers but you still need correct inference: standard errors may require robust variance estimation.
Using robust methods to hide model misspecification
- Outliers can signal model misspecification (missing covariates, nonlinearity). Robust methods are not a substitute for revisiting model form.

🔁 Probing Question — Answer (short, interview-ready)

Q: If your regression is heavily skewed by a single point, how would you detect and address it?

Detect

Plot residuals vs fitted values — an extreme vertical deviation is a red flag.
Compute leverage $h_{ii}$ (hat values): large $h_{ii}$ indicates X outlier.
Compute studentized/externally studentized residuals $t_i$: large magnitude (e.g., |t|>3) indicates outlying y given X.
Compute influence measures: Cook’s distance $D_i$ and DFBETAS to see effect on coefficients.
- Rule of thumb: $D_i > 4/n$ or $D_i$ substantially larger than others → examine point.
Use robust diagnostics (robust residuals or high-breakdown methods) to detect masking.

Address (ordered, interview-friendly strategy)

Investigate — Is it a data error? If so, correct or remove with justification. Document the rationale.
Re-fit with robust methods:
- Start with Huber M-estimator (convex, stable) or LAD if you suspect symmetric heavy tails.
- For extreme contamination or when many outliers exist, consider RANSAC or Theil–Sen (for simple slope).
Compare fits: OLS vs robust estimate vs fit without the point (leave-one-out). Use DFBETAS to see which coefficients are affected.
If leverage is the problem:
- Consider transforming X (standardize, winsorize), or add richer model terms (interaction, nonlinearity) if model misspecification explains the leverage.
If the outlier is legitimate but rare:
- Model it explicitly (mixture model; heavy-tailed error like Student-t) or report both robust and OLS results.
Report: Be transparent in write-up: show diagnostics, justify actions, and provide sensitivity analysis (how conclusions change if you remove or downweight the point).

📚 Reference Pointers

Huber, P. J. (1964). Robust Estimation of a Location Parameter. Annals of Mathematical Statistics. (classic on M-estimators).
Tukey, J. W. (1977). Exploratory Data Analysis. (intuition for robust methods).
Fischler, M. A. & Bolles, R. C. (1981). Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography. (RANSAC original).
Wikipedia — Huber loss: https://en.wikipedia.org/wiki/Huber_loss.
Wikipedia — RANSAC: https://en.wikipedia.org/wiki/RANSAC.
Cook, R. D. (1977). Detection of Influential Observations in Linear Regression. (Cook’s distance).
For robust regression theory & algorithms: Wikipedia — Robust regression: https://en.wikipedia.org/wiki/Robust_regression.
For general probabilistic assumptions and modeling context: Koller & Friedman, Probabilistic Graphical Models (MIT Press). https://mitpress.mit.edu/9780262013192/probabilistic-graphical-models

p-values and Confidence Intervals: Linear Regression Monitoring and Drift Detection: Linear Regression