OLS Assumptions: Linearity, Homoscedasticity & Independence Guide
3 min read
458 words
🎯 Core Idea
Ordinary Least Squares (OLS) regression estimates the linear relationship between input variables (features) and an output (target).
- The goal: find coefficients that minimize the sum of squared errors.
- For OLS to yield valid, unbiased, efficient results, certain assumptions (linearity, independence, homoscedasticity, etc.) must hold.
🌱 Intuition & Real-World Analogy
- Linearity: Imagine stretching a rubber band between data points. If the data roughly align, a straight band fits well. If the relationship is curved, forcing a straight line distorts reality.
- Homoscedasticity: Think of a doctor weighing patients. If the weighing scale works consistently for all weights, it’s fine. But if it becomes inaccurate for heavier people, the “variance” changes → heteroscedasticity.
- Independence: Like interviewing multiple people for a survey. If everyone copies from one another, answers aren’t independent → bias creeps in.
📐 Mathematical Foundation
OLS solves:
$$ \hat{\beta} = \arg\min_\beta \, (y - X\beta)^T(y - X\beta) $$Where:
- $y \in \mathbb{R}^n$ = target vector.
- $X \in \mathbb{R}^{n \times p}$ = feature matrix.
- $\beta \in \mathbb{R}^p$ = coefficients.
- $\hat{\beta} = (X^TX)^{-1}X^Ty$ (when $X^TX$ is invertible).
Key Assumptions (Gauss–Markov Theorem)
For OLS estimators to be BLUE (Best Linear Unbiased Estimator):
- Linearity: $y = X\beta + \epsilon$. Relationship is additive & linear in parameters.
- Independence: Errors $\epsilon_i$ are independent across observations.
- Homoscedasticity: $\text{Var}(\epsilon_i) = \sigma^2$, constant across all observations.
- No perfect multicollinearity: Features are not perfectly correlated.
- Exogeneity: $E[\epsilon|X] = 0$. Errors are uncorrelated with predictors.
⚖️ Strengths, Limitations & Trade-offs
✅ Strengths
- Simple, interpretable, and computationally efficient.
- Works well when assumptions are approximately true.
- Provides closed-form solution with strong theoretical guarantees.
❌ Limitations
- Linearity Restriction: Cannot capture non-linear relationships without feature engineering.
- Sensitivity to Outliers: Squared errors give extreme weight to large deviations.
- Homoscedasticity Violation: Leads to inefficient estimates and biased standard errors.
- Independence Violation: Autocorrelation (e.g., in time series) invalidates inference.
- Multicollinearity: Inflates variance of coefficients → unstable estimates.
- Overfitting in High Dimensions: With too many predictors, model loses generalization.
- Exogeneity Violation: Leads to biased and inconsistent estimates.
- Applicability Restricted: Assumes Gaussian-like residuals for hypothesis testing.
🔍 Variants & Extensions
- Generalized Least Squares (GLS): Corrects heteroscedasticity or correlated errors.
- Weighted Least Squares (WLS): Assigns weights when variance differs across observations.
- Ridge/Lasso Regression: Adds regularization to address multicollinearity or high-dimensionality.
- Robust Regression (Huber, M-estimators): Reduces sensitivity to outliers.
🚧 Common Challenges & Pitfalls
- Mistaking correlation for causation: OLS only captures associations, not causality.
- Ignoring diagnostics: Failing to test assumptions (e.g., residual plots for homoscedasticity).
- Over-relying on $R^2$: High $R^2$ doesn’t imply assumptions are valid.
- Data leakage in independence: Using overlapping or dependent samples biases results.
- Assuming normality of predictors: Not required; only errors matter.
📚 Reference Pointers
- Gauss–Markov Theorem: Wikipedia
- Classical OLS assumptions: Wooldridge, Introductory Econometrics
- Heteroscedasticity and remedies: Weighted Least Squares (WLS)
- Robust Regression overview: Huber Loss