OLS Assumptions: Linearity, Homoscedasticity & Independence Guide

OLS Assumptions: Linearity, Homoscedasticity & Independence Guide

3 min read 458 words

🎯 Core Idea

Ordinary Least Squares (OLS) regression estimates the linear relationship between input variables (features) and an output (target).

  • The goal: find coefficients that minimize the sum of squared errors.
  • For OLS to yield valid, unbiased, efficient results, certain assumptions (linearity, independence, homoscedasticity, etc.) must hold.

🌱 Intuition & Real-World Analogy

  • Linearity: Imagine stretching a rubber band between data points. If the data roughly align, a straight band fits well. If the relationship is curved, forcing a straight line distorts reality.
  • Homoscedasticity: Think of a doctor weighing patients. If the weighing scale works consistently for all weights, it’s fine. But if it becomes inaccurate for heavier people, the “variance” changes → heteroscedasticity.
  • Independence: Like interviewing multiple people for a survey. If everyone copies from one another, answers aren’t independent → bias creeps in.

📐 Mathematical Foundation

OLS solves:

$$ \hat{\beta} = \arg\min_\beta \, (y - X\beta)^T(y - X\beta) $$

Where:

  • $y \in \mathbb{R}^n$ = target vector.
  • $X \in \mathbb{R}^{n \times p}$ = feature matrix.
  • $\beta \in \mathbb{R}^p$ = coefficients.
  • $\hat{\beta} = (X^TX)^{-1}X^Ty$ (when $X^TX$ is invertible).

Key Assumptions (Gauss–Markov Theorem)

For OLS estimators to be BLUE (Best Linear Unbiased Estimator):

  1. Linearity: $y = X\beta + \epsilon$. Relationship is additive & linear in parameters.
  2. Independence: Errors $\epsilon_i$ are independent across observations.
  3. Homoscedasticity: $\text{Var}(\epsilon_i) = \sigma^2$, constant across all observations.
  4. No perfect multicollinearity: Features are not perfectly correlated.
  5. Exogeneity: $E[\epsilon|X] = 0$. Errors are uncorrelated with predictors.

⚖️ Strengths, Limitations & Trade-offs

✅ Strengths

  • Simple, interpretable, and computationally efficient.
  • Works well when assumptions are approximately true.
  • Provides closed-form solution with strong theoretical guarantees.

❌ Limitations

  1. Linearity Restriction: Cannot capture non-linear relationships without feature engineering.
  2. Sensitivity to Outliers: Squared errors give extreme weight to large deviations.
  3. Homoscedasticity Violation: Leads to inefficient estimates and biased standard errors.
  4. Independence Violation: Autocorrelation (e.g., in time series) invalidates inference.
  5. Multicollinearity: Inflates variance of coefficients → unstable estimates.
  6. Overfitting in High Dimensions: With too many predictors, model loses generalization.
  7. Exogeneity Violation: Leads to biased and inconsistent estimates.
  8. Applicability Restricted: Assumes Gaussian-like residuals for hypothesis testing.

🔍 Variants & Extensions

  • Generalized Least Squares (GLS): Corrects heteroscedasticity or correlated errors.
  • Weighted Least Squares (WLS): Assigns weights when variance differs across observations.
  • Ridge/Lasso Regression: Adds regularization to address multicollinearity or high-dimensionality.
  • Robust Regression (Huber, M-estimators): Reduces sensitivity to outliers.

🚧 Common Challenges & Pitfalls

  • Mistaking correlation for causation: OLS only captures associations, not causality.
  • Ignoring diagnostics: Failing to test assumptions (e.g., residual plots for homoscedasticity).
  • Over-relying on $R^2$: High $R^2$ doesn’t imply assumptions are valid.
  • Data leakage in independence: Using overlapping or dependent samples biases results.
  • Assuming normality of predictors: Not required; only errors matter.

📚 Reference Pointers

Any doubt in content? Ask me anything?
Chat
🤖 👋 Hi there! I'm your learning assistant. If you have any questions about this page or need clarification, feel free to ask!