OLS Assumptions: Linearity, Homoscedasticity & Independence Guide

📚 Notes

Journey

Day1

3 min read 458 words

🎯 Core Idea

Ordinary Least Squares (OLS) regression estimates the linear relationship between input variables (features) and an output (target).

The goal: find coefficients that minimize the sum of squared errors.
For OLS to yield valid, unbiased, efficient results, certain assumptions (linearity, independence, homoscedasticity, etc.) must hold.

Linearity: Imagine stretching a rubber band between data points. If the data roughly align, a straight band fits well. If the relationship is curved, forcing a straight line distorts reality.
Homoscedasticity: Think of a doctor weighing patients. If the weighing scale works consistently for all weights, it’s fine. But if it becomes inaccurate for heavier people, the “variance” changes → heteroscedasticity.
Independence: Like interviewing multiple people for a survey. If everyone copies from one another, answers aren’t independent → bias creeps in.

OLS solves:

$$ \hat{\beta} = \arg\min_\beta \, (y - X\beta)^T(y - X\beta) $$

Where:

For OLS estimators to be BLUE (Best Linear Unbiased Estimator):

Linearity: $y = X\beta + \epsilon$. Relationship is additive & linear in parameters.
Independence: Errors $\epsilon_i$ are independent across observations.
Homoscedasticity: $\text{Var}(\epsilon_i) = \sigma^2$, constant across all observations.
No perfect multicollinearity: Features are not perfectly correlated.
Exogeneity: $E[\epsilon|X] = 0$. Errors are uncorrelated with predictors.

Linearity Restriction: Cannot capture non-linear relationships without feature engineering.
Sensitivity to Outliers: Squared errors give extreme weight to large deviations.
Homoscedasticity Violation: Leads to inefficient estimates and biased standard errors.
Independence Violation: Autocorrelation (e.g., in time series) invalidates inference.
Multicollinearity: Inflates variance of coefficients → unstable estimates.
Overfitting in High Dimensions: With too many predictors, model loses generalization.
Exogeneity Violation: Leads to biased and inconsistent estimates.
Applicability Restricted: Assumes Gaussian-like residuals for hypothesis testing.

Generalized Least Squares (GLS): Corrects heteroscedasticity or correlated errors.
Weighted Least Squares (WLS): Assigns weights when variance differs across observations.
Ridge/Lasso Regression: Adds regularization to address multicollinearity or high-dimensionality.
Robust Regression (Huber, M-estimators): Reduces sensitivity to outliers.

Mistaking correlation for causation: OLS only captures associations, not causality.
Ignoring diagnostics: Failing to test assumptions (e.g., residual plots for homoscedasticity).
Over-relying on $R^2$: High $R^2$ doesn’t imply assumptions are valid.
Data leakage in independence: Using overlapping or dependent samples biases results.
Assuming normality of predictors: Not required; only errors matter.