Master the Core Theory and Assumptions: Linear Regression
🎯 Core Idea
-
Linear regression models the conditional expectation of a continuous target $y$ as a linear function of input features $X$:
$$ y = X\beta + \varepsilon $$where $X$ is the design matrix, $\beta$ the coefficient vector, and $\varepsilon$ the (irreducible) noise. The goal of Ordinary Least Squares (OLS) is to find $\beta$ that minimizes the squared residuals $|y - X\beta|^2$. Linear regression is both a modeling tool and a diagnostic baseline — it’s simple, interpretable, and the theoretical bedrock for many extensions.
🌱 Intuition & Real-World Analogy
- Why use it? Because many relationships are approximately linear locally, and a linear map is the simplest interpretable mapping between inputs and outputs.
- Analogy 1 — Fitting a board to points: Imagine trying to lay a straight plank (the model) across scattered pegs (data points) so the plank is as close as possible to all pegs. OLS finds the plank position minimizing total squared vertical gaps.
- Analogy 2 — Budget allocation: Think of $\beta$ as how much “weight” you give each budget line in predicting final monthly cost. If one line doubles, the linear model assumes proportional effect on final cost.
📐 Mathematical Foundation
Model statement
$$ y = X\beta + \varepsilon $$- $y \in \mathbb{R}^n$: observed outcomes.
- $X \in \mathbb{R}^{n\times p}$: design matrix, rows are observations, columns are features (may include a column of ones for intercept).
- $\beta \in \mathbb{R}^p$: parameters to estimate.
- $\varepsilon \in \mathbb{R}^n$: noise/errors, usually treated as random.
OLS objective
$$ \hat\beta_{OLS} = \arg\min_{\beta} \; \|y - X\beta\|_2^2 = \arg\min_{\beta} \; (y - X\beta)^\top (y - X\beta) $$Derivation (closed-form) Set gradient to zero:
$$ \frac{\partial}{\partial\beta} \big((y-X\beta)^\top (y-X\beta)\big) = -2X^\top (y - X\beta) = 0 $$Solve:
$$ X^\top X \hat\beta_{OLS} = X^\top y \quad\Rightarrow\quad \hat\beta_{OLS} = (X^\top X)^{-1} X^\top y $$Conditions: $X^\top X$ must be invertible (i.e., columns of $X$ are linearly independent).
Distributional results (when $\varepsilon \sim \mathcal{N}(0,\sigma^2 I)$)
-
$\mathbb{E}[\hat\beta_{OLS}] = \beta$ (unbiased).
-
$\operatorname{Var}(\hat\beta_{OLS}) = \sigma^2 (X^\top X)^{-1}$.
-
Residual vector: $\hat r = y - X\hat\beta$. Estimate of noise variance:
$$ \hat\sigma^2 = \frac{\hat r^\top \hat r}{n-p} $$ -
Predictions: $\hat y = X\hat\beta$. Prediction variance for a new point $x_*$:
$$ \operatorname{Var}(\hat y_*) = \sigma^2 \, x_*^\top (X^\top X)^{-1} x_* $$
Key assumptions (classical linear model)
- Linearity in parameters: $y = X\beta + \varepsilon$.
- Full rank / no perfect multicollinearity: $X^\top X$ invertible.
- Exogeneity / zero-mean error: $\mathbb{E}[\varepsilon \mid X] = 0$.
- Homoscedasticity: $\operatorname{Var}(\varepsilon_i \mid X) = \sigma^2$ (constant variance).
- Independence: errors are uncorrelated (often independent).
- (Optional for inference) Normality: $\varepsilon \sim \mathcal{N}(0,\sigma^2 I)$ — gives exact finite-sample inference; otherwise rely on large-sample approximations (CLT).
⚖️ Strengths, Limitations & Trade-offs
Strengths
- Analytic closed-form solution (when $p$ small and $X^\top X$ invertible).
- Highly interpretable coefficients: $\beta_j$ = marginal effect of one-unit change in feature $j$ holding others fixed (caveat: correlated features).
- Well-understood statistical properties (Gauss–Markov: OLS is BLUE under assumptions).
- Fast to train and robust baseline.
Limitations
- Linearity constraint: cannot capture nonlinear relationships unless features are transformed/augmented.
- Sensitive to outliers: squared loss magnifies large residuals.
- Violation of assumptions: heteroscedasticity, autocorrelation, endogeneity, or multicollinearity degrade inference or prediction.
- Interpretation pitfalls: coefficients are conditional on included covariates — omitted-variable bias can mislead causal claims.
Trade-offs
- Bias–variance: adding regularization (Ridge/Lasso) increases bias but reduces variance — helpful with multicollinearity or high-dimensional $p$.
- Simplicity vs expressiveness: linear models are simple and interpretable but less expressive than nonlinear models (trees, neural nets). However, feature engineering or basis expansions recover much expressive power while keeping interpretability.
🔍 Variants & Extensions
- Weighted Least Squares (WLS): addresses heteroscedasticity by weighting observations inversely to their variances.
- Generalized Least Squares (GLS): handles correlated/heteroscedastic errors ($\operatorname{Var}(\varepsilon)=\Sigma$).
- Ridge (L2) and Lasso (L1): penalized regression to control multicollinearity and overfitting.
- Elastic Net: combination of L1 and L2.
- Generalized Linear Models (GLMs): extend to non-Gaussian targets (e.g., logistic regression for binary).
- Basis expansions / splines / polynomial regression: allow nonlinear relationships while keeping linearity in parameters.
- Robust regression (e.g., Huber, RANSAC): reduce sensitivity to outliers.
🚧 Common Challenges & Pitfalls
Multicollinearity
- Problem: predictors highly correlated → $X^\top X$ ill-conditioned → coefficients unstable, inflated standard errors.
- Detection: high Variance Inflation Factor (VIF), near-singular $X^\top X$, large coefficient swings when removing features.
- Mitigation: remove/recombine features, PCA, Ridge regression.
Heteroscedasticity
- Problem: error variance not constant → OLS still unbiased but variance estimates (hence p-values/conf intervals) are wrong.
- Detection: residuals vs fitted plot showing funnel shape; formal tests (Breusch–Pagan, White test).
- Mitigation: transform response (e.g., log), WLS, heteroscedasticity-robust (sandwich) standard errors.
Autocorrelation (time series)
- Problem: correlated residuals (common in time series) → inefficient estimates, invalid standard errors.
- Detection: Durbin–Watson, ACF of residuals.
- Mitigation: include lagged variables, GLS, ARIMA-style modeling.
Endogeneity / Omitted Variable Bias
- Problem: $\mathbb{E}[\varepsilon \mid X] \neq 0$ due to omitted confounders or reverse causation → biased $\hat\beta$.
- Mitigation: instrument variables (IV), controlled experiments, include confounders.
Overfitting with high-dimensional data ($p \approx n$ or $p>n$)
- Problem: $X^\top X$ singular when $p>n$.
- Mitigation: regularization (Ridge/Lasso), feature selection, dimensionality reduction.
Misinterpreting coefficients
- Coefficients are conditional associations, not automatically causal effects. Causality needs separate identification (randomization, IVs, DAG-based reasoning).
🔬 Detection & Diagnostic Checklist (practical, conceptual)
- Residuals vs fitted: check linearity and heteroscedasticity.
- QQ plot of residuals: assess normality (only matters for small-sample inference).
- VIF scores: detect multicollinearity.
- Leverage and influence (Cook’s distance): detect influential points/outliers.
- Plot residual ACF: detect autocorrelation.
- Compare nested models (F-test) / information criteria (AIC/BIC): model selection reasoning.
📚 Reference Pointers (for deeper theory)
- Gauss–Markov theorem (BLUE): https://en.wikipedia.org/wiki/Gauss–Markov_theorem
- Ordinary Least Squares (Wikipedia): https://en.wikipedia.org/wiki/Ordinary_least_squares
- Heteroscedasticity tests — Breusch–Pagan: https://en.wikipedia.org/wiki/Breusch–Pagan_test
- Variance Inflation Factor (VIF): https://en.wikipedia.org/wiki/Variance_inflation_factor
- Robust standard errors (“sandwich” estimator): https://en.wikipedia.org/wiki/Heteroscedasticity-consistent_standard_errors
- Koller & Friedman, Probabilistic Graphical Models — for formal treatment of assumptions and probabilistic modeling: https://mitpress.mit.edu/9780262013192/probabilistic-graphical-models