Linear Regression — OLS Closed-Form Derivation & Math Drill Guide
🎯 Core Idea
Linear Regression (ordinary least squares, OLS) fits a linear model
$$ \hat{y} = X\beta $$to given inputs $X$ and targets $y$ by choosing coefficients $\beta$ that minimize the sum of squared residuals (prediction errors). The closed-form OLS solution is the coefficient vector that makes the residuals orthogonal to the column space of $X$, leading to the normal equations
$$ X^\top X \beta = X^\top y, $$and (when invertible)
$$ \hat\beta = (X^\top X)^{-1} X^\top y. $$🌱 Intuition & Real-World Analogy
- Why OLS? Minimizing squared errors penalizes large deviations strongly and yields a simple, analytically tractable estimator that often works well as a baseline.
- Analogy 1 — Shadow projection: Imagine the columns of $X$ span a plane in high-dimensional space. Observed $y$ is a vector; OLS finds the point in the plane closest to $y$ (orthogonal projection).
- Analogy 2 — Fitting a table to floors: You want a flat table (linear predictor) that’s as close as possible to many uneven floor points (data). OLS picks the table orientation minimizing total squared vertical distances.
📐 Mathematical Foundation
Setup and notation
-
Data: $n$ observations, $p$ predictors (including optional intercept). $X\in\mathbb{R}^{n\times p}$ design matrix, rows $x_i^\top$. Target vector $y\in\mathbb{R}^n$. Coefficients $\beta\in\mathbb{R}^p$. Residuals $e = y - X\beta$.
-
Objective (OLS): minimize the sum of squared residuals
$$ L(\beta)=\|y-X\beta\|_2^2 = (y-X\beta)^\top (y-X\beta). $$
Gradient and normal equations (matrix form)
Compute gradient w.r.t. $\beta$:
$$ \begin{aligned} L(\beta) &= y^\top y - 2\beta^\top X^\top y + \beta^\top X^\top X \beta,\\ \nabla_\beta L(\beta) &= -2 X^\top y + 2 X^\top X \beta. \end{aligned} $$Set gradient to zero for stationary point:
$$ X^\top X \beta = X^\top y \qquad\text{(normal equations)}. $$If $X^\top X$ is invertible,
$$ \boxed{\ \hat\beta = (X^\top X)^{-1} X^\top y\ }. $$Residual & fitted values
- Fitted values: $\hat{y} = X\hat\beta = H y$ where $H = X(X^\top X)^{-1}X^\top$ is the hat (projection) matrix.
- Residuals: $\hat{e} = y - \hat{y} = (I - H)y$. Key property: $X^\top \hat{e} = 0$ (residuals orthogonal to each column of $X$).
Probabilistic interpretation & variance of estimator
Under the standard linear model
$$ y = X\beta + \varepsilon,\qquad \mathbb{E}[\varepsilon]=0,\ \ \operatorname{Var}(\varepsilon)=\sigma^2 I, $$the OLS estimator $\hat\beta$ is unbiased and has covariance
$$ \mathbb{E}[\hat\beta] = \beta,\qquad \operatorname{Var}(\hat\beta) = \sigma^2 (X^\top X)^{-1}. $$Estimated noise variance: $\hat\sigma^2 = \dfrac{|\hat e|_2^2}{n-p}$ under usual degrees-of-freedom correction.
Full derivation (scalar to matrix) — paper-style step-through
1. Scalar single-feature case Model: $y_i = \beta_0 + \beta_1 x_i + \varepsilon_i$. OLS minimizes
$$ L(\beta_0,\beta_1)=\sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2. $$Take partial derivatives, set to zero:
$$ \frac{\partial L}{\partial \beta_0} = -2\sum (y_i - \beta_0 - \beta_1 x_i)=0, $$$$ \frac{\partial L}{\partial \beta_1} = -2\sum x_i(y_i - \beta_0 - \beta_1 x_i)=0. $$Solve these two linear equations to obtain explicit $\hat\beta_1$ and $\hat\beta_0$ (standard textbook algebra leading to $\hat\beta_1=\dfrac{\sum (x_i-\bar x)(y_i-\bar y)}{\sum (x_i-\bar x)^2}$).
2. Matrix derivation (compact) Start from $L(\beta) = (y-X\beta)^\top(y-X\beta)$. Expand and differentiate as in main text to get normal equations
$$ X^\top X \beta = X^\top y. $$Invert (if possible) to get $\hat\beta$.
3. Projection interpretation Distance $|y-X\beta|_2^2$ minimized when $y-X\hat\beta$ is orthogonal to the column space of $X$, i.e. $X^\top (y-X\hat\beta)=0$, which is exactly the normal equations.
4. Variance derivation Under $y=X\beta+\varepsilon$, substitute into $\hat\beta=(X^\top X)^{-1}X^\top y$:
$$ \hat\beta = \beta + (X^\top X)^{-1}X^\top \varepsilon. $$Thus $\operatorname{Var}(\hat\beta) = (X^\top X)^{-1}X^\top \operatorname{Var}(\varepsilon) X (X^\top X)^{-1} = \sigma^2 (X^\top X)^{-1}$.
⚖️ Strengths, Limitations & Trade-offs
Strengths
- Closed-form, fast to compute for small-to-moderate $p$.
- Interpretable coefficients; linearity is easy to explain.
- Under Gauss–Markov assumptions, OLS is the Best Linear Unbiased Estimator (BLUE).
- Basis for many extensions and diagnostic techniques.
Limitations
- Linearity: only captures linear relationships unless features are engineered.
- Sensitive to outliers: squared loss amplifies large residuals.
- Multicollinearity: near-linear dependence among columns of $X$ makes $X^\top X$ ill-conditioned and estimates unstable (high variance).
- Homoskedasticity & uncorrelated errors assumption: if violated, standard errors and inference are invalid.
- Invertibility requirement: if $X^\top X$ is singular (e.g., $p>n$ or perfectly correlated columns), the closed form fails.
Trade-offs
- Bias–variance: OLS is unbiased but can have high variance; regularization trades bias for lower variance (better generalization).
- Closed-form vs numeric: closed-form is elegant but for large $p$ and $n$, numerical linear algebra (QR, SVD) is preferred for stability.
🔍 Variants & Extensions
-
Ridge Regression (Tikhonov): adds $\lambda|\beta|_2^2$ penalty. Closed form:
$$ \hat\beta_{\text{ridge}} = (X^\top X + \lambda I)^{-1} X^\top y. $$Addresses multicollinearity and $p>n$ problems.
-
Lasso (L1): $\lambda|\beta|_1$ encourages sparsity; no closed-form — use convex optimization (coordinate descent).
-
Weighted Least Squares (WLS): minimize $(y-X\beta)^\top W (y-X\beta)$ for heteroskedastic noise (diagonal $W$ with weights).
-
Generalized Least Squares (GLS): when $\operatorname{Var}(\varepsilon)=\Sigma$ non-scalar, solution uses $\Sigma^{-1}$: $\hat\beta_{GLS} = (X^\top\Sigma^{-1}X)^{-1}X^\top\Sigma^{-1}y$.
-
Robust Regression: minimize other loss functions (e.g., Huber) to reduce outlier sensitivity.
-
Regularized/penalized variants: Elastic Net, Bayesian linear regression (gives a posterior distribution over $\beta$).
🚧 Common Challenges & Pitfalls
-
Dropping the intercept: If data isn’t mean-centered and intercept omitted, estimates bias and interpretation break. Prefer including intercept unless strong reason.
-
Multicollinearity: High variance of $\hat\beta$; coefficients unstable (signs flip) though predictions may still be fine. Diagnose with condition number or variance inflation factor (VIF). Use ridge, PCA, or remove redundant predictors.
-
Overfitting vs underfitting: OLS with many predictors (especially $p$ close to $n$) overfits noise. Regularize or perform feature selection.
-
Singular $X^\top X$: Occurs with duplicate/linearly dependent columns or $p>n$. Use SVD, pseudoinverse $X^+ = (X^\top X)^{+} X^\top$ or regularization.
-
Misinterpreting p-values and CIs: Standard inference requires assumptions (independent errors, homoskedasticity, normality for small samples). Violations require robust standard errors or bootstrap.
-
Extrapolation danger: Predictions far outside training $X$ domain are unreliable.
-
Leverage points: Points with high leverage (rows of $H$ with large diagonal elements) can unduly influence fit. Examine hat matrix and Cook’s distance.
-
Numerical instability: Computing $(X^\top X)^{-1}$ directly can be numerically unstable. Use QR or SVD in practice.
📚 Reference Pointers
- Hastie, Tibshirani, Friedman — The Elements of Statistical Learning (chapter on linear methods). (Classic, rigorous.) https://statweb.stanford.edu/~tibs/ElemStatLearn/
- Draper & Smith — Applied Regression Analysis (practical diagnostics & inference).
- Wikipedia — Linear regression (good short overview of formulas and properties). https://en.wikipedia.org/wiki/Linear_regression
- Greene or Wooldridge — Econometric Analysis (for GLS, heteroskedasticity, inference).
- For Gauss–Markov theorem & unbiasedness: any standard regression textbook; short discussion here: https://en.wikipedia.org/wiki/Gauss–Markov_theorem
Quick Reference — Key Identities (cheat-sheet)
- Objective: $L(\beta)=|y-X\beta|_2^2$.
- Gradient: $\nabla_\beta L = 2(X^\top X \beta - X^\top y)$.
- Normal equations: $X^\top X \beta = X^\top y$.
- OLS solution (if invertible): $\hat\beta=(X^\top X)^{-1}X^\top y$.
- Fitted values: $\hat y = H y,\quad H=X(X^\top X)^{-1}X^\top$.
- Residuals orthogonality: $X^\top \hat e = 0$.
- Variance: $\operatorname{Var}(\hat\beta)=\sigma^2 (X^\top X)^{-1}$.
- Ridge closed form: $\hat\beta_{\text{ridge}}=(X^\top X+\lambda I)^{-1}X^\top y$.
Final mentor note (short, practical)
When asked in an interview to derive OLS on paper, do the following clearly and confidently:
- Define objective $L(\beta)=|y-X\beta|_2^2$.
- Expand (or directly differentiate using matrix calculus) and compute $\nabla_\beta L$.
- Set gradient to zero → normal equations $X^\top X\beta=X^\top y$.
- State solution $\hat\beta=(X^\top X)^{-1}X^\top y$ and immediately discuss when inverse exists / when not (mention pseudoinverse or ridge).
- Give geometric interpretation (projection), state Gauss–Markov for BLUE, and give one sentence on variance and inference.
- Mention numerical stability (use QR/SVD) and give a quick remedy for multicollinearity (ridge).