Linear Regression — OLS Closed-Form Derivation & Math Drill Guide

📚 Notes

Journey

Day1

6 min read 1173 words

🎯 Core Idea

Linear Regression (ordinary least squares, OLS) fits a linear model

$$ \hat{y} = X\beta $$

to given inputs $X$ and targets $y$ by choosing coefficients $\beta$ that minimize the sum of squared residuals (prediction errors). The closed-form OLS solution is the coefficient vector that makes the residuals orthogonal to the column space of $X$, leading to the normal equations

$$ X^\top X \beta = X^\top y, $$

and (when invertible)

$$ \hat\beta = (X^\top X)^{-1} X^\top y. $$

🌱 Intuition & Real-World Analogy

Why OLS? Minimizing squared errors penalizes large deviations strongly and yields a simple, analytically tractable estimator that often works well as a baseline.
Analogy 1 — Shadow projection: Imagine the columns of $X$ span a plane in high-dimensional space. Observed $y$ is a vector; OLS finds the point in the plane closest to $y$ (orthogonal projection).
Analogy 2 — Fitting a table to floors: You want a flat table (linear predictor) that’s as close as possible to many uneven floor points (data). OLS picks the table orientation minimizing total squared vertical distances.

📐 Mathematical Foundation

Setup and notation

Data: $n$ observations, $p$ predictors (including optional intercept). $X\in\mathbb{R}^{n\times p}$ design matrix, rows $x_i^\top$. Target vector $y\in\mathbb{R}^n$. Coefficients $\beta\in\mathbb{R}^p$. Residuals $e = y - X\beta$.
Objective (OLS): minimize the sum of squared residuals
$$ L(\beta)=\|y-X\beta\|_2^2 = (y-X\beta)^\top (y-X\beta). $$

Gradient and normal equations (matrix form)

Compute gradient w.r.t. $\beta$:

$$ \begin{aligned} L(\beta) &= y^\top y - 2\beta^\top X^\top y + \beta^\top X^\top X \beta,\\ \nabla_\beta L(\beta) &= -2 X^\top y + 2 X^\top X \beta. \end{aligned} $$

Set gradient to zero for stationary point:

$$ X^\top X \beta = X^\top y \qquad\text{(normal equations)}. $$

If $X^\top X$ is invertible,

$$ \boxed{\ \hat\beta = (X^\top X)^{-1} X^\top y\ }. $$

Residual & fitted values

Fitted values: $\hat{y} = X\hat\beta = H y$ where $H = X(X^\top X)^{-1}X^\top$ is the hat (projection) matrix.
Residuals: $\hat{e} = y - \hat{y} = (I - H)y$. Key property: $X^\top \hat{e} = 0$ (residuals orthogonal to each column of $X$).

Probabilistic interpretation & variance of estimator

Under the standard linear model

$$ y = X\beta + \varepsilon,\qquad \mathbb{E}[\varepsilon]=0,\ \ \operatorname{Var}(\varepsilon)=\sigma^2 I, $$

the OLS estimator $\hat\beta$ is unbiased and has covariance

$$ \mathbb{E}[\hat\beta] = \beta,\qquad \operatorname{Var}(\hat\beta) = \sigma^2 (X^\top X)^{-1}. $$

Estimated noise variance: $\hat\sigma^2 = \dfrac{|\hat e|_2^2}{n-p}$ under usual degrees-of-freedom correction.

Full derivation (scalar to matrix) — paper-style step-through

1. Scalar single-feature case Model: $y_i = \beta_0 + \beta_1 x_i + \varepsilon_i$. OLS minimizes

$$ L(\beta_0,\beta_1)=\sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2. $$

Take partial derivatives, set to zero:

$$ \frac{\partial L}{\partial \beta_0} = -2\sum (y_i - \beta_0 - \beta_1 x_i)=0, $$$$ \frac{\partial L}{\partial \beta_1} = -2\sum x_i(y_i - \beta_0 - \beta_1 x_i)=0. $$

Solve these two linear equations to obtain explicit $\hat\beta_1$ and $\hat\beta_0$ (standard textbook algebra leading to $\hat\beta_1=\dfrac{\sum (x_i-\bar x)(y_i-\bar y)}{\sum (x_i-\bar x)^2}$).

2. Matrix derivation (compact) Start from $L(\beta) = (y-X\beta)^\top(y-X\beta)$. Expand and differentiate as in main text to get normal equations

$$ X^\top X \beta = X^\top y. $$

Invert (if possible) to get $\hat\beta$.

3. Projection interpretation Distance $|y-X\beta|_2^2$ minimized when $y-X\hat\beta$ is orthogonal to the column space of $X$, i.e. $X^\top (y-X\hat\beta)=0$, which is exactly the normal equations.

4. Variance derivation Under $y=X\beta+\varepsilon$, substitute into $\hat\beta=(X^\top X)^{-1}X^\top y$:

$$ \hat\beta = \beta + (X^\top X)^{-1}X^\top \varepsilon. $$

Thus $\operatorname{Var}(\hat\beta) = (X^\top X)^{-1}X^\top \operatorname{Var}(\varepsilon) X (X^\top X)^{-1} = \sigma^2 (X^\top X)^{-1}$.

⚖️ Strengths, Limitations & Trade-offs

Strengths

Closed-form, fast to compute for small-to-moderate $p$.
Interpretable coefficients; linearity is easy to explain.
Under Gauss–Markov assumptions, OLS is the Best Linear Unbiased Estimator (BLUE).
Basis for many extensions and diagnostic techniques.

Limitations

Linearity: only captures linear relationships unless features are engineered.
Sensitive to outliers: squared loss amplifies large residuals.
Multicollinearity: near-linear dependence among columns of $X$ makes $X^\top X$ ill-conditioned and estimates unstable (high variance).
Homoskedasticity & uncorrelated errors assumption: if violated, standard errors and inference are invalid.
Invertibility requirement: if $X^\top X$ is singular (e.g., $p>n$ or perfectly correlated columns), the closed form fails.

Trade-offs

Bias–variance: OLS is unbiased but can have high variance; regularization trades bias for lower variance (better generalization).
Closed-form vs numeric: closed-form is elegant but for large $p$ and $n$, numerical linear algebra (QR, SVD) is preferred for stability.

🔍 Variants & Extensions

Ridge Regression (Tikhonov): adds $\lambda|\beta|_2^2$ penalty. Closed form:
$$ \hat\beta_{\text{ridge}} = (X^\top X + \lambda I)^{-1} X^\top y. $$
Addresses multicollinearity and $p>n$ problems.
Lasso (L1): $\lambda|\beta|_1$ encourages sparsity; no closed-form — use convex optimization (coordinate descent).
Weighted Least Squares (WLS): minimize $(y-X\beta)^\top W (y-X\beta)$ for heteroskedastic noise (diagonal $W$ with weights).
Generalized Least Squares (GLS): when $\operatorname{Var}(\varepsilon)=\Sigma$ non-scalar, solution uses $\Sigma^{-1}$: $\hat\beta_{GLS} = (X^\top\Sigma^{-1}X)^{-1}X^\top\Sigma^{-1}y$.
Robust Regression: minimize other loss functions (e.g., Huber) to reduce outlier sensitivity.
Regularized/penalized variants: Elastic Net, Bayesian linear regression (gives a posterior distribution over $\beta$).

🚧 Common Challenges & Pitfalls

Dropping the intercept: If data isn’t mean-centered and intercept omitted, estimates bias and interpretation break. Prefer including intercept unless strong reason.
Multicollinearity: High variance of $\hat\beta$; coefficients unstable (signs flip) though predictions may still be fine. Diagnose with condition number or variance inflation factor (VIF). Use ridge, PCA, or remove redundant predictors.
Overfitting vs underfitting: OLS with many predictors (especially $p$ close to $n$) overfits noise. Regularize or perform feature selection.
Singular $X^\top X$: Occurs with duplicate/linearly dependent columns or $p>n$. Use SVD, pseudoinverse $X^+ = (X^\top X)^{+} X^\top$ or regularization.
Misinterpreting p-values and CIs: Standard inference requires assumptions (independent errors, homoskedasticity, normality for small samples). Violations require robust standard errors or bootstrap.
Extrapolation danger: Predictions far outside training $X$ domain are unreliable.
Leverage points: Points with high leverage (rows of $H$ with large diagonal elements) can unduly influence fit. Examine hat matrix and Cook’s distance.
Numerical instability: Computing $(X^\top X)^{-1}$ directly can be numerically unstable. Use QR or SVD in practice.

📚 Reference Pointers

Hastie, Tibshirani, Friedman — The Elements of Statistical Learning (chapter on linear methods). (Classic, rigorous.) https://statweb.stanford.edu/~tibs/ElemStatLearn/
Draper & Smith — Applied Regression Analysis (practical diagnostics & inference).
Wikipedia — Linear regression (good short overview of formulas and properties). https://en.wikipedia.org/wiki/Linear_regression
Greene or Wooldridge — Econometric Analysis (for GLS, heteroskedasticity, inference).
For Gauss–Markov theorem & unbiasedness: any standard regression textbook; short discussion here: https://en.wikipedia.org/wiki/Gauss–Markov_theorem

Quick Reference — Key Identities (cheat-sheet)

Objective: $L(\beta)=|y-X\beta|_2^2$.
Gradient: $\nabla_\beta L = 2(X^\top X \beta - X^\top y)$.
Normal equations: $X^\top X \beta = X^\top y$.
OLS solution (if invertible): $\hat\beta=(X^\top X)^{-1}X^\top y$.
Fitted values: $\hat y = H y,\quad H=X(X^\top X)^{-1}X^\top$.
Residuals orthogonality: $X^\top \hat e = 0$.
Variance: $\operatorname{Var}(\hat\beta)=\sigma^2 (X^\top X)^{-1}$.
Ridge closed form: $\hat\beta_{\text{ridge}}=(X^\top X+\lambda I)^{-1}X^\top y$.

Final mentor note (short, practical)

When asked in an interview to derive OLS on paper, do the following clearly and confidently:

Define objective $L(\beta)=|y-X\beta|_2^2$.
Expand (or directly differentiate using matrix calculus) and compute $\nabla_\beta L$.
Set gradient to zero → normal equations $X^\top X\beta=X^\top y$.
State solution $\hat\beta=(X^\top X)^{-1}X^\top y$ and immediately discuss when inverse exists / when not (mention pseudoinverse or ridge).
Give geometric interpretation (projection), state Gauss–Markov for BLUE, and give one sentence on variance and inference.
Mention numerical stability (use QR/SVD) and give a quick remedy for multicollinearity (ridge).

OLS Assumptions: Linearity, Homoscedasticity & Independence Guide