Master the Core Theory and Assumptions: Linear Regression

Machine Learning Interview Guide for Top Tech Roles (2025)

Linear Regression: Complete Interview Guide for Interviews

5 min read 933 words

🎯 Core Idea

Linear regression models the conditional expectation of a continuous target $y$ as a linear function of input features $X$:
$$ y = X\beta + \varepsilon $$
where $X$ is the design matrix, $\beta$ the coefficient vector, and $\varepsilon$ the (irreducible) noise. The goal of Ordinary Least Squares (OLS) is to find $\beta$ that minimizes the squared residuals $|y - X\beta|^2$. Linear regression is both a modeling tool and a diagnostic baseline — it’s simple, interpretable, and the theoretical bedrock for many extensions.

🌱 Intuition & Real-World Analogy

Why use it? Because many relationships are approximately linear locally, and a linear map is the simplest interpretable mapping between inputs and outputs.
Analogy 1 — Fitting a board to points: Imagine trying to lay a straight plank (the model) across scattered pegs (data points) so the plank is as close as possible to all pegs. OLS finds the plank position minimizing total squared vertical gaps.
Analogy 2 — Budget allocation: Think of $\beta$ as how much “weight” you give each budget line in predicting final monthly cost. If one line doubles, the linear model assumes proportional effect on final cost.

📐 Mathematical Foundation

Model statement

$$ y = X\beta + \varepsilon $$

$y \in \mathbb{R}^n$: observed outcomes.
$X \in \mathbb{R}^{n\times p}$: design matrix, rows are observations, columns are features (may include a column of ones for intercept).
$\beta \in \mathbb{R}^p$: parameters to estimate.
$\varepsilon \in \mathbb{R}^n$: noise/errors, usually treated as random.

OLS objective

$$ \hat\beta_{OLS} = \arg\min_{\beta} \; \|y - X\beta\|_2^2 = \arg\min_{\beta} \; (y - X\beta)^\top (y - X\beta) $$

Derivation (closed-form) Set gradient to zero:

$$ \frac{\partial}{\partial\beta} \big((y-X\beta)^\top (y-X\beta)\big) = -2X^\top (y - X\beta) = 0 $$

Solve:

$$ X^\top X \hat\beta_{OLS} = X^\top y \quad\Rightarrow\quad \hat\beta_{OLS} = (X^\top X)^{-1} X^\top y $$

Conditions: $X^\top X$ must be invertible (i.e., columns of $X$ are linearly independent).

Distributional results (when $\varepsilon \sim \mathcal{N}(0,\sigma^2 I)$)

$\mathbb{E}[\hat\beta_{OLS}] = \beta$ (unbiased).
$\operatorname{Var}(\hat\beta_{OLS}) = \sigma^2 (X^\top X)^{-1}$.
Residual vector: $\hat r = y - X\hat\beta$. Estimate of noise variance:
$$ \hat\sigma^2 = \frac{\hat r^\top \hat r}{n-p} $$
Predictions: $\hat y = X\hat\beta$. Prediction variance for a new point $x_*$:
$$ \operatorname{Var}(\hat y_*) = \sigma^2 \, x_*^\top (X^\top X)^{-1} x_* $$

Key assumptions (classical linear model)

Linearity in parameters: $y = X\beta + \varepsilon$.
Full rank / no perfect multicollinearity: $X^\top X$ invertible.
Exogeneity / zero-mean error: $\mathbb{E}[\varepsilon \mid X] = 0$.
Homoscedasticity: $\operatorname{Var}(\varepsilon_i \mid X) = \sigma^2$ (constant variance).
Independence: errors are uncorrelated (often independent).
(Optional for inference) Normality: $\varepsilon \sim \mathcal{N}(0,\sigma^2 I)$ — gives exact finite-sample inference; otherwise rely on large-sample approximations (CLT).

⚖️ Strengths, Limitations & Trade-offs

Strengths

Analytic closed-form solution (when $p$ small and $X^\top X$ invertible).
Highly interpretable coefficients: $\beta_j$ = marginal effect of one-unit change in feature $j$ holding others fixed (caveat: correlated features).
Well-understood statistical properties (Gauss–Markov: OLS is BLUE under assumptions).
Fast to train and robust baseline.

Limitations

Linearity constraint: cannot capture nonlinear relationships unless features are transformed/augmented.
Sensitive to outliers: squared loss magnifies large residuals.
Violation of assumptions: heteroscedasticity, autocorrelation, endogeneity, or multicollinearity degrade inference or prediction.
Interpretation pitfalls: coefficients are conditional on included covariates — omitted-variable bias can mislead causal claims.

Trade-offs

Bias–variance: adding regularization (Ridge/Lasso) increases bias but reduces variance — helpful with multicollinearity or high-dimensional $p$.
Simplicity vs expressiveness: linear models are simple and interpretable but less expressive than nonlinear models (trees, neural nets). However, feature engineering or basis expansions recover much expressive power while keeping interpretability.

🔍 Variants & Extensions

Weighted Least Squares (WLS): addresses heteroscedasticity by weighting observations inversely to their variances.
Generalized Least Squares (GLS): handles correlated/heteroscedastic errors ($\operatorname{Var}(\varepsilon)=\Sigma$).
Ridge (L2) and Lasso (L1): penalized regression to control multicollinearity and overfitting.
Elastic Net: combination of L1 and L2.
Generalized Linear Models (GLMs): extend to non-Gaussian targets (e.g., logistic regression for binary).
Basis expansions / splines / polynomial regression: allow nonlinear relationships while keeping linearity in parameters.
Robust regression (e.g., Huber, RANSAC): reduce sensitivity to outliers.

🚧 Common Challenges & Pitfalls

Multicollinearity

Problem: predictors highly correlated → $X^\top X$ ill-conditioned → coefficients unstable, inflated standard errors.
Detection: high Variance Inflation Factor (VIF), near-singular $X^\top X$, large coefficient swings when removing features.
Mitigation: remove/recombine features, PCA, Ridge regression.

Heteroscedasticity

Problem: error variance not constant → OLS still unbiased but variance estimates (hence p-values/conf intervals) are wrong.
Detection: residuals vs fitted plot showing funnel shape; formal tests (Breusch–Pagan, White test).
Mitigation: transform response (e.g., log), WLS, heteroscedasticity-robust (sandwich) standard errors.

Autocorrelation (time series)

Problem: correlated residuals (common in time series) → inefficient estimates, invalid standard errors.
Detection: Durbin–Watson, ACF of residuals.
Mitigation: include lagged variables, GLS, ARIMA-style modeling.

Endogeneity / Omitted Variable Bias

Problem: $\mathbb{E}[\varepsilon \mid X] \neq 0$ due to omitted confounders or reverse causation → biased $\hat\beta$.
Mitigation: instrument variables (IV), controlled experiments, include confounders.

Overfitting with high-dimensional data ($p \approx n$ or $p>n$)

Problem: $X^\top X$ singular when $p>n$.
Mitigation: regularization (Ridge/Lasso), feature selection, dimensionality reduction.

Misinterpreting coefficients

Coefficients are conditional associations, not automatically causal effects. Causality needs separate identification (randomization, IVs, DAG-based reasoning).

🔬 Detection & Diagnostic Checklist (practical, conceptual)

Residuals vs fitted: check linearity and heteroscedasticity.
QQ plot of residuals: assess normality (only matters for small-sample inference).
VIF scores: detect multicollinearity.
Leverage and influence (Cook’s distance): detect influential points/outliers.
Plot residual ACF: detect autocorrelation.
Compare nested models (F-test) / information criteria (AIC/BIC): model selection reasoning.

📚 Reference Pointers (for deeper theory)

Gauss–Markov theorem (BLUE): https://en.wikipedia.org/wiki/Gauss–Markov_theorem
Ordinary Least Squares (Wikipedia): https://en.wikipedia.org/wiki/Ordinary_least_squares
Heteroscedasticity tests — Breusch–Pagan: https://en.wikipedia.org/wiki/Breusch–Pagan_test
Variance Inflation Factor (VIF): https://en.wikipedia.org/wiki/Variance_inflation_factor
Robust standard errors (“sandwich” estimator): https://en.wikipedia.org/wiki/Heteroscedasticity-consistent_standard_errors
Koller & Friedman, Probabilistic Graphical Models — for formal treatment of assumptions and probabilistic modeling: https://mitpress.mit.edu/9780262013192/probabilistic-graphical-models

Math Concepts for Linear Regression Interviews Linear Regression System Design: Interview Framework