Math Concepts for Linear Regression Interviews

14 min read 2908 words

Mean Squared Error (MSE) Loss Function

Mean Squared Error (MSE) Loss Function

🎯 Intuition First:
MSE tells us how far our predictions are from the actual values, on average. Think of it as a “distance meter” that punishes bigger mistakes much more heavily than smaller ones.


📐 The Formal Definition:

$$ J(\theta) = \frac{1}{m} \sum_{i=1}^{m} \left( y^{(i)} - \hat{y}^{(i)} \right)^2 $$

Breaking Down the Formula:

  • \( J(\theta) \): The loss function (also called the cost function) we want to minimize.
  • \( m \): The number of training examples.
  • \( y^{(i)} \): The actual (ground truth) target value for the \(i\)-th example.
  • \( \hat{y}^{(i)} \): The predicted value from our model for the \(i\)-th example.
  • The square \((y - \hat{y})^2\): Ensures errors are positive and amplifies large errors.

MSE is popular because it’s mathematically convenient: the square makes the function smooth and differentiable, which makes optimization efficient.
⚠️
  • Usage: You’ll be asked to derive gradients of MSE during a coding interview or explain why squaring is better than absolute value.
  • Questions: Common ones include: “Why do we square errors instead of taking absolute values?” or “What happens to MSE if there are extreme outliers?”

Normal Equation (Closed-Form Solution)

Normal Equation (Closed-Form Solution)

🎯 Intuition First:
The Normal Equation is like a shortcut. Instead of gradually adjusting parameters with gradient descent, you solve directly for the best-fitting line in one shot (when feasible).


📐 The Formal Definition:

$$ \hat{\theta} = (X^T X)^{-1} X^T y $$

Breaking Down the Formula:

  • \( \hat{\theta} \): The vector of parameters (coefficients) that minimize the MSE.
  • \( X \): The design matrix (rows = samples, columns = features).
  • \( y \): The target vector.
  • \( X^T X \): Captures correlations between features.
  • \( (X^T X)^{-1} \): Inverse of this matrix, if it exists.
  • \( X^T y \): Aligns features with outcomes.

This formula is exact — no tuning of learning rates or iterations. But it becomes computationally expensive (and unstable) when the number of features grows large or features are highly correlated.
⚠️
  • Usage: You may be asked to compare Normal Equation vs Gradient Descent.
  • Questions: Expect: “When would you not use the Normal Equation?” or “Why is matrix inversion problematic in high dimensions?”

Gradient Descent Update Rule

Gradient Descent Update Rule

🎯 Intuition First:
Gradient Descent is like walking downhill with your eyes closed — at each step, you feel the slope under your feet and take a step in the opposite direction to get closer to the valley (the minimum).


📐 The Formal Definition:

$$ \theta_j := \theta_j - \alpha \cdot \frac{1}{m} \sum_{i=1}^m ( \hat{y}^{(i)} - y^{(i)} ) x_j^{(i)} $$

Breaking Down the Formula:

  • \( \theta_j \): The \(j\)-th parameter (coefficient) of the model.
  • \( \alpha \): The learning rate — how big a step we take.
  • \( m \): Number of training examples.
  • \( \hat{y}^{(i)} \): Prediction for example \(i\).
  • \( y^{(i)} \): Actual target for example \(i\).
  • \( x_j^{(i)} \): The value of feature \(j\) for sample \(i\).
  • The term \( (\hat{y}^{(i)} - y^{(i)}) x_j^{(i)} \): Contribution of feature \(j\) to the overall gradient.

Gradient Descent doesn’t care if you have 10 or 10 million features — it can still optimize, as long as you tune the learning rate and iterations.
⚠️
  • Usage: You’ll be asked to derive this update rule from the MSE loss.
  • Questions: Expect “Why do we subtract the gradient?” or “What happens if the learning rate is too small/large?”

R-squared (Coefficient of Determination)

R-squared (Coefficient of Determination)

🎯 Intuition First:
R² tells you how much of the variance in the target variable is explained by your model. Think of it as the “report card” for regression models: higher is usually better.


📐 The Formal Definition:

$$ R^2 = 1 - \frac{\sum_{i=1}^m (y^{(i)} - \hat{y}^{(i)})^2}{\sum_{i=1}^m (y^{(i)} - \bar{y})^2} $$

Breaking Down the Formula:

  • \( R^2 \): Coefficient of determination.
  • \( y^{(i)} \): Actual target value.
  • \( \hat{y}^{(i)} \): Predicted value.
  • \( \bar{y} \): Mean of all actual target values.
  • Numerator: Residual sum of squares (model error).
  • Denominator: Total sum of squares (how much variation exists in the data).

If your model predicts only the mean of \(y\), R² = 0. If your model is perfect, R² = 1. Negative values mean the model is worse than predicting the mean!
⚠️
  • Usage: You’ll often be asked to interpret R² in plain English.
  • Questions: “Can R² decrease when you add more features?” or “Why is R² not always the best metric for model evaluation?”

Bias-Variance Decomposition

Bias-Variance Decomposition

🎯 Intuition First:
This explains why your model might underfit (too simple, high bias) or overfit (too complex, high variance). It’s the balancing act behind every machine learning model.


📐 The Formal Definition:

$$ \mathbb{E}[(y - \hat{f}(x))^2] = \underbrace{(\text{Bias}[\hat{f}(x)])^2}_{\text{error from wrong assumptions}} + \underbrace{\text{Var}[\hat{f}(x)]}_{\text{error from sensitivity to data}} + \underbrace{\sigma^2}_{\text{irreducible noise}} $$

Breaking Down the Formula:

  • \( y \): Actual target.
  • \( \hat{f}(x) \): Prediction of the model.
  • \( \text{Bias}^2 \): Error due to simplifying assumptions (e.g., assuming linearity when the relationship is non-linear).
  • \( \text{Variance} \): Error due to sensitivity to training data (e.g., model changes drastically if trained on different samples).
  • \( \sigma^2 \): Irreducible noise in the data.

Bias and variance pull in opposite directions. The sweet spot — where total error is minimized — is the balance every ML engineer must find.
⚠️
  • Usage: You’ll be asked to explain why adding more features can reduce bias but increase variance.
  • Questions: “How would you reduce variance without dramatically increasing bias?” or “How does regularization fit into this decomposition?”

Condition Number of \(X^T X\)

Condition Number of \(X^T X\)

🎯 Intuition First:
The condition number tells us how stable or unstable our matrix computations will be. In regression, if \(X^T X\) has a high condition number, tiny changes in the data can cause massive swings in the estimated coefficients.


📐 The Formal Definition:
For a square matrix \(A\):

$$ \kappa(A) = \frac{\sigma_{\max}(A)}{\sigma_{\min}(A)} $$

Breaking Down the Formula:

  • \( \kappa(A) \): Condition number of matrix \(A\).
  • \( \sigma_{\max}(A) \): Largest singular value of \(A\).
  • \( \sigma_{\min}(A) \): Smallest singular value of \(A\).
  • In regression, \(A = X^T X\).

If the smallest singular value is close to zero, the condition number explodes, meaning the matrix is nearly singular (ill-conditioned). This is why linear regression becomes numerically unstable when features are highly correlated.
⚠️
  • Usage: Used to assess multicollinearity and numerical stability.
  • Questions: “What happens if \(X^T X\) is nearly singular?” or “Why might the Normal Equation fail in practice even when mathematically valid?”

Multicollinearity

Multicollinearity

🎯 Intuition First:
Multicollinearity occurs when features are highly correlated with each other. It’s like asking the model to separate the effects of two almost-identical variables — it can’t decide who deserves the credit, so it assigns unstable and misleading coefficients.


📐 The Formal Definition (Variance Inflation Factor):

$$ \text{VIF}_j = \frac{1}{1 - R_j^2} $$

Breaking Down the Formula:

  • \( \text{VIF}_j \): Variance Inflation Factor for feature \(j\).
  • \( R_j^2 \): R-squared value from regressing feature \(j\) on all the other features.
  • A large \(R_j^2\) → high VIF → strong multicollinearity.

When features are correlated, variance of estimated coefficients inflates, making them unreliable. This doesn’t always hurt predictions, but it wrecks interpretability.
⚠️
  • Usage: Candidates are asked to identify and mitigate multicollinearity.
  • Questions: “How does multicollinearity affect regression coefficients?” or “How would you detect and fix it in practice?”

Eigenvalue Decomposition of \(X^T X\)

Eigenvalue Decomposition of \(X^T X\)

🎯 Intuition First:
Eigenvalues and eigenvectors describe the directions of maximum variance in your data. For regression, decomposing \(X^T X\) reveals how information is distributed across features — and whether some directions are “weak” (near-zero eigenvalues → unstable solutions).


📐 The Formal Definition:

$$ X^T X = Q \Lambda Q^T $$

Breaking Down the Formula:

  • \( X^T X \): Symmetric positive semi-definite matrix.
  • \( Q \): Orthogonal matrix whose columns are eigenvectors (directions).
  • \( \Lambda \): Diagonal matrix with eigenvalues (strength of variance in each direction).
  • Small eigenvalues → nearly flat directions → ill-conditioning.

Eigenvalue decomposition connects regression to PCA: both look at variance structure in data. When eigenvalues are tiny, regularization (like Ridge) effectively boosts them, stabilizing the inversion of \(X^T X\).
⚠️
  • Usage: Interviewers use this to check understanding of numerical stability and regularization.
  • Questions: “Why does Ridge regression fix ill-conditioned \(X^T X\)?” or “What role do eigenvalues play in the stability of regression coefficients?”

Singular Value Decomposition (SVD) in Linear Regression

Singular Value Decomposition (SVD) in Linear Regression

🎯 Intuition First:
SVD is like a Swiss army knife for linear algebra. In regression, it’s the most stable way to solve for coefficients — especially when \(X^T X\) is ill-conditioned or singular.


📐 The Formal Definition:

$$ X = U \Sigma V^T $$

Breaking Down the Formula:

  • \( U \): Orthogonal matrix representing left singular vectors (directions in sample space).
  • \( \Sigma \): Diagonal matrix of singular values (like square roots of eigenvalues of \(X^T X\)).
  • \( V \): Orthogonal matrix of right singular vectors (directions in feature space).
  • Regression solution using SVD:
    $$ \hat{\theta} = V \Sigma^+ U^T y $$ where \( \Sigma^+ \) is the pseudo-inverse of \( \Sigma \).

SVD avoids directly inverting \(X^T X\). Instead, it decomposes \(X\) into stable parts and uses the pseudo-inverse. This is the method libraries like NumPy and scikit-learn use under the hood.
⚠️
  • Usage: Explains why practical implementations don’t actually compute \((X^T X)^{-1}\).
  • Questions: “Why do libraries use SVD instead of the Normal Equation?” or “How does SVD help with rank-deficient matrices?”

Regularization in the Eigenvalue Perspective

Regularization in the Eigenvalue Perspective

🎯 Intuition First:
Regularization works by shrinking coefficients along directions where data is weak. From an eigenvalue perspective, it’s like inflating tiny eigenvalues to prevent unstable solutions.


📐 The Formal Definition (Ridge Example):

$$ \hat{\theta}_{ridge} = (X^T X + \lambda I)^{-1} X^T y $$

Breaking Down the Formula:

  • \( \lambda \): Regularization strength.
  • Adding \( \lambda I \): Shifts all eigenvalues of \(X^T X\) by \(\lambda\).
  • Prevents division by near-zero eigenvalues when computing inverse.

Regularization doesn’t just “shrink coefficients” — mathematically, it fixes ill-conditioning by stabilizing eigenvalues. This is why Ridge regression is robust against multicollinearity.
⚠️
  • Usage: Advanced interviewers check if you can link regularization to eigenvalue decomposition.
  • Questions: “How does Ridge regression change the eigenvalues of \(X^T X\)?” or “Why is this important for stability?”

Matrix Rank and Rank-Deficiency in Linear Regression

Matrix Rank and Rank-Deficiency in Linear Regression

🎯 Intuition First:
Matrix rank tells us how much “unique information” is in the features. If features are independent, rank is full. If some features are linear combinations of others, rank drops — meaning your data doesn’t provide enough information to uniquely determine all coefficients.


📐 The Formal Definition:

  • Rank: The number of linearly independent rows or columns in a matrix.
  • Full Rank: A matrix with rank equal to the smaller of its dimensions.
  • Rank-Deficiency: When rank < number of features, meaning some features are redundant.

In regression:

  • \(X\) is the design matrix (\(m \times n\), \(m\) samples, \(n\) features).
  • \(X^T X\) is invertible iff \(X\) has full column rank.

Breaking Down with Linear Regression Context:

  • If rank = \(n\): All features are independent, \(X^T X\) is invertible.
  • If rank < \(n\): At least one feature is a linear combination of others → \(X^T X\) is singular (non-invertible).
  • Example: Two features where \(x_2 = 2 \cdot x_1\). The model cannot distinguish the effect of \(x_1\) vs \(x_2\).

Rank-deficiency means the regression problem has infinitely many solutions — the line is not uniquely determined because the data provides duplicate directions. Using pseudo-inverse (via SVD) or adding regularization (Ridge) resolves this by picking a stable solution.
⚠️
  • Usage: Rank concepts appear when interviewers ask why the Normal Equation might fail.
  • Questions:
    • “When does \( (X^T X)^{-1} \) not exist?”
    • “What happens if two features are perfectly correlated?”
    • “How does Ridge regression fix rank-deficiency?”

Geometric Interpretation of Linear Regression

Geometric Interpretation of Linear Regression

🎯 Intuition First:
Linear regression isn’t just about fitting numbers — it’s about geometry. The model projects your target vector \(y\) onto the subspace spanned by your features. The prediction is the projection, and the residual is the leftover part orthogonal to that subspace.


📐 The Formal Definition (Projection):

$$ \hat{y} = X \hat{\theta} = P_X y $$

where

$$ P_X = X (X^T X)^{-1} X^T $$


is the projection matrix (also called the “hat matrix”).

Breaking Down the Formula:

  • \( \hat{y} \): Vector of predictions (projection of \(y\) onto the column space of \(X\)).
  • \( X \): Design matrix (features).
  • \( \hat{\theta} \): Coefficients obtained by OLS.
  • \( P_X \): Projection matrix — maps \(y\) into the space spanned by columns of \(X\).
  • Residual vector:
    $$ r = y - \hat{y} $$
    lies orthogonal to every column of \(X\).

Orthogonality Condition:

$$ X^T (y - \hat{y}) = 0 $$

This means: residuals are always perpendicular to the feature space. The fitted line (or hyperplane) is the closest possible to the data in Euclidean distance.


Linear regression is essentially “orthogonal projection in high-dimensional space”. You’re not just fitting coefficients — you’re finding the closest point in the feature subspace to the target vector.
⚠️
  • Usage: Geometry often comes up in theory-heavy interviews to test deep intuition.
  • Questions:
    • “Why are residuals orthogonal to the features in linear regression?”
    • “What does the projection matrix \(P_X\) represent?”
    • “How can you interpret OLS as projecting onto a subspace?”

Orthogonality & Projections — Proof Walkthrough

Orthogonality & Projections — Proof Walkthrough

🎯 Intuition First:
When we fit a linear model by minimizing squared error, we are finding the point in the column space of \(X\) (the span of feature vectors) that is closest to the target vector \(y\). The difference between \(y\) and that closest point (the residual) must be perpendicular to every direction in the column space — otherwise we could move a bit along that direction and get closer.


📐 The Goal (What we’ll prove):
Show that the OLS solution \(\hat{\theta}\) satisfies the normal equations and therefore the residual vector \(r = y - \hat{y}\) is orthogonal to every column of \(X\):

$$ X^\top (y - \hat{y}) = 0 \quad\text{where}\quad \hat{y} = X\hat{\theta}. $$

Step-by-step proof (calculus / optimization view)

  1. Write the objective (MSE) in vector form
    For \(m\) samples and parameter vector \(\theta\):

    $$ J(\theta) = \frac{1}{2} \|y - X\theta\|_2^2 = \frac{1}{2} (y - X\theta)^\top (y - X\theta). $$

    The factor \(1/2\) is conventional — it cancels in the derivative.

  2. Differentiate the objective w.r.t. \(\theta\)
    Use matrix derivatives. The gradient of \(J(\theta)\) is:

    $$ \nabla_\theta J(\theta) = -X^\top (y - X\theta). $$

    (Derivation: expand the quadratic and use linearity; derivative of \(-\theta^\top X^\top y\) is \(-X^\top y\), derivative of \(\tfrac{1}{2}\theta^\top X^\top X\theta\) is \(X^\top X \theta\).)

  3. Set gradient to zero for minimizer (first-order condition)
    At optimum \(\hat{\theta}\):

    $$ -X^\top (y - X\hat{\theta}) = 0 \quad\Longrightarrow\quad X^\top (y - X\hat{\theta}) = 0. $$

    These are the normal equations.

  4. Rewrite in terms of residual and prediction
    Define residual \(r = y - \hat{y} = y - X\hat{\theta}\). The normal equations become:

    $$ X^\top r = 0. $$

    This states that the residual is orthogonal to every column of \(X\) (because each column of \(X\) dotted with \(r\) is zero).

  5. Geometric interpretation

    • Columns of \(X\) span a subspace \( \mathcal{C}(X) \).
    • \( \hat{y} = X\hat{\theta} \) is the orthogonal projection of \(y\) onto \( \mathcal{C}(X) \).
    • The residual \(r\) lies in the orthogonal complement \( \mathcal{C}(X)^\perp \).
    • The projection matrix (hat matrix) \(P_X\) satisfies \( \hat{y} = P_X y \) with \( P_X = X(X^\top X)^{-1}X^\top \) (when \(X\) has full column rank). One can verify \(P_X^2 = P_X\) and \(P_X^\top = P_X\) (idempotent and symmetric), properties of orthogonal projections.

Alternative proof (linear algebra / least squares normal equations via orthogonality):

  • The least-squares problem asks to find \(\hat{y} \in \mathcal{C}(X)\) minimizing \(\|y - \hat{y}\|_2\).
  • Let \(\hat{y} = X\hat{\theta}\). For any vector \(v\) in \(\mathcal{C}(X)\) (i.e., \(v = Xc\) for some \(c\)), the projection property requires: $$ (y - \hat{y})^\top v = 0 \quad\forall v\in\mathcal{C}(X). $$ Substitute \(v = Xc\): $$ (y - X\hat{\theta})^\top X c = 0 \quad\forall c \quad\Longrightarrow\quad X^\top (y - X\hat{\theta}) = 0, $$ recovering the normal equations. This is the orthogonality condition: the error is orthogonal to all vectors in the subspace.

Breaking Down the Key Symbols (explicit):

  • \(X\): \(m \times n\) design matrix (rows = samples, columns = features).
  • \(\theta\): \(n\)-dimensional parameter (coefficient) vector.
  • \(y\): \(m\)-dimensional target vector.
  • \(\hat{\theta}\): OLS estimator that minimizes \(\frac{1}{2}\|y - X\theta\|_2^2\).
  • \(\hat{y} = X\hat{\theta}\): Predicted/fit vector (a point in \(\mathcal{C}(X)\)).
  • \(r = y - \hat{y}\): Residual vector (lies in \(\mathcal{C}(X)^\perp\)).
  • \(P_X = X(X^\top X)^{-1}X^\top\): Projection (hat) matrix when \(X\) has full column rank.

Orthogonality is simply the statement that at the best-fit point you cannot move in any direction inside the feature space to further reduce the squared error — because the residual has zero component in every such direction. This is the geometric essence of least squares.
⚠️
  • Common prompts: “Derive the normal equations and explain their geometric meaning.” or “Why are residuals orthogonal to the feature space?”
  • What interviewers look for: A candidate who (a) derives \(X^\top (y - X\hat{\theta}) = 0\) cleanly, (b) explains the geometric projection interpretation, and (c) mentions practical caveats (e.g., what if \(X^\top X\) is singular — use pseudo-inverse or regularization).
  • Follow-ups to expect: “How does this change with regularization?” (answer: Ridge adds \(\lambda I\) → normal equations become \((X^\top X + \lambda I)\hat{\theta} = X^\top y\), projection is no longer orthogonal in the same sense), or “How does SVD compute the same solution robustly?” (answer: use pseudo-inverse via SVD: \(\hat{\theta}=V\Sigma^+U^\top y\)).
Any doubt in content? Ask me anything?
Chat
🤖 👋 Hi there! I'm your learning assistant. If you have any questions about this page or need clarification, feel free to ask!