Comparing Linear Regression Variants & Alternatives

Machine Learning Interview Guide for Top Tech Roles (2025)

Linear Regression: Complete Interview Guide for Interviews

8 min read 1529 words

L1 (Lasso) vs. L2 (Ridge) Regularization

Choose L1 (Lasso) when: You suspect many features are irrelevant and want a sparse model with automatic feature selection.
Choose L2 (Ridge) when: You believe features are all somewhat informative and you want to shrink coefficients to improve generalization while keeping them dense.

🅰️ L1 (Lasso)

Encourage sparsity by penalizing absolute coefficient magnitude; goal is both regularization and variable selection.

🅱️ L2 (Ridge)

Penalize squared coefficient magnitude to shrink weights smoothly and reduce variance without forcing exact zeros.

🅰️ L1 (Lasso)

Introduces a non-differentiable kink at zero; optimization (coordinate descent, sub-gradient methods) can push some coefficients exactly to zero.

🅱️ L2 (Ridge)

Adds a smooth convex quadratic penalty; gradients remain linear in coefficients so closed-form (or standard solvers) work cleanly and no exact zeros are produced.

🅰️ L1 (Lasso)

Sparsity: Yes — encourages exact zeros because the absolute penalty has corners that align with axes.
Use Case: High-dimensional problems (p ≫ n) or when model interpretability / feature selection matters.

🅱️ L2 (Ridge)

Sparsity: No — produces dense solutions where weights are shrunk but typically non-zero.
Use Case: When multicollinearity exists or you want stable, low-variance estimates and all features are expected to contribute.

🅰️ L1 (Lasso)

Penalty Term: $\lambda \sum_{i=1}^{n} |\theta_i|$

🅱️ L2 (Ridge)

Penalty Term: $\lambda \sum_{i=1}^{n} \theta_i^2$

Normal Equation (Closed-Form) vs Gradient Descent (Iterative)

Choose Normal Equation when: Feature count is small-to-moderate and you prefer an exact solution with minimal tuning.
Choose Gradient Descent when: You have many features, lots of data, or need iterative/online updates and memory/performance constraints prevent matrix inversion.

🅰️ Normal Equation

Compute the analytic minimizer for ordinary least squares; returns exact solution (within numerical precision) in one computation.

🅱️ Gradient Descent

Iteratively search parameter space with updates following negative gradient to converge to a (local/global) minimum.

🅰️ Normal Equation

Forms and inverts $X^T X$; no hyperparameters like learning rate but vulnerable to singular/ill-conditioned matrices.

🅱️ Gradient Descent

Requires learning rate (and possibly momentum, scheduling); scales to large datasets via mini-batching and stochastic variants.

🅰️ Normal Equation

Sparsity: Not applicable — provides dense solution unless paired with explicit regularization that yields sparsity.
Use Case: Small-to-medium feature counts (n up to a few thousands, depending on memory); quick one-shot solution.

🅱️ Gradient Descent

Sparsity: Not inherently sparse (unless combined with L1 penalty).
Use Case: Very large n or m, streaming data, or when you need to incorporate complex objectives and constraints incrementally.

🅰️ Normal Equation

Core Equation: $\hat{\theta} = (X^\top X)^{-1} X^\top y$ (use pseudo-inverse or regularization when $X^\top X$ is singular)

🅱️ Gradient Descent

Update Rule (batch GD): $\theta \leftarrow \theta - \alpha \cdot \frac{1}{m} X^\top (X\theta - y)$ where $\alpha$ is the learning rate.

Batch Gradient Descent vs Stochastic / Mini-batch Gradient Descent

Choose Batch GD when: You can afford full-batch computations and prefer stable, smooth convergence for small datasets.
Choose Stochastic/Mini-batch GD when: You need scalability, faster per-update progress, or better ability to escape shallow local plateaus.

🅰️ Batch Gradient Descent

Use the entire dataset each update to compute the exact gradient direction for stable descent.

🅱️ Stochastic / Mini-batch GD

Use one (stochastic) or a small random subset (mini-batch) per update to get noisy but frequent gradient signals, improving wall-clock speed and generalization sometimes.

🅰️ Batch Gradient Descent

Produces low-variance, deterministic updates but each step is expensive: requires full pass over data.

🅱️ Stochastic / Mini-batch GD

Noisy updates can help jump out of shallow basins and often reach good solutions faster in practice; requires learning-rate tuning and may need averaging / momentum.

🅰️ Batch Gradient Descent

Sparsity: Not affected by batch choice.
Use Case: Small datasets where exact gradient per step is affordable and deterministic convergence is preferred.

🅱️ Stochastic / Mini-batch GD

Sparsity: Not affected directly.
Use Case: Large-scale learning, online learning, or when you want faster iterations and to leverage hardware (GPU/TPU) via vectorized mini-batches.

🅰️ Batch Gradient Descent

Core Update (batch): $\theta \leftarrow \theta - \alpha \cdot \frac{1}{m} X^\top (X\theta - y)$

🅱️ Stochastic / Mini-batch GD

Core Update (stochastic for sample i): $\theta \leftarrow \theta - \alpha \cdot x^{(i)} (x^{(i)\top}\theta - y^{(i)})$ Mini-batch (size b): average gradient over b samples in the batch.

Ordinary Least Squares (OLS) vs Robust Regression (Huber / RANSAC)

Choose OLS when: Errors are roughly Gaussian and outliers are rare — OLS is efficient and unbiased under classic assumptions.
Choose Robust methods when: The dataset contains outliers or heavy-tailed noise that would unduly skew the OLS estimator.

🅰️ OLS (Ordinary Least Squares)

Minimize squared residuals to get the best linear unbiased estimator under Gauss–Markov assumptions.

🅱️ Robust Regression (Huber / RANSAC)

Aim to reduce influence of outliers by using alternative loss functions (Huber) or consensus-based fits (RANSAC).

🅰️ OLS

Penalizes squared errors; large residuals have quadratic influence, so outliers can dominate parameter estimates.

🅱️ Robust Regression

Huber: quadratic for small residuals and linear for large ones (smoothly reduces outlier influence).
RANSAC: fits repeatedly on random subsets and selects the model with the most inliers (resilient to many outliers).

🅰️ OLS

Sparsity: No. OLS produces dense parameter estimates.
Use Case: Clean datasets that satisfy linear model assumptions (homoscedasticity, no influential outliers).

🅱️ Robust Regression

Sparsity: No (unless combined with L1).
Use Case: Datasets with measurement errors, label noise, or adversarial/outlier points where OLS is unreliable.

🅰️ OLS

Objective (MSE): $\min_\theta ; \frac{1}{m}\sum_{i=1}^m (y^{(i)} - x^{(i)\top}\theta)^2$

🅱️ Robust Regression

Huber Loss:

$$$L_\delta(r) = \begin{cases} \tfrac{1}{2}r^2 & \text{if } |r| \le \delta \\ \delta (|r| - \tfrac{1}{2}\delta) & \text{otherwise} \end{cases} \quad\text{where } r = y - x^\top\theta $$

$$$

RANSAC (conceptual): Repeatedly fit model to random subsets and maximize inlier count (no simple closed-form loss).

Polynomial Feature Expansion vs Kernel Methods

Choose Polynomial Expansion when: Dimensionality after expansion is manageable and you want explicit feature representations for inspection and downstream use.
Choose Kernel Methods when: The feature space is implicitly huge (or infinite) and you want non-linear expressiveness without constructing explicit high-dimensional features.

🅰️ Polynomial Feature Expansion

Build explicit higher-order features (e.g., $x, x^2, x^3, x_i x_j$) to allow a linear model to fit non-linear relationships.

🅱️ Kernel Methods

Use a kernel function to compute inner products in an implicit feature space, enabling non-linear decision surfaces without explicit expansion.

🅰️ Polynomial Feature Expansion

Changes parameterization: you solve linear regression on enlarged feature set; complexity grows with combinatorial number of terms (degree & interactions).

🅱️ Kernel Methods

Influence is through kernel matrix $K$ (size m×m). Training depends on solving systems involving $K$, enabling high expressiveness but with computational cost tied to samples, not explicit feature dimension.

🅰️ Polynomial Feature Expansion

Sparsity: Not inherent; feature counts explode and may cause multicollinearity — use regularization to combat overfitting.
Use Case: Low-dimensional inputs where degree can be kept small and explicit transformed features are useful for interpretation.

🅱️ Kernel Methods

Sparsity: Not directly; can combine with sparse kernel approximations (e.g., Nyström, random features).
Use Case: When implicit high-dimensional mappings are needed (e.g., smooth non-linearities) and sample count m is moderate.

🅰️ Polynomial Feature Expansion

Example (degree d) features: build $[x, x^2, \dots, x^d]$ and interactions; objective remains: $\min_\theta ; \frac{1}{m}\sum_{i=1}^m (y^{(i)} - \phi(x^{(i)})^\top \theta)^2$ where $\phi(x)$ is explicit polynomial basis.

🅱️ Kernel Methods

Kernel ridge (dual) objective (example): $\min_\alpha ; \frac{1}{m}|y - K\alpha|^2 + \lambda \alpha^\top K \alpha$ with predictions $\hat{y}(x) = k(x)^\top \alpha$, where $k(x)$ is vector of kernel evaluations with training points.

Bias–Variance Trade-off vs Regularization

Think of Bias–Variance as: the conceptual diagnostic describing underfitting vs overfitting.
Think of Regularization as: a practical control knob to shift that trade-off toward lower variance (at the cost of added bias).

🅰️ Bias–Variance Trade-off

Explain the decomposition of expected error into bias² (error from wrong model assumptions), variance (sensitivity to training set noise), and irreducible noise.

🅱️ Regularization

Use penalty terms to intentionally increase bias slightly while reducing variance, often improving generalization on unseen data.

🅰️ Bias–Variance Trade-off

High-capacity models → low bias, high variance; low-capacity models → high bias, low variance. Goal: find sweet spot (model complexity + data).

🅱️ Regularization

Mechanism: shrink coefficients (L2) or select features (L1) to reduce model complexity effectively, thus decreasing variance.

🅰️ Bias–Variance Trade-off

Sparsity: Conceptual trade-off doesn’t directly impose sparsity.
Use Case: Diagnostic framework to decide model complexity, data collection, and validation strategies.

🅱️ Regularization

Sparsity: L1 can induce sparsity; L2 does not.
Use Case: Regularization is the practical tool applied when variance (overfitting) is observed on validation data.

🅰️ Bias–Variance Trade-off

Decomposition (expected squared error): $\mathbb{E}[(y - \hat{f}(x))^2] = \text{Bias}^2[\hat{f}(x)] + \text{Var}[\hat{f}(x)] + \sigma^2$ where $\sigma^2$ is irreducible noise.

🅱️ Regularization

Example (Ridge-regularized objective): $\min_\theta ; \frac{1}{m}\sum_{i=1}^m (y^{(i)} - x^{(i)\top}\theta)^2 + \lambda |\theta|_2^2$ Regularization term $\lambda$ trades bias for variance in a controllable way.

Cost Function and Optimization: Linear Regression