Regularization (Ridge, Lasso, ElasticNet): Linear Regression

Machine Learning Interview Guide for Top Tech Roles (2025)

Linear Regression: Complete Interview Guide for Interviews

6 min read 1171 words

🎯 Core Idea

Regularization adds a penalty to the loss to control model complexity, prevent overfitting, and (depending on the penalty) shrink coefficients or force them to zero.

Ridge (L2) penalizes the squared magnitude of coefficients → shrinks coefficients smoothly.
Lasso (L1) penalizes the absolute magnitude → can produce exact zeros (sparse models).
ElasticNet mixes L1 and L2 to combine sparsity with stability for correlated features.

🌱 Intuition & Real-World Analogy

Think of fitting a model as balancing two forces: fit-to-data vs. model simplicity. Regularization is the “simplicity” spring pulling coefficients toward zero.
Analogy 1 — Packing light for travel: Without constraints you pack everything (overfit). Ridge says “pack lighter (reduce sizes of many items)”. Lasso says “only take a few essential items — throw out some completely” (sparsity). ElasticNet: “throw out some items and compress the rest” (sparsity + shrinkage).
Analogy 2 — Gardening with a fence: The fence (penalty) limits how far plants (coefficients) can grow; L2 lowers all plant heights, L1 removes weaker plants completely.

📐 Mathematical Foundation

Problem setup (ordinary least squares baseline)

Given data $X \in \mathbb{R}^{n\times p}$, response $y\in\mathbb{R}^n$, coefficients $\beta\in\mathbb{R}^p$. OLS minimizes:

$$ \mathcal{L}_{\text{OLS}}(\beta)=\frac{1}{2n}\|y-X\beta\|_2^2. $$

(Scaling factor $1/(2n)$ convenient for derivatives; some texts use $1/2$ or no factor.)

Ridge (L2)

Penalty: $\lambda|\beta|2^2 = \lambda\sum{j=1}^p \beta_j^2$.

Objective:

$$ \boxed{\;\mathcal{L}_{\text{Ridge}}(\beta) = \frac{1}{2n}\|y-X\beta\|_2^2 + \lambda \|\beta\|_2^2\;} $$

Closed-form solution:

$$ \hat\beta_{\text{ridge}} = (X^\top X + 2n\lambda I)^{-1} X^\top y. $$

(If objective uses $\lambda |\beta|_2^2$ with $1/(2n)$ loss, the $2n$ appears in the formula. Different conventions shift constants.)

Bayesian view: equivalent to Gaussian prior $\beta_j \sim \mathcal{N}(0,\sigma^2_\beta)$.

Terms explained

$\lambda\ge0$: regularization strength (hyperparameter). Larger $\lambda$ → more shrinkage.
$I$: identity matrix; adds stability to invert $X^\top X$.

Lasso (L1)

Penalty: $\lambda|\beta|1 = \lambda\sum{j=1}^p |\beta_j|$.

Objective:

$$ \boxed{\;\mathcal{L}_{\text{Lasso}}(\beta) = \frac{1}{2n}\|y-X\beta\|_2^2 + \lambda \|\beta\|_1\;} $$

No closed-form in general. Solutions often computed by coordinate descent, LARS, or proximal methods.
Bayesian view: Laplace (double-exponential) prior on coefficients $\Rightarrow$ L1-like posterior mode.

Why L1 gives sparsity (intuition) L1 penalty has kinks at zero (nondifferentiable) — the optimization “prefers” to set small coefficients exactly to zero to reduce penalty without much increase in data loss.

ElasticNet

Mix of L1 and L2:

$$ \boxed{\;\mathcal{L}_{\text{EN}}(\beta) = \frac{1}{2n}\|y-X\beta\|_2^2 + \lambda\left(\alpha\|\beta\|_1 + \tfrac{1-\alpha}{2}\|\beta\|_2^2\right)\;} $$

$\lambda\ge0$: overall penalty strength.
$\alpha\in[0,1]$: mixing parameter. $\alpha=1$ → Lasso; $\alpha=0$ → Ridge.

ElasticNet combines sparsity (L1) with grouping/stability (L2).

Useful alternate formulation (constrained view)

Regularized problem $\Leftrightarrow$ constrained problem:

Ridge: minimize $|y-X\beta|_2^2$ s.t. $|\beta|_2^2 \le t$.
Lasso: minimize $|y-X\beta|_2^2$ s.t. $|\beta|_1 \le t$.

This helps visualize feasible sets: L2 ball vs. L1 diamond — intersection with objective contours explains sparsity (diamond corners) vs smooth shrinkage.

Derivation (brief) — Lasso soft threshold in orthonormal case

If columns of $X$ are orthonormal ($X^\top X = n I$), the Lasso decouples:

$$ \hat\beta_j^{\text{Lasso}} = S\!\left(\frac{X_j^\top y}{n}, \lambda\right) $$

where $S(z,\lambda)=\mathrm{sign}(z)\max(|z|-\lambda,0)$ is the soft-thresholding operator. This shows direct sparsity mechanism. (Full derivations: see Tibshirani 1996 and Hastie, Tibshirani & Wainwright.)

⚖️ Strengths, Limitations & Trade-offs

Ridge (L2)

Strengths

Stabilizes estimates when $X^\top X$ is ill-conditioned (multicollinearity).
Keeps all features but shrinks coefficients — good when many small effects exist.
Closed-form solution (fast linear algebra).

Limitations

Does not produce sparse models — hard to interpret when $p$ large.
Shrinks correlated predictors together but doesn’t perform selection.

When preferred

$p$ moderate, predictors heavily correlated, or we believe many features have small but nonzero effect.

Lasso (L1)

Strengths

Produces sparse solutions (feature selection) — interpretable and good for high-dimensional data.
Useful when true model is sparse (only few predictors matter).

Limitations

If predictors are highly correlated, Lasso arbitrarily selects one among a group; instability in selection.
Bias: L1 shrinks large coefficients too — may hurt predictive accuracy vs. unbiased methods in some settings.
Computationally more expensive (no simple closed form for general $X$).

When preferred

High $p$ with many irrelevant features, and you want automatic feature selection.

ElasticNet

Strengths

Retains sparsity while encouraging grouped selection when predictors correlate (L2 component stabilizes selection).
Often better predictive performance than pure Lasso when correlated features exist.

Limitations

Two hyperparameters ($\lambda,\alpha$) → more tuning.
Slightly more complex interpretation.

When preferred

Many features, groups of correlated predictors, and you want sparse + stable solutions.

🔍 Variants & Extensions

Group Lasso: L1 penalty on groups (select/deselect whole groups). Useful for grouped features.
Sparse Group Lasso: Combines group-wise and within-group sparsity.
SCAD / MCP (non-convex penalties): Reduce bias for large coefficients while encouraging sparsity.
Adaptive Lasso: Weigh penalties to reduce bias on large coefficients; enjoys oracle properties under certain conditions.
Fused Lasso: L1 penalty on coefficients and differences between adjacent coefficients — useful for ordered features.
Bayesian counterparts: Spike-and-slab, horseshoe priors for sparsity with probabilistic uncertainty estimates.
Regularized Generalized Linear Models: same penalties apply to logistic, Poisson, Cox models — replace squared loss with appropriate likelihood.
Path algorithms: LARS (Least Angle Regression) for Lasso path; coordinate descent for large problems.

🚧 Common Challenges & Pitfalls

Feature scaling: Always standardize (zero mean, unit variance) predictors before applying L1/L2 (except intercept). Penalties depend on scale — unscaled features lead to misleading shrinkage/selection.
Intercept handling: Do not penalize the intercept.
Correlated features:
- Lasso may arbitrarily pick one variable from a correlated set → unstable selection.
- ElasticNet or Group Lasso better when group behavior is desired.
Choice of $\lambda$: Critical — use cross-validation (CV) or information criteria. Beware of “one-SE rule” vs. CV-min choices for stability.
Model interpretability vs predictive accuracy: Lasso yields simpler models but can induce bias. Ridge reduces variance but keeps complexity. ElasticNet trades off both.
High-dimensional regime (p ≫ n): Lasso can select at most $n$ variables (in classical LARS sense). Use caution and consider specialized methods.
Hyperparameter grid & computation: Grid over $\lambda$ (log scale) and $\alpha$ (for ElasticNet). Use warm starts for efficiency.
Non-convex penalties: Can reduce bias but complicate optimization (local minima). Use only with careful justification.
Interpretation of coefficients under penalty: Penalized estimates are biased; compare to unpenalized refit on selected variables if unbiased estimates are required (with caveats about selection bias).

✅ Answer to the Probing Question

Q: If you have thousands of sparse features, which penalty would you prefer and why? A (concise): Prefer L1 (Lasso) because it encourages sparsity — it will set many coefficients exactly to zero, automatically performing feature selection and producing a compact model that’s computationally and statistically attractive when the true signal is sparse. Caveat: If features are heavily correlated or you want group stability, prefer ElasticNet (L1 + L2) to get sparsity and better behavior on correlated groups.

📚 Reference Pointers

Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society B. (Foundational Lasso paper). https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf (also see related LARS paper)
Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society B. https://statweb.stanford.edu/~tibs/ftp/elasticnet.pdf
Hastie, T., Tibshirani, R., & Wainwright, M. (2015). Statistical Learning with Sparsity: The Lasso and Generalizations. (book) https://web.stanford.edu/~hastie/StatLearnSparsity/
Koller, D. & Friedman, N. (2009). Probabilistic Graphical Models. (for Bayesian views and priors). https://mitpress.mit.edu/9780262013192/probabilistic-graphical-models
Wikipedia: Ridge regression / Lasso / Elastic Net (for quick refresh): https://en.wikipedia.org/wiki/Ridge_regression https://en.wikipedia.org/wiki/Lasso_(statistics) https://en.wikipedia.org/wiki/Elastic_net_regularization

Scaling Solutions: Linear Regression R-squared and Adjusted R-squared: Linear Regression