Regularization (Ridge, Lasso, ElasticNet): Linear Regression
π― Core Idea
Regularization adds a penalty to the loss to control model complexity, prevent overfitting, and (depending on the penalty) shrink coefficients or force them to zero.
- Ridge (L2) penalizes the squared magnitude of coefficients β shrinks coefficients smoothly.
- Lasso (L1) penalizes the absolute magnitude β can produce exact zeros (sparse models).
- ElasticNet mixes L1 and L2 to combine sparsity with stability for correlated features.
π± Intuition & Real-World Analogy
- Think of fitting a model as balancing two forces: fit-to-data vs. model simplicity. Regularization is the βsimplicityβ spring pulling coefficients toward zero.
- Analogy 1 β Packing light for travel: Without constraints you pack everything (overfit). Ridge says βpack lighter (reduce sizes of many items)β. Lasso says βonly take a few essential items β throw out some completelyβ (sparsity). ElasticNet: βthrow out some items and compress the restβ (sparsity + shrinkage).
- Analogy 2 β Gardening with a fence: The fence (penalty) limits how far plants (coefficients) can grow; L2 lowers all plant heights, L1 removes weaker plants completely.
π Mathematical Foundation
Problem setup (ordinary least squares baseline)
Given data $X \in \mathbb{R}^{n\times p}$, response $y\in\mathbb{R}^n$, coefficients $\beta\in\mathbb{R}^p$. OLS minimizes:
$$ \mathcal{L}_{\text{OLS}}(\beta)=\frac{1}{2n}\|y-X\beta\|_2^2. $$(Scaling factor $1/(2n)$ convenient for derivatives; some texts use $1/2$ or no factor.)
Ridge (L2)
Penalty: $\lambda|\beta|2^2 = \lambda\sum{j=1}^p \beta_j^2$.
Objective:
$$ \boxed{\;\mathcal{L}_{\text{Ridge}}(\beta) = \frac{1}{2n}\|y-X\beta\|_2^2 + \lambda \|\beta\|_2^2\;} $$- Closed-form solution:
(If objective uses $\lambda |\beta|_2^2$ with $1/(2n)$ loss, the $2n$ appears in the formula. Different conventions shift constants.)
- Bayesian view: equivalent to Gaussian prior $\beta_j \sim \mathcal{N}(0,\sigma^2_\beta)$.
Terms explained
- $\lambda\ge0$: regularization strength (hyperparameter). Larger $\lambda$ β more shrinkage.
- $I$: identity matrix; adds stability to invert $X^\top X$.
Lasso (L1)
Penalty: $\lambda|\beta|1 = \lambda\sum{j=1}^p |\beta_j|$.
Objective:
$$ \boxed{\;\mathcal{L}_{\text{Lasso}}(\beta) = \frac{1}{2n}\|y-X\beta\|_2^2 + \lambda \|\beta\|_1\;} $$- No closed-form in general. Solutions often computed by coordinate descent, LARS, or proximal methods.
- Bayesian view: Laplace (double-exponential) prior on coefficients $\Rightarrow$ L1-like posterior mode.
Why L1 gives sparsity (intuition) L1 penalty has kinks at zero (nondifferentiable) β the optimization “prefers” to set small coefficients exactly to zero to reduce penalty without much increase in data loss.
ElasticNet
Mix of L1 and L2:
$$ \boxed{\;\mathcal{L}_{\text{EN}}(\beta) = \frac{1}{2n}\|y-X\beta\|_2^2 + \lambda\left(\alpha\|\beta\|_1 + \tfrac{1-\alpha}{2}\|\beta\|_2^2\right)\;} $$- $\lambda\ge0$: overall penalty strength.
- $\alpha\in[0,1]$: mixing parameter. $\alpha=1$ β Lasso; $\alpha=0$ β Ridge.
ElasticNet combines sparsity (L1) with grouping/stability (L2).
Useful alternate formulation (constrained view)
Regularized problem $\Leftrightarrow$ constrained problem:
- Ridge: minimize $|y-X\beta|_2^2$ s.t. $|\beta|_2^2 \le t$.
- Lasso: minimize $|y-X\beta|_2^2$ s.t. $|\beta|_1 \le t$.
This helps visualize feasible sets: L2 ball vs. L1 diamond β intersection with objective contours explains sparsity (diamond corners) vs smooth shrinkage.
Derivation (brief) β Lasso soft threshold in orthonormal case
If columns of $X$ are orthonormal ($X^\top X = n I$), the Lasso decouples:
$$ \hat\beta_j^{\text{Lasso}} = S\!\left(\frac{X_j^\top y}{n}, \lambda\right) $$where $S(z,\lambda)=\mathrm{sign}(z)\max(|z|-\lambda,0)$ is the soft-thresholding operator. This shows direct sparsity mechanism. (Full derivations: see Tibshirani 1996 and Hastie, Tibshirani & Wainwright.)
βοΈ Strengths, Limitations & Trade-offs
Ridge (L2)
Strengths
- Stabilizes estimates when $X^\top X$ is ill-conditioned (multicollinearity).
- Keeps all features but shrinks coefficients β good when many small effects exist.
- Closed-form solution (fast linear algebra).
Limitations
- Does not produce sparse models β hard to interpret when $p$ large.
- Shrinks correlated predictors together but doesnβt perform selection.
When preferred
- $p$ moderate, predictors heavily correlated, or we believe many features have small but nonzero effect.
Lasso (L1)
Strengths
- Produces sparse solutions (feature selection) β interpretable and good for high-dimensional data.
- Useful when true model is sparse (only few predictors matter).
Limitations
- If predictors are highly correlated, Lasso arbitrarily selects one among a group; instability in selection.
- Bias: L1 shrinks large coefficients too β may hurt predictive accuracy vs. unbiased methods in some settings.
- Computationally more expensive (no simple closed form for general $X$).
When preferred
- High $p$ with many irrelevant features, and you want automatic feature selection.
ElasticNet
Strengths
- Retains sparsity while encouraging grouped selection when predictors correlate (L2 component stabilizes selection).
- Often better predictive performance than pure Lasso when correlated features exist.
Limitations
- Two hyperparameters ($\lambda,\alpha$) β more tuning.
- Slightly more complex interpretation.
When preferred
- Many features, groups of correlated predictors, and you want sparse + stable solutions.
π Variants & Extensions
- Group Lasso: L1 penalty on groups (select/deselect whole groups). Useful for grouped features.
- Sparse Group Lasso: Combines group-wise and within-group sparsity.
- SCAD / MCP (non-convex penalties): Reduce bias for large coefficients while encouraging sparsity.
- Adaptive Lasso: Weigh penalties to reduce bias on large coefficients; enjoys oracle properties under certain conditions.
- Fused Lasso: L1 penalty on coefficients and differences between adjacent coefficients β useful for ordered features.
- Bayesian counterparts: Spike-and-slab, horseshoe priors for sparsity with probabilistic uncertainty estimates.
- Regularized Generalized Linear Models: same penalties apply to logistic, Poisson, Cox models β replace squared loss with appropriate likelihood.
- Path algorithms: LARS (Least Angle Regression) for Lasso path; coordinate descent for large problems.
π§ Common Challenges & Pitfalls
-
Feature scaling: Always standardize (zero mean, unit variance) predictors before applying L1/L2 (except intercept). Penalties depend on scale β unscaled features lead to misleading shrinkage/selection.
-
Intercept handling: Do not penalize the intercept.
-
Correlated features:
- Lasso may arbitrarily pick one variable from a correlated set β unstable selection.
- ElasticNet or Group Lasso better when group behavior is desired.
-
Choice of $\lambda$: Critical β use cross-validation (CV) or information criteria. Beware of “one-SE rule” vs. CV-min choices for stability.
-
Model interpretability vs predictive accuracy: Lasso yields simpler models but can induce bias. Ridge reduces variance but keeps complexity. ElasticNet trades off both.
-
High-dimensional regime (p β« n): Lasso can select at most $n$ variables (in classical LARS sense). Use caution and consider specialized methods.
-
Hyperparameter grid & computation: Grid over $\lambda$ (log scale) and $\alpha$ (for ElasticNet). Use warm starts for efficiency.
-
Non-convex penalties: Can reduce bias but complicate optimization (local minima). Use only with careful justification.
-
Interpretation of coefficients under penalty: Penalized estimates are biased; compare to unpenalized refit on selected variables if unbiased estimates are required (with caveats about selection bias).
β Answer to the Probing Question
Q: If you have thousands of sparse features, which penalty would you prefer and why? A (concise): Prefer L1 (Lasso) because it encourages sparsity β it will set many coefficients exactly to zero, automatically performing feature selection and producing a compact model thatβs computationally and statistically attractive when the true signal is sparse. Caveat: If features are heavily correlated or you want group stability, prefer ElasticNet (L1 + L2) to get sparsity and better behavior on correlated groups.
π Reference Pointers
- Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society B. (Foundational Lasso paper). https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf (also see related LARS paper)
- Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society B. https://statweb.stanford.edu/~tibs/ftp/elasticnet.pdf
- Hastie, T., Tibshirani, R., & Wainwright, M. (2015). Statistical Learning with Sparsity: The Lasso and Generalizations. (book) https://web.stanford.edu/~hastie/StatLearnSparsity/
- Koller, D. & Friedman, N. (2009). Probabilistic Graphical Models. (for Bayesian views and priors). https://mitpress.mit.edu/9780262013192/probabilistic-graphical-models
- Wikipedia: Ridge regression / Lasso / Elastic Net (for quick refresh): https://en.wikipedia.org/wiki/Ridge_regression https://en.wikipedia.org/wiki/Lasso_(statistics) https://en.wikipedia.org/wiki/Elastic_net_regularization