Regularization (Ridge, Lasso, ElasticNet): Linear Regression

6 min read 1171 words

🎯 Core Idea

Regularization adds a penalty to the loss to control model complexity, prevent overfitting, and (depending on the penalty) shrink coefficients or force them to zero.

  • Ridge (L2) penalizes the squared magnitude of coefficients β†’ shrinks coefficients smoothly.
  • Lasso (L1) penalizes the absolute magnitude β†’ can produce exact zeros (sparse models).
  • ElasticNet mixes L1 and L2 to combine sparsity with stability for correlated features.

🌱 Intuition & Real-World Analogy

  • Think of fitting a model as balancing two forces: fit-to-data vs. model simplicity. Regularization is the β€œsimplicity” spring pulling coefficients toward zero.
  • Analogy 1 β€” Packing light for travel: Without constraints you pack everything (overfit). Ridge says β€œpack lighter (reduce sizes of many items)”. Lasso says β€œonly take a few essential items β€” throw out some completely” (sparsity). ElasticNet: β€œthrow out some items and compress the rest” (sparsity + shrinkage).
  • Analogy 2 β€” Gardening with a fence: The fence (penalty) limits how far plants (coefficients) can grow; L2 lowers all plant heights, L1 removes weaker plants completely.

πŸ“ Mathematical Foundation

Problem setup (ordinary least squares baseline)

Given data $X \in \mathbb{R}^{n\times p}$, response $y\in\mathbb{R}^n$, coefficients $\beta\in\mathbb{R}^p$. OLS minimizes:

$$ \mathcal{L}_{\text{OLS}}(\beta)=\frac{1}{2n}\|y-X\beta\|_2^2. $$

(Scaling factor $1/(2n)$ convenient for derivatives; some texts use $1/2$ or no factor.)


Ridge (L2)

Penalty: $\lambda|\beta|2^2 = \lambda\sum{j=1}^p \beta_j^2$.

Objective:

$$ \boxed{\;\mathcal{L}_{\text{Ridge}}(\beta) = \frac{1}{2n}\|y-X\beta\|_2^2 + \lambda \|\beta\|_2^2\;} $$
  • Closed-form solution:
$$ \hat\beta_{\text{ridge}} = (X^\top X + 2n\lambda I)^{-1} X^\top y. $$

(If objective uses $\lambda |\beta|_2^2$ with $1/(2n)$ loss, the $2n$ appears in the formula. Different conventions shift constants.)

  • Bayesian view: equivalent to Gaussian prior $\beta_j \sim \mathcal{N}(0,\sigma^2_\beta)$.

Terms explained

  • $\lambda\ge0$: regularization strength (hyperparameter). Larger $\lambda$ β†’ more shrinkage.
  • $I$: identity matrix; adds stability to invert $X^\top X$.

Lasso (L1)

Penalty: $\lambda|\beta|1 = \lambda\sum{j=1}^p |\beta_j|$.

Objective:

$$ \boxed{\;\mathcal{L}_{\text{Lasso}}(\beta) = \frac{1}{2n}\|y-X\beta\|_2^2 + \lambda \|\beta\|_1\;} $$
  • No closed-form in general. Solutions often computed by coordinate descent, LARS, or proximal methods.
  • Bayesian view: Laplace (double-exponential) prior on coefficients $\Rightarrow$ L1-like posterior mode.

Why L1 gives sparsity (intuition) L1 penalty has kinks at zero (nondifferentiable) β€” the optimization “prefers” to set small coefficients exactly to zero to reduce penalty without much increase in data loss.


ElasticNet

Mix of L1 and L2:

$$ \boxed{\;\mathcal{L}_{\text{EN}}(\beta) = \frac{1}{2n}\|y-X\beta\|_2^2 + \lambda\left(\alpha\|\beta\|_1 + \tfrac{1-\alpha}{2}\|\beta\|_2^2\right)\;} $$
  • $\lambda\ge0$: overall penalty strength.
  • $\alpha\in[0,1]$: mixing parameter. $\alpha=1$ β†’ Lasso; $\alpha=0$ β†’ Ridge.

ElasticNet combines sparsity (L1) with grouping/stability (L2).


Useful alternate formulation (constrained view)

Regularized problem $\Leftrightarrow$ constrained problem:

  • Ridge: minimize $|y-X\beta|_2^2$ s.t. $|\beta|_2^2 \le t$.
  • Lasso: minimize $|y-X\beta|_2^2$ s.t. $|\beta|_1 \le t$.

This helps visualize feasible sets: L2 ball vs. L1 diamond β€” intersection with objective contours explains sparsity (diamond corners) vs smooth shrinkage.


Derivation (brief) β€” Lasso soft threshold in orthonormal case

If columns of $X$ are orthonormal ($X^\top X = n I$), the Lasso decouples:

$$ \hat\beta_j^{\text{Lasso}} = S\!\left(\frac{X_j^\top y}{n}, \lambda\right) $$

where $S(z,\lambda)=\mathrm{sign}(z)\max(|z|-\lambda,0)$ is the soft-thresholding operator. This shows direct sparsity mechanism. (Full derivations: see Tibshirani 1996 and Hastie, Tibshirani & Wainwright.)


βš–οΈ Strengths, Limitations & Trade-offs

Ridge (L2)

Strengths

  • Stabilizes estimates when $X^\top X$ is ill-conditioned (multicollinearity).
  • Keeps all features but shrinks coefficients β€” good when many small effects exist.
  • Closed-form solution (fast linear algebra).

Limitations

  • Does not produce sparse models β€” hard to interpret when $p$ large.
  • Shrinks correlated predictors together but doesn’t perform selection.

When preferred

  • $p$ moderate, predictors heavily correlated, or we believe many features have small but nonzero effect.

Lasso (L1)

Strengths

  • Produces sparse solutions (feature selection) β€” interpretable and good for high-dimensional data.
  • Useful when true model is sparse (only few predictors matter).

Limitations

  • If predictors are highly correlated, Lasso arbitrarily selects one among a group; instability in selection.
  • Bias: L1 shrinks large coefficients too β€” may hurt predictive accuracy vs. unbiased methods in some settings.
  • Computationally more expensive (no simple closed form for general $X$).

When preferred

  • High $p$ with many irrelevant features, and you want automatic feature selection.

ElasticNet

Strengths

  • Retains sparsity while encouraging grouped selection when predictors correlate (L2 component stabilizes selection).
  • Often better predictive performance than pure Lasso when correlated features exist.

Limitations

  • Two hyperparameters ($\lambda,\alpha$) β†’ more tuning.
  • Slightly more complex interpretation.

When preferred

  • Many features, groups of correlated predictors, and you want sparse + stable solutions.

πŸ” Variants & Extensions

  • Group Lasso: L1 penalty on groups (select/deselect whole groups). Useful for grouped features.
  • Sparse Group Lasso: Combines group-wise and within-group sparsity.
  • SCAD / MCP (non-convex penalties): Reduce bias for large coefficients while encouraging sparsity.
  • Adaptive Lasso: Weigh penalties to reduce bias on large coefficients; enjoys oracle properties under certain conditions.
  • Fused Lasso: L1 penalty on coefficients and differences between adjacent coefficients β€” useful for ordered features.
  • Bayesian counterparts: Spike-and-slab, horseshoe priors for sparsity with probabilistic uncertainty estimates.
  • Regularized Generalized Linear Models: same penalties apply to logistic, Poisson, Cox models β€” replace squared loss with appropriate likelihood.
  • Path algorithms: LARS (Least Angle Regression) for Lasso path; coordinate descent for large problems.

🚧 Common Challenges & Pitfalls

  1. Feature scaling: Always standardize (zero mean, unit variance) predictors before applying L1/L2 (except intercept). Penalties depend on scale β€” unscaled features lead to misleading shrinkage/selection.

  2. Intercept handling: Do not penalize the intercept.

  3. Correlated features:

    • Lasso may arbitrarily pick one variable from a correlated set β†’ unstable selection.
    • ElasticNet or Group Lasso better when group behavior is desired.
  4. Choice of $\lambda$: Critical β€” use cross-validation (CV) or information criteria. Beware of “one-SE rule” vs. CV-min choices for stability.

  5. Model interpretability vs predictive accuracy: Lasso yields simpler models but can induce bias. Ridge reduces variance but keeps complexity. ElasticNet trades off both.

  6. High-dimensional regime (p ≫ n): Lasso can select at most $n$ variables (in classical LARS sense). Use caution and consider specialized methods.

  7. Hyperparameter grid & computation: Grid over $\lambda$ (log scale) and $\alpha$ (for ElasticNet). Use warm starts for efficiency.

  8. Non-convex penalties: Can reduce bias but complicate optimization (local minima). Use only with careful justification.

  9. Interpretation of coefficients under penalty: Penalized estimates are biased; compare to unpenalized refit on selected variables if unbiased estimates are required (with caveats about selection bias).


βœ… Answer to the Probing Question

Q: If you have thousands of sparse features, which penalty would you prefer and why? A (concise): Prefer L1 (Lasso) because it encourages sparsity β€” it will set many coefficients exactly to zero, automatically performing feature selection and producing a compact model that’s computationally and statistically attractive when the true signal is sparse. Caveat: If features are heavily correlated or you want group stability, prefer ElasticNet (L1 + L2) to get sparsity and better behavior on correlated groups.


πŸ“š Reference Pointers

Any doubt in content? Ask me anything?
Chat
πŸ€– πŸ‘‹ Hi there! I'm your learning assistant. If you have any questions about this page or need clarification, feel free to ask!