5.2. Regularization
🪄 Step 1: Intuition & Motivation
Core Idea: Regularization is like a gentle leash you put on your model — not to restrict it too much, but just enough to keep it from running wild and memorizing noise.
It adds a penalty to model complexity, preventing overfitting and improving generalization.
Simple Analogy: Think of fitting a line through noisy data points:
- Without regularization, your model might chase every tiny bump (overfit).
- With regularization, you’re saying: “Hey model, calm down — keep it simple and smooth.” It’s like telling a student: “Don’t memorize; understand the pattern.”
🌱 Step 2: Core Concept
What’s Happening Under the Hood?
When we train a model, we minimize a loss function (e.g., Mean Squared Error). Regularization adds a penalty term that discourages large weights — because large weights often mean the model is too sensitive to small input changes.
Mathematically, for model weights $w$ and data $(X, y)$:
$$ \text{Loss} = \text{MSE}(y, Xw) + \lambda , \Omega(w) $$where $\Omega(w)$ is the regularization term and $\lambda$ controls how strong the penalty is.
Different choices of $\Omega(w)$ give different regularization types:
- $L_2$ (Ridge): $\Omega(w) = |w|_2^2 = \sum_i w_i^2$
- $L_1$ (Lasso): $\Omega(w) = |w|_1 = \sum_i |w_i|$
Why It Works This Way
- Ridge ($L_2$): Penalizes large weights quadratically, so it shrinks them but rarely to zero.
- Lasso ($L_1$): Penalizes linearly, so small weights get pushed exactly to zero — creating sparse models.
Sparsity means: many features get ignored (weight = 0). This acts like automatic feature selection, making the model simpler and more interpretable.
How It Fits in ML Thinking
Regularization is everywhere in ML:
Linear models: Prevent overfitting via Ridge or Lasso.
Neural networks: Use weight decay (L2 regularization) to prevent exploding weights.
Logistic regression: Regularized loss improves classification robustness.
Bayesian view: Regularization is equivalent to assuming priors on parameters:
- $L_2$ ↔ Gaussian prior (weights normally distributed).
- $L_1$ ↔ Laplace prior (promotes sparsity).
📐 Step 3: Mathematical Foundation
Ridge Regression ($L_2$ Regularization)
Analytic Solution:
$$ w_{ridge} = (X^TX + \lambda I)^{-1}X^Ty $$- $\lambda I$ makes $X^TX$ invertible (even when multicollinear).
- As $\lambda \to 0$, Ridge → OLS.
- As $\lambda \to \infty$, coefficients shrink toward 0.
Lasso Regression ($L_1$ Regularization)
No closed-form solution, solved via coordinate descent or convex optimization.
Key Property: Many coefficients become exactly zero → sparse model.
Geometric Intuition (Why $L_1$ Creates Sparsity)
Visualize the constraint regions:
- Ridge: circle ($w_1^2 + w_2^2 = c$)
- Lasso: diamond ($|w_1| + |w_2| = c$)
The loss function (elliptical contours of MSE) touches the constraint boundary at an edge. Because diamonds have sharp corners at the axes, solutions often land on them → some weights exactly 0.
Connection to Weight Decay (Neural Networks)
In neural networks, we add the $L_2$ penalty directly to the training loss:
$$ L = \text{Loss}*{data} + \lambda \sum_i w_i^2 $$During gradient descent, this acts as:
$$ w \leftarrow (1 - \eta \lambda)w - \eta \nabla_w \text{Loss}*{data} $$The $(1 - \eta \lambda)$ term continuously decays weights toward zero — hence weight decay.
🧠 Step 4: Key Ideas
- Regularization = controlling model complexity via penalties.
- $L_2$ → smooth shrinkage (no zeros).
- $L_1$ → sparse shrinkage (many zeros).
- Weight decay = continuous $L_2$ regularization during optimization.
- Bayesian interpretation links regularization to priors.
⚖️ Step 5: Strengths, Limitations & Trade-offs
- Reduces overfitting by discouraging large weights.
- Improves generalization.
- $L_1$ offers feature selection (interpretable sparse models).
- $L_2$ stabilizes models and helps with multicollinearity.
- $L_1$ can be unstable when features are highly correlated.
- $L_2$ doesn’t yield sparsity — all weights stay small but nonzero.
- Choosing $\lambda$ requires cross-validation.
🚧 Step 6: Common Misunderstandings
🚨 Common Misunderstandings (Click to Expand)
- Myth: $L_2$ always improves accuracy. → Truth: It improves generalization, not necessarily training accuracy.
- Myth: $L_1$ and $L_2$ penalties affect optimization the same way. → Truth: $L_1$ introduces non-differentiability at zero — leading to sparse, axis-aligned solutions.
- Myth: Regularization always helps. → Truth: Too large $\lambda$ over-penalizes → underfitting.
🧩 Step 7: Mini Summary
🧠 What You Learned: Regularization controls overfitting by penalizing large weights. $L_1$ induces sparsity; $L_2$ smooths weights.
⚙️ How It Works: Adding a penalty term biases the optimization toward simpler, smaller-weight solutions, balancing bias and variance.
🎯 Why It Matters: Regularization is the foundation of stable, generalizable learning — it turns wild, overfit models into reliable predictors by enforcing simplicity.