3. L1 vs L2 Regularization
🪄 Step 1: Intuition & Motivation
Core Idea (in one line): Regularization is like giving your model a budget limit — it can’t spend too much “importance” on any one feature.
Simple Analogy: Imagine you’re packing for a trip.
- L1 regularization is like paying per item — you’ll drop some things entirely (sparse packing).
- L2 regularization is like paying by total weight — you’ll pack everything, but lighter. Both keep you from overpacking (overfitting).
🌱 Step 2: Core Concept
What’s Happening Under the Hood?
When we train a model, it tries to find parameters (weights) that minimize the error between predictions and true outputs. But sometimes, it learns too enthusiastically, giving large weights to unimportant features — leading to overfitting.
Regularization adds a penalty to the cost function, discouraging large weights. This gently (or not-so-gently) pushes weights toward zero, making the model simpler and more generalizable.
Why It Works This Way
L1 Regularization (Lasso): Adds the absolute values of weights ($|\theta_i|$) as penalty. This tends to make some weights exactly zero, effectively performing feature selection.
L2 Regularization (Ridge): Adds the squared values of weights ($\theta_i^2$) as penalty. This shrinks all weights gradually but never exactly to zero — it promotes smoothness and stability.
How It Fits in ML Thinking
Regularization is the bridge between bias–variance theory and optimization.
- It adds bias (simplifies the model)
- It reduces variance (stabilizes learning) In essence, it’s a disciplined way to make your model forget some of what it “memorized.”
📐 Step 3: Mathematical Foundation
Regularized Cost Function
where:
- $m$ → number of samples
- $\lambda$ → regularization strength (higher $\lambda$ = stronger penalty)
- $|\theta|_p$ → norm of the weights ($p=1$ for L1, $p=2$ for L2)
Think of $\lambda$ as a discipline knob for your model:
- Turn it up, and your model becomes humble (simpler, smoother).
- Turn it down, and your model becomes wild (more flexible, risk of overfitting).
L1 and L2 Gradients
L2 Gradient Update:
$$ \nabla_\theta = X^T(X\theta - y) + 2\lambda\theta $$→ Each weight gets nudged proportionally to its size — larger weights shrink more.
L1 Gradient Update:
$$ \nabla_\theta = X^T(X\theta - y) + \lambda , \text{sign}(\theta) $$→ Adds or subtracts a constant push, causing small weights to collapse to zero.
- L2 is like pressing down evenly on a sponge — everything compresses a bit.
- L1 is like cutting off small branches — some features vanish completely.
🧠 Step 4: Assumptions or Key Ideas
- Regularization assumes some features are either redundant or noisy.
- The penalty term prevents extreme parameter magnitudes.
- A properly tuned $\lambda$ balances fitting and simplicity.
🔍 Too high $\lambda$ → underfitting (model too simple) 🔍 Too low $\lambda$ → overfitting (model too flexible)
⚖️ Step 5: Strengths, Limitations & Trade-offs
- L1 performs automatic feature selection by driving some weights to zero.
- L2 improves stability and avoids over-sensitivity to training data.
- Both reduce variance and improve generalization.
- L1 may struggle when features are highly correlated — it picks one arbitrarily.
- L2 keeps all features but doesn’t remove irrelevant ones.
- Requires careful $\lambda$ tuning — too strong can degrade performance.
L1 and L2 represent different personalities of restraint:
- L1: “Cut unnecessary stuff out completely.”
- L2: “Keep everything, but stay modest.” In practice, Elastic Net combines both to enjoy the best of both worlds — sparse yet stable.
🚧 Step 6: Common Misunderstandings
🚨 Common Misunderstandings (Click to Expand)
“Regularization improves training accuracy.” False — it usually reduces training accuracy but improves test performance.
“L1 and L2 are only for linear models.” Nope! Regularization exists in almost every model — from linear regression to deep neural networks.
“A larger λ always helps generalization.” Not always — if too large, it can make the model too simple to learn patterns (underfitting).
🧩 Step 7: Mini Summary
🧠 What You Learned: Regularization is a penalty that keeps model weights in check, preventing overfitting.
⚙️ How It Works: L1 drives some weights to zero (sparsity), while L2 smoothly shrinks all weights (stability).
🎯 Why It Matters: It’s one of the most powerful ways to build models that generalize — not memorize.