3. L1 vs L2 Regularization

4 min read 694 words

🪄 Step 1: Intuition & Motivation

  • Core Idea (in one line): Regularization is like giving your model a budget limit — it can’t spend too much “importance” on any one feature.

  • Simple Analogy: Imagine you’re packing for a trip.

    • L1 regularization is like paying per item — you’ll drop some things entirely (sparse packing).
    • L2 regularization is like paying by total weight — you’ll pack everything, but lighter. Both keep you from overpacking (overfitting).

🌱 Step 2: Core Concept

What’s Happening Under the Hood?

When we train a model, it tries to find parameters (weights) that minimize the error between predictions and true outputs. But sometimes, it learns too enthusiastically, giving large weights to unimportant features — leading to overfitting.

Regularization adds a penalty to the cost function, discouraging large weights. This gently (or not-so-gently) pushes weights toward zero, making the model simpler and more generalizable.

Why It Works This Way
  • L1 Regularization (Lasso): Adds the absolute values of weights ($|\theta_i|$) as penalty. This tends to make some weights exactly zero, effectively performing feature selection.

  • L2 Regularization (Ridge): Adds the squared values of weights ($\theta_i^2$) as penalty. This shrinks all weights gradually but never exactly to zero — it promotes smoothness and stability.

How It Fits in ML Thinking

Regularization is the bridge between bias–variance theory and optimization.

  • It adds bias (simplifies the model)
  • It reduces variance (stabilizes learning) In essence, it’s a disciplined way to make your model forget some of what it “memorized.”

📐 Step 3: Mathematical Foundation

Regularized Cost Function
$$ J(\theta) = \frac{1}{2m}\sum_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})^2 + \lambda |\theta|_p $$

where:

  • $m$ → number of samples
  • $\lambda$ → regularization strength (higher $\lambda$ = stronger penalty)
  • $|\theta|_p$ → norm of the weights ($p=1$ for L1, $p=2$ for L2)

Think of $\lambda$ as a discipline knob for your model:

  • Turn it up, and your model becomes humble (simpler, smoother).
  • Turn it down, and your model becomes wild (more flexible, risk of overfitting).
L1 and L2 Gradients

L2 Gradient Update:

$$ \nabla_\theta = X^T(X\theta - y) + 2\lambda\theta $$

→ Each weight gets nudged proportionally to its size — larger weights shrink more.

L1 Gradient Update:

$$ \nabla_\theta = X^T(X\theta - y) + \lambda , \text{sign}(\theta) $$

→ Adds or subtracts a constant push, causing small weights to collapse to zero.

  • L2 is like pressing down evenly on a sponge — everything compresses a bit.
  • L1 is like cutting off small branches — some features vanish completely.

🧠 Step 4: Assumptions or Key Ideas

  • Regularization assumes some features are either redundant or noisy.
  • The penalty term prevents extreme parameter magnitudes.
  • A properly tuned $\lambda$ balances fitting and simplicity.

🔍 Too high $\lambda$ → underfitting (model too simple) 🔍 Too low $\lambda$ → overfitting (model too flexible)


⚖️ Step 5: Strengths, Limitations & Trade-offs

  • L1 performs automatic feature selection by driving some weights to zero.
  • L2 improves stability and avoids over-sensitivity to training data.
  • Both reduce variance and improve generalization.
  • L1 may struggle when features are highly correlated — it picks one arbitrarily.
  • L2 keeps all features but doesn’t remove irrelevant ones.
  • Requires careful $\lambda$ tuning — too strong can degrade performance.

L1 and L2 represent different personalities of restraint:

  • L1: “Cut unnecessary stuff out completely.”
  • L2: “Keep everything, but stay modest.” In practice, Elastic Net combines both to enjoy the best of both worlds — sparse yet stable.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)
  • “Regularization improves training accuracy.” False — it usually reduces training accuracy but improves test performance.

  • “L1 and L2 are only for linear models.” Nope! Regularization exists in almost every model — from linear regression to deep neural networks.

  • “A larger λ always helps generalization.” Not always — if too large, it can make the model too simple to learn patterns (underfitting).


🧩 Step 7: Mini Summary

🧠 What You Learned: Regularization is a penalty that keeps model weights in check, preventing overfitting.

⚙️ How It Works: L1 drives some weights to zero (sparsity), while L2 smoothly shrinks all weights (stability).

🎯 Why It Matters: It’s one of the most powerful ways to build models that generalize — not memorize.

Any doubt in content? Ask me anything?
Chat
🤖 👋 Hi there! I'm your learning assistant. If you have any questions about this page or need clarification, feel free to ask!