3. L1 vs L2 Regularization

Machine Learning Interview Guide for Top Tech Roles (2025)

Core Machine Learning — Foundational Theory

4 min read 694 words

🪄 Step 1: Intuition & Motivation

Core Idea (in one line): Regularization is like giving your model a budget limit — it can’t spend too much “importance” on any one feature.
Simple Analogy: Imagine you’re packing for a trip.
- L1 regularization is like paying per item — you’ll drop some things entirely (sparse packing).
- L2 regularization is like paying by total weight — you’ll pack everything, but lighter. Both keep you from overpacking (overfitting).

🌱 Step 2: Core Concept

What’s Happening Under the Hood?

When we train a model, it tries to find parameters (weights) that minimize the error between predictions and true outputs. But sometimes, it learns too enthusiastically, giving large weights to unimportant features — leading to overfitting.

Regularization adds a penalty to the cost function, discouraging large weights. This gently (or not-so-gently) pushes weights toward zero, making the model simpler and more generalizable.

Why It Works This Way

L1 Regularization (Lasso): Adds the absolute values of weights ($|\theta_i|$) as penalty. This tends to make some weights exactly zero, effectively performing feature selection.
L2 Regularization (Ridge): Adds the squared values of weights ($\theta_i^2$) as penalty. This shrinks all weights gradually but never exactly to zero — it promotes smoothness and stability.

How It Fits in ML Thinking

Regularization is the bridge between bias–variance theory and optimization.

It adds bias (simplifies the model)
It reduces variance (stabilizes learning) In essence, it’s a disciplined way to make your model forget some of what it “memorized.”

📐 Step 3: Mathematical Foundation

Regularized Cost Function

$$ J(\theta) = \frac{1}{2m}\sum_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})^2 + \lambda |\theta|_p $$

where:

$m$ → number of samples
$\lambda$ → regularization strength (higher $\lambda$ = stronger penalty)
$|\theta|_p$ → norm of the weights ($p=1$ for L1, $p=2$ for L2)

Think of $\lambda$ as a discipline knob for your model:

Turn it up, and your model becomes humble (simpler, smoother).
Turn it down, and your model becomes wild (more flexible, risk of overfitting).

L1 and L2 Gradients

L2 Gradient Update:

$$ \nabla_\theta = X^T(X\theta - y) + 2\lambda\theta $$

→ Each weight gets nudged proportionally to its size — larger weights shrink more.

L1 Gradient Update:

$$ \nabla_\theta = X^T(X\theta - y) + \lambda , \text{sign}(\theta) $$

→ Adds or subtracts a constant push, causing small weights to collapse to zero.

L2 is like pressing down evenly on a sponge — everything compresses a bit.
L1 is like cutting off small branches — some features vanish completely.

🧠 Step 4: Assumptions or Key Ideas

Regularization assumes some features are either redundant or noisy.
The penalty term prevents extreme parameter magnitudes.
A properly tuned $\lambda$ balances fitting and simplicity.

🔍 Too high $\lambda$ → underfitting (model too simple) 🔍 Too low $\lambda$ → overfitting (model too flexible)

⚖️ Step 5: Strengths, Limitations & Trade-offs

L1 performs automatic feature selection by driving some weights to zero.
L2 improves stability and avoids over-sensitivity to training data.
Both reduce variance and improve generalization.

L1 may struggle when features are highly correlated — it picks one arbitrarily.
L2 keeps all features but doesn’t remove irrelevant ones.
Requires careful $\lambda$ tuning — too strong can degrade performance.

L1 and L2 represent different personalities of restraint:

L1: “Cut unnecessary stuff out completely.”
L2: “Keep everything, but stay modest.” In practice, Elastic Net combines both to enjoy the best of both worlds — sparse yet stable.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)

“Regularization improves training accuracy.” False — it usually reduces training accuracy but improves test performance.
“L1 and L2 are only for linear models.” Nope! Regularization exists in almost every model — from linear regression to deep neural networks.
“A larger λ always helps generalization.” Not always — if too large, it can make the model too simple to learn patterns (underfitting).

🧩 Step 7: Mini Summary

🧠 What You Learned: Regularization is a penalty that keeps model weights in check, preventing overfitting.

⚙️ How It Works: L1 drives some weights to zero (sparsity), while L2 smoothly shrinks all weights (stability).

🎯 Why It Matters: It’s one of the most powerful ways to build models that generalize — not memorize.

4. Cross-Validation 2. Overfitting vs Underfitting