5.2. Regularization

Core Skills Guide for AI Interviews (Math, Code, SQL) 2025

5.2. Regularization

4 min read 837 words

🪄 Step 1: Intuition & Motivation

Core Idea: Regularization is like a gentle leash you put on your model — not to restrict it too much, but just enough to keep it from running wild and memorizing noise.
It adds a penalty to model complexity, preventing overfitting and improving generalization.
Simple Analogy: Think of fitting a line through noisy data points:
- Without regularization, your model might chase every tiny bump (overfit).
- With regularization, you’re saying: “Hey model, calm down — keep it simple and smooth.” It’s like telling a student: “Don’t memorize; understand the pattern.”

🌱 Step 2: Core Concept

What’s Happening Under the Hood?

When we train a model, we minimize a loss function (e.g., Mean Squared Error). Regularization adds a penalty term that discourages large weights — because large weights often mean the model is too sensitive to small input changes.

Mathematically, for model weights $w$ and data $(X, y)$:

$$ \text{Loss} = \text{MSE}(y, Xw) + \lambda , \Omega(w) $$

where $\Omega(w)$ is the regularization term and $\lambda$ controls how strong the penalty is.

Different choices of $\Omega(w)$ give different regularization types:

$L_2$ (Ridge): $\Omega(w) = |w|_2^2 = \sum_i w_i^2$
$L_1$ (Lasso): $\Omega(w) = |w|_1 = \sum_i |w_i|$

Why It Works This Way

Ridge ($L_2$): Penalizes large weights quadratically, so it shrinks them but rarely to zero.
Lasso ($L_1$): Penalizes linearly, so small weights get pushed exactly to zero — creating sparse models.

Sparsity means: many features get ignored (weight = 0). This acts like automatic feature selection, making the model simpler and more interpretable.

How It Fits in ML Thinking

Regularization is everywhere in ML:

Linear models: Prevent overfitting via Ridge or Lasso.
Neural networks: Use weight decay (L2 regularization) to prevent exploding weights.
Logistic regression: Regularized loss improves classification robustness.
Bayesian view: Regularization is equivalent to assuming priors on parameters:
- $L_2$ ↔ Gaussian prior (weights normally distributed).
- $L_1$ ↔ Laplace prior (promotes sparsity).

📐 Step 3: Mathematical Foundation

Ridge Regression ($L_2$ Regularization)

$$ \min_w ; |y - Xw|_2^2 + \lambda |w|_2^2 $$

Analytic Solution:

$$ w_{ridge} = (X^TX + \lambda I)^{-1}X^Ty $$

$\lambda I$ makes $X^TX$ invertible (even when multicollinear).
As $\lambda \to 0$, Ridge → OLS.
As $\lambda \to \infty$, coefficients shrink toward 0.

Ridge smooths the model by discouraging large coefficients — like applying a “rubber band” around parameter space.

Lasso Regression ($L_1$ Regularization)

$$ \min_w ; |y - Xw|_2^2 + \lambda |w|_1 $$

No closed-form solution, solved via coordinate descent or convex optimization.

Key Property: Many coefficients become exactly zero → sparse model.

Lasso doesn’t just shrink — it cuts off small weights entirely, like pruning weak connections.

Geometric Intuition (Why $L_1$ Creates Sparsity)

Visualize the constraint regions:

Ridge: circle ($w_1^2 + w_2^2 = c$)
Lasso: diamond ($|w_1| + |w_2| = c$)

The loss function (elliptical contours of MSE) touches the constraint boundary at an edge. Because diamonds have sharp corners at the axes, solutions often land on them → some weights exactly 0.

$L_1$ corners catch the optimal solution at zero. $L_2$ circles slide smoothly past it — no sparsity.

Connection to Weight Decay (Neural Networks)

In neural networks, we add the $L_2$ penalty directly to the training loss:

$$ L = \text{Loss}*{data} + \lambda \sum_i w_i^2 $$

During gradient descent, this acts as:

$$ w \leftarrow (1 - \eta \lambda)w - \eta \nabla_w \text{Loss}*{data} $$

The $(1 - \eta \lambda)$ term continuously decays weights toward zero — hence weight decay.

Weight decay is like applying a frictional force — weights lose momentum over time unless the data truly demands them.

🧠 Step 4: Key Ideas

Regularization = controlling model complexity via penalties.
$L_2$ → smooth shrinkage (no zeros).
$L_1$ → sparse shrinkage (many zeros).
Weight decay = continuous $L_2$ regularization during optimization.
Bayesian interpretation links regularization to priors.

⚖️ Step 5: Strengths, Limitations & Trade-offs

Reduces overfitting by discouraging large weights.
Improves generalization.
$L_1$ offers feature selection (interpretable sparse models).
$L_2$ stabilizes models and helps with multicollinearity.

$L_1$ can be unstable when features are highly correlated.
$L_2$ doesn’t yield sparsity — all weights stay small but nonzero.
Choosing $\lambda$ requires cross-validation.

Use $L_1$ when you expect few important features (sparsity). Use $L_2$ when you expect all features to contribute slightly. Sometimes, combine both → Elastic Net ($L_1 + L_2$).

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)

Myth: $L_2$ always improves accuracy. → Truth: It improves generalization, not necessarily training accuracy.
Myth: $L_1$ and $L_2$ penalties affect optimization the same way. → Truth: $L_1$ introduces non-differentiability at zero — leading to sparse, axis-aligned solutions.
Myth: Regularization always helps. → Truth: Too large $\lambda$ over-penalizes → underfitting.

🧩 Step 7: Mini Summary

🧠 What You Learned: Regularization controls overfitting by penalizing large weights. $L_1$ induces sparsity; $L_2$ smooths weights.

⚙️ How It Works: Adding a penalty term biases the optimization toward simpler, smaller-weight solutions, balancing bias and variance.

🎯 Why It Matters: Regularization is the foundation of stable, generalizable learning — it turns wild, overfit models into reliable predictors by enforcing simplicity.

5.3. Gradient-Based Optimization in Practice 5.1. Bias–Variance Tradeoff