Regularization (Ridge, Lasso, ElasticNet): Linear Regression

4 min read 797 words

🪄 Step 1: Intuition & Motivation

Core Idea: Linear Regression is simple and powerful — but it has one fatal flaw: it can overfit when there are too many features or when features are correlated. Regularization fixes this by gently “punishing” large coefficient values. It doesn’t let the model go wild trying to fit every tiny wiggle in the data.
Simple Analogy: Imagine giving a child (your model) some crayons to color a picture (fit data). Without rules, they’ll scribble all over the place — overfitting! Regularization is like telling them: “You can color inside the lines, but don’t press too hard.” It keeps the picture clean — simple enough to generalize.

🌱 Step 2: Core Concept

What’s Happening Under the Hood?

Regularization adds an extra term to the cost function — a penalty that discourages large coefficients.

The original Linear Regression cost (Mean Squared Error) is:

$$ J(\beta) = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y_i})^2 $$

Regularization modifies it into:

$$ J(\beta) = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y_i})^2 + \lambda \cdot \text{Penalty}(\beta) $$

Here, $\lambda$ controls how strong the penalty is.
If $\lambda = 0$, we get normal Linear Regression.
If $\lambda$ is large, the model becomes simpler (coefficients shrink).

Different types of penalties give us different regularizations:

L2 penalty → Ridge Regression
L1 penalty → Lasso Regression
Combination of both → ElasticNet

Why It Works This Way

Large coefficients often mean the model is trying to memorize noise or redundant information.
By shrinking coefficients toward zero, regularization makes the model:

Less sensitive to outliers,
More robust to multicollinearity,
Better at generalizing to unseen data.
It’s like telling the model, “Don’t get too excited about every small fluctuation.”

How It Fits in ML Thinking

Regularization introduces the idea of bias–variance trade-off:

Without regularization → low bias, high variance (overfit).
With too much regularization → high bias, low variance (underfit).
The sweet spot lies in between, balancing flexibility with simplicity.

📐 Step 3: Mathematical Foundation

Ridge Regression (L2 Regularization)

Cost Function:

$$ J(\beta) = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y_i})^2 + \lambda \sum_{j=1}^{p}\beta_j^2 $$

Penalizes the square of coefficients.
Forces coefficients to be small but not zero.

Effect: Shrinks coefficients evenly → helps when features are correlated.

Ridge is like stretching a rubber band around all coefficients — it pulls them closer to zero, but none completely disappear.

Lasso Regression (L1 Regularization)

Cost Function:

$$ J(\beta) = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y_i})^2 + \lambda \sum_{j=1}^{p}|\beta_j| $$

Penalizes the absolute value of coefficients.
Some coefficients become exactly zero → feature selection.

Effect: Sparsity — only the most important features survive.

Lasso is like having a strict teacher: “You can’t talk unless you have something important to say.”
Unimportant features (tiny effects) get silenced — $\beta_j = 0$.

ElasticNet (Combination of L1 + L2)

Cost Function:

$$ J(\beta) = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y_i})^2 + \lambda_1 \sum_{j}|\beta_j| + \lambda_2 \sum_{j}\beta_j^2 $$

Effect:
Balances Ridge’s smooth shrinkage with Lasso’s sparsity.
Works well when features are both many and correlated.

Think of ElasticNet as a “best of both worlds” deal — Ridge keeps stability, Lasso keeps focus.

🧠 Step 4: Key Ideas and Assumptions

1️⃣ Regularization assumes that smaller weights are better:
We prefer simpler models that rely on fewer or smaller coefficients.

2️⃣ λ (lambda) is crucial:

Small λ → behaves like OLS (less penalty).
Large λ → oversimplifies the model (too much shrinkage).

3️⃣ Feature scaling matters:
Regularization treats all coefficients equally, so features must be on comparable scales (use StandardScaler).

⚖️ Step 5: Strengths, Limitations & Trade-offs

Prevents overfitting effectively.
Ridge stabilizes models with correlated features.
Lasso performs automatic feature selection.
ElasticNet handles both correlation and sparsity gracefully.

λ tuning requires cross-validation.
Lasso can behave inconsistently when features are highly correlated.
Ridge doesn’t produce sparse models (all features stay).
Interpretability drops as regularization strength increases.

Ridge → keeps all features but shrinks them (smooth).
Lasso → keeps only key features (sparse).
ElasticNet → balances both worlds.
Choose based on data shape:
- High collinearity → Ridge or ElasticNet.
- Many irrelevant features → Lasso.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)

“Regularization only helps with overfitting.”
It also improves numerical stability (especially when $X^TX$ is nearly singular).
“Lasso always performs better than Ridge.”
Not true — Ridge often outperforms when most features are relevant.
“You can apply regularization without scaling features.”
Nope — without scaling, one feature might dominate just due to its range.

🧩 Step 7: Mini Summary

🧠 What You Learned: Regularization adds penalties to keep models simple and generalizable. Ridge (L2) smooths coefficients; Lasso (L1) enforces sparsity; ElasticNet blends both.

⚙️ How It Works: Adds a penalty term to the cost function — the higher the $\lambda$, the stronger the shrinkage.

🎯 Why It Matters: It’s the most practical way to fight overfitting, improve interpretability, and stabilize models in the real world.

Scaling Solutions: Linear Regression R-squared and Adjusted R-squared: Linear Regression