Cost Function and Optimization: Linear Regression

5 min read 910 words

🪄 Step 1: Intuition & Motivation

Core Idea (in short): Linear Regression needs a way to learn — to figure out the best line that fits the data. But how does it know what’s “best”? That’s where the Cost Function comes in. It measures how wrong the model is. Then, Optimization (like Gradient Descent) helps adjust the line bit by bit to make it less wrong — until it can’t get any better.
Simple Analogy: Imagine you’re blindfolded and trying to find the lowest point in a valley. You can’t see, but you can feel the ground’s slope under your feet. So, you keep stepping downhill in small steps until you can’t go any lower. That’s Gradient Descent — your brain is minimizing “error,” one cautious step at a time.

🌱 Step 2: Core Concept

What’s Happening Under the Hood?

Linear Regression predicts $\hat{y}$ using the equation $\hat{y} = X\beta$.
But these predictions rarely match actual values $y$.
So, we need a way to measure how far off our predictions are.
That measurement is done by a Cost Function, which quantifies error.

The most common one? The Mean Squared Error (MSE):

$$ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y_i})^2 $$

Once we have a cost, our job is to make it as small as possible by adjusting the parameters ($\beta$).
Optimization algorithms like Gradient Descent do this systematically — they calculate which direction decreases error and move a small step that way.

Why It Works This Way

MSE punishes large errors quadratically — if one prediction is way off, it hurts the model a lot more.
This ensures smoother, more stable learning, because the model strongly avoids making huge mistakes.

Gradient Descent helps us find the best $\beta$ without solving the math manually (like in OLS).
Instead, it learns iteratively: measure slope (gradient), move opposite to it, and repeat until you can’t get better.

How It Fits in ML Thinking

The Cost Function represents the “pain” or “error” of a model.
The Optimizer (like Gradient Descent) represents the “learning” — the act of reducing pain.
Together, they form the core loop of every learning algorithm in ML, from Linear Regression to Deep Neural Networks.

📐 Step 3: Mathematical Foundation

Mean Squared Error (MSE)

$$ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y_i})^2 $$

$y_i$: actual observed value
$\hat{y_i}$: predicted value
$n$: number of samples

MSE measures average squared distance between predictions and true values.

MSE acts like a “magnifying glass” for big errors — since errors are squared, it penalizes big misses more than small ones. That’s why models using MSE try hard to avoid outliers.

Mean Absolute Error (MAE)

$$ \text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y_i}| $$

MAE measures average absolute distance between predicted and actual values.

Less sensitive to outliers because it doesn’t square the errors.
But harder to optimize — the absolute value function has a “kink” at zero (non-differentiable).

MAE is like measuring distances with a ruler — every miss counts equally, no matter how big or small.

Gradient Descent Update Rule

The goal of Gradient Descent is to iteratively minimize the cost function $J(\beta)$:

$$ \beta := \beta - \eta \frac{\partial J}{\partial \beta} $$

Where:

$\eta$ (eta): learning rate — how big each step is.
$\frac{\partial J}{\partial \beta}$: slope of the cost function (the gradient).

We repeat this update until convergence — when further changes no longer reduce the cost.

The gradient tells us the direction of steepest ascent — so we move in the opposite direction to descend toward the valley’s minimum.
Learning rate $\eta$ controls step size: too small → slow crawl; too big → overshoot or oscillate wildly.

🧠 Step 4: Key Ideas and Assumptions

Differentiability:
The cost function must be smooth and differentiable (MSE is perfect for this).
Convexity:
The MSE curve is convex — shaped like a bowl — meaning there’s only one global minimum.
→ Gradient Descent will always find the best solution (no local traps).
Learning Rate Sensitivity:
$\eta$ determines how “fast” the model learns.
- Too high → jump over the minimum (diverge).
- Too low → take forever to converge.

⚖️ Step 5: Strengths, Limitations & Trade-offs

Gradient Descent scales well to large datasets.
MSE gives smooth, easy-to-optimize surfaces.
Works with many types of models, not just linear ones.

Sensitive to outliers (MSE exaggerates their impact).
Requires careful tuning of learning rate $\eta$.
May converge slowly if features aren’t properly scaled.

MSE + Gradient Descent = stability and simplicity,
but sometimes slower learning or outlier bias.
In real ML, we often choose the cost function and optimizer together to balance smooth learning vs. robustness.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)

“Learning rate doesn’t matter much” → It matters a lot. Even a small change can mean the difference between success and total divergence.
“Gradient Descent always finds the global minimum” → Only true for convex functions like MSE. In deep learning (non-convex), there can be many minima.
“MSE is always the best choice” → Not necessarily. For data with heavy outliers, MAE or Huber loss might perform better.

🧩 Step 7: Mini Summary

🧠 What You Learned: MSE quantifies model error, while Gradient Descent learns by reducing that error step by step.

⚙️ How It Works: Compute gradient → move in opposite direction → repeat until minimal error.

🎯 Why It Matters: Understanding cost and optimization is the beating heart of every ML model — it’s literally how machines learn.

Feature Pipelines: Linear Regression Comparing Linear Regression Variants & Alternatives