1. Revisit the Optimization Objective — Cost Function Foundation

4 min read 811 words

🪄 Step 1: Intuition & Motivation

Core Idea: Before we can “teach” a model how to make predictions, we must first define what it means to be wrong. The cost function is our way of telling the model, “Here’s how bad your guesses are.” Then, gradient descent becomes the way the model says, “Okay, I’ll adjust my knobs (weights) to reduce that badness.”
Simple Analogy: Imagine you’re standing on a dark mountain with a flashlight, trying to reach the lowest valley (the best solution). The height of the mountain represents your cost (error). You can’t see far ahead, but by feeling the slope (gradient) under your feet, you can slowly step downhill until you reach the lowest point — that’s gradient descent!

🌱 Step 2: Core Concept

Let’s start with the why and what’s happening before diving into equations.

What’s Happening Under the Hood?

When we train a model, it predicts some output $h_\theta(x_i)$ for each input $x_i$. We compare that prediction to the real answer $y_i$. If they differ, we compute how far off it is — that’s our error.

Then, we add up all those errors (with a mathematical twist) into one single number — the cost function, denoted as $J(\theta)$. This number tells us, “Given the current weights ($\theta$), how bad is our model overall?”

Why It Works This Way

We need a single, smooth surface to measure performance — something we can minimize systematically. That’s why we use Mean Squared Error (MSE) for Linear Regression and Cross-Entropy for Logistic Regression. Both are continuous, differentiable, and nicely shaped (convex), so we can use calculus to find the bottom.

How It Fits in ML Thinking

Every machine learning model has one goal: minimize a cost function. The cost function defines what success means for the model. Gradient descent is simply the method for reaching that success — by gradually adjusting parameters to minimize that cost.

📐 Step 3: Mathematical Foundation

Let’s unpack the math — gently.

Mean Squared Error (Linear Regression)

$$ J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x_i) - y_i)^2 $$

$J(\theta)$: The cost (how “wrong” our model is).
$m$: Number of data points.
$h_\theta(x_i)$: Predicted value using current parameters $\theta$.
$y_i$: Actual value from data.
$(h_\theta(x_i) - y_i)^2$: Squared error for each prediction.

We’re simply measuring how far off our guesses are, squaring the distance so that negative and positive errors don’t cancel out. Then we average them — a kind of “fair score” across all points. The $\frac{1}{2}$ factor? Just makes derivative math cleaner later.

Cross-Entropy (Logistic Regression)

$$ J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} [y_i \log(h_\theta(x_i)) + (1 - y_i)\log(1 - h_\theta(x_i))] $$

$y_i$: Actual label (0 or 1).
$h_\theta(x_i)$: Model’s predicted probability (from sigmoid).
The two terms ensure that we penalize confident but wrong predictions heavily.

Think of Cross-Entropy as a “truth detector”: it punishes the model more when it confidently predicts the wrong class. If you said, “I’m 99% sure this is a cat,” but it’s actually a dog — the penalty will be steep.

🧠 Step 4: Assumptions or Key Ideas

We assume the cost surface is convex (bowl-shaped). This means there’s one lowest point — the global minimum — where the model performs best.
We assume the cost function is differentiable, so we can compute the gradient (slope) and move downhill smoothly.

ℹ️

If the surface were bumpy with multiple valleys, our model could get stuck in a small valley (local minimum) instead of the global one. Linear and Logistic Regression are nice — they give us a smooth, convex surface.

⚖️ Step 5: Strengths, Limitations & Trade-offs

Clear, mathematically defined objective.
Smooth and differentiable — easy to optimize.
Directly connects “model error” with “parameter updates.”

Sensitive to outliers (especially MSE).
Assumes the model structure can represent the true relationship.
Doesn’t handle non-linear data directly.

Choosing the right cost function is a balance between interpretability and robustness. MSE is simple and interpretable, while Cross-Entropy gives probabilistic depth — but both assume good data scaling and meaningful features.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)

“Minimizing cost is just about accuracy.” Not quite — cost captures how confident or consistent your predictions are, not just whether they’re right or wrong.
“MSE and Cross-Entropy are interchangeable.” No — MSE works for continuous targets; Cross-Entropy for probabilities and classification.
“Convex functions are everywhere.” Only in linear models! Once you move to deep learning, the cost surface is non-convex, so multiple minima exist.

🧩 Step 7: Mini Summary

🧠 What You Learned: The cost function is how we measure the model’s wrongness — it’s the compass for optimization.

⚙️ How It Works: It sums up all prediction errors into one number, which gradient descent will try to minimize.

🎯 Why It Matters: Without defining a cost function, the model has no direction — it’s like hiking without a map.

10. Review and Internalize Conceptual Links