1.2 Dive into the Cost Function — The Log-Likelihood

5 min read 948 words

🪄 Step 1: Intuition & Motivation

Core Idea:
Now that we can make probability predictions with Logistic Regression, the next question is:
🧠 “How do we teach the model to make better predictions?”

In other words — how do we train it?

That’s where the log-likelihood (and its negative twin, the cost function) comes in.
It’s like a teacher giving feedback after every prediction: “How close was your guess to the truth?”

The model learns by tweaking its parameters (βs) to make its predicted probabilities as close as possible to the actual outcomes.

Simple Analogy:
Think of playing a guessing game.
You predict how likely someone is to like a movie.

If you say “80% chance” and they do like it → 👍 you were confident and right.
If you say “10% chance” and they like it → 😬 you were confident and wrong.

Logistic Regression uses the log-likelihood to reward confident, correct guesses and punish confident, wrong ones.

🌱 Step 2: Core Concept

What’s Happening Under the Hood?

The model predicts a probability for each training point:
$\hat{y_i} = P(y_i = 1 | x_i) = \frac{1}{1 + e^{-z_i}}$

We want these predicted probabilities to match the actual labels $y_i \in {0,1}$.

So, we define a likelihood function, which measures how likely it is that the observed labels came from the model’s predicted probabilities:

$$ L(\beta) = \prod_{i=1}^m \hat{y_i}^{y_i} (1 - \hat{y_i})^{(1 - y_i)} $$

This is the product of all the model’s predicted probabilities for the actual outcomes.

Then we take the log (to make math friendlier):

$$ \log L(\beta) = \sum_{i=1}^m [y_i\log(\hat{y_i}) + (1 - y_i)\log(1 - \hat{y_i})] $$

Finally, because optimization libraries like to minimize things, we take the negative:

$$ J(\beta) = -\frac{1}{m}\sum_{i=1}^m [y_i\log(\hat{y_i}) + (1 - y_i)\log(1 - \hat{y_i})] $$

This is our cost function — the negative log-likelihood (also known as binary cross-entropy loss).

Why It Works This Way

If we used Mean Squared Error (MSE) like in Linear Regression, the loss surface would be non-convex (wavy, messy, full of local minima). That means optimization might get stuck in bad spots.

The log-likelihood, on the other hand, produces a convex loss surface — shaped like a nice, smooth bowl — ensuring that our model has one and only one global minimum.

So, minimizing $J(\beta)$ is equivalent to maximizing the probability of observing the true labels — that’s why it’s called Maximum Likelihood Estimation (MLE).

How It Fits in ML Thinking

The log-likelihood is the foundation of many ML algorithms — not just Logistic Regression.

In broader ML thinking:

It’s part of probabilistic learning — we fit models that make data most probable.
It’s also the basis for cross-entropy loss, used in modern deep learning for classification tasks.

So yes — your friendly Logistic Regression is secretly the ancestor of neural network training!

📐 Step 3: Mathematical Foundation

Negative Log-Likelihood (Cost Function)

$$ J(\beta) = -\frac{1}{m} \sum_{i=1}^{m} [y_i\log(\hat{y_i}) + (1 - y_i)\log(1 - \hat{y_i})] $$

$m$ = number of samples
$y_i$ = true label (0 or 1)
$\hat{y_i}$ = predicted probability
The minus sign ensures we minimize the function (since log-likelihood is maximized)

Each term contributes:

A large penalty if the model is confidently wrong ($\hat{y_i}$ far from $y_i$)
A small penalty if the model is close

It’s like a truth detector — rewarding models that are confidently correct and punishing those that are confidently wrong.

Gradient of the Cost Function

$$ \frac{\partial J}{\partial \beta_j} = \frac{1}{m} \sum_i (\hat{y_i} - y_i)x_{ij} $$

$(\hat{y_i} - y_i)$ = the error term (difference between predicted and actual)
$x_{ij}$ = the $j$th feature of sample $i$

This tells us how to adjust each weight $\beta_j$:

If prediction > true value → decrease $\beta_j$
If prediction < true value → increase $\beta_j$

It’s the same logic as Linear Regression’s gradient — except here, the error is on probabilities, not raw values.

Gradient Descent simply says:

“Move the parameters in the direction that makes your wrong guesses less wrong.”

🧠 Step 4: Assumptions or Key Ideas

Convexity: The log-likelihood for Logistic Regression is convex (thanks to the monotonic sigmoid and linear parameters).
Independence: Each training example is assumed independent — this allows the likelihood to be expressed as a product.
Correct Model Form: The relationship between log-odds and features is linear.

These keep the math clean and the optimization stable.

⚖️ Step 5: Strengths, Limitations & Trade-offs

The cost function is convex, guaranteeing a single global minimum.
It naturally connects to probabilistic reasoning — interpretable and elegant.
Used as the foundation for deep learning classification losses.

Assumes independence — may not hold for time-dependent or correlated data.
Can struggle when probabilities are near 0 or 1 (log terms blow up).
Doesn’t inherently handle class imbalance.

This approach trades a bit of mathematical complexity for stability and elegance.
While MSE is simpler, log-likelihood gives a smoother, safer learning path — like swapping a rocky hiking trail for a paved one.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)

❌ “Cross-entropy is different from log-likelihood.” → They are mathematically the same, just with a minus sign for minimization.
❌ “MSE could work fine too.” → In practice, it creates non-convex optimization and unstable learning.
❌ “The cost is about distance.” → No — it measures how probable your predictions are, not their numeric closeness.

🧩 Step 7: Mini Summary

🧠 What You Learned: The log-likelihood measures how probable your observed data is under the model’s predictions — maximizing it trains the model.

⚙️ How It Works: It penalizes confident wrong predictions heavily, producing a convex optimization surface.

🎯 Why It Matters: This cost function ensures stable learning and is the foundation for modern classification losses.

1.3 Train Using Gradient Descent (From Scratch)1.1 Master the Intuition and Core Theory