3.4 Multiclass Logistic Regression

5 min read 881 words

🪄 Step 1: Intuition & Motivation

Core Idea: So far, we’ve been living in a binary world — “spam vs. not spam,” “disease vs. no disease,” “cat vs. not cat.” 🐱🚫

But the real world rarely plays by binary rules. Emails can be spam, promotion, or social. Animals can be cat, dog, bird, or elephant.

To handle such cases, we extend Logistic Regression from binary to multiclass classification. And how do we do that? With two powerful strategies:

One-vs-Rest (OvR) — many small battles
Softmax (Multinomial) — one grand competition

Simple Analogy: Imagine an election 🗳️. Each class (candidate) campaigns for your vote (probability).

In OvR, each candidate runs separately — “Me vs. Everyone Else.”
In Softmax, all candidates compete simultaneously, and only one wins proportionally to their “vote share.”

Both approaches pick a winner, but the game mechanics differ.

🌱 Step 2: Core Concept

Let’s unpack both strategies step by step.

1️⃣ One-vs-Rest (OvR) — Divide and Conquer

In One-vs-Rest, we train one binary Logistic Regression model per class:

For $K$ classes → train $K$ models.

Each model $M_k$ predicts:

“Is this sample class k or not?”

Mathematically:

$$ P(y = k | x) = \frac{1}{1 + e^{-(\beta_{0,k} + x^T \beta_k)}} $$

During prediction:

Compute all $K$ probabilities.
Choose the class with the highest probability.

✅ Advantages:

Simple and scalable.
Each model is independent (parallel training possible).

⚠️ Drawbacks:

Models can overlap — probabilities may not sum to 1.
Can be inconsistent when classes are highly correlated.

Think of OvR like running K separate popularity polls — the one with the highest “yes votes” wins, even if totals don’t perfectly balance.

2️⃣ Softmax (Multinomial Logistic Regression) — All-in-One Competition

The Softmax function generalizes the sigmoid to multiple classes:

$$ P(y = k | x) = \frac{e^{x^T \beta_k}}{\sum_{j=1}^{K} e^{x^T \beta_j}} $$

This ensures that:

All probabilities are positive.
They all sum to 1.

The class with the highest probability wins — simple and elegant.

Why Softmax Works:

Each class’s score ($x^T \beta_k$) represents how strongly the features support that class.
The exponentiation amplifies high scores and suppresses low ones — turning “votes” into probabilities.

✅ Advantages:

True probability distribution across all classes.
Joint optimization ensures better calibration and consistency.

⚠️ Drawbacks:

Computationally heavier (optimizing all $K$ weight vectors together).
Harder to parallelize.

Softmax is like a fair election: every candidate competes at once, and everyone’s probabilities sum neatly to 1.

3️⃣ Computational Trade-offs: OvR vs. Softmax

Aspect	One-vs-Rest (OvR)	Softmax (Multinomial)
# Models	K independent binary models	One unified model
Training Parallelism	Easy (train each separately)	Coupled (all classes trained together)
Probabilities Sum to 1?	No	Yes
Interpretability	Straightforward per class	More complex (joint weights)
Best For	Many classes with few samples	Few classes with balanced data

OvR: Great for scalable, large K (e.g., text classification with thousands of labels).
Softmax: Ideal for smaller, well-defined class sets (e.g., MNIST digits 0–9).

📐 Step 3: Mathematical Foundation

The Softmax Function Explained

$$ P(y = k | x) = \frac{e^{x^T \beta_k}}{\sum_{j=1}^{K} e^{x^T \beta_j}} $$

Numerator: The “score” for class $k$.
Denominator: The total score across all classes (normalization).

Properties:

Each probability $P(y=k|x) \in (0, 1)$
$\sum_{k=1}^{K} P(y=k|x) = 1$

This is why it’s called multinomial logistic regression — we’re modeling the probability distribution over multiple outcomes.

If each class is a “light bulb,” Softmax turns them all on — but brightens them so total light intensity (probability) always equals 1. 💡

Connection to Binary Logistic Regression

When $K = 2$, Softmax simplifies to the sigmoid function:

$$ P(y=1|x) = \frac{e^{x^T \beta_1}}{e^{x^T \beta_0} + e^{x^T \beta_1}} = \frac{1}{1 + e^{-x^T (\beta_1 - \beta_0)}} $$

So, binary logistic regression is just a special case of Softmax with two classes.

Sigmoid = 2-class Softmax. Softmax = general sigmoid for many classes.

🧠 Step 4: Assumptions or Key Ideas

Classes are mutually exclusive — each observation belongs to exactly one class.
For Softmax, all classes are learned jointly — meaning class relationships influence one another.
OvR assumes independence between models — no coordination across classes.

⚖️ Step 5: Strengths, Limitations & Trade-offs

Extends Logistic Regression naturally to multi-class tasks.
Softmax provides a true probability distribution across classes.
OvR is modular, simple, and parallelizable.

Softmax is computationally heavier — needs all class probabilities at once.
OvR can produce inconsistent results (e.g., multiple classes predicted “yes”).
Assumes classes don’t overlap — can struggle with ambiguous boundaries.

Trade-off:

OvR = scalable, but fragmented learning.
Softmax = consistent, but computationally demanding. Pick based on dataset size, class count, and consistency needs.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)

❌ “Softmax is only for deep learning.” → Nope! Softmax originated in Logistic Regression long before neural networks.
❌ “OvR probabilities must sum to 1.” → They don’t — each model works independently.
❌ “Softmax is nonlinear like neural nets.” → The transformation is nonlinear, but the relationship between inputs and log-odds remains linear.

🧩 Step 7: Mini Summary

🧠 What You Learned: Multiclass Logistic Regression extends binary classification using either One-vs-Rest or Softmax.

⚙️ How It Works: OvR trains multiple binary models, while Softmax jointly models class probabilities that sum to 1.

🎯 Why It Matters: It bridges traditional ML and deep learning — forming the mathematical foundation of the final layer in neural networks.

Logistic Regression 3.3 Logistic Regression at Scale