3.2 Handling Imbalanced Data

5 min read 866 words

🪄 Step 1: Intuition & Motivation

Core Idea: Sometimes, your model faces an unfair game — one class dominates the dataset. Imagine trying to detect fraud when only 1 out of 1000 transactions is fraudulent.

If your model simply predicts “not fraud” every time, it’ll boast 99.9% accuracy — and yet, it’s useless.

This is the curse of imbalanced data. Logistic Regression (and most ML models) assumes both classes are equally important — when they aren’t, it fails silently.

Simple Analogy: Think of a teacher with 100 students, where only one is misbehaving. If the teacher always says, “Everyone’s behaving fine,” they’ll be right 99% of the time — but still a terrible teacher.

We need to help the model pay more attention to the rare class — the “troublemaker.”

🌱 Step 2: Core Concept

There are three main strategies to fix imbalance:

Class Weighting
Resampling (Under/Oversampling)
Synthetic Data Generation (SMOTE)

Let’s unpack each.

1️⃣ Class Weighting — Let the Model Care More

Instead of giving equal importance to all samples, we assign higher weights to the minority class in the loss function:

$$ J(\beta) = -\frac{1}{m}\sum_{i=1}^{m} w_{y_i}[y_i\log(\hat{y_i}) + (1 - y_i)\log(1 - \hat{y_i})] $$

Here, $w_{y_i}$ gives more penalty when the model misclassifies rare examples.

In scikit-learn, this is as simple as: LogisticRegression(class_weight='balanced')

When to use:

When the dataset is large (avoid duplication).
When imbalance is moderate (e.g., 1:10 or 1:100).

Why it helps: It tells the optimizer:

“Hey, missing a minority case hurts more than misclassifying a majority one.”

2️⃣ Resampling — Balancing by Quantity

You can change the data itself to balance classes:

a. Undersampling the Majority Class

Randomly remove samples from the dominant class.
Simple, but you might lose valuable information.
Works well when the dataset is huge and the majority class is redundant.

b. Oversampling the Minority Class

Duplicate examples of the rare class to match the counts.
Prevents the model from ignoring the minority, but risks overfitting — the model memorizes those repeated examples.

When to use:

Small datasets (you can’t afford to drop data).
Initial experiments or quick prototypes.

Undersampling → lose data but avoid redundancy. Oversampling → keep all data but risk memorization.

3️⃣ SMOTE — Synthesizing New Minority Samples

SMOTE (Synthetic Minority Over-sampling Technique) generates new, synthetic samples for the minority class — not just copies.

It works like this:

For each minority sample, find its k nearest neighbors in feature space.
Randomly pick one neighbor.
Create a new sample between them (interpolation).

This way, SMOTE adds plausible new data points, not carbon copies.

Why it’s better:

Reduces overfitting (since new data isn’t identical).
Expands the decision boundary for the minority class.

When to use:

Severe imbalance (e.g., 1:500).
Continuous features (not categorical).

SMOTE doesn’t “clone” — it blends. It imagines new, realistic examples to teach the model what minority patterns could look like.

Why Accuracy Fails (and What to Use Instead)

In imbalanced data, accuracy is deceptive. Example:

990 negatives, 10 positives
Model predicts all negatives → 99% accuracy but 0% recall.

Instead, use metrics that focus on the minority class:

Precision, Recall, and F1 Score
ROC-AUC and PR-AUC (especially PR for extreme imbalance)

Forget “accuracy.” Focus on:

Recall when missing positives is costly.
Precision when false alarms are costly.
F1 when you need a balance.

📐 Step 3: Mathematical Foundation

Let’s peek at how weighting modifies the learning math.

Weighted Loss Function

$$ J(\beta) = -\frac{1}{m}\sum_{i=1}^{m} w_{y_i}[y_i\log(\hat{y_i}) + (1 - y_i)\log(1 - \hat{y_i})] $$

$w_{y_i}$ increases the cost of mistakes on minority samples.
The optimizer now “feels” more pain when it misses rare cases.

Think of it like grading a test where wrong answers on “hard” questions (minority samples) lose more points. The model learns to focus there.

🧠 Step 4: Assumptions or Key Ideas

Data imbalance is a distribution problem, not a “model weakness.”
Fixing it means rebalancing influence, not just “adding data.”
SMOTE works best on continuous, not categorical features.
Always evaluate using stratified cross-validation (preserve ratios).

⚖️ Step 5: Strengths, Limitations & Trade-offs

Increases model sensitivity to rare but important events.
Techniques like SMOTE improve decision boundary representation.
Weighted training is simple and effective for linear models like Logistic Regression.

Oversampling can cause overfitting.
Undersampling may discard valuable data.
SMOTE struggles with categorical or high-dimensional data.

Balancing is about fair representation, not equality. Too much balancing can distort natural distributions; too little leaves the minority ignored. Find the “realistic middle.”

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)

❌ “Just oversample the minority class infinitely.” → That duplicates data and overfits — no new information is learned.
❌ “Balancing guarantees fairness.” → It helps, but bias can still persist if features are uninformative.
❌ “Accuracy is fine as a metric.” → Accuracy hides failures — always use recall/F1 or AUC.

🧩 Step 7: Mini Summary

🧠 What You Learned: Imbalanced data misleads models and metrics — fixing it involves rebalancing data or its influence.

⚙️ How It Works: Use class weighting, resampling, or SMOTE to make the model more sensitive to minority classes.

🎯 Why It Matters: These techniques ensure your model doesn’t overlook rare but critical outcomes — from fraud to disease detection.

3.3 Logistic Regression at Scale 3.1 Decision Thresholds and Metrics