3.2 Handling Imbalanced Data
🪄 Step 1: Intuition & Motivation
Core Idea: Sometimes, your model faces an unfair game — one class dominates the dataset. Imagine trying to detect fraud when only 1 out of 1000 transactions is fraudulent.
If your model simply predicts “not fraud” every time, it’ll boast 99.9% accuracy — and yet, it’s useless.
This is the curse of imbalanced data. Logistic Regression (and most ML models) assumes both classes are equally important — when they aren’t, it fails silently.
Simple Analogy: Think of a teacher with 100 students, where only one is misbehaving. If the teacher always says, “Everyone’s behaving fine,” they’ll be right 99% of the time — but still a terrible teacher.
We need to help the model pay more attention to the rare class — the “troublemaker.”
🌱 Step 2: Core Concept
There are three main strategies to fix imbalance:
- Class Weighting
- Resampling (Under/Oversampling)
- Synthetic Data Generation (SMOTE)
Let’s unpack each.
1️⃣ Class Weighting — Let the Model Care More
Instead of giving equal importance to all samples, we assign higher weights to the minority class in the loss function:
$$ J(\beta) = -\frac{1}{m}\sum_{i=1}^{m} w_{y_i}[y_i\log(\hat{y_i}) + (1 - y_i)\log(1 - \hat{y_i})] $$Here, $w_{y_i}$ gives more penalty when the model misclassifies rare examples.
In scikit-learn, this is as simple as:
LogisticRegression(class_weight='balanced')
When to use:
- When the dataset is large (avoid duplication).
- When imbalance is moderate (e.g., 1:10 or 1:100).
Why it helps: It tells the optimizer:
“Hey, missing a minority case hurts more than misclassifying a majority one.”
2️⃣ Resampling — Balancing by Quantity
You can change the data itself to balance classes:
a. Undersampling the Majority Class
- Randomly remove samples from the dominant class.
- Simple, but you might lose valuable information.
- Works well when the dataset is huge and the majority class is redundant.
b. Oversampling the Minority Class
- Duplicate examples of the rare class to match the counts.
- Prevents the model from ignoring the minority, but risks overfitting — the model memorizes those repeated examples.
When to use:
- Small datasets (you can’t afford to drop data).
- Initial experiments or quick prototypes.
3️⃣ SMOTE — Synthesizing New Minority Samples
SMOTE (Synthetic Minority Over-sampling Technique) generates new, synthetic samples for the minority class — not just copies.
It works like this:
- For each minority sample, find its k nearest neighbors in feature space.
- Randomly pick one neighbor.
- Create a new sample between them (interpolation).
This way, SMOTE adds plausible new data points, not carbon copies.
Why it’s better:
- Reduces overfitting (since new data isn’t identical).
- Expands the decision boundary for the minority class.
When to use:
- Severe imbalance (e.g., 1:500).
- Continuous features (not categorical).
Why Accuracy Fails (and What to Use Instead)
In imbalanced data, accuracy is deceptive. Example:
- 990 negatives, 10 positives
- Model predicts all negatives → 99% accuracy but 0% recall.
Instead, use metrics that focus on the minority class:
- Precision, Recall, and F1 Score
- ROC-AUC and PR-AUC (especially PR for extreme imbalance)
Forget “accuracy.” Focus on:
- Recall when missing positives is costly.
- Precision when false alarms are costly.
- F1 when you need a balance.
📐 Step 3: Mathematical Foundation
Let’s peek at how weighting modifies the learning math.
Weighted Loss Function
- $w_{y_i}$ increases the cost of mistakes on minority samples.
- The optimizer now “feels” more pain when it misses rare cases.
🧠 Step 4: Assumptions or Key Ideas
- Data imbalance is a distribution problem, not a “model weakness.”
- Fixing it means rebalancing influence, not just “adding data.”
- SMOTE works best on continuous, not categorical features.
- Always evaluate using stratified cross-validation (preserve ratios).
⚖️ Step 5: Strengths, Limitations & Trade-offs
- Increases model sensitivity to rare but important events.
- Techniques like SMOTE improve decision boundary representation.
- Weighted training is simple and effective for linear models like Logistic Regression.
- Oversampling can cause overfitting.
- Undersampling may discard valuable data.
- SMOTE struggles with categorical or high-dimensional data.
🚧 Step 6: Common Misunderstandings
🚨 Common Misunderstandings (Click to Expand)
- ❌ “Just oversample the minority class infinitely.” → That duplicates data and overfits — no new information is learned.
- ❌ “Balancing guarantees fairness.” → It helps, but bias can still persist if features are uninformative.
- ❌ “Accuracy is fine as a metric.” → Accuracy hides failures — always use recall/F1 or AUC.
🧩 Step 7: Mini Summary
🧠 What You Learned: Imbalanced data misleads models and metrics — fixing it involves rebalancing data or its influence.
⚙️ How It Works: Use class weighting, resampling, or SMOTE to make the model more sensitive to minority classes.
🎯 Why It Matters: These techniques ensure your model doesn’t overlook rare but critical outcomes — from fraud to disease detection.