3.2 Handling Imbalanced Data

5 min read 866 words

🪄 Step 1: Intuition & Motivation

Core Idea: Sometimes, your model faces an unfair game — one class dominates the dataset. Imagine trying to detect fraud when only 1 out of 1000 transactions is fraudulent.

If your model simply predicts “not fraud” every time, it’ll boast 99.9% accuracy — and yet, it’s useless.

This is the curse of imbalanced data. Logistic Regression (and most ML models) assumes both classes are equally important — when they aren’t, it fails silently.


Simple Analogy: Think of a teacher with 100 students, where only one is misbehaving. If the teacher always says, “Everyone’s behaving fine,” they’ll be right 99% of the time — but still a terrible teacher.

We need to help the model pay more attention to the rare class — the “troublemaker.”


🌱 Step 2: Core Concept

There are three main strategies to fix imbalance:

  1. Class Weighting
  2. Resampling (Under/Oversampling)
  3. Synthetic Data Generation (SMOTE)

Let’s unpack each.


1️⃣ Class Weighting — Let the Model Care More

Instead of giving equal importance to all samples, we assign higher weights to the minority class in the loss function:

$$ J(\beta) = -\frac{1}{m}\sum_{i=1}^{m} w_{y_i}[y_i\log(\hat{y_i}) + (1 - y_i)\log(1 - \hat{y_i})] $$

Here, $w_{y_i}$ gives more penalty when the model misclassifies rare examples.

In scikit-learn, this is as simple as: LogisticRegression(class_weight='balanced')

When to use:

  • When the dataset is large (avoid duplication).
  • When imbalance is moderate (e.g., 1:10 or 1:100).

Why it helps: It tells the optimizer:

“Hey, missing a minority case hurts more than misclassifying a majority one.”


2️⃣ Resampling — Balancing by Quantity

You can change the data itself to balance classes:

a. Undersampling the Majority Class

  • Randomly remove samples from the dominant class.
  • Simple, but you might lose valuable information.
  • Works well when the dataset is huge and the majority class is redundant.

b. Oversampling the Minority Class

  • Duplicate examples of the rare class to match the counts.
  • Prevents the model from ignoring the minority, but risks overfitting — the model memorizes those repeated examples.

When to use:

  • Small datasets (you can’t afford to drop data).
  • Initial experiments or quick prototypes.
Undersampling → lose data but avoid redundancy. Oversampling → keep all data but risk memorization.

3️⃣ SMOTE — Synthesizing New Minority Samples

SMOTE (Synthetic Minority Over-sampling Technique) generates new, synthetic samples for the minority class — not just copies.

It works like this:

  1. For each minority sample, find its k nearest neighbors in feature space.
  2. Randomly pick one neighbor.
  3. Create a new sample between them (interpolation).

This way, SMOTE adds plausible new data points, not carbon copies.

Why it’s better:

  • Reduces overfitting (since new data isn’t identical).
  • Expands the decision boundary for the minority class.

When to use:

  • Severe imbalance (e.g., 1:500).
  • Continuous features (not categorical).
SMOTE doesn’t “clone” — it blends. It imagines new, realistic examples to teach the model what minority patterns could look like.

Why Accuracy Fails (and What to Use Instead)

In imbalanced data, accuracy is deceptive. Example:

  • 990 negatives, 10 positives
  • Model predicts all negatives → 99% accuracy but 0% recall.

Instead, use metrics that focus on the minority class:

  • Precision, Recall, and F1 Score
  • ROC-AUC and PR-AUC (especially PR for extreme imbalance)

Forget “accuracy.” Focus on:

  • Recall when missing positives is costly.
  • Precision when false alarms are costly.
  • F1 when you need a balance.

📐 Step 3: Mathematical Foundation

Let’s peek at how weighting modifies the learning math.


Weighted Loss Function
$$ J(\beta) = -\frac{1}{m}\sum_{i=1}^{m} w_{y_i}[y_i\log(\hat{y_i}) + (1 - y_i)\log(1 - \hat{y_i})] $$
  • $w_{y_i}$ increases the cost of mistakes on minority samples.
  • The optimizer now “feels” more pain when it misses rare cases.
Think of it like grading a test where wrong answers on “hard” questions (minority samples) lose more points. The model learns to focus there.

🧠 Step 4: Assumptions or Key Ideas

  • Data imbalance is a distribution problem, not a “model weakness.”
  • Fixing it means rebalancing influence, not just “adding data.”
  • SMOTE works best on continuous, not categorical features.
  • Always evaluate using stratified cross-validation (preserve ratios).

⚖️ Step 5: Strengths, Limitations & Trade-offs

  • Increases model sensitivity to rare but important events.
  • Techniques like SMOTE improve decision boundary representation.
  • Weighted training is simple and effective for linear models like Logistic Regression.
  • Oversampling can cause overfitting.
  • Undersampling may discard valuable data.
  • SMOTE struggles with categorical or high-dimensional data.
Balancing is about fair representation, not equality. Too much balancing can distort natural distributions; too little leaves the minority ignored. Find the “realistic middle.”

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)
  • “Just oversample the minority class infinitely.” → That duplicates data and overfits — no new information is learned.
  • “Balancing guarantees fairness.” → It helps, but bias can still persist if features are uninformative.
  • “Accuracy is fine as a metric.” → Accuracy hides failures — always use recall/F1 or AUC.

🧩 Step 7: Mini Summary

🧠 What You Learned: Imbalanced data misleads models and metrics — fixing it involves rebalancing data or its influence.

⚙️ How It Works: Use class weighting, resampling, or SMOTE to make the model more sensitive to minority classes.

🎯 Why It Matters: These techniques ensure your model doesn’t overlook rare but critical outcomes — from fraud to disease detection.

Any doubt in content? Ask me anything?
Chat
🤖 👋 Hi there! I'm your learning assistant. If you have any questions about this page or need clarification, feel free to ask!