2.1. Handling Missing Values

5 min read 1010 words

🪄 Step 1: Intuition & Motivation

  • Core Idea: Missing values are like gaps in a story. If you’re reading a novel and random pages are torn out, you can still try to guess what happened — but your guess might be wrong. Similarly, in machine learning, when parts of your dataset are missing, your model can’t “read the full story.”

    Handling missing values means deciding whether to:

    1. Fill in the blanks (impute),
    2. Ignore them, or
    3. Remove them — all while trying not to distort the truth hidden in your data.
  • Simple Analogy: Think of data like a classroom attendance sheet. If a few students were absent, you might estimate their average score based on classmates (imputation). But if half the class didn’t show up, your test average won’t mean much. So, sometimes you fill in gaps, and sometimes, you simply skip that column — that’s the art of missing value handling.


🌱 Step 2: Core Concept

Let’s break down how missing values are detected, understood, and handled step-by-step.


What’s Happening Under the Hood?

When you see a missing value (NaN, NULL, or blank), it’s not just “empty.” It signals something meaningful:

  • Maybe the data was never collected (sensor failure).
  • Maybe it doesn’t apply (e.g., “Age” missing for a newborn).
  • Or maybe it’s deliberately suppressed (privacy).

So the first step is to understand the reason behind the missingness.

Then, you decide on a strategy:

  • Deletion — remove the rows or columns.
  • Imputation — fill the missing entries with “educated guesses.”
  • Flagging — add an indicator variable (e.g., “was_missing”) to retain information about the gap itself.

Each of these changes the data — and therefore, the story your model learns.


Why It Works This Way

Because models can’t learn from blanks. Mathematical operations (like mean, variance) break when encountering NaN, and distance-based models (like KNN) can’t compute similarities without numbers.

So imputation isn’t just about aesthetics — it’s mathematically essential. But good imputation respects the structure of the data:

  • Continuous features → numeric strategies (mean, median, KNN, regression).
  • Categorical features → mode or frequency-based imputation.
  • Time series → forward-fill or interpolation.

The trick is to replace missing values in a way that preserves distribution and relationships as much as possible.


How It Fits in ML Thinking

Handling missing values is one of the earliest — and most consequential — decisions in a pipeline.

Why? Because it affects everything downstream:

  • Feature scaling assumes no NaNs.
  • Model training assumes valid statistics.
  • Even feature selection or PCA breaks if NaNs exist.

So before you build anything fancy, you must ensure your data is consistent, complete, and meaningful. Think of it as fixing the foundation before constructing the ML skyscraper.


📐 Step 3: Mathematical Foundation

Let’s make imputation intuitive by exploring a few core mathematical ideas.


Mean Imputation
$$ x_i' = \begin{cases} x_i, & \text{if } x_i \text{ is not missing} \ \bar{x}, & \text{if } x_i \text{ is missing} \end{cases} $$
  • $x_i’$: the imputed value for sample $i$.
  • $\bar{x}$: mean of all non-missing values.

This assumes the missing values are missing completely at random (MCAR) — meaning the fact that they’re missing doesn’t depend on the value itself.

You’re basically saying: “Let’s pretend the missing student scored the class average.” This keeps dataset size constant but may underestimate variance, making the data look more uniform than it truly is.

Median Imputation

Replace missing values with the median (middle value). This is robust to outliers and works better when your data is skewed.

Median imputation is like saying: “Let’s pick the middle performer instead of the top or bottom.” It preserves data shape better than mean when extreme values exist.

KNN Imputation

Here, each missing value is replaced by the average of its nearest neighbors. If a sample is missing a value for a feature, we find k other samples most similar in other features, and take their average (or majority vote).

$$ x_i' = \frac{1}{k} \sum_{j \in N_k(i)} x_j $$
Instead of a blind guess, you ask the neighbors: “Hey, what’s your typical score?” KNN imputation preserves relationships but can be computationally expensive for large datasets.

Model-Based Imputation

Here, you build a predictive model (like regression or Random Forest) to estimate missing values. Each feature with missing data becomes a “target variable” to be predicted from other features.

It’s like hiring a mini-model to “guess intelligently” instead of using a simple average. It preserves complex relationships but risks overfitting if not regularized.

🧠 Step 4: Assumptions or Key Ideas

  • Missingness must be understood, not just fixed.

  • There are 3 types of missingness:

    • MCAR — Missing Completely at Random
    • MAR — Missing at Random (depends on other variables)
    • MNAR — Missing Not at Random (depends on itself)
  • Your imputation choice depends on which type you suspect.

  • Always impute only using training data statistics to avoid leakage.


⚖️ Step 5: Strengths, Limitations & Trade-offs

  • Keeps dataset size intact.
  • Preserves information when missingness is low.
  • Simple methods (mean/median) are fast and interpretable.
  • Mean imputation reduces variance and biases correlations.
  • Complex imputations (KNN, model-based) are computationally heavy.
  • Wrong assumptions about missingness can lead to misleading results.
  • Balance simplicity vs realism. If few values are missing, use mean/median. If many are missing and relationships are complex, use model-based or KNN.
  • If too much is missing (>40%), sometimes dropping the column or row is more honest.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)
  • “Always fill missing values.” Sometimes, dropping them is better if they’re too many or random.

  • “Mean and median are the same thing.” They behave very differently when outliers exist.

  • “Model-based imputations are always superior.” Not true — they can overfit, especially on small datasets.


🧩 Step 7: Mini Summary

🧠 What You Learned: Missing value handling is about choosing how to fill data gaps without distorting its truth.

⚙️ How It Works: Techniques range from simple averages to intelligent model-based guesses, depending on data type and distribution.

🎯 Why It Matters: Because missingness can silently bias models — and smart imputation preserves both data integrity and predictive power.

Any doubt in content? Ask me anything?
Chat
🤖 👋 Hi there! I'm your learning assistant. If you have any questions about this page or need clarification, feel free to ask!