5.2. IQR (Interquartile Range) Method
🪄 Step 1: Intuition & Motivation
Core Idea: Not all data behaves like a perfect bell curve — many real-world datasets are skewed, clumped, or multi-peaked (multi-modal). The IQR Method is a more robust, distribution-free way to identify outliers — it doesn’t assume normality, and it resists the influence of extreme values.
Instead of using the mean and standard deviation (which get distorted by outliers), it focuses on percentiles — the points that divide your data into quarters.
IQR helps you find values that lie unusually far from the “middle 50%” of your data.
Simple Analogy: Think of your data as a classroom lined up by height.
- The middle half (students between the 25th and 75th percentile) represent the “typical” range.
- Anyone way shorter than the 25th percentile or way taller than the 75th percentile? That’s an outlier.
So instead of asking “how far from the mean,” IQR asks “how far from most of the crowd.”
🌱 Step 2: Core Concept
The Interquartile Range (IQR) captures the spread of the middle 50% of your data — ignoring the extremes. It’s defined as:
$$ IQR = Q3 - Q1 $$where:
- Q1 (25th percentile) — value below which 25% of data lies
- Q3 (75th percentile) — value below which 75% of data lies
Anything 1.5×IQR beyond Q1 or Q3 is considered an outlier.
What’s Happening Under the Hood?
When we compute the IQR, we’re effectively drawing two invisible “fences”:
- Lower Fence: $Q1 - 1.5 \times IQR$
- Upper Fence: $Q3 + 1.5 \times IQR$
Values outside these fences are too far from the core cluster — they deviate too much from the crowd’s central behavior.
Let’s see an example: Suppose your data = [5, 6, 7, 8, 9, 15, 22]
- Q1 = 6
- Q3 = 9
- IQR = 9 - 6 = 3
Outlier fences:
- Lower = 6 - (1.5×3) = 1.5
- Upper = 9 + (1.5×3) = 13.5
→ Any value above 13.5 is an outlier → here, 15 and 22.
Why It Works This Way
Unlike Z-Score, which assumes a normal distribution, IQR adapts to any data shape because it uses ranks (percentiles) instead of raw distances.
It focuses on the middle 50%, so extreme values (even huge ones) don’t distort the spread.
This makes it highly reliable for skewed distributions — income, housing prices, or transaction values — where “normal” isn’t symmetric.
It’s a non-parametric approach, meaning it makes no assumptions about how the data is distributed.
How It Fits in ML Thinking
The IQR method is essential before model training because:
- Outliers can skew feature scaling and normalization.
- It ensures your model’s understanding of “typical data” is accurate.
- It provides interpretability — percentile-based thresholds are easy to explain to non-technical stakeholders.
In ML workflows, IQR is often combined with boxplots — a visual tool where whiskers represent the fences, and dots outside them are flagged as potential outliers.
📐 Step 3: Mathematical Foundation
IQR Formula and Outlier Boundaries
Compute the interquartile range:
$$IQR = Q3 - Q1$$Define outlier limits:
- Lower limit: $Q1 - 1.5 \times IQR$
- Upper limit: $Q3 + 1.5 \times IQR$
Flag any point $x$ as an outlier if:
$$x < Q1 - 1.5 \times IQR \quad \text{or} \quad x > Q3 + 1.5 \times IQR$$
🧠 Step 4: Assumptions or Key Ideas
- Data is continuous or ordinal, so percentiles make sense.
- Doesn’t assume any particular distribution shape.
- Outlier definition depends on global thresholds — one size fits most, but not all.
- Works best for unimodal or mildly skewed data; for multi-modal data, each cluster may need its own IQR boundaries.
⚖️ Step 5: Strengths, Limitations & Trade-offs
- Robust to outliers — unaffected by extreme values.
- Non-parametric — works on any distribution.
- Simple and interpretable (percentile-based).
- Excellent for skewed or real-world datasets.
- May misclassify points in multi-modal data (multiple peaks).
- The 1.5× multiplier is heuristic — not universally optimal.
- Ineffective for very small datasets (unstable percentiles).
- Use IQR when your data isn’t Gaussian or when robustness is key.
- For multiple clusters, apply IQR within each cluster to avoid false positives.
- Combine with visualization (boxplots, histograms) for context-aware decisions.
🚧 Step 6: Common Misunderstandings
🚨 Common Misunderstandings (Click to Expand)
“IQR assumes normal distribution like Z-Score.” No — IQR is purely percentile-based, so it’s distribution-free.
“The 1.5× rule is fixed.” It’s a convention, not a law. For highly skewed data, you might use 2× or 3× IQR to be less sensitive.
“Outliers must always be removed.” Sometimes, they’re important signals — e.g., fraud detection or anomaly discovery.
🧩 Step 7: Mini Summary
🧠 What You Learned: The IQR Method detects outliers based on percentile boundaries instead of standard deviation — ideal for skewed, non-normal data.
⚙️ How It Works: Values beyond 1.5×IQR below Q1 or above Q3 are flagged as potential outliers.
🎯 Why It Matters: Because IQR is robust, intuitive, and immune to extreme distortions — making it a favorite for real-world data preprocessing.