1.2 Filter and Aggregate Like an Analyst
🪄 Step 1: Intuition & Real-World Analogy
🎯 Core Idea: Filtering and aggregation are the bread and butter of analytics — they allow you to zoom in on specific slices of data and then summarize them meaningfully.
🌍 Real-Life Analogy: Imagine you’re analyzing sales data like a detective.
- First, you filter: “Show me sales only from the West region and above $1,000.”
- Then, you aggregate: “Now, sum those sales by region.”
Filtering is like deciding which clues to look at. Aggregation is like summarizing the evidence into a conclusion.
💬 Curiosity Hook: But what if you apply the filter after aggregation? Will you get the same answer? That’s where the subtle but crucial WHERE vs HAVING distinction begins.
🌱 Step 2: The Core Concept Explained
Let’s break this into two pillars — filtering and aggregation.
1️⃣ Filtering Data — The WHERE Clause
Purpose:
The WHERE clause filters rows before any grouping or aggregation happens.
Think of it as a gatekeeper — only rows that pass this gate are considered later.
Example Thought:
“Show me all transactions above $100 in 2024.”
SELECT *
FROM sales
WHERE amount > 100
AND order_date >= '2024-01-01';This removes irrelevant rows early, which makes grouping and summarization faster later.
Internal Stage: Runs before GROUP BY — works only on individual row values, not aggregates.
2️⃣ Aggregating Data — The GROUP BY Clause
Purpose:
The GROUP BY clause collapses multiple rows into summary buckets.
“Group all sales by region and compute total sales per region.”
SELECT region, SUM(amount) AS total_sales
FROM sales
GROUP BY region;Now, instead of thousands of rows, you get one row per region — a compact summary.
Analytical Superpower: Aggregation turns messy transactional data into business insights.
3️⃣ Filtering After Aggregation — The HAVING Clause
Purpose:
HAVING filters groups, not rows.
It’s like saying:
“Now that I’ve summed up each region’s total — only keep the big ones.”
SELECT region, SUM(amount) AS total_sales
FROM sales
GROUP BY region
HAVING SUM(amount) > 10000;So,
WHERE= “Filter rows before grouping.”HAVING= “Filter groups after aggregation.”
This distinction is interview gold.
💻 Step 3: Code Demonstration & Explanation
Let’s watch these concepts in motion.
Example 1: Simple Filtering
SELECT *
FROM sales
WHERE region = 'West' AND amount > 1000;Explanation: Filters individual transactions — only rows where region is “West” and amount > 1000. No aggregation yet. This query outputs raw data.
Example 2: Adding Aggregation
SELECT region, SUM(amount) AS total_sales
FROM sales
WHERE amount > 1000
GROUP BY region;Explanation:
- Step 1:
WHEREfilters rows (only large sales). - Step 2:
GROUP BYgroups by region. - Step 3:
SUM()computes total per region.
Result: Each row now represents one region and its total sales (for high-value orders).
Example 3: Filtering After Aggregation
SELECT region, SUM(amount) AS total_sales
FROM sales
WHERE amount > 1000
GROUP BY region
HAVING SUM(amount) > 10000;Explanation:
WHERE: Selects big transactions first.GROUP BY: Collapses by region.HAVING: Keeps only regions with high totals.
This layered filtering keeps computations efficient and logic clean.
⚙️ Step 4: Performance Insights & Optimization
Push Filters Early: Always filter rows as early as possible using
WHERE. It reduces data volume before aggregation.Avoid HAVING for Row-Level Filters: If you can filter with
WHERE, do it —HAVINGruns after heavy aggregation.Indexing Benefit:
WHEREclauses can leverage indexes.HAVINGcannot, since it works on computed aggregates.Query Planning: In execution plans, watch for “Seq Scan” (bad) vs “Index Scan” (good). Example:
EXPLAIN SELECT region, SUM(amount) FROM sales WHERE amount > 1000 GROUP BY region;You’ll often see the planner push down filters for efficiency.
🎯 Step 5: Interview Perspective & Reasoning
Common Interview Questions
- “Difference between WHERE and HAVING?”
- “Can you use aggregates in WHERE?”
- “When would HAVING be necessary?”
- “How do filters affect aggregation results?”
Reasoning Expectation
Interviewers assess whether you understand the sequence and scope of filtering.
Strong candidates can say:
“WHERE filters individual rows before grouping; HAVING filters groups after aggregation. They operate at different stages of the query lifecycle.”
Bonus Challenge
Find the top-performing product categories (sum > $50,000), but only consider orders placed after 2024.
SELECT category, SUM(amount) AS total_sales
FROM orders
WHERE order_date >= '2024-01-01'
GROUP BY category
HAVING SUM(amount) > 50000;Can you explain why WHERE must come before HAVING here? That’s the key test of conceptual depth.
🧩 Step 6: Common Mistakes & Debugging Insights
🚨 Common Mistakes (Click to Expand)
- ❌ Using aggregates in WHERE (
WHERE SUM(amount) > 10000) → Wrong order of execution. - ❌ Forgetting GROUP BY when using aggregates (
SELECT region, SUM(amount)) → SQL doesn’t know how to group non-aggregated columns. - ❌ Filtering after aggregation when you could filter before. → Unnecessary computation.
💡 Debug Insight: If you see “invalid use of group function” → You’re using an aggregate too early (in WHERE).
🧠 Step 7: Key Takeaways
🧠 Main Idea:
WHEREfilters rows early;HAVINGfilters groups later.
💻 How It Works: SQL filters → groups → filters again → selects. The order defines whether you’re working with individual rows or summarized groups.
🚀 Why It Matters:
- Enables efficient analysis pipelines
- Prevents logical and performance bugs
- A must-know for any data analyst interview