1.2 Filter and Aggregate Like an Analyst

5 min read 864 words

🪄 Step 1: Intuition & Real-World Analogy

🎯 Core Idea: Filtering and aggregation are the bread and butter of analytics — they allow you to zoom in on specific slices of data and then summarize them meaningfully.

🌍 Real-Life Analogy: Imagine you’re analyzing sales data like a detective.

  • First, you filter: “Show me sales only from the West region and above $1,000.”
  • Then, you aggregate: “Now, sum those sales by region.”

Filtering is like deciding which clues to look at. Aggregation is like summarizing the evidence into a conclusion.

💬 Curiosity Hook: But what if you apply the filter after aggregation? Will you get the same answer? That’s where the subtle but crucial WHERE vs HAVING distinction begins.


🌱 Step 2: The Core Concept Explained

Let’s break this into two pillars — filtering and aggregation.


1️⃣ Filtering Data — The WHERE Clause

Purpose: The WHERE clause filters rows before any grouping or aggregation happens. Think of it as a gatekeeper — only rows that pass this gate are considered later.

Example Thought:

“Show me all transactions above $100 in 2024.”

SELECT *
FROM sales
WHERE amount > 100
  AND order_date >= '2024-01-01';

This removes irrelevant rows early, which makes grouping and summarization faster later.

Internal Stage: Runs before GROUP BY — works only on individual row values, not aggregates.


2️⃣ Aggregating Data — The GROUP BY Clause

Purpose: The GROUP BY clause collapses multiple rows into summary buckets.

“Group all sales by region and compute total sales per region.”

SELECT region, SUM(amount) AS total_sales
FROM sales
GROUP BY region;

Now, instead of thousands of rows, you get one row per region — a compact summary.

Analytical Superpower: Aggregation turns messy transactional data into business insights.


3️⃣ Filtering After Aggregation — The HAVING Clause

Purpose: HAVING filters groups, not rows. It’s like saying:

“Now that I’ve summed up each region’s total — only keep the big ones.”

SELECT region, SUM(amount) AS total_sales
FROM sales
GROUP BY region
HAVING SUM(amount) > 10000;

So,

  • WHERE = “Filter rows before grouping.”
  • HAVING = “Filter groups after aggregation.”

This distinction is interview gold.


💻 Step 3: Code Demonstration & Explanation

Let’s watch these concepts in motion.


Example 1: Simple Filtering
SELECT *
FROM sales
WHERE region = 'West' AND amount > 1000;

Explanation: Filters individual transactions — only rows where region is “West” and amount > 1000. No aggregation yet. This query outputs raw data.


Example 2: Adding Aggregation
SELECT region, SUM(amount) AS total_sales
FROM sales
WHERE amount > 1000
GROUP BY region;

Explanation:

  • Step 1: WHERE filters rows (only large sales).
  • Step 2: GROUP BY groups by region.
  • Step 3: SUM() computes total per region.

Result: Each row now represents one region and its total sales (for high-value orders).


Example 3: Filtering After Aggregation
SELECT region, SUM(amount) AS total_sales
FROM sales
WHERE amount > 1000
GROUP BY region
HAVING SUM(amount) > 10000;

Explanation:

  • WHERE: Selects big transactions first.
  • GROUP BY: Collapses by region.
  • HAVING: Keeps only regions with high totals.

This layered filtering keeps computations efficient and logic clean.


⚙️ Step 4: Performance Insights & Optimization

ℹ️
  • Push Filters Early: Always filter rows as early as possible using WHERE. It reduces data volume before aggregation.

  • Avoid HAVING for Row-Level Filters: If you can filter with WHERE, do it — HAVING runs after heavy aggregation.

  • Indexing Benefit: WHERE clauses can leverage indexes. HAVING cannot, since it works on computed aggregates.

  • Query Planning: In execution plans, watch for “Seq Scan” (bad) vs “Index Scan” (good). Example:

    EXPLAIN
    SELECT region, SUM(amount)
    FROM sales
    WHERE amount > 1000
    GROUP BY region;

    You’ll often see the planner push down filters for efficiency.


🎯 Step 5: Interview Perspective & Reasoning

Common Interview Questions
  • “Difference between WHERE and HAVING?”
  • “Can you use aggregates in WHERE?”
  • “When would HAVING be necessary?”
  • “How do filters affect aggregation results?”
Reasoning Expectation

Interviewers assess whether you understand the sequence and scope of filtering.

Strong candidates can say:

“WHERE filters individual rows before grouping; HAVING filters groups after aggregation. They operate at different stages of the query lifecycle.”

Bonus Challenge

Find the top-performing product categories (sum > $50,000), but only consider orders placed after 2024.

SELECT category, SUM(amount) AS total_sales
FROM orders
WHERE order_date >= '2024-01-01'
GROUP BY category
HAVING SUM(amount) > 50000;

Can you explain why WHERE must come before HAVING here? That’s the key test of conceptual depth.


🧩 Step 6: Common Mistakes & Debugging Insights

🚨 Common Mistakes (Click to Expand)
  • ❌ Using aggregates in WHERE (WHERE SUM(amount) > 10000) → Wrong order of execution.
  • ❌ Forgetting GROUP BY when using aggregates (SELECT region, SUM(amount)) → SQL doesn’t know how to group non-aggregated columns.
  • ❌ Filtering after aggregation when you could filter before. → Unnecessary computation.

💡 Debug Insight: If you see “invalid use of group function” → You’re using an aggregate too early (in WHERE).


🧠 Step 7: Key Takeaways

🧠 Main Idea: WHERE filters rows early; HAVING filters groups later.

💻 How It Works: SQL filters → groups → filters again → selects. The order defines whether you’re working with individual rows or summarized groups.

🚀 Why It Matters:

  • Enables efficient analysis pipelines
  • Prevents logical and performance bugs
  • A must-know for any data analyst interview
Any doubt in content? Ask me anything?
Chat
🤖 👋 Hi there! I'm your learning assistant. If you have any questions about this page or need clarification, feel free to ask!