1.2 Filter and Aggregate Like an Analyst

Core Skills Guide for AI Interviews (Math, Code, SQL) 2025

5 min read 864 words

🪄 Step 1: Intuition & Real-World Analogy

🎯 Core Idea: Filtering and aggregation are the bread and butter of analytics — they allow you to zoom in on specific slices of data and then summarize them meaningfully.

🌍 Real-Life Analogy: Imagine you’re analyzing sales data like a detective.

First, you filter: “Show me sales only from the West region and above $1,000.”
Then, you aggregate: “Now, sum those sales by region.”

Filtering is like deciding which clues to look at. Aggregation is like summarizing the evidence into a conclusion.

💬 Curiosity Hook: But what if you apply the filter after aggregation? Will you get the same answer? That’s where the subtle but crucial WHERE vs HAVING distinction begins.

🌱 Step 2: The Core Concept Explained

Let’s break this into two pillars — filtering and aggregation.

1️⃣ Filtering Data — The WHERE Clause

Purpose: The WHERE clause filters rows before any grouping or aggregation happens. Think of it as a gatekeeper — only rows that pass this gate are considered later.

Example Thought:

“Show me all transactions above $100 in 2024.”

SELECT *
FROM sales
WHERE amount > 100
  AND order_date >= '2024-01-01';

This removes irrelevant rows early, which makes grouping and summarization faster later.

Internal Stage: Runs before GROUP BY — works only on individual row values, not aggregates.

2️⃣ Aggregating Data — The GROUP BY Clause

Purpose: The GROUP BY clause collapses multiple rows into summary buckets.

“Group all sales by region and compute total sales per region.”

SELECT region, SUM(amount) AS total_sales
FROM sales
GROUP BY region;

Now, instead of thousands of rows, you get one row per region — a compact summary.

Analytical Superpower: Aggregation turns messy transactional data into business insights.

3️⃣ Filtering After Aggregation — The HAVING Clause

Purpose: HAVING filters groups, not rows. It’s like saying:

“Now that I’ve summed up each region’s total — only keep the big ones.”

SELECT region, SUM(amount) AS total_sales
FROM sales
GROUP BY region
HAVING SUM(amount) > 10000;

So,

WHERE = “Filter rows before grouping.”
HAVING = “Filter groups after aggregation.”

This distinction is interview gold.

💻 Step 3: Code Demonstration & Explanation

Let’s watch these concepts in motion.

Example 1: Simple Filtering

SELECT *
FROM sales
WHERE region = 'West' AND amount > 1000;

Explanation: Filters individual transactions — only rows where region is “West” and amount > 1000. No aggregation yet. This query outputs raw data.

Example 2: Adding Aggregation

SELECT region, SUM(amount) AS total_sales
FROM sales
WHERE amount > 1000
GROUP BY region;

Explanation:

Step 1: WHERE filters rows (only large sales).
Step 2: GROUP BY groups by region.
Step 3: SUM() computes total per region.

Result: Each row now represents one region and its total sales (for high-value orders).

Example 3: Filtering After Aggregation

SELECT region, SUM(amount) AS total_sales
FROM sales
WHERE amount > 1000
GROUP BY region
HAVING SUM(amount) > 10000;

Explanation:

WHERE: Selects big transactions first.
GROUP BY: Collapses by region.
HAVING: Keeps only regions with high totals.

This layered filtering keeps computations efficient and logic clean.

⚙️ Step 4: Performance Insights & Optimization

ℹ️

Push Filters Early: Always filter rows as early as possible using WHERE. It reduces data volume before aggregation.
Avoid HAVING for Row-Level Filters: If you can filter with WHERE, do it — HAVING runs after heavy aggregation.
Indexing Benefit: WHERE clauses can leverage indexes. HAVING cannot, since it works on computed aggregates.
Query Planning: In execution plans, watch for “Seq Scan” (bad) vs “Index Scan” (good). Example:
```
EXPLAIN
SELECT region, SUM(amount)
FROM sales
WHERE amount > 1000
GROUP BY region;
```
You’ll often see the planner push down filters for efficiency.

🎯 Step 5: Interview Perspective & Reasoning

Common Interview Questions

“Difference between WHERE and HAVING?”
“Can you use aggregates in WHERE?”
“When would HAVING be necessary?”
“How do filters affect aggregation results?”

Reasoning Expectation

Interviewers assess whether you understand the sequence and scope of filtering.

Strong candidates can say:

“WHERE filters individual rows before grouping; HAVING filters groups after aggregation. They operate at different stages of the query lifecycle.”

Bonus Challenge

Find the top-performing product categories (sum > $50,000), but only consider orders placed after 2024.

SELECT category, SUM(amount) AS total_sales
FROM orders
WHERE order_date >= '2024-01-01'
GROUP BY category
HAVING SUM(amount) > 50000;

Can you explain why WHERE must come before HAVING here? That’s the key test of conceptual depth.

🧩 Step 6: Common Mistakes & Debugging Insights

🚨 Common Mistakes (Click to Expand)

❌ Using aggregates in WHERE (WHERE SUM(amount) > 10000) → Wrong order of execution.
❌ Forgetting GROUP BY when using aggregates (SELECT region, SUM(amount)) → SQL doesn’t know how to group non-aggregated columns.
❌ Filtering after aggregation when you could filter before. → Unnecessary computation.

💡 Debug Insight: If you see “invalid use of group function” → You’re using an aggregate too early (in WHERE).

🧠 Step 7: Key Takeaways

🧠 Main Idea: WHERE filters rows early; HAVING filters groups later.

💻 How It Works: SQL filters → groups → filters again → selects. The order defines whether you’re working with individual rows or summarized groups.

🚀 Why It Matters:
Enables efficient analysis pipelines
Prevents logical and performance bugs
A must-know for any data analyst interview

1.3 Nulls, Case, and Conditional Logic 1.1. Learn the SQL Query Lifecycle