3.2 Cohort and Retention Analysis
🪄 Step 1: Intuition & Real-World Analogy
🎯 Core Idea
Cohort analysis helps us understand how groups of users behave over time — for example, “How many users who signed up in January are still active in March?”
Think of it as time-travel analytics — where you freeze users at the moment they start (their cohort) and then observe how they behave in future time intervals.
This helps answer:
- How sticky is your product?
- When do users typically drop off?
- Which acquisition month brings the most loyal customers?
🌍 Real-Life Analogy
Imagine a gym 🏋️ tracking its members:
- January sign-ups = Cohort A
- February sign-ups = Cohort B
- March sign-ups = Cohort C
You measure how many from each group still come to the gym after 1 month, 2 months, etc. That’s retention analysis.
If you plot this over time — you get the classic retention heatmap, showing engagement decay.
💬 Conversational Style
You might ask:
“Can’t I just group by month?”
Sure — but grouping by month of signup only gives you who joined, not how long they stayed. Cohort analysis tracks journey over time — “month since signup”, not “month of calendar”.
This distinction turns raw event data into powerful user behavior insight.
🌱 Step 2: The Core Concept Explained
What It Really Does
Cohort analysis in SQL involves:
- Identifying each user’s cohort start date (first event, purchase, or signup).
- Measuring user activity across time buckets (like months since signup).
- Aggregating results into a retention matrix — how many users return or convert after X months.
How It Works Internally
We use window functions to find the cohort start (e.g., MIN(order_date) per user).
Then we compute time deltas between each event and that start.
Finally, we group by cohort and time difference to count how many users remained active.
Behind the scenes:
- Each row represents a user-event pair.
- SQL calculates their cohort and “age” in months.
- Aggregations summarize retention across cohorts.
Common Use Cases
- SaaS: customer retention or churn analysis.
- E-commerce: repeat purchase rates.
- Apps: daily/weekly/monthly active user tracking.
- Marketing: cohort-based campaign effectiveness.
💻 Step 3: Code Demonstration & Explanation
Let’s build the logic step-by-step.
🧩 Example 1: Identify Cohorts
SELECT
customer_id,
MIN(order_date) AS cohort_start
FROM orders
GROUP BY customer_id;Explanation: This finds each customer’s first purchase date — their “signup month”.
🧩 Example 2: Calculate Time Since Signup (Cohort Age)
SELECT
customer_id,
MIN(order_date) AS cohort_start,
DATE_PART('month', order_date - MIN(order_date) OVER (PARTITION BY customer_id)) AS months_since_signup
FROM orders;Explanation:
MIN(order_date) OVER (PARTITION BY customer_id)finds the first purchase per customer (no subquery needed).- The date difference (in months) shows how many months since signup an order occurred.
🧩 Example 3: Count Active Users by Cohort Month
WITH cohort_data AS (
SELECT
customer_id,
DATE_TRUNC('month', MIN(order_date)) AS cohort_month,
DATE_TRUNC('month', order_date) AS activity_month,
DATE_PART('month', AGE(order_date, MIN(order_date) OVER (PARTITION BY customer_id))) AS months_since_signup
FROM orders
)
SELECT
cohort_month,
months_since_signup,
COUNT(DISTINCT customer_id) AS active_users
FROM cohort_data
GROUP BY cohort_month, months_since_signup
ORDER BY cohort_month, months_since_signup;Explanation:
cohort_month: identifies when users first joined.months_since_signup: measures retention period.- Grouping by both gives a cohort × retention matrix — the foundation for a retention chart.
Sample Output:
| cohort_month | months_since_signup | active_users |
|---|---|---|
| 2025-01-01 | 0 | 1000 |
| 2025-01-01 | 1 | 750 |
| 2025-01-01 | 2 | 500 |
| 2025-02-01 | 0 | 1200 |
| 2025-02-01 | 1 | 900 |
🧩 Example 4: Retention Percentage per Cohort
WITH cohort_counts AS (
SELECT
cohort_month,
months_since_signup,
COUNT(DISTINCT customer_id) AS active_users
FROM cohort_data
GROUP BY cohort_month, months_since_signup
),
cohort_sizes AS (
SELECT
cohort_month,
COUNT(DISTINCT customer_id) AS cohort_size
FROM cohort_data
WHERE months_since_signup = 0
GROUP BY cohort_month
)
SELECT
c.cohort_month,
c.months_since_signup,
ROUND(100.0 * c.active_users / s.cohort_size, 2) AS retention_rate
FROM cohort_counts c
JOIN cohort_sizes s USING (cohort_month)
ORDER BY c.cohort_month, c.months_since_signup;Explanation:
- Calculates retention percentage for each cohort over time.
months_since_signup = 0defines the initial cohort size.- Joins back to compute relative retention.
⚙️ Step 4: Performance Insights & Optimization
Key Challenges:
Cohort analysis often runs on massive datasets — millions or even billions of events. Each row represents a transaction, and window functions add overhead.
🔍 Optimization Strategies:
Materialized Views
- Cache intermediate results like cohort assignments.
- Refresh daily instead of recomputing
MIN(order_date)for all users.
CREATE MATERIALIZED VIEW user_cohorts AS SELECT customer_id, MIN(order_date) AS cohort_start FROM orders GROUP BY customer_id;Partitioning by Date
- Partition the
orderstable byorder_datemonth for faster scans.
CREATE TABLE orders_2025_01 PARTITION OF orders FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');- Partition the
Pre-aggregation
- Precompute daily or monthly summaries (event counts per user).
- Use these summaries instead of raw events in your cohort query.
Indexing
- Create indexes on
(customer_id, order_date)and(order_date)for efficient sorting and filtering.
- Create indexes on
Incremental Updates
- Append only new data each day rather than recomputing all cohorts.
Result: What once took hours on a 1B-row dataset can run in seconds with precomputation and partitioning.
🎯 Step 5: Interview Perspective & Reasoning
Common Interview Questions
- “Explain what cohort analysis measures.”
- “What’s the difference between calendar month and months-since-signup?”
- “How would you compute retention rates in SQL?”
- “How would you optimize cohort analysis on large data?”
- “Why use a materialized view instead of a CTE for this use case?”
Reasoning Expectation
Strong candidates should demonstrate:
- Understanding of temporal grouping logic.
- Comfort with window functions (
MIN() OVER,AGE(), etc.). - Awareness of scalability bottlenecks (I/O, window sort cost).
- Ability to reason about pre-aggregation pipelines.
Bonus Challenge
Write a query to calculate 7-day retention (users active within 7 days after signup). Then discuss how you’d run it efficiently on a 5-billion-row table — without recomputing everything daily.
🧩 Step 6: Common Mistakes & Debugging Insights
🚨 Common Mistakes (Click to Expand)
- Using
DATEDIFForAGEwithout truncating → non-integer intervals lead to wrong buckets. - Forgetting to
DATE_TRUNC('month', order_date)— causes too granular (daily) grouping. - Ignoring
DISTINCT customer_id— leads to inflated counts. - Running full recomputation on historical data daily — kills performance.
- Not indexing
order_date— forces full table scans for time-based filters.
Debugging Tip:
Use EXPLAIN ANALYZE to confirm:
- Window functions (
MinorRank) aren’t rescanned unnecessarily. - Joins on cohort tables use index joins instead of sequential scans.
🧠 Step 7: Key Takeaways
🧠 Main Idea: Cohort analysis groups users by their starting point and tracks retention over time — revealing product stickiness.
💻 How It Works: Use window functions to find cohort start dates, compute time differences, and aggregate user counts by cohort × age.
🚀 Why It Matters: It’s the backbone of growth analytics — showing not just who joined, but who stayed — and how to keep them longer.