3.2 Cohort and Retention Analysis

6 min read 1070 words

🪄 Step 1: Intuition & Real-World Analogy

🎯 Core Idea

Cohort analysis helps us understand how groups of users behave over time — for example, “How many users who signed up in January are still active in March?”

Think of it as time-travel analytics — where you freeze users at the moment they start (their cohort) and then observe how they behave in future time intervals.

This helps answer:

  • How sticky is your product?
  • When do users typically drop off?
  • Which acquisition month brings the most loyal customers?

🌍 Real-Life Analogy

Imagine a gym 🏋️ tracking its members:

  • January sign-ups = Cohort A
  • February sign-ups = Cohort B
  • March sign-ups = Cohort C

You measure how many from each group still come to the gym after 1 month, 2 months, etc. That’s retention analysis.

If you plot this over time — you get the classic retention heatmap, showing engagement decay.


💬 Conversational Style

You might ask:

“Can’t I just group by month?”

Sure — but grouping by month of signup only gives you who joined, not how long they stayed. Cohort analysis tracks journey over time — “month since signup”, not “month of calendar”.

This distinction turns raw event data into powerful user behavior insight.


🌱 Step 2: The Core Concept Explained

What It Really Does

Cohort analysis in SQL involves:

  1. Identifying each user’s cohort start date (first event, purchase, or signup).
  2. Measuring user activity across time buckets (like months since signup).
  3. Aggregating results into a retention matrix — how many users return or convert after X months.
How It Works Internally

We use window functions to find the cohort start (e.g., MIN(order_date) per user). Then we compute time deltas between each event and that start.

Finally, we group by cohort and time difference to count how many users remained active.

Behind the scenes:

  • Each row represents a user-event pair.
  • SQL calculates their cohort and “age” in months.
  • Aggregations summarize retention across cohorts.
Common Use Cases
  • SaaS: customer retention or churn analysis.
  • E-commerce: repeat purchase rates.
  • Apps: daily/weekly/monthly active user tracking.
  • Marketing: cohort-based campaign effectiveness.

💻 Step 3: Code Demonstration & Explanation

Let’s build the logic step-by-step.


🧩 Example 1: Identify Cohorts

SELECT 
  customer_id,
  MIN(order_date) AS cohort_start
FROM orders
GROUP BY customer_id;

Explanation: This finds each customer’s first purchase date — their “signup month”.


🧩 Example 2: Calculate Time Since Signup (Cohort Age)

SELECT 
  customer_id,
  MIN(order_date) AS cohort_start,
  DATE_PART('month', order_date - MIN(order_date) OVER (PARTITION BY customer_id)) AS months_since_signup
FROM orders;

Explanation:

  • MIN(order_date) OVER (PARTITION BY customer_id) finds the first purchase per customer (no subquery needed).
  • The date difference (in months) shows how many months since signup an order occurred.

🧩 Example 3: Count Active Users by Cohort Month

WITH cohort_data AS (
  SELECT 
    customer_id,
    DATE_TRUNC('month', MIN(order_date)) AS cohort_month,
    DATE_TRUNC('month', order_date) AS activity_month,
    DATE_PART('month', AGE(order_date, MIN(order_date) OVER (PARTITION BY customer_id))) AS months_since_signup
  FROM orders
)
SELECT 
  cohort_month,
  months_since_signup,
  COUNT(DISTINCT customer_id) AS active_users
FROM cohort_data
GROUP BY cohort_month, months_since_signup
ORDER BY cohort_month, months_since_signup;

Explanation:

  • cohort_month: identifies when users first joined.
  • months_since_signup: measures retention period.
  • Grouping by both gives a cohort × retention matrix — the foundation for a retention chart.

Sample Output:

cohort_monthmonths_since_signupactive_users
2025-01-0101000
2025-01-011750
2025-01-012500
2025-02-0101200
2025-02-011900

🧩 Example 4: Retention Percentage per Cohort

WITH cohort_counts AS (
  SELECT 
    cohort_month,
    months_since_signup,
    COUNT(DISTINCT customer_id) AS active_users
  FROM cohort_data
  GROUP BY cohort_month, months_since_signup
),
cohort_sizes AS (
  SELECT 
    cohort_month,
    COUNT(DISTINCT customer_id) AS cohort_size
  FROM cohort_data
  WHERE months_since_signup = 0
  GROUP BY cohort_month
)
SELECT 
  c.cohort_month,
  c.months_since_signup,
  ROUND(100.0 * c.active_users / s.cohort_size, 2) AS retention_rate
FROM cohort_counts c
JOIN cohort_sizes s USING (cohort_month)
ORDER BY c.cohort_month, c.months_since_signup;

Explanation:

  • Calculates retention percentage for each cohort over time.
  • months_since_signup = 0 defines the initial cohort size.
  • Joins back to compute relative retention.

⚙️ Step 4: Performance Insights & Optimization

ℹ️

Key Challenges:

Cohort analysis often runs on massive datasets — millions or even billions of events. Each row represents a transaction, and window functions add overhead.

🔍 Optimization Strategies:

  1. Materialized Views

    • Cache intermediate results like cohort assignments.
    • Refresh daily instead of recomputing MIN(order_date) for all users.
    CREATE MATERIALIZED VIEW user_cohorts AS
    SELECT customer_id, MIN(order_date) AS cohort_start
    FROM orders
    GROUP BY customer_id;
  2. Partitioning by Date

    • Partition the orders table by order_date month for faster scans.
    CREATE TABLE orders_2025_01 PARTITION OF orders
    FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');
  3. Pre-aggregation

    • Precompute daily or monthly summaries (event counts per user).
    • Use these summaries instead of raw events in your cohort query.
  4. Indexing

    • Create indexes on (customer_id, order_date) and (order_date) for efficient sorting and filtering.
  5. Incremental Updates

    • Append only new data each day rather than recomputing all cohorts.

Result: What once took hours on a 1B-row dataset can run in seconds with precomputation and partitioning.


🎯 Step 5: Interview Perspective & Reasoning

Common Interview Questions
  • “Explain what cohort analysis measures.”
  • “What’s the difference between calendar month and months-since-signup?”
  • “How would you compute retention rates in SQL?”
  • “How would you optimize cohort analysis on large data?”
  • “Why use a materialized view instead of a CTE for this use case?”
Reasoning Expectation

Strong candidates should demonstrate:

  • Understanding of temporal grouping logic.
  • Comfort with window functions (MIN() OVER, AGE(), etc.).
  • Awareness of scalability bottlenecks (I/O, window sort cost).
  • Ability to reason about pre-aggregation pipelines.
Bonus Challenge

Write a query to calculate 7-day retention (users active within 7 days after signup). Then discuss how you’d run it efficiently on a 5-billion-row table — without recomputing everything daily.


🧩 Step 6: Common Mistakes & Debugging Insights

🚨 Common Mistakes (Click to Expand)
  • Using DATEDIFF or AGE without truncating → non-integer intervals lead to wrong buckets.
  • Forgetting to DATE_TRUNC('month', order_date) — causes too granular (daily) grouping.
  • Ignoring DISTINCT customer_id — leads to inflated counts.
  • Running full recomputation on historical data daily — kills performance.
  • Not indexing order_date — forces full table scans for time-based filters.

Debugging Tip: Use EXPLAIN ANALYZE to confirm:

  • Window functions (Min or Rank) aren’t rescanned unnecessarily.
  • Joins on cohort tables use index joins instead of sequential scans.

🧠 Step 7: Key Takeaways

🧠 Main Idea: Cohort analysis groups users by their starting point and tracks retention over time — revealing product stickiness.

💻 How It Works: Use window functions to find cohort start dates, compute time differences, and aggregate user counts by cohort × age.

🚀 Why It Matters: It’s the backbone of growth analytics — showing not just who joined, but who stayed — and how to keep them longer.

Any doubt in content? Ask me anything?
Chat
🤖 👋 Hi there! I'm your learning assistant. If you have any questions about this page or need clarification, feel free to ask!