6.3 Causal Inference & Debiasing
🪄 Step 1: Intuition & Motivation
Core Idea : Most recommenders learn from historical interactions — what users clicked, watched, or bought. But what if those clicks only represent what users were shown, not what they would have liked if given a fair chance? 🤔 That’s exposure bias — our data doesn’t reflect true preference, only observed behavior. Causal inference helps us fix that by estimating what would have happened if users were exposed to different items.
Simple Analogy: Imagine a bookstore that only displays bestsellers in front. You’ll think everyone loves bestsellers — but maybe hidden gems never got the chance to shine. Causal methods help uncover the unseen truth behind such biased data. 📚✨
🌱 Step 2: Core Concept
Causal inference in recommendations means untangling cause and effect — figuring out whether a user clicked because they truly liked an item, or simply because the system overexposed it.
Let’s unpack the key ideas step-by-step.
What’s Happening Under the Hood?
The Problem: Biased Feedback Loops
Your recommender suggests items → users click → model retrains on those clicks → suggests similar items again. This causes:
- Exposure bias: Only popular items get shown → others never get exposure.
- Selection bias: Logged data isn’t random — it reflects your model’s past choices.
Over time, your system “locks in” to a narrow worldview — echo chambers form, diversity collapses, and long-tail items vanish.
So, you’re not learning user preferences — you’re learning your own past mistakes more confidently. 😬
Why It Works This Way
Most recommenders optimize on observed data:
$$ \mathcal{L} = \sum_{(u,i) \in \text{logged}} \ell(\hat{r}*{ui}, r*{ui}) $$But not all $(u,i)$ pairs are equally likely to appear in logs — only those the system chose to show. This means the dataset is biased by the exposure policy.
We need to correct for this by re-weighting each sample by how likely it was to be observed. That’s where Inverse Propensity Scoring (IPS) comes in.
How It Fits in ML Thinking
In machine learning terms, IPS and causal models allow us to debias training data the same way we debias experiments — by adjusting for unequal exposure probability. This transforms your model from “fitting the past” → to “inferring the truth.”
Causal thinking turns your recommender into a scientist — asking, “What if I had recommended something else?” rather than just “What did happen?”
📐 Step 3: Mathematical Foundation
Let’s walk through the mathematical backbone intuitively.
Inverse Propensity Scoring (IPS)
Suppose:
- $y_{ui}$ = user’s click (1 or 0)
- $\pi_0(i|u)$ = probability the old system showed item $i$ to user $u$ (logging policy)
- $\pi(i|u)$ = new system’s policy
We estimate unbiased loss:
$$ \mathcal{L}*{IPS} = \frac{1}{N} \sum*{(u,i)} \frac{\pi(i|u)}{\pi_0(i|u)} , \ell(\hat{y}*{ui}, y*{ui}) $$Each sample is reweighted by the inverse of its propensity (probability of being exposed). If an item was rarely shown ($\pi_0(i|u)$ small), its few observations get more weight to counter underexposure.
Self-Normalized IPS (SNIPS)
IPS can produce unstable estimates when propensities are small (huge weights). SNIPS normalizes the weights to control variance:
$$ \mathcal{L}*{SNIPS} = \frac{\sum*{(u,i)} w_{ui} , \ell(\hat{y}*{ui}, y*{ui})} {\sum_{(u,i)} w_{ui}}, \quad w_{ui} = \frac{\pi(i|u)}{\pi_0(i|u)} $$SNIPS = “IPS with a seatbelt” — more stable, less noisy, but slightly biased toward frequent samples.
Causal Graph Perspective
We can visualize the recommender system as a causal graph:
graph TD U[User Preferences] --> E[Exposure] E --> Y[Observed Click] U --> Y I[Item Features] --> E I --> Y
- Exposure (E) acts as a confounder: it affects both what’s shown and what’s clicked.
- To learn true causation, we must “adjust for” exposure — like controlling for age in medical studies.
Using causal inference, we aim to estimate:
$$ P(Y | do(E=i)) $$i.e., What would the click probability be if we forced exposure to item $i$? This “do” operator breaks feedback loops and isolates true preference signals.
🧠 Step 4: Assumptions or Key Ideas
- Unconfoundedness: All variables affecting both exposure and outcome are observed (or well-modeled).
- Positivity: Every user–item pair has some non-zero exposure probability (no missing possibilities).
- Stability: User preferences don’t change drastically during logging.
- Feedback mitigation: The goal isn’t perfect neutrality — it’s controlled curiosity.
⚖️ Step 5: Strengths, Limitations & Trade-offs
- Corrects historical bias from past exposure patterns.
- Enables fairer, more diverse recommendations.
- Bridges the gap between observational and experimental data.
- Encourages long-term user trust and satisfaction.
- Requires accurate logging of exposure probabilities.
- Propensity estimates can be unstable or unavailable.
- SNIPS introduces mild bias for variance reduction.
- True causal structure is often partially unobservable.
🚧 Step 6: Common Misunderstandings
🚨 Common Misunderstandings (Click to Expand)
- “Bias only comes from users.” Nope — much of it comes from the recommender’s own historical choices.
- “Debiasing means removing all preferences.” No — it means distinguishing genuine preferences from system-induced ones.
- “Causal inference is too theoretical.” Modern systems (YouTube, TikTok, Pinterest) already use causal estimators to ensure fairness and diversity in real-time.
🧩 Step 7: Mini Summary
🧠 What You Learned: Recommenders often reinforce their own biases because they train on their own outputs. Causal inference and counterfactual learning estimate true user preferences by correcting for exposure and selection bias.
⚙️ How It Works: IPS and SNIPS reweight data to account for under- or over-exposure, while causal graphs model what would happen under different recommendations.
🎯 Why It Matters: Causal methods turn your recommender from a self-reinforcing loop into a fair, exploratory system that uncovers real user interests — not just popular trends.