6.2 Reinforcement Learning in Recommenders
🪄 Step 1: Intuition & Motivation
Core Idea (1 short paragraph): Instead of asking “What’s the best item now?”, reinforcement learning (RL) asks “What sequence of items keeps the user happy over time?” Recommendations become decisions with consequences: today’s pick affects tomorrow’s interest, trust, and return rate. RL helps balance exploration (try new things) and exploitation (serve known favorites) while optimizing long-term value (retention, lifetime engagement).
Simple Analogy (one only): It’s like a barista who occasionally suggests a new drink. If you love it, you come back more often. A pure “serve the usual” policy may get a click today but risks boredom tomorrow.
🌱 Step 2: Core Concept
What’s Happening Under the Hood?
We model the interaction loop as a decision process:
- State $s_t$: user context + recent history (session, time, device, embeddings).
- Action $a_t$: show an item (or a slate of items).
- Reward $r_t$: click, dwell time, watch %, purchase, or a weighted mix.
- Transition: user’s next state $s_{t+1}$ after seeing $a_t$. The system chooses $a_t$ to maximize cumulative reward over time, not just the next click.
Why It Works This Way
How It Fits in ML Thinking
- Bandits: 1-step RL (no state transitions) — great for exploration at the top of a ranking.
- Full RL (MDP): multi-step with dynamics — learns policies that shape long-term behavior.
- Policy gradients: learn a stochastic policy end-to-end when the ranking function is non-differentiable through logged rewards.
📐 Step 3: Mathematical Foundation
Bandit Formulation (One-Step RL)
Goal: choose item/arm $a$ to maximize expected reward $E[r|a]$ while exploring.
- $\varepsilon$-greedy: with prob. $\varepsilon$ explore (random), else exploit $\arg\max_a \hat{\mu}_a$.
- UCB: pick $a_t = \arg\max_a \hat{\mu}_a + c\sqrt{\frac{\ln t}{n_a}}$
(optimism bonus shrinks as $n_a$ grows). - Thompson Sampling (Bernoulli): maintain $p(\mu_a)$ as $\text{Beta}(\alpha_a,\beta_a)$; sample $\tilde{\mu}_a \sim \text{Beta}(\alpha_a,\beta_a)$ and choose $\arg\max_a \tilde{\mu}_a$; update $(\alpha,\beta)$ with successes/failures.
Thompson = “roll the dice according to current belief.”
$\varepsilon$-greedy = “occasionally try something new.”
MDP & Return (Multi-Step RL)
A policy $\pi_\theta(a|s)$ maps states to action probabilities; we learn $\theta$ to maximize $J(\theta)=E_{\pi_\theta}[G_0]$.
Policy Gradient (REINFORCE) & Baseline
The vanilla gradient:
$$ \nabla_\theta J(\theta)=E_{\pi_\theta}\big[\nabla_\theta \log \pi_\theta(a_t|s_t)\, G_t\big]. $$To reduce variance, subtract a baseline $b(s_t)$ (e.g., value estimate):
$$ \nabla_\theta J(\theta)\approx E\big[\nabla_\theta \log \pi_\theta(a_t|s_t)\, (G_t - b(s_t))\big]. $$This encourages actions that did better than expected and discourages worse-than-expected ones.
Slate & Ranking Considerations
Real systems pick a slate (top-$K$ list). Approaches:
- Score-and-sort with bandit exploration in candidate generation.
- Slate-bandits or listwise RL optimizing page-level reward (e.g., total clicks, session time). Reward shaping and position bias corrections are common.
🧠 Step 4: Assumptions or Key Ideas
- Stationarity (often approximate): user dynamics don’t change too fast; otherwise adaptivity is needed.
- Counterfactual logging: unbiased (or corrected) estimators for off-policy evaluation (propensity scores, IPS/SNIPS).
- Exploration is valuable: occasional novelty increases long-term satisfaction and reduces popularity bias.
- Reward design matters: choose signals aligned with retention (not just clicks).
⚖️ Step 5: Strengths, Limitations & Trade-offs
- Optimizes long-term outcomes (retention, LTV).
- Principled handling of exploration vs. exploitation.
- Bandits give simple, fast wins at serving time.
- Hard to design rewards; delayed credit assignment.
- Off-policy evaluation is tricky (bias/variance).
- Full RL can be data- and compute-hungry; safety concerns in production.
🚧 Step 6: Common Misunderstandings (Optional)
🚨 Common Misunderstandings (Click to Expand)
- “Exploration always hurts metrics.”
Small, targeted exploration can improve long-term retention and discovery. - “Bandits = RL solved.”
Bandits ignore state dynamics; use full RL when history/context critically shape future rewards. - “Any reward works.”
Misaligned rewards (pure CTR) can inflate clicks but erode trust and session length.
🧩 Step 7: Mini Summary
🧠 What You Learned: RL treats recommendation as sequential decision-making, optimizing not just the next click but the future journey.
⚙️ How It Works: Bandits provide lightweight exploration at serve time; policy gradients learn stochastic policies that maximize discounted long-term reward.
🎯 Why It Matters: Balancing exploration and exploitation prevents stagnation, combats popularity bias, and nurtures sustained user satisfaction.