6.2 Reinforcement Learning in Recommenders

4 min read 732 words

🪄 Step 1: Intuition & Motivation

Core Idea (1 short paragraph): Instead of asking “What’s the best item now?”, reinforcement learning (RL) asks “What sequence of items keeps the user happy over time?” Recommendations become decisions with consequences: today’s pick affects tomorrow’s interest, trust, and return rate. RL helps balance exploration (try new things) and exploitation (serve known favorites) while optimizing long-term value (retention, lifetime engagement).

Simple Analogy (one only): It’s like a barista who occasionally suggests a new drink. If you love it, you come back more often. A pure “serve the usual” policy may get a click today but risks boredom tomorrow.


🌱 Step 2: Core Concept

What’s Happening Under the Hood?

We model the interaction loop as a decision process:

  • State $s_t$: user context + recent history (session, time, device, embeddings).
  • Action $a_t$: show an item (or a slate of items).
  • Reward $r_t$: click, dwell time, watch %, purchase, or a weighted mix.
  • Transition: user’s next state $s_{t+1}$ after seeing $a_t$. The system chooses $a_t$ to maximize cumulative reward over time, not just the next click.
Why It Works This Way
Short-term metrics (CTR) can create feedback loops and fatigue. RL explicitly values future impact: a diverse, slightly exploratory feed today can boost tomorrow’s satisfaction and retention, which a myopic optimizer misses.
How It Fits in ML Thinking
  • Bandits: 1-step RL (no state transitions) — great for exploration at the top of a ranking.
  • Full RL (MDP): multi-step with dynamics — learns policies that shape long-term behavior.
  • Policy gradients: learn a stochastic policy end-to-end when the ranking function is non-differentiable through logged rewards.

📐 Step 3: Mathematical Foundation

Bandit Formulation (One-Step RL)

Goal: choose item/arm $a$ to maximize expected reward $E[r|a]$ while exploring.

  • $\varepsilon$-greedy: with prob. $\varepsilon$ explore (random), else exploit $\arg\max_a \hat{\mu}_a$.
  • UCB: pick $a_t = \arg\max_a \hat{\mu}_a + c\sqrt{\frac{\ln t}{n_a}}$
    (optimism bonus shrinks as $n_a$ grows).
  • Thompson Sampling (Bernoulli): maintain $p(\mu_a)$ as $\text{Beta}(\alpha_a,\beta_a)$; sample $\tilde{\mu}_a \sim \text{Beta}(\alpha_a,\beta_a)$ and choose $\arg\max_a \tilde{\mu}_a$; update $(\alpha,\beta)$ with successes/failures.
UCB = “be optimistic about the uncertain.”
Thompson = “roll the dice according to current belief.”
$\varepsilon$-greedy = “occasionally try something new.”
MDP & Return (Multi-Step RL)
We maximize discounted return $G_t = \sum_{k=0}^{\infty}\gamma^k r_{t+k+1}$, with $0<\gamma<1$.
A policy $\pi_\theta(a|s)$ maps states to action probabilities; we learn $\theta$ to maximize $J(\theta)=E_{\pi_\theta}[G_0]$.
Policy Gradient (REINFORCE) & Baseline

The vanilla gradient:

$$ \nabla_\theta J(\theta)=E_{\pi_\theta}\big[\nabla_\theta \log \pi_\theta(a_t|s_t)\, G_t\big]. $$

To reduce variance, subtract a baseline $b(s_t)$ (e.g., value estimate):

$$ \nabla_\theta J(\theta)\approx E\big[\nabla_\theta \log \pi_\theta(a_t|s_t)\, (G_t - b(s_t))\big]. $$

This encourages actions that did better than expected and discourages worse-than-expected ones.

“Praise above-average actions, nudge down below-average ones.”
Slate & Ranking Considerations

Real systems pick a slate (top-$K$ list). Approaches:

  • Score-and-sort with bandit exploration in candidate generation.
  • Slate-bandits or listwise RL optimizing page-level reward (e.g., total clicks, session time). Reward shaping and position bias corrections are common.

🧠 Step 4: Assumptions or Key Ideas

  • Stationarity (often approximate): user dynamics don’t change too fast; otherwise adaptivity is needed.
  • Counterfactual logging: unbiased (or corrected) estimators for off-policy evaluation (propensity scores, IPS/SNIPS).
  • Exploration is valuable: occasional novelty increases long-term satisfaction and reduces popularity bias.
  • Reward design matters: choose signals aligned with retention (not just clicks).

⚖️ Step 5: Strengths, Limitations & Trade-offs

  • Optimizes long-term outcomes (retention, LTV).
  • Principled handling of exploration vs. exploitation.
  • Bandits give simple, fast wins at serving time.
  • Hard to design rewards; delayed credit assignment.
  • Off-policy evaluation is tricky (bias/variance).
  • Full RL can be data- and compute-hungry; safety concerns in production.
Start with bandits in candidate/reranking for safe exploration; graduate to policy gradients where long-term effects are measurable. Trade a bit of short-term CTR for durable engagement.

🚧 Step 6: Common Misunderstandings (Optional)

🚨 Common Misunderstandings (Click to Expand)
  • “Exploration always hurts metrics.”
    Small, targeted exploration can improve long-term retention and discovery.
  • “Bandits = RL solved.”
    Bandits ignore state dynamics; use full RL when history/context critically shape future rewards.
  • “Any reward works.”
    Misaligned rewards (pure CTR) can inflate clicks but erode trust and session length.

🧩 Step 7: Mini Summary

🧠 What You Learned: RL treats recommendation as sequential decision-making, optimizing not just the next click but the future journey.

⚙️ How It Works: Bandits provide lightweight exploration at serve time; policy gradients learn stochastic policies that maximize discounted long-term reward.

🎯 Why It Matters: Balancing exploration and exploitation prevents stagnation, combats popularity bias, and nurtures sustained user satisfaction.

Any doubt in content? Ask me anything?
Chat
🤖 👋 Hi there! I'm your learning assistant. If you have any questions about this page or need clarification, feel free to ask!