6.2 Reinforcement Learning in Recommenders

Machine Learning Interview Guide for Top Tech Roles (2025)

4 min read 732 words

🪄 Step 1: Intuition & Motivation

Core Idea (1 short paragraph): Instead of asking “What’s the best item now?”, reinforcement learning (RL) asks “What sequence of items keeps the user happy over time?” Recommendations become decisions with consequences: today’s pick affects tomorrow’s interest, trust, and return rate. RL helps balance exploration (try new things) and exploitation (serve known favorites) while optimizing long-term value (retention, lifetime engagement).

Simple Analogy (one only): It’s like a barista who occasionally suggests a new drink. If you love it, you come back more often. A pure “serve the usual” policy may get a click today but risks boredom tomorrow.

🌱 Step 2: Core Concept

What’s Happening Under the Hood?

We model the interaction loop as a decision process:

State $s_t$: user context + recent history (session, time, device, embeddings).
Action $a_t$: show an item (or a slate of items).
Reward $r_t$: click, dwell time, watch %, purchase, or a weighted mix.
Transition: user’s next state $s_{t+1}$ after seeing $a_t$. The system chooses $a_t$ to maximize cumulative reward over time, not just the next click.

Why It Works This Way

Short-term metrics (CTR) can create feedback loops and fatigue. RL explicitly values future impact: a diverse, slightly exploratory feed today can boost tomorrow’s satisfaction and retention, which a myopic optimizer misses.

How It Fits in ML Thinking

Bandits: 1-step RL (no state transitions) — great for exploration at the top of a ranking.
Full RL (MDP): multi-step with dynamics — learns policies that shape long-term behavior.
Policy gradients: learn a stochastic policy end-to-end when the ranking function is non-differentiable through logged rewards.

📐 Step 3: Mathematical Foundation

Bandit Formulation (One-Step RL)

Goal: choose item/arm $a$ to maximize expected reward $E[r|a]$ while exploring.

$\varepsilon$-greedy: with prob. $\varepsilon$ explore (random), else exploit $\arg\max_a \hat{\mu}_a$.
UCB: pick $a_t = \arg\max_a \hat{\mu}_a + c\sqrt{\frac{\ln t}{n_a}}$
(optimism bonus shrinks as $n_a$ grows).
Thompson Sampling (Bernoulli): maintain $p(\mu_a)$ as $\text{Beta}(\alpha_a,\beta_a)$; sample $\tilde{\mu}_a \sim \text{Beta}(\alpha_a,\beta_a)$ and choose $\arg\max_a \tilde{\mu}_a$; update $(\alpha,\beta)$ with successes/failures.

UCB = “be optimistic about the uncertain.”
Thompson = “roll the dice according to current belief.”
$\varepsilon$-greedy = “occasionally try something new.”

MDP & Return (Multi-Step RL)

We maximize discounted return $G_t = \sum_{k=0}^{\infty}\gamma^k r_{t+k+1}$, with $0<\gamma<1$.
A policy $\pi_\theta(a|s)$ maps states to action probabilities; we learn $\theta$ to maximize $J(\theta)=E_{\pi_\theta}[G_0]$.

Policy Gradient (REINFORCE) & Baseline

The vanilla gradient:

$$ \nabla_\theta J(\theta)=E_{\pi_\theta}\big[\nabla_\theta \log \pi_\theta(a_t|s_t)\, G_t\big]. $$

To reduce variance, subtract a baseline $b(s_t)$ (e.g., value estimate):

$$ \nabla_\theta J(\theta)\approx E\big[\nabla_\theta \log \pi_\theta(a_t|s_t)\, (G_t - b(s_t))\big]. $$

This encourages actions that did better than expected and discourages worse-than-expected ones.

“Praise above-average actions, nudge down below-average ones.”

Slate & Ranking Considerations

Real systems pick a slate (top-$K$ list). Approaches:

Score-and-sort with bandit exploration in candidate generation.
Slate-bandits or listwise RL optimizing page-level reward (e.g., total clicks, session time). Reward shaping and position bias corrections are common.

🧠 Step 4: Assumptions or Key Ideas

Stationarity (often approximate): user dynamics don’t change too fast; otherwise adaptivity is needed.
Counterfactual logging: unbiased (or corrected) estimators for off-policy evaluation (propensity scores, IPS/SNIPS).
Exploration is valuable: occasional novelty increases long-term satisfaction and reduces popularity bias.
Reward design matters: choose signals aligned with retention (not just clicks).

⚖️ Step 5: Strengths, Limitations & Trade-offs

Optimizes long-term outcomes (retention, LTV).
Principled handling of exploration vs. exploitation.
Bandits give simple, fast wins at serving time.

Hard to design rewards; delayed credit assignment.
Off-policy evaluation is tricky (bias/variance).
Full RL can be data- and compute-hungry; safety concerns in production.

Start with bandits in candidate/reranking for safe exploration; graduate to policy gradients where long-term effects are measurable. Trade a bit of short-term CTR for durable engagement.

🚧 Step 6: Common Misunderstandings (Optional)

🚨 Common Misunderstandings (Click to Expand)

“Exploration always hurts metrics.”
Small, targeted exploration can improve long-term retention and discovery.
“Bandits = RL solved.”
Bandits ignore state dynamics; use full RL when history/context critically shape future rewards.
“Any reward works.”
Misaligned rewards (pure CTR) can inflate clicks but erode trust and session length.

🧩 Step 7: Mini Summary

🧠 What You Learned: RL treats recommendation as sequential decision-making, optimizing not just the next click but the future journey.

⚙️ How It Works: Bandits provide lightweight exploration at serve time; policy gradients learn stochastic policies that maximize discounted long-term reward.

🎯 Why It Matters: Balancing exploration and exploitation prevents stagnation, combats popularity bias, and nurtures sustained user satisfaction.

6.3 Causal Inference & Debiasing 6.1 Graph-Based Recommendations