3.3 Sequential and Contextual Models
🪄 Step 1: Intuition & Motivation
Core Idea: Matrix Factorization and Neural CF assume that user preferences are static — that your tastes don’t change much over time. But in reality, human preferences evolve:
You might watch romantic comedies on weekends, documentaries on weekdays, and Christmas movies in December.
Sequential and Contextual Models capture this dynamic behavior — they model what you liked yesterday to predict what you’ll like today.
Simple Analogy: Think of a recommender as your best friend who notices patterns over time. If you’ve been binge-watching Marvel movies, they won’t recommend a slow romance tonight — they’ll say,
“You’re clearly in a superhero mood — how about Doctor Strange next?” 🦸♂️
That’s sequential modeling — predicting the next action based on recent history.
🌱 Step 2: Core Concept
The goal is to model the sequence of user-item interactions over time. Instead of treating each interaction as independent, we view it as a temporal trajectory:
$$ S_u = [i_1, i_2, i_3, ..., i_T] $$where $S_u$ is user $u$’s interaction sequence over time.
The model tries to predict:
What’s the next item ($i_{T+1}$) this user will interact with?
To do this, modern recommenders use sequence models like:
- RNNs / GRUs → capture sequential dependencies
- Transformers → capture long-range patterns using self-attention
Let’s unwrap these intuitively.
What’s Happening Under the Hood?
🌀 Recurrent Neural Networks (RNNs)
RNNs process interactions one at a time, maintaining a hidden state that summarizes past behavior. For a sequence of items $[i_1, i_2, …, i_T]$:
$$ h_t = f(Wx_t + Uh_{t-1}) $$- $x_t$: embedding of item $i_t$
- $h_t$: hidden state (memory of what came before)
At each step, $h_t$ captures cumulative context — like “User recently watched action movies.”
But RNNs struggle with long-term dependencies — they forget older preferences over time.
🧠 Gated Recurrent Units (GRUs)
GRUs improve memory handling with gates — they decide what to remember or forget:
$$ h_t = (1 - z_t)h_{t-1} + z_t \tilde{h}_t $$- $z_t$: update gate → how much new info to take in
- $\tilde{h}_t$: candidate hidden state
This helps the model balance recency (latest items) with stability (long-term taste).
⚡ Transformers
Transformers skip recurrence altogether and use self-attention to look at all past interactions simultaneously.
The self-attention formula:
$$ Attention(Q, K, V) = softmax\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$This computes how relevant each past item is to the current prediction — so the model can say,
“For predicting the next click, only these 3 past items matter most.”
Why It Works This Way
Sequential models recognize that user intent is dynamic — preferences are not fixed, they depend on context (time, season, device, etc.).
For example:
- You might listen to classical music while studying (weekday afternoon), but pop music while driving (evening).
RNNs and GRUs catch short-term dependencies (“what’s next?”), while Transformers capture long-term semantic relationships (“what’s your evolving taste?”).
This leads to context-aware, moment-aware recommendations.
How It Fits in ML Thinking
Sequential models bring recommender systems closer to behavioral modeling — treating recommendations as temporal predictions, not static matches.
In ML evolution terms:
| Generation | Focus | Example |
|---|---|---|
| Static | “Who is similar to whom?” | Matrix Factorization |
| Nonlinear | “What patterns exist in embeddings?” | NCF, DeepFM |
| Sequential | “How does preference change over time?” | SASRec, BERT4Rec |
They shift the paradigm from snapshot preference prediction to trajectory modeling.
📐 Step 3: Mathematical Foundation
Let’s simplify how sequence models mathematically connect user history to predictions.
SASRec (Self-Attentive Sequential Recommendation)
SASRec (Self-Attentive Sequential Recommendation) uses Transformer-style attention on interaction sequences:
$$ h_t = Attention(x_1, x_2, ..., x_t) $$Each interaction embedding attends to previous ones, producing context-aware representations. The model predicts the next item by passing $h_t$ through a softmax layer over all items.
Key advantage: It learns which past actions matter most — maybe your last 3 actions, or one movie from months ago.
BERT4Rec (Bidirectional Transformer for Recommendation)
BERT4Rec builds on SASRec but adds bidirectional context — it looks at both past and future interactions in a masked sequence prediction setup.
It randomly masks some items and trains the model to predict them, like BERT in NLP.
$$ L = - \sum_{t \in M} \log P(i_t | i_{\neg t}) $$- $M$: set of masked positions
- $i_{\neg t}$: unmasked items around the target
This lets BERT4Rec understand relationships in both directions — not just forward in time.
Time-Based Context & Session Modeling
In contextual recommenders, time and session metadata are explicitly modeled:
$$ \hat{r}_{ui} = f(P_u, Q_i, t, c) $$where $t$ = timestamp or session indicator, $c$ = context (device, location, etc.).
These models handle:
- Session-level patterns (short bursts of activity)
- Circadian patterns (daily rhythms)
- Contextual preferences (e.g., “watch on mobile” vs. “desktop”)
They’re often combined with RNNs/Transformers in session-based recommendation frameworks (e.g., GRU4Rec, NARM).
🧠 Step 4: Assumptions or Key Ideas
- Temporal dependency: Recent interactions carry more weight.
- Dynamic preference: User intent changes — recommendations must adapt.
- Attention over sequence: Not all past actions matter equally.
- Context enrichment: Time, device, or session features refine predictions.
When these assumptions hold, sequential models outperform static ones significantly.
⚖️ Step 5: Strengths, Limitations & Trade-offs
- Captures evolving user intent and temporal context.
- Handles cold-start sessions (even for unknown users).
- Attention helps deal with noisy or irrelevant history.
- Strong performance on implicit feedback data.
- Computationally expensive (especially Transformers).
- Needs long sequences — struggles if user history is short.
- Difficult to interpret (why a certain item was recommended).
- Sensitive to data sparsity and hyperparameters.
🚧 Step 6: Common Misunderstandings
🚨 Common Misunderstandings (Click to Expand)
- “Transformers always beat RNNs.” Not necessarily — RNNs/GRUs often perform better for shorter sequences or lower-latency setups.
- “Sequence models forget user identity.” They can include user embeddings — combining both long-term and short-term profiles.
- “Attention means explainability.” Attention weights show importance, not causal reasoning.
🧩 Step 7: Mini Summary
🧠 What You Learned: Sequential and contextual models use RNNs, GRUs, and Transformers to predict what users will do next, not just what they like.
⚙️ How It Works: They model time-dependent interactions, use attention to focus on relevant history, and adapt to evolving user intent.
🎯 Why It Matters: They solve real-world challenges like dynamic behavior, long-tail recommendations, and contextual personalization — powering modern systems like YouTube, TikTok, and Spotify.