3.3 Sequential and Contextual Models

Machine Learning Interview Guide for Top Tech Roles (2025)

6 min read 1110 words

🪄 Step 1: Intuition & Motivation

Core Idea: Matrix Factorization and Neural CF assume that user preferences are static — that your tastes don’t change much over time. But in reality, human preferences evolve:

You might watch romantic comedies on weekends, documentaries on weekdays, and Christmas movies in December.

Sequential and Contextual Models capture this dynamic behavior — they model what you liked yesterday to predict what you’ll like today.

Simple Analogy: Think of a recommender as your best friend who notices patterns over time. If you’ve been binge-watching Marvel movies, they won’t recommend a slow romance tonight — they’ll say,

“You’re clearly in a superhero mood — how about Doctor Strange next?” 🦸‍♂️

That’s sequential modeling — predicting the next action based on recent history.

🌱 Step 2: Core Concept

The goal is to model the sequence of user-item interactions over time. Instead of treating each interaction as independent, we view it as a temporal trajectory:

$$ S_u = [i_1, i_2, i_3, ..., i_T] $$

where $S_u$ is user $u$’s interaction sequence over time.

The model tries to predict:

What’s the next item ($i_{T+1}$) this user will interact with?

To do this, modern recommenders use sequence models like:

RNNs / GRUs → capture sequential dependencies
Transformers → capture long-range patterns using self-attention

Let’s unwrap these intuitively.

What’s Happening Under the Hood?

🌀 Recurrent Neural Networks (RNNs)

RNNs process interactions one at a time, maintaining a hidden state that summarizes past behavior. For a sequence of items $[i_1, i_2, …, i_T]$:

$$ h_t = f(Wx_t + Uh_{t-1}) $$

$x_t$: embedding of item $i_t$
$h_t$: hidden state (memory of what came before)

At each step, $h_t$ captures cumulative context — like “User recently watched action movies.”

But RNNs struggle with long-term dependencies — they forget older preferences over time.

🧠 Gated Recurrent Units (GRUs)

GRUs improve memory handling with gates — they decide what to remember or forget:

$$ h_t = (1 - z_t)h_{t-1} + z_t \tilde{h}_t $$

$z_t$: update gate → how much new info to take in
$\tilde{h}_t$: candidate hidden state

This helps the model balance recency (latest items) with stability (long-term taste).

⚡ Transformers

Transformers skip recurrence altogether and use self-attention to look at all past interactions simultaneously.

The self-attention formula:

$$ Attention(Q, K, V) = softmax\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$

This computes how relevant each past item is to the current prediction — so the model can say,

“For predicting the next click, only these 3 past items matter most.”

Why It Works This Way

Sequential models recognize that user intent is dynamic — preferences are not fixed, they depend on context (time, season, device, etc.).

For example:

You might listen to classical music while studying (weekday afternoon), but pop music while driving (evening).

RNNs and GRUs catch short-term dependencies (“what’s next?”), while Transformers capture long-term semantic relationships (“what’s your evolving taste?”).

This leads to context-aware, moment-aware recommendations.

How It Fits in ML Thinking

Sequential models bring recommender systems closer to behavioral modeling — treating recommendations as temporal predictions, not static matches.

In ML evolution terms:

Generation	Focus	Example
Static	“Who is similar to whom?”	Matrix Factorization
Nonlinear	“What patterns exist in embeddings?”	NCF, DeepFM
Sequential	“How does preference change over time?”	SASRec, BERT4Rec

They shift the paradigm from snapshot preference prediction to trajectory modeling.

📐 Step 3: Mathematical Foundation

Let’s simplify how sequence models mathematically connect user history to predictions.

SASRec (Self-Attentive Sequential Recommendation)

SASRec (Self-Attentive Sequential Recommendation) uses Transformer-style attention on interaction sequences:

$$ h_t = Attention(x_1, x_2, ..., x_t) $$

Each interaction embedding attends to previous ones, producing context-aware representations. The model predicts the next item by passing $h_t$ through a softmax layer over all items.

Key advantage: It learns which past actions matter most — maybe your last 3 actions, or one movie from months ago.

SASRec is like your personal assistant remembering which past clicks are still relevant right now.

BERT4Rec (Bidirectional Transformer for Recommendation)

BERT4Rec builds on SASRec but adds bidirectional context — it looks at both past and future interactions in a masked sequence prediction setup.

It randomly masks some items and trains the model to predict them, like BERT in NLP.

$$ L = - \sum_{t \in M} \log P(i_t | i_{\neg t}) $$

$M$: set of masked positions
$i_{\neg t}$: unmasked items around the target

This lets BERT4Rec understand relationships in both directions — not just forward in time.

Think of BERT4Rec as guessing missing words in your movie-watching sentence: “You watched [MASK] after Inception and before Tenet.” It learns deep temporal semantics.

Time-Based Context & Session Modeling

In contextual recommenders, time and session metadata are explicitly modeled:

$$ \hat{r}_{ui} = f(P_u, Q_i, t, c) $$

where $t$ = timestamp or session indicator, $c$ = context (device, location, etc.).

These models handle:

Session-level patterns (short bursts of activity)
Circadian patterns (daily rhythms)
Contextual preferences (e.g., “watch on mobile” vs. “desktop”)

They’re often combined with RNNs/Transformers in session-based recommendation frameworks (e.g., GRU4Rec, NARM).

The system doesn’t just know who you are — it knows what mood and context you’re in right now.

🧠 Step 4: Assumptions or Key Ideas

Temporal dependency: Recent interactions carry more weight.
Dynamic preference: User intent changes — recommendations must adapt.
Attention over sequence: Not all past actions matter equally.
Context enrichment: Time, device, or session features refine predictions.

When these assumptions hold, sequential models outperform static ones significantly.

⚖️ Step 5: Strengths, Limitations & Trade-offs

Captures evolving user intent and temporal context.
Handles cold-start sessions (even for unknown users).
Attention helps deal with noisy or irrelevant history.
Strong performance on implicit feedback data.

Computationally expensive (especially Transformers).
Needs long sequences — struggles if user history is short.
Difficult to interpret (why a certain item was recommended).
Sensitive to data sparsity and hyperparameters.

You trade efficiency for adaptability. Sequential models shine in streaming or real-time systems where recency and intent matter more than static similarity.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)

“Transformers always beat RNNs.” Not necessarily — RNNs/GRUs often perform better for shorter sequences or lower-latency setups.
“Sequence models forget user identity.” They can include user embeddings — combining both long-term and short-term profiles.
“Attention means explainability.” Attention weights show importance, not causal reasoning.

🧩 Step 7: Mini Summary

🧠 What You Learned: Sequential and contextual models use RNNs, GRUs, and Transformers to predict what users will do next, not just what they like.

⚙️ How It Works: They model time-dependent interactions, use attention to focus on relevant history, and adapt to evolving user intent.

🎯 Why It Matters: They solve real-world challenges like dynamic behavior, long-tail recommendations, and contextual personalization — powering modern systems like YouTube, TikTok, and Spotify.

4.1 Handling Sparse & Imbalanced Data 3.2 DeepFM, Wide & Deep, and AutoRec