2.2 Matrix Factorization
🪄 Step 1: Intuition & Motivation
Core Idea: Matrix Factorization (MF) is like the neural network of classical recommenders — compact, elegant, and powerful. Instead of memorizing who liked what (like CF did), MF tries to discover hidden factors — the secret traits that explain why people like certain things.
Simple Analogy: Imagine users and movies living in a secret “taste space.” Each user and movie has a vector — like coordinates in a map. A movie’s position reflects its style (romance, action, drama), and a user’s position reflects their preferences. A good match is when the user and movie vectors point in similar directions — they “vibe” in taste space. 🎬💫
Matrix Factorization is about learning those coordinates.
🌱 Step 2: Core Concept
The user–item rating matrix (R) is usually sparse — most users haven’t rated most items. Matrix Factorization assumes that we can approximate this huge sparse matrix as the product of two smaller matrices:
$$ R \approx P Q^T $$Where:
- $P \in \mathbb{R}^{m \times k}$ → user matrix (each row = user’s latent preferences)
- $Q \in \mathbb{R}^{n \times k}$ → item matrix (each row = item’s latent features)
- $k$ → number of latent factors (hidden dimensions)
Each predicted rating is:
$$ \hat{r}_{ui} = P_u^T Q_i $$So instead of directly storing 10 million ratings, we just learn a few hundred factors per user and item — compressing the data into meaningful, low-dimensional representations.
What’s Happening Under the Hood?
We want to find $P$ and $Q$ such that predicted ratings $\hat{r}{ui}$ are close to actual ratings $r{ui}$ for all known pairs.
To do this, we minimize the squared error between them — plus a regularization term to prevent overfitting:
$$ \min_{P, Q} \sum_{(u,i) \in \mathcal{D}} (r_{ui} - P_u^T Q_i)^2 + \lambda(||P_u||^2 + ||Q_i||^2) $$- The first term measures how well we reconstruct known ratings.
- The second term (with λ) keeps the learned vectors small, discouraging overfitting.
Why It Works This Way
Matrix Factorization essentially learns why users like things — without explicit labels. If users who love Inception also love Interstellar, the model will learn a hidden “sci-fi affinity” dimension that explains both.
Each latent factor captures an abstract trait:
- Factor 1 → “Romance preference”
- Factor 2 → “Action thrill level”
- Factor 3 → “Emotional depth”
A user’s preference vector weights how much they value each factor; an item’s vector describes how strongly it exhibits each factor. Their dot product measures alignment in taste.
How It Fits in ML Thinking
Matrix Factorization transforms raw user–item interactions into dense embeddings, similar to how modern deep learning models represent meaning in vectors (e.g., word embeddings in NLP).
It’s the conceptual bridge between traditional CF and Neural Collaborative Filtering (NCF) — same principle, but learned end-to-end via gradient descent.
📐 Step 3: Mathematical Foundation
Let’s gently unpack the core formulas.
Objective Function
Terms explained:
- $(r_{ui} - P_u^T Q_i)^2$ → squared reconstruction error
- $||P_u||^2 + ||Q_i||^2$ → L2 regularization
- $\lambda$ → regularization coefficient controlling overfitting
Goal: Find $P$ and $Q$ that minimize this loss.
Optimization using Stochastic Gradient Descent (SGD)
We update each parameter (user and item latent vectors) iteratively:
$$ P_u \leftarrow P_u + \alpha \big[(r_{ui} - \hat{r}_{ui}) Q_i - \lambda P_u \big] $$$$ Q_i \leftarrow Q_i + \alpha \big[(r_{ui} - \hat{r}_{ui}) P_u - \lambda Q_i \big] $$- $\alpha$: learning rate
- $(r_{ui} - \hat{r}_{ui})$: prediction error
Each update moves $P_u$ and $Q_i$ in the direction that reduces the error, while slightly shrinking them (regularization).
Hyperparameters and Latent Dimensions
Key hyperparameters:
$k$ (latent dimensions): Controls model capacity. Small $k$ → underfit (too simple). Large $k$ → overfit (too specific).
$\lambda$ (regularization): Prevents overfitting by shrinking vectors. High $\lambda$ → smoother model, less expressive. Low $\lambda$ → high variance.
$\alpha$ (learning rate): Controls how fast SGD updates parameters. Too high → unstable; too low → slow learning.
🧠 Step 4: Assumptions or Key Ideas
- Latent linearity: User–item interactions can be represented as linear combinations of hidden traits.
- Stationarity: User preferences and item properties are relatively stable over the observed period.
- Sufficient overlap: Enough ratings exist to uncover shared structure (no extreme sparsity).
Why they matter: If users rarely overlap in items rated, MF struggles to align their vectors — leading to “unanchored” latent spaces.
⚖️ Step 5: Strengths, Limitations & Trade-offs
- Elegant and mathematically grounded.
- Scalable to millions of users/items (especially with ALS).
- Learns latent semantics — captures abstract factors like genre, tone, or mood.
- Foundation for deep recommenders.
- Assumes linear relationships (dot product).
- Struggles with cold-start problems (new users/items).
- Sensitive to hyperparameter tuning.
- Needs sufficient data to uncover reliable patterns.
🚧 Step 6: Common Misunderstandings
🚨 Common Misunderstandings (Click to Expand)
- “Matrix Factorization needs full data.” No — it only minimizes loss over known ratings.
- “Regularization just reduces overfitting.” It also helps keep user/item spaces well-behaved (vectors don’t explode in magnitude).
- “More latent factors always help.” Beyond a point, extra dimensions capture noise, not meaning.
🧩 Step 7: Mini Summary
🧠 What You Learned: Matrix Factorization compresses huge sparse rating matrices into compact user and item embeddings that capture hidden preference patterns.
⚙️ How It Works: It minimizes squared reconstruction error using SGD, balancing fit with regularization.
🎯 Why It Matters: MF is the foundation for modern recommender architectures — it transforms collaborative data into structured, learnable representations.