5.3 Evaluation at Scale

Machine Learning Interview Guide for Top Tech Roles (2025)

6 min read 1138 words

🪄 Step 1: Intuition & Motivation

Core Idea: Building a recommendation model is impressive. Deploying it is even cooler. But evaluating it at scale — for millions of users and billions of interactions — that’s where engineering meets science.

And here’s the twist: A model that looks great on metrics like RMSE or Precision@K might still fail in the real world — boosting clicks but killing long-term satisfaction.

Simple Analogy: Think of a teacher grading students.

“Did they get this question right?” → Offline metric.
“Did they actually learn and stay interested in the subject?” → Online metric.

Your recommender, too, needs to be judged by both how smart it is (accuracy) and how students feel about learning from it (user satisfaction).

🌱 Step 2: Core Concept

At industrial scale, recommendation evaluation has two complementary arms:

Evaluation Type	What It Measures	When It’s Used
Offline Evaluation	Predictive accuracy on historical data	Before deployment
Online Evaluation	Real-world impact on live users	After deployment

But at scale, even computing these metrics is nontrivial — you can’t just use for loops on a billion events. You need distributed evaluation frameworks like Apache Spark or Ray to parallelize metric computation.

Let’s dive into both layers.

Offline Evaluation: Accuracy Before Action

Offline metrics test how well your model predicts known outcomes in past data. They’re fast, repeatable, and safe — but they only measure prediction, not experience.

Common offline metrics:

Metric	Formula	Intuition
Precision@K	$ \frac{\text{# relevant items in top K}}{K} $	How many top recommendations were correct
Recall@K	$ \frac{\text{# relevant items in top K}}{\text{# relevant items overall}} $	How many of the user’s relevant items were retrieved
MAP (Mean Average Precision)	Average of precisions at each relevant rank	Rewards correctly ordered results
NDCG (Normalized Discounted Cumulative Gain)	Weighted relevance by rank	Prioritizes items higher in the list

Example: If you recommend 10 movies and the user liked 3 of them, Precision@10 = 0.3.

Offline metrics are great for filtering bad models quickly — but they can’t predict how users feel or evolve over time.

They tell you “how good the guesses are,” not “how much joy they create.” ✨

Online Evaluation: Reality Checks with Live Users

Once the model is live, we switch to online metrics, measured via A/B testing or controlled experiments.

Key metrics:

CTR (Click-Through Rate): how often users click on recommended items.
Engagement / Dwell Time: how long they interact with content.
Conversion Rate: percentage of recommendations leading to a desired outcome (purchase, watch, etc.).
CTR Uplift: relative improvement compared to baseline model.

Example: If CTR rises from 5% to 6%, uplift = +20%.

But beware — a higher CTR might not always be good. Maybe users click more but quickly lose interest — reducing retention and trust.

Online metrics reveal behavioral truth, but not always emotional truth.

Distributed Evaluation Frameworks (Spark, Ray)

When evaluating large datasets (e.g., Netflix-scale: billions of rows), we can’t compute metrics sequentially.

How Spark and Ray Help:

Parallelize user-level computations: Each user’s top-K recommendations can be processed independently.
Distribute data across clusters: Store massive logs (impressions, clicks) in distributed storage (HDFS, S3).
Aggregate results efficiently: Compute global metrics like Precision@K and Recall@K via reduce operations.

Spark MLlib and Ray Tune even provide:

Cross-validation at scale
Hyperparameter tuning across distributed workers
Batch metric computation for multiple models in parallel

Spark is your math teacher with 1,000 assistants — all grading homework simultaneously. 🧮

Beyond Accuracy: Diversity, Novelty & Fairness

Offline and online metrics focus on accuracy, but modern recommenders need ethics and variety too.

🌈 1. Diversity

Ensure the recommended items are not all from the same category or style. Formula:

$$ \text{Diversity} = 1 - \frac{1}{|R|(|R|-1)} \sum_{i \neq j} \text{sim}(i, j) $$

$\text{sim}(i,j)$ = similarity between item embeddings
Higher value → more variety

Without diversity, users see the same things over and over — algorithmic tunnel vision.

✨ 2. Novelty

Encourages exposure to new or lesser-known items.

$$ \text{Novelty} = \frac{1}{|R|} \sum_{i \in R} -\log P(i) $$

$P(i)$ = popularity of item $i$
Higher = less mainstream, more surprising

Novelty is like recommending an indie film instead of yet another Marvel sequel. 🎬

⚖️ 3. Fairness

Prevents bias toward popular or well-represented groups (e.g., popular creators, dominant genres). Can be measured via:

Exposure Parity: equal exposure probability across groups
Statistical Parity Difference: $P(\hat{y}=1|A=0)$ vs. $P(\hat{y}=1|A=1)$

Fairness ensures that algorithms don’t amplify existing inequalities — they balance the ecosystem.

📐 Step 3: Mathematical Foundation

Let’s summarize the main evaluation metrics concisely:

Precision and Recall @ K

$$ Precision@K = \frac{|R_u^K \cap T_u|}{K}, \quad Recall@K = \frac{|R_u^K \cap T_u|}{|T_u|} $$

Where:

$R_u^K$: top-K recommendations for user $u$
$T_u$: true set of relevant items
Precision answers “How correct were my top picks?”

Recall answers “Did I find everything the user wanted?”

5.3 Evaluation at Scale

🪄 Step 1: Intuition & Motivation

🌱 Step 2: Core Concept

How Spark and Ray Help:

🌈 1. Diversity

✨ 2. Novelty

⚖️ 3. Fairness

📐 Step 3: Mathematical Foundation

🧠 Step 4: Assumptions or Key Ideas

⚖️ Step 5: Strengths, Limitations & Trade-offs

🚧 Step 6: Common Misunderstandings

🧩 Step 7: Mini Summary