1. Introduction

11 min read 2307 words

πŸ“ Flashcards

⚑ Short Theories

Problem setup is the most important interview step: ask clarifying questions to constrain scope (requests/sec, latency, data freshness, failure modes).

Use both component-level and end-to-end metrics β€” a better component doesn’t always improve user-facing metrics.

Funnel architecture scales: use cheap recallers/top-k retrieval then expensive rankers; reduces compute and enables complex models only where needed.

Offline metrics help iterate quickly; online experiments (A/B tests) determine real impact and guard against simulation gaps.

Good training data beats clever models: invest in labeling strategies, instrumentation, and human-in-the-loop data augmentation.

Monitor feature distribution drift, label leakage, and production data generation differences to avoid silent performance degradation.

🎀 Interview Q&A

Q1: How do you start when an interviewer asks you to design an ML system (e.g., search ranking or recommendation)?

🎯 TL;DR: Start by clarifying goals, constraints (latency, throughput, freshness), and metrics; turn vague asks into a precise ML problem statement.


🌱 Conceptual Explanation

Begin with questions that reveal scope and constraints. Convert business-level goals into technical requirements (e.g., expected requests/sec, acceptable response time, success metric like NDCG or engagement). This narrows design choices and shows structured thinking.

πŸ“ Technical / Math Details

No heavy math β€” focus on mapping requirements to components:

  • Define input/output (query β†’ ranked list).
  • Decide candidate generation & reranking.
  • Define offline vs online evaluation pipelines.

βš–οΈ Trade-offs & Production Notes

  • Asking latency upfront limits model complexity.
  • Asking scale helps decide whether to precompute embeddings or compute on-the-fly.
  • Freshness needs push toward streaming features / online updates.

🚨 Common Pitfalls

  • Assuming unlimited compute; ignoring latency.
  • Skipping metrics that measure user experience.
  • Forgetting failure modes (cold start, missing features).

πŸ—£ Interview-ready Answer

“I’d first clarify the product goal and SLOs (latency/throughput), convert that into a measurable metric (e.g., NDCG or CTR uplift), then propose a candidate generation + ranking funnel with offline and online evaluation plans.”

Q2: Explain the funnel architecture and why it’s useful for large-scale ranking/ad systems.

🎯 TL;DR: Funnel reduces work by using cheap, high-recall stages first and expensive high-precision models later.


🌱 Conceptual Explanation

At scale you cannot score every item with a deep model. The funnel (recall β†’ coarse scoring β†’ fine ranking) prunes the candidate set progressively so only a small set receives the heaviest computation.

πŸ“ Technical / Math Details

Typical stages:

  1. Retrieval via inverted indices / approximate nearest neighbors (ANN) β€” O(log N) or sublinear.
  2. Lightweight scoring (linear models, shallow trees).
  3. Heavy reranker (deep network, ensemble). Mathematically: if recall stage returns k candidates « N, compute cost reduces from O(N) to O(k).

βš–οΈ Trade-offs & Production Notes

  • Improves latency & cost but adds complexity and potential recall loss.
  • Need to tune recall thresholds to avoid dropping relevant items.
  • Use offline simulation to set stage cutoffs; validate with online experiments.

🚨 Common Pitfalls

  • Over-pruning in early stages causing irrecoverable misses.
  • Ignoring alignment between offline recall metric and online impact.

πŸ—£ Interview-ready Answer

“Use a funnel to reduce compute: retrieve high-recall candidates with ANN or simple filters, then apply an expensive neural reranker only to top-k candidates β€” this saves latency and cost while preserving precision.”

Q3: How do you define and choose metrics for offline and online evaluation?

🎯 TL;DR: Pick component metrics (NDCG, log loss, AUC) for development and user-facing end-to-end metrics (CTR, retention, task success) for deployment decisions.


🌱 Conceptual Explanation

Offline metrics let you iterate quickly; online metrics reveal user impact. Both are needed: component metrics isolate model improvements, end-to-end metrics validate system gains.

πŸ“ Technical / Math Details

  • Ranking: NDCG@k for position-weighted relevance.
  • Classification: AUC, precision, recall, F1, log-loss.
  • Online: CTR uplift, session length, retention rate. NDCG: uses graded relevance r_i and discount by log2(rank+1).

βš–οΈ Trade-offs & Production Notes

  • Offline metric improvements may not translate to online due to bias in logged data.
  • Use interleaving or randomized buckets to obtain unbiased estimates when possible.
  • Balance statistical significance vs experiment duration.

🚨 Common Pitfalls

  • Relying only on offline proxies.
  • Not instrumenting the system to gather the right signals for online metrics.

πŸ—£ Interview-ready Answer

“I’d use NDCG/AUC as component metrics for offline iteration and define one or two clear online metrics β€” e.g., CTR and session retention β€” to judge real user-level impact, validating via A/B tests.”

Q4: Describe strategies to gather training data when labeled data is scarce.

🎯 TL;DR: Combine weak supervision, interaction logs, synthetic labels, and transfer learning from pre-trained models.


🌱 Conceptual Explanation

Label scarcity is common. Use proxy signals (clicks, conversions) as weak labels, adopt pre-trained models to extract features, and apply human labeling for high-value samples. Active learning focuses human effort efficiently.

πŸ“ Technical / Math Details

  • Weak label generation: define heuristics h_i(x) and combine with label model (e.g., Snorkel style) to estimate true label.
  • Bootstrapping from logs: treat clicks as positives, but correct for position bias with inverse propensity scoring where possible.

βš–οΈ Trade-offs & Production Notes

  • Weak labels are noisy; require robust models and calibration.
  • Balancing human labeling cost vs model gain: use active learning to choose samples that maximize information.

🚨 Common Pitfalls

  • Treating clicks as ground truth without debiasing.
  • Not validating weak labels on a labeled subset.

πŸ—£ Interview-ready Answer

“Use interaction logs as weak labels corrected for bias, augment with transfer learning from pre-trained models, and apply active learning + targeted human labeling for the highest-value examples.”

Q5: What production monitoring and alerting would you set up for an ML system?

🎯 TL;DR: Monitor model performance, data quality, latency, and business metrics with alerts for drift, SLA breaches, and anomalous behavior.


🌱 Conceptual Explanation

Monitoring prevents silent failures. Observe inputs, feature distributions, model outputs, prediction latencies, and downstream business KPIs.

πŸ“ Technical / Math Details

  • Data drift: track KL divergence / population statistics between training and production features.
  • Performance: track moving-average of AUC or proxy metrics on sampled labeled data.
  • Latency: p95/p99 response times.

βš–οΈ Trade-offs & Production Notes

  • High-frequency metrics detect problems faster but increase monitoring noise and cost.
  • Decide which anomalies auto-rollback vs alert to owner.

🚨 Common Pitfalls

  • Only monitoring model score, not the input distributions.
  • Not instrumenting shadow traffic for safe evaluation.

πŸ—£ Interview-ready Answer

“I’d monitor feature distributions, model outputs, latencies (p95/p99), and business KPIs; set thresholds for drift and SLA breaches and create automated alerts and rollback procedures.”

Q6: How do you debug an ML model that performs well offline but fails online?

🎯 TL;DR: Compare training vs production data distributions, check feature pipelines, and isolate component vs system-level issues.


🌱 Conceptual Explanation

Differences between offline train/val data and online serving data (label leakage, feature skew, stale features) often cause regressions. Reproduce production pipeline locally and run targeted A/B diagnostics.

πŸ“ Technical / Math Details

  • Compare histograms, means, and KL divergence for each feature.
  • Use attribution (SHAP/feature importance) to see which features changed influence.

βš–οΈ Trade-offs & Production Notes

  • Time-consuming to reproduce production environment end-to-end.
  • Consider shadow deployments to compare outputs without affecting users.

🚨 Common Pitfalls

  • Ignoring subtle feature transformation mismatches (e.g., missing scaling).
  • Not checking online feature freshness or missing keys.

πŸ—£ Interview-ready Answer

“I’d first compare production vs training feature distributions and pipeline transformations, validate feature freshness and key integrity, and run shadow tests to isolate whether a component or system interaction is causing the failure.”

Q7: When should you use pre-trained SOTA models vs training from scratch?

🎯 TL;DR: Use pre-trained models to save labeling cost and training time when domain alignment exists; train from scratch when domain shift is large or when you need_custom architecture/constraints.


🌱 Conceptual Explanation

Pre-trained models transfer general knowledge and reduce labeled-data needs. But if the target distribution diverges substantially, fine-tuning or full retraining may be necessary.

πŸ“ Technical / Math Details

  • Transfer learning: initialize weights from pre-trained model, fine-tune last layers or the entire model depending on label quantity.
  • Regularization and learning-rate schedules matter to avoid catastrophic forgetting.

βš–οΈ Trade-offs & Production Notes

  • Pre-trained models may add inference cost; consider distillation or pruning for production.
  • Licensing and privacy considerations when using third-party models.

🚨 Common Pitfalls

  • Blindly fine-tuning without validating on domain-specific holdouts.
  • Overfitting small labeled sets when fine-tuning large models.

πŸ—£ Interview-ready Answer

“Prefer pre-trained models when label data is limited and the domain is similar; if latency or domain mismatch is problematic, consider distillation, targeted fine-tuning, or training a smaller model from scratch.”

Q8: How do you design offline & online experiments for a new ranking model?

🎯 TL;DR: Use offline holdouts and back-testing for fast iteration; run controlled randomized online experiments (A/B) with meaningful business metrics and sufficient power.


🌱 Conceptual Explanation

Offline tests (holdout or temporal splits) vet candidate models; online A/B tests reveal real-world effects and unintended regressions. Use canary releases and metric guardrails.

πŸ“ Technical / Math Details

  • Offline: temporal validation to mimic production drift.
  • Online: randomize users into control/treatment, collect key metrics, compute lift and confidence intervals; ensure sample size for statistical power.

βš–οΈ Trade-offs & Production Notes

  • Online tests are slow and risk user experience; limit risk with canary and short-duration checks.
  • Multiple metrics require pre-defined primary metric to avoid p-hacking.

🚨 Common Pitfalls

  • Running underpowered experiments.
  • Not pre-registering primary metric or stopping rules.

πŸ—£ Interview-ready Answer

“I’d validate models offline using temporal holdouts, then run a randomized A/B experiment with a primary business metric and safety guardrails, starting with a small canary before full rollout.”

Q9: Explain feature engineering best practices for recommendation systems.

🎯 TL;DR: Build user, item, and context features; include temporal and interaction features; ensure reproducibility and feature freshness.


🌱 Conceptual Explanation

Inspect actors (user, item, context) and derive features that capture preferences, recency, and temporal effects. Use embeddings for high-cardinality IDs.

πŸ“ Technical / Math Details

  • Aggregates: user_click_rate = clicks / impressions over window T.
  • Temporal: time_since_last_interaction, day_of_week, holiday flags.
  • Embeddings: learn item/user embeddings via matrix factorization or neural approaches.

βš–οΈ Trade-offs & Production Notes

  • Aggregates require streaming or incremental computation for freshness.
  • Embeddings improve expressivity but complicate deployment (storage, versioning).

🚨 Common Pitfalls

  • Leakage from future features.
  • Not aligning training-time feature computation with serving-time.

πŸ—£ Interview-ready Answer

“Design features across user, item, and context (including time windows and aggregates), use embeddings for high-cardinality fields, and ensure training/serving pipelines compute features identically and with required freshness.”

Q10: What are common failure modes specific to ML system design and how to mitigate them?

🎯 TL;DR: Common failures: data skew/drift, label leakage, pipeline mismatches, and metric misalignment; mitigate via monitoring, shadowing, and robust eval.


🌱 Conceptual Explanation

Failures often arise from the mismatch between assumptions made during development and production realities. Proactive checks and robust CI for data and models reduce surprises.

πŸ“ Technical / Math Details

  • Drift detection via KL divergence or population stats.
  • Label leakage detection by checking causally downstream signals used as features.

βš–οΈ Trade-offs & Production Notes

  • Building comprehensive test infrastructure costs time but prevents major outages.
  • Some mitigations (e.g., conservative rollouts) slow innovation.

🚨 Common Pitfalls

  • Not testing edge-cases (sparse users/items).
  • Ignoring feature null-handling at scale.

πŸ—£ Interview-ready Answer

“Watch for data drift, pipeline mismatches, and label leakage. Mitigate with monitoring, shadow deployments, unit tests for feature pipelines, and gradual canary rollouts with rollback capabilities.”

πŸ“ Key Formulas

NDCG@k (Normalized Discounted Cumulative Gain)
$$\text{DCG}_k = \sum_{i=1}^k \frac{2^{rel_i}-1}{\log_2(i+1)}$$

$\text{NDCG}_k = \frac{\text{DCG}_k}{\text{IDCG}_k}$

  • $rel_i$: graded relevance of item at rank $i$.
  • $k$: cutoff rank.
  • $\text{IDCG}_k$: ideal DCG (max possible DCG@k). Interpretation: Rewards placing highly relevant items near the top; normalized so perfect ranking = 1.
AUC (Area Under ROC Curve)

AUC measures probability a random positive ranks higher than a random negative.

  • Equivalent to the Mann–Whitney U statistic. Interpretation: Useful for imbalanced binary classification; insensitive to threshold.
Cross-Entropy Loss (Binary / Multiclass)

Binary: $L = -\big(y\log(\hat{p}) + (1-y)\log(1-\hat{p})\big)$ Multiclass: $L = -\sum_i y_i \log(\hat{y}_i)$

  • $y, y_i$: true label(s).
  • $\hat{p}, \hat{y}_i$: predicted probability(s). Interpretation: Penalizes confident wrong predictions; aligns with maximum likelihood.
Precision / Recall / F1

$\text{Precision} = \frac{TP}{TP+FP}, \quad \text{Recall} = \frac{TP}{TP+FN}$ $F1 = 2\cdot\frac{\text{Precision}\cdot\text{Recall}}{\text{Precision}+\text{Recall}}$

  • $TP$: true positives, $FP$: false positives, $FN$: false negatives. Interpretation: Precision measures correctness among positives; recall measures coverage of positives.
KL Divergence (for drift detection)

$D_{KL}(P ,||, Q) = \sum_x P(x) \log\frac{P(x)}{Q(x)}$

  • $P$: training distribution; $Q$: production distribution. Interpretation: Quantifies how different two distributions are; used to detect feature drift.

βœ… Cheatsheet

  • Problem setup: always clarify SLOs (latency, throughput), freshness, and primary metric first.
  • Architecture: use funnel (retrieval β†’ coarse scoring β†’ reranker) for scale.
  • Data: combine user-interaction signals, weak supervision, public datasets, and human labeling.
  • Evaluation: iterate offline (NDCG/AUC) but validate via online A/B tests on primary business metric.
  • Features: ensure training-serving parity and monitor feature distributions in production.
  • Monitoring: track feature drift, latencies (p95/p99), and business KPIs; use automated alerts and rollback.
  • Deployment: canary β†’ phased rollout β†’ full rollout; shadow mode for low-risk verification.
  • Pre-trained models: fine-tune when domain alignment; distill/prune for production latency constraints.
Any doubt in content? Ask me anything?
Chat
πŸ€– πŸ‘‹ Hi there! I'm your learning assistant. If you have any questions about this page or need clarification, feel free to ask!