1. Introduction
π Flashcards
β‘ Short Theories
Problem setup is the most important interview step: ask clarifying questions to constrain scope (requests/sec, latency, data freshness, failure modes).
Use both component-level and end-to-end metrics β a better component doesn’t always improve user-facing metrics.
Funnel architecture scales: use cheap recallers/top-k retrieval then expensive rankers; reduces compute and enables complex models only where needed.
Offline metrics help iterate quickly; online experiments (A/B tests) determine real impact and guard against simulation gaps.
Good training data beats clever models: invest in labeling strategies, instrumentation, and human-in-the-loop data augmentation.
Monitor feature distribution drift, label leakage, and production data generation differences to avoid silent performance degradation.
π€ Interview Q&A
Q1: How do you start when an interviewer asks you to design an ML system (e.g., search ranking or recommendation)?
π― TL;DR: Start by clarifying goals, constraints (latency, throughput, freshness), and metrics; turn vague asks into a precise ML problem statement.
π± Conceptual Explanation
Begin with questions that reveal scope and constraints. Convert business-level goals into technical requirements (e.g., expected requests/sec, acceptable response time, success metric like NDCG or engagement). This narrows design choices and shows structured thinking.
π Technical / Math Details
No heavy math β focus on mapping requirements to components:
- Define input/output (query β ranked list).
- Decide candidate generation & reranking.
- Define offline vs online evaluation pipelines.
βοΈ Trade-offs & Production Notes
- Asking latency upfront limits model complexity.
- Asking scale helps decide whether to precompute embeddings or compute on-the-fly.
- Freshness needs push toward streaming features / online updates.
π¨ Common Pitfalls
- Assuming unlimited compute; ignoring latency.
- Skipping metrics that measure user experience.
- Forgetting failure modes (cold start, missing features).
π£ Interview-ready Answer
“I’d first clarify the product goal and SLOs (latency/throughput), convert that into a measurable metric (e.g., NDCG or CTR uplift), then propose a candidate generation + ranking funnel with offline and online evaluation plans.”
Q2: Explain the funnel architecture and why it’s useful for large-scale ranking/ad systems.
π― TL;DR: Funnel reduces work by using cheap, high-recall stages first and expensive high-precision models later.
π± Conceptual Explanation
At scale you cannot score every item with a deep model. The funnel (recall β coarse scoring β fine ranking) prunes the candidate set progressively so only a small set receives the heaviest computation.
π Technical / Math Details
Typical stages:
- Retrieval via inverted indices / approximate nearest neighbors (ANN) β O(log N) or sublinear.
- Lightweight scoring (linear models, shallow trees).
- Heavy reranker (deep network, ensemble). Mathematically: if recall stage returns k candidates « N, compute cost reduces from O(N) to O(k).
βοΈ Trade-offs & Production Notes
- Improves latency & cost but adds complexity and potential recall loss.
- Need to tune recall thresholds to avoid dropping relevant items.
- Use offline simulation to set stage cutoffs; validate with online experiments.
π¨ Common Pitfalls
- Over-pruning in early stages causing irrecoverable misses.
- Ignoring alignment between offline recall metric and online impact.
π£ Interview-ready Answer
“Use a funnel to reduce compute: retrieve high-recall candidates with ANN or simple filters, then apply an expensive neural reranker only to top-k candidates β this saves latency and cost while preserving precision.”
Q3: How do you define and choose metrics for offline and online evaluation?
π― TL;DR: Pick component metrics (NDCG, log loss, AUC) for development and user-facing end-to-end metrics (CTR, retention, task success) for deployment decisions.
π± Conceptual Explanation
Offline metrics let you iterate quickly; online metrics reveal user impact. Both are needed: component metrics isolate model improvements, end-to-end metrics validate system gains.
π Technical / Math Details
- Ranking: NDCG@k for position-weighted relevance.
- Classification: AUC, precision, recall, F1, log-loss.
- Online: CTR uplift, session length, retention rate. NDCG: uses graded relevance r_i and discount by log2(rank+1).
βοΈ Trade-offs & Production Notes
- Offline metric improvements may not translate to online due to bias in logged data.
- Use interleaving or randomized buckets to obtain unbiased estimates when possible.
- Balance statistical significance vs experiment duration.
π¨ Common Pitfalls
- Relying only on offline proxies.
- Not instrumenting the system to gather the right signals for online metrics.
π£ Interview-ready Answer
“I’d use NDCG/AUC as component metrics for offline iteration and define one or two clear online metrics β e.g., CTR and session retention β to judge real user-level impact, validating via A/B tests.”
Q4: Describe strategies to gather training data when labeled data is scarce.
π― TL;DR: Combine weak supervision, interaction logs, synthetic labels, and transfer learning from pre-trained models.
π± Conceptual Explanation
Label scarcity is common. Use proxy signals (clicks, conversions) as weak labels, adopt pre-trained models to extract features, and apply human labeling for high-value samples. Active learning focuses human effort efficiently.
π Technical / Math Details
- Weak label generation: define heuristics h_i(x) and combine with label model (e.g., Snorkel style) to estimate true label.
- Bootstrapping from logs: treat clicks as positives, but correct for position bias with inverse propensity scoring where possible.
βοΈ Trade-offs & Production Notes
- Weak labels are noisy; require robust models and calibration.
- Balancing human labeling cost vs model gain: use active learning to choose samples that maximize information.
π¨ Common Pitfalls
- Treating clicks as ground truth without debiasing.
- Not validating weak labels on a labeled subset.
π£ Interview-ready Answer
“Use interaction logs as weak labels corrected for bias, augment with transfer learning from pre-trained models, and apply active learning + targeted human labeling for the highest-value examples.”
Q5: What production monitoring and alerting would you set up for an ML system?
π― TL;DR: Monitor model performance, data quality, latency, and business metrics with alerts for drift, SLA breaches, and anomalous behavior.
π± Conceptual Explanation
Monitoring prevents silent failures. Observe inputs, feature distributions, model outputs, prediction latencies, and downstream business KPIs.
π Technical / Math Details
- Data drift: track KL divergence / population statistics between training and production features.
- Performance: track moving-average of AUC or proxy metrics on sampled labeled data.
- Latency: p95/p99 response times.
βοΈ Trade-offs & Production Notes
- High-frequency metrics detect problems faster but increase monitoring noise and cost.
- Decide which anomalies auto-rollback vs alert to owner.
π¨ Common Pitfalls
- Only monitoring model score, not the input distributions.
- Not instrumenting shadow traffic for safe evaluation.
π£ Interview-ready Answer
“I’d monitor feature distributions, model outputs, latencies (p95/p99), and business KPIs; set thresholds for drift and SLA breaches and create automated alerts and rollback procedures.”
Q6: How do you debug an ML model that performs well offline but fails online?
π― TL;DR: Compare training vs production data distributions, check feature pipelines, and isolate component vs system-level issues.
π± Conceptual Explanation
Differences between offline train/val data and online serving data (label leakage, feature skew, stale features) often cause regressions. Reproduce production pipeline locally and run targeted A/B diagnostics.
π Technical / Math Details
- Compare histograms, means, and KL divergence for each feature.
- Use attribution (SHAP/feature importance) to see which features changed influence.
βοΈ Trade-offs & Production Notes
- Time-consuming to reproduce production environment end-to-end.
- Consider shadow deployments to compare outputs without affecting users.
π¨ Common Pitfalls
- Ignoring subtle feature transformation mismatches (e.g., missing scaling).
- Not checking online feature freshness or missing keys.
π£ Interview-ready Answer
“I’d first compare production vs training feature distributions and pipeline transformations, validate feature freshness and key integrity, and run shadow tests to isolate whether a component or system interaction is causing the failure.”
Q7: When should you use pre-trained SOTA models vs training from scratch?
π― TL;DR: Use pre-trained models to save labeling cost and training time when domain alignment exists; train from scratch when domain shift is large or when you need_custom architecture/constraints.
π± Conceptual Explanation
Pre-trained models transfer general knowledge and reduce labeled-data needs. But if the target distribution diverges substantially, fine-tuning or full retraining may be necessary.
π Technical / Math Details
- Transfer learning: initialize weights from pre-trained model, fine-tune last layers or the entire model depending on label quantity.
- Regularization and learning-rate schedules matter to avoid catastrophic forgetting.
βοΈ Trade-offs & Production Notes
- Pre-trained models may add inference cost; consider distillation or pruning for production.
- Licensing and privacy considerations when using third-party models.
π¨ Common Pitfalls
- Blindly fine-tuning without validating on domain-specific holdouts.
- Overfitting small labeled sets when fine-tuning large models.
π£ Interview-ready Answer
“Prefer pre-trained models when label data is limited and the domain is similar; if latency or domain mismatch is problematic, consider distillation, targeted fine-tuning, or training a smaller model from scratch.”
Q8: How do you design offline & online experiments for a new ranking model?
π― TL;DR: Use offline holdouts and back-testing for fast iteration; run controlled randomized online experiments (A/B) with meaningful business metrics and sufficient power.
π± Conceptual Explanation
Offline tests (holdout or temporal splits) vet candidate models; online A/B tests reveal real-world effects and unintended regressions. Use canary releases and metric guardrails.
π Technical / Math Details
- Offline: temporal validation to mimic production drift.
- Online: randomize users into control/treatment, collect key metrics, compute lift and confidence intervals; ensure sample size for statistical power.
βοΈ Trade-offs & Production Notes
- Online tests are slow and risk user experience; limit risk with canary and short-duration checks.
- Multiple metrics require pre-defined primary metric to avoid p-hacking.
π¨ Common Pitfalls
- Running underpowered experiments.
- Not pre-registering primary metric or stopping rules.
π£ Interview-ready Answer
“I’d validate models offline using temporal holdouts, then run a randomized A/B experiment with a primary business metric and safety guardrails, starting with a small canary before full rollout.”
Q9: Explain feature engineering best practices for recommendation systems.
π― TL;DR: Build user, item, and context features; include temporal and interaction features; ensure reproducibility and feature freshness.
π± Conceptual Explanation
Inspect actors (user, item, context) and derive features that capture preferences, recency, and temporal effects. Use embeddings for high-cardinality IDs.
π Technical / Math Details
- Aggregates: user_click_rate = clicks / impressions over window T.
- Temporal: time_since_last_interaction, day_of_week, holiday flags.
- Embeddings: learn item/user embeddings via matrix factorization or neural approaches.
βοΈ Trade-offs & Production Notes
- Aggregates require streaming or incremental computation for freshness.
- Embeddings improve expressivity but complicate deployment (storage, versioning).
π¨ Common Pitfalls
- Leakage from future features.
- Not aligning training-time feature computation with serving-time.
π£ Interview-ready Answer
“Design features across user, item, and context (including time windows and aggregates), use embeddings for high-cardinality fields, and ensure training/serving pipelines compute features identically and with required freshness.”
Q10: What are common failure modes specific to ML system design and how to mitigate them?
π― TL;DR: Common failures: data skew/drift, label leakage, pipeline mismatches, and metric misalignment; mitigate via monitoring, shadowing, and robust eval.
π± Conceptual Explanation
Failures often arise from the mismatch between assumptions made during development and production realities. Proactive checks and robust CI for data and models reduce surprises.
π Technical / Math Details
- Drift detection via KL divergence or population stats.
- Label leakage detection by checking causally downstream signals used as features.
βοΈ Trade-offs & Production Notes
- Building comprehensive test infrastructure costs time but prevents major outages.
- Some mitigations (e.g., conservative rollouts) slow innovation.
π¨ Common Pitfalls
- Not testing edge-cases (sparse users/items).
- Ignoring feature null-handling at scale.
π£ Interview-ready Answer
“Watch for data drift, pipeline mismatches, and label leakage. Mitigate with monitoring, shadow deployments, unit tests for feature pipelines, and gradual canary rollouts with rollback capabilities.”
π Key Formulas
NDCG@k (Normalized Discounted Cumulative Gain)
$\text{NDCG}_k = \frac{\text{DCG}_k}{\text{IDCG}_k}$
- $rel_i$: graded relevance of item at rank $i$.
- $k$: cutoff rank.
- $\text{IDCG}_k$: ideal DCG (max possible DCG@k). Interpretation: Rewards placing highly relevant items near the top; normalized so perfect ranking = 1.
AUC (Area Under ROC Curve)
AUC measures probability a random positive ranks higher than a random negative.
- Equivalent to the MannβWhitney U statistic. Interpretation: Useful for imbalanced binary classification; insensitive to threshold.
Cross-Entropy Loss (Binary / Multiclass)
Binary: $L = -\big(y\log(\hat{p}) + (1-y)\log(1-\hat{p})\big)$ Multiclass: $L = -\sum_i y_i \log(\hat{y}_i)$
- $y, y_i$: true label(s).
- $\hat{p}, \hat{y}_i$: predicted probability(s). Interpretation: Penalizes confident wrong predictions; aligns with maximum likelihood.
Precision / Recall / F1
$\text{Precision} = \frac{TP}{TP+FP}, \quad \text{Recall} = \frac{TP}{TP+FN}$ $F1 = 2\cdot\frac{\text{Precision}\cdot\text{Recall}}{\text{Precision}+\text{Recall}}$
- $TP$: true positives, $FP$: false positives, $FN$: false negatives. Interpretation: Precision measures correctness among positives; recall measures coverage of positives.
KL Divergence (for drift detection)
$D_{KL}(P ,||, Q) = \sum_x P(x) \log\frac{P(x)}{Q(x)}$
- $P$: training distribution; $Q$: production distribution. Interpretation: Quantifies how different two distributions are; used to detect feature drift.
β Cheatsheet
- Problem setup: always clarify SLOs (latency, throughput), freshness, and primary metric first.
- Architecture: use funnel (retrieval β coarse scoring β reranker) for scale.
- Data: combine user-interaction signals, weak supervision, public datasets, and human labeling.
- Evaluation: iterate offline (NDCG/AUC) but validate via online A/B tests on primary business metric.
- Features: ensure training-serving parity and monitor feature distributions in production.
- Monitoring: track feature drift, latencies (p95/p99), and business KPIs; use automated alerts and rollback.
- Deployment: canary β phased rollout β full rollout; shadow mode for low-risk verification.
- Pre-trained models: fine-tune when domain alignment; distill/prune for production latency constraints.