2.6. Model Debugging and Testing

📚 Notes

Module

ML System Design

10 min read 1946 words

📝 Flashcards

What is the primary goal of v1 model launch?

Ship quickly to get real-traffic validation; iterate from live feedback.

Feature distribution shift

When training and production feature distributions differ, performance degrades.

Feature-logging mismatch

Different offline vs online feature computation → prediction drop.

Overfitting signal

High train/validation but low live performance; hidden test set helps.

Underfitting signal

Low train performance; consider richer features or stronger model.

Failure-case debugging

Analyze mistakes, add features or targeted training examples.

Component-level debugging

Break system into layers and measure each to find bottlenecks.

Hidden test set purpose

Final unbiased evaluation; not used for tuning.

When to increase data

When specific failure modes lack sufficient examples.

Seasonality impact

Time-window bias in training can hurt performance post-season.

Candidate selection vs ranking

Selection finds candidates; ranker orders them — fix the layer causing failures.

Quick sanity check for online drop

Compare offline and online feature distributions and counts.

⚡ Short Theories

Ship v1 fast: get online validation from real traffic — iterative improvements beat endless offline tuning.

Distribution shift: models assume training and serving distributions match; when they don’t, expect performance drops.

Feature parity: ensure identical logic and windows for any feature computed offline and online.

Hidden test set: reserve a realistic, untouched dataset for final model quality checks to detect overfitting.

Failure-driven improvement: inspect error examples first — they reveal missing features, data sparsity, or modeling blind spots.

Layered debugging: large systems require per-component metrics (e.g., candidate recall vs. ranking quality) to prioritize fixes.

🎤 Interview Q&A

Q1: How would you debug a model that looks great offline but fails in production?

🎯 TL;DR: Compare training vs. serving—start with feature parity and distribution, then inspect failure cases and component-level metrics.

🌱 Conceptual Explanation

A model that performs well offline but poorly in production usually faces mismatches between the training assumptions and the live environment. Root causes are often (1) feature distribution shift, (2) feature computation mismatch, (3) data sparsity for certain cases, or (4) upstream component faults in a multi-stage system.

📐 Technical / Math Details

Compare empirical distributions: for each feature $x$, compute $D_{KL}(P_{\text{train}}(x) \parallel P_{\text{prod}}(x))$ or simple summary stats (mean, std, quantiles).
Use confusion matrices and calibration plots to check class-specific behavior.
Instrument per-component metrics (e.g., candidate recall, ranking NDCG).

⚖️ Trade-offs & Production Notes

Start with cheap checks (feature histograms, missingness) before costly retraining.
Adding instrumentation has latency and storage costs.
Sampling more production data for debugging may require privacy and throughput considerations.

🚨 Common Pitfalls

Assuming identical feature windows (7-day vs 30-day) without checking logs.
Overfitting debugging on a small set of failure examples rather than representative samples.
Fixing symptoms (e.g., raising thresholds) without addressing root causes.

🗣 Interview-ready Answer

“I’d first verify feature parity and distribution between training and production; if they differ, fix feature computation or retrain. If features match, analyze failure cases by component to decide between data augmentation, new features, or model changes.”

Q2: How do you detect and handle feature distribution shift?

🎯 TL;DR: Detect with distribution comparisons and drift metrics; handle by retraining, domain adaptation, or feature-engineering to be robust.

🌱 Conceptual Explanation

Distribution shift occurs when $P_{\text{train}}(X)\neq P_{\text{prod}}(X)$. Detect early via monitoring and apply remedies like incremental retraining or robust features.

📐 Technical / Math Details

Statistical checks: KS-test for continuous features, chi-squared for categoricals.
Drift metrics: population fraction difference, KL-divergence, PSI (Population Stability Index).
Example PSI: $\text{PSI} = \sum_{i} (P_{\text{train},i}-P_{\text{prod},i}) \log\frac{P_{\text{train},i}}{P_{\text{prod},i}}$

⚖️ Trade-offs & Production Notes

Frequent retraining reduces staleness but increases pipeline complexity.
Domain adaptation or importance weighting helps when labels are scarce.
Conservative feature engineering (less fragile aggregations) reduces sensitivity to shift.

🚨 Common Pitfalls

Relying on a single drift metric — combine multiple signals.
Responding to noise as if it were real shift — set thresholds and smoothing.

🗣 Interview-ready Answer

“Monitor production feature distributions (KS, PSI), alert on meaningful drift, then either retrain with fresh data or add robust features/importance weighting depending on label availability and latency constraints.”

Q3: Explain how feature-logging mismatches cause online performance drops and how to prevent them.

🎯 TL;DR: If offline features are computed differently from online features (different time windows, joins, or missing signals), the model sees different inputs—ensure parity with logging, tests, and identical code paths.

🌱 Conceptual Explanation

Offline training often uses pre-computed or enriched features; online serving must replicate those computations in real-time. Any discrepancy leads to inconsistent model inputs.

📐 Technical / Math Details

For an aggregate feature $f_t = \sum_{i=t-W}^{t-1} g(i)$, ensure window $W$ matches offline and online.
Add unit tests comparing feature outputs for same inputs: $f_{\text{offline}}(x) \overset{?}{=} f_{\text{online}}(x)$.

⚖️ Trade-offs & Production Notes

Real-time aggregation vs. precomputed caches trade latency vs. freshness.
Maintaining identical code paths is ideal; if not possible, log raw events and derive features offline to compare.

🚨 Common Pitfalls

Using different time horizons (7 vs 30 days).
Relying on offline-append-only enrichments not available during serving.

🗣 Interview-ready Answer

“Guarantee feature parity by sharing feature code between offline and online, validating with unit/integration tests, and logging raw events so you can recompute and compare feature outputs.”

Q4: How do you decide whether poor live performance is due to overfitting or distribution shift?

🎯 TL;DR: If model generalizes on a held-out hidden test set but fails in production → likely distribution shift; if it fails on hidden test too → overfitting or poor model capacity.

🌱 Conceptual Explanation

Overfitting shows as high train accuracy and low unseen-test accuracy. Distribution shift shows as good offline test performance but degraded production metrics.

📐 Technical / Math Details

Compute performance on: training set, validation set (used for tuning), held-out hidden test set, and production sample with labels (if available).
Compare: if $\text{perf}{\text{test}} \approx \text{perf}{\text{train}}$ but $\text{perf}{\text{prod}} \ll \text{perf}{\text{test}}$ → shift.

⚖️ Trade-offs & Production Notes

Obtaining labeled production data may be costly/time-consuming.
Hidden test sets must reflect expected production distribution; otherwise conclusions may be wrong.

🚨 Common Pitfalls

Using a validation set that leaks information → underestimates overfitting.
Treating short-term production dips (seasonality) as model failure.

🗣 Interview-ready Answer

“Compare a hidden test set to production: if hidden test replicates training performance but production drops, suspect distribution shift; if hidden test also drops, it’s overfitting or insufficient model capacity.”

Q5: What debugging process would you use for a multi-stage system (e.g., candidate selection + ranking)?

🎯 TL;DR: Instrument and measure per-stage metrics, identify the stage contributing most failures, then iterate targeted fixes (features, models, data) for that stage.

🌱 Conceptual Explanation

Large pipelines hide errors; breaking them into layers and measuring each layer’s recall/precision isolates the weak component (e.g., selection fails to recall relevant docs vs. ranker misorders them).

📐 Technical / Math Details

Define per-stage metrics: candidate selection recall@K, ranking NDCG@K, end-to-end accuracy.
Analyze failure set: for each failure, test whether the ideal item was present in candidate set; if not → selection; if present but ranked low → ranker.

⚖️ Trade-offs & Production Notes

Instrumentation and logging cost storage/latency.
Fix at the layer minimizes blast radius (easier to test and roll back).

🚨 Common Pitfalls

Optimizing ranker when selection is the culprit.
Ignoring upstream data losses (e.g., index staleness) masquerading as model issues.

🗣 Interview-ready Answer

“I’d first log candidate sets and compute recall@K to see whether selection misses ideal items; if selection is fine, focus on ranking metrics like NDCG and error analysis to pinpoint features or modeling issues.”

Q6: How do you use failure examples to guide feature engineering and data collection?

🎯 TL;DR: Perform root-cause analysis on representative failure examples, extract missing signals or rare cases, then add targeted features or collect more labeled examples for those scenarios.

🌱 Conceptual Explanation

Failure examples reveal specific blind spots: missing actor/author features, rare class examples, or entity contexts. Use targeted augmentation rather than blind global changes.

📐 Technical / Math Details

Tally failure types: cluster errors by cause.
For each cluster, measure frequency $f$ and impact on metric $\Delta M$; prioritize fixes with high $f \times \Delta M$.

⚖️ Trade-offs & Production Notes

Targeted data collection (labeling) is more effective than unguided upsampling.
Adding many ad-hoc features increases maintenance cost — prefer generalizable features if possible.

🚨 Common Pitfalls

Overfitting to the failure set by engineering too-specific features.
Not measuring whether added data/features actually improve held-out metrics.

🗣 Interview-ready Answer

“Cluster failure cases to find recurring patterns, then add features or labeled examples targeted to the highest-impact clusters, validating improvements on a holdout set that mirrors those failures.”

Q7: Which offline metrics should you monitor to anticipate online problems?

🎯 TL;DR: Monitor calibration, class-wise precision/recall, AUC/NDCG, and distributional checks for features and prediction scores.

🌱 Conceptual Explanation

Simple aggregate metrics (overall accuracy) hide per-segment issues. Per-class, per-feature-bin, and calibration checks reveal weaknesses likely to appear online.

📐 Technical / Math Details

Per-bin metrics: partition by feature quantiles and compute precision/recall per bin.
Calibration: reliability diagram, expected calibration error (ECE).
Ranking: use NDCG@K, MRR.

⚖️ Trade-offs & Production Notes

More metrics mean more monitoring complexity, but enable earlier detection.
Choose metrics aligned with business impact (e.g., precision for fraud).

🚨 Common Pitfalls

Over-optimizing a single metric (e.g., AUC) without considering business-relevant thresholds.
Ignoring class imbalance in per-class metrics.

🗣 Interview-ready Answer

“Track per-class precision/recall, calibration (ECE), and distributional stats for features and scores — these help detect issues like calibration drift or blind spots before they hit production.”

Q8: What are practical strategies to reduce overfitting during iteration?

🎯 TL;DR: Use regularization, simpler models, proper cross-validation, data augmentation, and robust validation sets (including hidden test sets).

🌱 Conceptual Explanation

Overfitting arises from excessively complex models relative to data. Regularization and better validation practices constrain the model and provide honest estimates of generalization.

📐 Technical / Math Details

L2 regularization: add $\lambda |w|_2^2$ to loss.
Early stopping using validation loss.
Cross-validation or stratified folds for limited data.

⚖️ Trade-offs & Production Notes

Strong regularization may underfit; tune with validation.
Data augmentation increases robustness but can introduce bias if unrealistic.

🚨 Common Pitfalls

Tuning hyperparameters on the test set (data leakage).
Ignoring representativeness of validation folds.

🗣 Interview-ready Answer

“I’d apply regularization, use early stopping with a realistic validation set, and, if possible, increase diverse training data or augmentations to improve generalization without overfitting.”

📐 Key Formulas

Precision, Recall, F1

$\text{Precision} = \frac{TP}{TP + FP} \quad,\quad \text{Recall} = \frac{TP}{TP + FN}$ $F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$

$TP$: true positives
$FP$: false positives
$FN$: false negatives Interpretation: Precision measures correctness among positive predictions; recall measures coverage of actual positives; F1 balances both.

Area Under ROC (AUC)

AUC is the area under the ROC curve (TPR vs FPR) and can be interpreted as the probability that a randomly chosen positive ranks above a randomly chosen negative.

TPR = $\frac{TP}{TP+FN}$, FPR = $\frac{FP}{FP+TN}$. Interpretation: AUC measures separability regardless of threshold; useful for imbalanced classes but not sensitive to calibration or business thresholds.

Cross-Entropy / Log Loss

$L = -\frac{1}{N}\sum_{n=1}^{N}\sum_{k} y_{n,k} \log \hat{p}_{n,k}$

$y_{n,k}$: one-hot true label for sample $n$ and class $k$
$\hat{p}_{n,k}$: predicted probability for sample $n$, class $k$ Interpretation: Penalizes confident wrong predictions heavily; lower is better.

Bias-Variance Decomposition (expected squared error)

$\mathbb{E}[(\hat{f}(x)-y)^2] = (\text{Bias}[\hat{f}(x)])^2 + \text{Var}[\hat{f}(x)] + \sigma^2$

$\hat{f}(x)$: model prediction
$\sigma^2$: irreducible noise Interpretation: Total error splits into bias (systematic error), variance (model sensitivity to data), and irreducible noise.

✅ Cheatsheet

Ship v1 fast: prefer early real-traffic validation over excessive offline tuning.
Monitor feature parity: ensure identical code & time windows for offline and online feature computation.
Instrument at layers: candidate recall vs ranking quality to isolate problems quickly.
Use hidden test sets: final unbiased check against overfitting.
Prioritize fixes: focus on high-frequency, high-impact failure clusters first.
Drift detection: PSIs / KS tests + smoothing to avoid false alarms.
Regularization & augmentation: go-to tools to reduce overfitting when data is limited.

2.5. Transfer Learning