Monitoring and Drift Detection: Linear Regression

Machine Learning Interview Guide for Top Tech Roles (2025)

Linear Regression: Complete Interview Guide for Interviews

7 min read 1474 words

🎯 Core Idea

Continuously monitor a deployed linear-regression model for silent degradation by tracking input (feature) drift, prediction distribution changes, and residual behavior (errors). Use statistical tests, change detectors, and business-aware triggers to decide when to investigate, retrain, or rollback. Monitoring must work with delayed or sparse labels — build both labelled and unlabelled monitors, and combine technical signals with business KPIs.

🌱 Intuition & Real-World Analogy

Why this matters: a regression model can look fine at deployment time but slowly drift as the data-generating process changes — predictions wander and residuals grow without an obvious single failure. Detecting that early avoids revenue loss, bad decisions, and compliance risks.

Analogies:

Thermostat analogy — you don’t only check the thermostat reading once; you watch temperature trends, sensor noise, and whether the heating cycles start lasting longer. Residuals ~ thermostat error; feature drift ~ changed insulation.
Car alignment — the car may still drive straight, but tiny misalignments slowly increase tire wear. Residual drift is the increased wear; periodic alignment (retraining) is needed before a blowout.

What to Track Post-Deployment (practical checklist)

Model performance (labelled):
- Rolling RMSE / MAE / MAPE on recently labeled data (if labels available).
- Mean residual (bias) and residual standard deviation.
- Prediction interval coverage (if model provides uncertainty): fraction of true labels inside predicted ±CI.
Prediction behaviour (unlabelled possible):
- Prediction distribution moments (mean, variance, skewness) over time.
- Fraction of extreme predictions (outliers relative to historical baseline).
- Prediction histogram comparisons (recent vs baseline).
Input (feature) distributions:
- Per-feature statistics: mean, std, quantiles, missing-value rate, and categories frequency for categorical features.
- Multivariate checks: joint-distribution shifts, covariance / correlation matrix drift.
Residual diagnostics (labelled):
- Autocorrelation of residuals (Durbin–Watson style): detects time-series structure broken.
- Heteroscedasticity indicators: residual variance vs predicted value or specific features.
- Systematic bias by subgroup (slice analysis): residual vs feature slices.
Operational health signals:
- Input throughput, latency, sample rate changes, feature pipeline errors.
- Percentage of feature values falling outside training ranges (new categories, unseen bins).

📐 Mathematical Foundation (key formulas)

1) Residuals and performance

Let $y$ be true target, $\hat{y}$ predicted.

Residual:
$$ r = y - \hat{y} $$
Rolling RMSE over window $W$:
$$ \mathrm{RMSE}_W = \sqrt{\frac{1}{|W|}\sum_{i\in W} (y_i-\hat{y}_i)^2} $$
Mean residual (bias):
$$ \bar r_W = \frac{1}{|W|}\sum_{i\in W} r_i $$

Assumptions: residuals are independent and identically distributed (i.i.d.) under stationarity; violations indicate drift or model misspecification.

2) Distributional drift measures (feature or prediction distributions)

Population Stability Index (PSI) for binned numeric or categorical features: for bins $b$,
$$ \mathrm{PSI} = \sum_b (p_b^{\text{ref}} - p_b^{\text{new}})\cdot \ln\frac{p_b^{\text{ref}}}{p_b^{\text{new}}} $$
where $p_b^{\text{ref}}$, $p_b^{\text{new}}$ are proportions. (Rule-of-thumb: PSI > 0.2 often flagged.)
Kullback–Leibler (KL) divergence (for probability densities $p,q$):
$$ D_{KL}(p\Vert q) = \int p(x)\ln\frac{p(x)}{q(x)}\,dx $$
(asymmetric; sensitive to support mismatch.)
Wasserstein (Earth Mover’s) distance between distributions $P,Q$: intuitively minimal mass transport cost; robust to small support shifts.
Maximum Mean Discrepancy (MMD): kernel two-sample test statistic. For kernel $k$,
$$ \mathrm{MMD}^2 = \mathbb{E}_{x,x'\sim P}[k(x,x')] + \mathbb{E}_{y,y'\sim Q}[k(y,y')] - 2\mathbb{E}_{x\sim P,y\sim Q}[k(x,y)] $$

Assumptions: these capture different kinds of shift (mean shift, shape, support). Use appropriate binning/kernels and sample sizes.

3) Sequential change detection (e.g., CUSUM, Page–Hinkley)

CUSUM monitors cumulative deviations of an observed statistic $z_t$ from a reference mean $\mu_0$. One-sided CUSUM for positive shifts:

$$ S_t = \max(0, S_{t-1} + (z_t - \mu_0 - \kappa)) $$

signal if $S_t > h$ where $\kappa$ is drift slack and $h$ threshold.

Page–Hinkley monitors the cumulative difference from the running mean:

$$ m_t = \frac{1}{t}\sum_{i=1}^t z_i,\quad PH_t = \max_t\left(\sum_{i=1}^t (z_i - m_t - \delta)\right) $$

signal when $PH_t > \lambda$.

Assumptions: independence of observations to quantify false alarm rate; in practice these work empirically under weak dependence.

🔬 Deep-Dive: Interpreting PSI, KL, Wasserstein, MMD

PSI is practical and interpretable for production teams (binned differences). Sensitive to bin choice.
KL strongly penalizes missing support (zero-probability in denominator). Use smoothed estimates.
Wasserstein gives a distance with units of the feature (intuitive) and handles continuous shifts well.
MMD is a kernel test with good theoretical guarantees for two-sample testing; computational cost can be high.

⚖️ Strengths, Limitations & Trade-offs

Strengths

Statistical drift metrics provide early warning signals before labels are available.
Combining multiple monitors (features, predictions, residuals) reduces blind spots.
Sequential detectors (CUSUM/Page–Hinkley/ADWIN) detect gradual and abrupt changes in streaming settings.

Limitations

Label scarcity: performance metrics can’t be computed in real time if labels are delayed.
False alarms vs missed drift: sensitive thresholds cause noisy alerts; lax thresholds cause silent degradation.
Univariate tests miss multivariate shifts: marginal stability doesn’t guarantee joint stability.
Correlated features/time dependence: many tests assume independence—violations make p-values unreliable.
Business relevance: small statistical shifts may not impact business decisions — monitoring must tie to KPIs.

Trade-offs

Sensitivity vs stability: tune thresholds and window sizes; short windows detect fast drift but are noisy.
Complex tests vs interpretability: MMD/Wasserstein are powerful but harder to explain than PSI or mean-shift alerts.
Compute & storage: multivariate detectors and bootstrapped thresholds cost more resources.

🔍 Variants & Extensions (what interviewers like to hear)

Label-aware vs label-agnostic monitors: combine both; use labelled RMSE when available and unlabelled detectors otherwise.
Ensemble and model-based drift detectors: monitor disagreement between current model and a shadow model or ensemble members.
Importance-weighting / covariate-shift correction: reweight training loss by $w(x)=p_{\text{test}}(x)/p_{\text{train}}(x)$ to adjust for covariate shift [Sugiyama et al.]. ([Journal of Machine Learning Research][1])
Online learning / incremental updates: use streaming learners that adapt weights gradually (risk of catastrophic forgetting).
Adaptive windows (ADWIN): automatically adjust window size to detect shifts in streaming errors. ([ResearchGate][2])
Two-sample kernel tests (MMD) for multivariate drift with theoretical guarantees. ([Journal of Machine Learning Research][3])
Change localization and explainability: use feature attribution (SHAP, permutation importance) to find which features cause drift.
Meta-monitors: combine multiple low-level signals into a single risk score (weighted sum, learned aggregator).

🚧 Common Challenges & Pitfalls

Confusing data pipeline errors with true drift — missing columns, normalization bugs, new categories should trigger pipeline checks before model retraining.
Using p-values naively under dependence — many drift tests assume i.i.d.; autocorrelation inflates false positives.
Monitoring only marginal distributions — joint shifts (e.g., correlated changes) can break model yet remain invisible to marginal tests.
Ignoring label delay — waiting for labels to retrain can cause long windows of poor performance; use surrogate or active labeling strategies.
Over-reacting to noise — retraining at every small signal wastes compute and risks overfitting to transient regimes.
Not aligning to business impact — statistical drift does not always equal material business harm — calibrate action thresholds to KPIs.
No backoff / rollback plan — retraining without A/B testing and rollback increases risk; always validate newly trained models.

✅ Practical Retraining Triggers (operational recipe)

Use a layered trigger system with technical and business gates:

Immediate alerts (investigate, not retrain):
- Pipeline errors, missing features, sudden spike in missing values, model serving latency.
- Feature cardinality explosion (new categories).
Warning signals (investigate + collect labels / shadow predictions):
- PSI > 0.1–0.2 on critical features.
- Persistent change in prediction mean or variance beyond historical ±nσ for T consecutive windows.
- Page–Hinkley / CUSUM / ADWIN signals on residuals or prediction errors.
Retrain candidate (apply human + business checks):
- Statistically significant RMSE increase on recent labeled data AND business KPI degradation (e.g., revenue, conversion) OR
- Strong multivariate drift (MMD/Wasserstein) on inputs used by the model AND a business-impact estimate.
- Retrain window should be chosen to reflect new distribution (recent N days) and validated offline.
Safeguards before rollout:
- Offline evaluation (holdout and time-split), A/B test or shadow deployment, and rollback plan.
- Use importance weighting if retraining on older labeled data would bias learning (covariate shift correction). ([Journal of Machine Learning Research][1])
Operational heuristics:
- Use cool-down windows to avoid repeated retraining (e.g., minimum 7–14 days between retrains unless catastrophic failure).
- Maintain rolling baselines updated at slow cadence (weekly) so thresholds are meaningful.
- Track retraining frequency and model staleness as metrics themselves.

🔁 Deep-Dive: Label-delay strategies & surrogate approaches

Shadow models: train a model on the most recent labeled data in the background for comparison.
Proxy labels: use weak / noisy labels (business events correlated with target) while waiting for ground truth.
Active learning: sample instances to label that are most informative (high residuals or near decision boundaries).
Backtesting with time slices: simulate retraining on historical sliding windows to estimate expected improvement.

Monitoring and Alerting Best Practices (ops checklist)

Define SLOs linking model prediction quality to business KPIs — alerts should be tiered (info, warning, critical).
Combine signals — require at least two orthogonal alerts (e.g., PSI + residual RMSE rise) before automated retrain.
Explainability on alerts — when an alert fires, provide top contributing features (feature-level PSI, SHAP drift) for fast triage.
Dashboard design — show short and long windows (7-day, 30-day) and leaderboards for slices (regions, cohorts).
Ownership & runbooks — assign team owners, runbook steps (investigate feature pipeline → validate data → label sample → decide retrain).
Audit logs — store data, signals, retraining inputs, model versions for post-mortem and compliance.

Outliers and Robust Regression: Linear Regression Math Concepts for Linear Regression Interviews