📘 ML System Design Design Patterns — Beginner-Friendly Theory Series 10

(Series 10 of Many)

🎯 Covered Sections

This series covers:
4.1: Drift Detection (data drift, concept drift, KL divergence, PSI, real-time vs. training distribution, monitoring loops, automated retraining triggers)

🪄 Step 1: Intuition & Motivation

Core Idea (in 1 short paragraph): Models don’t break loudly — they fade. Users change, markets shift, sensors age, and suddenly yesterday’s patterns don’t match today’s reality. Drift detection is the model’s smoke alarm: it watches incoming data (and outcomes when available) to warn you when the world has changed enough that your model’s judgments are no longer trustworthy.
Simple Analogy (one only): Think of a map app using last year’s traffic patterns. The city built a new flyover: routes change, but your app doesn’t know. Drift detection is the alert that says, “Road network looks different — recompute routes.”

🌱 Step 2: Core Concept

We’ll separate what can drift, how to measure it, and how to react.

What’s Happening Under the Hood?

Two kinds of change:

Data drift (covariate shift): Input feature distributions change, e.g., users are older now, or device types shifted from desktop to mobile.
Concept drift: The relationship between inputs and labels changes, e.g., the same symptoms now imply a different diagnosis due to a new variant.

A minimal drift service does:

Maintains reference distributions from training/validation (per-feature histograms, quantiles, correlations).
Samples live traffic in windows (e.g., last 1h/6h/24h).
Applies divergence tests per feature (and sometimes for predictions).
Aggregates results → drift score with thresholds → alerts and optional retraining triggers.

Why It Works This Way

Models assume “tomorrow looks like yesterday.” When that stops being true, errors rise.
Simple, fast statistical tests on inputs and outputs are early warning signs, even before ground-truth labels arrive.
Windowing (short and long) balances sensitivity (catch issues quickly) and stability (avoid noisy false alarms).

How It Fits in ML Thinking

Drift is a production concern connecting data engineering (streams, windows), statistics (divergence tests), and MLOps (alerts, retraining). It closes the loop: Serve → Observe → Adapt.

📐 Step 3: Mathematical Foundation

KL Divergence (for categorical or binned features)

$$ D_{KL}(P \,\|\, Q) = \sum_i P(i)\,\log\frac{P(i)}{Q(i)} $$

$P$ = reference (training) distribution; $Q$ = live distribution.
Interpreting: how “surprised” $P$ would be if data came from $Q$. Larger means more drift.

If training expected 30% “mobile” but live is 60%, KL grows — your user mix changed.

Population Stability Index (PSI) — common in risk/fintech

Partition feature into $k$ bins with training proportions $e_i$ and live proportions $a_i$:

$$ \text{PSI} = \sum_{i=1}^{k} (a_i - e_i)\,\ln\frac{a_i}{e_i} $$

Rule of thumb (context-dependent): $<0.1$ small shift, $0.1$–$0.25$ moderate, $>0.25$ significant.

PSI asks: are people moving between bins (e.g., income brackets) more than expected?

Kolmogorov–Smirnov (KS) — for continuous features

Let $F_P(x)$ and $F_Q(x)$ be CDFs of reference and live:

$$ \text{KS} = \sup_x |F_P(x) - F_Q(x)| $$

Measures maximum gap between distributions. Larger gap → more drift.

🧠 Step 4: Assumptions or Key Ideas

Use point-in-time correct reference distributions (same feature transformations as production).
Compare apples-to-apples: same bins, same encoding, same missing-value handling.
Use multi-window monitoring (e.g., 1h/6h/24h) to catch acute spikes vs. gradual shifts.
Track prediction drift and (when labels arrive) performance drift (AUC, RMSE, calibration).
Not every drift is bad — some reflect healthy product growth. Tie alerts to business KPIs.

⚖️ Step 5: Strengths, Limitations & Trade-offs

Early warning before labels arrive (input/prediction drift).
Lightweight computations suitable for streaming pipelines.
Actionable thresholds enable automated maintenance.

Pure input drift may not hurt accuracy (model robust); concept drift might hurt without obvious input changes.
Binning choices affect sensitivity (PSI/KL on coarse vs. fine bins).
Too many alerts → alarm fatigue; too few → missed incidents.

Sensitivity vs. Noise: Lower thresholds catch more issues but increase false positives.
Speed vs. Rigor: Simple tests run fast; richer tests (distributional shift in high dimensions) are costlier.
Automation vs. Oversight: Auto-retrain reduces latency to fix, but risks training on contaminated data.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)

“No drift detected ⇒ model is fine.”
→ Concept drift can degrade accuracy even with stable inputs; monitor performance when labels arrive.
“Any drift ⇒ retrain immediately.”
→ Verify impact; some shifts are benign or seasonal. Use guardrails and review.
“One metric fits all.”
→ Use a toolbox: KL/PSI/KS for inputs, calibration for predictions, business KPIs for impact.

🧩 Step 7: Mini Summary

🧠 What You Learned: Drift detection watches how inputs, predictions, and (when available) outcomes shift away from training-time reality. ⚙️ How It Works: Keep reference distributions, compute divergences in rolling windows (KL/PSI/KS), aggregate into drift scores, alert on thresholds, and validate impact. 🎯 Why It Matters: It’s the early-warning system that prevents silent model decay and triggers timely adaptation.

🔁 Transition Note

Next up: Series 11, where we’ll cover 4.2: Model Monitoring and Alerting — choosing the right metrics (beyond accuracy), setting thresholds, and wiring alerts and dashboards that engineers trust.

4.2. Model Monitoring & Alerting 3.3. Model Compression & Distillation