1.9. Alerting, Logging, and Anomaly Detection

AI System Design Interview Guide (2025)

ML System Design — Monitoring & Observability

4 min read 836 words

🎯 Covered Sections

This series covers:
1.9: Alerting, Logging, and Anomaly Detection

🪄 Step 1: Intuition & Motivation

Core Idea (in 1 short paragraph): Alerts are the heartbeat monitors of your ML system — they ensure that when something abnormal happens, someone knows before users do. But good alerting isn’t just about “ringing bells”; it’s about knowing what to alert on, how loudly, and when to stay silent. A great monitoring engineer isn’t the one who sets up 100 alerts — it’s the one who sets up 10 that truly matter.
Simple Analogy: Imagine a smoke detector that goes off every time you cook toast. You’d eventually ignore it — even when there’s a real fire. The same happens in ML: noisy, overly sensitive alerts cause alert fatigue, and true problems slip through. Smart alerting keeps the signal strong and the noise low.

🌱 Step 2: Core Concept

What’s Happening Under the Hood?

ML alerting pipelines continuously watch metrics — data drift, model performance, latency, bias — and detect unusual deviations from baseline behavior.

Here’s how it works:

Metric Collection:
Each monitored metric (like accuracy, PSI, latency) is logged in real time or at intervals.
Thresholding:
Thresholds define when a metric’s change is considered abnormal.
Example: “Trigger alert if AUC drops by >5% or PSI > 0.2.”
Time-Series Modeling:
Instead of fixed thresholds, anomaly detectors like EWMA, Prophet, or Z-score models predict expected ranges dynamically, adapting to seasonality and trends.
Alert Routing:
Alerts are categorized by severity (info, warning, critical) and routed to the right team or channel (Slack, email, PagerDuty).
Escalation & Feedback:
If issues persist or reoccur, alerts escalate — and the system learns from false alarms to improve precision over time.

Why It Works This Way

Not every metric deviation matters.
The system must learn which signals are meaningful and which are just noise from natural variance.
By combining rule-based (PSI > 0.2) and statistical (EWMA, Prophet) anomaly detection, you build a system that adapts — catching both sharp shocks and gradual drifts.

How It Fits in ML Thinking

Alerting is the bridge between monitoring and human response.
While monitoring detects, alerting communicates.
Done right, it transforms raw metrics into prioritized, actionable signals — helping data and ops teams focus only on what truly requires attention.

📐 Step 3: Mathematical Foundation

Exponential Weighted Moving Average (EWMA)

$$ S_t = \alpha X_t + (1 - \alpha) S_{t-1} $$

$S_t$: smoothed signal at time $t$
$X_t$: actual metric value at time $t$
$\alpha$: smoothing factor (0 < α ≤ 1)

If $|X_t - S_t|$ exceeds a threshold (e.g., 3σ), it’s flagged as an anomaly.

EWMA is like a rolling average that forgets slowly. It spots sustained deviations rather than one-off spikes.

Z-Score Anomaly Detection

$$ Z_t = \frac{X_t - \mu}{\sigma} $$

If $|Z_t| > k$ (e.g., 3), flag as anomaly.

$\mu$, $\sigma$: mean and standard deviation from baseline data.
$k$: sensitivity parameter.

Think of it as a “distance-from-normal” meter — the further away from typical, the more suspicious the metric.

Prophet Forecast-based Detection (Conceptual)

$$ \text{Alert if } X_t \notin [\hat{y_t} - \epsilon, \hat{y_t} + \epsilon] $$

$\hat{y_t}$: forecasted metric value by Prophet (captures seasonality/trends).
$\epsilon$: tolerance band (confidence interval).

Prophet acts like a “weather forecaster” — it expects daily or weekly patterns and only flags anomalies when real deviations occur.

🧠 Step 4: Assumptions or Key Ideas

Baseline metrics (mean, variance, or trend) exist for comparison.
Metrics are time-indexed and consistently logged.
Alerts are actionable — tied to owners, escalation paths, and clear playbooks.
False positives are reviewed periodically to recalibrate thresholds.
Sensitive or PII-related data is handled responsibly in logging and alert messages.

⚖️ Step 5: Strengths, Limitations & Trade-offs

Detects problems in real time before business impact.
Reduces manual monitoring workload.
Enables automated self-healing via retraining triggers.

Poor threshold calibration leads to noise or missed events.
High alert volume causes fatigue and desensitization.
Anomaly models (Prophet/EWMA) require tuning for non-stationary metrics.

Sensitivity vs. Stability: More sensitivity catches smaller drifts but increases false alarms.
Automation vs. Human Oversight: Full automation speeds detection but risks overreaction.
Global vs. Segmented Alerts: Aggregated alerts are simpler; segment-level alerts catch hidden issues.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)

“Every metric should have an alert.”
Wrong — alerts must be meaningful, limited, and tied to clear actions.
“Fixed thresholds are enough.”
Static thresholds can’t handle seasonality or long-term drifts — use adaptive baselines.
“More alerts = better coverage.”
Over-alerting causes blindness; alert fatigue kills reliability faster than no alerts at all.

🧩 Step 7: Mini Summary

🧠 What You Learned: Alerting and anomaly detection transform monitoring data into actionable signals — helping engineers react quickly and confidently.

⚙️ How It Works: Track key metrics → apply thresholds or anomaly models (EWMA, Prophet) → categorize alerts by severity → route and escalate smartly.

🎯 Why It Matters: Great alerting is quiet until it must be loud — it prevents chaos by ensuring only true, urgent issues get attention.

ML System Design — Monitoring & Observability 1.8. Continuous Evaluation & Retraining Pipelines