5. Evaluation Metrics

5 min read 889 words

🪄 Step 1: Intuition & Motivation

Core Idea (in one line): Evaluation metrics are the report card of your model — they tell you not just how well it did, but how fairly and in what way it succeeded or failed.
Simple Analogy: Imagine grading a student.
- Accuracy says, “How many answers did they get right?”
- Precision says, “Of all their confident answers, how many were correct?”
- Recall says, “Of all the questions they should’ve answered correctly, how many did they actually get right?”
- F1 balances both — a fair overall score.
- ROC-AUC? It says, “How good are they at ranking correct answers higher than wrong ones?”

Each metric tells a different story — and in ML, choosing the wrong one can make a bad model look great.

What’s Happening Under the Hood?

When your model makes predictions, it classifies outcomes into four categories:

True Positives (TP): Correctly predicted positive cases.
True Negatives (TN): Correctly predicted negative cases.
False Positives (FP): Predicted positive when it’s actually negative.
False Negatives (FN): Missed positives — predicted negative when it’s actually positive.

These form the foundation of most evaluation metrics.

Why It Works This Way

No single metric captures everything.

Accuracy is fine when data is balanced (equal positives and negatives).
Precision focuses on how trustworthy your positive predictions are.
Recall focuses on how completely you found all positive cases.
F1-score is the harmonic mean between precision and recall — it balances the trade-off.
ROC-AUC measures ranking ability — can your model tell positive and negative apart, regardless of thresholds?

How It Fits in ML Thinking

Metrics are how you translate model performance into business relevance.

In ML, a “good” model is one that aligns with the problem’s stakes:

Classification Metrics

Metric	Formula	Measures
Accuracy	$ \frac{TP + TN}{TP + FP + TN + FN} $	Overall correctness
Precision	$ \frac{TP}{TP + FP} $	Reliability of positive predictions
Recall (Sensitivity)	$ \frac{TP}{TP + FN} $	Coverage of actual positives
F1-Score	$ 2 \times \frac{Precision \times Recall}{Precision + Recall} $	Balance between precision and recall
ROC-AUC	Area under ROC curve	Ability to rank positives higher than negatives

Regression Metrics

Metric	Formula	Description
MSE (Mean Squared Error)	$ \frac{1}{n}\sum(y_i - \hat{y_i})^2 $	Penalizes large errors heavily
RMSE (Root Mean Squared Error)	$ \sqrt{\frac{1}{n}\sum(y_i - \hat{y_i})^2} $	Easier to interpret (same units as $y$)
MAE (Mean Absolute Error)	$ \frac{1}{n}\sum \| y_i - \hat{y_i} \| $	Penalizes all errors equally
MAPE (Mean Absolute Percentage Error)	$ \frac{100}{n}\sum \| \frac{y_i - \hat{y_i}}{y_i} \| $	Measures relative error (%)
R² (Coefficient of Determination)	$ 1 - \frac{SS_{res}}{SS_{tot}} $	How much variance in $y$ is explained

MSE: Squared errors exaggerate big mistakes → sensitive to outliers.
MAE: More robust, treats all errors equally.
MAPE: Great for interpretability (percent errors), but fails when $y_i = 0$.
R²: Shows “how much better than random guessing” your model is.

Scenario	Best Metric	Why
Balanced classification	Accuracy	Simple, interpretable
Imbalanced classification	F1, Precision, Recall	Handles uneven classes better
Fraud or medical detection	Recall	Missing positives is costly
Spam filtering	Precision	False alarms are costly
Ranking tasks	ROC-AUC	Focus on ordering, not thresholds
Regression with outliers	MAE	Robust to extreme errors
Regression with smooth loss	RMSE	Sensitive to large deviations

🎯 Remember: Metrics are not interchangeable — they should mirror the real-world cost of being wrong.

A single metric can be misleading in isolation (e.g., high accuracy on imbalanced data).
Some metrics (like ROC-AUC) are hard to interpret for non-technical stakeholders.
Metrics must be paired with domain understanding — context matters.

Metrics are like camera lenses:

Wide-angle (Accuracy) shows the big picture but misses small details.
Zoom (Precision/Recall) shows fine detail but can distort context. Choose the right lens for the right photo — or risk misunderstanding your model.

🚨 Common Misunderstandings (Click to Expand)

“Accuracy is always a good metric.” Not in imbalanced datasets — it can be high even if the model ignores minority classes.
“High ROC-AUC means a perfect classifier.” It just means good ranking, not necessarily good thresholded classification.
“R² can never be negative.” It can — if the model performs worse than just predicting the mean of the target.

🧠 What You Learned: Evaluation metrics translate raw model predictions into meaningful, interpretable performance indicators.

⚙️ How It Works: Classification metrics use confusion matrix components; regression metrics measure error magnitudes or explained variance.

🎯 Why It Matters: The right metric keeps your model aligned with the real-world goal — because accuracy alone never tells the full story.