3.1 Decision Thresholds and Metrics

5 min read 999 words

🪄 Step 1: Intuition & Motivation

Core Idea: Your Logistic Regression model doesn’t directly say “Yes” or “No.” It gives a probability — the likelihood of belonging to class 1 (e.g., “spam” or “has disease”).

Then you must decide:

At what probability should we call it a “Yes”?

By default, we pick 0.5, meaning anything above 50% is “positive.” But in real life — business, healthcare, security — 0.5 is rarely optimal.


Simple Analogy: Imagine a security scanner that flags passengers as “suspicious.”

  • If it’s too sensitive, many innocent people get flagged (false positives).
  • If it’s too relaxed, actual threats slip through (false negatives).

Adjusting the decision threshold is like fine-tuning that scanner — deciding how strict or lenient your model should be.


🌱 Step 2: Core Concept


What’s Happening Under the Hood?

Logistic Regression outputs a predicted probability $\hat{y}$ between 0 and 1. To make a decision, we apply a threshold (t):

[ \text{Predicted class} = \begin{cases} 1, & \text{if } \hat{y} \geq t
0, & \text{if } \hat{y} < t \end{cases} ]

  • Default: $t = 0.5$

  • But depending on your goal, you can lower or raise the threshold:

    • Lower $t$ → catch more positives (↑ recall, ↓ precision).
    • Raise $t$ → be more certain before predicting positive (↑ precision, ↓ recall).

Example: In medical diagnosis, missing a sick patient (false negative) is worse than wrongly flagging a healthy one. So we lower the threshold — to err on the side of caution.


Why It Works This Way

Changing the threshold moves the balance between the two kinds of errors:

  • False Positives (FP): predict “yes” when it’s actually “no.”
  • False Negatives (FN): predict “no” when it’s actually “yes.”

These aren’t just numbers — they have real-world meaning:

  • In fraud detection → FNs = missed frauds (bad!).
  • In email spam → FPs = losing an important email (bad!).

So your business context decides where the threshold should be.

Threshold tuning is less about math and more about strategy — you adjust it based on which error your use-case can tolerate more.

How It Fits in ML Thinking

Machine learning isn’t just about predictions — it’s about making the right decisions. Thresholds connect math to mission: they let you translate probability into action.

They’re crucial in production systems, where false alarms or missed detections have real costs. So, the best ML engineers think not just about “accuracy,” but about the cost of being wrong.


📐 Step 3: Mathematical Foundation

Let’s unpack the evaluation metrics that help you measure and choose the right threshold.


Precision, Recall, and F1 Score
MetricFormulaMeaning
Precision$\frac{TP}{TP + FP}$“Of all the positive predictions, how many were correct?”
Recall$\frac{TP}{TP + FN}$“Of all actual positives, how many did we catch?”
F1 Score$2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$Harmonic mean — balances both.
  • Precision = “How trustworthy are our alerts?”
  • Recall = “How many real cases did we find?”
  • F1 = “How balanced are we?”

Changing the threshold will shift all three — that’s why you can’t optimize them all simultaneously.


ROC Curve and AUC

The ROC (Receiver Operating Characteristic) curve plots:

  • x-axis: False Positive Rate (FPR) = $\frac{FP}{FP + TN}$
  • y-axis: True Positive Rate (TPR) = Recall = $\frac{TP}{TP + FN}$

Each point = one threshold. The AUC (Area Under Curve) measures model’s discrimination ability — how well it separates the two classes.

  • AUC = 1 → perfect model
  • AUC = 0.5 → random guessing
AUC asks: “If I randomly pick one positive and one negative, how often will the model rank the positive higher?”

Precision–Recall (PR) Curve

PR curves are better when data is imbalanced. They show how precision and recall trade off as the threshold changes.

  • The area under the PR curve (AP) tells how well the model identifies positives in a sea of negatives.
  • A steep drop in precision means the model is easily fooled by false alarms.
ROC = “Can it tell classes apart?” PR = “How reliable are positive predictions?”

Calibration Plots

Calibration plots check whether predicted probabilities are honest.

Example: if your model says “70% chance of rain” across 100 cases, then about 70 of them should actually rain.

If not, your model is miscalibrated — too optimistic or too cautious.

Well-calibrated models are vital when probabilities are used for risk or resource decisions (e.g., insurance, medicine).

Calibration answers: “Can I trust this probability, or is my model bluffing?”

🧠 Step 4: Assumptions or Key Ideas

  • The default threshold (0.5) assumes classes are balanced and costs of FP/FN are equal — rarely true in real life.
  • Business context determines whether to prioritize recall or precision.
  • Calibration assumes probabilities reflect reality, not just ranking.

⚖️ Step 5: Strengths, Limitations & Trade-offs

  • Threshold tuning aligns ML outputs with business goals.
  • Metrics like ROC-AUC and PR-AUC give deep insight beyond accuracy.
  • Calibration ensures trustworthy probabilities for decision systems.
  • Over-optimization on one metric can harm others (e.g., high recall kills precision).
  • Metrics can be misleading in highly imbalanced data.
  • Calibration requires lots of validation data for reliable curves.
The trade-off is philosophical: Do you want a careful model (fewer false alarms) or a paranoid model (misses almost nothing)? Your threshold decides.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)
  • “0.5 threshold always works.” → Only if both classes are balanced and equally important.
  • “A high AUC means I don’t need to adjust threshold.” → AUC ignores threshold; threshold still controls practical outcomes.
  • “Precision and recall can both be maximized.” → No, improving one often sacrifices the other.

🧩 Step 7: Mini Summary

🧠 What You Learned: Logistic Regression predicts probabilities, not classes — thresholds convert them into real-world decisions.

⚙️ How It Works: Adjusting thresholds changes the balance between false positives and false negatives; metrics like ROC, PR, and calibration help you find the right point.

🎯 Why It Matters: This step turns your model from a “math tool” into a strategic decision-maker aligned with business goals.

Any doubt in content? Ask me anything?
Chat
🤖 👋 Hi there! I'm your learning assistant. If you have any questions about this page or need clarification, feel free to ask!