3.3. Hypothesis Testing

6 min read 1077 words

🪄 Step 1: Intuition & Motivation

  • Core Idea: Hypothesis testing is about decision-making under uncertainty. When we see data, we ask:

    “Is this effect real, or could it just be random noise?”

    It provides a structured way to test claims about populations using sample data.

  • Simple Analogy: Imagine you’re a judge.

    • The null hypothesis ($H_0$) says the accused is innocent (no effect).
    • The alternative hypothesis ($H_1$) says the accused is guilty (effect exists). You never “prove” guilt beyond all doubt — you just decide if there’s enough evidence to reject $H_0$.

    That’s hypothesis testing — logical, cautious, and data-driven judgment.


🌱 Step 2: Core Concept

What’s Happening Under the Hood?

We start with two competing claims:

  • Null Hypothesis ($H_0$): the default assumption (no difference or no effect).
  • Alternative Hypothesis ($H_1$): what we’re trying to prove (difference exists).

We then calculate a test statistic from our sample (like a $z$-score or $t$-value) that measures how extreme our data is if $H_0$ were true.

The smaller the probability of seeing such extreme data under $H_0$, the stronger the evidence against it.

That probability is called the p-value.

Why It Works This Way

Under the null hypothesis, we know the expected behavior of the test statistic. So if we observe something unusually extreme (e.g., far from the mean), it’s unlikely to have occurred by random chance.

Rejecting $H_0$ doesn’t mean $H_1$ is proven — it just means the data is inconsistent with $H_0$.

That’s the humility built into statistics: we infer, not declare.

How It Fits in ML Thinking

Hypothesis testing is the logic behind A/B testing, feature selection, and model comparison:

  • In A/B testing, $H_0$: “Variant A = Variant B” vs $H_1$: “Variant B performs better.”
  • In model evaluation, we test whether performance improvements are statistically significant or just lucky noise.

So, every time you say “this change improved accuracy,” hypothesis testing is quietly judging you in the background.


📐 Step 3: Mathematical Foundation


⚖️ 1. The Hypothesis Framework

Null vs. Alternative Hypothesis
  • $H_0$ (Null): The default assumption; no effect, no difference.

    Example: “The new ad doesn’t change click-through rate.”

  • $H_1$ (Alternative): What we test for; an effect exists.

    Example: “The new ad increases click-through rate.”

We collect data and compute a test statistic based on a chosen model.

Then we ask:

“If $H_0$ were true, how likely is this test statistic?”

If that likelihood (the p-value) is very small — below a pre-decided threshold ($\alpha$) — we reject $H_0$.


📊 2. The p-value and Significance Levels

Definition & Intuition

The p-value is the probability of obtaining results as extreme (or more extreme) than the observed sample, assuming $H_0$ is true.

Mathematically:

$$ p = P(\text{Test Statistic ≥ observed value } | H_0) $$
  • Low p-value (< α): Unlikely under $H_0$ → reject $H_0$.
  • High p-value: Plausible under $H_0$ → fail to reject $H_0$.

Common significance levels:

  • 0.05 → “5% risk of wrongly rejecting $H_0$.”
  • 0.01 → “Stricter, 1% chance of false alarm.”

Example:

“If the p-value = 0.03 and α = 0.05, reject $H_0$ — evidence is significant.”

The p-value isn’t “the probability that $H_0$ is true” — it’s the probability of seeing your data if $H_0$ were true.

3. Type I & Type II Errors

Error Types & Trade-offs
Error TypeDefinitionProbabilityAnalogy
Type IRejecting $H_0$ when it’s true$\alpha$Convicting an innocent person
Type IIFailing to reject $H_0$ when it’s false$\beta$Letting a guilty person go free

Power of the test:

$$ \text{Power} = 1 - \beta $$

It’s the ability to detect a real effect when it exists.

Increasing sample size decreases both $\beta$ and uncertainty.

Significance ($\alpha$) controls false alarms; power controls missed detections. Balancing them is the art of good experimental design.

📈 4. Common Statistical Tests

z-test

Used when population variance is known or sample size is large ($n>30$).

$$ z = \frac{\bar{X} - \mu_0}{\sigma / \sqrt{n}} $$

Example: Testing whether an ad campaign’s average click rate differs from 5%.


t-test

Used when population variance is unknown and $n$ is small.

$$ t = \frac{\bar{X} - \mu_0}{s / \sqrt{n}} $$

Types:

  • One-sample t-test: compare sample mean to population mean.
  • Two-sample t-test: compare means of two groups.
  • Paired t-test: compare before-and-after measurements.

Chi-Square Test

Tests whether observed categorical frequencies match expected frequencies.

$$ \chi^2 = \sum \frac{(O_i - E_i)^2}{E_i} $$

Used in contingency tables, goodness-of-fit, and independence testing.


ANOVA (Analysis of Variance)

Compares means of 3 or more groups to see if at least one differs.

It partitions total variability into “between-group” and “within-group” variance:

$$ F = \frac{\text{Between-group variance}}{\text{Within-group variance}} $$

If $F$ is large → reject $H_0$ (means not all equal).


💭 Deeper Insight: A/B Testing Example

“If the p-value is 0.06, what’s your decision?”

  • If $\alpha = 0.05$, we fail to reject $H_0$ — the evidence isn’t strong enough.
  • But remember, 0.06 isn’t a brick wall — it’s a guideline. In practical ML experiments, you’d consider effect size, sample size, and practical significance too.

In short: Statistical significance ≠ business significance.


🧠 Step 4: Assumptions or Key Ideas

  • Samples are random and independent.
  • Data follows an assumed distribution (e.g., normality for t-tests).
  • Variances are approximately equal across groups (for ANOVA).
  • Chosen $\alpha$ determines your tolerance for false positives.

⚖️ Step 5: Strengths, Limitations & Trade-offs

  • Provides structured decision-making under uncertainty.
  • Quantifies confidence using p-values and confidence levels.
  • Foundation for A/B testing, experiments, and scientific inference.
  • Misinterpreted easily (p-values are often overtrusted).
  • Arbitrary $\alpha$ thresholds can mislead conclusions.
  • Assumes ideal sampling and distribution — often violated in real data.
Hypothesis testing gives statistical rigor, but not business context — small effects can be significant, and large ones insignificant, depending on sample size.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)
  • “p = 0.04 proves the alternative hypothesis.” → No. It just means data is unlikely under $H_0$.
  • “Failing to reject $H_0$ means $H_0$ is true.” → Not necessarily — maybe your test lacked power.
  • “Smaller p-values mean bigger effects.” → Not always. p-values depend on both effect size and sample size.

🧩 Step 7: Mini Summary

🧠 What You Learned: Hypothesis testing formalizes how we use data to test claims — balancing evidence and uncertainty.

⚙️ How It Works: By quantifying how likely observed data would be under a null hypothesis and comparing it to a predefined threshold.

🎯 Why It Matters: Every model improvement, business experiment, or research conclusion rests on this framework — it’s the backbone of statistical reasoning.

Any doubt in content? Ask me anything?
Chat
🤖 👋 Hi there! I'm your learning assistant. If you have any questions about this page or need clarification, feel free to ask!