5.3. Experimental Design
🪄 Step 1: Intuition & Motivation
Core Idea: Experimental design is about discovering cause, not just correlation. It’s the scientific backbone of how we prove that one variable (like a product feature) causes a change in another (like user engagement).
Simple Analogy: Imagine you’re a chef testing two new recipes. You want to know which one people prefer — but you can’t just ask your friends (they might be biased). So, you give different groups of people randomly one dish or the other, and then measure satisfaction. If the random assignment is fair, any difference in ratings must come from the recipe — not the tasters. That’s experimental design: structured curiosity that guards against self-deception.
🌱 Step 2: Core Concept
What’s Happening Under the Hood?
Experimental design builds the foundation for causal inference — isolating the effect of one factor while holding others constant.
Its three pillars are:
- Randomization – Removes bias by ensuring treatment groups are comparable.
- Control – Provides a baseline for comparison.
- Blocking – Reduces variability by accounting for known confounders.
Together, these principles make sure observed differences are causal, not coincidental.
Why It Works This Way
Randomization mimics nature’s fairness — it balances out hidden variables (like age, gender, income). Control isolates the treatment effect. Blocking explains part of the variability before it becomes noise.
In short: Randomization = fairness, Control = contrast, Blocking = clarity.
How It Fits in ML Thinking
In data science, experimental design underlies:
- A/B testing (which version performs better?)
- Online experiments (does the new ranking algorithm increase click-through rate?)
- Causal inference (what would have happened if not for this change?)
Without proper design, you risk chasing statistical mirages — impressive results that vanish under scrutiny.
📐 Step 3: Mathematical Foundation
🧩 1. Randomization, Control, and Blocking
Core Ideas & Mathematical Framing
Let’s formalize the intuition.
🌀 Randomization
Every unit (user, ad, patient, etc.) has equal probability of receiving any treatment. This ensures that, on average, the treatment and control groups are identical except for the intervention.
Mathematically:
$$ E[Y|Treatment] - E[Y|Control] \approx \text{Causal Effect} $$because confounders cancel out in expectation.
🧱 Control
A control group provides the “no-change” baseline to compare against. It helps separate natural variation from treatment effects.
$$ \text{Effect} = \bar{Y}*{treatment} - \bar{Y}*{control} $$🎯 Blocking
When certain factors (e.g., age, region) are known to influence outcomes, we block them — creating subgroups where those factors are constant. Within each block, we randomize again.
This reduces within-group variance:
$$ Var(\text{Treatment Effect}) = Var_{between\ blocks} + Var_{within\ blocks} $$Blocking minimizes the second term.
🔁 2. A/B/n Testing and Sequential Testing
From Simple Tests to Smarter Experiments
🧪 A/B Testing
Compare two versions (A = control, B = treatment). Let $p_A$ and $p_B$ be conversion rates.
Hypothesis test:
$$ H_0: p_A = p_B \quad \text{vs} \quad H_1: p_B > p_A $$Z-statistic:
$$ z = \frac{p_B - p_A}{\sqrt{p(1-p)(\frac{1}{n_A} + \frac{1}{n_B})}} $$where $p$ = pooled proportion.
Reject $H_0$ if $z > z_{critical}$ (e.g., 1.96 for 95%).
🧬 A/B/n Testing
Extends A/B testing to multiple variants (A, B, C…). Use ANOVA or pairwise corrections (like Bonferroni) to control overall false discovery rate.
⏳ Sequential Testing
Instead of fixing sample size upfront, sequential tests analyze data as it arrives, allowing early stopping when evidence is strong. Popular methods include SPRT (Sequential Probability Ratio Test) and Bayesian sequential analysis.
Benefit: Faster decisions without sacrificing statistical validity.
⚖️ 3. False Discovery Correction
Controlling False Positives
When running many experiments, some “wins” occur purely by chance. False Discovery Rate (FDR) controls help prevent over-celebrating noise.
Common Methods:
- Bonferroni Correction: Adjusts significance level $\alpha$ to $\alpha’ = \frac{\alpha}{m}$ (very strict).
- Benjamini–Hochberg (BH): Ranks $p$-values and controls the expected proportion of false positives among significant results.
If $p_{(i)}$ is the $i$th smallest p-value among $m$ tests, find largest $i$ where
$$ p_{(i)} \leq \frac{i}{m} \alpha $$All $p_{(1)}, \dots, p_{(i)}$ are considered significant.
🧭 4. Connection to Causal Inference
From Randomization to Independence
In causal inference, we want:
$$ Y \perp!!!\perp Treatment ;|; Randomization $$This means treatment assignment is independent of potential outcomes — ensuring unbiased estimation of causal effects.
When randomization is done right:
$$ E[Y(1) - Y(0)] = E[Y|Treatment] - E[Y|Control] $$That’s the average treatment effect (ATE) — the holy grail of experimentation.
If randomization fails, selection bias creeps in: differences may come from pre-existing conditions, not the treatment itself.
💭 Probing Question:
“What if your A/B test shows a lift, but you suspect Simpson’s paradox?”
Answer: Simpson’s paradox occurs when aggregated data hides opposite trends within subgroups.
If your A/B test shows improvement overall, but:
- Certain segments (e.g., mobile users, specific regions) perform worse,
- Or traffic was unevenly distributed,
Then your observed “lift” may not represent the true causal effect.
Fix:
- Stratify or block by confounding variables.
- Reanalyze within subgroups to ensure consistency.
- Use regression adjustment or propensity score matching if randomization was imperfect.
In essence: Always check segment-level behavior before declaring victory.
🧠 Step 4: Assumptions or Key Ideas
- Randomization ensures independence between treatment and confounders.
- Control isolates causal effects.
- Blocking reduces variance due to known heterogeneity.
- Each participant must be independent (no interference between units).
⚖️ Step 5: Strengths, Limitations & Trade-offs
- Gold standard for establishing causality.
- Provides interpretable and actionable results.
- Scalable to online experiments and A/B testing pipelines.
- Expensive or slow for large-scale systems.
- Ethical or logistical limits (e.g., can’t randomize harmful treatments).
- Susceptible to hidden confounders if randomization breaks.
🚧 Step 6: Common Misunderstandings
🚨 Common Misunderstandings (Click to Expand)
- “A/B testing always gives the truth.” → Only if randomization and independence hold.
- “p < 0.05 means the result is practically significant.” → Not necessarily — statistical ≠ business significance.
- “Simpson’s paradox can’t happen in randomized trials.” → It still can if subgroups behave differently or randomization was imperfect.
🧩 Step 7: Mini Summary
🧠 What You Learned: Experimental design provides structure to testing causal hypotheses — using randomization, control, and blocking to ensure fairness and clarity.
⚙️ How It Works: Randomized controlled experiments isolate treatment effects and reduce confounding through careful setup.
🎯 Why It Matters: Every trustworthy A/B test, clinical trial, or policy evaluation rests on these design principles — the guardrails between correlation and causation.