5.3. Experimental Design

Core Skills Guide for AI Interviews (Math, Code, SQL) 2025

Probability & Statistics for Data Science

6 min read 1127 words

🪄 Step 1: Intuition & Motivation

Core Idea: Experimental design is about discovering cause, not just correlation. It’s the scientific backbone of how we prove that one variable (like a product feature) causes a change in another (like user engagement).
Simple Analogy: Imagine you’re a chef testing two new recipes. You want to know which one people prefer — but you can’t just ask your friends (they might be biased). So, you give different groups of people randomly one dish or the other, and then measure satisfaction. If the random assignment is fair, any difference in ratings must come from the recipe — not the tasters. That’s experimental design: structured curiosity that guards against self-deception.

🌱 Step 2: Core Concept

What’s Happening Under the Hood?

Experimental design builds the foundation for causal inference — isolating the effect of one factor while holding others constant.

Its three pillars are:

Randomization – Removes bias by ensuring treatment groups are comparable.
Control – Provides a baseline for comparison.
Blocking – Reduces variability by accounting for known confounders.

Together, these principles make sure observed differences are causal, not coincidental.

Why It Works This Way

Randomization mimics nature’s fairness — it balances out hidden variables (like age, gender, income). Control isolates the treatment effect. Blocking explains part of the variability before it becomes noise.

In short: Randomization = fairness, Control = contrast, Blocking = clarity.

How It Fits in ML Thinking

In data science, experimental design underlies:

A/B testing (which version performs better?)
Online experiments (does the new ranking algorithm increase click-through rate?)
Causal inference (what would have happened if not for this change?)

Without proper design, you risk chasing statistical mirages — impressive results that vanish under scrutiny.

📐 Step 3: Mathematical Foundation

🧩 1. Randomization, Control, and Blocking

Core Ideas & Mathematical Framing

Let’s formalize the intuition.

🌀 Randomization

Every unit (user, ad, patient, etc.) has equal probability of receiving any treatment. This ensures that, on average, the treatment and control groups are identical except for the intervention.

Mathematically:

$$ E[Y|Treatment] - E[Y|Control] \approx \text{Causal Effect} $$

because confounders cancel out in expectation.

🧱 Control

A control group provides the “no-change” baseline to compare against. It helps separate natural variation from treatment effects.

$$ \text{Effect} = \bar{Y}*{treatment} - \bar{Y}*{control} $$

🎯 Blocking

When certain factors (e.g., age, region) are known to influence outcomes, we block them — creating subgroups where those factors are constant. Within each block, we randomize again.

This reduces within-group variance:

$$ Var(\text{Treatment Effect}) = Var_{between\ blocks} + Var_{within\ blocks} $$

Blocking minimizes the second term.

Randomization levels the playing field, control defines the baseline, and blocking removes distractions. Together, they make comparisons fair and precise.

🔁 2. A/B/n Testing and Sequential Testing

From Simple Tests to Smarter Experiments

🧪 A/B Testing

Compare two versions (A = control, B = treatment). Let $p_A$ and $p_B$ be conversion rates.

Hypothesis test:

$$ H_0: p_A = p_B \quad \text{vs} \quad H_1: p_B > p_A $$

Z-statistic:

$$ z = \frac{p_B - p_A}{\sqrt{p(1-p)(\frac{1}{n_A} + \frac{1}{n_B})}} $$

where $p$ = pooled proportion.

Reject $H_0$ if $z > z_{critical}$ (e.g., 1.96 for 95%).

🧬 A/B/n Testing

Extends A/B testing to multiple variants (A, B, C…). Use ANOVA or pairwise corrections (like Bonferroni) to control overall false discovery rate.

⏳ Sequential Testing

Instead of fixing sample size upfront, sequential tests analyze data as it arrives, allowing early stopping when evidence is strong. Popular methods include SPRT (Sequential Probability Ratio Test) and Bayesian sequential analysis.

Benefit: Faster decisions without sacrificing statistical validity.

A/B testing is like a single match. Sequential testing is like stopping the match early when one team is clearly ahead.

⚖️ 3. False Discovery Correction

Controlling False Positives

When running many experiments, some “wins” occur purely by chance. False Discovery Rate (FDR) controls help prevent over-celebrating noise.

Common Methods:

Bonferroni Correction: Adjusts significance level $\alpha$ to $\alpha’ = \frac{\alpha}{m}$ (very strict).
Benjamini–Hochberg (BH): Ranks $p$-values and controls the expected proportion of false positives among significant results.

If $p_{(i)}$ is the $i$th smallest p-value among $m$ tests, find largest $i$ where

$$ p_{(i)} \leq \frac{i}{m} \alpha $$

All $p_{(1)}, \dots, p_{(i)}$ are considered significant.

Without correction, multiple A/B tests are like rolling many dice — eventually, you’ll get “snake eyes” by luck.

🧭 4. Connection to Causal Inference

From Randomization to Independence

In causal inference, we want:

$$ Y \perp!!!\perp Treatment ;|; Randomization $$

This means treatment assignment is independent of potential outcomes — ensuring unbiased estimation of causal effects.

When randomization is done right:

$$ E[Y(1) - Y(0)] = E[Y|Treatment] - E[Y|Control] $$

That’s the average treatment effect (ATE) — the holy grail of experimentation.

If randomization fails, selection bias creeps in: differences may come from pre-existing conditions, not the treatment itself.

Randomization severs hidden causal arrows — it makes treatment independent of everything else, letting causality shine through.

💭 Probing Question:

“What if your A/B test shows a lift, but you suspect Simpson’s paradox?”

Answer: Simpson’s paradox occurs when aggregated data hides opposite trends within subgroups.

If your A/B test shows improvement overall, but:

Certain segments (e.g., mobile users, specific regions) perform worse,
Or traffic was unevenly distributed,

Then your observed “lift” may not represent the true causal effect.

Fix:

Stratify or block by confounding variables.
Reanalyze within subgroups to ensure consistency.
Use regression adjustment or propensity score matching if randomization was imperfect.

In essence: Always check segment-level behavior before declaring victory.

🧠 Step 4: Assumptions or Key Ideas

Randomization ensures independence between treatment and confounders.
Control isolates causal effects.
Blocking reduces variance due to known heterogeneity.
Each participant must be independent (no interference between units).

⚖️ Step 5: Strengths, Limitations & Trade-offs

Gold standard for establishing causality.
Provides interpretable and actionable results.
Scalable to online experiments and A/B testing pipelines.

Expensive or slow for large-scale systems.
Ethical or logistical limits (e.g., can’t randomize harmful treatments).
Susceptible to hidden confounders if randomization breaks.

Experimental design trades control for realism — perfect isolation reduces bias but can limit generalizability to broader contexts.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)

“A/B testing always gives the truth.” → Only if randomization and independence hold.
“p < 0.05 means the result is practically significant.” → Not necessarily — statistical ≠ business significance.
“Simpson’s paradox can’t happen in randomized trials.” → It still can if subgroups behave differently or randomization was imperfect.

🧩 Step 7: Mini Summary

🧠 What You Learned: Experimental design provides structure to testing causal hypotheses — using randomization, control, and blocking to ensure fairness and clarity.

⚙️ How It Works: Randomized controlled experiments isolate treatment effects and reduce confounding through careful setup.

🎯 Why It Matters: Every trustworthy A/B test, clinical trial, or policy evaluation rests on these design principles — the guardrails between correlation and causation.

Probability & Statistics for Data Science 5.2. Resampling & Validation