2.3. Online Experimentation

Instead of blindly deploying model changes, online experiments allow controlled testing. For example, if a deeper network is hypothesized to improve engagement, we can test it incrementally.

📐 Technical / Math Details

Setup: Split traffic into control (A) and variation (B).
Measure key metrics (CTR, revenue, etc.).
Use statistical tests to check significance.

⚖️ Trade-offs & Production Notes

Pros: Risk mitigation, real-world evidence.
Cons: Requires infrastructure, user exposure, and careful metric selection.

🚨 Common Pitfalls

Picking the wrong success metric.
Insufficient sample size → false conclusions.

🗣 Interview-ready Answer

“Online experimentation lets us test hypotheses safely by running controlled experiments on real users, reducing risk before full rollout.”

Q2: How does A/B testing work in practice?

🎯 TL;DR: A/B testing compares control (A) vs. variation (B) by randomly splitting users and analyzing outcome metrics.

🌱 Conceptual Explanation

It’s like a clinical trial: users are randomly assigned to different versions, and their behavior is compared.

📐 Technical / Math Details

Control (A): baseline version.
Variation (B): modified version.
Key step: ensure random, equal split and measure engagement or conversions.

⚖️ Trade-offs & Production Notes

Simple and effective.
Requires sufficient traffic to detect differences.

🚨 Common Pitfalls

Traffic skew (biased allocation).
Multiple testing without correction.

🗣 Interview-ready Answer

“A/B testing splits traffic between baseline and modified versions, measures user response, and uses stats to decide which wins.”

Q3: What is the role of null and alternative hypotheses in A/B testing?

🎯 TL;DR: H0 assumes no change; H1 assumes variation has effect; significance tests decide between them.

🌱 Conceptual Explanation

We formally test changes by defining H0 (no effect) and H1 (effect exists). Results are interpreted based on statistical tests.

📐 Technical / Math Details

$H_0$: No difference between control and variation.
$H_1$: Variation significantly improves metric.

Decision rule:

If $p \leq \alpha$: reject $H_0$ → accept change.
If $p > \alpha$: fail to reject $H_0$ → keep baseline.

⚖️ Trade-offs & Production Notes

Avoids launching bad features.
Risk of false positives/negatives.

🚨 Common Pitfalls

Misinterpreting p-values as probability of being correct.
Using arbitrary thresholds without context.

🗣 Interview-ready Answer

“We set H0 as no difference, H1 as positive difference, and use significance testing to decide whether to launch changes.”

Q4: Why is statistical significance important in experimentation?

🎯 TL;DR: It ensures observed improvements are unlikely due to chance.

🌱 Conceptual Explanation

Statistical significance is like a “confidence filter.” It tells us if the difference in outcomes is strong enough to believe in.

📐 Technical / Math Details

Significance level: $\alpha = 0.05$ (5%).
If $p \leq 0.05$, we reject $H_0$ with 95% confidence.

⚖️ Trade-offs & Production Notes

Higher $\alpha$ = more false positives.
Lower $\alpha$ = more false negatives.

🚨 Common Pitfalls

Treating significance as proof (it’s probabilistic).
P-hacking by stopping tests early.

🗣 Interview-ready Answer

“Statistical significance tells us if observed differences are unlikely random; typically we require p ≤ 0.05 to act.”

Q5: What is backtesting in online experimentation?

🎯 TL;DR: Backtesting flips control and variation roles to confirm if observed gains hold true.

🌱 Conceptual Explanation

If results seem too good to be true, we re-run the test with swapped roles to verify stability.

📐 Technical / Math Details

Original: A vs. B → B wins with +5%.
Backtest: B vs. A → expect -5%.
If symmetry holds → results are robust.

⚖️ Trade-offs & Production Notes

Builds confidence in results.
More computation and time.

🚨 Common Pitfalls

Not accounting for seasonality or external shifts.

🗣 Interview-ready Answer

“Backtesting swaps control and variation to validate results; if gains reverse symmetrically, findings are robust.”

Q6: Why run long-term A/B tests?

🎯 TL;DR: To detect delayed or hidden negative effects not visible in short-term experiments.

🌱 Conceptual Explanation

Some changes may improve short-term metrics but harm retention or satisfaction long term. Long experiments uncover these effects.

📐 Technical / Math Details

Example: More ads → short-term revenue ↑, long-term retention ↓.
Approach: continue experiment for weeks/months.

⚖️ Trade-offs & Production Notes

Detects sustainability of effects.
Costly in time and resources.

🚨 Common Pitfalls

Attrition bias if users leave mid-test.

🗣 Interview-ready Answer

“Long-term A/B tests help uncover delayed negative effects, ensuring improvements are sustainable over time.”

📐 Key Formulas

P-value significance test

$$ \text{If } p \leq \alpha \Rightarrow \text{reject } H_0 $$

$p$: probability of observing result under $H_0$.
$\alpha$: significance threshold (e.g., 0.05).

Interpretation: If p is smaller than alpha, the effect is statistically significant.

Sample Size Estimation (simplified)

$$ n = \frac{2 \cdot (Z_{\alpha/2} + Z_{\beta})^2 \cdot \sigma^2}{\Delta^2} $$

$Z_{\alpha/2}$: critical value for significance level.
$Z_{\beta}$: critical value for desired power.
$\sigma^2$: variance of metric.
$\Delta$: minimum detectable effect size.

Interpretation: Larger variance or smaller detectable effect requires larger sample size.

✅ Cheatsheet

Online Experimentation: Controlled validation of ML hypotheses.
A/B Testing: Compare control vs. variation on real users.
H0 vs H1: H0 = no effect, H1 = effect exists.
p-value & α: Decide statistical significance (commonly α=0.05).
Backtesting: Swap roles to confirm robustness.
Long-term Testing: Detect delayed impacts on retention and engagement.

2.4. Embeddings 2.2. Training Data Collection Strategies