2.3. Online Experimentation
π Flashcards
β‘ Short Theories
Online experimentation provides a safe, controlled way to validate hypotheses before costly production rollouts.
A/B testing splits traffic into control and variation groups to compare performance.
Statistical significance ensures that observed differences are unlikely due to random chance.
Backtesting increases confidence in results by flipping control and variation roles.
Long-running A/B tests detect delayed or hidden negative impacts of system changes.
π€ Interview Q&A
Q1: Explain the purpose of online experimentation in ML systems.
π― TL;DR: Online experimentation validates hypotheses safely by measuring real user impact before full rollouts.
π± Conceptual Explanation
Instead of blindly deploying model changes, online experiments allow controlled testing. For example, if a deeper network is hypothesized to improve engagement, we can test it incrementally.
π Technical / Math Details
- Setup: Split traffic into control (A) and variation (B).
- Measure key metrics (CTR, revenue, etc.).
- Use statistical tests to check significance.
βοΈ Trade-offs & Production Notes
- Pros: Risk mitigation, real-world evidence.
- Cons: Requires infrastructure, user exposure, and careful metric selection.
π¨ Common Pitfalls
- Picking the wrong success metric.
- Insufficient sample size β false conclusions.
π£ Interview-ready Answer
“Online experimentation lets us test hypotheses safely by running controlled experiments on real users, reducing risk before full rollout.”
Q2: How does A/B testing work in practice?
π― TL;DR: A/B testing compares control (A) vs. variation (B) by randomly splitting users and analyzing outcome metrics.
π± Conceptual Explanation
Itβs like a clinical trial: users are randomly assigned to different versions, and their behavior is compared.
π Technical / Math Details
- Control (A): baseline version.
- Variation (B): modified version.
- Key step: ensure random, equal split and measure engagement or conversions.
βοΈ Trade-offs & Production Notes
- Simple and effective.
- Requires sufficient traffic to detect differences.
π¨ Common Pitfalls
- Traffic skew (biased allocation).
- Multiple testing without correction.
π£ Interview-ready Answer
“A/B testing splits traffic between baseline and modified versions, measures user response, and uses stats to decide which wins.”
Q3: What is the role of null and alternative hypotheses in A/B testing?
π― TL;DR: H0 assumes no change; H1 assumes variation has effect; significance tests decide between them.
π± Conceptual Explanation
We formally test changes by defining H0 (no effect) and H1 (effect exists). Results are interpreted based on statistical tests.
π Technical / Math Details
- $H_0$: No difference between control and variation.
- $H_1$: Variation significantly improves metric.
Decision rule:
- If $p \leq \alpha$: reject $H_0$ β accept change.
- If $p > \alpha$: fail to reject $H_0$ β keep baseline.
βοΈ Trade-offs & Production Notes
- Avoids launching bad features.
- Risk of false positives/negatives.
π¨ Common Pitfalls
- Misinterpreting p-values as probability of being correct.
- Using arbitrary thresholds without context.
π£ Interview-ready Answer
“We set H0 as no difference, H1 as positive difference, and use significance testing to decide whether to launch changes.”
Q4: Why is statistical significance important in experimentation?
π― TL;DR: It ensures observed improvements are unlikely due to chance.
π± Conceptual Explanation
Statistical significance is like a βconfidence filter.β It tells us if the difference in outcomes is strong enough to believe in.
π Technical / Math Details
- Significance level: $\alpha = 0.05$ (5%).
- If $p \leq 0.05$, we reject $H_0$ with 95% confidence.
βοΈ Trade-offs & Production Notes
- Higher $\alpha$ = more false positives.
- Lower $\alpha$ = more false negatives.
π¨ Common Pitfalls
- Treating significance as proof (itβs probabilistic).
- P-hacking by stopping tests early.
π£ Interview-ready Answer
“Statistical significance tells us if observed differences are unlikely random; typically we require p β€ 0.05 to act.”
Q5: What is backtesting in online experimentation?
π― TL;DR: Backtesting flips control and variation roles to confirm if observed gains hold true.
π± Conceptual Explanation
If results seem too good to be true, we re-run the test with swapped roles to verify stability.
π Technical / Math Details
- Original: A vs. B β B wins with +5%.
- Backtest: B vs. A β expect -5%.
- If symmetry holds β results are robust.
βοΈ Trade-offs & Production Notes
- Builds confidence in results.
- More computation and time.
π¨ Common Pitfalls
- Not accounting for seasonality or external shifts.
π£ Interview-ready Answer
“Backtesting swaps control and variation to validate results; if gains reverse symmetrically, findings are robust.”
Q6: Why run long-term A/B tests?
π― TL;DR: To detect delayed or hidden negative effects not visible in short-term experiments.
π± Conceptual Explanation
Some changes may improve short-term metrics but harm retention or satisfaction long term. Long experiments uncover these effects.
π Technical / Math Details
- Example: More ads β short-term revenue β, long-term retention β.
- Approach: continue experiment for weeks/months.
βοΈ Trade-offs & Production Notes
- Detects sustainability of effects.
- Costly in time and resources.
π¨ Common Pitfalls
- Attrition bias if users leave mid-test.
π£ Interview-ready Answer
“Long-term A/B tests help uncover delayed negative effects, ensuring improvements are sustainable over time.”
π Key Formulas
P-value significance test
- $p$: probability of observing result under $H_0$.
- $\alpha$: significance threshold (e.g., 0.05).
Interpretation: If p is smaller than alpha, the effect is statistically significant.
Sample Size Estimation (simplified)
- $Z_{\alpha/2}$: critical value for significance level.
- $Z_{\beta}$: critical value for desired power.
- $\sigma^2$: variance of metric.
- $\Delta$: minimum detectable effect size.
Interpretation: Larger variance or smaller detectable effect requires larger sample size.
β Cheatsheet
- Online Experimentation: Controlled validation of ML hypotheses.
- A/B Testing: Compare control vs. variation on real users.
- H0 vs H1: H0 = no effect, H1 = effect exists.
- p-value & Ξ±: Decide statistical significance (commonly Ξ±=0.05).
- Backtesting: Swap roles to confirm robustness.
- Long-term Testing: Detect delayed impacts on retention and engagement.