2.2. Training Data Collection Strategies

The model’s learning is only as good as its input. Garbage in → garbage out. Even state-of-the-art neural networks fail if the training data is biased, noisy, or insufficient.

📐 Technical / Math Details

Given a model $f_\theta(x)$, parameters $\theta$ are learned to minimize loss $L(y, f_\theta(x))$ over dataset $D$.
If $D$ is flawed (wrong $y$, skewed $x$), no optimization will yield meaningful $\theta$.

⚖️ Trade-offs & Production Notes

Better data often cheaper than more compute.
Domain-specific labeling critical for edge cases.

🚨 Common Pitfalls

Blindly scaling data quantity without cleaning.
Ignoring dataset shift in production.

🗣 Interview-ready Answer

“Training data quality directly determines a model’s ceiling. A simple model with great data usually outperforms a complex model trained on bad data.”

Q2: What are online and offline data collection strategies, and when do you use each?

🎯 TL;DR: Online = user interactions; Offline = human labelers.

🌱 Conceptual Explanation

Online: Collects data from user behavior with an existing system.
Offline: Requires explicit labeling by humans when interactions don’t generate labels.

📐 Technical / Math Details

Online: click-through logs, engagement metrics → positive/negative signals.
Offline: manual annotation (e.g., bounding boxes for objects).

⚖️ Trade-offs & Production Notes

Online: cheap, scalable, but noisy/bias-prone.
Offline: expensive, slow, but high-quality and task-specific.

🚨 Common Pitfalls

Misinterpreting implicit signals (e.g., ignoring ≠ dislike).
Over-reliance on crowdsourcing for specialized tasks.

🗣 Interview-ready Answer

“Online leverages implicit user behavior, offline uses explicit human labeling. Online is scalable but biased; offline is costly but precise.”

Q3: How do you handle bias in training data collected from user engagement?

🎯 TL;DR: Use exploration strategies to collect unbiased signals.

🌱 Conceptual Explanation

Popular-first recommendation introduces feedback loops. To avoid this, sample from the full pool—even low-ranked items—and track user engagement.

📐 Technical / Math Details

Bias = non-representative sampling from item space $I$.
Exploration ensures $P(\text{item shown}) \neq 0$ for all $i \in I$.

⚖️ Trade-offs & Production Notes

Exploration reduces short-term click-through but improves long-term model fairness.
Needs careful traffic allocation (e.g., 5% randomized).

🚨 Common Pitfalls

Ignoring bias leads to “winner-takes-all” dynamics.
Exploration without control harms business KPIs.

🗣 Interview-ready Answer

“We randomize a fraction of recommendations to gather unbiased signals, preventing feedback loops and improving long-term generalization.”

Q4: What are training, validation, and test sets, and why do we need all three?

🎯 TL;DR: Train to fit, validate to tune, test to evaluate final performance.

🌱 Conceptual Explanation

Splitting ensures that model performance is assessed on unseen data, preventing overfitting to the training set or to hyperparameters.

📐 Technical / Math Details

Training set: optimize parameters $\theta$.
Validation set: choose hyperparameters $\lambda$.
Test set: evaluate unbiased generalization.

⚖️ Trade-offs & Production Notes

Typical splits: 70/15/15 or 60/20/20.
In temporal data, splits must follow time order.

🚨 Common Pitfalls

Using test data for tuning → data leakage.
Ignoring time-series structure.

🗣 Interview-ready Answer

“Train fits parameters, validation tunes hyperparameters, and test provides unbiased generalization. Using all three avoids overfitting and leakage.”

Q5: How can GANs or augmentation help expand training data?

🎯 TL;DR: GANs/augmentation create synthetic but realistic variants to cover data gaps.

🌱 Conceptual Explanation

When data is scarce in certain conditions (e.g., rainy driving images), augmentation artificially generates examples to balance distribution.

📐 Technical / Math Details

GAN objective:
$$ \min_G \max_D V(D, G) = \mathbb{E}_{x \sim p_{data}} [\log D(x)] + \mathbb{E}_{z \sim p_z} [\log (1 - D(G(z)))] $$.
$G$: generator, $D$: discriminator.

⚖️ Trade-offs & Production Notes

Reduces cost of manual labeling.
Risk: synthetic data may not fully capture real-world edge cases.

🚨 Common Pitfalls

Over-reliance on GANs leading to unrealistic data artifacts.

🗣 Interview-ready Answer

“GANs and augmentation generate synthetic variants to fill gaps (e.g., rain/night images), improving robustness without costly labeling.”

Q6: What is the cold start problem, and how do we mitigate it?

🎯 TL;DR: New items lack engagement signals; boost them artificially using similarity or adjusted relevance.

🌱 Conceptual Explanation

Cold start arises when new items or users have no prior data. Models trained on engagement signals can’t rank unseen items well.

📐 Technical / Math Details

Boosting: artificially increase $score(i)$ for new item $i$ to ensure visibility.
Similarity-based initialization:
$$ score(i_{new}) = \alpha \cdot sim(i_{new}, i_{known}) + \beta $$

⚖️ Trade-offs & Production Notes

Boosting ensures discovery but may degrade short-term relevance.
Needs balance between exploration and user satisfaction.

🚨 Common Pitfalls

Over-boosting irrelevant items.
Ignoring user cold start alongside item cold start.

🗣 Interview-ready Answer

“The cold start problem means new items lack signals. We mitigate it by boosting their scores or using similarity to known items until data accumulates.”

📐 Key Formulas

Cross-Entropy Loss

$$ L = -\sum_i y_i \log(\hat{y}_i) $$

$y_i$: true label (one-hot encoded)
$\hat{y}_i$: predicted probability for class $i$

Interpretation: Penalizes wrong confident predictions; standard for classification tasks.

Softmax Function

$$ \text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}} $$

$z_i$: logit for class $i$

Interpretation: Converts raw logits into a normalized probability distribution.

✅ Cheatsheet

Online Data: scalable, cheap, but noisy/bias-prone.
Offline Data: costly, slower, but precise.
Augmentation/GANs: expand underrepresented scenarios.
Splits: Train = fit, Validation = tune, Test = final check.
Bias Mitigation: random exploration to avoid feedback loops.
Cold Start: boost or initialize by similarity.

2.3. Online Experimentation 2.1. Performance & Capacity in ML Systems