2.2. Training Data Collection Strategies
๐ Flashcards
โก Short Theories
Training data quality is more important than algorithm sophistication; poor data cannot be fixed by fancy models.
Online collection leverages user interactions, while offline relies on human annotation and specialized effort.
Open-source datasets supplement costly manual labeling but may not fully cover domain-specific gaps.
Validation data reveals generalization ability, preventing overfitting hidden by training error.
Data bias, if unchecked, leads to systemic issues like โrich-get-richerโ feedback loops.
Bootstrapping helps overcome the cold start problem for new items in recommendation/ads systems.
๐ค Interview Q&A
Q1: Why is training data often said to be more important than the model itself?
๐ฏ TL;DR: High-quality data beats a fancy algorithm with poor data.
๐ฑ Conceptual Explanation
The modelโs learning is only as good as its input. Garbage in โ garbage out. Even state-of-the-art neural networks fail if the training data is biased, noisy, or insufficient.
๐ Technical / Math Details
- Given a model $f_\theta(x)$, parameters $\theta$ are learned to minimize loss $L(y, f_\theta(x))$ over dataset $D$.
- If $D$ is flawed (wrong $y$, skewed $x$), no optimization will yield meaningful $\theta$.
โ๏ธ Trade-offs & Production Notes
- Better data often cheaper than more compute.
- Domain-specific labeling critical for edge cases.
๐จ Common Pitfalls
- Blindly scaling data quantity without cleaning.
- Ignoring dataset shift in production.
๐ฃ Interview-ready Answer
โTraining data quality directly determines a modelโs ceiling. A simple model with great data usually outperforms a complex model trained on bad data.โ
Q2: What are online and offline data collection strategies, and when do you use each?
๐ฏ TL;DR: Online = user interactions; Offline = human labelers.
๐ฑ Conceptual Explanation
- Online: Collects data from user behavior with an existing system.
- Offline: Requires explicit labeling by humans when interactions donโt generate labels.
๐ Technical / Math Details
- Online: click-through logs, engagement metrics โ positive/negative signals.
- Offline: manual annotation (e.g., bounding boxes for objects).
โ๏ธ Trade-offs & Production Notes
- Online: cheap, scalable, but noisy/bias-prone.
- Offline: expensive, slow, but high-quality and task-specific.
๐จ Common Pitfalls
- Misinterpreting implicit signals (e.g., ignoring โ dislike).
- Over-reliance on crowdsourcing for specialized tasks.
๐ฃ Interview-ready Answer
โOnline leverages implicit user behavior, offline uses explicit human labeling. Online is scalable but biased; offline is costly but precise.โ
Q3: How do you handle bias in training data collected from user engagement?
๐ฏ TL;DR: Use exploration strategies to collect unbiased signals.
๐ฑ Conceptual Explanation
Popular-first recommendation introduces feedback loops. To avoid this, sample from the full poolโeven low-ranked itemsโand track user engagement.
๐ Technical / Math Details
- Bias = non-representative sampling from item space $I$.
- Exploration ensures $P(\text{item shown}) \neq 0$ for all $i \in I$.
โ๏ธ Trade-offs & Production Notes
- Exploration reduces short-term click-through but improves long-term model fairness.
- Needs careful traffic allocation (e.g., 5% randomized).
๐จ Common Pitfalls
- Ignoring bias leads to โwinner-takes-allโ dynamics.
- Exploration without control harms business KPIs.
๐ฃ Interview-ready Answer
โWe randomize a fraction of recommendations to gather unbiased signals, preventing feedback loops and improving long-term generalization.โ
Q4: What are training, validation, and test sets, and why do we need all three?
๐ฏ TL;DR: Train to fit, validate to tune, test to evaluate final performance.
๐ฑ Conceptual Explanation
Splitting ensures that model performance is assessed on unseen data, preventing overfitting to the training set or to hyperparameters.
๐ Technical / Math Details
- Training set: optimize parameters $\theta$.
- Validation set: choose hyperparameters $\lambda$.
- Test set: evaluate unbiased generalization.
โ๏ธ Trade-offs & Production Notes
- Typical splits: 70/15/15 or 60/20/20.
- In temporal data, splits must follow time order.
๐จ Common Pitfalls
- Using test data for tuning โ data leakage.
- Ignoring time-series structure.
๐ฃ Interview-ready Answer
โTrain fits parameters, validation tunes hyperparameters, and test provides unbiased generalization. Using all three avoids overfitting and leakage.โ
Q5: How can GANs or augmentation help expand training data?
๐ฏ TL;DR: GANs/augmentation create synthetic but realistic variants to cover data gaps.
๐ฑ Conceptual Explanation
When data is scarce in certain conditions (e.g., rainy driving images), augmentation artificially generates examples to balance distribution.
๐ Technical / Math Details
- GAN objective:
$$ \min_G \max_D V(D, G) = \mathbb{E}_{x \sim p_{data}} [\log D(x)] + \mathbb{E}_{z \sim p_z} [\log (1 - D(G(z)))] $$. - $G$: generator, $D$: discriminator.
โ๏ธ Trade-offs & Production Notes
- Reduces cost of manual labeling.
- Risk: synthetic data may not fully capture real-world edge cases.
๐จ Common Pitfalls
- Over-reliance on GANs leading to unrealistic data artifacts.
๐ฃ Interview-ready Answer
โGANs and augmentation generate synthetic variants to fill gaps (e.g., rain/night images), improving robustness without costly labeling.โ
Q6: What is the cold start problem, and how do we mitigate it?
๐ฏ TL;DR: New items lack engagement signals; boost them artificially using similarity or adjusted relevance.
๐ฑ Conceptual Explanation
Cold start arises when new items or users have no prior data. Models trained on engagement signals canโt rank unseen items well.
๐ Technical / Math Details
- Boosting: artificially increase $score(i)$ for new item $i$ to ensure visibility.
- Similarity-based initialization:
$$ score(i_{new}) = \alpha \cdot sim(i_{new}, i_{known}) + \beta $$
โ๏ธ Trade-offs & Production Notes
- Boosting ensures discovery but may degrade short-term relevance.
- Needs balance between exploration and user satisfaction.
๐จ Common Pitfalls
- Over-boosting irrelevant items.
- Ignoring user cold start alongside item cold start.
๐ฃ Interview-ready Answer
โThe cold start problem means new items lack signals. We mitigate it by boosting their scores or using similarity to known items until data accumulates.โ
๐ Key Formulas
Cross-Entropy Loss
- $y_i$: true label (one-hot encoded)
- $\hat{y}_i$: predicted probability for class $i$
Interpretation: Penalizes wrong confident predictions; standard for classification tasks.
Softmax Function
- $z_i$: logit for class $i$
Interpretation: Converts raw logits into a normalized probability distribution.
โ Cheatsheet
- Online Data: scalable, cheap, but noisy/bias-prone.
- Offline Data: costly, slower, but precise.
- Augmentation/GANs: expand underrepresented scenarios.
- Splits: Train = fit, Validation = tune, Test = final check.
- Bias Mitigation: random exploration to avoid feedback loops.
- Cold Start: boost or initialize by similarity.