2.2. Training Data Collection Strategies

2.2. Training Data Collection Strategies

6 min read 1136 words

๐Ÿ“ Flashcards

โšก Short Theories

Training data quality is more important than algorithm sophistication; poor data cannot be fixed by fancy models.

Online collection leverages user interactions, while offline relies on human annotation and specialized effort.

Open-source datasets supplement costly manual labeling but may not fully cover domain-specific gaps.

Validation data reveals generalization ability, preventing overfitting hidden by training error.

Data bias, if unchecked, leads to systemic issues like โ€œrich-get-richerโ€ feedback loops.

Bootstrapping helps overcome the cold start problem for new items in recommendation/ads systems.


๐ŸŽค Interview Q&A

Q1: Why is training data often said to be more important than the model itself?

๐ŸŽฏ TL;DR: High-quality data beats a fancy algorithm with poor data.


๐ŸŒฑ Conceptual Explanation

The modelโ€™s learning is only as good as its input. Garbage in โ†’ garbage out. Even state-of-the-art neural networks fail if the training data is biased, noisy, or insufficient.

๐Ÿ“ Technical / Math Details

  • Given a model $f_\theta(x)$, parameters $\theta$ are learned to minimize loss $L(y, f_\theta(x))$ over dataset $D$.
  • If $D$ is flawed (wrong $y$, skewed $x$), no optimization will yield meaningful $\theta$.

โš–๏ธ Trade-offs & Production Notes

  • Better data often cheaper than more compute.
  • Domain-specific labeling critical for edge cases.

๐Ÿšจ Common Pitfalls

  • Blindly scaling data quantity without cleaning.
  • Ignoring dataset shift in production.

๐Ÿ—ฃ Interview-ready Answer

โ€œTraining data quality directly determines a modelโ€™s ceiling. A simple model with great data usually outperforms a complex model trained on bad data.โ€


Q2: What are online and offline data collection strategies, and when do you use each?

๐ŸŽฏ TL;DR: Online = user interactions; Offline = human labelers.


๐ŸŒฑ Conceptual Explanation

  • Online: Collects data from user behavior with an existing system.
  • Offline: Requires explicit labeling by humans when interactions donโ€™t generate labels.

๐Ÿ“ Technical / Math Details

  • Online: click-through logs, engagement metrics โ†’ positive/negative signals.
  • Offline: manual annotation (e.g., bounding boxes for objects).

โš–๏ธ Trade-offs & Production Notes

  • Online: cheap, scalable, but noisy/bias-prone.
  • Offline: expensive, slow, but high-quality and task-specific.

๐Ÿšจ Common Pitfalls

  • Misinterpreting implicit signals (e.g., ignoring โ‰  dislike).
  • Over-reliance on crowdsourcing for specialized tasks.

๐Ÿ—ฃ Interview-ready Answer

โ€œOnline leverages implicit user behavior, offline uses explicit human labeling. Online is scalable but biased; offline is costly but precise.โ€


Q3: How do you handle bias in training data collected from user engagement?

๐ŸŽฏ TL;DR: Use exploration strategies to collect unbiased signals.


๐ŸŒฑ Conceptual Explanation

Popular-first recommendation introduces feedback loops. To avoid this, sample from the full poolโ€”even low-ranked itemsโ€”and track user engagement.

๐Ÿ“ Technical / Math Details

  • Bias = non-representative sampling from item space $I$.
  • Exploration ensures $P(\text{item shown}) \neq 0$ for all $i \in I$.

โš–๏ธ Trade-offs & Production Notes

  • Exploration reduces short-term click-through but improves long-term model fairness.
  • Needs careful traffic allocation (e.g., 5% randomized).

๐Ÿšจ Common Pitfalls

  • Ignoring bias leads to โ€œwinner-takes-allโ€ dynamics.
  • Exploration without control harms business KPIs.

๐Ÿ—ฃ Interview-ready Answer

โ€œWe randomize a fraction of recommendations to gather unbiased signals, preventing feedback loops and improving long-term generalization.โ€


Q4: What are training, validation, and test sets, and why do we need all three?

๐ŸŽฏ TL;DR: Train to fit, validate to tune, test to evaluate final performance.


๐ŸŒฑ Conceptual Explanation

Splitting ensures that model performance is assessed on unseen data, preventing overfitting to the training set or to hyperparameters.

๐Ÿ“ Technical / Math Details

  • Training set: optimize parameters $\theta$.
  • Validation set: choose hyperparameters $\lambda$.
  • Test set: evaluate unbiased generalization.

โš–๏ธ Trade-offs & Production Notes

  • Typical splits: 70/15/15 or 60/20/20.
  • In temporal data, splits must follow time order.

๐Ÿšจ Common Pitfalls

  • Using test data for tuning โ†’ data leakage.
  • Ignoring time-series structure.

๐Ÿ—ฃ Interview-ready Answer

โ€œTrain fits parameters, validation tunes hyperparameters, and test provides unbiased generalization. Using all three avoids overfitting and leakage.โ€


Q5: How can GANs or augmentation help expand training data?

๐ŸŽฏ TL;DR: GANs/augmentation create synthetic but realistic variants to cover data gaps.


๐ŸŒฑ Conceptual Explanation

When data is scarce in certain conditions (e.g., rainy driving images), augmentation artificially generates examples to balance distribution.

๐Ÿ“ Technical / Math Details

  • GAN objective:
    $$ \min_G \max_D V(D, G) = \mathbb{E}_{x \sim p_{data}} [\log D(x)] + \mathbb{E}_{z \sim p_z} [\log (1 - D(G(z)))] $$.
  • $G$: generator, $D$: discriminator.

โš–๏ธ Trade-offs & Production Notes

  • Reduces cost of manual labeling.
  • Risk: synthetic data may not fully capture real-world edge cases.

๐Ÿšจ Common Pitfalls

  • Over-reliance on GANs leading to unrealistic data artifacts.

๐Ÿ—ฃ Interview-ready Answer

โ€œGANs and augmentation generate synthetic variants to fill gaps (e.g., rain/night images), improving robustness without costly labeling.โ€


Q6: What is the cold start problem, and how do we mitigate it?

๐ŸŽฏ TL;DR: New items lack engagement signals; boost them artificially using similarity or adjusted relevance.


๐ŸŒฑ Conceptual Explanation

Cold start arises when new items or users have no prior data. Models trained on engagement signals canโ€™t rank unseen items well.

๐Ÿ“ Technical / Math Details

  • Boosting: artificially increase $score(i)$ for new item $i$ to ensure visibility.
  • Similarity-based initialization:
    $$ score(i_{new}) = \alpha \cdot sim(i_{new}, i_{known}) + \beta $$

โš–๏ธ Trade-offs & Production Notes

  • Boosting ensures discovery but may degrade short-term relevance.
  • Needs balance between exploration and user satisfaction.

๐Ÿšจ Common Pitfalls

  • Over-boosting irrelevant items.
  • Ignoring user cold start alongside item cold start.

๐Ÿ—ฃ Interview-ready Answer

โ€œThe cold start problem means new items lack signals. We mitigate it by boosting their scores or using similarity to known items until data accumulates.โ€


๐Ÿ“ Key Formulas

Cross-Entropy Loss
$$ L = -\sum_i y_i \log(\hat{y}_i) $$
  • $y_i$: true label (one-hot encoded)
  • $\hat{y}_i$: predicted probability for class $i$

Interpretation: Penalizes wrong confident predictions; standard for classification tasks.

Softmax Function
$$ \text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}} $$
  • $z_i$: logit for class $i$

Interpretation: Converts raw logits into a normalized probability distribution.


โœ… Cheatsheet

  • Online Data: scalable, cheap, but noisy/bias-prone.
  • Offline Data: costly, slower, but precise.
  • Augmentation/GANs: expand underrepresented scenarios.
  • Splits: Train = fit, Validation = tune, Test = final check.
  • Bias Mitigation: random exploration to avoid feedback loops.
  • Cold Start: boost or initialize by similarity.
Any doubt in content? Ask me anything?
Chat
๐Ÿค– ๐Ÿ‘‹ Hi there! I'm your learning assistant. If you have any questions about this page or need clarification, feel free to ask!