2.5. Transfer Learning

2.5. Transfer Learning

5 min read 996 words

πŸ“ Flashcards

⚑ Short Theories

Transfer learning leverages knowledge from one task/domain to improve performance on another, saving data and compute.

Feature extraction freezes layers of a pre-trained model and only retrains the final layers.

Fine-tuning adjusts some or all pre-trained parameters for the new task.

Strategy depends on dataset size and task similarity: small/similar = freeze more; large/different = fine-tune more.

Pre-trained models like VGG16, BERT, and ELMo are common starting points in vision and NLP tasks.

🎀 Interview Q&A

Q1: Explain transfer learning and why it’s important in machine learning.

🎯 TL;DR: Transfer learning reuses pre-trained models to solve new tasks efficiently with less data and compute.


🌱 Conceptual Explanation

Instead of starting from scratch, we “borrow” knowledge from a model trained on a large dataset. Like learning Spanish after knowing Italian, you reuse patterns instead of relearning everything.

πŸ“ Technical / Math Details

Formally, given a source task $T_s$ with data $D_s$ and a target task $T_t$ with data $D_t$, transfer learning seeks to improve $P(Y_t|X_t)$ using knowledge from $D_s, T_s$.

βš–οΈ Trade-offs & Production Notes

  • Saves compute and training time.
  • Strong performance with limited labels.
  • Less effective if source and target domains are too different.

🚨 Common Pitfalls

  • Negative transfer: pre-trained knowledge may hurt if tasks are dissimilar.
  • Overfitting when fine-tuning with small data.

πŸ—£ Interview-ready Answer

“Transfer learning leverages pre-trained models to quickly adapt to new tasks, reducing data and compute needs while improving performance.”


Q2: What are the main techniques in transfer learning?

🎯 TL;DR: Two main techniques are feature extraction and fine-tuning.


🌱 Conceptual Explanation

Feature extraction uses frozen layers of a pre-trained model as universal feature generators. Fine-tuning modifies selected layers (or all) to adapt to the new task.

πŸ“ Technical / Math Details

  • Feature extraction: Freeze parameters $\theta$ for initial layers, retrain last layer weights.
  • Fine-tuning: Optimize $\theta$ across chosen layers with new labeled data.

βš–οΈ Trade-offs & Production Notes

  • Feature extraction: quick, avoids overfitting on small data.
  • Fine-tuning: more powerful if you have enough data.

🚨 Common Pitfalls

  • Freezing too few layers with small datasets β†’ overfit.
  • Fine-tuning entire model with tiny dataset β†’ wasted effort.

πŸ—£ Interview-ready Answer

“The two main approaches are feature extractionβ€”using frozen layers for general featuresβ€”and fine-tuning, where we adapt layers to the new task.”


Q3: How do dataset size and task similarity influence transfer learning strategy?

🎯 TL;DR: Small/similar tasks β†’ freeze more. Large/different tasks β†’ fine-tune more.


🌱 Conceptual Explanation

If tasks share patterns (like car vs truck classification), we can freeze most layers. If tasks differ (natural vs medical images), fine-tuning deeper layers is required. The larger the dataset, the more layers we can safely fine-tune.

πŸ“ Technical / Math Details

  • Let $n$ = size of labeled target dataset.
    • Small $n$: minimize trainable parameters.
    • Large $n$: increase trainable parameters.

βš–οΈ Trade-offs & Production Notes

  • Freezing saves time and compute.
  • Fine-tuning entire models is expensive but can maximize accuracy.

🚨 Common Pitfalls

  • Ignoring domain similarity β†’ poor adaptation.
  • Over-tuning small datasets.

πŸ—£ Interview-ready Answer

“With small, similar datasets, freeze most layers; with large, different datasets, fine-tune more or all layers.”


Q4: Why is transfer learning especially valuable in NLP?

🎯 TL;DR: Pre-trained embeddings capture language semantics, enabling better downstream tasks with less data.


🌱 Conceptual Explanation

Models like BERT or ELMo learn dense text representations from large corpora. These embeddings encode meaning, which can be reused for tasks like NER, spam detection, or search ranking.

πŸ“ Technical / Math Details

  • Word2Vec: skip-gram model trains embeddings via context prediction.
  • BERT: Transformer-based pre-training via masked language modeling.

βš–οΈ Trade-offs & Production Notes

  • Strong transfer on diverse NLP tasks.
  • Pre-trained models are large and resource-heavy.

🚨 Common Pitfalls

  • Using embeddings blindly without considering domain-specific vocabulary.
  • Ignoring fine-tuning opportunities for specialized tasks.

πŸ—£ Interview-ready Answer

“In NLP, transfer learning reuses embeddings from models like BERT, capturing semantic meaning for many downstream tasks with less data.”


Q5: What are risks of transfer learning in production systems?

🎯 TL;DR: Risks include negative transfer, resource overhead, and domain mismatch.


🌱 Conceptual Explanation

Not all pre-trained knowledge generalizes well. If the source and target differ too much, performance may degrade. Large pre-trained models may also be expensive to deploy.

πŸ“ Technical / Math Details

Negative transfer occurs when $P(Y_t|X_t)$ degrades due to irrelevant $P(Y_s|X_s)$ knowledge.

βš–οΈ Trade-offs & Production Notes

  • Cost vs. benefit of fine-tuning.
  • Monitoring for drift between source/target domains.

🚨 Common Pitfalls

  • Using a mismatched pre-trained model.
  • Underestimating serving costs of large models.

πŸ—£ Interview-ready Answer

“Key risks are negative transfer when tasks differ, and high resource cost for large models; careful model selection mitigates this.”


πŸ“ Key Formulas

Softmax Function
$$ \text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}} $$
  • $z_i$: logit for class i
    Interpretation: Converts raw logits into probability distributions.
Cross-Entropy Loss
$$ L = -\sum_{i} y_i \log(\hat{y}_i) $$
  • $y_i$: true label (one-hot)
  • $\hat{y}_i$: predicted probability
    Interpretation: Penalizes incorrect predictions more strongly if they are overconfident.

βœ… Cheatsheet

  • Transfer Learning = Pre-trained model β†’ new task.
  • Feature Extraction = Freeze layers + retrain head.
  • Fine-Tuning = Adjust weights of some/all layers.
  • Small & Similar dataset β†’ Freeze more.
  • Large & Different dataset β†’ Fine-tune more.
  • Vision: CNNs trained on ImageNet reused for medical imaging.
  • NLP: BERT/ELMo embeddings reused for NER, spam, ranking.
Any doubt in content? Ask me anything?
Chat
πŸ€– πŸ‘‹ Hi there! I'm your learning assistant. If you have any questions about this page or need clarification, feel free to ask!