2.5. Transfer Learning

Instead of starting from scratch, we “borrow” knowledge from a model trained on a large dataset. Like learning Spanish after knowing Italian, you reuse patterns instead of relearning everything.

📐 Technical / Math Details

Formally, given a source task $T_s$ with data $D_s$ and a target task $T_t$ with data $D_t$, transfer learning seeks to improve $P(Y_t|X_t)$ using knowledge from $D_s, T_s$.

⚖️ Trade-offs & Production Notes

Saves compute and training time.
Strong performance with limited labels.
Less effective if source and target domains are too different.

🚨 Common Pitfalls

Negative transfer: pre-trained knowledge may hurt if tasks are dissimilar.
Overfitting when fine-tuning with small data.

🗣 Interview-ready Answer

“Transfer learning leverages pre-trained models to quickly adapt to new tasks, reducing data and compute needs while improving performance.”

Q2: What are the main techniques in transfer learning?

🎯 TL;DR: Two main techniques are feature extraction and fine-tuning.

🌱 Conceptual Explanation

Feature extraction uses frozen layers of a pre-trained model as universal feature generators. Fine-tuning modifies selected layers (or all) to adapt to the new task.

📐 Technical / Math Details

Feature extraction: Freeze parameters $\theta$ for initial layers, retrain last layer weights.
Fine-tuning: Optimize $\theta$ across chosen layers with new labeled data.

⚖️ Trade-offs & Production Notes

Feature extraction: quick, avoids overfitting on small data.
Fine-tuning: more powerful if you have enough data.

🚨 Common Pitfalls

Freezing too few layers with small datasets → overfit.
Fine-tuning entire model with tiny dataset → wasted effort.

🗣 Interview-ready Answer

“The two main approaches are feature extraction—using frozen layers for general features—and fine-tuning, where we adapt layers to the new task.”

Q3: How do dataset size and task similarity influence transfer learning strategy?

🎯 TL;DR: Small/similar tasks → freeze more. Large/different tasks → fine-tune more.

🌱 Conceptual Explanation

If tasks share patterns (like car vs truck classification), we can freeze most layers. If tasks differ (natural vs medical images), fine-tuning deeper layers is required. The larger the dataset, the more layers we can safely fine-tune.

📐 Technical / Math Details

Let $n$ = size of labeled target dataset.
- Small $n$: minimize trainable parameters.
- Large $n$: increase trainable parameters.

⚖️ Trade-offs & Production Notes

Freezing saves time and compute.
Fine-tuning entire models is expensive but can maximize accuracy.

🚨 Common Pitfalls

Ignoring domain similarity → poor adaptation.
Over-tuning small datasets.

🗣 Interview-ready Answer

“With small, similar datasets, freeze most layers; with large, different datasets, fine-tune more or all layers.”

Q4: Why is transfer learning especially valuable in NLP?

🎯 TL;DR: Pre-trained embeddings capture language semantics, enabling better downstream tasks with less data.

🌱 Conceptual Explanation

Models like BERT or ELMo learn dense text representations from large corpora. These embeddings encode meaning, which can be reused for tasks like NER, spam detection, or search ranking.

📐 Technical / Math Details

Word2Vec: skip-gram model trains embeddings via context prediction.
BERT: Transformer-based pre-training via masked language modeling.

⚖️ Trade-offs & Production Notes

Strong transfer on diverse NLP tasks.
Pre-trained models are large and resource-heavy.

🚨 Common Pitfalls

Using embeddings blindly without considering domain-specific vocabulary.
Ignoring fine-tuning opportunities for specialized tasks.

🗣 Interview-ready Answer

“In NLP, transfer learning reuses embeddings from models like BERT, capturing semantic meaning for many downstream tasks with less data.”

Q5: What are risks of transfer learning in production systems?

🎯 TL;DR: Risks include negative transfer, resource overhead, and domain mismatch.

🌱 Conceptual Explanation

Not all pre-trained knowledge generalizes well. If the source and target differ too much, performance may degrade. Large pre-trained models may also be expensive to deploy.

📐 Technical / Math Details

Negative transfer occurs when $P(Y_t|X_t)$ degrades due to irrelevant $P(Y_s|X_s)$ knowledge.

⚖️ Trade-offs & Production Notes

Cost vs. benefit of fine-tuning.
Monitoring for drift between source/target domains.

🚨 Common Pitfalls

Using a mismatched pre-trained model.
Underestimating serving costs of large models.

🗣 Interview-ready Answer

“Key risks are negative transfer when tasks differ, and high resource cost for large models; careful model selection mitigates this.”

📐 Key Formulas

Softmax Function

$$ \text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}} $$

$z_i$: logit for class i
Interpretation: Converts raw logits into probability distributions.

Cross-Entropy Loss

$$ L = -\sum_{i} y_i \log(\hat{y}_i) $$

$y_i$: true label (one-hot)
$\hat{y}_i$: predicted probability
Interpretation: Penalizes incorrect predictions more strongly if they are overconfident.

✅ Cheatsheet

Transfer Learning = Pre-trained model → new task.
Feature Extraction = Freeze layers + retrain head.
Fine-Tuning = Adjust weights of some/all layers.
Small & Similar dataset → Freeze more.
Large & Different dataset → Fine-tune more.
Vision: CNNs trained on ImageNet reused for medical imaging.
NLP: BERT/ELMo embeddings reused for NER, spam, ranking.

2.6. Model Debugging and Testing 2.4. Embeddings