2.1. BERT and Its Variants
🪄 Step 1: Intuition & Motivation
Core Idea: While GPT generates language word-by-word, BERT is designed to understand it. It reads the entire sentence — both left and right context — and builds a deep understanding of meaning, relationships, and nuance.
Simple Analogy: Imagine GPT as a storyteller who speaks word by word, guessing what comes next. BERT, on the other hand, is a careful reader — it looks at the whole sentence, both before and after each word, to fully grasp the meaning.
🌱 Step 2: Core Concept
Let’s walk through how BERT learns to “understand” text through clever training tricks.
Masked Language Modeling (MLM) — The Fill-in-the-Blank Game
Instead of predicting the next word (like GPT), BERT learns by filling in blanks.
During training:
- 15% of tokens are randomly replaced by a special token
[MASK]. - The model must predict the original word based on its left and right neighbors.
Example:
Sentence: “The [MASK] chased the ball.” Model predicts: “dog.”
This teaches BERT bidirectional understanding — it uses both past and future context to infer missing information.
- Why it matters: Real language understanding often depends on what comes after a word, not just before it. For instance, “bank” means different things in “river bank” vs. “money bank.”
Next Sentence Prediction (NSP) — Learning Sentence Relationships
To help BERT understand relationships between sentences, another task was added: Next Sentence Prediction.
During training:
- The model receives two sentences.
- 50% of the time, the second sentence follows the first.
- The other 50% of the time, it’s a random sentence from elsewhere.
The model learns to classify whether sentence B logically follows sentence A.
This helps BERT handle tasks like question answering or entailment, where understanding the relationship between ideas is key.
However, later research found that NSP wasn’t very effective — it made training slower without clear benefits.
Bidirectional Encoder Stack — Seeing Both Directions at Once
BERT’s architecture is purely encoder-based, unlike GPT’s decoder-only design.
Every token can attend to all other tokens in the sentence — both before and after it — using self-attention.
This is what makes it bidirectional: Each word’s meaning is derived from the entire context, not just preceding words.
Example:
“I went to the bank to withdraw money.” Here, BERT knows “bank” means a financial institution because of the right-side context “withdraw money.”
This deep, contextual understanding is why BERT shines at comprehension-based tasks.
RoBERTa — BERT, but Stronger and Faster
RoBERTa (Robustly Optimized BERT) revisited BERT’s design and asked: “What if we just train longer, on more data, and drop the NSP task?”
Key improvements:
- Removed NSP: Simplified training, faster convergence.
- Dynamic Masking: Different tokens are masked each epoch — more robust learning.
- Bigger Batches + Longer Training: Better generalization.
Result: RoBERTa outperformed the original BERT on almost every NLP benchmark — proving that optimization and scale mattered more than new architecture tweaks.
DistilBERT & ALBERT — Smaller Yet Smart
As BERT grew popular, people wanted lighter versions for real-time applications.
DistilBERT:
- Trained using knowledge distillation — a smaller model learns to mimic a large teacher (BERT).
- Achieves ~97% of BERT’s performance with 40% fewer parameters.
- Great for inference speed and deployment on limited hardware.
ALBERT:
- Uses parameter sharing — reuses the same weights across layers.
- Factorizes embedding matrices to reduce redundancy.
- Keeps accuracy close to BERT while drastically shrinking size.
Both proved you don’t always need more parameters — you need smarter parameter usage.
Why It Works This Way
BERT’s brilliance lies in its bidirectional context capture. By allowing attention across the full sentence, it captures dependencies that left-to-right models (like GPT) might miss.
However, this same bidirectionality prevents BERT from generating fluent text — because each token prediction depends on future tokens that won’t exist during generation.
How It Fits in ML Thinking
BERT represents the “understanding” half of natural language models — it builds rich, contextual embeddings that other systems (like T5 or GPT) can later use for reasoning or generation.
It taught the ML world that self-supervision on raw text can build universal language representations.
📐 Step 3: Mathematical Foundation
The MLM Objective
- $M$: the set of masked positions.
- $x_i$: the true token at position $i$.
- $x_{\setminus M}$: all unmasked tokens used as context.
The NSP Objective
- $y$: label (1 if the second sentence follows, 0 otherwise).
🧠 Step 4: Key Ideas & Assumptions
- Context is two-way: Understanding requires both left and right context.
- Masked prediction teaches semantics: Removing words forces the model to understand grammar and meaning.
- Pretraining is universal: Once trained, BERT embeddings can be reused across tasks.
⚖️ Step 5: Strengths, Limitations & Trade-offs
- Deep, bidirectional context understanding.
- Excels in comprehension and classification tasks.
- Transferable embeddings reduce labeled data needs.
- Strong foundation for fine-tuning on many NLP tasks.
- Cannot generate coherent text (non-causal).
- Computationally heavy during pretraining.
- Limited to fixed input lengths.
- Prone to overfitting if fine-tuned on small data.
🚧 Step 6: Common Misunderstandings
🚨 Common Misunderstandings (Click to Expand)
- “BERT can generate sentences.” No — BERT fills blanks, not sequences. It’s not autoregressive.
- “NSP is crucial.” Actually, it’s not — later models (like RoBERTa) dropped it with no performance loss.
- “Distillation means cutting layers.” It’s about teaching smaller models to mimic larger ones, not simply shrinking architecture.
🧩 Step 7: Mini Summary
🧠 What You Learned: BERT understands language by predicting masked words using both left and right context — capturing deep meaning and relationships.
⚙️ How It Works: It uses bidirectional self-attention and two main tasks (MLM, NSP) during pretraining.
🎯 Why It Matters: BERT pioneered contextual embeddings — forming the basis for many modern models that “understand” before they “generate.”