1.4. Modeling Objectives — Teaching Language Understanding
🪄 Step 1: Intuition & Motivation
- Core Idea: Once a model knows how to read text (through tokenization and embeddings), it still doesn’t know what to do with it. The modeling objective defines the game the model plays during training — the exact “goal” it tries to achieve. It’s like setting the rules for learning:
- “Guess the next word!” (Causal)
- “Fill in the blanks!” (Masked)
- “Repair the noise I added!” (Denoising)
Different objectives lead to different kinds of intelligence — some models become great storytellers, others great readers, and some both.
- Simple Analogy: Imagine two students learning language:
- One reads a sentence and tries to predict the next word — “I love ___” → “you”.
- Another reads sentences with missing words — “I ___ ice cream” → “love”. They both learn language, but in different ways. That’s exactly what CLM and MLM do.
🌱 Step 2: Core Concept
What’s Happening Under the Hood?
Let’s look at the three key objectives that power modern LLMs:
1️⃣ Causal Language Modeling (CLM) — The Storyteller
Used by GPT-style models. The model reads tokens one by one and learns to predict the next word based on all previous words.
For a sentence like:
“I love transformers”
the model sees:
- “I” → predicts “love”
- “I love” → predicts “transformers”
This process trains it to generate fluent text — perfect for writing, summarizing, and dialogue generation.
2️⃣ Masked Language Modeling (MLM) — The Detective
Used by BERT-style models. Instead of predicting the next word, it learns to fill in missing ones.
Example:
“I [MASK] transformers.”
The model must predict “love” by looking at the entire sentence — both left and right context. This makes MLM great at understanding language (classification, QA, etc.), but not at generating text, since it sees all words at once.
3️⃣ Denoising Autoencoding — The Repairer
Used by T5, BART, and FLAN-T5 models. Here, the model learns to reconstruct corrupted input, like restoring missing or shuffled words.
Example:
Input: “I ___ transformers, do I?” Output: “I love transformers, do I?”
This makes the model flexible — it learns to both understand and generate coherent text.
Why It Works This Way
Each objective defines what kind of context the model uses:
- CLM: sees only the past → unidirectional → good for generation.
- MLM: sees past + future → bidirectional → good for understanding.
- Denoising: learns both directions and sequence repair → balanced for general tasks.
That’s why GPT excels at dialogue, while BERT shines at comprehension. They simply learned under different games.
How It Fits in ML Thinking
📐 Step 3: Mathematical Foundation
Causal Language Modeling (CLM)
The model maximizes the probability of the next token given all previous ones:
$$ P(w_t | w_{The training goal is to minimize the negative log-likelihood loss:
$$ \mathcal{L}*{CLM} = -\sum_t \log P(w_t | w*{Masked Language Modeling (MLM)
Instead of next-word prediction, MLM hides some tokens (15% usually) and trains the model to guess them:
$$ \mathcal{L}*{MLM} = -\sum*{m \in M} \log P(w_m | w_{\setminus M}) $$- $M$: set of masked tokens
- $w_m$: actual masked token
- $w_{\setminus M}$: unmasked context tokens
The loss focuses only on the masked positions.
Denoising Objective (T5/BART)
A more general objective: corrupt input text and train the model to reconstruct it.
$$ \mathcal{L}_{denoise} = -\sum_i \log P(x_i | \tilde{x}) $$- $x_i$: original token
- $\tilde{x}$: corrupted input
Corruptions include deletion, permutation, or span-masking.
🧠 Step 4: Assumptions or Key Ideas
- Language follows statistical regularities — the next word depends on context.
- Masking or corruption provides natural supervision — no labels required.
- Predictive objectives encourage models to learn grammar, semantics, and world knowledge simultaneously.
⚖️ Step 5: Strengths, Limitations & Trade-offs
✅ Strengths
- Enables unsupervised learning from raw text.
- Builds internal representations transferable across tasks.
- Flexible across architectures (encoder, decoder, seq2seq).
⚠️ Limitations
- CLM lacks access to future context (bad at understanding).
- MLM can’t generate coherent long text.
- Objectives are sensitive to data bias — predicting next words amplifies frequent patterns.
🚧 Step 6: Common Misunderstandings
🚨 Common Misunderstandings (Click to Expand)
- “CLM and MLM learn the same thing.” ❌ No — CLM predicts future, MLM predicts missing.
- “BERT can generate text like GPT.” ❌ It can’t — it sees the entire sentence at once, not sequentially.
- “More masking = better learning.” ❌ Too much masking starves the model of context; too little gives no challenge.
🧩 Step 7: Mini Summary
🧠 What You Learned: Modeling objectives define how LLMs learn language — by predicting, filling, or repairing.
⚙️ How It Works: Each method teaches a model different language behaviors — generation (CLM), comprehension (MLM), or reconstruction (denoising).
🎯 Why It Matters: Choosing the right objective is what differentiates a storyteller model (GPT) from a reader model (BERT).