1.4. Modeling Objectives — Teaching Language Understanding

3 min read 499 words

🪄 Step 1: Intuition & Motivation

  • Core Idea: Once a model knows how to read text (through tokenization and embeddings), it still doesn’t know what to do with it. The modeling objective defines the game the model plays during training — the exact “goal” it tries to achieve. It’s like setting the rules for learning:
  • “Guess the next word!” (Causal)
  • “Fill in the blanks!” (Masked)
  • “Repair the noise I added!” (Denoising)

Different objectives lead to different kinds of intelligence — some models become great storytellers, others great readers, and some both.

  • Simple Analogy: Imagine two students learning language:
  1. One reads a sentence and tries to predict the next word — “I love ___” → “you”.
  2. Another reads sentences with missing words — “I ___ ice cream” → “love”. They both learn language, but in different ways. That’s exactly what CLM and MLM do.

🌱 Step 2: Core Concept

What’s Happening Under the Hood?

Let’s look at the three key objectives that power modern LLMs:

1️⃣ Causal Language Modeling (CLM) — The Storyteller

Used by GPT-style models. The model reads tokens one by one and learns to predict the next word based on all previous words.

For a sentence like:

“I love transformers”

the model sees:

  • “I” → predicts “love”
  • “I love” → predicts “transformers”

This process trains it to generate fluent text — perfect for writing, summarizing, and dialogue generation.


2️⃣ Masked Language Modeling (MLM) — The Detective

Used by BERT-style models. Instead of predicting the next word, it learns to fill in missing ones.

Example:

“I [MASK] transformers.”

The model must predict “love” by looking at the entire sentence — both left and right context. This makes MLM great at understanding language (classification, QA, etc.), but not at generating text, since it sees all words at once.


3️⃣ Denoising Autoencoding — The Repairer

Used by T5, BART, and FLAN-T5 models. Here, the model learns to reconstruct corrupted input, like restoring missing or shuffled words.

Example:

Input: “I ___ transformers, do I?” Output: “I love transformers, do I?”

This makes the model flexible — it learns to both understand and generate coherent text.


Why It Works This Way

Each objective defines what kind of context the model uses:

  • CLM: sees only the past → unidirectional → good for generation.
  • MLM: sees past + future → bidirectional → good for understanding.
  • Denoising: learns both directions and sequence repair → balanced for general tasks.

That’s why GPT excels at dialogue, while BERT shines at comprehension. They simply learned under different games.

How It Fits in ML Thinking
Modeling objectives are the heart of self-supervised learning — where models teach themselves by predicting parts of data from other parts. It’s what allows LLMs to learn from raw, unlabeled text — billions of tokens without a single human-annotated label. This makes them both scalable and surprisingly general.

📐 Step 3: Mathematical Foundation

Causal Language Modeling (CLM)

The model maximizes the probability of the next token given all previous ones:

$$ P(w_t | w_{
  • $w_t$: token to predict
  • $w_{<t}$: all previous tokens
  • $h_t$: model’s hidden representation
  • $W$: output projection weights
  • The training goal is to minimize the negative log-likelihood loss:

    $$ \mathcal{L}*{CLM} = -\sum_t \log P(w_t | w*{
    CLM teaches the model to speak forward in time. It learns the rhythm of language — how each word probabilistically follows another.

    Masked Language Modeling (MLM)

    Instead of next-word prediction, MLM hides some tokens (15% usually) and trains the model to guess them:

    $$ \mathcal{L}*{MLM} = -\sum*{m \in M} \log P(w_m | w_{\setminus M}) $$
    • $M$: set of masked tokens
    • $w_m$: actual masked token
    • $w_{\setminus M}$: unmasked context tokens

    The loss focuses only on the masked positions.

    Think of it like “fill in the blanks” puzzles — the model becomes a master at reading comprehension and context reconstruction.

    Denoising Objective (T5/BART)

    A more general objective: corrupt input text and train the model to reconstruct it.

    $$ \mathcal{L}_{denoise} = -\sum_i \log P(x_i | \tilde{x}) $$
    • $x_i$: original token
    • $\tilde{x}$: corrupted input

    Corruptions include deletion, permutation, or span-masking.

    This teaches models to understand structure and recover meaning — a skill closer to human inference (“What’s missing here?”).

    🧠 Step 4: Assumptions or Key Ideas

    • Language follows statistical regularities — the next word depends on context.
    • Masking or corruption provides natural supervision — no labels required.
    • Predictive objectives encourage models to learn grammar, semantics, and world knowledge simultaneously.

    ⚖️ Step 5: Strengths, Limitations & Trade-offs

    Strengths

    • Enables unsupervised learning from raw text.
    • Builds internal representations transferable across tasks.
    • Flexible across architectures (encoder, decoder, seq2seq).

    ⚠️ Limitations

    • CLM lacks access to future context (bad at understanding).
    • MLM can’t generate coherent long text.
    • Objectives are sensitive to data bias — predicting next words amplifies frequent patterns.
    ⚖️ Trade-offs Bidirectional models (MLM) = better comprehension. Autoregressive models (CLM) = better generation. Seq2Seq (Denoising) = balance of both worlds, ideal for instruction-tuned systems.

    🚧 Step 6: Common Misunderstandings

    🚨 Common Misunderstandings (Click to Expand)
    • “CLM and MLM learn the same thing.” ❌ No — CLM predicts future, MLM predicts missing.
    • “BERT can generate text like GPT.” ❌ It can’t — it sees the entire sentence at once, not sequentially.
    • “More masking = better learning.” ❌ Too much masking starves the model of context; too little gives no challenge.

    🧩 Step 7: Mini Summary

    🧠 What You Learned: Modeling objectives define how LLMs learn language — by predicting, filling, or repairing.

    ⚙️ How It Works: Each method teaches a model different language behaviors — generation (CLM), comprehension (MLM), or reconstruction (denoising).

    🎯 Why It Matters: Choosing the right objective is what differentiates a storyteller model (GPT) from a reader model (BERT).

    1.5. Scaling Laws & Model Capacity1.3. Embeddings — The Language Geometry
    Any doubt in content? Ask me anything?
    Chat
    🤖 👋 Hi there! I'm your learning assistant. If you have any questions about this page or need clarification, feel free to ask!