4.4. Evaluation & Interpretability
🪄 Step 1: Intuition & Motivation
- Core Idea: Once your Transformer model is trained, the next big question is —
“How well does it actually understand and generalize?”
In deep learning, training performance ≠ real intelligence. A model might achieve near-zero loss but still just be memorizing patterns instead of understanding them.
That’s why we evaluate models not just with numbers (like loss or perplexity) but also by peeking inside — visualizing how attention behaves and probing what the model knows.
In other words, evaluation tells you “how well it performs,” while interpretability tells you “why it performs that way.”
- Simple Analogy: Training a Transformer is like teaching a student.
- Evaluation checks their test score.
- Interpretability checks how they think — do they really understand, or just memorize answers?
🌱 Step 2: Core Concept
There are three key pillars to understanding evaluation and interpretability in Transformers:
- Quantitative Evaluation (Loss, Perplexity)
- Attention Visualization (Seeing Focus Patterns)
- Probing Tasks (Testing Linguistic and Semantic Knowledge)
1️⃣ Quantitative Evaluation — Perplexity and Loss
🔹 Cross-Entropy Loss
Language models predict probabilities for each token given the context. Their objective is to minimize the cross-entropy loss:
$$ L = -\frac{1}{N} \sum_{i=1}^{N} \log P(w_i | w_{ Here:- $P(w_i | w_{<i})$ = model’s predicted probability for the correct token.
- $N$ = total tokens.
Lower loss = better prediction accuracy (model assigns higher probability to the right next word).
🔹 Perplexity (PPL)
Perplexity is the exponentiated version of cross-entropy loss:
$$ \text{PPL} = e^{L} $$It represents how “confused” the model is — the average number of possible choices it considers plausible per prediction.
- Low PPL (close to 1): model is confident and correct.
- High PPL: model is uncertain or guessing widely.
Example: If a model predicts “The cat sat on the ___”:
- High PPL → it’s unsure between “table,” “floor,” “roof.”
- Low PPL → it strongly favors “mat.”
2️⃣ Attention Visualization — Peeking Inside the Mind
Transformers compute attention scores for every token pair — revealing which words each token “pays attention to.”
Visualizing these scores helps us understand:
- Where focus lies (e.g., subject-object links, syntactic dependencies).
- Which tokens influence others most.
- How attention evolves across layers.
Example:
Sentence:
“The dog that chased the cat was fast.”
You may find:
- Early layers attend to nearby words (“dog ↔ chased”).
- Middle layers form syntactic structure (“that ↔ was”).
- Deeper layers encode semantics (“dog ↔ fast”).
These patterns show hierarchical understanding — how the model builds meaning step-by-step.
Visualization Tools:
bertviztransformers’s built-in attention hooks- Heatmaps (e.g., token-by-token matrix of attention weights)
3️⃣ Probing Tasks — Testing What the Model Knows
Sometimes, we need to go beyond “attention maps” and directly test what knowledge is encoded. That’s where probing tasks come in — small, focused tests designed to reveal specific competencies.
Types of Probes:
| Probe Type | Tests For | Example Task |
|---|---|---|
| Syntactic | Grammar understanding | Predict part-of-speech tags or dependency arcs |
| Semantic | Meaning comprehension | Classify semantic roles or sentence similarity |
| World Knowledge | Factual memory | Answer “Who wrote Hamlet?” |
| Reasoning | Logical relationships | Identify cause-effect or entailment |
How It Works: You freeze the pretrained model and train a lightweight classifier on its embeddings. If the classifier performs well, the embeddings already contain the needed information — showing that the model implicitly learned it during pretraining.
Insight: A model may not explicitly “know” grammar but still encode it geometrically in its embeddings.
📐 Step 3: Mathematical Foundation
Cross-Entropy and Perplexity Relationship
Cross-Entropy:
$$ L = -\frac{1}{N}\sum_{i=1}^{N} \log P(w_i | w_{Perplexity:$$ \text{PPL} = e^{L} $$If the model predicts perfectly (probability = 1), then $L = 0$ and $\text{PPL} = 1$. If predictions are random, $L$ grows, and $\text{PPL}$ increases exponentially.
Interpretation: PPL measures how many equally likely guesses the model effectively makes. A PPL of 10 means “on average, the model acts like there are 10 plausible next words.”
🧠 Step 4: Key Ideas
- Loss and Perplexity quantify prediction confidence.
- Attention Visualization reveals the structure of model reasoning.
- Probing Tasks check if the model’s hidden states encode linguistic and semantic understanding.
- Interpretability Audits help detect overfitting or memorization, ensuring genuine generalization.
⚖️ Step 5: Strengths, Limitations & Trade-offs
- Perplexity provides a clear quantitative metric for comparison.
- Attention maps offer intuitive interpretability.
- Probing tasks connect internal representations to linguistic knowledge.
- Perplexity doesn’t reflect meaningful generation quality (a low PPL model can still be dull).
- Attention ≠ explanation — weights show focus, not causality.
- Probes may measure correlation, not true understanding.
🚧 Step 6: Common Misunderstandings
🚨 Common Misunderstandings (Click to Expand)
- “Low perplexity = good model.” Not always — the model might be overfitting. Use validation and interpretability checks.
- “Attention visualizations explain decisions.” Attention shows correlations, not reasoning. Some heads may focus decoratively.
- “Probing proves understanding.” Probes can show the presence of knowledge, not whether it’s used during inference.
🧩 Step 7: Mini Summary
🧠 What You Learned: Evaluation tells you how well a Transformer performs; interpretability tells you what it’s actually doing under the hood.
⚙️ How It Works: Perplexity quantifies predictive confidence, attention maps visualize focus, and probing tasks test implicit knowledge.
🎯 Why It Matters: In top-tier ML practice, you’re judged not just by accuracy — but by whether your model generalizes, avoids bias, and can be understood and trusted.