4.4. Evaluation & Interpretability

Generative AI & LLM Interview Guide for Top Roles (2025)

1 min read 199 words

🪄 Step 1: Intuition & Motivation

Core Idea: Once your Transformer model is trained, the next big question is —

“How well does it actually understand and generalize?”

In deep learning, training performance ≠ real intelligence. A model might achieve near-zero loss but still just be memorizing patterns instead of understanding them.

That’s why we evaluate models not just with numbers (like loss or perplexity) but also by peeking inside — visualizing how attention behaves and probing what the model knows.

In other words, evaluation tells you “how well it performs,” while interpretability tells you “why it performs that way.”

Simple Analogy: Training a Transformer is like teaching a student.
Evaluation checks their test score.
Interpretability checks how they think — do they really understand, or just memorize answers?

🌱 Step 2: Core Concept

There are three key pillars to understanding evaluation and interpretability in Transformers:

Quantitative Evaluation (Loss, Perplexity)
Attention Visualization (Seeing Focus Patterns)
Probing Tasks (Testing Linguistic and Semantic Knowledge)

1️⃣ Quantitative Evaluation — Perplexity and Loss

🔹 Cross-Entropy Loss

Language models predict probabilities for each token given the context. Their objective is to minimize the cross-entropy loss:

$$ L = -\frac{1}{N} \sum_{i=1}^{N} \log P(w_i | w_{ Here:

$P(w_i | w_{<i})$ = model’s predicted probability for the correct token.
$N$ = total tokens.

Lower loss = better prediction accuracy (model assigns higher probability to the right next word).

🔹 Perplexity (PPL)

Perplexity is the exponentiated version of cross-entropy loss:

$$ \text{PPL} = e^{L} $$

It represents how “confused” the model is — the average number of possible choices it considers plausible per prediction.

Low PPL (close to 1): model is confident and correct.
High PPL: model is uncertain or guessing widely.

Example: If a model predicts “The cat sat on the ___”:

High PPL → it’s unsure between “table,” “floor,” “roof.”
Low PPL → it strongly favors “mat.”

Perplexity measures how “surprised” your model feels by real data. A good model feels less surprised — it “expects” the right tokens.

2️⃣ Attention Visualization — Peeking Inside the Mind

Transformers compute attention scores for every token pair — revealing which words each token “pays attention to.”

Visualizing these scores helps us understand:

Where focus lies (e.g., subject-object links, syntactic dependencies).
Which tokens influence others most.
How attention evolves across layers.

Example:

Sentence:

“The dog that chased the cat was fast.”

You may find:

Early layers attend to nearby words (“dog ↔ chased”).
Middle layers form syntactic structure (“that ↔ was”).
Deeper layers encode semantics (“dog ↔ fast”).

These patterns show hierarchical understanding — how the model builds meaning step-by-step.

Visualization Tools:

bertviz
transformers’s built-in attention hooks
Heatmaps (e.g., token-by-token matrix of attention weights)

Attention visualization is like using an MRI scanner for your model’s brain — it shows which parts light up when processing each word.

3️⃣ Probing Tasks — Testing What the Model Knows

Sometimes, we need to go beyond “attention maps” and directly test what knowledge is encoded. That’s where probing tasks come in — small, focused tests designed to reveal specific competencies.

Types of Probes:

Probe Type	Tests For	Example Task
Syntactic	Grammar understanding	Predict part-of-speech tags or dependency arcs
Semantic	Meaning comprehension	Classify semantic roles or sentence similarity
World Knowledge	Factual memory	Answer “Who wrote Hamlet?”
Reasoning	Logical relationships	Identify cause-effect or entailment

How It Works: You freeze the pretrained model and train a lightweight classifier on its embeddings. If the classifier performs well, the embeddings already contain the needed information — showing that the model implicitly learned it during pretraining.

Insight: A model may not explicitly “know” grammar but still encode it geometrically in its embeddings.

Probing tasks are like mini pop quizzes — you’re not testing memorization but checking if the student has conceptual understanding hidden beneath the surface.

📐 Step 3: Mathematical Foundation

Cross-Entropy and Perplexity Relationship

Cross-Entropy:

$$ L = -\frac{1}{N}\sum_{i=1}^{N} \log P(w_i | w_{Perplexity:

$$ \text{PPL} = e^{L} $$

If the model predicts perfectly (probability = 1), then $L = 0$ and $\text{PPL} = 1$. If predictions are random, $L$ grows, and $\text{PPL}$ increases exponentially.

Interpretation: PPL measures how many equally likely guesses the model effectively makes. A PPL of 10 means “on average, the model acts like there are 10 plausible next words.”

Perplexity is the “effective vocabulary size” the model thinks it must choose from each step.

🧠 Step 4: Key Ideas

Loss and Perplexity quantify prediction confidence.
Attention Visualization reveals the structure of model reasoning.
Probing Tasks check if the model’s hidden states encode linguistic and semantic understanding.
Interpretability Audits help detect overfitting or memorization, ensuring genuine generalization.

⚖️ Step 5: Strengths, Limitations & Trade-offs

Perplexity provides a clear quantitative metric for comparison.
Attention maps offer intuitive interpretability.
Probing tasks connect internal representations to linguistic knowledge.

Perplexity doesn’t reflect meaningful generation quality (a low PPL model can still be dull).
Attention ≠ explanation — weights show focus, not causality.
Probes may measure correlation, not true understanding.

Metrics give numbers, interpretability gives insight. The two must go hand-in-hand: evaluation tells you what, interpretability tells you why. A great model isn’t just accurate — it’s understandably accurate.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)

“Low perplexity = good model.” Not always — the model might be overfitting. Use validation and interpretability checks.
“Attention visualizations explain decisions.” Attention shows correlations, not reasoning. Some heads may focus decoratively.
“Probing proves understanding.” Probes can show the presence of knowledge, not whether it’s used during inference.

🧩 Step 7: Mini Summary

🧠 What You Learned: Evaluation tells you how well a Transformer performs; interpretability tells you what it’s actually doing under the hood.

⚙️ How It Works: Perplexity quantifies predictive confidence, attention maps visualize focus, and probing tasks test implicit knowledge.

🎯 Why It Matters: In top-tier ML practice, you’re judged not just by accuracy — but by whether your model generalizes, avoids bias, and can be understood and trusted.

5.1. Transformer Variants 4.3. Fine-Tuning and Transfer Learning