4.3. BLEU, ROUGE & Semantic Metrics — Evaluating Generations
🪄 Step 1: Intuition & Motivation
- Core Idea: When an LLM generates text — say, translating a sentence or summarizing an article — we need a way to quantify how good its output is. But evaluating text is tricky: there’s rarely just one right answer.
That’s where metrics like BLEU, ROUGE, and BERTScore come in — they help us approximate “goodness” by comparing the model’s output with reference human responses.
- Simple Analogy: Imagine you ask several students to summarize a story. Even if their words differ, some summaries capture the same meaning — others don’t. BLEU and ROUGE are like teachers counting how many overlapping phrases they used, while BERTScore is a teacher who understands meaning instead of just word matching.
🌱 Step 2: Core Concept
There are three generations of text evaluation metrics — each improving on the last:
- BLEU → counts exact word overlaps (precision).
- ROUGE → counts how much of the reference is covered (recall).
- BERTScore & friends → compare meanings using embeddings, not words.
Let’s dive deeper into each.
1️⃣ BLEU — Counting Word Precision
Purpose: Originally designed for machine translation.
BLEU (Bilingual Evaluation Understudy) measures n-gram precision — how many words or word sequences (n-grams) from the reference appear in the model’s output.
Formula:
$$ BLEU = BP \cdot \exp\left(\sum_{n=1}^{N} w_n \log p_n\right) $$Where:
- ( p_n ): n-gram precision (e.g., unigram, bigram).
- ( w_n ): weight for each n-gram order (commonly equal).
- ( BP ): brevity penalty — discourages overly short outputs.
Example:
- Reference: “The cat sat on the mat.”
- Prediction: “The cat is on the mat.”
- Overlap n-grams: “The”, “cat”, “on”, “the”, “mat” → high BLEU score.
Interpretation: BLEU = how many reference phrases the model got right.
If model output is too short, it gets penalized:
$$ BP = \begin{cases} 1 & \text{if } c > r \ e^{(1 - r/c)} & \text{if } c \le r \end{cases} $$where ( c ) = candidate length, ( r ) = reference length.
2️⃣ ROUGE — Measuring Recall and Coverage
Purpose: Designed for summarization tasks.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures how much of the reference content is covered by the generated text.
Main Variants:
- ROUGE-N: n-gram recall.
- ROUGE-L: longest common subsequence (LCS) — captures fluency.
- ROUGE-W: weighted LCS — emphasizes longer matches.
Example:
- Reference summary: “Climate change affects rainfall patterns globally.”
- Generated summary: “Global rainfall patterns are changing due to climate.”
- High ROUGE-L because the sequence structure matches well.
Interpretation: ROUGE = how completely the model captures the reference meaning.
3️⃣ BERTScore — Understanding Meaning Beyond Words
Purpose: Fixes BLEU/ROUGE’s weakness — surface-level word matching.
Instead of counting words, BERTScore uses embeddings from pretrained language models (like BERT) to measure semantic similarity between generated and reference tokens.
How It Works:
- Represent each token as an embedding vector.
- Compute cosine similarity between each token in prediction and its closest match in reference.
- Average similarities to produce a final score.
Benefits:
- Captures synonyms and paraphrases (“happy” ≈ “joyful”).
- Correlates much better with human judgment.
- Language-agnostic (depends on embedding model).
Example: Prediction: “The boy leaped over the fence.” Reference: “The kid jumped the barrier.” → BLEU: low (few word overlaps) → BERTScore: high (semantic equivalence).
📐 Step 3: Mathematical & Conceptual Foundation
BLEU in Detail
- ( p_n ) = modified n-gram precision (prevents spamming repeats).
- ( BP ) = brevity penalty.
- ( N ) = up to 4-grams typically.
Interpretation: BLEU ≈ geometric mean of precisions × length normalization.
ROUGE-L (Longest Common Subsequence)
ROUGE-L computes:
$$ \text{ROUGE-L} = F_{\beta} = \frac{(1+\beta^2) \times \text{R}*{LCS} \times \text{P}*{LCS}}{\text{R}*{LCS} + \beta^2 \times \text{P}*{LCS}} $$Where:
- ( \text{R}_{LCS} ): LCS recall.
- ( \text{P}_{LCS} ): LCS precision.
It balances both coverage and fluency — making it robust for summarization.
BERTScore Mathematics
Given embeddings ( E_{pred} ) and ( E_{ref} ):
$$ \text{BERTScore} = \frac{1}{|E_{pred}|}\sum_{e_i \in E_{pred}}\max_{e_j \in E_{ref}}\cos(e_i, e_j) $$Each word in the generated text finds the most semantically similar word in the reference.
Range: Typically 0.7–0.9 for high-quality text.
⚖️ Step 4: Strengths, Limitations & Trade-offs
✅ Strengths
- BLEU/ROUGE are simple, fast, and language-agnostic.
- BERTScore aligns closely with human intuition.
- Together, they provide a full view: surface accuracy + semantic depth.
⚠️ Limitations
- BLEU/ROUGE fail on paraphrases and creative outputs.
- BERTScore depends on the quality of the embedding model.
- Automatic metrics still miss nuance — tone, humor, reasoning, safety.
⚖️ Trade-offs
- BLEU = reproducible and simple but shallow.
- ROUGE = better recall but domain-dependent.
- BERTScore = semantically rich but computationally heavy.
🚧 Step 5: Common Misunderstandings
🚨 Common Misunderstandings (Click to Expand)
- “High BLEU means the model understands meaning.” ❌ It just matches words.
- “ROUGE is for translation too.” ❌ It’s mainly for summarization recall.
- “BERTScore replaces all other metrics.” ❌ It complements them — semantic ≠ stylistic.
🧩 Step 6: Mini Summary
🧠 What You Learned: BLEU and ROUGE measure word-level overlap, while BERTScore measures meaning-level alignment.
⚙️ How It Works: BLEU checks precision, ROUGE checks recall, and BERTScore checks semantic similarity via embeddings.
🎯 Why It Matters: Together, these metrics form the foundation for evaluating text generation — from literal accuracy to nuanced fluency and meaning.