4.3. BLEU, ROUGE & Semantic Metrics — Evaluating Generations

Generative AI & LLM Interview Guide for Top Roles (2025)

5 min read 951 words

🪄 Step 1: Intuition & Motivation

Core Idea: When an LLM generates text — say, translating a sentence or summarizing an article — we need a way to quantify how good its output is. But evaluating text is tricky: there’s rarely just one right answer.

That’s where metrics like BLEU, ROUGE, and BERTScore come in — they help us approximate “goodness” by comparing the model’s output with reference human responses.

Simple Analogy: Imagine you ask several students to summarize a story. Even if their words differ, some summaries capture the same meaning — others don’t. BLEU and ROUGE are like teachers counting how many overlapping phrases they used, while BERTScore is a teacher who understands meaning instead of just word matching.

🌱 Step 2: Core Concept

There are three generations of text evaluation metrics — each improving on the last:

BLEU → counts exact word overlaps (precision).
ROUGE → counts how much of the reference is covered (recall).
BERTScore & friends → compare meanings using embeddings, not words.

Let’s dive deeper into each.

1️⃣ BLEU — Counting Word Precision

Purpose: Originally designed for machine translation.

BLEU (Bilingual Evaluation Understudy) measures n-gram precision — how many words or word sequences (n-grams) from the reference appear in the model’s output.

Formula:

$$ BLEU = BP \cdot \exp\left(\sum_{n=1}^{N} w_n \log p_n\right) $$

Where:

( p_n ): n-gram precision (e.g., unigram, bigram).
( w_n ): weight for each n-gram order (commonly equal).
( BP ): brevity penalty — discourages overly short outputs.

Example:

Reference: “The cat sat on the mat.”
Prediction: “The cat is on the mat.”
Overlap n-grams: “The”, “cat”, “on”, “the”, “mat” → high BLEU score.

Interpretation: BLEU = how many reference phrases the model got right.

If model output is too short, it gets penalized:

$$ BP = \begin{cases} 1 & \text{if } c > r \ e^{(1 - r/c)} & \text{if } c \le r \end{cases} $$

where ( c ) = candidate length, ( r ) = reference length.

BLEU rewards “literal correctness” — perfect for translation, not creativity.

2️⃣ ROUGE — Measuring Recall and Coverage

Purpose: Designed for summarization tasks.

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures how much of the reference content is covered by the generated text.

Main Variants:

ROUGE-N: n-gram recall.
ROUGE-L: longest common subsequence (LCS) — captures fluency.
ROUGE-W: weighted LCS — emphasizes longer matches.

Example:

Reference summary: “Climate change affects rainfall patterns globally.”
Generated summary: “Global rainfall patterns are changing due to climate.”
High ROUGE-L because the sequence structure matches well.

Interpretation: ROUGE = how completely the model captures the reference meaning.

BLEU = Precision (how much the model says that’s right). ROUGE = Recall (how much of the reference is captured).

3️⃣ BERTScore — Understanding Meaning Beyond Words

Purpose: Fixes BLEU/ROUGE’s weakness — surface-level word matching.

Instead of counting words, BERTScore uses embeddings from pretrained language models (like BERT) to measure semantic similarity between generated and reference tokens.

How It Works:

Represent each token as an embedding vector.
Compute cosine similarity between each token in prediction and its closest match in reference.
Average similarities to produce a final score.

$$ \text{BERTScore} = \frac{1}{N}\sum_{i=1}^{N}\max_j(\text{cosine}(e_i^{pred}, e_j^{ref})) $$

Benefits:

Captures synonyms and paraphrases (“happy” ≈ “joyful”).
Correlates much better with human judgment.
Language-agnostic (depends on embedding model).

Example: Prediction: “The boy leaped over the fence.” Reference: “The kid jumped the barrier.” → BLEU: low (few word overlaps) → BERTScore: high (semantic equivalence).

BLEU and ROUGE care how you said it. BERTScore cares what you meant.

📐 Step 3: Mathematical & Conceptual Foundation

BLEU in Detail

$$ BLEU = BP \times \exp\left(\sum_{n=1}^N w_n \log p_n\right) $$

( p_n ) = modified n-gram precision (prevents spamming repeats).
( BP ) = brevity penalty.
( N ) = up to 4-grams typically.

Interpretation: BLEU ≈ geometric mean of precisions × length normalization.

BLEU ranges 0–1; higher is better. A BLEU > 0.4 on English translation is already strong.

ROUGE-L (Longest Common Subsequence)

ROUGE-L computes:

$$ \text{ROUGE-L} = F_{\beta} = \frac{(1+\beta^2) \times \text{R}*{LCS} \times \text{P}*{LCS}}{\text{R}*{LCS} + \beta^2 \times \text{P}*{LCS}} $$

Where:

( \text{R}_{LCS} ): LCS recall.
( \text{P}_{LCS} ): LCS precision.

It balances both coverage and fluency — making it robust for summarization.

ROUGE-L rewards long, coherent overlaps — not just scattered matching words.

BERTScore Mathematics

Given embeddings ( E_{pred} ) and ( E_{ref} ):

$$ \text{BERTScore} = \frac{1}{|E_{pred}|}\sum_{e_i \in E_{pred}}\max_{e_j \in E_{ref}}\cos(e_i, e_j) $$

Each word in the generated text finds the most semantically similar word in the reference.

Range: Typically 0.7–0.9 for high-quality text.

Unlike BLEU/ROUGE, BERTScore doesn’t penalize rephrasing — it focuses on semantic faithfulness.

⚖️ Step 4: Strengths, Limitations & Trade-offs

✅ Strengths

BLEU/ROUGE are simple, fast, and language-agnostic.
BERTScore aligns closely with human intuition.
Together, they provide a full view: surface accuracy + semantic depth.

⚠️ Limitations

BLEU/ROUGE fail on paraphrases and creative outputs.
BERTScore depends on the quality of the embedding model.
Automatic metrics still miss nuance — tone, humor, reasoning, safety.

⚖️ Trade-offs

BLEU = reproducible and simple but shallow.
ROUGE = better recall but domain-dependent.
BERTScore = semantically rich but computationally heavy.

🚧 Step 5: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)

“High BLEU means the model understands meaning.” ❌ It just matches words.
“ROUGE is for translation too.” ❌ It’s mainly for summarization recall.
“BERTScore replaces all other metrics.” ❌ It complements them — semantic ≠ stylistic.

🧩 Step 6: Mini Summary

🧠 What You Learned: BLEU and ROUGE measure word-level overlap, while BERTScore measures meaning-level alignment.

⚙️ How It Works: BLEU checks precision, ROUGE checks recall, and BERTScore checks semantic similarity via embeddings.

🎯 Why It Matters: Together, these metrics form the foundation for evaluating text generation — from literal accuracy to nuanced fluency and meaning.

4.4. Human Evaluation & Preference Modeling 4.2. Perplexity — The Statistical Backbone