3.5. Query Transformation & Re-ranking

Generative AI & LLM Interview Guide for Top Roles (2025)

6 min read 1089 words

🪄 Step 1: Intuition & Motivation

Core Idea: When humans search for information, we rarely get it right on the first try. We instinctively rephrase, clarify, and narrow down our questions — “Why is my code slow?” becomes “How to optimize Python loops?”

RAG systems face the same challenge. The user’s query might be ambiguous, incomplete, or phrased differently from how documents were written.

Query transformation helps the system understand what the user means, not just what they say. Then, re-ranking ensures the system picks the best among retrieved results — not just the first few that popped up.

Together, these two steps transform raw retrieval into smart understanding.

Simple Analogy: Imagine you’re Googling “apple performance.” Without context, the search engine doesn’t know whether you mean 🍎 fruit nutrition or 💻 MacBook speed.

Query transformation disambiguates that by rephrasing your search —

“Apple company device performance benchmarks.”

Then, re-ranking ensures you get the most relevant article on benchmarks — not a random blog about orchards. 🌳

🌱 Step 2: Core Concept

Let’s unpack how RAG systems improve retrieval through query rewriting, expansion, and re-ranking.

1️⃣ Query Rewriting — Clarifying the Question

Sometimes, a user’s query isn’t phrased in a way that matches how documents are written.

For example:

Query: “How do I speed up pandas?” Documents: “Optimizing Pandas DataFrames for performance.”

Even though they mean the same thing, the retriever might not connect them perfectly.

That’s where query rewriting comes in — it reformulates the query into a clearer, semantically richer version.

Methods include:

LLM-based reformulation: Using an LLM to rephrase the query before embedding.
Template-based rewriting: Adding clarifying context (e.g., “Explain the performance optimization of…”).
Automatic paraphrasing: Generating multiple semantically similar queries.

Embedding models are sensitive to phrasing. Rewriting ensures the query matches the semantic style of your indexed documents.

2️⃣ Query Expansion — Widening the Search Net

Sometimes, users provide too few clues.

Query: “COVID vaccine risks” Relevant documents might contain terms like “side effects,” “adverse reactions,” or “safety concerns.”

If you only use the original query embedding, you’ll miss many of these.

Query expansion fixes that by adding related terms or paraphrases to the embedding.

Techniques:

Synonym Expansion: Using WordNet or embedding similarity.
Pseudo-Relevance Feedback: Retrieve top results, extract key terms, and re-query.
LLM Expansion: Generate alternative phrasings dynamically (e.g., “side effects of COVID vaccination”).

Think of query expansion as casting a wider net in the vector ocean — it increases recall (catching more relevant docs), even if some irrelevant ones sneak in.

3️⃣ Re-ranking — Choosing the Best Catch

After initial retrieval (e.g., top 100 vectors), not all results are equally relevant.

Enter re-ranking — a second-stage process that re-scores retrieved documents using a more precise but slower model.

Two popular approaches:

Type	Example	Description
Cross-Encoder Re-ranker	MiniLM, ColBERT	Takes both query + document together and computes relevance score using deep attention.
LLM-as-Ranker	GPT-based scoring	LLM evaluates which passages best answer the query (costly but interpretable).

Example (MiniLM cross-encoder): For each candidate document $d_i$, compute

$$ s_i = f_{\text{cross}}([q; d_i]) $$

and then re-rank documents by descending $s_i$.

This ensures that semantically rich, directly relevant chunks rise to the top — improving precision.

Cross-encoders are slower than vector similarity, so they’re used after fast dense retrieval. That’s why RAG pipelines often follow a two-stage design: 1️⃣ Retrieve broadly, 2️⃣ Re-rank deeply.

4️⃣ Balancing Precision and Recall

You can think of recall and precision as two sides of a see-saw:

Metric	Meaning	Extreme Behavior
Recall	How many relevant docs you found out of all relevant ones.	High recall = more coverage, slower speed.
Precision	How many of your retrieved docs are actually relevant.	High precision = fewer false positives, might miss some useful docs.

A good RAG system:

Retrieves with high recall (dense retrieval).
Refines with high precision (re-ranking).

“Recall gets you the fish; precision cleans them for dinner.” 🐟🔪

5️⃣ Latency Mitigation — Making It Fast Again

Re-ranking adds computational overhead, especially at scale. To keep latency low, teams use:

Async pipelines: Perform re-ranking in parallel threads.
Caching: Store re-ranking results for common queries.
Top-k filtering: Only re-rank the top 20–50 documents, not all.
Distilled re-rankers: Smaller models trained from cross-encoders for faster inference.

Example:

Fast retriever → top 100 docs → re-rank top 20 → feed 5 best to the generator.

This hybrid setup ensures speed, precision, and practicality.

📐 Step 3: Mathematical Foundation

Two-Stage Retrieval Objective

Let $q$ be the query and $\mathcal{D}$ the document set.

1️⃣ Stage 1 — Dense Retrieval: Find top-k docs by cosine similarity:

$$ \mathcal{D}*k = \arg\max*{d_i \in \mathcal{D}} \text{sim}(E(q), E(d_i)) $$

2️⃣ Stage 2 — Re-ranking: Re-score each $d_i \in \mathcal{D}_k$ using a cross-encoder $f$:

$$ s_i = f(q, d_i) $$

Final ranking:

$$ \text{Rank}(d_i) = \text{sort}_{i}(s_i) $$

Dense retrieval = “find the crowd.” Re-ranking = “pick the stars.” ⭐

🧠 Step 4: Key Ideas & Assumptions

Queries may not match document phrasing; transformation bridges this gap.
Re-ranking corrects retrieval noise using deeper semantic comparison.
Recall–precision balance is critical for both speed and accuracy.
The system can be optimized for throughput with parallel and selective re-ranking.
Query transformations are often LLM-driven — the model that answers can also rephrase.

⚖️ Step 5: Strengths, Limitations & Trade-offs

✅ Strengths:

Greatly improves retrieval accuracy and relevance.
Adapts user queries to the domain vocabulary.
Provides interpretable intermediate outputs for debugging.

⚠️ Limitations:

Increases latency due to re-ranking cost.
Relies on quality of reformulations (bad rewrites can worsen retrieval).
Requires extra infrastructure for cross-encoders or LLM-based scoring.

⚖️ Trade-offs:

Recall vs. Precision: Broader recall → more data to re-rank.
Latency vs. Accuracy: More stages → better results but slower.
Automation vs. Control: LLM-driven rewrites improve flexibility but reduce predictability.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)

“Re-ranking replaces retrieval.” → No, it refines retrieval; it needs candidates first.
“Query expansion always helps.” → Over-expansion can pollute results with irrelevant hits.
“LLMs don’t need re-ranking.” → Even LLMs benefit from structured retrieval ordering for factual grounding.

🧩 Step 7: Mini Summary

🧠 What You Learned: Query transformation refines user intent, and re-ranking ensures the system selects the most relevant retrieved chunks. Together, they form the “smart filter” of RAG pipelines.

⚙️ How It Works: The retriever finds a broad set of candidates (high recall), then the re-ranker uses deeper attention models to reorder them for precision — often asynchronously for performance.

🎯 Why It Matters: These steps transform RAG from “search by chance” into “search with understanding,” dramatically improving factual grounding and response quality.

3.6. Context Integration & Generation 3.4. Chunking and Context Windows