3.4. Chunking and Context Windows
🪄 Step 1: Intuition & Motivation
Core Idea: Imagine feeding an LLM an entire 300-page PDF — your poor model will choke. 🥴
LLMs can only process text within their context window — the maximum number of tokens (words, subwords, punctuation) they can “see” at once. For GPT-4, that might be 8K or 128K tokens; for smaller models, often 2K–4K.
To make large documents digestible, we split them into smaller, overlapping sections called chunks — like dividing a big cake into bite-sized slices. 🍰
But here’s the catch: if the slices are too thin, you lose flavor (context); if they’re too thick, the model gets indigestion (token overflow). Hence, chunking is both an art and a science.
Simple Analogy: Think of document chunking like reading a long novel out loud to a friend who forgets things after a few minutes. You read short, connected paragraphs (chunks) — not single words (too little) or whole chapters (too much). That way, your friend (the LLM) remembers just enough to make sense of each part.
🌱 Step 2: Core Concept
Let’s unpack how chunking works, what the “context window” means, and how to tune chunk size wisely.
1️⃣ What Is Document Chunking?
Chunking means dividing long documents into smaller text segments before creating embeddings.
Each chunk becomes a retrievable unit in your vector database.
For example, a 20-page article might be split into:
- Chunk 1: Introduction and definitions (≈500 tokens)
- Chunk 2: Key arguments and examples (≈600 tokens)
- Chunk 3: Summary and conclusion (≈400 tokens)
When a query arrives, RAG retrieves only the most relevant chunks, not the whole document — saving time and token costs.
2️⃣ What Are Context Windows?
The context window is the maximum number of tokens the model can process in one go.
Examples:
- GPT-3.5 → ~4K tokens
- GPT-4 → up to 128K tokens
- Mistral → ~8K tokens
Every chunk retrieved during RAG + your question + system prompts must all fit inside this limit.
If they exceed it, earlier chunks are truncated — meaning the model may “forget” critical information.
Hence, chunking is designed relative to the model’s context size. If your model has a 4K window, chunk sizes of 512–1024 tokens are often optimal.
3️⃣ Why Chunk Size Matters
| Chunk Size | Problem | Example Outcome |
|---|---|---|
| Too Small | Breaks context flow | The model misses relationships between paragraphs. |
| Too Large | Retrieval dilution | Chunks contain mixed topics, lowering precision. |
Example: If you split a legal contract sentence-by-sentence, queries like
“What are the termination conditions?” will fail, because no single chunk contains both “termination” and “condition” together.
But if you chunk the whole contract into 3000-token blobs, your retriever will pull too much irrelevant content.
Goal: Keep chunks large enough for semantic completeness but small enough for selective retrieval.
4️⃣ Chunking Strategies
Different use cases require different chunking styles:
| Strategy | Description | Best For |
|---|---|---|
| Fixed-Size Chunking | Divide text every N tokens (e.g., every 500). | Fast, simple preprocessing. |
| Sliding Window Chunking | Overlapping windows (e.g., 500 tokens with 100-token overlap). | Smooth context continuity. |
| Semantic Segmentation | Breaks text at natural boundaries (e.g., paragraphs, topic shifts). | Complex, meaning-preserving chunking. |
Sliding Window Example:
Chunk 1 → Tokens 0–500
Chunk 2 → Tokens 400–900 (100-token overlap)That overlap ensures continuity — like reading a paragraph that slightly repeats where you left off.
5️⃣ Dynamic Chunk Sizing
In production RAG systems, you don’t always use one chunk size for all data.
Chunk size can be dynamic:
- Based on document type (short blogs vs. long reports).
- Based on model context (e.g., 512 for 8K context, 2048 for 32K).
- Based on retrieval performance metrics (Recall@k, MRR).
Some advanced pipelines even auto-tune chunk sizes by testing retrieval accuracy over a validation set and picking the configuration with the highest semantic precision.
📐 Step 3: Mathematical Foundation
Token Counting and Context Budget
Let:
- $L$ = model context window (e.g., 4096)
- $C$ = chunk size (in tokens)
- $k$ = number of chunks retrieved
- $Q$ = query length (in tokens)
Constraint:
$$ kC + Q \leq L $$This defines how many chunks you can safely include.
Example: For $L = 4096$, $Q = 128$, and $C = 512$:
$$ k \leq \frac{4096 - 128}{512} = 7.7 $$So you can include about 7 chunks per query.
🧠 Step 4: Key Ideas & Assumptions
- The model’s context window defines your chunking limits.
- Chunks must be semantically coherent (self-contained in meaning).
- Overlap (sliding window) preserves continuity.
- Chunking is a tunable design choice, not a fixed rule.
- Retrieval performance must be validated empirically — not guessed.
⚖️ Step 5: Strengths, Limitations & Trade-offs
✅ Strengths:
- Enables scalable retrieval from long documents.
- Balances comprehension and performance.
- Reduces hallucination by maintaining local coherence.
⚠️ Limitations:
- Poor chunking can fragment context or include irrelevant data.
- Overlapping chunks increase storage and embedding costs.
- No one-size-fits-all configuration — requires experimentation.
⚖️ Trade-offs:
- Size vs. Coherence: Larger chunks capture context but slow retrieval.
- Overlap vs. Efficiency: More overlap improves recall but duplicates data.
- Dynamic vs. Static: Adaptive chunking increases accuracy but adds complexity.
🚧 Step 6: Common Misunderstandings
🚨 Common Misunderstandings (Click to Expand)
- “Smaller chunks are always better.” → Not true. Tiny chunks destroy semantic coherence.
- “You can ignore overlap.” → Overlap maintains reasoning flow. Without it, the model “forgets” sentence transitions.
- “Chunking doesn’t affect accuracy.” → It directly determines what context the model sees — it’s one of the most critical hyperparameters in RAG.
🧩 Step 7: Mini Summary
🧠 What You Learned: Chunking breaks large documents into retrievable, coherent segments that fit within the LLM’s context window.
⚙️ How It Works: Chunks are sized and overlapped based on the model’s context limit, retrieval goals, and domain semantics — balancing coherence, precision, and cost.
🎯 Why It Matters: Optimal chunking ensures your RAG system retrieves meaningful, focused context for each query — the foundation for factual and efficient reasoning.