3.4. Chunking and Context Windows

6 min read 1099 words

🪄 Step 1: Intuition & Motivation

Core Idea: Imagine feeding an LLM an entire 300-page PDF — your poor model will choke. 🥴

LLMs can only process text within their context window — the maximum number of tokens (words, subwords, punctuation) they can “see” at once. For GPT-4, that might be 8K or 128K tokens; for smaller models, often 2K–4K.

To make large documents digestible, we split them into smaller, overlapping sections called chunks — like dividing a big cake into bite-sized slices. 🍰

But here’s the catch: if the slices are too thin, you lose flavor (context); if they’re too thick, the model gets indigestion (token overflow). Hence, chunking is both an art and a science.


Simple Analogy: Think of document chunking like reading a long novel out loud to a friend who forgets things after a few minutes. You read short, connected paragraphs (chunks) — not single words (too little) or whole chapters (too much). That way, your friend (the LLM) remembers just enough to make sense of each part.


🌱 Step 2: Core Concept

Let’s unpack how chunking works, what the “context window” means, and how to tune chunk size wisely.


1️⃣ What Is Document Chunking?

Chunking means dividing long documents into smaller text segments before creating embeddings.

Each chunk becomes a retrievable unit in your vector database.

For example, a 20-page article might be split into:

  • Chunk 1: Introduction and definitions (≈500 tokens)
  • Chunk 2: Key arguments and examples (≈600 tokens)
  • Chunk 3: Summary and conclusion (≈400 tokens)

When a query arrives, RAG retrieves only the most relevant chunks, not the whole document — saving time and token costs.

Chunking defines the granularity of understanding. Too granular → fragmented meaning. Too coarse → irrelevant context. Balance = coherence + retrievability.

2️⃣ What Are Context Windows?

The context window is the maximum number of tokens the model can process in one go.

Examples:

  • GPT-3.5 → ~4K tokens
  • GPT-4 → up to 128K tokens
  • Mistral → ~8K tokens

Every chunk retrieved during RAG + your question + system prompts must all fit inside this limit.

If they exceed it, earlier chunks are truncated — meaning the model may “forget” critical information.

Hence, chunking is designed relative to the model’s context size. If your model has a 4K window, chunk sizes of 512–1024 tokens are often optimal.

Keep chunk size ≈ 1/4 to 1/8 of your total context window. This leaves space for multiple retrieved chunks and the user query.

3️⃣ Why Chunk Size Matters
Chunk SizeProblemExample Outcome
Too SmallBreaks context flowThe model misses relationships between paragraphs.
Too LargeRetrieval dilutionChunks contain mixed topics, lowering precision.

Example: If you split a legal contract sentence-by-sentence, queries like

“What are the termination conditions?” will fail, because no single chunk contains both “termination” and “condition” together.

But if you chunk the whole contract into 3000-token blobs, your retriever will pull too much irrelevant content.

Goal: Keep chunks large enough for semantic completeness but small enough for selective retrieval.

Start with 512–1024 tokens per chunk for GPT-like models. Experiment and visualize embedding density (e.g., cosine similarity between adjacent chunks) to tune precisely.

4️⃣ Chunking Strategies

Different use cases require different chunking styles:

StrategyDescriptionBest For
Fixed-Size ChunkingDivide text every N tokens (e.g., every 500).Fast, simple preprocessing.
Sliding Window ChunkingOverlapping windows (e.g., 500 tokens with 100-token overlap).Smooth context continuity.
Semantic SegmentationBreaks text at natural boundaries (e.g., paragraphs, topic shifts).Complex, meaning-preserving chunking.

Sliding Window Example:

Chunk 1 → Tokens 0–500  
Chunk 2 → Tokens 400–900 (100-token overlap)

That overlap ensures continuity — like reading a paragraph that slightly repeats where you left off.

For high-stakes tasks (legal, medical, research), semantic segmentation gives better recall — because it respects topic boundaries rather than blind token counts.

5️⃣ Dynamic Chunk Sizing

In production RAG systems, you don’t always use one chunk size for all data.

Chunk size can be dynamic:

  • Based on document type (short blogs vs. long reports).
  • Based on model context (e.g., 512 for 8K context, 2048 for 32K).
  • Based on retrieval performance metrics (Recall@k, MRR).

Some advanced pipelines even auto-tune chunk sizes by testing retrieval accuracy over a validation set and picking the configuration with the highest semantic precision.

Dynamic chunk sizing is part of retrieval optimization. Good teams treat chunking as a hyperparameter, not a fixed rule.

📐 Step 3: Mathematical Foundation

Token Counting and Context Budget

Let:

  • $L$ = model context window (e.g., 4096)
  • $C$ = chunk size (in tokens)
  • $k$ = number of chunks retrieved
  • $Q$ = query length (in tokens)

Constraint:

$$ kC + Q \leq L $$

This defines how many chunks you can safely include.

Example: For $L = 4096$, $Q = 128$, and $C = 512$:

$$ k \leq \frac{4096 - 128}{512} = 7.7 $$

So you can include about 7 chunks per query.

Chunk size is a budgeting problem: how to fit enough context for reasoning without exceeding the model’s attention span.

🧠 Step 4: Key Ideas & Assumptions

  • The model’s context window defines your chunking limits.
  • Chunks must be semantically coherent (self-contained in meaning).
  • Overlap (sliding window) preserves continuity.
  • Chunking is a tunable design choice, not a fixed rule.
  • Retrieval performance must be validated empirically — not guessed.

⚖️ Step 5: Strengths, Limitations & Trade-offs

Strengths:

  • Enables scalable retrieval from long documents.
  • Balances comprehension and performance.
  • Reduces hallucination by maintaining local coherence.

⚠️ Limitations:

  • Poor chunking can fragment context or include irrelevant data.
  • Overlapping chunks increase storage and embedding costs.
  • No one-size-fits-all configuration — requires experimentation.

⚖️ Trade-offs:

  • Size vs. Coherence: Larger chunks capture context but slow retrieval.
  • Overlap vs. Efficiency: More overlap improves recall but duplicates data.
  • Dynamic vs. Static: Adaptive chunking increases accuracy but adds complexity.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)
  • “Smaller chunks are always better.” → Not true. Tiny chunks destroy semantic coherence.
  • “You can ignore overlap.” → Overlap maintains reasoning flow. Without it, the model “forgets” sentence transitions.
  • “Chunking doesn’t affect accuracy.” → It directly determines what context the model sees — it’s one of the most critical hyperparameters in RAG.

🧩 Step 7: Mini Summary

🧠 What You Learned: Chunking breaks large documents into retrievable, coherent segments that fit within the LLM’s context window.

⚙️ How It Works: Chunks are sized and overlapped based on the model’s context limit, retrieval goals, and domain semantics — balancing coherence, precision, and cost.

🎯 Why It Matters: Optimal chunking ensures your RAG system retrieves meaningful, focused context for each query — the foundation for factual and efficient reasoning.

Any doubt in content? Ask me anything?
Chat
🤖 👋 Hi there! I'm your learning assistant. If you have any questions about this page or need clarification, feel free to ask!