4.1. Context Extension Strategies

Generative AI & LLM Interview Guide for Top Roles (2025)

6 min read 1125 words

🪄 Step 1: Intuition & Motivation

Core Idea: Traditional Transformers have a short memory. Their attention mechanism grows quadratically with input length — meaning doubling your input quadruples the compute. Long-context models solve this by teaching Transformers how to remember more efficiently without exhausting memory or compute.
Simple Analogy: Imagine you’re reading a 500-page novel. You don’t reread every page to recall the plot — you remember summaries, key events, and references. Long-context models do the same — they learn to store, compress, or retrieve important bits of history while ignoring irrelevant details.

🌱 Step 2: Core Concept

Let’s explore the clever engineering that lets modern LLMs handle hundred-thousand-token conversations (and even million-token memory).

Sliding Window Attention — Moving Through Context Like a Camera

In standard attention, every token attends to every other token — $O(n^2)$ complexity.

Sliding Window Attention limits this by giving each token visibility into only a local window of recent tokens.

Imagine a spotlight moving across a line of text — each position can “see” only what’s nearby.

For example, with a window size of 512 tokens:

Token 1000 can attend only to tokens 488–1000.
Token 2000 attends to 1488–2000, and so on.

This drastically reduces computation while preserving short-term coherence.

Trade-off: It’s great for tasks like dialogue or streaming text, but can’t capture global dependencies (like connecting a word in chapter 1 to a reference in chapter 20).

ALiBi — Attention with Linear Biases

ALiBi (Attention with Linear Biases) adds a simple yet powerful trick: instead of fixed positional embeddings, it adds a distance-based penalty directly into the attention score.

Mathematically:

$$ A_{ij} = \frac{Q_i K_j^T}{\sqrt{d_k}} + m \cdot |i - j| $$

Here, $m$ is a bias slope that penalizes attention to far-away tokens.

Intuitively, tokens prefer nearby neighbors — but can still attend to distant ones if necessary.

Why it works:

No fixed positional embeddings → better generalization to longer contexts.
Linear bias naturally decays attention over distance → efficient and simple.

Think of ALiBi as gravity for attention — nearby words “pull stronger,” distant words still exert a smaller pull, but it fades with distance.

Linear Attention — Compressing the Attention Explosion

Regular attention computes a huge $n \times n$ matrix — all token pairs. Linear attention replaces this with a trick: approximate the softmax operation so that computation scales linearly with sequence length ($O(n)$).

The key idea is to factorize the softmax kernel into separate parts for Q and K, allowing cumulative sums instead of pairwise comparisons.

This saves massive memory and compute — now, the model can process much longer sequences in one pass.

The Catch: Linear attention approximates the full attention map, so it often loses fine-grained relationships — meaning weaker reasoning over distant dependencies.

Hence, while efficient, it can’t fully replace exact attention for tasks needing deep cross-sentence logic.

Retrieval-Augmented Memory — Storing What Doesn’t Fit

Even with efficient attention, models can’t hold everything in memory. Enter Retrieval-Augmented Memory (RAM) — an external brain for the model.

Here’s how it works:

The model encodes chunks of previous text into vector embeddings.
These embeddings are stored in a vector database (like FAISS or Milvus).
When new input arrives, the model retrieves relevant past chunks using similarity search.
Retrieved context is appended to the prompt — giving the illusion of “infinite memory.”

This is how systems like Claude 3.5 or Gemini 1.5 maintain multi-million-token “context” — they’re not storing everything; they’re remembering selectively.

Retrieval-Augmented Memory is like your own memory: you don’t recall every word of a book — you recall the key parts when prompted.

Memory Persistence — Beyond One Conversation

Long-context models are now combining short-term attention with long-term storage — enabling continuous learning across sessions.

They maintain memory slots or embeddings that persist between interactions, updating them incrementally.

This persistent memory design is inspired by the human hippocampus:

Short-term memory (attention window).
Long-term memory (retrieval storage).
Consolidation (compressing past info into summaries).

Claude 3.5 and Gemini 1.5 use variations of this to support episodic recall — remembering context over hours or even days of chat.

Why It Works This Way

The challenge of scaling context isn’t just about more tokens — it’s about information management.

Without constraints, long sequences overwhelm the model’s ability to focus or reason. Techniques like ALiBi, sliding windows, and retrieval let the model:

Keep local focus (attention window).
Retain global awareness (retrieval memory).
Avoid combinatorial explosion ($O(n^2)$ → $O(n)$).

How It Fits in ML Thinking

Long-context modeling bridges sequence efficiency with memory representation. It’s not just an architectural trick — it’s a philosophical step toward continual reasoning and contextual grounding.

These techniques move models from static responders toward contextually aware agents that can think across sessions, documents, and time.

📐 Step 3: Mathematical Foundation

Sliding Window Complexity

$$ \text{Cost} = O(n \cdot w) $$

$n$: sequence length
$w$: window size

Compared to full attention ($O(n^2)$), this reduces computation drastically while maintaining local coherence.

ALiBi Bias Term

$$ A_{ij} = \frac{Q_i K_j^T}{\sqrt{d_k}} + m \cdot |i - j| $$

$m$: linear slope controlling how quickly attention decays.
$|i - j|$: distance between tokens.

This keeps the attention mechanism aware of distance — without positional embeddings.

🧠 Step 4: Key Ideas & Assumptions

Attention needs focus, not infinity: Models don’t need to attend to all tokens — only the relevant subset.
Positional decay mimics human forgetting: ALiBi approximates how we gradually “fade out” distant context.
External memory extends reasoning horizon: Retrieval stores long-term knowledge without bloating model parameters.

⚖️ Step 5: Strengths, Limitations & Trade-offs

Enables huge context lengths efficiently.
Retains short-term coherence and long-term recall.
Works well with streaming or conversational inputs.
Retrieval-augmented memory provides scalable expansion.

Linear and windowed attention lose some global dependencies.
Retrieval quality depends on embedding accuracy.
Persistent memory can drift or accumulate noise over time.

True long-context design is a balancing act — between attention depth (reasoning) and memory breadth (recall). Like human memory, too much recall causes clutter; too little causes forgetfulness.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)

“Long-context = infinite memory.” No — models retrieve or summarize, not memorize everything.
“Linear attention is always better.” Not for reasoning-heavy tasks; it loses token-to-token nuance.
“Retrieval memory means permanent knowledge.” It’s dynamic and query-dependent — the model recalls context when needed, not stored facts.

🧩 Step 7: Mini Summary

🧠 What You Learned: Long-context models overcome the quadratic limits of attention through sliding windows, positional biasing (ALiBi), linear approximations, and retrieval memory.

⚙️ How It Works: They blend local attention with external retrieval to simulate efficient, human-like recall.

🎯 Why It Matters: This enables reasoning over long documents, sustained conversations, and cross-session understanding — essential for the next generation of “always-on” AI agents.

4.2. Sparse and Mixture-of-Experts (MoE) Models 3.2. Instruction Fine-Tuning & Mixture Objectives