3.1. Linear Algebra Refresher

Generative AI & LLM Interview Guide for Top Roles (2025)

5 min read 1011 words

🪄 Step 1: Intuition & Motivation

Core Idea: Transformers are built on linear algebra — specifically, on the idea that words, meanings, and relationships can all be represented as vectors in a geometric space.

Every core Transformer operation — from attention weights to feed-forward layers — is really just a story of matrix multiplication, projection, and similarity.

If you understand the geometry of these transformations, you understand the soul of the Transformer.

Simple Analogy: Imagine a spotlight shining on different actors on stage. Each actor (word) has a unique position and role, but by rotating or shifting the spotlight (via matrix multiplication), you can change who the audience focuses on — that’s what projections in Transformers do.

🌱 Step 2: Core Concept

Let’s unpack the three major linear algebra ideas that silently power Transformers:

Matrix Multiplication and Projection Geometry
Orthogonality, Covariance, and Subspace Decomposition
Eigenspectrum and Training Stability

1️⃣ Matrix Multiplication and Projection Geometry

At its heart, attention relies on vector projections.

When we compute attention scores as $QK^T$, we’re literally measuring how much each query vector aligns with each key vector — i.e., how much one vector “projects” onto another.

If $q$ and $k$ are row vectors:

$$ \text{similarity}(q, k) = q \cdot k = |q||k| \cos(\theta) $$

This dot product is large when vectors point in similar directions — meaning the query “finds” that key relevant.

Now, when we multiply matrices:

$$ QK^T $$

each entry $(i, j)$ represents how much token i attends to token j.

Then, multiplying by $V$:

$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$

means we’re collecting weighted combinations of values, where weights come from how well queries align with keys.

Every matrix multiplication in a Transformer is like reorienting the spotlight — projecting meaning from one direction (query) onto another (key), then aggregating the illuminated values.

2️⃣ Orthogonality, Covariance, and Subspace Decomposition

Orthogonality simply means vectors are independent (their dot product = 0).

In Transformers, this concept helps us build diverse and non-redundant representations — we don’t want all words to point in the same direction in space.

When $Q$, $K$, or $V$ projections are learned, they often become approximately orthogonal, meaning each captures different relational directions (syntax, semantics, etc.).

Covariance measures how features move together. If model activations are highly correlated, it means they’re redundant — not great for efficient learning.

That’s why Layer Normalization (and sometimes orthogonal initialization) is important: they reduce covariance and keep directions balanced.

Subspace decomposition means breaking a high-dimensional space into meaningful smaller “directions” or subspaces. Each attention head can be thought of as operating in a different subspace, focusing on one type of relation (say, grammar, or meaning).

Think of orthogonal subspaces as separate radio channels — each transmits a unique signal without interference. Transformers use multiple heads (subspaces) to listen to multiple linguistic “frequencies” at once.

3️⃣ Eigenspectrum and Training Stability

When we train deep networks, the weight matrices evolve. Their eigenspectrum — the set of eigenvalues — tells us how the network stretches or squashes signals passing through.

If eigenvalues are too large → activations explode. If too small → gradients vanish.

For a linear transformation $Wx$,

Eigenvectors are directions that don’t change when multiplied by $W$.
Eigenvalues are how much those directions are stretched or compressed.

So when initializing or regularizing weight matrices, we want eigenvalues roughly around 1 to preserve signal scale.

That’s why Xavier and Kaiming initializations exist — they balance variance to keep signals steady layer to layer.

If you think of data as a shape passing through layers, the eigenspectrum controls whether it gets squeezed into a line (collapsed info) or blown up into chaos (unstable gradients). Stable eigenvalues = stable learning.

📐 Step 3: Mathematical Foundation

Projection and Similarity in Matrix Form

For attention:

$$ \text{Score}(Q,K) = QK^T $$

Each element represents a projection (dot product) of query vs. key. It encodes pairwise relevance.

Then, normalized attention:

$$ \alpha_{ij} = \frac{\exp(q_i \cdot k_j / \sqrt{d_k})}{\sum_j \exp(q_i \cdot k_j / \sqrt{d_k})} $$

weights each key by how similar it is to the query.

Finally:

$$ z_i = \sum_j \alpha_{ij} v_j $$

produces a contextualized embedding — a blend of the values weighted by relevance.

Eigenspectrum & Weight Initialization

To maintain stability:

$$ Var(Wx) = Var(W) Var(x) $$

We want:

$$ Var(W) = \frac{1}{d_{in}} $$

so that variance of activations stays roughly constant across layers.

If $W$ has eigenvalues $\lambda_i$, large spread (|λ| » 1 or « 1) means instability. Balanced eigenvalues (≈1) → consistent signal scale, smooth gradient flow.

🧠 Step 4: Key Ideas

Matrix multiplication = geometric projection — the engine of attention.
Orthogonality ensures diversity in learned representations.
Eigenspectrum balance keeps training numerically stable.
Separate Q, K, V projections allow asymmetric relational reasoning — queries ask, keys describe, values answer.

⚖️ Step 5: Strengths, Limitations & Trade-offs

Provides rich geometric interpretability.
Enables multiple relational spaces (via heads).
Orthogonalization promotes stability and diversity.

Intuitive geometry can get messy in very high dimensions.
Orthogonality is not strictly enforced during training — can drift.
Large eigenspectrum spread → unstable optimization if not controlled.

Balancing expressiveness (many subspaces, diverse heads) vs stability (controlled eigenspectrum) is key. It’s like designing a highway: more lanes (dimensions) increase capacity, but you must manage traffic flow (gradients) to prevent jams or crashes.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)

“Q, K, and V projections are just copies.” No — they’re separate subspace projections with different relational roles.
“Orthogonality means independence across layers.” It applies within representations, not across layers.
“Eigenspectrum only matters for math.” In practice, it directly affects gradient flow and convergence stability.

🧩 Step 7: Mini Summary

🧠 What You Learned: Linear algebra underlies every Transformer operation — projections, similarities, and stability all depend on it.

⚙️ How It Works: $Q, K, V$ are learned linear projections creating asymmetric relationships; their interactions rely on orthogonality and controlled eigenspectra.

🎯 Why It Matters: Understanding these geometric foundations reveals why Transformers are stable, expressive, and context-aware — and what can go wrong if those properties are violated.

3.2. Softmax and Normalization Effects 2.5. Encoder vs. Decoder Architecture