3.1. Linear Algebra Refresher
🪄 Step 1: Intuition & Motivation
- Core Idea: Transformers are built on linear algebra — specifically, on the idea that words, meanings, and relationships can all be represented as vectors in a geometric space.
Every core Transformer operation — from attention weights to feed-forward layers — is really just a story of matrix multiplication, projection, and similarity.
If you understand the geometry of these transformations, you understand the soul of the Transformer.
- Simple Analogy: Imagine a spotlight shining on different actors on stage. Each actor (word) has a unique position and role, but by rotating or shifting the spotlight (via matrix multiplication), you can change who the audience focuses on — that’s what projections in Transformers do.
🌱 Step 2: Core Concept
Let’s unpack the three major linear algebra ideas that silently power Transformers:
- Matrix Multiplication and Projection Geometry
- Orthogonality, Covariance, and Subspace Decomposition
- Eigenspectrum and Training Stability
1️⃣ Matrix Multiplication and Projection Geometry
At its heart, attention relies on vector projections.
When we compute attention scores as $QK^T$, we’re literally measuring how much each query vector aligns with each key vector — i.e., how much one vector “projects” onto another.
If $q$ and $k$ are row vectors:
$$ \text{similarity}(q, k) = q \cdot k = |q||k| \cos(\theta) $$This dot product is large when vectors point in similar directions — meaning the query “finds” that key relevant.
Now, when we multiply matrices:
$$ QK^T $$each entry $(i, j)$ represents how much token i attends to token j.
Then, multiplying by $V$:
$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$means we’re collecting weighted combinations of values, where weights come from how well queries align with keys.
2️⃣ Orthogonality, Covariance, and Subspace Decomposition
Orthogonality simply means vectors are independent (their dot product = 0).
In Transformers, this concept helps us build diverse and non-redundant representations — we don’t want all words to point in the same direction in space.
- When $Q$, $K$, or $V$ projections are learned, they often become approximately orthogonal, meaning each captures different relational directions (syntax, semantics, etc.).
Covariance measures how features move together. If model activations are highly correlated, it means they’re redundant — not great for efficient learning.
That’s why Layer Normalization (and sometimes orthogonal initialization) is important: they reduce covariance and keep directions balanced.
Subspace decomposition means breaking a high-dimensional space into meaningful smaller “directions” or subspaces. Each attention head can be thought of as operating in a different subspace, focusing on one type of relation (say, grammar, or meaning).
3️⃣ Eigenspectrum and Training Stability
When we train deep networks, the weight matrices evolve. Their eigenspectrum — the set of eigenvalues — tells us how the network stretches or squashes signals passing through.
If eigenvalues are too large → activations explode. If too small → gradients vanish.
For a linear transformation $Wx$,
- Eigenvectors are directions that don’t change when multiplied by $W$.
- Eigenvalues are how much those directions are stretched or compressed.
So when initializing or regularizing weight matrices, we want eigenvalues roughly around 1 to preserve signal scale.
That’s why Xavier and Kaiming initializations exist — they balance variance to keep signals steady layer to layer.
📐 Step 3: Mathematical Foundation
Projection and Similarity in Matrix Form
For attention:
$$ \text{Score}(Q,K) = QK^T $$Each element represents a projection (dot product) of query vs. key. It encodes pairwise relevance.
Then, normalized attention:
$$ \alpha_{ij} = \frac{\exp(q_i \cdot k_j / \sqrt{d_k})}{\sum_j \exp(q_i \cdot k_j / \sqrt{d_k})} $$weights each key by how similar it is to the query.
Finally:
$$ z_i = \sum_j \alpha_{ij} v_j $$produces a contextualized embedding — a blend of the values weighted by relevance.
Eigenspectrum & Weight Initialization
To maintain stability:
$$ Var(Wx) = Var(W) Var(x) $$We want:
$$ Var(W) = \frac{1}{d_{in}} $$so that variance of activations stays roughly constant across layers.
If $W$ has eigenvalues $\lambda_i$, large spread (|λ| » 1 or « 1) means instability. Balanced eigenvalues (≈1) → consistent signal scale, smooth gradient flow.
🧠 Step 4: Key Ideas
- Matrix multiplication = geometric projection — the engine of attention.
- Orthogonality ensures diversity in learned representations.
- Eigenspectrum balance keeps training numerically stable.
- Separate Q, K, V projections allow asymmetric relational reasoning — queries ask, keys describe, values answer.
⚖️ Step 5: Strengths, Limitations & Trade-offs
- Provides rich geometric interpretability.
- Enables multiple relational spaces (via heads).
- Orthogonalization promotes stability and diversity.
- Intuitive geometry can get messy in very high dimensions.
- Orthogonality is not strictly enforced during training — can drift.
- Large eigenspectrum spread → unstable optimization if not controlled.
🚧 Step 6: Common Misunderstandings
🚨 Common Misunderstandings (Click to Expand)
- “Q, K, and V projections are just copies.” No — they’re separate subspace projections with different relational roles.
- “Orthogonality means independence across layers.” It applies within representations, not across layers.
- “Eigenspectrum only matters for math.” In practice, it directly affects gradient flow and convergence stability.
🧩 Step 7: Mini Summary
🧠 What You Learned: Linear algebra underlies every Transformer operation — projections, similarities, and stability all depend on it.
⚙️ How It Works: $Q, K, V$ are learned linear projections creating asymmetric relationships; their interactions rely on orthogonality and controlled eigenspectra.
🎯 Why It Matters: Understanding these geometric foundations reveals why Transformers are stable, expressive, and context-aware — and what can go wrong if those properties are violated.