1.2. Derive PCA Mathematically (Step-by-Step)

5 min read 1052 words

🪄 Step 1: Intuition & Motivation

Core Idea:
So far, we’ve learned what PCA does — it rotates the data to capture maximum variance.
Now we’ll learn how PCA figures out those directions mathematically.
This part is about understanding how PCA turns “find the most spread-out direction” into actual math using covariance and eigen decomposition.
Simple Analogy:
Think of a ball of clay (your dataset) that you flatten against a wall (lower dimension).
PCA mathematically figures out which wall angle keeps the clay’s original shape as much as possible.
To do that, it looks at how the clay points move together (covariance) and chooses the orientation where that spread (variance) is the largest.

🌱 Step 2: Core Concept

What’s Happening Under the Hood?

Here’s the backstage story of PCA’s math — simplified but precise:

Step 1 — Center the Data:
First, PCA removes the “bias” by centering the dataset.
Every feature’s mean is subtracted so the data now sits around the origin (zero mean).
This ensures the directions we find aren’t influenced by position, only by shape and spread.
Step 2 — Compute Covariance Matrix ($\Sigma$):
Covariance tells us how features move together.
If two features increase or decrease together, they have a positive covariance.
Mathematically:
$$ \Sigma = \frac{1}{n-1} X^T X $$

where $X$ is your centered data.
This $\Sigma$ is like a map of relationships — a snapshot of how every feature depends on every other.
Step 3 — Find Principal Directions (Eigenvectors):
We now want the direction where data varies the most — that’s like finding a line that “fits through” the data cloud.
PCA finds this by solving:
$$ \Sigma v_i = \lambda_i v_i $$

Here:
- $v_i$ → direction (eigenvector).
- $\lambda_i$ → variance captured along that direction (eigenvalue).
This equation literally says: “Find the directions that don’t change direction when multiplied by $\Sigma$ — just scale in size.”
Step 4 — Sort and Select Top Directions:
Eigenvectors are ranked by their eigenvalues.
The first few have the largest $\lambda$, meaning they capture the most variance.
If you keep only the top $k$, you’ve reduced dimensions while retaining maximum information.
Step 5 — Project Data:
Finally, we project the original data $X$ onto those top-$k$ directions:
$$ X_{proj} = X V_k $$

where $V_k$ contains the top $k$ eigenvectors as columns.
Voilà — your data is now expressed in the most informative directions!

Why It Works This Way

Imagine you have points scattered across a plane. PCA looks for the axis that cuts through the data’s longest stretch — because that’s where the most variation lies.
Mathematically, that’s equivalent to maximizing the variance:

$$ \max_{v} v^T \Sigma v \quad \text{s.t.} \quad \|v\| = 1 $$

This ensures we find the direction that “stretches” the data the most while keeping the vector normalized (unit length).

Each next axis (second, third, etc.) is found under the rule that it must be orthogonal (perpendicular) to the previous ones — so each captures new information.

How It Fits in ML Thinking

Machine learning often deals with data that’s too big or redundant.
PCA’s math turns that data into compact, clean representations — reducing noise and redundancy.
Think of it as the “data cleaning before the model,” but done with math instead of intuition.
It prepares the data for better, faster, and often more accurate learning.

📐 Step 3: Mathematical Foundation

Covariance Matrix

$$ \Sigma = \frac{1}{n-1} X^T X $$

$X$: Centered data matrix ($n$ samples × $d$ features).
$\Sigma$: Covariance matrix ($d \times d$).
Each element $\Sigma_{ij}$ tells how features $i$ and $j$ move together.

If $\Sigma_{ij}$ is large and positive → they grow together.
If negative → one grows as the other shrinks.
If near zero → they’re unrelated.

Covariance is like gossip between features — how much they “talk” to each other. PCA listens carefully, then summarizes the conversations into a few main storylines (principal components).

Eigen Decomposition

$$ \Sigma v_i = \lambda_i v_i $$

$v_i$: the direction (principal component).
$\lambda_i$: the amount of variance along that direction.

The math says: “Multiply $\Sigma$ by $v_i$, and you get the same direction scaled by $\lambda_i$.”
That means $v_i$ is special — it’s like the purest direction of data spread.

Eigenvectors are like compass needles pointing in the directions the data stretches the most.
Eigenvalues tell you how far it stretches in that direction.

Projection Formula

$$ X_{proj} = X V_k $$

$V_k$: top $k$ eigenvectors stacked column-wise.
$X_{proj}$: your data re-expressed in fewer dimensions.

This is literally like taking a photograph of your data from the “best angle.”

Projection means compressing the data onto a smaller space — while keeping the most important details intact.
Like folding a large map neatly without losing key landmarks.

🧠 Step 4: Assumptions or Key Ideas

Zero Mean: PCA assumes all features are centered (no offset).
Linearity: It only finds straight-line directions — curved structures are invisible to it.
Orthogonality: Each new principal component must be perpendicular to previous ones — ensuring no redundancy.
Variance as Information: The assumption that high variance = meaningful signal.

These principles define the kind of structure PCA can and cannot capture.

⚖️ Step 5: Strengths, Limitations & Trade-offs

✅ Strengths:

Provides a clear mathematical framework for dimensionality reduction.
Links geometry, algebra, and statistics seamlessly.
Works well as a preprocessing step before any ML model.

⚠️ Limitations:

Only detects linear relationships.
Sensitive to scaling — features with larger variance dominate.
Hard to interpret — principal components are mixtures of features.

⚖️ Trade-offs:
PCA sacrifices interpretability for simplicity and efficiency.
You lose direct feature meaning but gain a compressed, cleaner view of your data’s underlying structure.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)

“Eigenvectors are random directions.” → No, they’re carefully chosen to maximize variance and remain orthogonal.
“Covariance and correlation are the same.” → Covariance depends on scale; correlation standardizes it.
“PCA reduces data arbitrarily.” → It’s not arbitrary — it minimizes reconstruction error while keeping maximal variance.

🧩 Step 7: Mini Summary

🧠 What You Learned: PCA’s core math revolves around covariance, eigen decomposition, and projection.

⚙️ How It Works: It finds orthogonal directions of maximum variance and projects data onto them.

🎯 Why It Matters: This mathematical process lets PCA compress data intelligently — simplifying without losing important structure.

1.3. Understand SVD-Based Implementation 1.1. Build a Geometric and Statistical Intuition