1.4. Implement PCA from Scratch

5 min read 854 words

🪄 Step 1: Intuition & Motivation

Core Idea:
Now that you know the why and how of PCA, let’s make it come alive — with a hands-on mental walk-through of how PCA would be built from scratch.
The magic of PCA becomes real when you realize that, at its core, it’s just a few lines of linear algebra — no black box, just elegant math.
Simple Analogy:
Imagine you’re an artist reducing a full-color image (hundreds of color shades) to a few dominant tones.
You’d first identify the most common color patterns, then recreate the picture using only those — that’s PCA in action, except with numbers instead of colors.

🌱 Step 2: Core Concept

What’s Happening Under the Hood?

Here’s what a simple PCA “from scratch” does under the hood — without any libraries hiding the process.

Step 1 — Center the Data:
- Subtract the mean of each feature so the data is centered around zero.
- Why? Because PCA assumes data is zero-mean; otherwise, variance will be biased by the mean shift.
- Mathematically:
  $$ X_{centered} = X - \text{mean}(X) $$
Step 2 — Compute the Covariance Matrix:
- This matrix shows how features move together.
- Formula:
  $$ \Sigma = \frac{1}{n-1} X_{centered}^T X_{centered} $$
- Each cell in $\Sigma$ tells how two features co-vary — the heartbeat of PCA.
Step 3 — Perform Eigen Decomposition:
- Solve the equation: $$ \Sigma v_i = \lambda_i v_i $$
- Each eigenvector $v_i$ points in a “direction of variation,” and the corresponding eigenvalue $\lambda_i$ tells how strong that variation is.
Step 4 — Sort Eigenvectors by Importance:
- Rank them by eigenvalues (descending order).
- The first few capture the biggest patterns — the rest often represent noise.
Step 5 — Select Top-$k$ Components and Project the Data:
- Keep the top $k$ eigenvectors (say $V_k$) and project your data onto them: $$ X_{proj} = X_{centered} V_k $$
- You’ve now reduced dimensions while keeping the structure that matters most.

Why It Works This Way

By centering and decomposing, PCA effectively finds new “axes” that best describe the data’s spread.
Each new axis (principal component) is built from combinations of original features — the result is a compressed, rotated version of your data that’s easier to interpret or visualize.

If the original data were a cloud, PCA finds the direction the cloud is longest and flattens the rest — keeping its essence while discarding redundancy.

How It Fits in ML Thinking

Implementing PCA manually teaches you data intuition — not just what libraries do, but why they do it.
This understanding helps in:

Debugging model preprocessing pipelines.
Explaining model transformations in interviews.
Avoiding misuse (like forgetting to center data).

📐 Step 3: Mathematical Foundation

Centering and Covariance Computation

The core matrix for PCA is the covariance matrix:

$$ \Sigma = \frac{1}{n-1} X_{centered}^T X_{centered} $$

$X_{centered}$: Data with mean removed.
$\Sigma$: $d \times d$ matrix showing relationships between features.

Each diagonal element represents variance, and off-diagonal elements represent feature correlations.

Think of $\Sigma$ as a “map of friendships” — large values mean features are tightly connected. PCA finds the main “friend groups” — sets of features that move together — and uses them as new axes.

Eigenvalue Decomposition and Projection

To find the main directions of data spread:

$$ \Sigma v_i = \lambda_i v_i $$

Here:

$v_i$: direction (principal component).
$\lambda_i$: variance along that direction.

Once sorted, keep top $k$ eigenvectors and compute:

$$ X_{proj} = X_{centered} V_k $$

Projection is like shadowing your data onto a smaller surface — keeping the shape, losing the unnecessary dimensions.

🧠 Step 4: Assumptions or Key Ideas

Data Centering: Always center before computing covariance; otherwise, PCA will misinterpret the mean shift as variance.
Scaling Matters: Features with large scales dominate; always standardize when units differ.
Dimensionality Choice: The number of components ($k$) controls the trade-off between compression and information retention.

⚖️ Step 5: Strengths, Limitations & Trade-offs

✅ Strengths:

Simple and elegant implementation with only a few mathematical steps.
Transparent process that can be visualized at each stage.
Provides direct insight into variance-based structure in data.

⚠️ Limitations:

Sensitive to scaling — standardization is mandatory.
Only captures linear relationships.
Interpretation of new features (components) can be abstract.

⚖️ Trade-offs:
The more components you keep, the more accurate your reconstruction — but the less you compress.
Choosing $k$ is all about finding the sweet spot between simplicity and fidelity.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)

“PCA reduces features by deleting them.” → PCA transforms them; no information is truly deleted until projection.
“You can skip centering if data looks okay.” → Never skip it — PCA assumes zero mean.
“More components always mean better accuracy.” → Not necessarily — after a point, extra components mostly capture noise.

🧩 Step 7: Mini Summary

🧠 What You Learned: How PCA can be built step by step using mean-centering, covariance, eigen decomposition, and projection.

⚙️ How It Works: PCA transforms data by aligning it along its main directions of variance.

🎯 Why It Matters: Implementing PCA from scratch transforms abstract math into intuition — now you understand what sklearn.PCA is actually doing.

2.1. Connect PCA to Optimization 1.3. Understand SVD-Based Implementation