3.1. Choosing the Number of Components

4 min read 845 words

🪄 Step 1: Intuition & Motivation

Core Idea: Choosing how many principal components ($k$) to keep is like choosing how much of your story to tell. If you tell too little, people miss the plot (you lose information). If you tell too much, they get bored (you include noise).
PCA gives you compressed data — but how much compression is enough? That’s what this step is about: balancing information retention with simplicity and efficiency.
Simple Analogy: Imagine reducing a high-resolution image to fewer pixels. You want to keep enough detail so it’s recognizable, but not so many that you lose the benefit of compression. The same idea applies to choosing how many components to keep in PCA.

🌱 Step 2: Core Concept

What’s Happening Under the Hood?

Each principal component explains a certain amount of the total variance (information) in your dataset. When you order components by descending eigenvalues ($\lambda_1, \lambda_2, \lambda_3, \dots$), the first captures the most variance, the next captures what’s left, and so on.

The challenge is deciding how many components ($k$) to keep without losing critical information.

There are three main approaches:

Cumulative Explained Variance (95–99% Rule):
- Compute the ratio of variance explained by each component: $$ \text{Explained Variance Ratio} = \frac{\lambda_i}{\sum_j \lambda_j} $$
- Sum these ratios until you reach 95% (or 99%) — that’s your $k$.
- It ensures you retain most of the “story” in your data.
Scree Plot (Elbow Rule):
- Plot eigenvalues (variance explained) against component numbers.
- Look for the “elbow” — the point where the curve starts flattening.
- Beyond this point, extra components contribute little new information.
Cross-Validation for Model Performance:
- If PCA is part of a pipeline, test model performance for different $k$ values using cross-validation.
- Pick the smallest $k$ that gives near-optimal accuracy.
- This is more computationally expensive but more practical for predictive tasks.

Why It Works This Way

The intuition is simple:

Variance = Information.
Components with small eigenvalues contribute little to explaining the data’s structure — often they’re just noise.

So by choosing components that cover most variance, you compress data efficiently while preserving signal. The “elbow” point marks the diminishing returns zone — the spot where adding more dimensions stops being worth the cost.

How It Fits in ML Thinking

Selecting $k$ is part of the bias–variance trade-off:

Fewer components → higher bias, lower variance (simpler, faster, but might lose detail).
More components → lower bias, higher variance (captures noise, slower).

It’s also a feature engineering decision: choosing $k$ defines how abstract your transformed features will be. In production ML, this decision affects both model performance and pipeline scalability.

📐 Step 3: Mathematical Foundation

Cumulative Explained Variance

For each component $i$, the explained variance ratio is:

$$ r_i = \frac{\lambda_i}{\sum_{j=1}^{d} \lambda_j} $$

The cumulative variance up to component $k$ is:

$$ R_k = \sum_{i=1}^{k} r_i $$

Choose the smallest $k$ such that $R_k \geq 0.95$ (or your chosen threshold).

Think of $R_k$ as how much of the “story” your chosen components tell. 95% means you’ve captured nearly all of the meaningful plot — the rest is minor details or noise.

Scree Plot (Elbow Method)

Plot eigenvalues $\lambda_i$ vs. component index $i$. The curve usually drops sharply and then levels off — the point where it “bends” (the elbow) marks where adding components stops improving information capture significantly.

The elbow method is like deciding when to stop adding ingredients to a dish — after a point, more doesn’t make it better, just more complicated.

🧠 Step 4: Assumptions or Key Ideas

High Variance = Important: PCA assumes directions with higher variance are more meaningful.
Linear Relationships: PCA doesn’t capture nonlinear variance — only linear structure.
Noise Filtering: Small components often represent random fluctuations or measurement noise.
Data-Centric Decision: The optimal number of components depends on your data and use case — not a fixed percentage.

⚖️ Step 5: Strengths, Limitations & Trade-offs

✅ Strengths:

Reduces dimensionality efficiently.
Offers visual tools (variance plots) for decision-making.
Enhances model generalization by removing noisy features.

⚠️ Limitations:

No fixed rule — 95% is heuristic, not optimal for all cases.
“Elbow” point can be ambiguous or subjective.
Overemphasis on variance may ignore meaningful low-variance patterns.

⚖️ Trade-offs: More components = higher accuracy, lower interpretability. Fewer components = faster, simpler, but potentially less precise. You’re always walking the line between clarity and completeness.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)

“Always keep 95% variance.” → It’s a guideline, not a rule — use domain knowledge and model performance to decide.
“All components are equally important.” → The first few typically dominate; later ones often describe noise.
“PCA reduces dimensionality automatically.” → You must explicitly choose $k$ or a variance threshold.

🧩 Step 7: Mini Summary

🧠 What You Learned: PCA components are ranked by how much variance they explain; choosing $k$ determines your balance between simplicity and information.

⚙️ How It Works: Use explained variance plots, scree plots, or cross-validation to find the “sweet spot.”

🎯 Why It Matters: Picking the right number of components ensures efficient, interpretable, and high-performing ML pipelines.

3.2. Numerical Stability and Scaling 2.3. Kernel PCA (Nonlinear Extension)