1.2. Learn the Mathematical Backbone — Topology and Graph Construction
🪄 Step 1: Intuition & Motivation
Core Idea: So far, we’ve seen why dimensionality reduction matters. Now, let’s peek under the hood of UMAP’s mathematical skeleton — the part that gives it structure and intelligence.
UMAP’s secret weapon lies in how it connects points together to capture both local and global geometry. It does this using ideas borrowed from topology (the math of shapes and connections) and graph theory (the math of relationships).
In simple words:
“UMAP doesn’t see data as coordinates — it sees it as a web of friendships.”
🌱 Step 2: Core Concept
What’s a Manifold, Really?
Imagine a crumpled sheet of paper. Now, that paper is sitting in your 3D room — so technically, it’s part of a 3D space. But all the interesting stuff (the ink on the paper) lives on its 2D surface.
That’s a manifold — a lower-dimensional surface living inside a higher-dimensional space. Each small region (local area) can be treated as flat, even if the overall structure is curved or twisted.
So, when UMAP says it assumes “data lies on a manifold,” it’s like saying:
“Although our data may have 100 features, the real patterns can be described in fewer — say, 2 or 3 — meaningful directions.”
Local vs Global Structure — The Balancing Act
- Local structure: Who are your nearest neighbors? What’s your tiny neighborhood like?
- Global structure: How do these neighborhoods connect to form the bigger picture?
UMAP’s goal is to preserve both:
- If it only cared about local structure, you’d get perfect small clusters but lose overall organization (like t-SNE).
- If it only cared about global structure, you’d get a rough overview but miss fine details (like PCA).
UMAP gives you the best of both worlds by tuning how “wide” each neighborhood is and how much overlap there is between them.
Building a k-Nearest Neighbor Graph (kNN)
Here’s how UMAP builds the foundation for its magic:
- For each data point, find its k-nearest neighbors (using distance metrics like Euclidean or cosine).
- Create a graph where each point is a node, and edges connect to its neighbors.
- Assign weights to those edges — representing how strongly two points are connected (closer points → stronger connection).
So what you get is not just dots in space — it’s a network of relationships that represents the shape of your data manifold.
Enter Fuzzy Simplicial Sets — The Soft Friendship Network
Now comes the topological twist: Instead of saying “A and B are either connected or not,” UMAP says,
“A and B are connected to some degree.”
This is where fuzzy set theory steps in — it allows partial membership.
In UMAP’s world, every connection between two points has a membership strength between 0 and 1 — representing how confidently UMAP believes they’re neighbors.
- A value near 1 → “Strongly connected; they belong to the same local region.”
- A value near 0 → “Barely related; might be distant in the manifold.”
UMAP merges all these “fuzzy friendships” to create a fuzzy simplicial set — a complex mathematical way of saying:
“Here’s a flexible map of local connections that gently overlap to represent the whole data shape.”
The Role of Parameters: n_neighbors and min_dist
n_neighbors — How Big Is Your Neighborhood?
- Controls the size of the local region considered for each point.
- Small value → Focus on local details (tight clusters).
- Large value → Focus on global structure (broad relationships).
Think of it like choosing how zoomed in your map is — small neighborhoods give street-level detail; large ones show the city’s layout.
min_dist — How Close Can Points Get in the Embedding?
- Determines how tightly UMAP packs similar points in the low-dimensional space.
- Smaller value → Points can sit very close, emphasizing clear cluster separation.
- Larger value → Points are more spread out, improving continuity between clusters.
Together, these two parameters control UMAP’s zoom and compression — the balance between granularity and overview.
📐 Step 3: Mathematical Foundation
The kNN Graph Weight Formula
For each point $i$ and neighbor $j$, UMAP computes a connection strength using an exponential decay:
$$ w_{ij} = \exp\left(-\frac{d_{ij} - \rho_i}{\sigma_i}\right) $$- $d_{ij}$: Distance between points $i$ and $j$
- $\rho_i$: Local connectivity threshold (distance to the closest neighbor)
- $\sigma_i$: Normalizing factor controlling smoothness of the neighborhood
UMAP adjusts $\sigma_i$ so that the total weight sums up to a consistent neighborhood size, ensuring uniform density handling.
Combining Fuzzy Relationships Symmetrically
After building all local fuzzy sets, UMAP merges them using this formula:
$$ \mu_{ij} = 1 - (1 - w_{ij})(1 - w_{ji}) $$This ensures that if either point thinks the other is a friend, their connection remains strong.
🧠 Step 4: Key Ideas & Assumptions
- Manifold Hypothesis: The data lies on a smooth, lower-dimensional manifold.
- Locality Principle: Relationships between nearby points carry more meaning than distant ones.
- Fuzziness Reflects Uncertainty: Relationships aren’t binary — they exist in degrees.
- Balance Matters: Hyperparameters control the trade-off between local precision and global continuity.
⚖️ Step 5: Strengths, Limitations & Trade-offs
- Beautifully captures nonlinear relationships.
- Fuzzy graphs make UMAP robust to noise and local variations.
- Highly flexible — tunable for zoomed-in or zoomed-out embeddings.
- Too small
n_neighborsmay overfit local noise. - Too large
n_neighborsmay blur fine structure. - Misunderstanding parameter roles can distort embeddings dramatically.
n_neighbors = zoom level, min_dist = how tightly things cluster.
Finding the sweet spot is key to meaningful visualizations.🚧 Step 6: Common Misunderstandings
🚨 Common Misunderstandings (Click to Expand)
- “UMAP connects every point to every other.” → No, it builds a sparse graph — each point connects to only its k nearest neighbors.
- “
n_neighborscontrols how many clusters you’ll see.” → Not exactly; it influences how connected the clusters appear, not their number. - “Fuzzy sets are just probabilities.” → They’re membership strengths, not strict probabilities — they measure closeness, not frequency.
🧩 Step 7: Mini Summary
🧠 What You Learned: UMAP models data as a fuzzy graph that captures local and global relationships using topological and geometric intuition.
⚙️ How It Works: It builds kNN graphs, assigns fuzzy membership strengths, and merges them into a manifold representation.
🎯 Why It Matters: This structure allows UMAP to compress data meaningfully while preserving the shape of its intrinsic geometry.