2.1. Delve Into the Mathematical Framework

6 min read 1098 words

🪄 Step 1: Intuition & Motivation

Core Idea: We’ve seen UMAP as a practical, powerful visualization tool. But beneath that user-friendly face lies a deep and elegant mathematical brain — built from geometry, topology, and graph theory.

This part is about understanding what makes UMAP fundamentally different from other algorithms like PCA or t-SNE. UMAP doesn’t just reduce dimensions — it respects the shape of your data in its native high-dimensional world.

In other words, while PCA asks “How can I project this data flatly?”, UMAP asks,

“What if I walked along the true curved surface of this data and tried to preserve that experience?”

To answer that, we’ll need to step into the world of Riemannian geometry and manifolds — don’t worry, we’ll keep our math shoes comfy.

🌱 Step 2: Core Concept

1️⃣ Riemannian Geometry — The Language of Curved Spaces

Riemannian geometry is the mathematics of curved surfaces. It’s how we describe spaces where “straight lines” don’t exist — like the surface of a sphere or the folds of a crumpled sheet.

UMAP assumes your high-dimensional data doesn’t fill the entire space — it lies along a curved manifold. Each small neighborhood of points is approximately flat (locally Euclidean), but when you move farther, the curvature reveals itself.

So UMAP’s task is:

Respect local geometry (short distances).
Understand global structure (how local patches connect).
Flatten it carefully without tearing (manifold unfolding).

Think of it like unfolding a paper globe — you can flatten small parts without distortion, but large parts need careful cutting and joining to preserve relationships.

2️⃣ Local Connectivity Approximation — Who’s Really Connected?

In a high-dimensional world, not every pair of points is meaningfully connected. UMAP approximates connectivity by building small local neighborhoods — the fuzzy simplicial sets we learned earlier.

Within each neighborhood, distances are small enough that Euclidean geometry works fine. Across neighborhoods, UMAP relies on topology — how these neighborhoods connect — rather than the exact distances.

That’s what makes UMAP resilient to noise and scaling: it doesn’t care about absolute positions, only how points relate locally.

So, local neighborhoods act as “flat tiles” on a curved surface, and UMAP learns how to stitch those tiles together consistently.

3️⃣ Geodesic Distance — Measuring the True Shortest Path

In curved spaces, the shortest distance between two points isn’t a straight line — it’s a geodesic (the “as-straight-as-possible” curve on the surface).

UMAP approximates these geodesic distances through its k-nearest-neighbor graph.

Each edge in the graph represents a short, trustworthy local connection.
Longer paths are built by chaining local edges together — tracing how the manifold folds through space.

This graph-based approximation allows UMAP to estimate true manifold distances without ever computing them directly — an elegant shortcut that makes it both powerful and fast.

If a straight line cuts through the mountain, UMAP prefers to take the hiking trail that actually winds along its surface — more steps, but more faithful to the terrain.

4️⃣ Spectral Embedding Initialization — Starting Smart

Before optimization begins, UMAP gives itself a head start by using spectral embedding, a mathematically guided initialization method.

Here’s how it works:

Compute the graph Laplacian — a matrix that captures how connected each point is to its neighbors.
Perform eigen-decomposition on that matrix to extract its key structure.
Use the first few eigenvectors (think of them like “vibration modes” of the data graph) to position points in low dimensions.

This gives UMAP a near-optimal starting layout that already reflects the manifold’s structure — making optimization smoother and faster.

Think of it like sketching the rough outline of your painting before filling in the details — it saves time, prevents mistakes, and ensures a balanced composition.

📐 Step 3: Mathematical Foundation

The Riemannian Manifold Metric Tensor

In Riemannian geometry, every manifold has a metric tensor, a function that defines how distances are measured locally.

For two nearby points, the infinitesimal distance $ds$ is defined as:

$$ ds^2 = g_{ij} , dx^i , dx^j $$

Where:

$g_{ij}$ → elements of the metric tensor (a local scaling matrix)
$dx^i, dx^j$ → small coordinate differences

In simple terms, $g_{ij}$ encodes how “stretched” or “compressed” space is in different directions.

The metric tensor is like a local ruler that adapts to your data’s shape. UMAP assumes these rulers vary smoothly, letting it respect curved distances without ever explicitly computing $g_{ij}$.

The Graph Laplacian and Eigen-Decomposition

The graph Laplacian is defined as:

$$ L = D - W $$

Where:

$W$ → adjacency matrix (connection strengths between points)
$D$ → diagonal matrix of node degrees (sum of connections per node)

UMAP uses eigenvectors of $L$ to get a spectral embedding — a low-dimensional map that respects the smooth connectivity of the graph.

The smallest eigenvectors of $L$ describe the “softest modes” of variation — the broadest patterns of connection. This initialization ensures the final embedding already carries the manifold’s essential shape.

🧠 Step 4: Key Ideas & Assumptions

Manifold Hypothesis: The data lies on a smooth, continuous manifold.
Local Euclidean Approximation: Small patches can be treated as flat.
Connectivity Reflects Geometry: Relationships between points define the shape better than raw coordinates.
Spectral Initialization = Smart Head Start: Graph eigenvectors provide a topologically aware beginning.

⚖️ Step 5: Strengths, Limitations & Trade-offs

Mathematically principled foundation (geometry + topology).
Robust to noise and scaling distortions.
Spectral initialization ensures stability and faster convergence.

Relies on manifold assumption — may fail for highly discontinuous data.
Eigen-decomposition can be computationally heavy for very large graphs.
Local approximations might blur fine details if neighborhoods are too large.

UMAP trades exact distance accuracy for topological faithfulness — it doesn’t measure precisely, it measures meaningfully. This balance is what makes it both practical and theoretically elegant.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)

“Riemannian geometry means UMAP computes actual curvature.” → No, it approximates curvature implicitly through local neighborhoods.
“Spectral embedding is just PCA.” → Spectral methods use graph connectivity, not raw variance — they preserve topology, not linear directions.
“Geodesic distances are measured directly.” → They’re approximated through shortest paths on the kNN graph.

🧩 Step 7: Mini Summary

🧠 What You Learned: UMAP’s mathematical roots come from Riemannian geometry and spectral graph theory, letting it model data as a curved manifold.

⚙️ How It Works: It approximates geodesic distances through local neighborhoods, uses graph Laplacians for spectral initialization, and preserves manifold continuity.

🎯 Why It Matters: This mathematical foundation explains why UMAP succeeds where simpler projection methods fail — it sees not just points, but the invisible shape connecting them.

2.2. Dive into Graph-Based Learning 1.4. Connect to Practical Applications and Visualizations