2.3. Analyze the Optimization Process

6 min read 1139 words

🪄 Step 1: Intuition & Motivation

Core Idea: After UMAP builds its fuzzy graph — representing which points are close and how strongly — it must bring that abstract graph to life in a 2D or 3D space.

This final stage is like sculpting: starting with a network of invisible relationships and shaping it into something humans can see. But how does UMAP decide where to place every point?

It uses a physics-inspired approach:

Similar points attract each other.
Dissimilar points repel each other.
The result? A beautifully balanced low-dimensional map that captures both local and global structure.

But UMAP doesn’t push every pair of points — that would be too slow! Instead, it uses negative sampling to speed things up while keeping the balance intact.

In short: UMAP turns abstract math into motion — a dance of points finding their natural place in space.

🌱 Step 2: Core Concept

1️⃣ Attractive and Repulsive Forces — The Physics Analogy

UMAP’s optimization feels a lot like simulating gravity and magnetism between points.

Attractive forces: Pull close neighbors together.
- These come from the high $p_{ij}$ values in the fuzzy graph — “we’re supposed to be close.”
Repulsive forces: Push non-neighbors apart.
- These ensure unrelated points don’t collapse into the same cluster.

The algorithm finds equilibrium between these two forces — a stable configuration that visually represents both local neighborhoods and global structure.

You can imagine this as a crowd settling into a room: friends stand near each other, strangers drift apart, and eventually everyone finds a comfortable distance.

2️⃣ Negative Sampling — Faster Repulsion

Computing repulsive forces for all non-neighbor pairs would be an impossible task — there are billions of such pairs in a large dataset.

UMAP solves this elegantly using negative sampling, borrowed from word embedding algorithms like Word2Vec.

Instead of computing every repulsion, UMAP:

Randomly samples a few “negative pairs” (unrelated points).
Applies repulsive updates only to those.
Over many iterations, this stochastic process approximates the global repulsion efficiently.

This dramatically cuts computation from $O(N^2)$ to nearly linear time — without significantly harming quality.

Think of it like checking just a few random “non-friends” to remind yourself who not to stand near, rather than inspecting every stranger in the crowd.

3️⃣ Learning Rate Scheduling — Balancing Movement and Stability

UMAP uses stochastic gradient descent (SGD) — adjusting point positions step-by-step to reduce loss. But if steps are too big, points jitter chaotically; too small, and the algorithm gets stuck.

The learning rate controls how fast or slow the optimization moves.

High learning rate → faster but riskier (might overshoot).
Low learning rate → slower but smoother convergence.

UMAP typically uses a decaying learning rate — starting large (to spread things out quickly) and reducing over time (to fine-tune structure).

This ensures that early iterations define the broad layout, while later ones refine the fine details.

It’s like sculpting: rough chisel first, gentle polishing later.

4️⃣ Convergence Criteria — When to Stop

UMAP runs for a fixed number of epochs (iterations over all points). Each epoch moves points closer to their optimal positions — until the embedding stabilizes.

Stopping too early → coarse, underdeveloped clusters. Running too long → overfitting noise, or wasted computation.

A typical heuristic:

Small datasets (≤10k points): 200–400 epochs.
Large datasets: 100–200 epochs often suffice.

The ideal stopping point is when the cross-entropy loss stops decreasing meaningfully, or when the embedding visually stabilizes.

📐 Step 3: Mathematical Foundation

The Optimization Objective Revisited

UMAP minimizes the following cross-entropy loss:

$$ C = \sum_{i,j} \Big[ -p_{ij} \log(q_{ij}) - (1 - p_{ij}) \log(1 - q_{ij}) \Big] $$

Where:

$p_{ij}$ = high-dimensional fuzzy relationship strength
$q_{ij}$ = low-dimensional similarity modeled as: $$ q_{ij} = \frac{1}{1 + a \cdot d_{ij}^{2b}} $$
$a, b$ = curve-shaping constants controlling how attraction decays with distance

UMAP minimizes this function by moving points ($x_i$) to reduce mismatches between $p_{ij}$ and $q_{ij}$.

The loss acts like an energy landscape — the system evolves toward a low-energy state where close friends stay close, and unrelated points stay apart just enough.

Negative Sampling Gradient Update

For each point $i$:

Attractive updates: Sample true neighbors and apply gradient descent.
Repulsive updates: Sample random non-neighbors (“negative samples”) and apply gradient ascent (push away).

Gradient update rule:

$$ x_i = x_i - \eta \cdot \frac{\partial C}{\partial x_i} $$

Where $\eta$ is the learning rate. This selective update scheme drastically reduces computation while maintaining good gradient estimates.

You don’t need to push every point away — just enough random ones to maintain space. It’s like occasionally reminding yourself to respect personal boundaries without measuring every person’s distance in the room.

Attractive vs. Repulsive Force Equations

In simplified terms, each pair $(i,j)$ exerts forces:

Attractive: $$ F_{\text{attr}} = -2ab , d_{ij}^{2(b-1)} \cdot (1 + a d_{ij}^{2b})^{-2} , p_{ij} $$
Repulsive: $$ F_{\text{rep}} = 2b , d_{ij}^{2(b-1)} \cdot (1 + a d_{ij}^{2b})^{-2} , (1 - p_{ij}) $$

These forces are applied stochastically during training, producing smooth, globally coherent embeddings.

Attraction is strong for nearby friends (to maintain connection), repulsion is gentle but persistent across the crowd (to prevent overcrowding).

🧠 Step 4: Key Ideas & Assumptions

Selective Optimization: Only a subset of pairs are updated each iteration — efficient but effective.
Force Balance: Embeddings form when attraction and repulsion reach equilibrium.
Learning Rate Decay: Fast exploration first, fine refinement later.
Stochastic Convergence: Randomness helps avoid local minima but may cause small variations across runs.

⚖️ Step 5: Strengths, Limitations & Trade-offs

Negative sampling makes UMAP scalable to millions of points.
Attractive/repulsive balance captures both cluster detail and global layout.
Learning rate decay ensures smooth convergence.

Too few epochs → weak structure or missing small clusters.
Too small batch sizes → noisy gradients, uneven embeddings.
Too large learning rates → oscillations or unstable clusters.

UMAP trades precision for speed — small stochastic approximations (like negative sampling) lead to fast convergence with minor visual variance. The art is choosing just enough epochs and learning rate to get stability without overfitting.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)

“Negative sampling means ignoring repulsion.” → No, it’s sampling repulsion smartly — fewer computations, same overall effect.
“Lower learning rate is always better.” → Not always; it might slow convergence or get stuck early.
“More epochs always improve results.” → Beyond a point, extra epochs add little value and may amplify noise.

🧩 Step 7: Mini Summary

🧠 What You Learned: UMAP’s optimization process uses attraction, repulsion, and negative sampling to form meaningful, stable embeddings.

⚙️ How It Works: Through stochastic gradient descent, points move under physical-like forces — balancing locality and separation while learning rate schedules ensure smooth convergence.

🎯 Why It Matters: Understanding this process helps you tune epochs, batch sizes, and learning rates wisely — the difference between a perfect UMAP map and a chaotic blob.

3.1. Implementation and Parameter Engineering 2.2. Dive into Graph-Based Learning