1.3. Understand the Optimization Objective

5 min read 1053 words

🪄 Step 1: Intuition & Motivation

Core Idea: Up to now, we’ve learned that UMAP builds a fuzzy graph — a soft network of relationships that describes how data points connect in high-dimensional space.

Now comes the crucial part: turning that fuzzy graph into a meaningful 2D or 3D layout. This is where the optimization objective comes in.

In plain terms, UMAP wants to draw a map in low dimensions that looks and feels as much like the original high-dimensional graph as possible.

Imagine you’ve built a friendship network between people based on shared interests. Now you’re trying to plot everyone on a 2D map — close friends should appear near each other, while unrelated people should stay far apart.

UMAP achieves this by minimizing a special kind of cross-entropy — a measure of how much the low-dimensional friendships differ from the original high-dimensional ones.

🌱 Step 2: Core Concept

The Goal — Match Two Worlds

UMAP creates two fuzzy sets:

One in high-dimensional space (the original relationships).
Another in low-dimensional space (the new embedding).

Each expresses how strongly two points “belong together.” The optimization process tries to make these two sets agree as much as possible.

If two points were close before, UMAP pulls them together. If they were far apart before, UMAP pushes them away.

It’s like saying:

“Let’s keep the same friendships in our 2D world as we had in our 100D world.”

Why Cross-Entropy?

Cross-entropy is a measure of difference between two probability-like distributions.

In UMAP’s case, it measures how different the pairwise connection strengths are between the original fuzzy graph ($p_{ij}$) and the embedding fuzzy graph ($q_{ij}$).

If they match perfectly, the cross-entropy is small — that means our low-dimensional layout faithfully preserves the original structure.

If they don’t match, the loss grows — UMAP “feels” the distortion and adjusts the positions until the mismatch shrinks.

How the Optimization Works

UMAP uses stochastic gradient descent (SGD) — a method where the algorithm moves step-by-step, adjusting point positions to reduce loss gradually.

Here’s the basic intuition:

For every pair of points $(i, j)$:
- If they were close before (high $p_{ij}$), pull them together.
- If they were far apart (low $p_{ij}$), push them apart — but not too aggressively (to avoid noise).

This is done iteratively until the system settles into a stable structure where the total cross-entropy is minimized.

Each “epoch” is like gently nudging a point cloud until it finds a natural, tension-balanced shape — like marbles rolling into place on a curved surface.

How It Differs from t-SNE

t-SNE also tries to preserve relationships, but it uses Kullback–Leibler (KL) divergence, which is asymmetric — it penalizes local mismatches much more than global ones.

That’s why t-SNE often creates tightly separated clusters, sometimes losing global context.

UMAP’s cross-entropy, on the other hand, is symmetric — it balances both local and global structures. It’s also computationally faster due to:

Spectral initialization: A smart starting point that reduces random wandering.
Approximate nearest neighbors: Efficiently finding relationships.
Parallelizable SGD updates: Using modern hardware effectively.

In short, UMAP doesn’t just look better — it’s mathematically faster and more stable.

📐 Step 3: Mathematical Foundation

The Cross-Entropy Objective

UMAP minimizes the difference between high- and low-dimensional fuzzy graphs using:

$$ C = \sum_{i,j} \Big[ -p_{ij}\log(q_{ij}) - (1 - p_{ij})\log(1 - q_{ij}) \Big] $$

Where:

$p_{ij}$ → membership strength in high-dimensional space (how close $i$ and $j$ were originally)
$q_{ij}$ → membership strength in low-dimensional space (how close $i$ and $j$ are now)

UMAP is basically saying:

“If two points were close before, make them close again.” “If they weren’t, don’t force them together.”

The first term $-p_{ij}\log(q_{ij})$ pulls similar points closer, while the second term $-(1 - p_{ij})\log(1 - q_{ij})$ pushes dissimilar points apart.

Defining q₍ᵢⱼ₎ — The Low-Dimensional Similarity

In the low-dimensional embedding, the similarity is modeled as:

$$ q_{ij} = \frac{1}{1 + a \cdot d_{ij}^{2b}} $$

$d_{ij}$ → distance between points $i$ and $j$ in low-dimensional space
$a$ and $b$ → constants that control how quickly the curve drops (learned empirically)

This function ensures that:

Nearby points have high $q_{ij}$ (strong attraction)
Distant points have small $q_{ij}$ (weak repulsion)

The curve decays smoothly — like how friendship weakens with distance. Not all faraway points need to repel strongly; only the very close ones matter.

Gradient Updates via SGD

At every optimization step, UMAP updates each point’s coordinates slightly:

$$ \text{position}*{new} = \text{position}*{old} - \eta \cdot \nabla C $$

Where:

$\eta$ is the learning rate.
$\nabla C$ is the gradient (the direction to move in to reduce the loss).

This happens thousands of times, gradually refining the layout.

Unlike t-SNE, UMAP’s updates can be done asynchronously and in parallel, which greatly speeds up convergence — especially on large datasets.

🧠 Step 4: Key Ideas & Assumptions

Local and global consistency: Both neighborhood details and broad structure matter.
Symmetric optimization: Unlike t-SNE’s one-sided loss, UMAP’s cross-entropy balances both directions.
Smooth decay: The low-dimensional attraction function ensures stable embeddings.
Stochastic learning: Randomized updates avoid local minima and speed up convergence.

⚖️ Step 5: Strengths, Limitations & Trade-offs

Optimizes both local and global structures effectively.
Faster and more scalable than t-SNE.
Produces embeddings that are more reproducible and stable.

Requires parameter tuning for best balance (e.g., learning rate, epochs).
Cross-entropy is harder to interpret intuitively than KL divergence for some users.
May produce slightly noisier boundaries if under-trained.

UMAP sacrifices some fine-grained local precision to achieve faster, more global coherence. This trade-off makes it ideal for large datasets where t-SNE would be prohibitively slow.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)

“UMAP uses KL divergence like t-SNE.” → No, it uses a cross-entropy objective, which is symmetric.
“UMAP updates all pairs at once.” → It uses stochastic mini-batches for efficiency.
“UMAP just visually clusters data.” → It’s an optimization algorithm, not a visual trick — its embeddings represent genuine geometric structure.

🧩 Step 7: Mini Summary

🧠 What You Learned: UMAP optimizes a cross-entropy objective that aligns high-dimensional and low-dimensional structures.

⚙️ How It Works: It pulls close points together and pushes far points apart via stochastic gradient descent, balancing local and global relationships.

🎯 Why It Matters: Understanding this optimization explains why UMAP runs faster, preserves structure better, and produces more interpretable embeddings than other methods.

1.4. Connect to Practical Applications and Visualizations 1.2. Learn the Mathematical Backbone — Topology and Graph Construction