2.1. K-Means as an Expectation-Maximization (EM) Special Case

5 min read 920 words

🪄 Step 1: Intuition & Motivation

  • Core Idea: At first glance, K-Means and Gaussian Mixture Models (GMMs) seem totally different — one is geometric, the other probabilistic. But here’s the twist: K-Means is actually a simplified cousin of GMM, running under the same logic as the Expectation-Maximization (EM) algorithm — just without the “softness.”

    In essence, K-Means says “Each point belongs to exactly one cluster”, while GMM gently replies, “Maybe, but let’s give it a probability.”

  • Simple Analogy:

    Imagine sorting songs into playlists. K-Means says, “This song goes only in the Rock playlist.” GMM says, “This song sounds 70% Rock, 30% Pop — let’s record that uncertainty.”


🌱 Step 2: Core Concept

What’s Happening Under the Hood?

Both K-Means and GMM try to explain data by finding cluster structures, but they think differently:

  • K-Means: Uses hard assignments. Each data point strictly belongs to one cluster based on nearest centroid.
  • GMM: Uses soft assignments. Each data point belongs to every cluster with a probability.

Underneath, both follow an iterative two-step process that alternates between “guessing” and “updating.”

  1. E-step (Expectation):

    • Estimate which cluster each point belongs to.
    • In K-Means → assign to the nearest centroid (hard).
    • In GMM → compute probabilities of belonging to each cluster (soft).
  2. M-step (Maximization):

    • Update cluster parameters (centroids or Gaussian parameters) to best fit the new assignments.
    • In K-Means → recompute the mean of assigned points.
    • In GMM → recompute mean, covariance, and mixing weights using probabilities.
Why K-Means is a Special Case of EM

The EM algorithm for GMM models assumes data is generated by several Gaussian distributions:

$$ p(x) = \sum_{k=1}^{K} \pi_k \mathcal{N}(x | \mu_k, \Sigma_k) $$

If we make these simplifying assumptions:

  1. Each cluster has equal covariance (spherical and identical $\Sigma = \sigma^2 I$).
  2. All clusters have equal prior probability ($\pi_k = 1/K$).
  3. We assign each point to the most likely cluster (not a probability).

Then, GMM’s EM steps collapse into K-Means’ simple steps of:

  • Assign → closest centroid.
  • Update → mean of assigned points.

That’s why K-Means is viewed as the “hard-assignment limit” of EM when the variance $\sigma^2 \to 0$.

How GMM Handles Overlapping Clusters

Unlike K-Means, which forces each point into one cluster, GMM acknowledges uncertainty — it gives each data point a membership probability for each cluster.

For example, if a point sits between two centroids, K-Means will pick one side arbitrarily, while GMM will say:

“This point is 55% likely to belong to Cluster A and 45% to Cluster B.”

This makes GMM better for real-world data where boundaries blur or clusters overlap — think of customer groups with shared behaviors or songs that mix genres.


📐 Step 3: Mathematical Foundation

K-Means Objective (Hard Assignments)
$$ J = \sum_{i=1}^{K} \sum_{x \in C_i} ||x - \mu_i||^2 $$
  • Hard assignment: each point $x$ belongs to exactly one cluster $C_i$.

  • Goal: minimize total distance between points and their assigned centroids.

    K-Means simply measures closeness and says “belong to the nearest group.”
GMM Log-Likelihood (Soft Assignments)
$$ \mathcal{L} = \sum_{n=1}^{N} \log \left( \sum_{k=1}^{K} \pi_k \mathcal{N}(x_n | \mu_k, \Sigma_k) \right) $$

Here, $\pi_k$ are cluster weights and $\mathcal{N}(x_n | \mu_k, \Sigma_k)$ is the Gaussian probability density. Each data point contributes to all clusters, weighted by its membership probability.

This formula doesn’t just measure distance — it measures how likely each cluster is to have produced a data point.
Soft vs. Hard Membership
  • In K-Means, the membership matrix $r_{nk}$ (point $n$ in cluster $k$) is binary:

    $$ r_{nk} = \begin{cases} 1, & \text{if } k = \arg\min_j ||x_n - \mu_j||^2 \ 0, & \text{otherwise} \end{cases} $$
  • In GMM, it’s probabilistic:

    $$ r_{nk} = \frac{\pi_k \mathcal{N}(x_n | \mu_k, \Sigma_k)}{\sum_{j=1}^{K} \pi_j \mathcal{N}(x_n | \mu_j, \Sigma_j)} $$
    K-Means assigns points like “yes or no.”

GMM assigns points like “how much.”


🧠 Step 4: Assumptions or Key Ideas

The key difference is that K-Means simplifies away probability — it commits early, while GMM keeps its options open.


⚖️ Step 5: Strengths, Limitations & Trade-offs

Strengths

⚠️ Limitations

⚖️ Trade-offs K-Means is fast and simple but rigid. GMM is flexible but slower and sensitive to initialization. Choosing between them is like picking between a quick sketch (K-Means) and a detailed painting (GMM).

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)

🧩 Step 7: Mini Summary

🧠 What You Learned: K-Means is actually the hard-assignment version of the EM algorithm used in GMMs. The two share the same alternating structure — estimate memberships, then update parameters.

⚙️ How It Works: K-Means minimizes distances (deterministic), while GMM maximizes probabilities (stochastic). The latter introduces soft membership and covariance flexibility.

🎯 Why It Matters: This connection explains why K-Means converges — and sets the stage for understanding advanced clustering techniques rooted in probabilistic reasoning.

2.2. Variants and Extensions1.6. Handle Real-World Data Challenges
Any doubt in content? Ask me anything?
Chat
🤖 👋 Hi there! I'm your learning assistant. If you have any questions about this page or need clarification, feel free to ask!