2. Learn the Mathematics Behind Splitting Criteria

5 min read 946 words

🪄 Step 1: Intuition & Motivation

Core Idea (in 1 short paragraph): When a Decision Tree decides how to split data, it needs to measure how good that split is. Does it make the groups cleaner or more mixed? Splitting criteria like Entropy, Gini Impurity, and Variance Reduction act as purity meters. The tree uses these to decide which question (feature split) best organizes the data.
Simple Analogy: Think of sorting marbles into jars by color.
- A pure jar has marbles of only one color.
- A mixed jar has a jumble of colors. Splitting criteria help the tree choose the question (“Sort by color or by size?”) that results in the purest jars after sorting.

🌱 Step 2: Core Concept

What’s Happening Under the Hood?

When training, the Decision Tree looks at every possible feature and every possible threshold to ask, “If I split here, how pure will my new groups be?”

For each possible split:

It measures how impure (mixed) the current dataset is.
It simulates a split based on a feature.
It measures how much that impurity has decreased.
The split that gives the biggest decrease in impurity wins — that’s Information Gain.

In short: Entropy or Gini → measure disorder, Information Gain → how much disorder was reduced by a split.

Why It Works This Way

Decision Trees are built to make decisions that reduce confusion — not just random cuts.

By mathematically quantifying how mixed or pure groups are, they ensure each question moves us toward order and clarity. The math acts as a compass, guiding the tree toward meaningful splits instead of arbitrary ones.

How It Fits in ML Thinking

Splitting criteria are the “brains” of the Decision Tree — without them, it would have no way to judge what a good question is.

These same principles — measuring impurity, reducing uncertainty, maximizing information — also appear in other ML models and even deep learning. They represent the core philosophy of learning: turning confusion into clarity.

📐 Step 3: Mathematical Foundation

Entropy — Measuring Uncertainty

$$ H(S) = - \sum_{i=1}^{k} p_i \log_2 p_i $$

$H(S)$: Entropy — how uncertain or mixed the dataset $S$ is.
$p_i$: The probability (fraction) of class $i$ in the dataset.
The log base 2 makes the unit “bits” — a measure of information.

If all data points belong to one class → $H(S) = 0$ (no confusion). If all classes are equally likely → $H(S)$ is maximum (total confusion).

Entropy measures how “surprised” we are by the data. A dataset where all outcomes are the same has no surprise — hence, low entropy. A dataset where outcomes are equally mixed is very surprising — hence, high entropy.

Information Gain — Measuring Improvement

$$ IG(S, A) = H(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} H(S_v) $$

$IG(S, A)$: How much uncertainty (entropy) was reduced by splitting on attribute $A$.
$H(S)$: Entropy before the split (the “confusion” before asking the question).
$H(S_v)$: Entropy after the split, for each subset $S_v$.
$\frac{|S_v|}{|S|}$: Weight of each subset (how big that branch is).

If a split perfectly separates classes → Information Gain is high. If a split doesn’t help much → Information Gain is low.

Think of Information Gain as the reduction in surprise. The tree loves splits that make its future decisions less surprising — more predictable.

Gini Impurity — A Simpler Alternative

$$ Gini(S) = 1 - \sum_{i=1}^{k} p_i^2 $$

$p_i$: Proportion of samples belonging to class $i$.
Gini measures how often you’d be wrong if you randomly labeled data points according to their class distribution.

If all samples belong to one class → Gini = 0 (pure). If classes are evenly mixed → Gini = high (impure).

Gini is like a “fast version” of entropy — simpler math, similar intuition. That’s why many implementations (like CART) prefer it for speed.

Variance Reduction — For Regression Trees

For regression trees (predicting numbers instead of classes), impurity is measured using variance — how spread out the target values are.

$$ Variance(S) = \frac{1}{n} \sum_{i=1}^{n} (y_i - \bar{y})^2 $$

A good split is one that reduces variance — meaning the data in each branch becomes more consistent.

In regression, the tree isn’t reducing “confusion between labels” — it’s reducing spread. Each split aims to make predictions tighter and more reliable.

⚖️ Step 4: Strengths, Limitations & Trade-offs

Quantifies “purity” mathematically — no guesswork.
Adapts to classification (Entropy, Gini) and regression (Variance).
Makes tree building data-driven instead of manual.

Entropy is computationally heavier due to logarithms.
Both Gini and Entropy can be biased toward features with many distinct values.
Variance-based splitting can overfit if not regularized.

Entropy gives richer theoretical meaning (from information theory). Gini is faster and often indistinguishable in practice. Variance reduction translates the same logic for numeric targets — simpler, but sensitive to outliers.

🚧 Step 5: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)

“Entropy and Gini always give different trees.” → Usually false; they often produce nearly identical splits.
“Entropy is better because it’s from information theory.” → Not necessarily — Gini is faster and just as effective for most datasets.
“Variance is unrelated to impurity.” → For regression trees, variance is the impurity measure — it plays the same role as entropy does in classification.

🧩 Step 6: Mini Summary

🧠 What You Learned: Decision Trees measure impurity using Entropy, Gini, or Variance to decide the best split.

⚙️ How It Works: Each split aims to maximize Information Gain — the reduction in uncertainty or spread.

🎯 Why It Matters: Understanding these metrics helps you grasp why trees split where they do — and how to tune them for performance or interpretability.

3. Master the Recursive Tree-Building Algorithm 1. Grasp the Core Intuition and Structure