4.3. Target, Frequency, and Binary Encoding
🪄 Step 1: Intuition & Motivation
Core Idea: When your categorical feature has many unique values (like “city,” “product ID,” or “user ID”), One-Hot Encoding explodes your dataset into hundreds or thousands of columns — making training slow and sparse.
That’s where smart encodings come in — ways to represent categories numerically and meaningfully, without blowing up dimensionality.
- Target Encoding: captures how each category relates to the target (e.g., average sales per region).
- Frequency Encoding: represents how common each category is.
- Binary Encoding: compresses category IDs into compact binary form, blending efficiency and interpretability.
These techniques are like data compaction algorithms — squeezing meaning into fewer dimensions without losing too much signal.
Simple Analogy: Imagine you’re summarizing a huge city dataset. Instead of one column per city (like One-Hot),
- Target Encoding says: “On average, cities like this have this outcome.”
- Frequency Encoding says: “This city shows up this often.”
- Binary Encoding says: “Let’s assign each city a binary code, like a postal address, to compress space.”
🌱 Step 2: Core Concept
Let’s decode these encoders one by one and understand how they turn messy categorical chaos into structured numerical insight.
Target Encoding — Learning from the Target Itself
Goal: Replace each category with a summary of its relationship to the target variable.
Example: Suppose you’re predicting whether a customer will buy (1) or not (0), and you have a feature “City”:
| City | Target |
|---|---|
| Delhi | 1 |
| Delhi | 0 |
| Mumbai | 1 |
| Pune | 0 |
| Pune | 0 |
The mean target value per city is:
- Delhi → (1+0)/2 = 0.5
- Mumbai → 1.0
- Pune → 0.0
Now, replace each city with its mean:
| City | Target Encoded |
|---|---|
| Delhi | 0.5 |
| Mumbai | 1.0 |
| Pune | 0.0 |
Why It Works: It compresses categorical information into a meaningful numeric signal — how strongly each category correlates with the target.
But Beware: If you compute means using all data, your model will “peek” at the target — causing data leakage and inflated performance.
To prevent this:
- Use cross-validation or out-of-fold encoding (compute mean using only training folds).
- Add smoothing for rare categories.
- Add noise to prevent overfitting.
Frequency Encoding — Capturing Category Popularity
Goal: Replace each category with how often it appears in the dataset.
Example:
| City | Count | Frequency |
|---|---|---|
| Delhi | 200 | 0.4 |
| Mumbai | 150 | 0.3 |
| Pune | 150 | 0.3 |
The “City” feature becomes [0.4, 0.3, 0.3].
Why It Works: Frequency often correlates with importance — common categories usually have more stable behavior. It’s simple, scalable, and doesn’t risk leakage.
Perfect For:
- High-cardinality categorical features (like user IDs or product SKUs).
- Tree-based models (like XGBoost, CatBoost, LightGBM).
- Large datasets where interpretability is secondary.
Binary Encoding — Compact Yet Informative
Goal: Represent categories efficiently using binary digits.
Here’s how it works:
- Assign each category an integer ID.
- Convert that integer into binary.
- Split each binary digit into its own column.
Example:
Let’s say we have 4 categories:
A → 1, B → 2, C → 3, D → 4.
| Category | Integer | Binary | Encoded Columns |
|---|---|---|---|
| A | 1 | 001 | [0,0,1] |
| B | 2 | 010 | [0,1,0] |
| C | 3 | 011 | [0,1,1] |
| D | 4 | 100 | [1,0,0] |
If you had 1000 categories, you’d only need log₂(1000) ≈ 10 columns — a huge reduction from 1000 One-Hot columns!
Why It Works: It reduces dimensionality drastically while preserving uniqueness of categories. Tree-based models handle it well since they can still find patterns across bits.
Perfect For:
- Large categorical features.
- When you need balance between compression and distinctiveness.
How It Fits in ML Thinking
These encoders are examples of feature engineering guided by structure and efficiency.
They bring together two goals:
- Preserve signal — the category’s relationship to prediction.
- Reduce clutter — fewer features, faster training.
They are essential when scaling ML to real-world, high-cardinality problems like e-commerce, user profiling, and recommendation systems.
📐 Step 3: Mathematical Foundation
Target Encoding Formula
Where:
- $y_i$ = target value for each sample in category $c$
- $n_c$ = number of samples in category $c$
Smoothed Variant:
$$ \text{Encoded}(c) = \frac{n_c \cdot \bar{y}*c + \alpha \cdot \bar{y}*{global}}{n_c + \alpha} $$where $\alpha$ controls smoothing strength (large $\alpha$ pulls toward the global mean).
Frequency Encoding Formula
where $n_c$ is the number of occurrences of category $c$, and $N$ is the total number of samples.
Binary Encoding Concept
Given $K$ unique categories, assign integer IDs $1$ to $K$. Each ID is represented as a binary vector of length $\lceil \log_2 K \rceil$.
🧠 Step 4: Assumptions or Key Ideas
- Target Encoding: assumes a meaningful relationship between category and target; use out-of-fold encoding to avoid leakage.
- Frequency Encoding: assumes category frequency carries useful information.
- Binary Encoding: assumes you want compact representation without implying order.
- Always analyze category cardinality before choosing the encoding strategy.
⚖️ Step 5: Strengths, Limitations & Trade-offs
- Efficient for high-cardinality categorical variables.
- Capture useful signal (Target Encoding).
- Preserve feature identity with fewer dimensions (Binary).
- Easy to implement and combine.
- Target Encoding risks data leakage if computed on full data.
- Frequency Encoding may lose predictive meaning.
- Binary Encoding may obscure interpretability.
- Use Target Encoding when categories are predictive and stable.
- Use Frequency Encoding when scale and distribution matter more than target correlation.
- Use Binary Encoding when balancing memory and uniqueness. Always validate on unseen folds to ensure encodings generalize.
🚧 Step 6: Common Misunderstandings
🚨 Common Misunderstandings (Click to Expand)
“Target Encoding improves accuracy automatically.” Only if done carefully — otherwise, it leaks target info and overfits.
“Frequency Encoding reflects correlation with target.” It doesn’t — it’s independent of target.
“Binary Encoding is just compressed One-Hot.” Not exactly — it’s structured by integer order, not category meaning.
🧩 Step 7: Mini Summary
🧠 What You Learned: Target, Frequency, and Binary Encodings are powerful alternatives for high-cardinality categorical features.
⚙️ How It Works: They convert categories into meaningful numeric patterns — either by learning from the target, capturing frequency, or encoding efficiently in binary.
🎯 Why It Matters: Because scaling categorical representation intelligently can dramatically improve both model performance and computation efficiency.