2.2. Sigmoid
🪄 Step 1: Intuition & Motivation
Core Idea: The Sigmoid function is one of the earliest activation functions used in neural networks. It takes any real number and squashes it smoothly into a range between 0 and 1, making it perfect for representing probabilities.
It’s like a “soft yes-no” decision maker — not everything is fully 0 or 1; the sigmoid allows shades of confidence, like saying, “I’m 80% sure this is a cat.”
Simple Analogy: Think of the sigmoid like a volume knob — as you turn it, the output smoothly transitions from completely off (0) to fully on (1), without abrupt jumps.
🌱 Step 2: Core Concept
What’s Happening Under the Hood?
The sigmoid function is defined as:
$$f(x) = \frac{1}{1 + e^{-x}}$$Here’s what it does:
- When $x$ is a large positive number, $e^{-x}$ becomes nearly 0, so $f(x) ≈ 1$.
- When $x$ is large and negative, $e^{-x}$ becomes huge, so $f(x) ≈ 0$.
- Around $x = 0$, the function is steepest and most sensitive to changes.
So, sigmoid smoothly transitions between 0 and 1, compressing all possible inputs into a manageable range — perfect for outputs that represent “probabilities.”
Why It Works This Way
The sigmoid’s “S”-shaped curve (also called a logistic curve) ensures outputs are always bounded — no neuron will produce absurdly large values.
This was initially great for biological inspiration (mimicking how real neurons saturate), but it also introduces drawbacks:
- When the input is very positive or very negative, the slope (gradient) of the function becomes nearly zero.
- This means neurons in these regions stop learning — their weights barely change, leading to the vanishing gradient problem.
So while sigmoid feels “smooth” and “safe,” it slows learning dramatically in deep networks.
How It Fits in ML Thinking
The sigmoid was historically used everywhere — from the early perceptron models to logistic regression.
Today, it’s mostly reserved for:
- Output layers in binary classification, where we want outputs as probabilities between 0 and 1.
- Gate functions in architectures like LSTMs (where a controlled, smooth response is beneficial).
In hidden layers, however, it has been largely replaced by ReLU and its variants due to the efficiency issues sigmoid introduces.
📐 Step 3: Mathematical Foundation
Formula & Derivative
The sigmoid function:
$$f(x) = \frac{1}{1 + e^{-x}}$$Its derivative:
$$f'(x) = f(x)(1 - f(x))$$- The derivative depends on the output itself.
- The maximum slope (at $x = 0$) is 0.25, and it decreases as $x$ moves away from zero.
🧠 Step 4: Key Ideas
- Bounded Output: Keeps neuron activations between 0 and 1.
- Smooth Differentiability: Perfect for gradient-based optimization.
- Probabilistic Interpretation: Maps values directly to probabilities — ideal for binary outcomes.
- Vanishing Gradient Issue: Large or small $x$ values flatten the slope → small gradients → slow learning.
⚖️ Step 5: Strengths, Limitations & Trade-offs
✅ Strengths
- Outputs naturally interpreted as probabilities.
- Smooth, continuous, and differentiable everywhere.
- Historically important and still useful in output layers.
⚠️ Limitations
- Vanishing Gradient Problem: Gradients shrink to near zero for large $|x|$.
- Non–Zero Centered Output: Always positive → causes zig-zag optimization trajectories.
- Slow Convergence: Especially in deep networks.
🚧 Step 6: Common Misunderstandings
🚨 Common Misunderstandings (Click to Expand)
- “Sigmoid is obsolete.” Not true — it’s still essential in output layers and gating mechanisms.
- “Sigmoid always vanishes gradients.” Only when inputs are far from zero; around zero, it still learns effectively.
- “Sigmoid and Softmax are the same.” They’re related but not identical — Softmax generalizes sigmoid to multiple classes.
🧩 Step 7: Mini Summary
🧠 What You Learned: Sigmoid squashes any input into a range between 0 and 1, making it perfect for probabilistic interpretations.
⚙️ How It Works: It uses the exponential function to create a smooth transition from “off” to “on,” with gradients strongest near the midpoint.
🎯 Why It Matters: Although limited in deep layers, sigmoid remains vital in binary classification and gating mechanisms where smooth probability transitions matter.