2.2. Sigmoid

Deep Learning Interview Prep: The Ultimate Guide (2025)

Neural Network Fundamentals — Core Concepts & Activation Functions

2.2. Sigmoid

4 min read 751 words

🪄 Step 1: Intuition & Motivation

Core Idea: The Sigmoid function is one of the earliest activation functions used in neural networks. It takes any real number and squashes it smoothly into a range between 0 and 1, making it perfect for representing probabilities.
It’s like a “soft yes-no” decision maker — not everything is fully 0 or 1; the sigmoid allows shades of confidence, like saying, “I’m 80% sure this is a cat.”
Simple Analogy: Think of the sigmoid like a volume knob — as you turn it, the output smoothly transitions from completely off (0) to fully on (1), without abrupt jumps.

🌱 Step 2: Core Concept

What’s Happening Under the Hood?

The sigmoid function is defined as:

$$f(x) = \frac{1}{1 + e^{-x}}$$

Here’s what it does:

When $x$ is a large positive number, $e^{-x}$ becomes nearly 0, so $f(x) ≈ 1$.
When $x$ is large and negative, $e^{-x}$ becomes huge, so $f(x) ≈ 0$.
Around $x = 0$, the function is steepest and most sensitive to changes.

So, sigmoid smoothly transitions between 0 and 1, compressing all possible inputs into a manageable range — perfect for outputs that represent “probabilities.”

Why It Works This Way

The sigmoid’s “S”-shaped curve (also called a logistic curve) ensures outputs are always bounded — no neuron will produce absurdly large values.

This was initially great for biological inspiration (mimicking how real neurons saturate), but it also introduces drawbacks:

When the input is very positive or very negative, the slope (gradient) of the function becomes nearly zero.
This means neurons in these regions stop learning — their weights barely change, leading to the vanishing gradient problem.

So while sigmoid feels “smooth” and “safe,” it slows learning dramatically in deep networks.

How It Fits in ML Thinking

The sigmoid was historically used everywhere — from the early perceptron models to logistic regression.

Today, it’s mostly reserved for:

Output layers in binary classification, where we want outputs as probabilities between 0 and 1.
Gate functions in architectures like LSTMs (where a controlled, smooth response is beneficial).

In hidden layers, however, it has been largely replaced by ReLU and its variants due to the efficiency issues sigmoid introduces.

📐 Step 3: Mathematical Foundation

Formula & Derivative

The sigmoid function:

$$f(x) = \frac{1}{1 + e^{-x}}$$

Its derivative:

$$f'(x) = f(x)(1 - f(x))$$

The derivative depends on the output itself.
The maximum slope (at $x = 0$) is 0.25, and it decreases as $x$ moves away from zero.

The derivative $f’(x) = f(x)(1 - f(x))$ shows that the gradient is largest in the middle (around 0.5) and fades toward the edges. It’s like having strong learning for uncertain predictions, but almost no learning when the model is overly confident — which can slow convergence in deep networks.

🧠 Step 4: Key Ideas

Bounded Output: Keeps neuron activations between 0 and 1.
Smooth Differentiability: Perfect for gradient-based optimization.
Probabilistic Interpretation: Maps values directly to probabilities — ideal for binary outcomes.
Vanishing Gradient Issue: Large or small $x$ values flatten the slope → small gradients → slow learning.

⚖️ Step 5: Strengths, Limitations & Trade-offs

✅ Strengths

Outputs naturally interpreted as probabilities.
Smooth, continuous, and differentiable everywhere.
Historically important and still useful in output layers.

⚠️ Limitations

Vanishing Gradient Problem: Gradients shrink to near zero for large $|x|$.
Non–Zero Centered Output: Always positive → causes zig-zag optimization trajectories.
Slow Convergence: Especially in deep networks.

⚖️ Trade-offs Sigmoid is stable and interpretable but inefficient for deep layers. It’s excellent when you need probabilities, but harmful when you need depth. ReLU’s unbounded range and zero-centered nature make it far better for hidden layers, while sigmoid still shines at the output stage of binary classifiers.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)

“Sigmoid is obsolete.” Not true — it’s still essential in output layers and gating mechanisms.
“Sigmoid always vanishes gradients.” Only when inputs are far from zero; around zero, it still learns effectively.
“Sigmoid and Softmax are the same.” They’re related but not identical — Softmax generalizes sigmoid to multiple classes.

🧩 Step 7: Mini Summary

🧠 What You Learned: Sigmoid squashes any input into a range between 0 and 1, making it perfect for probabilistic interpretations.

⚙️ How It Works: It uses the exponential function to create a smooth transition from “off” to “on,” with gradients strongest near the midpoint.

🎯 Why It Matters: Although limited in deep layers, sigmoid remains vital in binary classification and gating mechanisms where smooth probability transitions matter.

2.3. Tanh (Hyperbolic Tangent)2.1. ReLU (Rectified Linear Unit)