2.3. Tanh (Hyperbolic Tangent)
🪄 Step 1: Intuition & Motivation
Core Idea: The tanh (hyperbolic tangent) activation function is like the cool, balanced cousin of the sigmoid. It behaves similarly — smoothly squashing inputs — but instead of pushing everything between 0 and 1, it maps values into –1 to 1.
This simple shift makes a huge difference: tanh is zero-centered, meaning it can represent both positive and negative activations — helping the network learn faster and more stably.
Simple Analogy: Imagine you’re training two students: one can only give positive feedback (sigmoid), while the other gives both positive and negative feedback (tanh). The second student (tanh) helps the model adjust in both directions, speeding up learning.
🌱 Step 2: Core Concept
What’s Happening Under the Hood?
The tanh function is defined as:
$$f(x) = \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$$- When $x$ is large and positive, $f(x) \approx 1$.
- When $x$ is large and negative, $f(x) \approx -1$.
- Around $x = 0$, it’s most sensitive and steep — that’s where learning is fastest.
So, tanh transforms inputs into a symmetric range around zero — negative values map to negative outputs, and positive inputs to positive outputs.
Why It Works This Way
Tanh behaves like a scaled and shifted sigmoid function:
$$\tanh(x) = 2\sigma(2x) - 1$$where $\sigma(x)$ is the sigmoid function.
That means tanh inherits the sigmoid’s nice smoothness but fixes one major flaw — zero-centered outputs.
When activations are zero-centered:
- Positive inputs don’t always push gradients in the same direction as negative ones.
- This helps the optimizer take more balanced steps, speeding up convergence.
However, like sigmoid, tanh still saturates at its extremes — once outputs reach ±1, gradients become tiny, and learning slows down (the vanishing gradient problem).
How It Fits in ML Thinking
Tanh was the default activation for hidden layers before ReLU became popular.
It’s still widely used in Recurrent Neural Networks (RNNs), where keeping activations bounded between –1 and 1 helps control the scale of hidden states and prevents them from exploding.
In modern architectures like LSTMs or GRUs, tanh continues to appear inside gated mechanisms — providing smooth, bounded activations that stabilize the flow of information across time steps.
📐 Step 3: Mathematical Foundation
Formula & Derivative
Function:
$$f(x) = \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$$Derivative:
$$f'(x) = 1 - \tanh^2(x)$$- The derivative shrinks as $|x|$ increases (since $\tanh(x)$ approaches ±1).
- Around $x = 0$, $f’(x)$ is close to 1, meaning the neuron learns quickly in that region.
🧠 Step 4: Key Ideas
- Outputs are zero-centered — helps gradients flow in both directions.
- Faster convergence than sigmoid due to balanced activations.
- Still suffers from vanishing gradients for large input magnitudes.
- Common in RNNs, where bounded activations keep state updates stable.
⚖️ Step 5: Strengths, Limitations & Trade-offs
✅ Strengths
- Zero-centered output accelerates learning.
- Smooth and differentiable everywhere.
- Natural fit for models that process both positive and negative signals (e.g., RNNs).
⚠️ Limitations
- Still prone to vanishing gradients for large inputs.
- Slightly more expensive to compute than ReLU.
- Can cause slow learning if not initialized properly.
🚧 Step 6: Common Misunderstandings
🚨 Common Misunderstandings (Click to Expand)
- “Tanh completely solves vanishing gradients.” It reduces but doesn’t eliminate them — gradients still vanish at high magnitudes.
- “Tanh is always better than sigmoid.” It’s better in hidden layers due to zero-centering, but sigmoid is still preferable for output probabilities.
- “Tanh isn’t used anymore.” Still heavily used in recurrent architectures and some generative models.
🧩 Step 7: Mini Summary
🧠 What You Learned: Tanh is a smoother, zero-centered version of sigmoid that outputs values in the range (–1, 1).
⚙️ How It Works: It amplifies small signals while stabilizing large ones, balancing gradient flow during training.
🎯 Why It Matters: Tanh bridges the gap between sigmoid’s probabilistic smoothness and ReLU’s speed — crucial for understanding RNNs and gating mechanisms in modern architectures.