3.1. LSTM (Long Short-Term Memory)
🪄 Step 1: Intuition & Motivation
Core Idea: Vanilla RNNs struggle to remember information across many time steps — their memory fades as gradients vanish. The Long Short-Term Memory (LSTM) architecture fixes this by giving the network a way to decide what to remember, what to forget, and what to output.
In short: LSTM is an RNN with a built-in memory management system.
Simple Analogy: Think of an LSTM as a smart note-taker in a lecture.
- It listens (input gate),
- Erases irrelevant points (forget gate),
- Writes key ideas to a notebook (cell state),
- And decides what to share in the summary (output gate). Unlike a normal student (vanilla RNN) who tries to memorize everything — and forgets most of it by the end — the LSTM learns what’s worth remembering.
🌱 Step 2: Core Concept
What’s Happening Under the Hood?
An LSTM cell has two key internal components:
- Hidden state ($h_t$): carries short-term information (like RNN memory).
- Cell state ($C_t$): acts as long-term memory — a running “context highway” that information flows through with minimal interference.
At every time step $t$, the LSTM uses four gates to control information flow:
Forget Gate ($f_t$): Decides what to discard from the previous cell state.
$$ f_t = \sigma(W_f [h_{t-1}, x_t] + b_f) $$If $f_t$ is close to 0 → forget it. If $f_t$ is close to 1 → keep it.
Input Gate ($i_t$) and Candidate Cell ($\tilde{C}_t$): Determines what new information to add.
$$ i_t = \sigma(W_i [h_{t-1}, x_t] + b_i) $$$$ \tilde{C}*t = \tanh(W_c [h*{t-1}, x_t] + b_c) $$Here, $\tilde{C}_t$ is the proposed update (like “new knowledge”), and $i_t$ decides how much of it should be remembered.
Cell State Update ($C_t$): Combines old and new information:
$$ C_t = f_t * C_{t-1} + i_t * \tilde{C}_t $$This “highway” connection allows gradients to flow smoothly through time — solving the vanishing gradient problem.
Output Gate ($o_t$): Controls how much of the cell state contributes to the next hidden state (the part that gets “spoken out loud”).
$$ o_t = \sigma(W_o [h_{t-1}, x_t] + b_o) $$Then the new hidden state becomes:
$$ h_t = o_t * \tanh(C_t) $$
Together, these gates ensure that important information persists and irrelevant noise fades away.
Why It Works This Way
The secret weapon of LSTMs is the cell state ($C_t$).
Think of it as a long conveyor belt that moves information down the timeline. Each gate acts like a worker deciding:
- Should we erase this part? (forget gate)
- Should we add something new? (input gate)
- Should we show this to the outside world? (output gate)
This selective control ensures that the gradient doesn’t vanish — it simply flows forward unless explicitly stopped by a gate.
How It Fits in ML Thinking
LSTMs are the first architecture that explicitly models memory. They allow deep learning systems to:
- Capture dependencies spanning hundreds of time steps (e.g., full sentences or video frames).
- Control how information decays over time.
- Learn temporal hierarchies — what matters now vs. what matters later.
This idea of gated control over information flow inspired later models like GRUs and even the attention mechanisms in Transformers.
📐 Step 3: Mathematical Foundation
LSTM Equations (Complete)
The full set of LSTM equations is:
$$ \begin{aligned} f_t &= \sigma(W_f[h_{t-1}, x_t] + b_f) \quad &\text{(Forget gate)} \ i_t &= \sigma(W_i[h_{t-1}, x_t] + b_i) \quad &\text{(Input gate)} \ \tilde{C}*t &= \tanh(W_c[h*{t-1}, x_t] + b_c) \quad &\text{(Candidate cell)} \ C_t &= f_t * C_{t-1} + i_t * \tilde{C}*t \quad &\text{(Cell update)} \ o_t &= \sigma(W_o[h*{t-1}, x_t] + b_o) \quad &\text{(Output gate)} \ h_t &= o_t * \tanh(C_t) \quad &\text{(Hidden state update)} \ \end{aligned} $$Why Use tanh Instead of ReLU?
If ReLU were used in the cell state, its unbounded nature could cause uncontrolled growth in $C_t$, leading to exploding gradients or overflow.
$\tanh$ is bounded between -1 and 1, which keeps cell values stable while still allowing smooth gradient flow.
🧠 Step 4: Assumptions or Key Ideas
- The cell state ($C_t$) serves as a gradient highway, letting information flow through many time steps.
- Gates use sigmoid activations to control how much information flows — mimicking decisions like “remember a little,” “forget a lot,” or “output half.”
- The design allows both short-term and long-term dependencies to be learned effectively.
⚖️ Step 5: Strengths, Limitations & Trade-offs
✅ Strengths
- Retains memory over long time horizons.
- Solves vanishing gradient problem.
- Adaptively decides what information to keep or forget.
- Performs well on text, audio, and sequential sensor data.
⚠️ Limitations
- Computationally expensive (four gates = four weight matrices).
- Harder to interpret due to multiple internal signals.
- Still sequential — can’t be easily parallelized across time steps.
🚧 Step 6: Common Misunderstandings
🚨 Common Misunderstandings (Click to Expand)
- “The forget gate always deletes information.” → Not true — it scales old memory, allowing partial forgetting.
- “All gates are independent.” → They’re learned jointly — their cooperation is what makes LSTMs powerful.
- “LSTMs never forget.” → They choose what to forget. That’s why they’re called Long Short-Term Memory — they balance both.
🧩 Step 7: Mini Summary
🧠 What You Learned: You explored how LSTMs use gating mechanisms to control memory — deciding what to keep, forget, and output.
⚙️ How It Works: A stable cell state carries long-term gradients, while gates dynamically regulate information flow.
🎯 Why It Matters: This innovation transformed RNNs from forgetful learners into context-aware models capable of long-term reasoning — the foundation for today’s sequential deep learning.