3.1. LSTM (Long Short-Term Memory)

5 min read 1017 words

🪄 Step 1: Intuition & Motivation

  • Core Idea: Vanilla RNNs struggle to remember information across many time steps — their memory fades as gradients vanish. The Long Short-Term Memory (LSTM) architecture fixes this by giving the network a way to decide what to remember, what to forget, and what to output.

    In short: LSTM is an RNN with a built-in memory management system.

  • Simple Analogy: Think of an LSTM as a smart note-taker in a lecture.

    • It listens (input gate),
    • Erases irrelevant points (forget gate),
    • Writes key ideas to a notebook (cell state),
    • And decides what to share in the summary (output gate). Unlike a normal student (vanilla RNN) who tries to memorize everything — and forgets most of it by the end — the LSTM learns what’s worth remembering.

🌱 Step 2: Core Concept

What’s Happening Under the Hood?

An LSTM cell has two key internal components:

  • Hidden state ($h_t$): carries short-term information (like RNN memory).
  • Cell state ($C_t$): acts as long-term memory — a running “context highway” that information flows through with minimal interference.

At every time step $t$, the LSTM uses four gates to control information flow:

  1. Forget Gate ($f_t$): Decides what to discard from the previous cell state.

    $$ f_t = \sigma(W_f [h_{t-1}, x_t] + b_f) $$

    If $f_t$ is close to 0 → forget it. If $f_t$ is close to 1 → keep it.

  2. Input Gate ($i_t$) and Candidate Cell ($\tilde{C}_t$): Determines what new information to add.

    $$ i_t = \sigma(W_i [h_{t-1}, x_t] + b_i) $$

    $$ \tilde{C}*t = \tanh(W_c [h*{t-1}, x_t] + b_c) $$

    Here, $\tilde{C}_t$ is the proposed update (like “new knowledge”), and $i_t$ decides how much of it should be remembered.

  3. Cell State Update ($C_t$): Combines old and new information:

    $$ C_t = f_t * C_{t-1} + i_t * \tilde{C}_t $$

    This “highway” connection allows gradients to flow smoothly through time — solving the vanishing gradient problem.

  4. Output Gate ($o_t$): Controls how much of the cell state contributes to the next hidden state (the part that gets “spoken out loud”).

    $$ o_t = \sigma(W_o [h_{t-1}, x_t] + b_o) $$

    Then the new hidden state becomes:

    $$ h_t = o_t * \tanh(C_t) $$

Together, these gates ensure that important information persists and irrelevant noise fades away.


Why It Works This Way

The secret weapon of LSTMs is the cell state ($C_t$).

Think of it as a long conveyor belt that moves information down the timeline. Each gate acts like a worker deciding:

  • Should we erase this part? (forget gate)
  • Should we add something new? (input gate)
  • Should we show this to the outside world? (output gate)

This selective control ensures that the gradient doesn’t vanish — it simply flows forward unless explicitly stopped by a gate.


How It Fits in ML Thinking

LSTMs are the first architecture that explicitly models memory. They allow deep learning systems to:

  • Capture dependencies spanning hundreds of time steps (e.g., full sentences or video frames).
  • Control how information decays over time.
  • Learn temporal hierarchies — what matters now vs. what matters later.

This idea of gated control over information flow inspired later models like GRUs and even the attention mechanisms in Transformers.


📐 Step 3: Mathematical Foundation

LSTM Equations (Complete)

The full set of LSTM equations is:

$$ \begin{aligned} f_t &= \sigma(W_f[h_{t-1}, x_t] + b_f) \quad &\text{(Forget gate)} \ i_t &= \sigma(W_i[h_{t-1}, x_t] + b_i) \quad &\text{(Input gate)} \ \tilde{C}*t &= \tanh(W_c[h*{t-1}, x_t] + b_c) \quad &\text{(Candidate cell)} \ C_t &= f_t * C_{t-1} + i_t * \tilde{C}*t \quad &\text{(Cell update)} \ o_t &= \sigma(W_o[h*{t-1}, x_t] + b_o) \quad &\text{(Output gate)} \ h_t &= o_t * \tanh(C_t) \quad &\text{(Hidden state update)} \ \end{aligned} $$
Each gate uses a sigmoid ($\sigma$) to “decide” how much information (0 to 1) passes through. The $\tanh$ squashes values between -1 and 1, keeping the memory controlled. Together, these ensure gradients flow in a stable, bounded way — avoiding explosion or collapse.

Why Use tanh Instead of ReLU?

If ReLU were used in the cell state, its unbounded nature could cause uncontrolled growth in $C_t$, leading to exploding gradients or overflow.

$\tanh$ is bounded between -1 and 1, which keeps cell values stable while still allowing smooth gradient flow.

Think of $\tanh$ as a “soft cushion” that prevents the cell state from jumping too high or dropping too low — keeping learning balanced and controlled.

🧠 Step 4: Assumptions or Key Ideas

  • The cell state ($C_t$) serves as a gradient highway, letting information flow through many time steps.
  • Gates use sigmoid activations to control how much information flows — mimicking decisions like “remember a little,” “forget a lot,” or “output half.”
  • The design allows both short-term and long-term dependencies to be learned effectively.

⚖️ Step 5: Strengths, Limitations & Trade-offs

Strengths

  • Retains memory over long time horizons.
  • Solves vanishing gradient problem.
  • Adaptively decides what information to keep or forget.
  • Performs well on text, audio, and sequential sensor data.

⚠️ Limitations

  • Computationally expensive (four gates = four weight matrices).
  • Harder to interpret due to multiple internal signals.
  • Still sequential — can’t be easily parallelized across time steps.
⚖️ Trade-offs You gain robust long-term memory but lose training efficiency. LSTMs are powerful but heavy — later models like GRUs simplify the structure for faster computation while keeping most of the benefits.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)
  • “The forget gate always deletes information.” → Not true — it scales old memory, allowing partial forgetting.
  • “All gates are independent.” → They’re learned jointly — their cooperation is what makes LSTMs powerful.
  • “LSTMs never forget.” → They choose what to forget. That’s why they’re called Long Short-Term Memory — they balance both.

🧩 Step 7: Mini Summary

🧠 What You Learned: You explored how LSTMs use gating mechanisms to control memory — deciding what to keep, forget, and output.

⚙️ How It Works: A stable cell state carries long-term gradients, while gates dynamically regulate information flow.

🎯 Why It Matters: This innovation transformed RNNs from forgetful learners into context-aware models capable of long-term reasoning — the foundation for today’s sequential deep learning.

Any doubt in content? Ask me anything?
Chat
🤖 👋 Hi there! I'm your learning assistant. If you have any questions about this page or need clarification, feel free to ask!