3.1. LSTM (Long Short-Term Memory)

Deep Learning Interview Prep: The Ultimate Guide (2025)

Recurrent Neural Networks (RNNs)

5 min read 1017 words

🪄 Step 1: Intuition & Motivation

Core Idea: Vanilla RNNs struggle to remember information across many time steps — their memory fades as gradients vanish. The Long Short-Term Memory (LSTM) architecture fixes this by giving the network a way to decide what to remember, what to forget, and what to output.
In short: LSTM is an RNN with a built-in memory management system.
Simple Analogy: Think of an LSTM as a smart note-taker in a lecture.
- It listens (input gate),
- Erases irrelevant points (forget gate),
- Writes key ideas to a notebook (cell state),
- And decides what to share in the summary (output gate). Unlike a normal student (vanilla RNN) who tries to memorize everything — and forgets most of it by the end — the LSTM learns what’s worth remembering.

🌱 Step 2: Core Concept

What’s Happening Under the Hood?

An LSTM cell has two key internal components:

Hidden state ($h_t$): carries short-term information (like RNN memory).
Cell state ($C_t$): acts as long-term memory — a running “context highway” that information flows through with minimal interference.

At every time step $t$, the LSTM uses four gates to control information flow:

Forget Gate ($f_t$): Decides what to discard from the previous cell state.
$$ f_t = \sigma(W_f [h_{t-1}, x_t] + b_f) $$
If $f_t$ is close to 0 → forget it. If $f_t$ is close to 1 → keep it.
Input Gate ($i_t$) and Candidate Cell ($\tilde{C}_t$): Determines what new information to add.
$$ i_t = \sigma(W_i [h_{t-1}, x_t] + b_i) $$
$$ \tilde{C}*t = \tanh(W_c [h*{t-1}, x_t] + b_c) $$
Here, $\tilde{C}_t$ is the proposed update (like “new knowledge”), and $i_t$ decides how much of it should be remembered.
Cell State Update ($C_t$): Combines old and new information:
$$ C_t = f_t * C_{t-1} + i_t * \tilde{C}_t $$
This “highway” connection allows gradients to flow smoothly through time — solving the vanishing gradient problem.
Output Gate ($o_t$): Controls how much of the cell state contributes to the next hidden state (the part that gets “spoken out loud”).
$$ o_t = \sigma(W_o [h_{t-1}, x_t] + b_o) $$
Then the new hidden state becomes:
$$ h_t = o_t * \tanh(C_t) $$

Together, these gates ensure that important information persists and irrelevant noise fades away.

Why It Works This Way

The secret weapon of LSTMs is the cell state ($C_t$).

Think of it as a long conveyor belt that moves information down the timeline. Each gate acts like a worker deciding:

Should we erase this part? (forget gate)
Should we add something new? (input gate)
Should we show this to the outside world? (output gate)

This selective control ensures that the gradient doesn’t vanish — it simply flows forward unless explicitly stopped by a gate.

How It Fits in ML Thinking

LSTMs are the first architecture that explicitly models memory. They allow deep learning systems to:

Capture dependencies spanning hundreds of time steps (e.g., full sentences or video frames).
Control how information decays over time.
Learn temporal hierarchies — what matters now vs. what matters later.

This idea of gated control over information flow inspired later models like GRUs and even the attention mechanisms in Transformers.

📐 Step 3: Mathematical Foundation

LSTM Equations (Complete)

The full set of LSTM equations is:

$$ \begin{aligned} f_t &= \sigma(W_f[h_{t-1}, x_t] + b_f) \quad &\text{(Forget gate)} \ i_t &= \sigma(W_i[h_{t-1}, x_t] + b_i) \quad &\text{(Input gate)} \ \tilde{C}*t &= \tanh(W_c[h*{t-1}, x_t] + b_c) \quad &\text{(Candidate cell)} \ C_t &= f_t * C_{t-1} + i_t * \tilde{C}*t \quad &\text{(Cell update)} \ o_t &= \sigma(W_o[h*{t-1}, x_t] + b_o) \quad &\text{(Output gate)} \ h_t &= o_t * \tanh(C_t) \quad &\text{(Hidden state update)} \ \end{aligned} $$

Each gate uses a sigmoid ($\sigma$) to “decide” how much information (0 to 1) passes through. The $\tanh$ squashes values between -1 and 1, keeping the memory controlled. Together, these ensure gradients flow in a stable, bounded way — avoiding explosion or collapse.

Why Use tanh Instead of ReLU?

If ReLU were used in the cell state, its unbounded nature could cause uncontrolled growth in $C_t$, leading to exploding gradients or overflow.

$\tanh$ is bounded between -1 and 1, which keeps cell values stable while still allowing smooth gradient flow.

Think of $\tanh$ as a “soft cushion” that prevents the cell state from jumping too high or dropping too low — keeping learning balanced and controlled.

🧠 Step 4: Assumptions or Key Ideas

The cell state ($C_t$) serves as a gradient highway, letting information flow through many time steps.
Gates use sigmoid activations to control how much information flows — mimicking decisions like “remember a little,” “forget a lot,” or “output half.”
The design allows both short-term and long-term dependencies to be learned effectively.

⚖️ Step 5: Strengths, Limitations & Trade-offs

✅ Strengths

Retains memory over long time horizons.
Solves vanishing gradient problem.
Adaptively decides what information to keep or forget.
Performs well on text, audio, and sequential sensor data.

⚠️ Limitations

Computationally expensive (four gates = four weight matrices).
Harder to interpret due to multiple internal signals.
Still sequential — can’t be easily parallelized across time steps.

⚖️ Trade-offs You gain robust long-term memory but lose training efficiency. LSTMs are powerful but heavy — later models like GRUs simplify the structure for faster computation while keeping most of the benefits.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)

“The forget gate always deletes information.” → Not true — it scales old memory, allowing partial forgetting.
“All gates are independent.” → They’re learned jointly — their cooperation is what makes LSTMs powerful.
“LSTMs never forget.” → They choose what to forget. That’s why they’re called Long Short-Term Memory — they balance both.

🧩 Step 7: Mini Summary

🧠 What You Learned: You explored how LSTMs use gating mechanisms to control memory — deciding what to keep, forget, and output.

⚙️ How It Works: A stable cell state carries long-term gradients, while gates dynamically regulate information flow.

🎯 Why It Matters: This innovation transformed RNNs from forgetful learners into context-aware models capable of long-term reasoning — the foundation for today’s sequential deep learning.

3.2. GRU (Gated Recurrent Unit)2.2. Handling Vanishing & Exploding Gradients