1.2. Architecture of a Vanilla RNN

Deep Learning Interview Prep: The Ultimate Guide (2025)

Recurrent Neural Networks (RNNs)

5 min read 925 words

🪄 Step 1: Intuition & Motivation

Core Idea: The Vanilla RNN is the simplest form of a recurrent neural network — a neural network that loops back on itself to process data in sequence. Think of it as a neuron with memory: each time it receives an input, it not only produces an output but also updates its internal state, which influences future computations.
Simple Analogy: Imagine taking lecture notes during a class. Each new sentence (input) you hear depends on what was said before. You don’t start fresh every minute — your current understanding (hidden state) builds on what you already know. Similarly, RNNs don’t reset at each time step — they accumulate understanding over time.

🌱 Step 2: Core Concept

What’s Happening Under the Hood?

Let’s unpack the core of an RNN:

At every time step $t$, the RNN receives two things:

The current input $x_t$ (e.g., the current word in a sentence)
The previous hidden state $h_{t-1}$ (the memory of everything seen before)

It combines these using two weight matrices — $W_{xh}$ for the input and $W_{hh}$ for the previous hidden state — adds a bias $b_h$, and applies a nonlinear function $f$ (like $\tanh$ or $\text{ReLU}$) to produce a new hidden state:

$$ h_t = f(W_{xh}x_t + W_{hh}h_{t-1} + b_h) $$

This $h_t$ acts as both:

A summary of the past, and
A source of information for the next time step.

Finally, the network computes the output prediction:

$$ y_t = W_{hy}h_t + b_y $$

Here, $W_{hy}$ transforms the hidden state into the desired output space (e.g., predicted word, next value, etc.).

Why It Works This Way

The brilliance of this design is parameter sharing — the same weights ($W_{xh}, W_{hh}, W_{hy}$) are reused across all time steps.

Why is that powerful?

It means the model learns a single rule for how information flows through time, rather than learning a separate rule for each position.
This keeps the number of parameters manageable and allows the network to generalize across sequences of different lengths.

If we unroll an RNN across time, it looks like a chain of identical layers, each passing its hidden state to the next — but all of them share the same weights.

That’s what makes RNNs recurrent — they literally recur through time with the same logic at each step.

How It Fits in ML Thinking

In traditional machine learning, we often map input → output in one go (like in regression or CNNs). But in sequence modeling, the current output depends on both current input and past context.

The RNN architecture allows this dependency to emerge naturally through its hidden state recurrence. It’s the foundation that later evolves into LSTMs and Transformers, which handle memory and context more efficiently but are built upon this same “recurrent” principle.

📐 Step 3: Mathematical Foundation

Recurrence Relation

$$ h_t = f(W_{xh}x_t + W_{hh}h_{t-1} + b_h) $$

$x_t$ → input at time $t$ (e.g., current word or sensor value)
$h_{t-1}$ → hidden state from the previous time step (memory)
$W_{xh}$ → weights connecting input to hidden state
$W_{hh}$ → weights connecting previous hidden state to current hidden state
$b_h$ → bias term for the hidden layer
$f$ → activation function (usually $\tanh$ or $\text{ReLU}$)

The output:

$$ y_t = W_{hy}h_t + b_y $$

The RNN acts like a “rolling calculator.” At each step, it takes the new input, combines it with what it remembers, and updates its internal note (hidden state) — a continuous blend of new knowledge and past experience.

🧠 Step 4: Assumptions or Key Ideas

Shared Parameters: The same weight matrices ($W_{xh}, W_{hh}, W_{hy}$) are used for all time steps — this reduces complexity and enforces consistency across time.
Markov-like Dependency: The next state $h_t$ depends only on the current input $x_t$ and the previous state $h_{t-1}$ (not the entire past directly).
Nonlinear Activation: The use of $f$ introduces nonlinearity, enabling the RNN to capture complex temporal dynamics instead of just linear correlations.

⚖️ Step 5: Strengths, Limitations & Trade-offs

✅ Strengths

Learns patterns across time — captures both short- and mid-range dependencies.
Handles sequences of variable length naturally.
Reuses parameters, making the model memory-efficient.
Conceptually simple — forms the backbone for advanced architectures like LSTMs and GRUs.

⚠️ Limitations

Struggles with long-term dependencies (information from distant time steps fades).
Sequential computation prevents full parallelization during training.
Susceptible to gradient vanishing or explosion during backpropagation through time.

⚖️ Trade-offs While RNNs are intuitive and compact, they trade long-term accuracy for short-term simplicity. They’re great for tasks like speech or simple text sequences but struggle with very long dependencies — motivating the invention of LSTMs and GRUs.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)

“Each time step has different weights.” → False. The same weights are reused for every time step — this consistency enables generalization.
“RNNs can remember everything perfectly.” → Not quite. The hidden state summarizes past inputs, but this summary fades over time due to gradient issues.
“Unrolling means copying layers.” → It’s just a visualization trick — there’s only one RNN cell whose logic repeats over time.

🧩 Step 7: Mini Summary

🧠 What You Learned: You explored the architecture of a Vanilla RNN — a model that uses recurrence to remember past information.

⚙️ How It Works: The hidden state acts as a running summary, passed through time using shared weights.

🎯 Why It Matters: This idea of “memory through recurrence” is the cornerstone of all sequence models — from basic RNNs to modern Transformers.

2.1. Derive BPTT Mathematically 1.1. Understand Sequential Data and the Need for Memory