4.1. Build a Simple RNN from Scratch

Deep Learning Interview Prep: The Ultimate Guide (2025)

Recurrent Neural Networks (RNNs)

5 min read 914 words

🪄 Step 1: Intuition & Motivation

Core Idea: Understanding RNNs theoretically is great — but building one from scratch makes it click. By coding it step by step, you’ll see how data flows forward through time, how memory is carried between steps, and how gradients travel backward.
Simple Analogy: Think of your RNN as a memory-based storyteller. Each new word it hears updates its internal “mental state,” which it uses to predict what comes next. By constructing this network yourself, you’ll learn how it keeps this evolving memory alive — mathematically and computationally.

🌱 Step 2: Core Concept

What’s Happening Under the Hood?

Let’s recall how an RNN processes sequences: At each time step $t$:

It receives an input $x_t$.
Combines it with its previous memory $h_{t-1}$ through linear transformations.
Passes it through a nonlinear activation to produce a new hidden state $h_t$.
Generates an output $y_t$ from $h_t$.

We’ll simulate this logic using NumPy — no deep learning frameworks yet.

Here’s the structure we’ll implement conceptually:

Weight Matrices
- $W_{xh}$: maps input → hidden layer
- $W_{hh}$: maps previous hidden state → new hidden state
- $W_{hy}$: maps hidden state → output
Forward Pass (per time step)
- Compute hidden state: $$ h_t = \tanh(W_{xh}x_t + W_{hh}h_{t-1}) $$
- Compute output: $$ y_t = W_{hy}h_t $$
Backward Pass (Backpropagation Through Time)
- Gradients are propagated backward for each time step, adjusting weights.
- We’ll do this manually for 2–3 steps to visualize the flow of learning.

Why It Works This Way

The RNN is recursive — it reuses the same function at every time step. By maintaining a hidden state, it “remembers” what it saw before.

During training, the model compares each predicted output $y_t$ with the true value $\hat{y_t}$ and uses the difference to adjust weights. The challenge is that errors at later time steps depend on earlier states — so the backward pass must trace through time to assign credit correctly.

That’s what makes this model recurrent in both computation and learning.

How It Fits in ML Thinking

Building an RNN from scratch connects theory with intuition. You’ll see how:

The same weights get reused across time.
The hidden state acts as a bridge between past and present.
Gradients can either fade or explode as they travel backward.

This understanding is what transforms an “RNN user” into an “RNN engineer” — someone who truly understands why their model behaves the way it does.

📐 Step 3: Mathematical Foundation

Forward Pass Equations

At time step $t$:

$$ \begin{aligned} h_t &= \tanh(W_{xh}x_t + W_{hh}h_{t-1} + b_h) \ y_t &= W_{hy}h_t + b_y \end{aligned} $$

Each hidden state $h_t$ depends on both the current input and the previous memory. During the forward loop, we store all $h_t$ values because they’re needed later for backpropagation.

Imagine passing a secret note along a chain of friends — each adds their input (the new word) but also passes on what they received earlier (the old hidden state). The RNN’s forward pass is this chain in action.

Backward Pass — Simplified BPTT

To visualize how gradients move backward:

For simplicity, consider 3 time steps: $t=1, 2, 3$. We compute $\frac{\partial L}{\partial W_{xh}}$, $\frac{\partial L}{\partial W_{hh}}$, $\frac{\partial L}{\partial W_{hy}}$ using the chain rule.

Each gradient at step $t$ affects not only current parameters but also those from previous steps — because $h_t$ depends on $h_{t-1}$:

$$ \frac{\partial L}{\partial W_{hh}} = \sum_t \frac{\partial L_t}{\partial h_t} \frac{\partial h_t}{\partial W_{hh}} $$

This means that updates are accumulated across time.

The computational cost per sequence = O(T × H²), where:

$T$ = number of time steps,
$H$ = number of hidden units.

🧠 Step 4: Assumptions or Key Ideas

Weight Sharing: All time steps use the same parameters — this reduces complexity and improves generalization.
Sequential Dependency: $h_t$ depends only on $h_{t-1}$ (Markov-like assumption).
Gradient Memory: Each hidden state contributes to future learning signals — so truncating backprop too early might miss long-term dependencies.

⚖️ Step 5: Strengths, Limitations & Trade-offs

✅ Strengths

Offers hands-on understanding of RNN dynamics.
Demonstrates temporal information flow clearly.
Simple to extend into LSTM or GRU once the base RNN logic is clear.

⚠️ Limitations

Hard to scale — manual BPTT is computationally costly and unstable.
Requires careful weight initialization to avoid gradient problems.
Doesn’t handle long sequences effectively.

⚖️ Trade-offs Building from scratch provides clarity, not performance. You learn why frameworks like PyTorch and TensorFlow exist — to automate complex gradient flows efficiently. But the conceptual mastery gained here forms the foundation for debugging and innovating in deep sequence modeling.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)

“The RNN has different weights for each time step.” → False. All steps share the same $W_{xh}$, $W_{hh}$, and $W_{hy}$.
“We can backpropagate indefinitely.” → Not practical. In reality, we truncate backprop after a few steps to save computation and avoid gradient decay.
“More time steps = better memory.” → Not necessarily — longer unrolls often worsen vanishing gradient issues unless using LSTMs or GRUs.

🧩 Step 7: Mini Summary

🧠 What You Learned: You learned how to construct a simple RNN from scratch — defining weight matrices, computing hidden states, and tracing gradient flow manually.

⚙️ How It Works: Each time step processes input sequentially, using shared weights to blend new and old information.

🎯 Why It Matters: This hands-on foundation demystifies how RNNs actually function under the hood — crucial for understanding why modern architectures (like LSTMs, GRUs, and Transformers) were created in the first place.

4.2. Implement an LSTM/GRU using PyTorch 3.2. GRU (Gated Recurrent Unit)