3.2. Modern Architectures and Trends (ResNet, DenseNet, MobileNet)

3.2. Modern Architectures and Trends (ResNet, DenseNet, MobileNet)

6 min read 1207 words

🪄 Step 1: Intuition & Motivation

  • Core Idea: When networks got deeper (beyond 20–30 layers), training started to break. Why? Because as we kept stacking layers, gradients vanished or exploded, and optimization became unstable — even though deeper networks should be better at representation learning.

Modern architectures like ResNet, DenseNet, and MobileNet fixed this — not by magic, but by rethinking how information and gradients flow through the network.

  • Simple Analogy: Imagine whispering a message through 100 people in a line. By the time it reaches the end, the message is garbled (vanishing gradient). ResNet says, “Forget whispering through everyone — let’s pass a copy of the original message straight through!” That’s the skip connection — a direct line that helps information (and gradients) travel cleanly.

🌱 Step 2: Core Concept — Modern CNN Designs

Let’s explore each key innovation, from ResNet’s identity mapping to DenseNet’s feature reuse and MobileNet’s efficient convolutions.


🧩 ResNet — The Residual Learning Revolution

What’s Happening Under the Hood? (ResNet)

🧠 The Problem:

As networks deepened (>20–30 layers), accuracy stopped improving and sometimes got worse. This wasn’t due to overfitting, but because the deeper models became harder to optimize — gradients either vanished or exploded as they propagated back.

💡 The Solution: Residual Connections

ResNet added identity shortcuts — connections that skip one or more layers and add their output directly to the block’s input:

$$ y = F(x, {W_i}) + x $$
  • $x$ = input to the block
  • $F(x, {W_i})$ = residual mapping (small transformation using a few conv layers)
  • $y$ = output after adding input (the “shortcut”)

Instead of forcing the network to learn a full mapping $H(x)$, ResNet reframes it as:

$$ H(x) = F(x) + x $$

So the network only learns the difference (residual) between input and output — an easier optimization problem.

🧱 Typical Residual Block (Simplified):

Input → Conv → BN → ReLU → Conv → BN
   ↘────────────── skip connection ─────────↗
                  Add → ReLU → Output

Why It Works This Way (Gradient Flow)

Residual connections make gradients flow directly through the skip path. During backpropagation, gradients can “bypass” some layers, reducing the chance of vanishing.

If $y = F(x) + x$, then:

$$ \frac{\partial y}{\partial x} = \frac{\partial F(x)}{\partial x} + I $$

The term $I$ (identity matrix) ensures a stable gradient path, even if $\frac{\partial F(x)}{\partial x}$ gets small.

So the network can grow to hundreds of layers (e.g., ResNet-152) and still train efficiently.


How It Fits in ML Thinking

ResNet fundamentally changed how we design networks: instead of forcing each layer to learn from scratch, it allows feature refinement — each block tweaks what the previous block already learned.

Think of it like improving a draft: you don’t start over each time, you just add residual edits.


🧩 DenseNet — The Feature Recycling Factory

What’s Happening Under the Hood? (DenseNet)

ResNet skips connections additively — DenseNet takes it further by connecting every layer to every other layer by concatenation:

$$ x_l = H_l([x_0, x_1, \dots, x_{l-1}]) $$
  • Each layer receives all previous outputs as input.
  • The feature maps are concatenated (not summed).

This leads to:

  • Feature reuse: Later layers don’t have to relearn patterns already discovered earlier.
  • Improved gradient flow: Every layer directly connects to the loss, minimizing vanishing gradient risk.
  • Parameter efficiency: Fewer filters per layer needed — because they can “borrow” features from others.

🔧 Dense Block Structure (Conceptually):

Input
 ├─ Layer 1 → output_1
 ├─ Layer 2(input=[input, output_1]) → output_2
 ├─ Layer 3(input=[input, output_1, output_2]) → output_3
 └─ ...
Concatenate all outputs → Transition Layer → Next Block

DenseNet thus encourages information sharing and compactness — smaller networks, strong performance.


🧩 MobileNet — The Lightweight Engineer

What’s Happening Under the Hood? (MobileNet)

MobileNet was designed for mobile and embedded devices — keeping accuracy high while minimizing computation.

Core Trick: Depthwise Separable Convolution

Standard convolution mixes spatial + channel information in one heavy operation. MobileNet factorizes it into two lighter ones:

  1. Depthwise convolution: one filter per input channel (captures spatial patterns only).
  2. Pointwise (1×1) convolution: combines channels linearly (mixes features).

Mathematical savings:

For input with $D_k \times D_k$ kernel, $M$ input channels, and $N$ output channels:

  • Standard convolution cost: $D_k^2 \times M \times N$
  • Depthwise separable cost: $D_k^2 \times M + M \times N$

Huge reduction in computation (≈9× less for typical values).


Why It Works This Way (Efficiency & Deployment)

By breaking convolution into two cheap operations, MobileNet keeps performance decent on small devices. It trades a small bit of accuracy for massive efficiency — a pragmatic design for deployment, not just research.

You can think of it like building a car that’s not the fastest, but gives you 100 km mileage — practical, lean, and effective.


📐 Step 3: Mathematical Foundation

Residual Connection Formula
$$ y = F(x, {W_i}) + x $$

Where:

  • $F(x, {W_i})$ — the residual function (e.g., two conv layers).
  • $x$ — input (shortcut connection).
  • $y$ — output (sum of both).

Gradient Path:

$$ \frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \cdot \left( \frac{\partial F(x)}{\partial x} + I \right) $$

The $+ I$ term acts like a “gradient highway,” ensuring signal can flow backward directly through the identity path.

Residuals act like “express lanes” in a traffic network — even if side streets (deep layers) are congested, gradients have a fast route backward.

🧠 Step 4: Why Skip Connections Help Deep Networks

  • Better Gradient Flow: The identity shortcut ensures gradients don’t vanish; they have a direct backward path.

  • Identity Mapping Simplifies Learning: Instead of learning $H(x)$ from scratch, the network learns $F(x) = H(x) - x$, often easier to optimize.

  • Information Preservation: Original input information is retained and refined through layers — preventing information degradation.

  • Deeper Yet Stable: Enables training of networks 10× deeper (e.g., 152+ layers in ResNet) without exploding loss or vanishing gradients.


⚖️ Step 5: Strengths, Limitations & Trade-offs

Strengths

  • ResNets: Solve vanishing gradients elegantly → stable, deep training.
  • DenseNets: Maximum feature reuse → smaller networks, strong generalization.
  • MobileNets: Massive efficiency gain → practical for edge/mobile devices.

⚠️ Limitations

  • Residual paths can increase memory usage (need to store inputs for addition).
  • Dense concatenations grow feature dimension fast → memory-heavy for large inputs.
  • MobileNets trade accuracy for speed; not ideal for very fine-grained tasks.

⚖️ Trade-offs

  • ResNet → Ideal for accuracy on large GPUs (server-scale).
  • DenseNet → Compact yet expressive, but more memory-hungry.
  • MobileNet → Efficient deployment; less representational capacity.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)
  • “Residual blocks just copy input forward.” Not exactly — they add learned refinements to the input.
  • “DenseNet is just ResNet with concatenation.” DenseNet uses concatenation (not addition), which changes gradient and feature dynamics entirely.
  • “MobileNet is just a smaller ResNet.” It’s architecturally different — uses separable convolutions for efficiency, not residuals for depth.

🧩 Step 7: Mini Summary

🧠 What You Learned: ResNet introduced skip connections to fix vanishing gradients and enable very deep learning; DenseNet extended this idea with dense connectivity for feature reuse; MobileNet optimized convolution efficiency for on-device inference.

⚙️ How It Works: Skip and dense connections preserve information and gradient flow; separable convolutions reduce redundant computation.

🎯 Why It Matters: These designs form the backbone of nearly every modern CNN — balancing accuracy, depth, and efficiency.

Any doubt in content? Ask me anything?
Chat
🤖 👋 Hi there! I'm your learning assistant. If you have any questions about this page or need clarification, feel free to ask!