3.1. Classic Architectures (LeNet, AlexNet, VGG, GoogLeNet)

6 min read 1221 words

πŸͺ„ Step 1: Intuition & Motivation

  • Core Idea (short): These classic models are the landmark steps that turned neural nets from curiosities into practical computer-vision engines. Each introduced one or more β€œtricks” β€” deeper layers, nonlinearity choices, careful regularization, multi-scale modules β€” that solved real training or efficiency problems and unlocked better performance on larger, harder image datasets.

  • Simple Analogy: Think of the evolution like early cars β†’ race cars β†’ tuned sports cars: each generation kept what worked and innovated where the bottleneck was (speed, control, fuel efficiency). Likewise, CNN architectures evolved to be deeper, faster to train, and more computation-aware.


🌱 Step 2: Core Concept β€” Walkthrough of Each Model

LeNet-5 (the humble grandparent)

What’s Happening Under the Hood? (LeNet-5)
  • When/why: Early 1990s, designed for handwritten digit recognition (MNIST).
  • Core shape: A few convolution + pooling layers β†’ small fully connected head β†’ softmax.
  • Architecture sketch (conceptual):
INPUT (32Γ—32 grayscale)
  ↓ conv 5Γ—5 β†’ feature maps (6)
  ↓ avg-pool 2Γ—2
  ↓ conv 5Γ—5 β†’ feature maps (16)
  ↓ avg-pool 2Γ—2
  ↓ FC (120) β†’ FC (84) β†’ Output (10)
  • Innovation: Showed convolution + pooling + local connectivity + weight sharing worked for images. It was the first practical end-to-end trainable vision CNN.

Why it matters: Proved that learned, hierarchical features beat hand-crafted filters for simple vision tasks.


AlexNet (the deep wake-up call)

What’s Happening Under the Hood? (AlexNet)
  • When/why: 2012 β€” dramatically reduced error on ImageNet (large-scale dataset).
  • Core shape: Much deeper than LeNet, ReLU activations, dropout, data augmentation, GPU training.
  • Architecture sketch (simplified):
INPUT (224Γ—224 RGB)
  ↓ conv 11Γ—11, stride 4 β†’ ReLU
  ↓ max-pool 3Γ—3, stride 2
  ↓ conv 5Γ—5 β†’ ReLU
  ↓ max-pool
  ↓ conv 3Γ—3 β†’ conv 3Γ—3 β†’ conv 3Γ—3
  ↓ max-pool
  ↓ FC layers (4096 β†’ 4096) with Dropout
  ↓ Output (1000)
  • Innovation:

    • ReLU nonlinearity to speed up training.
    • Aggressive data augmentation and dropout to reduce overfitting.
    • Trained on GPUs on a massive dataset β€” showed scale matters.

Why it matters: Sparked modern deep learning renaissance β€” deeper nets + more data = far better performance.


VGGNet (the simplicity champion)

What’s Happening Under the Hood? (VGGNet)
  • When/why: 2014 β€” showed depth with simplicity can be powerful.
  • Core shape: Use many small ($3\times3$) filters stacked, doubling channels occasionally, then pooling.
  • Architecture sketch (typical VGG block):
INPUT
  ↓ [conv 3Γ—3 β†’ conv 3Γ—3] Γ— N β†’ max-pool
  ↓ [conv 3Γ—3 Γ— more] β†’ max-pool
  ↓ ... (repeat) ...
  ↓ FC layers β†’ Output

Example: VGG-16 composed of 13 conv layers + 3 FC layers.

  • Innovation:

    • Replaced large kernels (e.g., 7Γ—7) with stacks of 3Γ—3 convolutions.
    • Demonstrated depth (16–19 layers) with small filters yields better features and efficiency (fewer parameters than naive large-kernel nets).

Why it matters: Showed a clear, repeatable design pattern: use small kernels, stack them, and increase channels β€” simplicity + depth wins.


GoogLeNet / Inception (the efficiency artist)

What’s Happening Under the Hood? (GoogLeNet / Inception)
  • When/why: 2014–2015 β€” wanted high accuracy but with much lower computation and parameter count.
  • Core shape: Inception modules: parallel branches with different kernel sizes (1Γ—1, 3Γ—3, 5Γ—5) + pooling β€” concatenate outputs. Network is deep but efficient.
  • Architecture sketch (one Inception block idea):
INPUT
  β”œβ”€ 1Γ—1 conv ─┐
  β”œβ”€ 1Γ—1 β†’ 3Γ—3 conv ── concatenate β†’ output
  β”œβ”€ 1Γ—1 β†’ 5Γ—5 conv ──
  └─ 3Γ—3 pool β†’ 1Γ—1 conv β”˜
  • Innovation:

    • Multi-scale feature capture inside a single block (small and larger receptive fields together).
    • 1Γ—1 convolutions for dimensionality reduction (bottlenecks) to keep compute low.
    • Reduced parameter explosion while keeping expressivity.

Why it matters: Achieved state-of-the-art accuracy with far fewer parameters β€” a pragmatic architecture for real-world compute budgets.


πŸ“ Step 3: How to Reproduce a Small VGG-Like Network on CIFAR-10 (Conceptual plan β€” no code)

Goal: Train a compact, VGG-style conv net that works well on CIFAR-10 (32Γ—32 RGB images).

1) Architecture design (compact VGG-ish)

  • Use blocks of: [Conv 3Γ—3 β†’ ReLU β†’ Conv 3Γ—3 β†’ ReLU] β†’ MaxPool 2Γ—2
  • Start channels small: e.g., 64 β†’ 128 β†’ 256 (increase after each pooling).
  • After convolution blocks, use a small FC head: FC 512 β†’ ReLU β†’ Dropout β†’ FC 10 (softmax).
  • Keep total depth moderate (e.g., 8–12 conv layers) to match CIFAR scale.

2) Data preprocessing & augmentation

  • Normalize images per-channel (subtract mean, divide by std).
  • Use augmentation: random horizontal flips, random crops with padding, and optional color jitter. This combats overfitting strongly on CIFAR.

3) Regularization & training recipe

  • Optimizer: SGD with momentum or Adam (SGD often generalizes better).
  • Learning rate schedule: start with e.g., 0.1 and decay by factor (step or cosine).
  • Weight decay (L2 regularization): small value like 1e-4.
  • Use dropout in FC layers (e.g., p=0.5) if you have a large FC head; avoid heavy dropout in conv layers.
  • Batch normalization after conv layers can stabilize and speed training.

4) Loss and metrics

  • Cross-entropy loss for 10 classes.
  • Track training & validation loss, and validation accuracy. Use early stopping or best-model checkpointing based on validation accuracy.

5) Diagnostics & expected behavior

  • If training accuracy skyrockets but validation lags β†’ increase augmentation, weight decay, or reduce model capacity.
  • If both are low β†’ increase learning rate, check data pipeline, or deepen network slightly.
  • Visual checks: visualize learned filters of early conv layers β€” they should look like edges and color blobs.

6) Resource & runtime tips

  • Use small batch sizes (e.g., 64–128) depending on GPU memory.
  • Mixed precision helps for speed but is optional.

🧠 Step 4: Assumptions or Key Ideas

  • Scaling law: Large datasets + deeper nets typically need stronger regularization. CIFAR-10 is small-ish β†’ keep model capacity modest.
  • Local features compose: Stacked small kernels (3Γ—3) can emulate larger receptive fields with fewer parameters and more nonlinearity.
  • Augmentation is often the single most effective regularizer on small vision datasets.

βš–οΈ Step 5: Strengths, Limitations & Trade-offs

βœ… **Strengths (by model):

  • LeNet:** low complexity; great for tiny datasets.
  • AlexNet: showed ReLU + GPUs + data mattered.
  • VGG: simple repeatable pattern; strong baseline.
  • GoogLeNet: multi-scale + parameter efficiency.

⚠️ Limitations:

  • AlexNet/VGG: large parameter counts (FC layers especially) β€” heavy on memory.
  • GoogLeNet: architecture is more complex to design/tweak.
  • All classic nets: don’t include modern tricks like residual connections β€” deeper versions can be hard to train without them.

βš–οΈ Trade-offs:

  • Simplicity (VGG) vs computational efficiency (Inception).
  • Depth improves representation but increases training difficulty (solved later by ResNets).
  • Design choice depends on dataset size and compute budget.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)
  • β€œBigger is always better.” Depth helps, but without the right optimization tricks (BN, skip connections), very deep nets can be worse.
  • β€œInception is just more convolutions.” No β€” it’s carefully organized multi-scale parallel branches with bottlenecks to control compute.
  • β€œVGG uses big filters.” VGG intentionally uses many small (3Γ—3) filters stacked, not large kernels.

🧩 Step 7: Mini Summary

🧠 What You Learned: Classic CNNs progressed from simple, shallow LeNet to deep, compute-smart designs like VGG and Inception. Each solved a bottleneck: expressivity, trainability, or efficiency.

βš™οΈ How They Work: LeNet introduced basic conv+pool stacking; AlexNet scaled depth and training; VGG popularized deep stacks of 3Γ—3 convs; GoogLeNet introduced multi-scale Inception blocks and 1Γ—1 bottlenecks.

🎯 Why It Matters: These architectures provide blueprints and design intuitions still used today β€” whether you want a simple baseline (VGG-like) or a compute-efficient net (Inception-style).

Any doubt in content? Ask me anything?
Chat
πŸ€– πŸ‘‹ Hi there! I'm your learning assistant. If you have any questions about this page or need clarification, feel free to ask!