3.1. Classic Architectures (LeNet, AlexNet, VGG, GoogLeNet)
πͺ Step 1: Intuition & Motivation
Core Idea (short): These classic models are the landmark steps that turned neural nets from curiosities into practical computer-vision engines. Each introduced one or more βtricksβ β deeper layers, nonlinearity choices, careful regularization, multi-scale modules β that solved real training or efficiency problems and unlocked better performance on larger, harder image datasets.
Simple Analogy: Think of the evolution like early cars β race cars β tuned sports cars: each generation kept what worked and innovated where the bottleneck was (speed, control, fuel efficiency). Likewise, CNN architectures evolved to be deeper, faster to train, and more computation-aware.
π± Step 2: Core Concept β Walkthrough of Each Model
LeNet-5 (the humble grandparent)
Whatβs Happening Under the Hood? (LeNet-5)
- When/why: Early 1990s, designed for handwritten digit recognition (MNIST).
- Core shape: A few convolution + pooling layers β small fully connected head β softmax.
- Architecture sketch (conceptual):
INPUT (32Γ32 grayscale)
β conv 5Γ5 β feature maps (6)
β avg-pool 2Γ2
β conv 5Γ5 β feature maps (16)
β avg-pool 2Γ2
β FC (120) β FC (84) β Output (10)- Innovation: Showed convolution + pooling + local connectivity + weight sharing worked for images. It was the first practical end-to-end trainable vision CNN.
Why it matters: Proved that learned, hierarchical features beat hand-crafted filters for simple vision tasks.
AlexNet (the deep wake-up call)
Whatβs Happening Under the Hood? (AlexNet)
- When/why: 2012 β dramatically reduced error on ImageNet (large-scale dataset).
- Core shape: Much deeper than LeNet, ReLU activations, dropout, data augmentation, GPU training.
- Architecture sketch (simplified):
INPUT (224Γ224 RGB)
β conv 11Γ11, stride 4 β ReLU
β max-pool 3Γ3, stride 2
β conv 5Γ5 β ReLU
β max-pool
β conv 3Γ3 β conv 3Γ3 β conv 3Γ3
β max-pool
β FC layers (4096 β 4096) with Dropout
β Output (1000)Innovation:
- ReLU nonlinearity to speed up training.
- Aggressive data augmentation and dropout to reduce overfitting.
- Trained on GPUs on a massive dataset β showed scale matters.
Why it matters: Sparked modern deep learning renaissance β deeper nets + more data = far better performance.
VGGNet (the simplicity champion)
Whatβs Happening Under the Hood? (VGGNet)
- When/why: 2014 β showed depth with simplicity can be powerful.
- Core shape: Use many small ($3\times3$) filters stacked, doubling channels occasionally, then pooling.
- Architecture sketch (typical VGG block):
INPUT
β [conv 3Γ3 β conv 3Γ3] Γ N β max-pool
β [conv 3Γ3 Γ more] β max-pool
β ... (repeat) ...
β FC layers β OutputExample: VGG-16 composed of 13 conv layers + 3 FC layers.
Innovation:
- Replaced large kernels (e.g., 7Γ7) with stacks of 3Γ3 convolutions.
- Demonstrated depth (16β19 layers) with small filters yields better features and efficiency (fewer parameters than naive large-kernel nets).
Why it matters: Showed a clear, repeatable design pattern: use small kernels, stack them, and increase channels β simplicity + depth wins.
GoogLeNet / Inception (the efficiency artist)
Whatβs Happening Under the Hood? (GoogLeNet / Inception)
- When/why: 2014β2015 β wanted high accuracy but with much lower computation and parameter count.
- Core shape: Inception modules: parallel branches with different kernel sizes (1Γ1, 3Γ3, 5Γ5) + pooling β concatenate outputs. Network is deep but efficient.
- Architecture sketch (one Inception block idea):
INPUT
ββ 1Γ1 conv ββ
ββ 1Γ1 β 3Γ3 conv ββ€ concatenate β output
ββ 1Γ1 β 5Γ5 conv ββ€
ββ 3Γ3 pool β 1Γ1 conv βInnovation:
- Multi-scale feature capture inside a single block (small and larger receptive fields together).
- 1Γ1 convolutions for dimensionality reduction (bottlenecks) to keep compute low.
- Reduced parameter explosion while keeping expressivity.
Why it matters: Achieved state-of-the-art accuracy with far fewer parameters β a pragmatic architecture for real-world compute budgets.
π Step 3: How to Reproduce a Small VGG-Like Network on CIFAR-10 (Conceptual plan β no code)
Goal: Train a compact, VGG-style conv net that works well on CIFAR-10 (32Γ32 RGB images).
1) Architecture design (compact VGG-ish)
- Use blocks of:
[Conv 3Γ3 β ReLU β Conv 3Γ3 β ReLU] β MaxPool 2Γ2 - Start channels small: e.g., 64 β 128 β 256 (increase after each pooling).
- After convolution blocks, use a small FC head:
FC 512 β ReLU β Dropout β FC 10 (softmax). - Keep total depth moderate (e.g., 8β12 conv layers) to match CIFAR scale.
2) Data preprocessing & augmentation
- Normalize images per-channel (subtract mean, divide by std).
- Use augmentation: random horizontal flips, random crops with padding, and optional color jitter. This combats overfitting strongly on CIFAR.
3) Regularization & training recipe
- Optimizer: SGD with momentum or Adam (SGD often generalizes better).
- Learning rate schedule: start with e.g., 0.1 and decay by factor (step or cosine).
- Weight decay (L2 regularization): small value like 1e-4.
- Use dropout in FC layers (e.g., p=0.5) if you have a large FC head; avoid heavy dropout in conv layers.
- Batch normalization after conv layers can stabilize and speed training.
4) Loss and metrics
- Cross-entropy loss for 10 classes.
- Track training & validation loss, and validation accuracy. Use early stopping or best-model checkpointing based on validation accuracy.
5) Diagnostics & expected behavior
- If training accuracy skyrockets but validation lags β increase augmentation, weight decay, or reduce model capacity.
- If both are low β increase learning rate, check data pipeline, or deepen network slightly.
- Visual checks: visualize learned filters of early conv layers β they should look like edges and color blobs.
6) Resource & runtime tips
- Use small batch sizes (e.g., 64β128) depending on GPU memory.
- Mixed precision helps for speed but is optional.
π§ Step 4: Assumptions or Key Ideas
- Scaling law: Large datasets + deeper nets typically need stronger regularization. CIFAR-10 is small-ish β keep model capacity modest.
- Local features compose: Stacked small kernels (3Γ3) can emulate larger receptive fields with fewer parameters and more nonlinearity.
- Augmentation is often the single most effective regularizer on small vision datasets.
βοΈ Step 5: Strengths, Limitations & Trade-offs
β **Strengths (by model):
- LeNet:** low complexity; great for tiny datasets.
- AlexNet: showed ReLU + GPUs + data mattered.
- VGG: simple repeatable pattern; strong baseline.
- GoogLeNet: multi-scale + parameter efficiency.
β οΈ Limitations:
- AlexNet/VGG: large parameter counts (FC layers especially) β heavy on memory.
- GoogLeNet: architecture is more complex to design/tweak.
- All classic nets: donβt include modern tricks like residual connections β deeper versions can be hard to train without them.
βοΈ Trade-offs:
- Simplicity (VGG) vs computational efficiency (Inception).
- Depth improves representation but increases training difficulty (solved later by ResNets).
- Design choice depends on dataset size and compute budget.
π§ Step 6: Common Misunderstandings
π¨ Common Misunderstandings (Click to Expand)
- βBigger is always better.β Depth helps, but without the right optimization tricks (BN, skip connections), very deep nets can be worse.
- βInception is just more convolutions.β No β itβs carefully organized multi-scale parallel branches with bottlenecks to control compute.
- βVGG uses big filters.β VGG intentionally uses many small (3Γ3) filters stacked, not large kernels.
π§© Step 7: Mini Summary
π§ What You Learned: Classic CNNs progressed from simple, shallow LeNet to deep, compute-smart designs like VGG and Inception. Each solved a bottleneck: expressivity, trainability, or efficiency.
βοΈ How They Work: LeNet introduced basic conv+pool stacking; AlexNet scaled depth and training; VGG popularized deep stacks of 3Γ3 convs; GoogLeNet introduced multi-scale Inception blocks and 1Γ1 bottlenecks.
π― Why It Matters: These architectures provide blueprints and design intuitions still used today β whether you want a simple baseline (VGG-like) or a compute-efficient net (Inception-style).