CNNs - Roadmap
Goal: Build deep, intuitive, and mathematical understanding of CNNs — from their core building blocks to full-fledged architectures used in production systems.
This roadmap moves from first principles → architecture design → practical scaling → interview-level mastery.
⚙️ 1. Core CNN Mechanics
Note
The Top Tech Interview Angle (Convolutional Layers): This is the heart of computer vision models. Expect to be grilled on the why and how behind convolutions — kernel operations, receptive fields, and why CNNs outperform dense nets on image data. Demonstrating intuition about parameter efficiency and translation invariance is key.
1.1: Understand the Convolution Operation
- Start with 2D convolutions — visualize a kernel sliding across an image.
- Derive the formula:
$$Y_{i,j} = \sum_m \sum_n X_{i+m, j+n} \cdot K_{m,n}$$
Understand what each term represents. - Learn stride, padding, and receptive field mathematically and visually.
Deeper Insight: Be ready to discuss how stride and padding affect the output size and the trade-offs between spatial resolution and computational cost.
Probing Question: “If you double the stride, how does the receptive field and output size change?”
1.2: Convolution vs. Fully Connected Layers
- Compare parameter counts between dense layers and conv layers.
- Explain weight sharing and local connectivity intuitively.
- Implement both in PyTorch/TensorFlow and measure the difference in parameter counts.
Note: A common interview question — “Why not just use fully connected layers on images?”
Hint: Discuss spatial structure, translation invariance, and parameter explosion.
🧩 2. Feature Extraction & Pooling
Note
The Top Tech Interview Angle (Pooling Layers): Pooling tests your understanding of feature hierarchy and translation robustness. Many candidates forget to link pooling to invariance and receptive field scaling — a subtle but critical insight.
2.1: Max Pooling and Average Pooling
- Explain the mathematical operation behind each.
- Demonstrate how pooling reduces dimensionality and increases receptive field.
- Implement a 2×2 max pool manually using NumPy.
Deeper Insight: Be ready to reason about why pooling helps generalization and when it can harm performance.
Probing Question: “Why might you replace pooling with strided convolutions in modern architectures?”
2.2: Dropout and Regularization in CNNs
- Understand how dropout combats overfitting in fully connected layers.
- Study why dropout is less commonly used in convolutional layers.
- Implement dropout during training and observe its effect on validation loss.
Note: In interviews, you might be asked: “If dropout hurts your CNN’s convergence, what alternatives could you use?”
Talk about BatchNorm, data augmentation, and L2 regularization.
🏗️ 3. CNN Architectures & Design Patterns
Note
The Top Tech Interview Angle (Architectures): Architecture design tests your system-level understanding — depth, receptive fields, skip connections, and parameter efficiency. You must articulate design trade-offs.
3.1: Classic Architectures
- Study LeNet-5, AlexNet, VGGNet, and GoogLeNet.
- For each, sketch architecture diagrams and identify design innovations.
- Reproduce a small VGG-like network on CIFAR-10.
Deeper Insight: You should be able to say why each model was a breakthrough.
Example: “GoogLeNet used Inception modules to capture multi-scale features while keeping computation affordable.”
3.2: Modern Architectures and Trends
- Explore ResNet and understand residual connections mathematically:
$$y = F(x, \{W_i\}) + x$$ - Grasp how skip connections help in mitigating vanishing gradients.
- Dive into DenseNet and MobileNet for depth efficiency and lightweight inference.
Probing Question: “Why do skip connections enable deeper networks to train effectively?”
Mention gradient flow, identity mapping, and information preservation.
📸 4. CNNs in Practice
Note
The Top Tech Interview Angle (Applied CNNs): You’ll often be asked to build or optimize CNNs for tasks like classification or detection. Focus on model debugging, data pipelines, and performance tuning.
4.1: CNN for Image Classification
- Build a CNN from scratch for MNIST or CIFAR-10 using PyTorch or TensorFlow.
- Visualize filters and feature maps using Matplotlib.
- Tune hyperparameters (learning rate, batch size, kernel size).
Deeper Insight: Discuss early stopping, data augmentation, and overfitting diagnosis.
Probing Question: “If your CNN overfits even after dropout and data augmentation, what could be the next step?”
4.2: Transfer Learning and Fine-Tuning
- Load pretrained models (ResNet, EfficientNet) using torchvision or Keras.
- Freeze layers and fine-tune on a small dataset.
- Discuss feature reuse and representation learning.
Note: Expect a question like: “Why not train from scratch?”
Be prepared to mention data scarcity, compute efficiency, and representation generalization.
🔬 5. Scaling, Optimization, and Deployment
Note
The Top Tech Interview Angle (Scaling CNNs): This assesses your ability to think beyond toy problems — optimizing large models, managing inference latency, and scaling training across GPUs.
5.1: Training at Scale
- Learn about batch normalization, gradient clipping, and mixed precision.
- Implement multi-GPU training using
DataParallelorDistributedDataParallel. - Understand how gradient accumulation helps with limited GPU memory.
Deeper Insight: Discuss the trade-off between batch size and generalization.
Probing Question: “Why can larger batch sizes hurt test performance?”
5.2: Inference Optimization
- Study model pruning, quantization, and knowledge distillation.
- Optimize CNNs using TensorRT or ONNX.
- Measure FPS (frames per second) and latency during inference.
Note: Interviewers value practical awareness: “How would you deploy a CNN on a mobile device?”
Mention quantization, depthwise convolutions, and model compression.
🧠 6. Advanced Vision Topics (Stretch Goals)
Note
The Top Tech Interview Angle (Beyond CNNs): These topics separate strong engineers from visionary ML thinkers. They connect CNN fundamentals to advanced architectures used in real systems.
6.1: CNNs vs. Vision Transformers (ViT)
- Compare convolutional inductive bias vs. self-attention.
- Discuss why CNNs are more data-efficient for smaller datasets.
- Analyze hybrid architectures (e.g., ConvNeXt).
Probing Question: “Would you replace CNNs with Transformers for all vision tasks?”
Answer with nuance — mention compute cost, data regime, and inductive priors.
6.2: Explainability in CNNs
- Use Grad-CAM and saliency maps to interpret CNN decisions.
- Understand feature attribution and class activation maps.
- Discuss how interpretability aids model trust in real-world systems.
Note: A strong candidate links interpretability to debugging, bias detection, and regulatory compliance.