The Complete Convolutional Neural Networks Cheat Sheet: Master Deep Learning for Computer Vision

Introduction: What are Convolutional Neural Networks?

Convolutional Neural Networks (CNNs) are specialized deep learning architectures primarily designed for processing grid-like data such as images. Inspired by the visual cortex of animals, CNNs automatically learn spatial hierarchies of features through backpropagation by using multiple building blocks such as convolution layers, pooling layers, and fully connected layers.

Why CNNs Matter:

  • State-of-the-art performance in image classification, object detection, and segmentation
  • Feature extraction without manual feature engineering
  • Parameter sharing and translation invariance reduces model complexity
  • Applications span computer vision, medical imaging, autonomous vehicles, facial recognition, and more

Core Concepts and Principles

Fundamental Building Blocks

  1. Convolution Layer

    • Applies sliding filters/kernels to input data
    • Extracts features through parameter sharing
    • Preserves spatial relationships in data
  2. Activation Function

    • Introduces non-linearity (typically ReLU)
    • Enables learning of complex patterns
    • Helps with the vanishing gradient problem
  3. Pooling Layer

    • Reduces spatial dimensions (downsampling)
    • Provides translation invariance
    • Common types: Max pooling, average pooling
  4. Fully Connected Layer

    • Traditional neural network layer
    • Often used at the end of the network for classification
    • Flattens spatial data into a 1D feature vector

CNN Operations

OperationPurposeParametersOutput Shape
ConvolutionFeature extractionKernel size, stride, padding, filtersHeight × Width × Channels
PoolingDimensionality reductionPool size, strideReduced Height × Width × Channels
FlatteningPrepare for FC layersNone1D vector
Fully ConnectedClassificationNumber of neuronsNumber of classes
DropoutRegularizationDropout rateSame as input
Batch NormalizationTraining stabilityMomentum, epsilonSame as input

Mathematical Foundations

Convolution Operation

For a 2D input image I and a kernel K, the convolution operation is:

$$(I * K)(i,j) = \sum_m \sum_n I(i+m, j+n) \cdot K(m,n)$$

Feature Map Size Calculation

For an input of size (H × W) with kernel size K, stride S, and padding P:

$$\text{Output Height} = \frac{H – K + 2P}{S} + 1$$ $$\text{Output Width} = \frac{W – K + 2P}{S} + 1$$

Common Activation Functions

FunctionEquationProperties
ReLU$f(x) = \max(0, x)$Fast computation, helps with vanishing gradient
Leaky ReLU$f(x) = \max(\alpha x, x)$ where $\alpha$ is smallPrevents “dying ReLU” problem
Sigmoid$f(x) = \frac{1}{1 + e^{-x}}$Outputs between 0 and 1, useful for binary classification
Tanh$f(x) = \frac{e^x – e^{-x}}{e^x + e^{-x}}$Outputs between -1 and 1, zero-centered

Step-by-Step CNN Architecture Design

1. Input Layer Configuration

  • Define input dimensions (height, width, channels)
  • Normalize pixel values (typically to [0,1] or [-1,1])
  • Consider data augmentation strategies

2. Feature Extraction Block Design

  • Select kernel sizes (typical: 3×3, 5×5, 7×7)
  • Decide number of filters (powers of 2: 32, 64, 128, etc.)
  • Choose appropriate stride and padding
  • Add activation function (typically ReLU)
  • Apply batch normalization (optional)
  • Include pooling layer (typical size: 2×2)

3. Layer Stacking Strategy

  • Start with fewer filters in early layers
  • Increase filter count as network deepens
  • Decrease spatial dimensions progressively
  • Consider residual connections for deeper networks
  • Add regularization (dropout) to prevent overfitting

4. Classification Block Design

  • Flatten the output of convolutional layers
  • Add fully connected layers with appropriate dimensions
  • Include dropout between FC layers (typical rate: 0.5)
  • Use appropriate activation for output layer:
    • Softmax for multi-class classification
    • Sigmoid for binary classification

5. Training Configuration

  • Select appropriate loss function
  • Choose optimizer (Adam, SGD with momentum)
  • Set learning rate and schedule
  • Define batch size and number of epochs
  • Implement early stopping criteria

Popular CNN Architectures

ArchitectureYearKey InnovationDepthParametersAccuracy (ImageNet)
LeNet-51998Pioneer CNN for digits5 layers60KN/A (MNIST: 99.2%)
AlexNet2012ReLU, dropout, GPU training8 layers60M63.3% (Top-1)
VGG2014Small filters (3×3), deeper networks16-19 layers138M74.4% (Top-1)
GoogLeNet/Inception2014Inception modules, 1×1 convolutions22 layers6.8M74.8% (Top-1)
ResNet2015Residual connections18-152 layers11.7M-60M82.9% (Top-1, ResNet-152)
MobileNet2017Depthwise separable convolutions28 layers4.2M70.6% (Top-1)
EfficientNet2019Compound scaling methodVaries5.3M-66M84.3% (Top-1, B7)

Implementation Considerations

Code Example: Basic CNN in PyTorch

import torch
import torch.nn as nn
import torch.nn.functional as F

class SimpleCNN(nn.Module):
    def __init__(self, num_classes=10):
        super(SimpleCNN, self).__init__()
        # Conv Layer Block 1
        self.conv1 = nn.Conv2d(in_channels=3, out_channels=32, kernel_size=3, padding=1)
        self.bn1 = nn.BatchNorm2d(32)
        self.relu1 = nn.ReLU()
        self.pool1 = nn.MaxPool2d(kernel_size=2)
        
        # Conv Layer Block 2
        self.conv2 = nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, padding=1)
        self.bn2 = nn.BatchNorm2d(64)
        self.relu2 = nn.ReLU()
        self.pool2 = nn.MaxPool2d(kernel_size=2)
        
        # Conv Layer Block 3
        self.conv3 = nn.Conv2d(in_channels=64, out_channels=128, kernel_size=3, padding=1)
        self.bn3 = nn.BatchNorm2d(128)
        self.relu3 = nn.ReLU()
        self.pool3 = nn.MaxPool2d(kernel_size=2)
        
        # Fully Connected Layer
        self.fc = nn.Linear(128 * 4 * 4, num_classes)
        
    def forward(self, x):
        # Conv Layer Block 1
        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu1(x)
        x = self.pool1(x)
        
        # Conv Layer Block 2
        x = self.conv2(x)
        x = self.bn2(x)
        x = self.relu2(x)
        x = self.pool2(x)
        
        # Conv Layer Block 3
        x = self.conv3(x)
        x = self.bn3(x)
        x = self.relu3(x)
        x = self.pool3(x)
        
        # Flatten and Pass to Fully Connected Layer
        x = x.view(x.size(0), -1)
        x = self.fc(x)
        
        return x

Code Example: Basic CNN in TensorFlow/Keras

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, BatchNormalization
from tensorflow.keras.layers import Flatten, Dense, Dropout

def create_simple_cnn(input_shape=(32, 32, 3), num_classes=10):
    model = Sequential([
        # Conv Layer Block 1
        Conv2D(32, kernel_size=3, padding='same', activation='relu', input_shape=input_shape),
        BatchNormalization(),
        MaxPooling2D(pool_size=2),
        
        # Conv Layer Block 2
        Conv2D(64, kernel_size=3, padding='same', activation='relu'),
        BatchNormalization(),
        MaxPooling2D(pool_size=2),
        
        # Conv Layer Block 3
        Conv2D(128, kernel_size=3, padding='same', activation='relu'),
        BatchNormalization(),
        MaxPooling2D(pool_size=2),
        
        # Fully Connected Layers
        Flatten(),
        Dropout(0.5),
        Dense(num_classes, activation='softmax')
    ])
    
    return model

Common Challenges and Solutions

Challenge: Overfitting

Solutions:

  • Data augmentation (rotations, flips, scales, crops)
  • Dropout regularization (typically 0.5 for dense layers, 0.1-0.3 for conv layers)
  • L1/L2 regularization on weights
  • Early stopping based on validation loss
  • Use transfer learning with pre-trained models

Challenge: Vanishing/Exploding Gradients

Solutions:

  • Use ReLU or variants (Leaky ReLU, ELU)
  • Apply batch normalization
  • Implement residual connections
  • Use proper weight initialization (He for ReLU, Xavier/Glorot for tanh)
  • Gradient clipping during training

Challenge: Limited Training Data

Solutions:

  • Transfer learning from pre-trained models
  • Extensive data augmentation
  • Synthetic data generation
  • Few-shot learning techniques
  • Self-supervised learning approaches

Challenge: Computational Efficiency

Solutions:

  • Depthwise separable convolutions
  • Network pruning and quantization
  • Knowledge distillation
  • Low-rank factorization of convolutions
  • Efficient architecture design (MobileNet, ShuffleNet)

Challenge: Class Imbalance

Solutions:

  • Weighted loss functions
  • Oversampling minority classes
  • Undersampling majority classes
  • Generate synthetic samples (SMOTE)
  • Focal loss to focus on hard examples

Advanced Techniques and Extensions

1. Transfer Learning Approaches

  • Feature extraction (freeze pre-trained base)
  • Fine-tuning (update all or part of the pre-trained weights)
  • Progressive fine-tuning (gradually unfreeze deeper layers)
  • Domain adaptation techniques

2. Object Detection Frameworks

  • Region-based: R-CNN, Fast R-CNN, Faster R-CNN
  • Single Shot: SSD, YOLO, RetinaNet
  • Anchor-free: CenterNet, FCOS
  • Transformer-based: DETR

3. Semantic Segmentation Architectures

  • FCN (Fully Convolutional Networks)
  • U-Net (Encoder-Decoder with skip connections)
  • DeepLab (Atrous convolutions, ASPP)
  • Mask R-CNN (Instance segmentation)

4. Attention Mechanisms

  • Channel attention (Squeeze-and-Excitation)
  • Spatial attention
  • Self-attention and transformers
  • Non-local neural networks

5. Recent Innovations

  • Vision Transformers (ViT)
  • MLP-Mixer architectures
  • Neural Architecture Search (NAS)
  • Once-for-all networks
  • Contrastive learning approaches

Best Practices and Practical Tips

Architecture Design

  • Start with established architectures before customizing
  • Use 3×3 kernels for most convolutions (following VGG principle)
  • Double channels when spatial dimensions are halved
  • Add batch normalization before activation
  • Use global average pooling instead of flattening when possible

Training Procedures

  • Learning rate: Start with 1e-3 for Adam, 0.1 for SGD
  • Implement learning rate schedules (step, cosine, reduce on plateau)
  • Batch size: Start with 32-128, adjust based on GPU memory
  • Use mixed-precision training for larger models
  • Monitor gradient norms to detect training instabilities

Hyperparameter Tuning

  • Prioritize learning rate and regularization strength
  • Use learning rate finder to identify optimal range
  • Consider automated hyperparameter optimization (Bayesian)
  • Implement cross-validation for smaller datasets
  • Track multiple metrics, not just accuracy

Model Deployment

  • Export models using ONNX for cross-platform compatibility
  • Consider TensorRT, TensorFlow Lite, or Core ML for optimization
  • Quantize models to reduce inference time and memory
  • Implement model versioning and A/B testing
  • Monitor inference time and resource usage

Resources for Further Learning

Books

  • “Deep Learning” by Ian Goodfellow, Yoshua Bengio, and Aaron Courville
  • “Computer Vision: Algorithms and Applications” by Richard Szeliski
  • “Deep Learning for Computer Vision” by Rajalingappaa Shanmugamani
  • “Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow” by Aurélien Géron

Courses

  • CS231n: Convolutional Neural Networks for Visual Recognition (Stanford)
  • Deep Learning Specialization, Course 4: Convolutional Neural Networks (Coursera/deeplearning.ai)
  • Practical Deep Learning for Coders (fast.ai)
  • Computer Vision Nanodegree (Udacity)

Research Papers

  • “ImageNet Classification with Deep Convolutional Neural Networks” (AlexNet, 2012)
  • “Very Deep Convolutional Networks for Large-Scale Image Recognition” (VGG, 2014)
  • “Deep Residual Learning for Image Recognition” (ResNet, 2015)
  • “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks” (2019)
  • “An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale” (ViT, 2020)

Online Resources

  • PyTorch Vision Documentation and Tutorials
  • TensorFlow Computer Vision Tutorials
  • Papers with Code (Computer Vision section)
  • ModelZoo.co pre-trained model repository
  • Distill.pub visual explanations of deep learning concepts

Remember: CNNs are powerful but require thoughtful implementation. Start with simple architectures and gradually increase complexity as needed. Always validate your models thoroughly and consider computational constraints for real-world deployment.

Scroll to Top