The Evolution of Convolutional Neural Networks in Image Classification

Before the advent of Convolutional Neural Networks (CNNs), the conventional method for training neural networks to categorize images involved flattening the images into pixel lists and processing them through feed-forward neural networks. This approach, however, resulted in the loss of crucial spatial information inherent to the images.

In 1989, Yann LeCun and his team unveiled CNNs, which have since been the cornerstone of Computer Vision research. Unlike traditional feed-forward networks, CNNs maintain the two-dimensional structure of images, allowing for spatial processing of information.

This article provides a historical overview of CNNs specifically in the realm of Image Classification, tracing their evolution from early research in the 1990s to the prolific advancements of the mid-2010s, when some of the most innovative Deep Learning architectures emerged. We will also look at the latest developments in CNN research as they vie with attention mechanisms and vision transformers.

For a visual explanation of the concepts discussed in this article, refer to the accompanying YouTube video. Unless otherwise indicated, all images and illustrations are created by the author for the video version.

Basics of Convolutional Neural Networks

The core of a CNN lies in the convolution operation. This involves sliding a filter across the image and computing the dot product at each overlapping section. The resulting output is termed a feature map, which illustrates the presence and location of the filter pattern within the image.

In a convolutional layer, multiple filters are trained to extract various feature maps from the input image. By stacking several convolutional layers in sequence with non-linear activations, we construct a convolutional neural network (CNN).

Each convolutional layer performs two essential tasks: 1. Spatial filtering through convolution operations between images and kernels. 2. Combining multiple input channels to produce a new set of channels.

A significant portion of CNN research has focused on enhancing these two aspects.

The 1989 Paper

The foundational 1989 paper demonstrated how to train non-linear CNNs from the ground up using backpropagation. The researchers used 16x16 grayscale images of handwritten digits, passing them through two convolutional layers with 12 filters of size 5x5. The filters operated with a stride of 2, facilitating the downsampling of the input images. After the convolutional layers, the output feature maps were flattened and processed through two fully connected networks to generate probabilities for the ten digit classes. The softmax cross-entropy loss function optimized the network to accurately predict the handwritten digit labels. The use of the tanh nonlinearity after each layer allowed for more complex and expressive feature maps. With only 9,760 parameters, this network was quite small compared to contemporary models, which often contain hundreds of millions of parameters.

Inductive Bias

Inductive bias refers to the intentional incorporation of specific rules and constraints into the learning process, guiding models away from generalization and toward solutions aligned with human-like understanding.

When humans classify images, they employ spatial filtering to identify common patterns and create multiple representations, which are then combined to form predictions. CNN architecture mimics this behavior. In feed-forward networks, each pixel is treated as an independent feature, as each neuron connects with all pixels; conversely, CNNs utilize parameter sharing, allowing the same filter to scan the entire image. Inductive biases make CNNs less reliant on extensive datasets, as they inherently recognize local patterns due to their structural design, whereas feed-forward networks must learn these patterns from scratch.

Le-Net 5 (1998)

In 1998, Yann LeCun and his team introduced Le-Net 5, a deeper and larger 7-layer CNN model. This architecture utilized Max Pooling, which downsamples the image by selecting the maximum values from a sliding 2x2 window.

Local Receptive Field

When training a 3x3 convolutional layer, each neuron connects to a 3x3 region in the original image; this is referred to as the neuron's local receptive field. When this feature map is processed through another 3x3 layer, it indirectly creates a receptive field of a larger 5x5 region from the original image. Additionally, downsampling through max-pooling or strided convolution increases the receptive field, enabling deeper layers to access the input image more comprehensively.

Consequently, early layers in a CNN focus on low-level features such as edges, while later layers capture broader, more global patterns.

The Drought (1998–2012)

Despite the impressive capabilities of Le-Net 5, researchers in the early 2000s regarded neural networks as computationally expensive and data-hungry. Overfitting posed another challenge, where complex networks memorized training datasets but struggled with unseen data. As a result, researchers turned their attention to traditional machine learning algorithms, such as support vector machines, which demonstrated superior performance on smaller datasets with lower computational requirements.

ImageNet Dataset (2009)

The ImageNet dataset was made publicly accessible in 2009, containing 3.2 million annotated images across more than 1,000 categories. Currently, it encompasses over 14 million images with more than 20,000 annotated classes. From 2010 to 2017, the ILSVRC competition took place annually, where research groups aimed to surpass benchmarks on a subset of the ImageNet dataset. Traditional ML methods, like Support Vector Machines, dominated the competition in 2010 and 2011, but starting in 2012, CNNs took the lead. The primary metric for ranking networks was the top-5 error rate, which measures the percentage of instances where the true class label was absent from the top five predictions made by the network.

AlexNet (2012)

AlexNet, developed by Dr. Geoffrey Hinton and his team, emerged victorious in the ILSVRC 2012 with a top-5 test set error rate of 17%. The following are three key contributions of AlexNet:

Multi-scaled Kernels: AlexNet operated on 224x224 RGB images and employed multiple kernel sizes, including 11x11, 5x5, and 3x3. In contrast, models like Le-Net 5 exclusively used 5x5 kernels. Larger kernels, while computationally more demanding due to additional weights, are capable of capturing broader patterns within images. Due to these large kernels, AlexNet contained over 60 million trainable parameters, which could lead to overfitting.
Dropout: To combat overfitting, AlexNet incorporated a regularization technique known as Dropout. During training, a subset of neurons in each layer is randomly set to zero, preventing the network from overly relying on specific neurons or groups. This encourages all neurons to learn generalizable features suitable for classification.
ReLU: AlexNet replaced the tanh nonlinearity with ReLU, an activation function that sets negative values to zero while preserving positive values. The tanh function can saturate in deeper networks, leading to diminished gradients and slower optimization. ReLU provides a consistent gradient signal, enabling the network to train approximately six times faster than with tanh.

AlexNet also introduced the idea of Local Response Normalization and methods for distributed CNN training.

GoogleNet / Inception (2014)

In 2014, GoogleNet achieved a top-5 error rate of 6.67% on ImageNet. The primary feature of GoogLeNet was the inception module, which includes parallel convolutional layers of varying filter sizes (1x1, 3x3, 5x5) and max-pooling layers. Each inception module applies these kernels to the same input and concatenates the outputs, integrating both low-level and medium-level features.

1x1 Convolution

GoogleNet also utilized 1x1 convolutional layers, which first scale the input channels before combining them. These kernels perform pointwise convolutions, multiplying each pixel by a fixed value.

While larger kernels (3x3 and 5x5) manage both spatial filtering and channel combination, 1x1 kernels are optimized for channel mixing, requiring significantly fewer weights. For instance, a 3x4 grid of 1x1 convolution layers only requires 12 weights, whereas 3x3 kernels would necessitate 108 weights.

Dimensionality Reduction

GoogleNet employs 1x1 convolution layers for dimensionality reduction, minimizing the number of channels prior to spatial filtering with 3x3 and 5x5 convolutions. This approach reduces the total number of trainable weights compared to AlexNet.

VGGNet (2014)

The VGG Network argues that larger kernels, such as 5x5 or 7x7, are unnecessary; instead, using three 3x3 kernels provides the same receptive field as a single 5x5 layer. Similarly, two layers of 3x3 convolutions match the receptive field of a single 7x7 layer.

Deep 3x3 Convolution Layers achieve the same receptive field as larger kernels with fewer parameters.

For example, a single 5x5 filter requires training 25 weights, whereas two 3x3 filters only require 18 weights. Likewise, a 7x7 filter trains 49 weights, while three 3x3 filters only train 27. This approach of employing deep 3x3 convolution layers became a standard practice in CNN architecture.

Batch Normalization (2015)

Deep neural networks may face issues known as “Internal Covariate Shift” during training. As earlier layers of the network continuously adjust, the latter layers must adapt to the shifting input distributions they receive.

Batch Normalization addresses this challenge by normalizing the inputs of each layer to have zero mean and unit standard deviation during training. A batch normalization layer can be applied post any convolution layer, subtracting the mean of the feature map across the mini-batch dimension and dividing by the standard deviation, resulting in a more stable Gaussian distribution during training.

Advantages of Batch Normalization: 1. Accelerates convergence by approximately 14 times. 2. Enables the use of higher learning rates. 3. Enhances robustness to the initial weights of the network.

ResNets (2016)

Deep Networks and Identity Mapping

Consider a shallow neural network that performs well on a classification task. If 100 new convolutional layers are added, the training accuracy may decrease!

This counterintuitive result occurs because the additional layers should ideally replicate the output of the shallow network, maintaining the same accuracy. However, deep networks can be challenging to train, as gradients may saturate or destabilize during backpropagation through numerous layers. With ReLU and batch normalization, training 22-layer deep CNNs became feasible, and Microsoft introduced ResNets in 2015, enabling stable training of CNNs with up to 150 layers.

Residual Learning

In this architecture, the input passes through one or more CNN layers, and the original input is added back to the final output. These are termed residual blocks, as they need not learn the final output feature maps in the conventional sense; rather, they learn the residual features that, when added to the input, yield the final feature maps. If the weights of the intermediate layers become zero, the residual block simply returns the identity function, allowing it to replicate the input X.

Easy Gradient Flow

During backpropagation, gradients can flow directly through these shortcut paths to reach earlier layers more quickly, mitigating gradient vanishing issues. ResNets integrate many of these blocks to construct exceptionally deep networks without sacrificing accuracy.

With these significant advancements, ResNets successfully trained a 152-layer model, achieving a top-5 error rate that surpassed all previous records.

DenseNet (2017)

DenseNets employ shortcut paths connecting earlier layers to subsequent layers within the network. Each DenseNet block consists of a series of convolution layers, where the output of every layer is concatenated with the feature maps from all preceding layers in the block before proceeding to the next layer. This design allows each layer to contribute a minimal number of new feature maps to the network's "collective knowledge" as the image progresses through it. DenseNets facilitate improved information and gradient flow throughout the network, as every layer has direct access to the gradients from the loss function.

Squeeze and Excitation Network (2017)

The SEN-NET was the victor of the ILSVRC competition, introducing the Squeeze and Excitation Layer into CNNs. The SE block is crafted to explicitly model dependencies among all channels within a feature map. In standard CNNs, each channel is computed independently; however, SEN-Net employs a self-attention-like approach to make each channel aware of the global properties of the input image. SEN-Net clinched victory in the final ILVSRC of 2017, with one of the 154-layer SenNet + ResNet models achieving an astonishing top-5 error rate of 4.47%.

Squeeze Operation

The squeeze operation compresses the spatial dimensions of the input feature map into a channel descriptor through global average pooling. As each channel contains neurons that capture local image properties, the squeeze operation gathers global insights about each channel.

Excitation Operation

The excitation operation rescales the input feature maps by channel-wise multiplication with the descriptors obtained from the squeeze operation. This effectively disseminates global-level information to each channel, contextualizing each one with respect to the others in the feature map.

MobileNet (2017)

Convolutional layers perform two primary functions: 1) filtering spatial information and 2) combining them channel-wise. The MobileNet paper introduced Depthwise Separable Convolution, which divides these two processes into separate layers—Depthwise Convolution for filtering and pointwise convolution for channel combination.

Depthwise Convolution

Given an input with M channels, depthwise convolution layers train M 3x3 convolutional kernels. Unlike standard convolution layers that apply convolution across all feature maps, depthwise convolution layers use filters that convolve only one feature map at a time. Subsequently, 1x1 pointwise convolution filters are utilized to integrate all these feature maps. This separation of filtering and combining significantly reduces the number of weights, making the model lightweight while preserving performance.

MobileNetV2 (2019)

MobileNetV2 enhanced the MobileNet architecture by introducing two innovations: Linear Bottlenecks and Inverted Residuals.

Linear Bottlenecks

MobileNetV2 employs 1x1 pointwise convolution for dimensionality reduction, followed by depthwise convolution for spatial filtering, and another 1x1 pointwise convolution layer to restore channel dimensions. These bottlenecks do not pass through ReLU, maintaining linearity. ReLU tends to zero out negative values from the dimensionality reduction step, which can lead to the loss of vital information if many lower-dimensional values are negative. Linear layers prevent excessive information loss during this bottleneck.

Inverted Residuals

The second innovation involves Inverted Residuals. Typically, residual connections are made between layers with the highest channel counts, but this design introduces shortcuts between the bottleneck layers. The bottleneck captures relevant information within a low-dimensional latent space, allowing for seamless information and gradient flow between these layers.

Vision Transformers (2020)

Vision Transformers (ViTs) have demonstrated that transformers can outperform state-of-the-art CNNs in Image Classification tasks. Transformers and attention mechanisms offer a highly parallelizable, scalable, and versatile framework for modeling sequences. While neural attention is a distinct area of Deep Learning not covered in this article, further exploration can be found in a related YouTube video.

ViTs Use Patch Embeddings and Self-Attention

The input image is divided into fixed-size patches, each of which is embedded into a fixed-size vector, either through a CNN or a linear layer. These patch embeddings, along with their positional encodings, are then fed into a self-attention-based transformer encoder as a sequence of tokens. Self-attention models the relationships between patches, generating updated patch embeddings that are contextually aware of the entire image.

Inductive Bias vs Generality

While CNNs impose various inductive biases regarding images, transformers operate without such biases—lacking localization and sliding kernels—relying instead on generality and raw computational power to model relationships among all patches. Self-attention layers facilitate global connectivity among all patches, regardless of their spatial distance. Inductive biases are advantageous for smaller datasets; however, the potential of transformers lies in their performance on extensive training datasets, where a general framework may ultimately outpace the inductive biases of CNNs.

ConvNext — A ConvNet for the 2020s (2022)

Though Swin Transformers are an interesting addition to this discussion, we will focus on one last CNN paper in this article.

Patchifying Images Like ViTs

The ConvNext model adopts a patching strategy inspired by Vision Transformers. A 4x4 convolution kernel with a stride of 4 generates a downsampled image that serves as input for the subsequent network layers.

Depthwise Separable Convolution

Similar to MobileNet, ConvNext utilizes depthwise separable convolution layers. The authors propose that depthwise convolution parallels the weighted sum operation in self-attention, which operates on a per-channel basis, mixing information solely in the spatial dimension. Additionally, the 1x1 pointwise convolutions resemble the channel mixing steps in self-attention.

Larger Kernel Sizes

While CNNs have predominantly employed 3x3 kernels since the introduction of VGG, ConvNext advocates for larger 7x7 filters to capture broader spatial contexts, aiming to approximate the fully global contexts captured by ViTs while retaining the localization characteristic of CNNs.

Additional adjustments, such as MobileNetV2-inspired inverted bottlenecks, GELU activations, layer normalization in place of batch normalization, and other modifications contribute to the overall ConvNext architecture.

Scalability

ConvNext offers a more computationally efficient model with depthwise separable convolutions, proving to be more scalable than transformers for high-resolution images—this is due to the quadratic scaling of self-attention with sequence length, in contrast to convolutional operations.

Final Thoughts!

The historical progression of CNNs provides profound insights into Deep Learning, inductive bias, and computational paradigms. It will be intriguing to observe whether the inductive biases of ConvNets or the generality of transformers ultimately prevail. Be sure to watch the accompanying YouTube video for a visual exploration of this article, along with the individual research papers referenced below.

References