Page Content

Tutorials

Convolutional Neural Networks Architecture and Advantages

Introduction

A potent class of models, convolutional neural networks (CNNs) are said to have a significant influence on computer vision applications. They have produced state-of-the-art outcomes in a variety of challenges and have led to impressive advances in autonomously learning hierarchical representations from photos.

What are Convolutional Neural Networks (CNNs)?

Images and videos are examples of grid-like matrix datasets that are particularly intended for processing by Convolutional Neural Networks (CNNs), a specialized family of Artificial Neural Networks. They are more sophisticated ANNs that are excellent at identifying patterns in visual data and extracting features from it. To do this, Convolutional Neural Networks filter inputs for relevant information using convolutional layers. Along with other cutting-edge AI systems like robots, virtual assistants, and self-driving vehicles, they are extensively utilized in computer vision applications. Before Convolutional Neural Networks, object identification in photos required laborious, manual feature extraction techniques; however, CNNs provide a more scalable solution.

How Convolutional Neural Networks Work: Components and Process

Convolutional Neural Networks
Convolutional Neural Networks

The greater performance of CNNs with inputs of audio, speech, or images sets them apart. One or more hidden layers, an input layer, and an output layer make up their structure. Because the neurons in a CNN’s layers are organized in three dimensions (width, height, and depth), it can convert a three-dimensional input volume into an output volume, which is not possible with conventional neural networks. Usually, the hidden layers blend a variety of layer types:

Input Layer: This is where the model receives the raw input data, which is usually an image. The dimensions of a cuboid (channels such as red, green, and blue) can be used to depict a picture. A 32x32x3 picture, for example, would serve as the input volume.

Convolutional Layer: This fundamental component is where the majority of the computation takes place, allowing features to be extracted from the incoming data.

  • Small matrices (e.g., 2×2, 3×3, 5×5) that are also referred to as kernels or feature detectors are applied as a collection of learnable filters. The input volume is equal to the depth of these filters.
  • In the convolution operation, a convolution kernel (filter) is combined with the input data (feature map) to create a changed feature map.
  • Step-by-step, or stride-by-step, each filter moves over the input volume during the forward pass. The stride controls how far the kernel travels, or how many pixels it moves.
  • You get less when you take a longer stride.
  • Every step involves calculating a dot product between the input volume patch and the kernel weights.
  • This sequence of dot products culminates in a final output called a convolved feature, feature map, or activation map. A 12 depth output volume would be produced, for instance, if 12 filters were employed.
  • Known as parameter sharing, the feature detector’s weights stay constant as it scans the image.
  • By adding zeros to the border, zero-padding (valid, same, or full padding) can be used to restrict the output size when filters don’t fit the input picture.
  • For higher levels of abstraction, CNNs filter input volumes using several convolutional layers. Later layers develop a feature hierarchy by recognizing bigger pieces or forms, whereas earlier layers concentrate on simpler aspects like colors and edges. Bicycle detection, for instance, may contain both higher-level patterns that combine these pieces and lower-level patterns that involve particular parts (frame, handlebars).
  • Adapting to learnt parameters, the filters automatically extract the most relevant information for a given job, such as filtering for bird colour in bird recognition or item shape in general recognition.

Activation Layer: The feature map undergoes an activation function application immediately after each convolution process. In order for the network to learn intricate patterns, non-linearity is included. With its ability to zero out negative inputs and speed up training without noticeably sacrificing accuracy, the Rectified Linear Unit (ReLU) function is a popular option for CNNs. You may also utilise other activation functions, such as Tanh and Leaky ReLU. This layer maintains the same volume measurements.

Pooling Layer (or Subsampling): Periodically inserting these layers shrinks the volume, which speeds up processing, lowers memory use, and helps avoid overfitting.

  • One process that minimizes the input across a certain area to a single value is called pooling. Although a filter applies an aggregation function in place of weights, it sweeps over the input similarly as convolution does.
  • Average pooling, which computes the average value, and Max pooling, which chooses the maximum value in the receptive field, are the two primary varieties. In general, max pooling occurs more frequently.
  • Pooling enhances object recognition capabilities by offering fundamental invariance to translations and rotations. A face that is a little off-center, for instance, can still be identified since pooling appropriately filters the data. Despite the loss of certain knowledge, pooling successfully lowers complexity and boosts productivity.

Flattening: The feature maps are flattened into a one-dimensional vector following convolution and pooling, allowing them to be input into the next fully connected layer for regression or classification.

Fully Connected Layers (FC Layers): Following flattening, these layers use the input from the previous layers to carry out the last classification or regression job.

  • All nodes in the output layer have direct connections to all nodes in the preceding layer, in contrast to convolutional layers that have local connections.
  • A softmax activation function is frequently used by FC layers to categorise inputs by generating a probability score (between 0 and 1) for each class, although ReLU is frequently used in previous layers.

Output Layer: To translate outputs into class probabilities, this final layer often employs a logistic function, such as the sigmoid or softmax.

Learning and Training of Convolutional Neural Networks

CNNs use gradient-based optimization methods for training, just as other neural networks, including Recurrent Neural Networks and Multilayer Perceptron. Stochastic, batch, and mini-batch gradient descent algorithms are employed to optimize the network’s constraints. After training, the CNN may be used to infer information and make precise predictions about outputs for novel inputs. CNNs may be computationally taxing to train, and GPUs are frequently needed for acceleration. Through the use of GPUs, NVIDIA’s Deep Learning SDK speeds up training and inference for deep learning frameworks such as Caffe, CNTK, TensorFlow, Theano, and Torch.

Advantages and Disadvantages of CNNs

Advantages of Convolutional Neural Networks:

  • Very good at finding characteristics and patterns in audio, video, and picture signals.
  • Robust against scaling invariance, translation, and rotation.
  • Takes the place of manual feature extraction by supporting end-to-end training.
  • Capable of handling big data sets with excellent accuracy.

Disadvantages of Convolutional Neural Networks:

  • Costly to train computationally and memory-intensive.
  • May be prone to overfitting in the event that insufficient data is available or appropriate regularisation methods are not applied.
  • It Has a high labelled data need.
  • Interpretability issues that make it hard to comprehend what the network has discovered.

Code Example: Applying CNN Operations (Python/TensorFlow)

TensorFlow is used to illustrate the convolution, activation (ReLU), and pooling (Max) operations on an image in Python.

Condensed Python code structure looks:

# import the necessary libraries
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
from itertools import product # This import is not used, but doesn't cause harm
# set the param (for plotting)
plt.rc('figure', autolayout=True)
plt.rc('image', cmap='magma')
# define the kernel (a 3x3 filter for edge detection)
kernel = tf.constant([
    [-1, -1, -1],
    [-1,  8, -1],
    [-1, -1, -1],
])
# load the image (assuming 'Ganesh.jpg' is in the directory)
image = tf.io.read_file('Ganesh.jpg')
image = tf.io.decode_jpeg(image, channels=1) # Decode as grayscale
# CORRECTED LINE HERE: Added the size tuple (e.g., (300, 300))
image = tf.image.resize(image, size=(300, 300)) # Resize to 300x300
# plot the original image
# Ensure the image is converted to a NumPy array for imshow if it's still a TensorFlow Tensor
img = tf.squeeze(image).numpy()
plt.figure(figsize=(5, 5))
plt.imshow(img, cmap='gray')
plt.axis('off')
plt.title('Original Gray Scale image')
plt.show()
# Reformat image and kernel for TensorFlow operations
image = tf.image.convert_image_dtype(image, dtype=tf.float32)
image = tf.expand_dims(image, axis=0) # Add batch dimension (shape: [1, height, width, channels])
# Reshape kernel for conv2d: [filter_height, filter_width, in_channels, out_channels]
# Since input image is grayscale (1 channel) and you want 1 output feature map
kernel = tf.reshape(kernel, [*kernel.shape, 1, 1])
kernel = tf.cast(kernel, dtype=tf.float32)
# --- Apply Convolutional Layer Operation ---
conv_fn = tf.nn.conv2d
image_filter = conv_fn(
    input=image,
    filters=kernel,
    strides=1, # or (1, 1)
    padding='SAME', # Ensures output size is same as input
)
plt.figure(figsize=(15, 5))
# Plot the convolved image
plt.subplot(1, 3, 1)
plt.imshow(tf.squeeze(image_filter)) # tf.squeeze removes batch and channel dimensions
plt.axis('off')
plt.title('Convolution')
# --- Apply Activation Layer Operation (ReLU) ---
relu_fn = tf.nn.relu
image_detect = relu_fn(image_filter)
plt.subplot(1, 3, 2)
plt.imshow(tf.squeeze(image_detect))
plt.axis('off')
plt.title('Activation')
# --- Apply Pooling Layer Operation (Max Pooling) ---
pool = tf.nn.pool
image_condense = pool(
    input=image_detect,
    window_shape=(2, 2), # 2x2 pooling window
    pooling_type='MAX', # Max pooling
    strides=(2, 2), # Stride of 2
    padding='SAME',
)
plt.subplot(1, 3, 3)
plt.imshow(tf.squeeze(image_condense))
plt.axis('off')
plt.title('Pooling')
plt.show()

Explanation of the Code Example:

Libraries: Tensorflow is imported for network construction, matplotlib.pyplot for charting, and numpy for numerical computations.

Kernel Definition: It defines a 3×3 kernel (filter) with certain weights. Because this specific kernel draws attention to the differences between neighbouring pixels, it is frequently employed for edge detection.

Image Loading and PreprocessingGanesh.jpg is loaded, scaled to 300×300 pixels, and then transformed to greyscale (channels=1). The data type is then changed to float32, which is necessary for TensorFlow computations, and a batch dimension is added. To make the kernel compatible with tf.nn.conv2d, it is additionally redesigned.

Convolutional Layer: Convolution is done using the tf.nn.conv2d function.

  • input=image: The picture after preprocessing.
  • filters=kernel: 3×3 edge detection kernel definition.
  • strides=1: One picture pixel at a time is moved by the filter.
  • padding='SAME': Makes ensuring the output feature map is the same size as the original picture by, if required, adding zero-padding.
  • Image_filter is a convolved picture with edges highlighted as the end result.

Activation Layer (ReLU): Image_filter receives the ReLU activation from the tf.nn.relu function. By substituting zero for every negative pixel value, this process adds non-linearity. The characteristics that have been enabled are displayed in image_detect.

Pooling Layer (Max Pooling): The function tf.nn.pool carries out max pooling.

input=image_detect: Results from the layer of activation.

window_shape=(2, 2): Defines the pooling window of 2×2.

pooling_type='MAX': Defines max pooling, in which the highest value found in each 2×2 window is chosen.

strides=(2, 2): The pooling window essentially reduces the image size by half (for example, from 300×300 to 150×150 for each channel) by moving two pixels at a time.

padding='SAME': Maintains constant output scaling, just like convolution does.

The active feature map is downsampled while keeping the most noticeable characteristics in the final output, image_condense. In order to extract and analyze information from an image, this code efficiently illustrates how the major layers of a CNN are applied sequentially.

Index