What is Hierarchical composition, How it works, and features

Neural networks are designed using the core design idea of hierarchical composition, in which network layers are arranged to learn increasingly abstract and sophisticated properties. Because of its layered structure, the network may develop a deep, multi-level knowledge of the incoming data. Building with LEGOs is comparable in that you start with basic bricks, such as edges and colors, combine them to create larger buildings, like shapes, and then put those structures together to create a final model that depicts an entire object or scenario. The way the human brain moves from basic sensory inputs to abstract notions is modelled by this method.

What is Hierarchical composition?

In the context of neural networks, hierarchical composition describes a structure in which sub-networks are organized in layers, each of which captures distinct features of the input data. By mixing basic properties, the network can learn and represent increasingly complex ones. Because each subnet concentrates on a specific level of abstraction, the network is said to have a modular structure that facilitates analysis.

This structure builds on more fundamental properties to allow models to learn abstract patterns. For instance, edges create forms, which in turn create objects in picture recognition. Letters combine to create words, which in turn create sentences. By recording links and dependencies at several levels of abstraction, hierarchical composition facilitates effective learning, generalization, and interpretability. Deep learning architectures such as recursive and convolutional neural networks (CNNs) are based on it.

History

Hierarchical feature learning is well-established. Kunihiko Fukushima proposed the neocognitron in the 1980s, a precursor to CNNs. To identify patterns hierarchically, the neocognitron employed a layered structure. However, the dominance of hierarchical composition did not occur until deep learning and more powerful computational capabilities. Deep learning’s foundation was established by CNNs’ broad success in image identification, particularly with AlexNet in 2012.

How It Works

Each network layer learns a new, more intricate representation of the input depending on the output from the layer before it in a hierarchical composition.

Low-Level Layers: The earliest layers are made to extract simple, local features, like the first few convolutional layers of a CNN. These consist of components like corners, edges, and color blobs. Generally speaking, these characteristics are not unique to any one object.

Mid-Level Layers: More intricate mid-level features are created by combining these fundamental elements in later layers. For instance, a layer may combine particular edges and corners to learn to recognize shapes like squares or circles.

High-Level Layers: After that, the last layers incorporate these intricate characteristics into high-level, abstract representations. Through the assembly of learnt forms and textures, these layers may be able to recognize entire objects in an image recognition task, such as a face, a car, or a cat.

Simpler networks would find it difficult to learn complicated correlations and patterns in the data, but this approach enables the network to do so.

Goal and ideas of hierarchical composition

Learning Hierarchical characteristics: A hierarchical composition hierarchy groups information from previous layers to create complex characteristics on a broader scale. A lower layer may detect simple edges, a medium layer may join edges to produce corners or curves, and a higher layer may recognise object pieces or whole objects in image processing.

Addressing Optimisation Challenges: When initialised randomly, gradient-based optimisation often stuck in unsatisfactory solutions, making deep multi-layer neural networks hard to train. Hierarchical composition, especially greedy layer-wise unsupervised pre-training, initialises network weights at a good local minimum, simplifying optimisation and improving generalisation.

Efficiency and Generalization: Deep architectures are more efficient than shallow ones in computational elements and examples, enabling compact representation of complex functions. Structured learning improves untested data generalisation.

Disentanglement: A fundamental goal for learnt representations is “disentanglement,” where data changes are minimal compared to real-world transformations. Hierarchical composition separates data variation components to make representations more interpretable and transformable.

Advantages

Improved Accuracy: When compared to flat, single-layer networks, hierarchical systems can perform better, particularly in complex operations.

Feature Reusability: Features learned at lower levels (like edges) can be reused across many different tasks and objects, making the network more efficient.

Data Efficiency: The network may frequently generalize better with less data by first learning general features.

Interpretability: The hierarchical structure makes it easier to understand what the network is “looking for” at different times by showing the features each layer has learnt.

Robustness: Focusing on abstract, higher-level qualities rather than exact pixel values makes the network more tolerant to input noise and oscillations.

Reduced Training Time: When adding new categories or making modifications, certain subnets might be the focus of retraining hierarchical networks, which can sometimes result in a more efficient retraining process. Learning is accelerated for SHANNs in particular when they have access to shallow connections.

Computational Efficiency: Mobile applications can benefit greatly from the energy and computational resources that selective subnetwork execution can save.

Disadvantages

Computational Cost: Because very deep hierarchical networks include many layers and parameters, training them can be computationally costly.

Vanishing/Exploding Gradients: One of the most frequent issues with very deep networks is that the training gradients can get so small (vanishing) or so big (exploding) that it becomes hard to train the network efficiently.

Architecture Complexity: Developing and fine-tuning the optimal hierarchical architecture for a given issue can be a challenging undertaking that often requires extensive testing.

Types

Hierarchical composition is not a particular kind of network, but rather a design principle. The primary “types” are divided into groups according on the sort of data they process:

Spatial Hierarchies: Used for image and spatial data (e.g., CNNs).
Temporal Hierarchies: Used for sequential data like text or time series (e.g., Hierarchical RNNs).
Combined Hierarchies: Networks that handle both spatial and temporal hierarchies, such as those used for video analysis.

Challenges of Hierarchical composition

Optimal Depth and Width: Deciding the ideal number of layers (depth) and neurons per layer (width) is a major challenge.
Learning Long-Range Dependencies: For networks with very deep hierarchies, maintaining information flow from the first to the last layers can be difficult.
Data Imbalances: If certain classes are underrepresented in the training data, the network may struggle to learn a balanced hierarchy of features.
Required Training Examples: Quantitatively determining how many training examples are required to learn abstract, low-dimensional hierarchical data representations is an open question. The Random Hierarchy Model is a synthetic task introduced to study this, finding that the number of data needed correlates with the point where low-level feature correlations with classes become detectable.

Applications

Hierarchical composition is widely applied in many fields:

Computer Vision: Includes tasks like object detection, image segmentation, and facial recognition.
Natural Language Processing (NLP): Applied in areas such as text classification, sentiment analysis, and machine translation.
Speech Recognition: Used for converting audio signals into text.
Medical Imaging: Employed for detecting tumors or diseases from X-rays and MRIs.

Implementations in Various Models

Different approaches enable hierarchical composition in deep learning architectures:

Sparse Auto-Encoders (SAEs) and Sparse Decomposition: SAEs enable unsupervised learning of hierarchical picture representations . Encoders translate input to latent (hidden) feature spaces, while decoders reconstruct input from these features.

A sparsity constraint is applied to latent feature maps, promoting parsimonious representation at each hierarchy level. This is essential for preventing trivial solutions and letting characteristics naturally form complicated structures. The models are trained unsupervised, greedy, and layer-wise, with each layer learning from the previous layer’s output.

The Deconvolutional Networks (DNs) framework implements unsupervised sparse decomposition for hierarchical picture representations. DNs optimise feature map activations to calculate features perfectly without an encoder, unlike sparse auto-encoders. DNs learn hierarchical filters for mid-level visual ideas including edge junctions, parallel lines, and curves. Sparse decomposition over the entire image is essential for learning rich features, unlike patch-based approaches.

Deep Belief Networks (DBNs): Generative models with hidden causal factors. Restricted Boltzmann Machines (RBMs) are used to greedily train each layer unsupervised.

Lower layers in a DBN extract “low-level features” while higher layers represent “abstract” concepts that explain the input. It starts with elementary notions and then moves on to abstract ones. After learning representations, the network can be fine-tuned to meet supervised training criteria, resulting in improved generalisation.

Convolutional Neural Networks (CNNs): Multiple layers of convolutions, non-linearities, and sub-sampling (pooling) enable hierarchical composition.

Pooling layers, like max-pooling, provide spatial invariance to feature placements, however this is frequently achieved over a deep hierarchy, and intermediate feature mappings may not be invariant to massive transformations.

Deep networks, such as ResNets, emphasise network depth by reformulating layers to learn “residual functions” simplifying training and enabling complicated feature learning.

Deep Convolutional Inverse Graphics Network (DC-IGN): The DC-IGN model learns an interpretable image representation that is disentangled from transformations such as rotations and lighting fluctuations. The SGVB technique is used to train many layers of convolution and de-convolution operators.

The encoder is responsible for extracting scene latent variables (graphics codes), while the decoder reconstructs the image. The model can re-render photographs with varied poses or lighting by changing specific clusters of neurones in the graphics code layer, displaying learnt disentangled and interpretable representations.

Recurrent Neural Networks (RNNs): RNNs process variable-length sequences over several stages, implicitly representing time and composite structures.

They develop context-dependent representations that generalise across item classes by learning internal representations that reflect task demands inside earlier internal states.

In language processing, RNNs can discover hierarchical category structures for words, where proximity in the representational space suggests property similarity and higher-level categories correspond to bigger regions.

Neural Turing Machines (NTMs): NTMs use a neural network controller and external memory bank to run simple algorithms.
They can learn to use content-based and location-based addressing mechanisms to interface with memory, enabling algorithms like copying and sorting to generalise to longer sequences. The process implicitly composes subroutines or operations.

Radial Basis Function Networks: Layered feed-forward models for multi-variable functional interpolation. A single hidden layer with radial basis function centres allows the network to learn non-linear correlations. This context uses single-hidden-layer topologies, yet fitting complicated surfaces to data using these functions implies a hierarchical mapping from input to output.

The Harmony Theory framework for information processing in dynamical systems views knowledge as a hierarchy of “knowledge atoms” and “representational features”.

The system tries to bring together lower-level perceptual processing and higher-level cognitive processes by integrating information across abstraction levels.

The proposed network structure connects lower-level representation nodes to higher-level knowledge atoms, creating a multi-layered hierarchy for complicated cognitive processes.

Page Content

Tutorials