Page Content

Tutorials

Denoising Autoencoders and How Denoising Autoencoders Work

Denoising Autoencoders (DAEs) are a variant of the fundamental autoencoder architecture that reconstructs a clean input from a corrupted version in order to develop resilient representations of data.

This is a thorough description of denoising autoencoders:

Core Concept and Motivation

Motivation and the Core Idea Making learnt representations resilient to partial input destruction is the main concept underpinning Denoising Autoencoders. It is predicted that this robustness will motivate the model to represent regularities, relationships, and stable structures present in the underlying (unknown) data distribution. Similar to how people can identify partially obscured or distorted images, important patterns in high-dimensional, redundant inputs like images should be recoverable even from partial observations.

How Denoising Autoencoders Work

Corrupting Input: To produce a partially ruined version, x̃, an original input, x, is first randomly corrupted. This sometimes entails choosing a predetermined number of components (such as vd components) at random and setting their values to 0 while keeping the others unaltered in trials. This procedure mimics the “salt noise” or component removal of photographs with a white background.
The process of encoding and decoding The encoder function y = fθ(x̃) is then used to transfer the tainted input x̃ to a concealed representation y. A decoder function, z = gθ'(y), then uses this concealed representation y to reconstruct a “repaired” vector z in the original input space.

Goal: The Denoising Autoencoders parameters are adjusted to reduce the average reconstruction error between the reconstructed output (z) and the uncorrupted input (x). The reconstruction cross-entropy, LIH(x, z), is a popular loss function that functions as a negative log-likelihood for binary vectors. The main difference from a conventional autoencoder is that z is a stochastic mapping and a deterministic function of x̃ (the corrupted input) rather than x (the original input). This configuration eliminates the need for a hidden unit dimension (d’) smaller than the input dimension (d) or other particular regularization strategies to prevent trivial solutions by guaranteeing that the Denoising Autoencoders cannot simply learn the identity function.

Layer-wise Initialization for Deep Architectures

Initialization of Deep Architectures at the Layer Level A technique known as greedy layer-wise pre-training involves using Denoising Autoencoders as building bricks to train deep neural networks layer by layer.

Every layer is taught to generate a reliable higher-level representation of the information it receives by minimizing the denoising criterion.
It has been empirically demonstrated that this initialization procedure prevents getting trapped in subpar local minima that are frequently encountered with random initializations of deep networks, resulting in noticeably better generalization performance. The corruption process (qD) is only used during this unsupervised training phase and is not used when propagating representations through the network from raw input to higher layers during fine-tuning or inference. According to experiments, unsupervised greedy layer-wise pre-training with DAEs can outperform unsupervised pre-training or solely supervised greedy layer-wise pre-training by a large margin.

Denoising Autoencoders

Conceptual Views and Relationships

There are various theoretical perspectives from which to understand the Denoising Autoencoder:

Manifold Learning Perspective: Denoising Autoencoders learn a stochastic operator that maps corrupted examples—which are probably off the manifold—back to high-probability spots on or near a low-dimensional manifold if training data is located on or close to this data manifold. The hidden representation Y can be thought of as either a representation that captures the primary variations in the data or as a coordinate system for points on the manifold.

Stochastic Operator Perspective: The DAE learns a semi-parametric model p(X) that approximates the empirical data distribution q0(X) by minimizing the Kullback-Leibler (KL) divergence DKL(q0(X, X̃)‖p(X, X̃)) between the empirical distribution and a model distribution. It is simple to sample from this model.

Information-Theoretic Viewpoint: Even when Y is generated from a corrupted input, minimizing the expected reconstruction error in a DAE efficiently maximizes a lower bound on the mutual information between the original input X and its concealed representation Y.

Generative Model Viewpoint: In mathematics, training a DAE is the same as maximizing a variational constraint on a particular generative model, p(X, X̃, Y).

Connection to Other Models

  • The denoising task is a thoroughly researched image processing issue.
  • Although DAEs do not require prior domain knowledge for these transformations, they are similar to methods that add stochastically modified patterns to training data.
  • Similarities exist with robust coding over noisy channels.
  • DAEs learn a quick and reliable deterministic mapping during training, in contrast to appropriate latent variable generative models that need expensive marginalization for missing inputs.
  • Although stochastic models like Deep Belief Networks (DBNs) and Restricted Boltzmann Machines (RBMs) can also learn robust representations, DAEs are different from models like Deconvolutional Networks, which solve inference problems directly without the need for an additional encoder, and aim for better features.
Index