what is a deep belief networks?
One kind of generative deep learning model composed of several layers of random, latent variables is called a Deep Belief Network (DBN). Restricted Boltzmann Machines (RBMs), a type of neural network that can learn to represent probability distributions over inputs, typically make up these layers.

Deep belief networks Structure
- A generative model with several layers of hidden causal variables is called a Deep Belief Network (DBN).
- Its construction is such that lower layers extract “low-level features” from the input, while upper layers are meant to represent more “abstract” concepts that explain the input observation.
- The fundamental tenet is that DBNs learn by first mastering more basic ideas, then expanding on them to learn more complex ones.
- Since all of the conditional layers in a DBN are factorised conditional distributions, denoted as P(gi|gi+1), sampling and probability computation are simple.
- The hidden layers, g_i, are assumed to be binary random vectors with n_i elements by the model.

Training Process of Deep belief networks
Greedy Layer-Wise Unsupervised Learning: An algorithm for greedy layer-wise unsupervised learning is a key advancement for DBNs. Pre-training one layer at a time in a greedy manner, employing unsupervised learning at each layer to preserve input information, and then fine-tuning the entire network with regard to the end criterion are the three components that make this technique particularly significant.
Restricted Boltzmann Machines (RBMs) as Building Blocks: RBMs are the essential building blocks that DBNs are constructed from. The energy function energy(v,h) = −h′Wv − b′v − c′h, where v and h are visible and hidden units, respectively, and W, b, and c are the parameters, defines an RBM.
Contrastive Divergence (CD): This algorithm is used to train RBMs. It takes a short number of steps in a Markov chain (usually k=1) to estimate the gradient of the log-likelihood. This process is repeated using a sample of v0 taken from the training distribution of the RBM.
The process of layer-by-layer training:
The empirical data is first used as input to train an RBM.
An empirical distribution p̂1 is produced over the first layer g1 by the posterior distribution Q(g1|g0) from this trained RBM, where g0 is the observed input x.
This procedure is repeated: using Q(g_i|g(i-1), samples of g(i-1) are randomly transformed into samples of g_i.
Given an input x, this approach additionally yields an approximation of the posterior for all hidden variables at all levels.
Greedy Training: A variational bound provides justification for the greedy process by guaranteeing that, provided an extra layer is appropriately initialised and contains enough units, initial gains in training the subsequent layer will result in gains in the likelihood of the preceding layer.
Continuous-Valued Inputs: By altering the energy function and the permitted range of values, such as by employing Gaussian units for continuous inputs, DBNs and RBMs can be made to naturally handle continuous-valued inputs.
The greedy notion that each layer is pre-trained to model its input may be maintained by continuous training, which performs as least as well as the original method and permits a single stopping criterion. Denoising Autoencoders as Alternatives: Instead of utilising RBMs as building blocks, auto-encoders (more especially, denoising autoencoders) can be used to apply the same layer-wise greedy unsupervised pre-training concept with similar results. In addition to producing more appropriate intermediate representations for ensuing learning tasks, this explicit denoising criterion aids in capturing intriguing structure.
Supervised Fine-tuning
Following the unsupervised pre-training of DBN layers, gradient descent may be used to further optimise the network as a whole in relation to any training criterion that can be calculated deterministically.
A logistic regression layer can be added for supervised classification, and stochastic gradient descent can be used to train the entire network. It has been demonstrated that supervised updates perform better when the learning rate is much higher (e.g., 20-fold) than contrastive divergence updates.
Partially Supervised Training: A mixed training criterion may be applied when the input distribution does not adequately disclose information about the target variable (for example, in some regression issues). This forces predictive information into the representation by fusing the supervised and unsupervised objectives. Even if this partial supervision is limited to the first layer, it can result in notable enhancements.
Benefits and Features of Deep belief networks
Overcoming Optimisation Challenges: DBNs tackle the age-old issue of deep neural network training, in which gradient-based optimisation from random initialisation frequently results in subpar solutions. Weights are efficiently initialised in an area close to a good local minimum by the unsupervised pre-training.
Representational Efficiency: When it comes to the computing components required to represent specific complex functions, deep architectures like DBNs can be exponentially more efficient than shallow structures (like SVMs and one-hidden-layer neural networks), requiring fewer examples in the process.
Better generalisation performance is the result of the greedy layer-wise unsupervised training approach.
Significant High-Level Abstractions: This approach starts upper layers with more accurate representations of pertinent high-level input abstractions.
Scalability: Hierarchical representations can be learnt in an unsupervised manner with Convolutional Deep Belief Networks.
Robustness: Like denoising autoencoders, the stochastic character of RBMs may incorporate a type of robustness in the learnt representations.
Challenges and Limitations of Deep belief networks
Unsolvable Probabilistic Computations: Intractable probabilistic computations in maximum likelihood estimation and similar techniques pose a computational challenge for deep generative models like DBNs and Deep Boltzmann Machines (DBMs). For such models, it is frequently impossible to derive a tractable unnormalized probability density.
Uncooperative Input Distributions: The completely unsupervised greedy layer-wise pre-training process may not be sufficient for supervised tasks where the input distribution p(x) is mainly unrelated to the target variable y. Partial supervision throughout training helps to lessen this.
Weight Sharing in RBMs: As the fundamental DBN module, restricted Boltzmann machines have the further requirement that their encoder and decoder share weights, which may restrict the latent representation that may be obtained.