Introduction
Variational Autoencoders (VAEs) are useful for approximate inference and learning in directed probabilistic models with continuous latent variables with intractable posterior distributions and big datasets. They solve the problem of efficient inference and learning in such models.
Detailed Variational Autoencoder breakdown:
Core Concept and Problem Addressed
Purpose Variational Autoencoders learn a generative model to capture data distribution and generate new data samples like training data. They learn an approximation inference model (called a recognition model) to efficiently infer latent variables for a data point.
Challenge (Intractability) When continuous latent variable or parameter posterior distributions are intractable, variational Bayesian (VB) learning with directed probabilistic models generally fails. With complex likelihood functions like neural networks with nonlinear hidden layers, the EM algorithm and mean-field VB are impractical because integrals for marginal likelihood and true posterior densities cannot be evaluated or differentiated.
The Auto-Encoding Variational Bayes (AEVB) Algorithm
AEVB Auto-Encoding variantal Bayes The Auto-Encoding Variational Bayes (AEVB) technique is recommended for efficient inference and learning in i.i.d. datasets with continuous latent variables per data point. It uses a variational lower bound reparameterization to generate the simple, differentiable, unbiased Stochastic Gradient Variational Bayes (SGVB) estimator.

Key Variational Autoencoders Components
Variational Autoencoders as a directed graphical model that assumes unobserved continuous latent variables z generate observable data x.
Generative Model (Probabilistic Decoder): This model section generates data x from latent variables z. It is usually shown as a conditional distribution pθ(x|z), where θ are its parameters. A latent code z generates a dispersion of probable x values. This is typically done using a neural network, such as a Multi-Layer Perceptron (MLP), with z as the input. The network outputs the distribution parameters, such as mean and variance for Gaussian outputs or probabilities for Bernoulli outputs.
Recognition Model / Probabilistic Encoder as an Approximate Inference Model: The model qφ(z|x) approximates the posterior pθ(z|x), which is often difficult to calculate. Given a data point x, it generates a Gaussian distribution of the code z values from which x may have been generated. Its parameters φ are learned alongside the generative model parameters θ. This model becomes a variational auto-encoder with a neural network. An MLP encoder, like the decoder, receives x as input and produces qφ(z|x) parameters (e.g., mean and standard deviation).
Latent Variables (Code): The unobserved continuous random variable z represents or codes the observed data x. Typically, a simple distribution like centered isotropic multivariate Gaussian pθ(z) = N(z; 0, I) is chosen for the prior distribution over z.
The Variational Lower Bound
Variable Lower Bound The Variational Autoencoders training goal is to optimize the variational lower bound on data marginal likelihood. The marginal likelihood log pθ(x(i)) may be divided into two terms: a non-negative KL divergence term DKL(qφ(z|x(i))||pθ(z|x(i))) and the lower bound L(θ,φ;x(i)): log pθ(x(i)) = DKL(qφ(z|x(i))||pθ(z|x(i))) + L(θ,φ;x(i)). Lower bound L(θ,φ;x(i)) = Eqφ(z|x). [− log qφ(z|x) + log pθ(x, z)]. L(θ,φ;x(i)) = −DKL(qφ(z|x(i))||pθ(z)) + Eqφ(z|x(i)). [log pθ(x(i)|z)]. The second formulation (Eq. 3 in source) is widespread due to the analytical integration of the KL-divergence term DKL(qφ(z|x(i))||pθ(z)), particularly when the posterior qφ(z|x) and prior pθ(z) are Gaussian. This regularizer KL-divergence term keeps the approximate posterior near to the prior. The “auto-encoding” feature relies on the expected negative reconstruction error, represented by the second term in Eqφ(z|x(i)) [log pθ(x(i)|z)].
The SGVB Estimator and Reparameterization Trick
The SGVB Estimator/Reparameterization Trick Directly optimizing the lower limit, especially for variational parameters φ, is challenging due to the high variance of the naive Monte Carlo gradient estimator.
Trick for Reparameterization This is addressed via reparameterization. For a Gaussian N(µ, σ²), z = µ + σε where ε ∼ N(0,1), the random variable z̃ ∼ qφ(z|x) is rewritten as a differentiable transformation gφ(ε,x) of an auxiliary noise variable ε ∼ p(ε). This method makes Monte Carlo expectation estimates differentiable with regard to φ.
SGVB Estimators Using this method on the variational lower bound gives practical Stochastic Gradient Variational Bayes (SGVB) estimators.
L̃A(θ,φ;x(i)): A generic estimator based on Eq. (2).
The second estimator, L̃B(θ,φ;x(i)), is based on Eq. (3) and computes the KL-divergence term analytically. It often has smaller variance. Optimizing with stochastic gradient methods is easy with these estimators. We employ a minibatch version of the estimator L̃M(θ,φ;XM) for efficient training on big i.i.d. datasets.
Training Methods
AEVB Algorithm method 1 in the sources describes minibatch training with the AEVB method. It includes initializing θ and φ, then iterating:
- First, draw a random minibatch of M data points XM.
- Sampling random noise ε from distribution p(ε).
- Calculating minibatch estimator gradients ∇θ,φL̃M(θ,φ;XM, ε).
- Update parameters θ and φ using gradients (e.g., SGD or Adagrad). The procedure continues until convergence. M = 100 and L = 1 (samples per datapoint) worked in experiments.
Auto-encoder connection
The AEVB algorithm shows that auto-encoders and directed probabilistic models (trained with a variational aim) are related. The objective function in Eq. (7) shows:
The first term, -DKL(qφ(z|x(i))||pθ(z)), functions as a regularizer, ensuring the encoded latent space z aligns with pθ(z).
The second element in Eqφ(z|x(i)) [log pθ(x(i)|z)] represents the expected negative reconstruction error, a crucial aspect of auto-encoders. The decoder pθ(x|z) is evaluated for its ability to reconstruct the original input x(i) from a sampled latent coding z(i,l). A nuisance regularization hyperparameter is usually needed in autoencoders to acquire usable representations, however this variational bound regularization term is useful.
Example: Variational Auto-Encoder with Gaussian Components
Gaussian Variational Auto-Encoder VAEs often use Multi-Layer Perceptrons (MLPs) for both the probabilistic encoder qφ(z|x) and the probabilistic decoder pθ(x|z).
The prior over latent variables pθ(z) is usually a centered isotropic multivariate Gaussian N(z; 0, I).
The approximate posterior qφ(z|x) is a multivariate Gaussian with diagonal covariance structure, as defined by log N(z; µ(i), σ²(i)I). The encoding MLP outputs mean µ(i) and standard deviation σ(i) as functions of x(i) and φ.
Reparameterization: From qφ(z|x(i)), z(i,l) = µ(i) + σ(i) ⊙ ε(l), where ε(l) ∼ N(0, I) and ⊙ represents element-wise product.
Decoder: multivariate Gaussian (real-valued data) or Bernoulli (binary data) parameters are generated from z using a decoding MLP. The SGVB estimator can be simplified by analytically computing and differentiating the KL divergence term in this Gaussian-to-Gaussian setting.
Uses and Benefits of Variational Autoencoders
- Consider efficient posterior inference for coding or data representation.
- Efficient approximation marginal inference: Suitable for image denoising, inpainting, and super-resolution jobs needing a prior over x.
- Learning useful representations: Without hyperparameters, the variational bound regularizes latent representations, creating more meaningful codes.
- Learning encoders can visualize high-dimensional data by projecting it onto a low-dimensional latent space (e.g., 2D).
- When trained, the generative model pθ(x|z) can produce new data samples by sampling from the prior pθ(z) and then pθ(x|z).
VAEs can learn disentangled representations, whose latent variables correlate to interpretable data properties including stance, illumination, and image form, through particular training processes. This lets you manipulate generated data like a 3D renderer.