A gradient-based learning approach for recurrent neural networks (RNNs) is called Back-Propagation Through Time (BPTT). It works by sending error signals “backwards in time” to change the weights in the network.
However, as these error signals propagate backward through time, they have a tendency to burst or exponentially vanish (decay), which is a basic weakness of Back-Propagation Through Time. It is challenging to train RNNs to capture long-term dependencies because of this exponential dependence on weight size.
The Vanishing/Exploding Gradient Problem
Mechanism: A product of first-order derivatives and weights along the path determines how error signals scale as they go backward in time.
Vanishing Gradients: The error falls exponentially with the duration of the time lag (q) if the absolute value of these scaling factors is less than 1.0 for every step (for example, when using logistic sigmoid activation functions with weights with absolute values below 4.0). This results in learning to overcome long time gaps being unnecessarily slow or failing completely. This is the more prevalent problem.
Exploding Gradients: The error signals may exponentially “blow up,” resulting in fluctuating weights and unstable learning, if the scaling factors are higher than 1.0. Although less common, this can have serious consequences.
Adjustment ineffectiveness: Since raising the learning rate has no influence on the ratio of long-range to short-range error flow, it does not address this underlying problem.
Limitations on Learning Long-Term Dependencies
Back-Propagation Through Time “fails miserably” on tasks when there are short time lags between inputs and matching teacher signals, and it is ineffective at learning long-term dependencies.
For instance, Back-Propagation Through Time demonstrated 0% success for both a 10-time-step delay and a 100-time-step delay in studies involving noise-free sequences that required memory across time, but Long Short-Term Memory (LSTM) obtained 100% success.
According to the sources, Back-Propagation Through Time and RTRL “have no chance of solving non-trivial tasks with minimal time lags of 1000 steps” and seem “quite useless when the time lags exceed as few as 10 steps.”
LSTM obviously outperforms BPTT and learns significantly faster, even for problems with shorter time lags, such as the embedded Reber grammar, when Back-Propagation Through Time may not fail entirely.
Properties of Computation
The computational complexity of Back-Propagation Through Time is O(1) for each time step and weight, which is regarded as efficient and on par with the complexity of LSTM.
Full Back-Propagation Through Time is not “local in time” like some other algorithms, hence activation values must be stored on a stack that can expand infinitely with the length of the input sequence.
Comparison to Other Algorithms
Real-Time Recurrent Learning (RTRL): The gradient calculated by untruncated BPTT is identical to that of offline RTRL. On long time lag difficulties, both techniques produce extremely comparable and frequently negative outcomes due to the same exponential error decay/explosion issues.
Long Short-Term Memory (LSTM): LSTM was created especially to address the error back-flow issues that RTRL and BPTT were experiencing. This is accomplished by the use of multiplicative “gate units” that learn to control access to the error flow and the introduction of “constant error carrousels” (CECs) within special units, which maintain a constant error flow. On tasks like copying, repeat copying, associative recall, and priority sorting, experimental results demonstrate that Neural Turing Machines (NTMs) with LSTM controllers and standalone LSTMs learn “much faster” and generalise noticeably better to longer sequences than standard LSTMs (which use BPTT-like training). When compared to earlier approaches, the sources indicate a “qualitative, rather than quantitative, difference” in the way that NTM and LSTM address these issues.
Truncated BPTT: there are no appreciable benefits to computing the entire gradient in BPTT as opposed to truncated versions because error flow outside of the stable CECs in LSTM architectures tends to disappear rapidly anyhow.
Simple Weight Guessing: It has been discovered that random weight guessing performs better than BPTT and other algorithms for a few basic tasks. This suggests that some of the benchmark problems that were previously employed were insufficiently complicated to demonstrate the algorithms’ full potential.
Mitigation Measures
Using a clipped gradient, in which the norm of the gradient vector is scaled if it over a specific threshold, is one way to lessen the detrimental effects of vanishing or bursting gradients in RNNs.