LSTM And its Comparison with Gated Recurrent Units (GRUs)

The powerful Long Short-Term Memory (LSTM) unit overcomes many obstacles in training recurrent neural networks (RNNs), particularly learning long-term dependencies.

Details about the LSTM unit:

Purpose and Problem Addressed

Due to vanishing or expanding gradients during backpropagation, traditional RNNs struggle to capture long-term dependencies. Error signals traveling backward in time tend to evaporate or explode rapidly, making it hard for the network to understand correlations between distant occurrences. LSTM was created to solve these problems, allowing RNNs to store and retrieve data over time.

Core Mechanism: Constant Error Carousel (CEC)

LSTM’s Constant Error Carousel is its main invention. A center linear unit with a fixed self-connection (weight 1.0) in a memory cell represents this. This fixed self-connection prevents the vanishing or inflating gradient problem by allowing error signals held in the memory cell to flow back indefinitely without being magnified exponentially.

Key Parts: Gate Units and Memory Cells

This CEC is the basis of an LSTM memory cell, which has many multiplicative “gate units” that regulate information flow. These gates learn to store, ignore, or output information, making memory operations context-sensitive.

The Memory Cell (cjt) : The Memory Cell (cjt) is the major component that preserves memory at time t. Internal state scj(t) gathers data. An output gate modulates the memory cell’s internal state to compute ycj(t).

Input Gate (ijt): A multiplicative unit that safeguards memory cell contents from irrelevant inputs. It controls how much fresh memory is added to the cell. The network learns when to keep or override memory cell information using this gate.

The Forget Gate (fjt): controls the extent of memory content forgetting. This lets the LSTM selectively store or discard information over time, which is essential for long-term dependence.

The Output Gate (ojt): is a multiplicative unit that safeguards other units from extraneous memory contents in the cell. It controls network exposure to memory content. The network uses this to determine when to access the memory cell and when to avoid disrupting other units.

Comparison with Gated Recurrent Units (GRUs)

Another advanced recurrent unit, the Gated Recurrent Unit (GRU), uses gating mechanisms but has a simpler structure than LSTM.

Similarities: LSTM and GRU use an additive component in their update mechanism to retain existing content and add new content on top of it, creating “shortcut paths” for error back-propagation and making important features easier to remember over long sequences. It reduces vanishing gradients.

Differences: GRU lacks discrete memory cell, unlike LSTM. Its activation hjt is a linear interpolation between previous and candidate activations.

In contrast to GRU, LSTM contains an output gate to control memory content disclosure, while GRU exposes its whole information. The two topologies modify information flow differently via input/reset gates.
GRU may surpass LSTM in convergence speed and generalization for fixed parameter counts on particular datasets, although performance is usually equivalent, depending on the dataset and job.

Training/Optimization

LSTM networks, like other neural networks, are differentiable end-to-end and train effectively utilizing gradient descent techniques like SGD with backpropagation.

LSTM utilizes truncated backpropagation to maintain error flow within memory cells, preventing mistakes from propagating continuously outside the cells or gate units. This maximizes efficiency without affecting CEC long-term memory.

The LSTM approach is computationally efficient, with an update complexity per time step and weight of O(1), comparable to BPTT for fully recurrent nets. It is also “local in space and time,” therefore storage needs are independent of input sequence length.

Address Training Issues:
If error reduction is attainable without long-term storage, the network may “abuse” memory cells by employing them as bias cells during training. To encourage later allocation, sequential network building or negative bias output gate initialization are solutions.
When memory cell inputs are continuously positive or negative, the internal state can drift, potentially leading the gradient to evaporate. Initial input gate biasing to zero solves this.

Experimental Results and Generalization

The LSTM outperforms RNNs with tanh units, Real-Time Recurrent Learning (RTRL), and BPTT in long-term memory tests.

LSTM solves tasks with minimal time lags over 1000 steps; older approaches struggle even with lower lags (e.g., 10 steps).
Handles noise, scattered inputs, and accurate continuous values without considerable degradation over time.
LSTM can extract temporal order information from widely spaced inputs.
LSTM demonstrates good generalization, outperforming typical LSTMs on longer sequences.
LSTM can function as a controller network in Neural Turing Machines (NTMs), which connect neural networks with external memory resources. LSTM controllers learn faster and generalize better than feedforward controllers or normal LSTM networks on copying, repeat copy, and associative recall, especially for longer sequences.

Advantages and Limitations

Advantages:

Bridges long time lags caused by mistake backpropagation in memory cells.
Handles complex data, including noise, distributed representations, and continuous values for long-term lag issues.
Excellent generalization, even when input positions are irrelevant, and can identify temporal order of widely spaced inputs.
Robustness: Works well across various learning rates and gate biases without substantial parameter fine-tuning.
Highly efficient: O(1) update complexity per weight and time step, local in space and time.
LSTM allows for limitless state numbers, unlike finite state automata or hidden Markov models.

Limitations:

The efficient trimmed backprop variant of LSTM may struggle with “strongly delayed XOR problems” or tasks where completing subgoals does not steadily reduce error.
Increased Parameters: Memory cell blocks may require more input and output gates, potentially raising weights by up to a factor of 9 compared to standard recurrent nets completely connected.
Parity Problems: LSTM’s behavior may be similar to feedforward networks, making it unsuitable for 500-step parity problems that can be solved faster by random weight guessing.
LSTM may struggle with counting discrete time steps if extremely minor temporal variations are important, similar to other gradient-based techniques.

Page Content

Tutorials