Traditional Recurrent Neural Networks (RNNs) struggle to learn long-term dependencies. The Gated Recurrent Unit (GRU) solves this problem. Recent proposals by Cho et al. have been employed in machine translation.
Purpose and Problem Addressed
Vanishing or ballooning gradients make it hard for traditional RNNs to capture relationships between distant events in a sequence. The GRU, like the LSTM unit, uses a gating mechanism to modify input flow and adaptively capture dependencies across time scales. RNNs with GRUs excel in speech recognition and machine translation with long-term dependencies.
Core Mechanism and Components
Unlike the LSTM, the GRU has no individual memory cell. Instead, its content is a linear interpolation between the prior activation and a potential activation. The GRU controls information flow with two major gating units:

The Update Gate ($\mathbf{z_{jt}}$): controls the amount of content or activation updates for the unit. It balances retaining the hidden state and adding new information from the input and prior hidden state. The update gate is calculated using a logistic sigmoid function: $\sigma (W_z x_t + U_z h_{t-1}})_j$. The unit can “forget” the prior state and carry forward information via this approach.
Reset Gate ($\mathbf{r_{jt}}$): Controls the impact of prior concealed states on candidate activation calculation. If the reset gate is near to zero, the unit might “forget” its computed state and act as if it’s reading the first symbol in an input sequence. The reset gate is calculated using a logistic sigmoid function: $\sigma (W_r x_t + U_r h_{t-1}})_j$.
Candidate Activation ($\mathbf{\tilde{h}{jt}}$): The suggested hidden state content, calculated using a hyperbolic tangent function: $\tanh (\mathbf{W x_t} + \mathbf{U (r_t \circ h{t-1})})j$, where $\circ$ represents element-wise multiplication. The reset gate modulates the prior hidden state (e.g., $r_t \circ h{t-1}$) to selectively forget past information when computing the current candidate state.
Final Activation ($\mathbf{h_{jt}}$): The GRU’s current activation is a linear interpolation between $h_{jt-1}$ and $\tilde{h}{jt}$, guided by $z{jt}$: $(1 – z_{jt})h_{jt-1} + z_{jt}\tilde{h}_{jt}$.
Comparison with LSTM
Similarities:
Additive Component: A key element of GRU and LSTM is their additive update process. GRU and LSTM preserve older content and add new stuff at each time step, unlike typical recurrent units. Units can remember characteristics without being overwritten due to their additive nature.
Shortcut Paths: This additive structure facilitates bypassing several temporal phases. These shortcuts mitigate the vanishing gradient problem by making error signals easier to back-propagate without vanishing too quickly through various non-linearities.
Gating Mechanisms: Both systems restrict information flow with gating units, effectively capturing long-term dependencies.
Differences:
Memory Cell: LSTM units have a unique memory cell ($c_{jt}$) from their hidden state output ($h_{jt}$), while GRUs use their activation as their state.
The LSTM unit has an output gate that controls the exposure and use of memory cell content by other network units. However, the GRU publishes its entire content without control.
In LSTM, the new memory content ($c̃{jt}$) is computed independently and added to the memory cell ($c{jt}$) by an input gate ($i_{jt}$) independent of the forget gate ($f_{jt}$).
In GRU, the reset gate controls information flow from the previous activation to compute the candidate activation ($\tilde{h}{jt}$). The quantity of candidate activation contributed to the current state is connected to the single update gate ($z{jt}$).
Due to structural variances, it’s hard to say which unit performs best across all jobs.
Performance and Training
LSTMs and GRUs are trained via gradient-based optimization. To avoid gradient explosions, RMSProp, weight noise, and gradient clipping are employed during training.
Comparing GRU, LSTM, and classic tanh units on sequence modeling tasks like polyphonic music and speech signal modeling has been done empirically.
Superiority over Tanh Units: GRU and LSTM-based RNNs outperform tanh-RNNs, especially on difficult tasks like raw speech signal modeling. Compared to simpler recurrent units, gating units converge faster and produce better solutions.
GRU vs. LSTM Performance
- GRU-RNN outperformed LSTM-RNN on polyphonic music datasets, showing better progress in updates and CPU time.
- LSTM-RNN excelled in speech signal modeling on Ubisoft A, while GRU-RNN excelled on Ubisoft B.
- the best LSTM or GRU decision depends on the dataset and task.
Parameter Efficiency: On some datasets, GRU can surpass LSTM in convergence speed, parameter updates, and generalization with a fixed number of parameters. Unlike Neural Turing Machines (NTMs) with external memory, LSTM networks have quadratic parameters growth with hidden units.
Conclusion
The GRU is a powerful recurrent unit that uses adaptive gating methods to solve the RNN vanishing gradient problem and learn and keep long-term dependencies. It is simpler than LSTM yet performs similarly, making it a good sequence modeling alternative.