The Neural Turing Machines (NTM) neural network architecture couples standard neural networks with external memory resources to greatly increase their capabilities. It draws on biological working memory models and digital computer architectures like Turing Machines and Von Neumann architecture.

If properly built, recurrent neural networks (RNNs) can imitate arbitrary procedures, but this is not always easy. The Neural Turing Machines uses a large, addressable memory to enhance standard recurrent networks to ease algorithmic processes. Differentiable computers like NTMs can learn programs using gradient descent, unlike Turing Machines.
Core Mechanisms and Architecture of Neural Turing Machines
A neural network controller and memory bank make up an Neural Turing Machines architecture.
The controller network, like most neural networks, interacts with the external world through input and output vectors.
Critically, it interacts with a memory matrix through selective read/write operations. The network outputs that parameterize these operations are “heads”.
Recurrent (LSTM) or feedforward neural networks are important architectural choices for controllers.
Recurrent controllers, such as LSTM, have internal memory to supplement the Neural Turing Machines larger external memory matrix. Recurrent controller hidden activations allow information to be combined across time steps like CPU registers.
A feedforward controller can replicate a recurrent network by reading and writing at the same memory region, providing better transparency since memory interaction patterns are easier to comprehend than RNN internal states. However, the number of concurrent read and write heads can limit the NTM’s computing power; one read head does a unary transform, two a binary transform, etc. Since they can save read vectors from prior time-steps, recurrent controllers avoid this limitation.

Memory Bank and Operations of Neural Turing Machines
Neural Turing Machines memory matrix is accessed by “blurry” read/write operations. Instead of addressing one memory element like a digital computer, operations interact with all memory elements.
The degree of “blurriness” is governed by an attentional “focus” mechanism that limits read/write operations to a specific part of memory, ignoring the rest. The sparse interaction biases the NTM to store data without interference.
The memory location is weighted. Specialized outputs from the heads define a normalized weighting across the rows (memory “locations”) in the memory matrix to determine the attentional emphasis.
The read vector ($r_t$) returned by a head is a convex combination of the row-vectors ($M_t(i)$) in memory, weighted by $w_t(i)$. This operation is memory- and weight-differentiable.
The write operation is divided into two parts: an erase and an add, inspired by the input and forget gates of LSTM. A write weighting $w_t$ and an erase vector $e_t$ modify the preceding memory vectors $M_{t-1}(i)$.
Training and Experimental Performance of Neural Turing Machines
The Neural Turing Machines is end-to-end differentiable, making gradient descent training easy. All studies employed the RMSProp algorithm with momentum 0.9 and trimmed gradient components. Episode-based tasks reset network dynamic state at each input sequence. Experiments verified NTM’s ability to solve issues by learning compact internal programs that generalize beyond training data.
NTMs to LSTM networks in several algorithmic tasks:
The Copy Task evaluates Neural Turing Machines capacity to store and recall a lengthy sequence of random data.
NTM (using feedforward and LSTM controllers) outperformed LSTM alone in learning speed and cost, indicating a qualitative difference in problem-solving.
NTM generalized to longer sequences (e.g., 20 copies, 100 or 120), but LSTM degenerated quickly beyond training length.
NTM’s memory usage analysis reveals it writes input vectors to specific locations during input and reads from them during output, utilizing content-based addressing (to jump to sequence start) and location-based addressing (to move along sequence using relative shifts).
- Repeat Copy Task: Extends copy task by forcing network to output copied sequence multiple times.
- NTM learned quicker than LSTM and both correctly solved the job within the training data range.
- In generalizing beyond training data, NTM performed better with longer sequences and more repetitions, but LSTM suffered. NTM sometimes mispredicted the end marker for very large repeat counts, possibly due to numerical representation.
- NTM utilized memory wisely by extending the copy method, repeating sequential reads, and occasionally using a “goto statement” as needed.
Associative Recall Task: Tests Neural Turing Machines indirection capabilities by requiring the network to return the next item after querying with one data item.
- NTM achieved near-zero cost almost 30,000 episodes faster than LSTM, but LSTM did not approach zero cost after a million episodes.
- Feedforward NTM learned quicker than LSTM NTM for this challenge.
- NTM had superior generalization to longer item sequences than LSTM. These findings show that NTM’s external memory maintains data structures better than LSTM’s internal state.
Dynamic N-Grams Task: The Dynamic N-Grams Task assesses NTM’s adaptability to new prediction distributions by counting transition statistics utilizing memory.
- NTM showed a little performance edge over LSTM, but never reached optimal cost.
- The NTM controller appeared to employ memory to count ones and zeros in various scenarios, using a technique similar to the optimum estimator.
The Priority Sort Task : assesses NTM’s data sorting capabilities.
- NTM with feedforward and LSTM controllers considerably outperformed LSTM in this assignment.
- NTM prioritizes write sites and reads from memory in ascending order to traverse the sorted sequence.
Parameter Efficiency: A key difference is parameter count. Recurrent connections boost LSTM network parameters quadratically with hidden units. The number of parameters does not increase with memory locations for NTMs. The Copy Task LSTM network contained 1.3 million parameters, while Neural Turing Machines controllers had 17k-67k. Despite this, NTMs beat LSTM networks in numerous experimental situations.
Conclusion
Neural Turing Machines explicitly incorporate a differentiable external memory, making them powerful for algorithm learning and long-term dependencies. Their capacity to generalize to unknown sequence lengths and job complexities, as well as their better learning speed and parameter efficiency compared to LSTMs, make them a promising design for complicated algorithmic problems.