Paper review: Neural Turing machines

Paper: “Neural Turing Machines”, A. Graves et al., 2014

The paper introduces an architecture of a Neural network that can write and read information to external memory selectively. The architecture bears resemblance with Von Neumann architecture of computers, proposed by J. von Neumann et al. 1945. In short, von Neumann model has CPU, memory and I/O units and is an instance of Universal Turing Machine proposed by A. Turing in 1936. I assume the reader is familiar with these essential concepts of Computer Science and will not go into further details.

Before diving into Neural Turing Machines (NTM) we can gain better understudying of motivation behind them by looking at the current Artificial Neural Network (ANN) models: Feedforward and Recurrent Neural Networks(RNN). After we train a model of ANN the data is distributed across the layers of the network in an abstract form. The distributed nature of the data means we use the whole model for getting information rather then isolated parts of the model. Let’s say we have a ANN that processes a text and outputs its category, in other words performs a classification. If we take a part of neurons from the network we will not see clear mapping of words into neuron weights and will not get much use of these subset of neurons isolated from the network. Similarly, removing a part of neurons from a network will affect its performance. This type of data representation is also called Distributed Representation (DR) and is different from how the data stored in “localist representation” (term by G. Hinton from CSC321 course, University of Toronto) where one neuron is responsible for one thing. DR representation has advantages over “localist representation” although this topic deserves a separate post. In short, the reason why DR models are mainstream is because they are more efficient compared to “localist representation” models.

Even though DR models have advantages it is difficult to add (distribute) new data to the trained model. The importance of it issue depends on a use case of the model. For example, if we trained an ANN to recognize images of traffic signs we can use the trained model for a long period of time without modification because new signs are not frequently introduced. On the other hand, if we use ANN for NLP tasks e.g. news articles analysis we will need to add large sets of data continuously and populate our model with new knowledge. In this case there is a need to increase the amount of data in the model arbitrarily and with the least number of changes to the model. Rather than increasing the size of network we can use external memory that will serve as a knowledge base for our model. This architecture will have memory and control unit that supports I/O operations on memory which makes it an instance of Turing Machine which is capable of Turing-complete computing.

NTM is not the only ANN model that is Turing-complete. Its competitor is a family of Recurrent Neural Networks (RNN) that are also capable of Turing-complete computing in theory however it is challenging to train them for large data sequences (time steps). Even though improved RNN architectures such as Long short-term memory (LSTM) [S. Hochreiter & J. Schmidhuber, 1997] increase capabilities of RNN’s in processing longer sequences, its memory cells still reflect global changes over each step in a data sequence. There has been a constant search for new models that can solve the problem in a simpler and less computationally intensive way by performing local updates of memory. NTM aims to solve this particular problem by introducing an model with addressing mechanisms that can selectively modify data in a memory structure.

Core element of NTM architecture is a ANN called controller unit that interacts with external input and output and memory matrix (Figure 1). Interaction with the memory matrix is performed with read and write heads via weight vectors. Weights in each vector determine their effect on respective locations in the weight matrix by changing degree of attention to a particular location. During read or write operation a weight vector goes through a set of four operations (content addressing, interpolation, convolutional shift, sharpening) that are differentiable and allow us to train the network with gradient descent. In addition to content addressing, NTM has a second addressing mechanism “location-based addressing” that can be used instead of content-based addressing depending on the value of interpolation operation. Location-based addressing iterates through memory locations with rotational shift of weights and can move either forward or backward.

Figure 1: NTM architecture

Controller unit is implemented by ANN and can be either feedforward network or RNN. In a set of experiments that include tasks such as copy, recall and sort of binary vectors, both LSTM based controller and feedforward based controller demonstrated advantage over regular LSTM networks in training time and generalization capability. The results suggest that NTM is capable of generalizing an operation (e.g. copy) over data sequences longer than it was trained with, exceeding the capabilities of its LSTM competitor. In contrast, LSTM performance will degrade significantly as the sequence size growth beyond the one it was originally trained with. The authors assume that NTM may have learned the operation as an algorithm and base their assumption on visualization of NTM memory state during copy task. This is an interesting theory that will need further research. Another interesting observation is much lower number of parameters in NTM compared to LSTM on the same tasks, even when using NTM with LSTM based controller. Number of parameters in LSTM network grew quadratically with increase of hidden neurons and was an order of magnitude higher on a few particular tasks: copy, repeat copy and associative recall.

Additionally, the experiments indicate differences in performance of feedforward and RNN based controllers of NTM. The paper does not explore their differences in details but it seems that performance of both architectures in controller depends on a particular task. We may need further research of the impact of feedforward and recurrent networks in controller unit on the overall performance to better understand reasons behind it. Unfortunately, there was also a significant number of tuning of their model including a particular number of concurrent heads and controller network hyper-parameters that were not properly addressed in the paper.

In conclusion, NTM model looks promising and contributes to an interesting direction in Deep Learning that aims to augment ANN with external memory. This direction has been gaining attention recently, both from academia and the industry. For example, the paper [J. Weston et al., 2014] from Facebook AI group was also published in October and proposes another interesting approach to external memory in ANN.


  • Graves, Alex, Greg Wayne, and Ivo Danihelka. “Neural turing machines.” arXiv preprint arXiv:1410.5401 (2014).
  • Weston, Jason, Sumit Chopra, and Antoine Bordes. “Memory networks.” arXiv preprint arXiv:1410.3916 (2014).
  • Hochreiter, Sepp, and Jürgen Schmidhuber. “Long short-term memory.” Neural computation 9, no. 8 (1997): 1735-1780.