I always strive to make my posts accessible and easy to understand. However, you may need a basic understanding of Neural Networks and NLP for reading this article. I highly recommend Stanford course CS 224d Deep Learning for Natural Language Processing. The course materials are available online. This includes lecture videos on Youtube and lecture slides on the course website. It worth to watch the whole course if you have time. Otherwise, please check lectures 2 and 3 for information about word embeddings and Word2Vec model and lecture 7 about Recurrent Neural Nets.
The goal of this series of articles is to demonstrate how to create a Neural Network application for sentiment analysis. The Neural Network will be trained to determine whether the sentiment of user reviews is positive or negative. The concepts used in this example can be applied to more complex sentiment analysis. In path 1 I will through main components of the application for sentiment analysis. The application will use word embedding model Word2Vec and LSTM Neural Network implemented in Keras. Part 2 will focus on the implementation of the app.
If you are familiar with Artificial Neural Networks (ANN) you may notice that a model of ANN is essentially a set of functions performed on vectors and matrices. Some data types like sensor readings or financial data are easier to feed to ANN. On the other hand, data like text is more challenging to present in numeric format. Sure we have ASCI or UTF codes of characters but not for words. Even if we assign unique code to each word, we still have to consider forms of words, their relations in the text, etc. In other words, we need to consider morphology, syntax and semantics when analyzing text. Describing such logic in software requires enormous resources even for one language.
Word embedding methods were created to solve this problem by mapping words to vectors. Each vector has multiple dimensions where it stores information about the word. The information is encoded in a location of the word in a vector space created by a model. As a result, it is possible to use existing mathematical methods (e.g. dot product) to find similarity between word vectors. The number of vector’s dimensions may vary even for models based on the same word embedding method. For example, popular models (e.g. Word2Vec, Glove) typically have a few hundred dimensions.
In this article I am using Word2Vec model (Mikolov et al, 2013) although you will get similar results with other word embedding models e.g. Glove (Pennington et al, 2014). If you are confused by Word2Vec model, try to imagine it as an autoencoder that reduces high dimensional space of words into more compact vectors. Of course, it is very rough description but it can give you a basic idea.
Recurrent Neural Networks
After we decided how to transfer text into vectors we can think about a type of Artificial Neural Network that will process them. The network needs to receive a sequence of vectors as input and consider the order of the vectors. In case of Feedforward Neural Network we will need to input the whole sequence at once because the network will not store a state of previous data samples. Even with a relatively small text of 100 words and 300 dimensional vectors it will result in 30000 neurons in the input layer alone. Alternatively, we can use Recurrent Neural Network (RNN) which can process a sequence of data samples while keeping the changes in state for a particular sequence. Figure 1 shows feedforward(1) and recurrent neural networks(2).
Even though RNN may seem complicated at first glance you can think of it like a feedforward network that has multiple layers or combines a few smaller networks in one. In this case a number of additional layers or smaller networks is a number of data samples in a sequence (aka time steps). The result of one time step supplements the RNN layer in the next time step, effectively transferring the state further. This is illustrated on Figure 2.
Unfortunately, with high number of time steps training of RNN becomes challenging. This includes the problems of exploding and vanishing gradient due to multiple iterations of multiplying the weighs of the networks. In short, if the weights are small, the gradient decreases with each iteration (vanishes) and hinders the training of the network. Similarly, large weights value increase gradient value and after multiple iterations cause it to “explode” which also negatively affects the training process. Its metaphor from a real world could be compound interest, where the duration (sequence length) of hundreds of years results in very high growth of the capital. While exploding can be solved by simple solutions like gradient clipping, vanishing gradient requires more changes to the architecture such as Long Short-term Memory (LSTM) introduced by Hochreiter, Schmidhuber (1997). The LSTM architecture of RNN deserves its own post and is out of the scope of this article. If you don’t have time to read articles on LSTM, its input and output are similar to vanilla RNN. LSTM has more logic under the hood such as gates controlling memory of neurons which supplements the underlining RNN architecture. The modifications of RNN in LSTM allow to increase maximal sequence length that can be used in a network without facing vanishing gradient problem.
We have already defined two essential parts of the system and now we need to see how to use LSTM and Word2Vec together. Let’s say we have a set of texts with positive and negative reviews. In this case, we need to covert words from each text into sequence of vectors with Word2Vec model. Each resulting sequence of vectors will be a separate data sample passed to the input of LSTM. After LSTM goes through all vectors in a sequence it will output a vector with probabilities of the text belonging to positive or negative category. In general terms of Machine Learning this is an example of classification task and in the context of text analysis it is also known as Sentiment Analysis. The majors componens of the application and their interactions are shown on Figure 3.
The neural network (LSTM) will be implemented with Keras framework that is based on Theano. In my older post I wrote about Theano and provided a short example. Keras is high level framework that provides a set of implemented layers and infrastructure for creating deep learning models. The framework is flexible and provides wide range of adjustments in models. However, it is possible to use default layers with minimal configuration and still have decent results. This simplifies development of new models and helps with prototyping new ideas during research phase. Please visit Keras official repository for more information about the framework.
This is the end of Part 1. In Part 2 I will focus on the implementation of the application.