Sentiment analysis with RNN in Keras, Part 2

[Update from 17.04.2016]: The code examples were updated to Keras 1.0 and should work with future 1.0.x versions of Keras.

In the previous part I covered basic concepts that will be used in the application. Today I will show how to implement it with Keras. I will try to keep only the parts of code related to Keras and not overburden the reader with infrastructure related code.

In my example I use dataset with labeled movies reviews from IMBD, used in “Learning Word Vectors for Sentiment Analysis” (Mass et al, 2011). The dataset has 25000 positive and 25000 negative reviews. In case the dataset is removed from Stanford AI website (http://ai.stanford.edu/~amaas/data/sentiment/) you can find IMDB reviews dataset on IMDB website or use totally different dataset. I split dataset into three parts: a training part + validation and a test part in proportions (0.8 + 0.2)*75% and 25% respectively but you are free to use different split ratios.

Word2Vec

First, let’s review high level components of our application that we defined in Part 1. We have Word2Vec model for transforming words into vectors that are used as an input for LSTM network. Their interaction is shown on Figure 1. We will start with Word2Vec model and its implementation for Python. There is a popular open-source module gensim that provides this functionality. Please check its repository for more information about implementation and documentation.

After we found a module capable of working with Word2Vec models we need a model itself. There are two options to get such model: train a new model yourself or use a pre-trained model. While gensim provides an option to train your own model this will require significant time and resources and is outside of the scope for this article. Therefore, we may want to look for a model that was pre-trained, preferably on large dataset. There are a few pre-trained models of Word2Vec available online that fit this requirement. For example, we can use a pre-trained model from Google Word2Vec on Google code, trained on 100 bln words from Google News dataset. You can find the dataset is in an archive GoogleNews-vectors-negative300.bin.gz. Here is how to load this model with gensim:

from gensim.models import Doc2Vec
def load_w2v():
    _fname = "/home/walter/Dev/Data/GoogleNews-vectors-negative300.bin"
    w2vModel = Doc2Vec.load_word2vec_format(_fname, binary=True)
    return w2vModel 
 

Please note that it may take about a few minutes to load on average hardware and at least 4 Gb of RAM. It seems that in the current state gensim does not untilize multiple cores while loading a model. On my PC with quad core i5-4690K and SSD it maxes out one core and typically takes about 40 second to load which is rather slow. Computing words similarity also takes some time even on one text sample. Considering over 30000 text samples that will be passed to neural network in multiple iterations this may slow down the training process. Therefore, I decided to save intermediate result of vectorized words to text files that can quickly loaded. To simplify the loading logic, I saved vectors of positive and negative reviews to separate directories. Saving vectors to files is an optional step for better performance and you can skip it.

Please note that because we use a pre-trained W2V model from Google there can be words in the text that not present in the model. In this case you will get errors in gensim Word2Vec model. Even if we handle model exceptions we still can’t transform these words into vectors. Thus, you may want to filter them out. Here is a code example that shows how to do it

    #tokenize with your prefered tokenizer
    tokens = tokenize(text)
    filteredTokens = filter(lambda x: x in w2vModel.vocab, tokens)

In our dataset we have two kinds of reviews : positive and negative. The most naive approach would be to pass negative samples together and then do the same thing with negative samples. However, studies show that it may affect the training of the model its ability to generalize. Better approach would be to alternate between positive and negative samples during training. We can achieve it by creating a custom iterator. Here is an example of a simple iterator that builds a set of mixed samples from files with vectors. It also allows you to get data in smaller portions that may be useful if you don’t have powerful hardware and don’t want to use the whole training set while experiment with the model. If you don’t want to save vectors to files, you can either keep all text vectors in RAM (if you have enough of it) or convert text to vectors on the fly.

import numpy as np
class DataIterator:
    def __init__(self, data_path, batch_size = 1000):
        pos_files, neg_files = get_data_file_list(data_path)
        self.pos_iter = iter(pos_files)
        self.neg_iter = iter(neg_files)
        self.batchSize = batch_size

    def getNext(self):
        vectors = []
        values = []
        while (len(vectors) < self.batchSize):
            file = next (self.pos_iter, None)
            if file is None:
                break
            vec = np.load(join(vector_files_path, file))
            vectors.append(vec)
            values.append([1])

            file = next(self.neg_iter, None)
            if file is None:
                break
            vec = np.load(join(vector_files_path, file))
            vectors.append(vec)
            values.append([0])
        return np.array(vectors), np.array(values)

Alternatively, you can pass data generator to Keras model. Please note, that you will need to use model.fit_generator() in this case and specify the number of samples generated per epoch for training and validation. In this example we have files with numpy vectors stored in a single directory with prefixes “neg” and “pos” for negative or positive reviews respectively.

def get_generator(data_path):
    files = [f for f in listdir(data_path) if isfile(join(data_path, f))]
    while 1:
        for file in files:
            x = np.load(join(data_path, file))
            x = np.array([x])
            label = get_label(file)
            y = np.full((1, 1), [label])
            yield (x, y)

def get_label(filename):
    if filename.startswith("pos"):
        return 1
    return 0
    
#Fit model with generator
    model.fit_generator(get_generator("training data path"),
                        samples_per_epoch=25000, nb_epoch=40,
                        validation_data=get_generator("test data path"), nb_val_samples=25000,
                        callbacks=cbks)

LSTM Network

After we got a module with Word2Vec implementation and a pre-trained dataset we can start working on LSTM network. Our pre-trained model has vectors with 300 dimensions hence our input layer will be size 300. However, we don’t know how long our sequence will be - how many words we will input to the network per text. The texts have different length and we need to consider it. Therefore, we have a few options:

  1. Use variable sequence length. Con: No implementation in Keras or other popular framework as far as I know. Custom implementation in low level library e.g Theano is challenging. No studies that show significant benefit of such approach.
  2. Use single training sample instead of minibatch. Con: Negative impact on training time
  3. Set a fixed sequence length. Pad missing words with zeros. Con: Too much padding will affect the training and network precision.

Among these options number 3 has less problems and does not require changes in standard models of Neural Networks. The only things we need to do is to find a distribution of documents length in our dataset. I created a small script and plotted a histogram (Figure 1) of words count per document in the whole dataset. Based on the histogram it appears that 350 words is a reasonable choice for a sequence length in our application. Feel free to experiment with larger sequences size and more padding.

Figure 1: Word count distribution

Keras provides a function for padding data sequences keras.preprocessing.sequence.pad_sequences. Here is an example of how to use it for padding out data

    #tokenize with your prefered tokenizer
    tokens = tokenize(text)
    filteredTokens = filter(lambda x: x in w2vModel.vocab, tokens)[:timesteps]
    #populate vectors array
    doc_vectors = []
    for token in filteredTokens:
    	vector = self.w2vModel[token]
    	doc_vectors.append(vector)
   #pad vectors sequence
   while len(doc_vectors) < timesteps:
   	doc_vectors.append([0])
   doc_vectors = sequence.pad_sequences(doc_vectors,padding='post',dtype='float32', maxlen=dimensions, value=0.)
    	  

After we know our sequence size we can define our LSTM network model. The model will have an input layer with 300 input points, 350 sequence size (time steps) that takes 350 word 300 dimensional vectors, 200 neurons in hidden LSTM layer and a dropout method for regularization. Binary classification between positive and negative categories is done in the output layer with sigmoid function. We use binary cross-entropy loss function for training because the network outputs probabilities of the data sample belonging to each category. If we had more classes we would use softmax and categorical crossentropy. The code of creating and training the model is shown below.

import sys
from keras import callbacks
from keras.layers import Dense, LSTM, Dropout
from keras.models import Sequential


def train():
    timesteps = 350
    dimensions = 300
    batch_size = 64
    epochs_number = 40
    model = Sequential()
    model.add(LSTM(200,  input_shape=(timesteps, dimensions),  return_sequences=False))
    model.add(Dropout(0.2))
       model.add(Dense(1, activation='sigmoid'))
    model.compile(loss="binary_crossentropy", optimizer='rmsprop', metrics=['accuracy'])
    fname = 'weights/keras-lstm.h5'
    model.load_weights(fname)
    cbks = [callbacks.ModelCheckpoint(filepath=fname, monitor='val_loss', save_best_only=True),
            callbacks.EarlyStopping(monitor='val_loss', patience=3)]
    #get all available data samples from data iterators
    train_iterator = DataIterator(train_data_path, sys.maxint)
    test_iterator = DataIterator(test_data_path, sys.maxint)
    train_X, train_Y = train_iterator.get_next()
    test_X, test_Y = test_iterator.get_next()
    model.fit(train_X, train_Y, batch_size=batch_size, callbacks=cbks, nb_epoch=epochs_number,
              show_accuracy=True, validation_split=0.2, shuffle=True)
    loss, acc = model.evaluate(test_X, test_Y, batch_size, show_accuracy=True)
    print('Test loss / test accuracy = {:.4f} / {:.4f}'.format(loss, acc))

Please note that the example takes a lot of computational resources. Running it with on NVIDIA GPU with CUDA support will greatly accelerate training of the nework. For example, it runs over 7x faster on my GTX 770 compared to i5 4690K. Therefore, I added a callback for Keras model that is executed after each epoch and saves the weights of the neural net in a local file'weights/keras-lstm.h5'. The weights are updated only if there is an improvement compared to previous iterations. This is enabled with save_best_only=True option. There is a second callback that performs early stopping technique automatically to prevent overfitting. In this code example the network will stop training if there hasn’t been improvement in loss value on the validation set after 3 epochs. Even though I set 40 training epochs the network will start to overfit much sooner due to small dataset and weak regularisation.

Example of an output:

When using 30k, 7.5k, 12.5k for training, validation, test

Epoch 25

loss: 0.1610 - acc: 0.9413 - val_loss: 0.2340 - val_acc: 0.8993

Test loss / test accuracy = 0.2353 / 0.8991

…….

When using 25k, 7.5k, 17.5k for training, validation, test

Epoch 13

loss: 0.2513 - acc: 0.8965 - val_loss: 0.2698 - val_acc: 0.8915

Test loss / test accuracy = 0.2696 / 0.8913

As you can see from the output example above, my network achieved 89.91% accuracy on a test set with 30k training samples and 89.13% with 25k training samples. With less data the model started to overfit sooner and the early stopping was triggered earlier (epoch 13 vs 25). It worth noting that the model created by Mass et al, 2011 achieved 88.89% with using 50000 unlabeled reviews in addition to training data while our model did not utilize unlabeled data. If we compare our network with the model from Mass et al, 2011, our model has 1% higher accuracy on 30k training samples and 0.24% higher accuracy on 25k training samples. In both cases it outperforms the model that was state of the art a few years ago. This is a good result for a rather simple model without any tuning or using advanced techniques and demonstrates the potential of neural networks. However, our goal wasn’t to create state of the art network but to demonstrate the basics of creating a specific kind of neural networks in practice.

To summarize, in this article I showed how to create LSTM neural network and train it perform sentiment analysis on movies reviews. The article covered the major steps of the process, including word embedding, basic data preparation, neural network design, implementation and training.

References:

Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1 (HLT ‘11), Vol. 1. Association for Computational Linguistics, Stroudsburg, PA, USA, 142-150.