Sentiment analysis on IMDB dataset

Dec 4 2020

8 minute read

natural-language-processing

Introduction

This is as exemple from the excellent book by François Chollet on deep learning. My idea here is to further detail the explanation with the code output, which the book does not contain. And since tensorflow 2.0 was released, I will be using tf.keras instead.

The goal of this example is

The steps are

Load data

Load imdb data from keras datasets. This dataset represents 25000 movie reviews and they are labeled by sentiment. The format is a list of of length 25000 and each entry is another list that represent a review. Integers represent the frequency of the word, so we are considering the top 1000 most frequent ones. We can see the sentiment values is 0 or 1.

 1from tensorflow.keras import datasets, preprocessing
 2
 3max_features = 10000           # number of words
 4
 5(train_data, train_targ), (test_data, test_targ) = datasets.imdb.load_data(
 6    num_words=max_features) # limits to the most common ones 
 7
 8print(len(train_data), train_data[0])
 9print(set(train_targ))
10print(train_targ)

25000 [1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]
{0, 1}
[1 0 0 ... 0 1 0]

With this bunch of numbers is hard to grasp the real meaning of this data. Fortunately we can convert it back to text using the index of words also provided by the dataset.

1index = datasets.imdb.get_word_index()
2print(list(index.items())[:5])

[('fawn', 34701), ('tsukino', 52006), ('nunnery', 52007), ('sonja', 16816), ('vani', 63951)]

We can see that the word index is a dictionary where the key is the word and the value is the integer representation. Because we have the values and want the keys, it is useful to invert this dictionary so we can entry and integer and get out a word.

1index_word = {value: key for (key, value) in index.items()}
2print(index_word[11], index_word[19])
3comment = ' '.join([index_word.get(i - 3, '#') for i in train_data[0]]) # -3 because in the 10.000 words it is missing 2 and the first number is the sentiment
4print(comment)

this film
# this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert # is an amazing actor and now the same being director # father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for # and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also # to the two little boy's that played the # of norman and paul they were just brilliant children are often left out of the # list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all

Preprocessing

Transform the list into 2D tensors with length=maxlen. So we are limiting the comments to just the first 20 words. This will affect the accuracy of the final trained neural network model. We expect a higher accuracy when more words are considered during training.

1maxlen = 20
2train_data = preprocessing.sequence.pad_sequences(train_data, maxlen=maxlen)
3test_data = preprocessing.sequence.pad_sequences(test_data, maxlen=maxlen)
4print(train_data)

[[  65   16   38 ...   19  178   32]
 [  23    4 1690 ...   16  145   95]
 [1352   13  191 ...    7  129  113]
 ...
 [  11 1818 7561 ...    4 3586    2]
 [  92  401  728 ...   12    9   23]
 [ 764   40    4 ...  204  131    9]]

Build the model

The goal now is to assemble a model that encapsulates the analysis. A Sequential model is used for a stack of layers in the neural networks. Each layer has 1 input tensor and 1 output tensor.

Here we are adding multiple layers to our model. One of them is the embedding layer which turns positive integers into dense vectors of fixed sizes.¹ The first argument is the size o the vocabulary (distinct words), in this case we have the first 10000 most common words. The second argument is the length of the output dense vector, in this case 8. Finally the input length is the length of each input that will be passed to the model. Therefore, this layer will convert each comment review data (limited to the first 20) and transform it in a dense vector of length 8.

Then the flatten layer which simply flattens the input, turning it into one dimension.

Finally we add the dense layer.

The dense layer is the one that we are going to train and use for classification. It implements a math operation where there is a multiplication between the input and kernel, then there is a transformation with an activation function. The kernel is the weight matrix created by the layer. The activation function used here is the sigmoid, which is a nonlinear function that enables clear prediction.²

1from tensorflow import keras
2from tensorflow.keras import layers
3
4model = keras.Sequential()            # model
5model.add(layers.Embedding(10000, 8, input_length=maxlen))
6model.add(layers.Flatten())            # flattens 3D tensor into 2D
7model.add(layers.Dense(1, activation='sigmoid')) # output with 1 dimension only
8model.summary()

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_4 (Embedding)      (None, 20, 8)             80000     
_________________________________________________________________
flatten_4 (Flatten)          (None, 160)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 161       
=================================================================
Total params: 80,161
Trainable params: 80,161
Non-trainable params: 0
_________________________________________________________________

Configure the model for training

The compile method configures the model for training. The optimizer argument specify the algorithm used for this task. In this example we use the RMSprop algorithm,which means root mean square prop.³ Optimization in this sense means minimizing a function, in this case we want to minimize the loss function.

The loss argument is the objective function, or cost function, which indicates how well the model is at predicting for given parameters. The one chosen was binary-crossentropy, which is a cross entropy loss function with 2 (binary) classes (positive, negative reviews). The cross-entropy, or log loss, measures performance when the output is a value between 0 or 1 and puts a heavy penalty on lower values which are clearly wrong.

The metrics argument is a list of things that are going to be tested during training and testing. Here we used the accuracy. The accuracy metric is used for classification models and is simply the number of corrected predictions divided by the number of predictions (success rate).

1model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
2model.summary()

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_4 (Embedding)      (None, 20, 8)             80000     
_________________________________________________________________
flatten_4 (Flatten)          (None, 160)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 161       
=================================================================
Total params: 80,161
Trainable params: 80,161
Non-trainable params: 0
_________________________________________________________________

From the summary we see the layers. First we flatten the embedding, then we train a single dense layer

Training the model

The next step is training the model. The function performs the number iterations given by epoch and the output is a history of training loss values and metrics at the epochs. It also shows the validation loss values and validation metrics values.

1history = model.fit(train_data, train_targ, # input data and target data 
2                    epochs=10,         # number of iterations per the data
3                    batch_size=32,    # samples per gradient update
4                    validation_split=0.2) # fraction of data used for training

Epoch 1/10
625/625 [==============================] - 2s 3ms/step - loss: 0.6738 - acc: 0.6104 - val_loss: 0.6297 - val_acc: 0.6874
Epoch 2/10
625/625 [==============================] - 1s 2ms/step - loss: 0.5541 - acc: 0.7462 - val_loss: 0.5315 - val_acc: 0.7288
Epoch 3/10
625/625 [==============================] - 1s 2ms/step - loss: 0.4671 - acc: 0.7844 - val_loss: 0.5031 - val_acc: 0.7440
Epoch 4/10
625/625 [==============================] - 2s 2ms/step - loss: 0.4246 - acc: 0.8069 - val_loss: 0.4959 - val_acc: 0.7486
Epoch 5/10
625/625 [==============================] - 1s 2ms/step - loss: 0.3966 - acc: 0.8224 - val_loss: 0.4951 - val_acc: 0.7490
Epoch 6/10
625/625 [==============================] - 1s 2ms/step - loss: 0.3733 - acc: 0.8356 - val_loss: 0.4979 - val_acc: 0.7536
Epoch 7/10
625/625 [==============================] - 1s 2ms/step - loss: 0.3524 - acc: 0.8496 - val_loss: 0.5045 - val_acc: 0.7564
Epoch 8/10
625/625 [==============================] - 2s 2ms/step - loss: 0.3324 - acc: 0.8604 - val_loss: 0.5112 - val_acc: 0.7554
Epoch 9/10
625/625 [==============================] - 1s 2ms/step - loss: 0.3135 - acc: 0.8716 - val_loss: 0.5176 - val_acc: 0.7546
Epoch 10/10
625/625 [==============================] - 1s 2ms/step - loss: 0.2957 - acc: 0.8809 - val_loss: 0.5264 - val_acc: 0.7516

Analysis

The training output is a dictionary with the results as a list.

1print(history.history)

{'loss': [0.6738093495368958, 0.5541190505027771, 0.4671066403388977, 0.4246475100517273, 0.3965531885623932, 0.37326815724372864, 0.3524011969566345, 0.3323952257633209, 0.3134663701057434, 0.29573047161102295], 'acc': [0.6103500127792358, 0.746150016784668, 0.7843999862670898, 0.8069499731063843, 0.822350025177002, 0.8355500102043152, 0.8496000170707703, 0.8603500127792358, 0.8715999722480774, 0.8808500170707703], 'val_loss': [0.6296502351760864, 0.5315172076225281, 0.5030556917190552, 0.49588504433631897, 0.4951230585575104, 0.49791309237480164, 0.504469096660614, 0.5111817121505737, 0.5175582766532898, 0.526441216468811], 'val_acc': [0.6873999834060669, 0.7287999987602234, 0.7440000176429749, 0.7486000061035156, 0.7490000128746033, 0.753600001335144, 0.7563999891281128, 0.7554000020027161, 0.7545999884605408, 0.7516000270843506]}

From the training data we can start do some analysis with graphs. The evolution on the metrics of the validation data (training data) suggests that the prediction power of the neural network model is increasing very fast at the beginning and maintaining a level of the accuracy metric around 76%. We only used

1import matplotlib.pyplot as plt
2import numpy as np
3plt.plot(np.arange(1, len(history.history['val_acc'])+1),
4	 history.history['val_acc'])
5plt.xlabel('Epochs')
6plt.ylabel('Validation Accuracy')

Footnotes

The dense vector is a better alternative do sparse vectores obtained with one-hot algorithms.

Because it goes to 0 or 1 very fast. The activation function just maps the internal product between the input and weights into a fixed interval [0, 1].

Here is a good reference for this algorithm