Sentiment analysis on IMDB dataset
Introduction
This is as exemple from the excellent book by François Chollet on deep learning.
My idea here is to further detail the explanation with the code output, which the book does not contain.
And since tensorflow 2.0 was released, I will be using tf.keras
instead.
The goal of this example is
The steps are
Load data
Load imdb data from keras datasets. This dataset represents 25000 movie reviews and they are labeled by sentiment. The format is a list of of length 25000 and each entry is another list that represent a review. Integers represent the frequency of the word, so we are considering the top 1000 most frequent ones. We can see the sentiment values is 0 or 1.
1from tensorflow.keras import datasets, preprocessing
2
3max_features = 10000 # number of words
4
5(train_data, train_targ), (test_data, test_targ) = datasets.imdb.load_data(
6 num_words=max_features) # limits to the most common ones
7
8print(len(train_data), train_data[0])
9print(set(train_targ))
10print(train_targ)
25000 [1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32] {0, 1} [1 0 0 ... 0 1 0]
With this bunch of numbers is hard to grasp the real meaning of this data. Fortunately we can convert it back to text using the index of words also provided by the dataset.
1index = datasets.imdb.get_word_index()
2print(list(index.items())[:5])
[('fawn', 34701), ('tsukino', 52006), ('nunnery', 52007), ('sonja', 16816), ('vani', 63951)]
We can see that the word index is a dictionary where the key is the word and the value is the integer representation. Because we have the values and want the keys, it is useful to invert this dictionary so we can entry and integer and get out a word.
1index_word = {value: key for (key, value) in index.items()}
2print(index_word[11], index_word[19])
3comment = ' '.join([index_word.get(i - 3, '#') for i in train_data[0]]) # -3 because in the 10.000 words it is missing 2 and the first number is the sentiment
4print(comment)
this film # this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert # is an amazing actor and now the same being director # father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for # and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also # to the two little boy's that played the # of norman and paul they were just brilliant children are often left out of the # list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all
Preprocessing
Transform the list into 2D tensors with length=maxlen
.
So we are limiting the comments to just the first 20 words.
This will affect the accuracy of the final trained neural network model.
We expect a higher accuracy when more words are considered during training.
1maxlen = 20
2train_data = preprocessing.sequence.pad_sequences(train_data, maxlen=maxlen)
3test_data = preprocessing.sequence.pad_sequences(test_data, maxlen=maxlen)
4print(train_data)
[[ 65 16 38 ... 19 178 32] [ 23 4 1690 ... 16 145 95] [1352 13 191 ... 7 129 113] ... [ 11 1818 7561 ... 4 3586 2] [ 92 401 728 ... 12 9 23] [ 764 40 4 ... 204 131 9]]
Build the model
The goal now is to assemble a model that encapsulates the analysis.
A Sequential
model is used for a stack of layers in the neural networks.
Each layer has 1 input tensor and 1 output tensor.
Here we are adding multiple layers to our model. One of them is the embedding layer which turns positive integers into dense vectors of fixed sizes.1 The first argument is the size o the vocabulary (distinct words), in this case we have the first 10000 most common words. The second argument is the length of the output dense vector, in this case 8. Finally the input length is the length of each input that will be passed to the model. Therefore, this layer will convert each comment review data (limited to the first 20) and transform it in a dense vector of length 8.
Then the flatten layer which simply flattens the input, turning it into one dimension.
Finally we add the dense layer.
The dense layer is the one that we are going to train and use for classification. It implements a math operation where there is a multiplication between the input and kernel, then there is a transformation with an activation function. The kernel is the weight matrix created by the layer. The activation function used here is the sigmoid, which is a nonlinear function that enables clear prediction.2
1from tensorflow import keras
2from tensorflow.keras import layers
3
4model = keras.Sequential() # model
5model.add(layers.Embedding(10000, 8, input_length=maxlen))
6model.add(layers.Flatten()) # flattens 3D tensor into 2D
7model.add(layers.Dense(1, activation='sigmoid')) # output with 1 dimension only
8model.summary()
Model: "sequential_4" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding_4 (Embedding) (None, 20, 8) 80000 _________________________________________________________________ flatten_4 (Flatten) (None, 160) 0 _________________________________________________________________ dense_1 (Dense) (None, 1) 161 ================================================================= Total params: 80,161 Trainable params: 80,161 Non-trainable params: 0 _________________________________________________________________
Configure the model for training
The compile
method configures the model for training.
The optimizer
argument specify the algorithm used for this task.
In this example we use the RMSprop algorithm,which means root mean square prop.3
Optimization in this sense means minimizing a function, in this case we want to minimize the loss function.
The loss
argument is the objective function, or cost function, which indicates how well the model is at predicting for given parameters.
The one chosen was binary-crossentropy, which is a cross entropy loss function with 2 (binary) classes (positive, negative reviews).
The cross-entropy, or log loss, measures performance when the output is a value between 0 or 1 and puts a heavy penalty on lower values which are clearly wrong.
The metrics
argument is a list of things that are going to be tested during training and testing.
Here we used the accuracy.
The accuracy metric is used for classification models and is simply the number of corrected predictions divided by the number of predictions (success rate).
1model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
2model.summary()
Model: "sequential_4" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding_4 (Embedding) (None, 20, 8) 80000 _________________________________________________________________ flatten_4 (Flatten) (None, 160) 0 _________________________________________________________________ dense_1 (Dense) (None, 1) 161 ================================================================= Total params: 80,161 Trainable params: 80,161 Non-trainable params: 0 _________________________________________________________________
From the summary we see the layers. First we flatten the embedding, then we train a single dense layer
Training the model
The next step is training the model.
The function performs the number iterations given by epoch
and the output is a history of training loss values and metrics at the epochs.
It also shows the validation loss values and validation metrics values.
1history = model.fit(train_data, train_targ, # input data and target data
2 epochs=10, # number of iterations per the data
3 batch_size=32, # samples per gradient update
4 validation_split=0.2) # fraction of data used for training
Epoch 1/10 625/625 [==============================] - 2s 3ms/step - loss: 0.6738 - acc: 0.6104 - val_loss: 0.6297 - val_acc: 0.6874 Epoch 2/10 625/625 [==============================] - 1s 2ms/step - loss: 0.5541 - acc: 0.7462 - val_loss: 0.5315 - val_acc: 0.7288 Epoch 3/10 625/625 [==============================] - 1s 2ms/step - loss: 0.4671 - acc: 0.7844 - val_loss: 0.5031 - val_acc: 0.7440 Epoch 4/10 625/625 [==============================] - 2s 2ms/step - loss: 0.4246 - acc: 0.8069 - val_loss: 0.4959 - val_acc: 0.7486 Epoch 5/10 625/625 [==============================] - 1s 2ms/step - loss: 0.3966 - acc: 0.8224 - val_loss: 0.4951 - val_acc: 0.7490 Epoch 6/10 625/625 [==============================] - 1s 2ms/step - loss: 0.3733 - acc: 0.8356 - val_loss: 0.4979 - val_acc: 0.7536 Epoch 7/10 625/625 [==============================] - 1s 2ms/step - loss: 0.3524 - acc: 0.8496 - val_loss: 0.5045 - val_acc: 0.7564 Epoch 8/10 625/625 [==============================] - 2s 2ms/step - loss: 0.3324 - acc: 0.8604 - val_loss: 0.5112 - val_acc: 0.7554 Epoch 9/10 625/625 [==============================] - 1s 2ms/step - loss: 0.3135 - acc: 0.8716 - val_loss: 0.5176 - val_acc: 0.7546 Epoch 10/10 625/625 [==============================] - 1s 2ms/step - loss: 0.2957 - acc: 0.8809 - val_loss: 0.5264 - val_acc: 0.7516
Analysis
The training output is a dictionary with the results as a list.
1print(history.history)
{'loss': [0.6738093495368958, 0.5541190505027771, 0.4671066403388977, 0.4246475100517273, 0.3965531885623932, 0.37326815724372864, 0.3524011969566345, 0.3323952257633209, 0.3134663701057434, 0.29573047161102295], 'acc': [0.6103500127792358, 0.746150016784668, 0.7843999862670898, 0.8069499731063843, 0.822350025177002, 0.8355500102043152, 0.8496000170707703, 0.8603500127792358, 0.8715999722480774, 0.8808500170707703], 'val_loss': [0.6296502351760864, 0.5315172076225281, 0.5030556917190552, 0.49588504433631897, 0.4951230585575104, 0.49791309237480164, 0.504469096660614, 0.5111817121505737, 0.5175582766532898, 0.526441216468811], 'val_acc': [0.6873999834060669, 0.7287999987602234, 0.7440000176429749, 0.7486000061035156, 0.7490000128746033, 0.753600001335144, 0.7563999891281128, 0.7554000020027161, 0.7545999884605408, 0.7516000270843506]}
From the training data we can start do some analysis with graphs. The evolution on the metrics of the validation data (training data) suggests that the prediction power of the neural network model is increasing very fast at the beginning and maintaining a level of the accuracy metric around 76%. We only used
1import matplotlib.pyplot as plt
2import numpy as np
3plt.plot(np.arange(1, len(history.history['val_acc'])+1),
4 history.history['val_acc'])
5plt.xlabel('Epochs')
6plt.ylabel('Validation Accuracy')