Question classifier preprocessing

Dec 7 2020

2 minute read

natural-language-processing

Introduction

An example from the book by Aman Kedia. About label encoding.

Loading data

1import pandas as pd
2data = open("train_1000-label.txt", 'r')
3train_raw = pd.DataFrame(data.readlines(), columns=['data'])
4print(train_raw.head())

                                                data
0  DESC:manner How did serfdom develop in and the...
1  ENTY:cremat What films featured the character ...
2  DESC:manner How can I find a list of celebriti...
3  ENTY:animal What fowl grabs the spotlight afte...
4         ABBR:exp What is the full form of .com ?\n

Preprocessing

Split string

We need to split the class, finner class and the question itself.

1train = train_raw.data.str.split(':', n=1, expand=True)
2print(train.head())
3train[1] = train[1].str.split(n=1).str[1] # remove te finner classification
4train.columns = ['QType', 'Question']
5print(train.head())

      0                                                  1
0  DESC  manner How did serfdom develop in and then lea...
1  ENTY  cremat What films featured the character Popey...
2  DESC  manner How can I find a list of celebrities ' ...
3  ENTY  animal What fowl grabs the spotlight after the...
4  ABBR              exp What is the full form of .com ?\n
  QType                                           Question
0  DESC  How did serfdom develop in and then leave Russ...
1  ENTY  What films featured the character Popeye Doyle...
2  DESC  How can I find a list of celebrities ' real na...
3  ENTY  What fowl grabs the spotlight after the Chines...
4  ABBR                  What is the full form of .com ?\n

Identify the possible classifications

We have 6 different questions types.

1classes = train.QType.drop_duplicates()
2print(classes)

0     DESC
1     ENTY
4     ABBR
5      HUM
10     NUM
15     LOC
Name: QType, dtype: object

Label encoding

In Machine Learning the algorithms are performed only on numerical data, therefore we need to convert our text data into numerical. This process is called Label Encoding.

We can use Scikit-Learn preprocessing Label Encoder class.

1from sklearn.preprocessing import LabelEncoder
2le = LabelEncoder()
3le.fit(classes)
4encoded_classes = le.transform(classes) 
5print(encoded_classes)
6print(le.inverse_transform(encoded_classes))

[1 2 0 3 5 4]
['DESC' 'ENTY' 'ABBR' 'HUM' 'NUM' 'LOC']

One hot encoding

As an alternative to label encoding we can use one-hot encoding. Why? because the naive label encoding might add a unwanted pattern to the data, the progressions. One-hot encoding creates a column for each category. And it indicates the category with 0 or 1.

We can do that with Scikit-Learn or TensorFlow (with Keras).

TODO Tensor flow

In order to use Keras preprocessing API the text input should be a single input. We do that with pandas string functions. The we use the one hot method for preprocessing text in keras, which is not a one-hot encoding but simply a label encoding. The we perform the

1from tensorflow.keras import utils, preprocessing
2print(classes.str.cat(sep=' '), classes.size)
3enc_classes = preprocessing.text.one_hot(classes.str.cat(sep=' '),
4                                         n=classes.size,
5split=' ')
6print(enc_classes)
7one_hot_classes = utils.to_categorical(enc_classes,
8                                      classes.size) # number of classes
9print(one_hot_classes)

DESC ENTY ABBR HUM NUM LOC 6
[4, 4, 3, 2, 4, 4]
[[0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 1. 0. 0.]
 [0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 1. 0.]]

It seems its giving the same integer to multiple categories. Not sure what happened here.