Question classifier preprocessing
Introduction
An example from the book by Aman Kedia. About label encoding.
Loading data
1import pandas as pd
2data = open("train_1000-label.txt", 'r')
3train_raw = pd.DataFrame(data.readlines(), columns=['data'])
4print(train_raw.head())
data 0 DESC:manner How did serfdom develop in and the... 1 ENTY:cremat What films featured the character ... 2 DESC:manner How can I find a list of celebriti... 3 ENTY:animal What fowl grabs the spotlight afte... 4 ABBR:exp What is the full form of .com ?\n
Preprocessing
Split string
We need to split the class, finner class and the question itself.
1train = train_raw.data.str.split(':', n=1, expand=True)
2print(train.head())
3train[1] = train[1].str.split(n=1).str[1] # remove te finner classification
4train.columns = ['QType', 'Question']
5print(train.head())
0 1 0 DESC manner How did serfdom develop in and then lea... 1 ENTY cremat What films featured the character Popey... 2 DESC manner How can I find a list of celebrities ' ... 3 ENTY animal What fowl grabs the spotlight after the... 4 ABBR exp What is the full form of .com ?\n QType Question 0 DESC How did serfdom develop in and then leave Russ... 1 ENTY What films featured the character Popeye Doyle... 2 DESC How can I find a list of celebrities ' real na... 3 ENTY What fowl grabs the spotlight after the Chines... 4 ABBR What is the full form of .com ?\n
Identify the possible classifications
We have 6 different questions types.
1classes = train.QType.drop_duplicates()
2print(classes)
0 DESC 1 ENTY 4 ABBR 5 HUM 10 NUM 15 LOC Name: QType, dtype: object
Label encoding
In Machine Learning the algorithms are performed only on numerical data, therefore we need to convert our text data into numerical. This process is called Label Encoding.
We can use Scikit-Learn preprocessing Label Encoder class.
1from sklearn.preprocessing import LabelEncoder
2le = LabelEncoder()
3le.fit(classes)
4encoded_classes = le.transform(classes)
5print(encoded_classes)
6print(le.inverse_transform(encoded_classes))
[1 2 0 3 5 4] ['DESC' 'ENTY' 'ABBR' 'HUM' 'NUM' 'LOC']
One hot encoding
As an alternative to label encoding we can use one-hot encoding. Why? because the naive label encoding might add a unwanted pattern to the data, the progressions. One-hot encoding creates a column for each category. And it indicates the category with 0 or 1.
We can do that with Scikit-Learn or TensorFlow (with Keras).
TODO Tensor flow
In order to use Keras preprocessing API the text input should be a single input. We do that with pandas string functions. The we use the one hot method for preprocessing text in keras, which is not a one-hot encoding but simply a label encoding. The we perform the
1from tensorflow.keras import utils, preprocessing
2print(classes.str.cat(sep=' '), classes.size)
3enc_classes = preprocessing.text.one_hot(classes.str.cat(sep=' '),
4 n=classes.size,
5split=' ')
6print(enc_classes)
7one_hot_classes = utils.to_categorical(enc_classes,
8 classes.size) # number of classes
9print(one_hot_classes)
DESC ENTY ABBR HUM NUM LOC 6 [4, 4, 3, 2, 4, 4] [[0. 0. 0. 0. 1. 0.] [0. 0. 0. 0. 1. 0.] [0. 0. 0. 1. 0. 0.] [0. 0. 1. 0. 0. 0.] [0. 0. 0. 0. 1. 0.] [0. 0. 0. 0. 1. 0.]]
It seems its giving the same integer to multiple categories. Not sure what happened here.