Topic modeling
Introduction
Topic modeling is a form of semantic analysis, a step forwarding finding meaning from word counts. This analysis allows discovery of document topic without trainig data. It involves counting words and grouping similar word patterns to describe the data.
Latent Dirichlet Allocation
Latent Dirichlet Allocation (LDA) is one topic modeling technique. It can infer the probability distribution of words for each topic, characterizing it. It also defines a distribution of topics through the documents.
More information about the LDA model:
- https://papers.nips.cc/paper/2010/file/71f6278d140af599e06ad9bf1ba03cb0-Paper.pdf
- https://bookdown.org/Maxine/tidy-text-mining/latent-dirichlet-allocation.html
Gensim
Gesim is library for topic modeling that has LDA algorithm implemented.
pyLDAvis
pyLDAvis is library that helps visualize LDA outputs. It can generate an interactive
LDA procedure example
Load the data
I will use the data from portuguese blog posts from BLOGSET-BR. I'm limiting the analysis to just 1000 blog posts in order to keep the code running fast in this notebook.
1import pandas as pd
2
3posts_data = pd.read_csv('C:/Users/nasse/Desktop/blogset-br.csv/blogset-br.csv', header=None, nrows=1000)
4print(posts_data.shape)
5print(posts_data.head())
(1000, 9) 0 1 2 \ 0 4513095612773714447 1000016902259367892 2012-01-27T13:30:00-02:00 1 4402618022359447709 1000016902259367892 2011-07-25T21:43:00-03:00 2 4903861431859076038 1000016902259367892 2011-10-06T23:39:00-03:00 3 5936720117277385447 1000016902259367892 2011-07-25T21:46:00-03:00 4 9035220881064614540 1000016902259367892 2011-05-04T00:09:00-03:00 3 \ 0 Ombré nails como fazer 1 RISQUÉ - COLOR 2 Risque Dogs 3 Risqué coleção nova da Penelope 4 Coleção 2011 MOHDA INVERNO 4 5 \ 0 \n 01861978533047910632 1 SÃAAAO MUITOO LINDOS AMEEI OOO MIRAGEM AZUL *-* 01861978533047910632 2 A Nova coleção da Risqué veeeeio ANIMAL hehe c... 01861978533047910632 3 esmaltes 01861978533047910632 4 A MOHDA nesse inverno enta trazendo para nos v... 01861978533047910632 6 7 8 0 Daniela 1 NaN 1 Daniela 0 NaN 2 Daniela 3 NaN 3 Daniela 0 NaN 4 Daniela 0 NaN
We are only interested in the blog content, so we drop all columns except the one with content which is the 4th column.
1posts = posts_data[4].to_frame()
2posts.columns = ['content']
3print(posts.head())
4print(posts.columns)
content 0 \n 1 SÃAAAO MUITOO LINDOS AMEEI OOO MIRAGEM AZUL *-* 2 A Nova coleção da Risqué veeeeio ANIMAL hehe c... 3 esmaltes 4 A MOHDA nesse inverno enta trazendo para nos v... Index(['content'], dtype='object')
Preprocessing
Lower case
First step is to make every thing lower case using the vectorize string function in pandas.
1posts['lower'] = posts['content'].str.lower() #
2print(posts['lower'].head())
0 \n 1 sãaaao muitoo lindos ameei ooo miragem azul *-* 2 a nova coleção da risqué veeeeio animal hehe c... 3 esmaltes 4 a mohda nesse inverno enta trazendo para nos v... Name: lower, dtype: object
Tokenize each post
We then tokenize each post using spaCy by loading the model (português one), which can be downloaded with python -m spacy download pt_core_news_sm
.
1import spacy
2
3nlp = spacy.load('pt_core_news_sm')
4posts['tokens'] = posts['lower'].apply(nlp)
5print(posts['tokens'].head())
0 (\n) 1 (sãaaao, muitoo, lindos, ameei, ooo, miragem, ... 2 (a, nova, coleção, da, risqué, veeeeio, animal... 3 (esmaltes) 4 (a, mohda, nesse, inverno, enta, trazendo, par... Name: tokens, dtype: object
Let's check the number of tokens in each post. We can see the average tokens for each post. A great part of those tokens are stopwords which does not contain much information, so we are removing them.
1posts['num_tokens'] = posts['tokens'].apply(len)
2print(posts['num_tokens'].mean() )
287.099
Removing punctuation
Reduces the average tokens.
1posts['tokens_nopunc'] = posts['tokens'].apply(lambda x: [tkn for tkn in x if not tkn.is_punct])
2print(posts['tokens_nopunc'].head())
3
4posts['num_tokens_nopunc'] = posts['tokens_nopunc'].apply(len)
5print(posts['num_tokens_nopunc'].mean() )
0 [\n] 1 [sãaaao, muitoo, lindos, ameei, ooo, miragem, ... 2 [a, nova, coleção, da, risqué, veeeeio, animal... 3 [esmaltes] 4 [a, mohda, nesse, inverno, enta, trazendo, par... Name: tokens_nopunc, dtype: object 241.036
Removing stopwords
1print(list(nlp.Defaults.stop_words)[:20])
['momento', 'foram', 'geral', 'onde', 'tiveram', 'seus', 'oito', 'além', 'aqui', 'podem', 'aquelas', 'por', 'sétima', 'acerca', 'primeiro', 'nuns', 'estás', 'nossas', 'está', 'temos']
Then we remove the stop words using the spaCy collection of stopwords, we can see that we reduced the average tokens to 63 per post.
1def remove_stopwords(tokens):
2 tokens_nosw = []
3 for tkn in tokens:
4 if tkn.text.isspace():
5 continue # remove empty space
6 if tkn.text not in nlp.Defaults.stop_words:
7 tokens_nosw.append(tkn)
8 return tokens_nosw
9
10posts['tokens_nosw'] = posts['tokens_nopunc'].apply(remove_stopwords)
11print(posts['tokens_nosw'].head())
12
13posts['num_tokens_nosw'] = posts['tokens_nosw'].apply(len)
14print(posts['num_tokens_nosw'].mean() )
0 [] 1 [sãaaao, muitoo, lindos, ameei, ooo, miragem, ... 2 [a, coleção, risqué, veeeeio, animal, hehe, co... 3 [esmaltes] 4 [a, mohda, inverno, enta, trazendo, viadas, es... Name: tokens_nosw, dtype: object 139.142
Removing tokens with less the 4 letters
Usually smaller words don't carry much information, so we remove then.
1posts['tokens_clean'] = posts.tokens_nosw.apply(lambda x: [tkn for tkn in x if len(tkn) >4])
2print(posts.tokens_clean)
3
4
5posts['num_tokens_clean'] = posts['tokens_clean'].apply(len)
6print(posts['num_tokens_clean'].mean() )
0 [] 1 [sãaaao, muitoo, lindos, ameei, miragem] 2 [coleção, risqué, veeeeio, animal, cores, clar... 3 [esmaltes] 4 [mohda, inverno, trazendo, viadas, esmaltes, c... ... 995 [apaixonada, esportes, adolescência, curtia, v... 996 [enfim, dormir, acordar, trocar, noite, preocu... 997 [http://www.youtube.com/watch?v=-_wr1hy-bka] 998 [bullying, alguém, agride, humilha, xinga,&nbs... 999 [mohandas, karamchand, gandhi, outubro, janeir... Name: tokens_clean, Length: 1000, dtype: object 96.65
Removing empty posts
Now creating a new Series pandas object. We filter the Series considering only entities with length greater than 0, then we call the attribute to convert each token into text. At the end we have a list of list of strings.
1posts_clean = posts.tokens_clean[posts.tokens_clean.str.len() > 0].apply(lambda x: [tkn.text for tkn in x])
2print(posts_clean)
3
4print(type(posts_clean.tolist()[0]))
5print(type(posts_clean.tolist()[0][0]))
1 [sãaaao, muitoo, lindos, ameei, miragem] 2 [coleção, risqué, veeeeio, animal, cores, clar... 3 [esmaltes] 4 [mohda, inverno, trazendo, viadas, esmaltes, c... 5 [susseso, filme, riqué, cores, vibrante, fizer... ... 995 [apaixonada, esportes, adolescência, curtia, v... 996 [enfim, dormir, acordar, trocar, noite, preocu... 997 [http://www.youtube.com/watch?v=-_wr1hy-bka] 998 [bullying, alguém, agride, humilha, xinga,&nbs... 999 [mohandas, karamchand, gandhi, outubro, janeir... Name: tokens_clean, Length: 927, dtype: object <class 'list'> <class 'str'>
Summary preprocessing DataFrame
The full preprocess in the posts DataFrame, so we can visualiza each process.
1pd.set_option('max_columns', None)
2print(posts.head())
content \ 0 \n 1 SÃAAAO MUITOO LINDOS AMEEI OOO MIRAGEM AZUL *-* 2 A Nova coleção da Risqué veeeeio ANIMAL hehe c... 3 esmaltes 4 A MOHDA nesse inverno enta trazendo para nos v... lower \ 0 \n 1 sãaaao muitoo lindos ameei ooo miragem azul *-* 2 a nova coleção da risqué veeeeio animal hehe c... 3 esmaltes 4 a mohda nesse inverno enta trazendo para nos v... tokens num_tokens \ 0 (\n) 1 1 (sãaaao, muitoo, lindos, ameei, ooo, miragem, ... 10 2 (a, nova, coleção, da, risqué, veeeeio, animal... 27 3 (esmaltes) 1 4 (a, mohda, nesse, inverno, enta, trazendo, par... 49 tokens_nopunc num_tokens_nopunc \ 0 [\n] 1 1 [sãaaao, muitoo, lindos, ameei, ooo, miragem, ... 7 2 [a, nova, coleção, da, risqué, veeeeio, animal... 24 3 [esmaltes] 1 4 [a, mohda, nesse, inverno, enta, trazendo, par... 42 tokens_nosw num_tokens_nosw \ 0 [] 0 1 [sãaaao, muitoo, lindos, ameei, ooo, miragem, ... 7 2 [a, coleção, risqué, veeeeio, animal, hehe, co... 19 3 [esmaltes] 1 4 [a, mohda, inverno, enta, trazendo, viadas, es... 26 tokens_clean num_tokens_clean 0 [] 0 1 [sãaaao, muitoo, lindos, ameei, miragem] 5 2 [coleção, risqué, veeeeio, animal, cores, clar... 12 3 [esmaltes] 1 4 [mohda, inverno, trazendo, viadas, esmaltes, c... 17
LDA with Gensim
Prepare the corpus
The LdaModel
object from Gensim takes in a corpus as parameter.
A corpus is a collection of Documents
objects.
A Document
object is simply a string of text.
This corpus is created from a list of documents with a list of tokens.
The Dictionary
class associates each word with a unique integer ID.
This is dictionary defines a vocabulary.
1from gensim.corpora import Dictionary
2docs = posts_clean.tolist()
3dictionary = Dictionary(docs)
4print(dictionary)
Dictionary(34661 unique tokens: ['ameei', 'lindos', 'miragem', 'muitoo', 'sãaaao']...)
Transform documents to vectorized form
In this step we convert a document into a numeric form so the algorithm can manipulate it.
This step is also known as bag-of-words representation of documents.
This function doc2bow()
creates a tuple for each word with (word_id, word_frquency)
.
1corpus = [dictionary.doc2bow(doc) for doc in docs]
2print(corpus[0], docs[0])
3print('Number of unique tokens:', len(dictionary))
4print('Number of posts:', len(corpus))
[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)] ['sãaaao', 'muitoo', 'lindos', 'ameei', 'miragem'] Number of unique tokens: 34661 Number of posts: 927
Training
In this step we need a couple of parameters to be set. First the number of topics must be pre defined, I choose 10. Chunk size controls how many documents are processed at a time. Passes defined how often the model is trained on the corpus, same as 'epochs'. Iterations controls how often we repeat a particular loop over each document.
1from gensim.models import LdaModel
2
3
4num_topics = 10
5chunksize = 2000
6passes = 20
7iterations = 400
8eval_every = None
9
10
11id2word = dictionary
12
13model = LdaModel(
14 corpus=corpus,
15 id2word=id2word,
16 chunksize=chunksize,
17 alpha='auto',
18 eta='auto',
19 iterations=iterations,
20 num_topics=num_topics,
21 passes=passes,
22 eval_every=eval_every
23)
24model.save(f'model_{len(corpus)}.gensim')
Training output
We can print the top topics from a particular document.
1from pprint import pprint
2topics = model.top_topics(corpus)
3pprint(topics[4])
4print(docs[4])
([(0.0030817974, 'pessoas'), (0.0029705719, 'trabalho'), (0.0019567062, 'empresa'), (0.0019483563, 'social'), (0.0018672714, 'ficar'), (0.0018014798, 'melhor'), (0.0017869751, 'paciente'), (0.0017729151, 'download'), (0.0015917295, 'estamos'), (0.0014970169, 'mesma'), (0.0014943925, 'muitas'), (0.0014523969, 'coisas'), (0.0014118261, 'alguém'), (0.0013802068, 'sendo'), (0.001359764, 'mundo'), (0.0013132734, 'brasil'), (0.0013105941, 'contas'), (0.0013103066, 'médicos'), (0.0012967874, 'carminha'), (0.0012708729, 'marido')], -5.887733908971602) ['susseso', 'filme', 'riqué', 'cores', 'vibrante', 'fizerma', 'nbsp;as', 'cores', 'caxinha', 'linda', 'hortencia', 'obsessão', 'fluor', 'laranja', 'amarelo', 'cores', 'vivas', 'otimas', 'serem', 'usada', 'primaver/', 'verão', 'cores', 'coleções', 'coleções', 'passadas ', 'usaram', 'otima', 'opção', 'colocarem', 'merdaco', 'chamar', 'atenção', 'consumidoras', 'esmaltes. ']
Visualizing pyLDAvis
This tools allows us to visualize the topics discovered, their importance, similarity between topics, and the relevant words of each topics. The relevance is computed with a weighted average of the probability of the word given the topic and this probability normalized by the probability of the term. If we adjust the slider on the top to 0, we are considering words that appeared mostly on a specific topic.
1import pyLDAvis.gensim
2lda = LdaModel.load('model_927.gensim')
3lda_display = pyLDAvis.gensim.prepare(lda, corpus, dictionary, sort_topics=False)
4pyLDAvis.save_html(lda_display, 'ldavis.html')
Analysis
From the model results we can make some observations:
- The 4th topic is the most important across the 1000 posts analyzed. It mostly likely refers to literature or culture because of the relevant terms: Machado de Assis, Shakespeare, José Alencar, poemas.
- Topic 1 most likely refers to religion.
- Topics 3, 5, 6, 7, 8 and 9 are clustered in a region which indicates that they have a similar topic. My guess is posts about the current news.
Points of improvements
- From the 4th topic we can see proper names separated, I could have used a POS tagging tool to join them in a single token.
- From the 2nd and 10th topics, we can see a couple of words that do not helps in the topic analysis and could be filtered out.
- Increase the data used for analysis.