Text processing with spaCy
Introduction
spaCy is a library for Natural Language Processing (NLP) in python. It offers multiple solutions for text processing such as tokenization, named entity recognition, word vectors, part of speech tagging. The alternative is the library NLTK which seems to be used mostly in academia whereas spaCy is recommended for production use.
Load the language model
We need to download the models for the language with python -m spacy download model-name.
Then we can load the language model in our code by creating a Language
object.
This object has all the data and methods required to process the text.
By calling this object on a string of text we get a processed object Doc
, which is a sequence of Token
objects.
The Doc
object has a __getitem__()
method which enabled the container to be accessed like a list.
1import spacy
2nlp = spacy.load('pt_core_news_sm')
3text = """Joaquim Maria Machado de Assis (Rio de Janeiro, 21 de junho de
41839 — Rio de Janeiro, 29 de setembro de 1908) foi um escritor
5brasileiro, considerado por muitos críticos, estudiosos, escritores e
6leitores um dos maiores senão o maior nome da literatura do Brasil."""
7doc = nlp(text)
8print(doc, type(doc))
Joaquim Maria Machado de Assis (Rio de Janeiro, 21 de junho de 1839 — Rio de Janeiro, 29 de setembro de 1908) foi um escritor brasileiro, considerado por muitos críticos, estudiosos, escritores e leitores um dos maiores senão o maior nome da literatura do Brasil. <class 'spacy.tokens.doc.Doc'>
Tokenization
When the string is processed into Doc
spaCy automatically tokenizes the text.
Which is the process of dividing the text in individual entities.
1for token in doc[:10]:
2 print(token.text)
Joaquim Maria Machado de Assis ( Rio de Janeiro ,
We can see it considers already punctuation rules.
Part of speech (POS)
Part of speech tagging is a process in which we want to know what word type our tokens are.
The model loaded was trained to be capable of associate words to its type.
We need to use token.pos_
in order to get a readable string.
1for token in doc[:10]:
2 print(token.text, token.pos_)
Joaquim PROPN Maria PROPN Machado PROPN de ADP Assis PROPN ( PUNCT Rio PROPN de ADP Janeiro PROPN , PUNCT
we can see the types: PROPN (proper noun), ADP (adposition), PUNCT (punctuation).
spaCy offers a really neat form of displaying the text.
1spacy.displacy.render(doc[:5])
Named entity recognition (NER)
The NER process aims to identify entities that represent something in reality (e.g. person, city, date).
The attribute doc.ents
are the named entities as a tuple.
1for ent in doc.ents:
2 print(ent.text, ent.label_)
Joaquim Maria Machado de Assis PER Rio de Janeiro LOC Rio de Janeiro LOC Brasil LOC
Very nice and easy. This result can also be display nicely.
1spacy.displacy.render(doc, style='ent')
As we can see, it didn't recognize the dates. We can manually assign those.
1print(doc[10:16])
2from spacy.tokens import Span
3date_ent = Span(doc, 10, 16, label='DATE')
4doc.ents = list(doc.ents) + [date_ent] # add a new ent to the list
5spacy.displacy.render(doc, style='ent')
21 de junho de 1839
Conclusion
I am very impressed by this tool. Very high level and easy to use, I'm looking forward to explore it more.