NLP with Python

Dec 9 2020

1 minute read

machine-learning , natural-language-processing

Introduction

This is an attempt to compile different sources of informations about Natural Language processing with python.

Techiniques used in NLP

Tokenization

Converts text into segments (n-grams) that could represent a word, two words or more. During this step, usually it is performed some kind of vocabulary reduction such as normalization, stemming, lemmatization and removing stop words.

Tools

NLTK: string processing library, built by academics.
spaCy: better choice for app developers.

One-hot encoding

One-hot encoding is a rudimentary way to convert a word into a integer number representation.

TF-IDF

TF-IDF stands for term frequency times inverse document frequency. This is a method for representing a word using information about the importance of that word. The inverse document frequency is a form of normalizing the frequency of a word across multiple documents.

Word vectors

Numerical representation of word semantics using real numbers. Also known as word embeddings. It can capture the abstraction behind the word, what it means, such as its category (is it a person, a place, an animal, etc).

Tools

Word2vec is capable of learning a word meaning by processing a corpus of unlabeled data.
TensorFlow has a word2vec family of methods to generate word embeddings.

Applications of NLP

Information extraction

Convert unstructured text in structure knowledge base. It can be used to extract a variety of information such as:

named entities such as places, dates, prices, and so on.
relations (part of speech)
keywords

Tools

Google date parser: dateutil.parser.parse
POS tagging: NLTK or spaCy (faster and more accurate)

Text classification

Classify a text according to some category. This is done generally by training a neural network with labeled data.

Tools

Keras for building the neural network

Introduction

Techiniques used in NLP

Tokenization

Tools

One-hot encoding

TF-IDF

Word vectors

Tools

Applications of NLP

Information extraction

Tools

Text classification

Tools

See Also