NLP with Python
Introduction
This is an attempt to compile different sources of informations about Natural Language processing with python.
Techiniques used in NLP
Tokenization
Converts text into segments (n-grams) that could represent a word, two words or more. During this step, usually it is performed some kind of vocabulary reduction such as normalization, stemming, lemmatization and removing stop words.
Tools
- NLTK: string processing library, built by academics.
- spaCy: better choice for app developers.
One-hot encoding
One-hot encoding is a rudimentary way to convert a word into a integer number representation.
TF-IDF
TF-IDF stands for term frequency times inverse document frequency. This is a method for representing a word using information about the importance of that word. The inverse document frequency is a form of normalizing the frequency of a word across multiple documents.
Word vectors
Numerical representation of word semantics using real numbers. Also known as word embeddings. It can capture the abstraction behind the word, what it means, such as its category (is it a person, a place, an animal, etc).
Tools
- Word2vec is capable of learning a word meaning by processing a corpus of unlabeled data.
- TensorFlow has a word2vec family of methods to generate word embeddings.
Applications of NLP
Information extraction
Convert unstructured text in structure knowledge base. It can be used to extract a variety of information such as:
- named entities such as places, dates, prices, and so on.
- relations (part of speech)
- keywords
Tools
- Google date parser:
dateutil.parser.parse
- POS tagging: NLTK or spaCy (faster and more accurate)
Text classification
Classify a text according to some category. This is done generally by training a neural network with labeled data.
Tools
- Keras for building the neural network