Chapter 21 Natural Language Processing: Text As Data

Text is a challenging type of data for various reasons. Most importantly, natural language is incredibly complex. But also text does not resemble a numeric matrix, the primary data form that we enter into statistical models.

Python offers a wide functionality for natural language processing (NLP), some of it in the general libraries like sklearn or tensorflow, and some of it in dedicated NLP libraries, such as nltk or spacy.

Below, we assume you have imported the following libraries:

import pandas as pd
## /home/otoomet/R/x86_64-pc-linux-gnu-library/4.4/reticulate/python/rpytools/loader.py:120: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
##   return _find_and_load(name, import_)
import numpy as np

21.1 Preparing text

Text data typically needs much more preparation than numeric data. First we demonstrate a number of preparatory steps. We are heavily relying on nltk library so we import it here:

import nltk

21.1.1 Cleaning and homogenizing text

Raw text is typically not well suited for analyzis. The first steps usually include good old-fashioned text cleaning, including

  • converting everything to lower case
  • remove line breaks
  • removing punctuation
  • removing numbers

Obviously, what exactly should be done depends on the task. If we are trying to identify persons in the text, we may want to preserve capital letters as those contain information whether a word in name. Again, in historical text we may keep the numbers as those may represent dates and years.

21.1.2 Tokenization

The idea of tokenization is to split text into words. However, not every language has quite clear idea about what are words. Also, text often contains objects that are not obviously words (e.g. numbers and acronymes), we call the resulting objects tokens. English and other European languages are rather straightforward to tokenize, one basically has to split the text on whitespaces. nltk library offers and few handy options.

word_tokenize mostly extracts what we normally consider words, but also considers punctuation symbols, such as comma and period to be separate tokens. This may be useful in context where we want to use information embedded in punctuation: