Chapter 21 Natural Language Processing: Text As Data

Text is a challenging type of data for various reasons. Most importantly, natural language is incredibly complex. But also text does not resemble a numeric matrix, the primary data form that we enter into statistical models.

Python offers a wide functionality for natural language processing (NLP), some of it in the general libraries like sklearn or tensorflow, and some of it in dedicated NLP libraries, such as nltk or spacy.

Below, we assume you have imported the following libraries:

import pandas as pd
import numpy as np

21.1 Preparing text

Text data typically needs much more preparation than numeric data. First we demonstrate a number of preparatory steps. We are heavily relying on nltk library so we import it here:

import nltk
## /usr/lib/python3/dist-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.1
##   warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"

21.1.1 Cleaning and homogenizing text

Raw text is typically not well suited for analyzis. The first steps usually include good old-fashioned text cleaning, including

  • converting everything to lower case
  • remove line breaks
  • removing punctuation
  • removing numbers

Obviously, what exactly should be done depends on the task. If we are trying to identify persons in the text, we may want to preserve capital letters as those contain information whether a word in name. Again, in historical text we may keep the numbers as those may represent dates and years.

21.1.2 Tokenization

The idea of tokenization is to split text into words. However, not every language has quite clear idea about what are words. Also, text often contains objects that are not obviously words (e.g. numbers and acronymes), we call the resulting objects tokens. English and other European languages are rather straightforward to tokenize, one basically has to split the text on whitespaces. nltk library offers and few handy options.

word_tokenize mostly extracts what we normally consider words, but also considers punctuation symbols, such as comma and period to be separate tokens. This may be useful in context where we want to use information embedded in punctuation:

doc = """I am Liao Hua, and I've had to become a highwayman.
       My five hundred men and I survive on robbery."""
nltk.word_tokenize(doc)[:11]  # returns a list
## ['I', 'am', 'Liao', 'Hua', ',', 'and', 'I', "'ve", 'had', 'to', 'become']

Note also that I’ve has been split into two tokens, I and ’ve, preserving the apostrophe.

A close cousing of word tokenizer is sentence tokenizer sent_tokenize which splits to document into sentences:

nltk.sent_tokenize(doc)  # returns a list
## ["I am Liao Hua, and I've had to become a highwayman.", 'My five hundred men and I survive on robbery.']

Finally, we can also create our own tokenizer based on regular expressions. In this example we consider all “word characters” to be part of a token:

tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')
tokenizer.tokenize(doc)[:11]  # returns a list
## ['I', 'am', 'Liao', 'Hua', 'and', 'I', 've', 'had', 'to', 'become', 'a']

Note that unlike in case of word_tokenizer, this regular expression removes all non-word characters, including punctuation. I’ve is split into two tokens, I and ve, the apostrophe is gone.

21.1.3 Stemming

Stemming is a simple method to make different grammatical forms of the same word more similar. In many contexts it is not useful to preserve different forms of the same word, such as “look” and “looked”. If we are interested in just the topic of the text, both of these words carry pretty much the same meaning, and different grammatical forms just burden the model with extra parameters and require more training. Stemming is a simple way to remove common prefixes and suffixes, and transform the word into its “root form”. However, stemming uses simple heuristics, and as a result, it may get various words wrong.

Here is an example using PorterStemmer from nltk library:

from nltk.stem import PorterStemmer
ps = PorterStemmer()

for token in tokenizer.tokenize(doc):
    print(ps.stem(token), end=" ")
## i am liao hua and i ve had to becom a highwayman my five hundr men and i surviv on robberi

The example shows several words being “standardized”. In particular, become has turnded to becom and survive has turned to surviv as stemmer removed the infinitive e-suffix (e suffix is missing in words like becoming and surviving).

21.1.4 Lemmatization

As lemmatization is based on dictionaries, we have to install the corresponding nltk dictionary.

from nltk.stem import WordNetLemmatizer 

lemmatizer = WordNetLemmatizer()
  
for token in tokenizer.tokenize(doc):
    print(lemmatizer.lemmatize(token))
## Error: LookupError: 
## **********************************************************************
##   Resource omw-1.4 not found.
##   Please use the NLTK Downloader to obtain the resource:
## 
##   >>> import nltk
##   >>> nltk.download('omw-1.4')
##   
##   For more information see: https://www.nltk.org/data.html
## 
##   Attempted to load corpora/omw-1.4
## 
##   Searched in:
##     - '/home/siim/nltk_data'
##     - '//usr/nltk_data'
##     - '//usr/share/nltk_data'
##     - '//usr/lib/nltk_data'
##     - '/usr/share/nltk_data'
##     - '/usr/local/share/nltk_data'
##     - '/usr/lib/nltk_data'
##     - '/usr/local/lib/nltk_data'
## **********************************************************************

This example converts the words into a dictionary form. Note that the case is preserved. By default, lemmatizer treats all words as nouns (that is why “had” is preserved in its past tense). You can tell the word type with pos argument:

lemmatizer.lemmatize("had", pos="v")  # 'v' for verb
## Error: LookupError: 
## **********************************************************************
##   Resource omw-1.4 not found.
##   Please use the NLTK Downloader to obtain the resource:
## 
##   >>> import nltk
##   >>> nltk.download('omw-1.4')
##   
##   For more information see: https://www.nltk.org/data.html
## 
##   Attempted to load corpora/omw-1.4
## 
##   Searched in:
##     - '/home/siim/nltk_data'
##     - '//usr/nltk_data'
##     - '//usr/share/nltk_data'
##     - '//usr/lib/nltk_data'
##     - '/usr/share/nltk_data'
##     - '/usr/local/share/nltk_data'
##     - '/usr/lib/nltk_data'
##     - '/usr/local/lib/nltk_data'
## **********************************************************************

Now lemmatizer looked up the correct dictionary form “have”.

21.1.5 Ngrams

In human language, the meaning of word is very much dependent on context, the other words nearby. A popular option to preserve context is to look at ngrams, \(n\)-word ordered sequences. For instance, in case of a sample sentence “Video from Chang’e lander” we can create the following three bigrams (2-grams): “video from”, “from chang’e”, “chang’e lander”. Below is an example how to create ngrams using nltk library:

## Create 3-grams, use the text example from above
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')
tokens = tokenizer.tokenize(doc)
trigrams = nltk.ngrams(tokens, 3)  # 2-grams, returns a generator object
[b for b in trigrams]
## [('I', 'am', 'Liao'), ('am', 'Liao', 'Hua'), ('Liao', 'Hua', 'and'), ('Hua', 'and', 'I'), ('and', 'I', 've'), ('I', 've', 'had'), ('ve', 'had', 'to'), ('had', 'to', 'become'), ('to', 'become', 'a'), ('become', 'a', 'highwayman'), ('a', 'highwayman', 'My'), ('highwayman', 'My', 'five'), ('My', 'five', 'hundred'), ('five', 'hundred', 'men'), ('hundred', 'men', 'and'), ('men', 'and', 'I'), ('and', 'I', 'survive'), ('I', 'survive', 'on'), ('survive', 'on', 'robbery')]

Ngrams can be used in language models instead of tokens. However, as there are many more possible ngrams than tokens, we easily run into curse of dimensionality problem.

21.2 Converting text to numbers

After text cleaning and preparation, we are still not ready to use the ML techniques on the data. As all statistical models want to work with numeric matrices, we have to convert text into such form.

21.2.1 Bag-of-words and Document-term-matrix

Bag of words (BOW) is essentially just a word frequency table. The simplest way to create such table is to use CountVectorizer and TfidfVectorizer from sklearn.feature_extraction.text. These functions take a sequence of documents as input and output a BOW for each of the texts. All BOW-s are packed on top of each other, the resulting matrix is called Document-Term Matrix (DTM). The following example illustrates the steps needed. First, we create the documents and the vectorizer.

## Two documents
doc1 = """
Then he went home and told his mother: "My teacher, Ji Gong, 
wants me to act as Wei Tuo tonight"
"""
doc2 = """
His mother asked, "What is acting as a Wei Tuo"?
"""

## Import the vectorizer
from sklearn.feature_extraction.text import CountVectorizer
vrizer = CountVectorizer()

This sets up the vectorizer (the example considers CountVectorizer but TfidfVectorizer usage is similar). CountVectorizer has a number of options, e.g. one can only ask for binary (contains/does not contain) BOW instead of word counts, specify the stopwords and supply a custom tokenizer. See the documentation for more information.

The next step is to “fit” the vectorizer and transform the text. It is somewhat similar to ML models in sklearn where “fitting” means primarily vocabulary building. Afterwards one can transform different texts using the same vocabulary:

_ = vrizer.fit([doc1, doc2])
X = vrizer.transform([doc1, doc2])
X  # a sparse matrix
## <2x24 sparse matrix of type '<class 'numpy.int64'>'
##  with 29 stored elements in Compressed Sparse Row format>

Both fit and transform take in a list of text documents. Instead of a list, other iterable collections, such as pd.Series also work. There is also a .fit_transform method that performs both fitting and transforming on the same set of documents. This is slightly easier options if we are working with a single set of documents only.

As the DTM-s are usually sparse, the result is a sparse matrix, not the ordinary dense matrix. It is a good idea to keep DTM in the sparse form if possible as this may save a lot of memory and work much faster. Sparse matrices work for most of the functionality we need, but there are tasks where they fail and one has to convert them to dense ones using .toarray method. For instance, one cannot create data frames from sparse matrices. In other contexts, sparse matrices behave differently, more like numpy matrices, not like numpy arrays. This happens when you, e.g., multiply sparse matrices. This may create quite a bit of confusion, and one may want to convert these to numpy arrays using .toarray.

In order to understand the DTM, one may want to convert it into a data frame with column names equal to the corresponding tokens. One can retrieve the vocabulary using .get_feature_names method, and build the data frame from the DTM:

pd.DataFrame(X.toarray(), columns=vrizer.get_feature_names())
##    act  acting  and  as  asked  gong  he  his  home  ...  then  to  told  tonight  tuo  wants  wei  went  what
## 0    1       0    1   1      0     1   1    1     1  ...     1   1     1        1    1      1    1     1     0
## 1    0       1    0   1      1     0   0    1     0  ...     0   0     0        0    1      0    1     0     1
## 
## [2 rows x 24 columns]

Here is an example where we need .toarray: one cannot directly convert a sparse matrix to data frame, it must be converted to a dense array first. We also supply the vocabulary as column names.

Note that CountVectorizer contains a built-in preprocessing and tokenizer, in particular the documents are split into words, converted to lower case, and punctuation removed. All these steps can be adjusted, and instead of creating DTM of token unigrams, CountVectorizer can also handle token n-grams.

The resulting DTM X can be used for modeling as any other design matrix.

21.2.2 Example: categorize text.

Lets demonstrate the DTM on a simple example: we take two poems from different authors, and attribute a third poem to one of the authors based on text similarity (we use \(k\)-NN with cosine distance).

The first poem (by Rudaki):

# Rudaki
rudaki1 = """
   When you find me dead, my lips apart,
   A shell empty of life, worn out by want,
   Sit by my bedside and say, with charm:
   “It is I who killed you, I regret it now.”
"""
# Rudaki
rudaki2 = """
   You killed many, broke the enemy’s courage.
   You gave so much, there isn’t one beggar left.
   Many have lamb and sweets on their table,
   Others, not enough bread to ease their hunger.
   Take action. Don’t sit idle for too long,
   Even though your sacks of gold reach the moon.
"""

The second set of poems is by Sunthorn Phu. We reserve the second stanza of his second poem to be the “unknown” document to be analyzed.

phu1 = """
   I salute the Pagoda of the Holy Relics
   May the true religion live forever.
   I make merit, so the Buddha helps me
   Increase my power to attain enlightenment.
   And Id like my words, my book,
   To preserve, till the end of time and heavens,
   Sunthorn the scribe who belongs
   To the King of the White Elephant
"""
# Nirat Phu Khao Thong, 1st stanza - Sunthorn Phu 
phu2 = """
   Near to, I could smell the King's scent,
   Sweetly rend'ring the air at hand:
   The King died, tasteless became the land
   He died, and scentless my own fate.
"""
## 'unknown' document
# Nirat Phu Khao Thong, 2nd stanza - Sunthorn Phu
newdoc = """
   In the Palace His ashes in an urn,
   I in turn my merit dedicate
   To Him, and the Majesty in state
   For a great and glorious reign.
"""

We create the DTM as in the examples above, just now we have four documents instead of two. We also use fit_transform as we do fitting and transforming on the same data:

vrizer = CountVectorizer()
X = vrizer.fit_transform([rudaki1, rudaki2, phu1, phu2])

It is instructive to how many tokens do the training documents contain:

X.shape[1]
## 118

Next, we transform the unknown document newdoc into a BOW:

newX = vrizer.transform([newdoc])

Now it is important to use transform, not fit_transform. The latter would make a new vocabulary and hence the DTM columns for known and unknown data would not match. We cannot easily compare documents that are transformed in different way. The second note is that we create a list of a single document. CountVectorize expects a list, even a list of one.

Now we are ready to compute the similarity between the documents.

… TBD

We can compute the cosine similarities manually. The easiest way to compute cosine similarity is by first normalizing the relevant vectors, and thereafter just computing similarity with matrix product. We can normalize newX as

newX = newX.toarray()
newXn = newX/np.sqrt(newX.T @ newX)
## <string>:1: RuntimeWarning: divide by zero encountered in divide
## <string>:1: RuntimeWarning: invalid value encountered in divide

… TBD

We can also the sklearn’s pre-packaged nearest neighbor model with cosine metric. NearestNeighbors just computes the single nearest neighbor (as n_neighbors=1), and .kneighbors returns it together with the distance measure. We ask the distance to all 4 documents to be returned by kneigbors:

from sklearn.neighbors import NearestNeighbors
m = NearestNeighbors(n_neighbors=1, metric="cosine")
_ = m.fit(X)
d = m.kneighbors(newX, n_neighbors=4)  # return distance for all 4 documents
d
## (array([[0.31640093, 0.47776703, 0.69411235, 0.81742581]]), array([[2, 3, 1, 0]]))

As we can see, the closest document based on cosine distance is document 2 with distance 0.316. We got the poet right, but not the poem, but one can see that both Phu’s poems are closer than Rudaki’s poems. These short examples contain too little text for realistic precision.

Exercise 21.1

  • Repeat the previous example using TfidfVectorizer instead of CountVectorizer.

See the solution