Chapter 22 Natural Language Processing: Text As Data

thematic::thematic_on(font = thematic::font_spec(scale=1.8))
# theme_set(text = element_text(size = 24))

Text is a challenging type of data for various reasons. Most importantly, natural language is incredibly complex. But also text does not resemble a numeric matrix, the primary data form that we enter into statistical models.

Python offers a wide functionality for natural language processing (NLP), some of it in the general libraries like sklearn or tensorflow, and some of it in dedicated NLP libraries, such as nltk or spacy.

Below, we assume you have imported the following libraries:

import pandas as pd
import numpy as np

22.1 Preparing text

Text data typically needs much more preparation than numeric data. First we demonstrate a number of preparatory steps. We are heavily relying on nltk library so we import it here:

import nltk

## /usr/lib/python3/dist-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.4
##   warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"

22.1.1 Cleaning and homogenizing text

Raw text is typically not well suited for analyzis. The first steps usually include good old-fashioned text cleaning, including

converting everything to lower case
remove line breaks
removing punctuation
removing numbers

Obviously, what exactly should be done depends on the task. If we are trying to identify persons in the text, we may want to preserve capital letters as those contain information whether a word in name. Again, in historical text we may keep the numbers as those may represent dates and years.

22.1.2 Tokenization

The idea of tokenization is to split text into words. However, not every language has quite clear idea about what are words. Also, text often contains objects that are not obviously words (e.g. numbers and acronymes), we call the resulting objects tokens. English and other European languages are rather straightforward to tokenize, one basically has to split the text on whitespaces. nltk library offers and few handy options.

word_tokenize mostly extracts what we normally consider words, but also considers punctuation symbols, such as comma and period to be separate tokens. This may be useful in context where we want to use information embedded in punctuation:

doc = """I am Liao Hua, and I've had to become a highwayman.
      My five hundred men and I survive on robbery."""
nltk.word_tokenize(doc)[:11]  # returns a list

## ['I', 'am', 'Liao', 'Hua', ',', 'and', 'I', "'ve", 'had', 'to', 'become']

Note also that I’ve has been split into two tokens, I and ’ve, preserving the apostrophe.

A close cousing of word tokenizer is sentence tokenizer sent_tokenize which splits to document into sentences:

nltk.sent_tokenize(doc)  # returns a list

## ["I am Liao Hua, and I've had to become a highwayman.", 'My five hundred men and I survive on robbery.']

Finally, we can also create our own tokenizer based on regular expressions. In this example we consider all “word characters” to be part of a token:

tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')
tokenizer.tokenize(doc)[:11]  # returns a list

## ['I', 'am', 'Liao', 'Hua', 'and', 'I', 've', 'had', 'to', 'become', 'a']

Note that unlike in case of word_tokenizer, this regular expression removes all non-word characters, including punctuation. I’ve is split into two tokens, I and ve, the apostrophe is gone.

22.1.3 Stemming

Stemming is a simple method to make different grammatical forms of the same word more similar. In many contexts it is not useful to preserve different forms of the same word, such as “look” and “looked”. If we are interested in just the topic of the text, both of these words carry pretty much the same meaning, and different grammatical forms just burden the model with extra parameters and require more training. Stemming is a simple way to remove common prefixes and suffixes, and transform the word into its “root form”. However, stemming uses simple heuristics, and as a result, it may get various words wrong.

Here is an example using PorterStemmer from nltk library:

from nltk.stem import PorterStemmer
ps = PorterStemmer()

for token in tokenizer.tokenize(doc):
    print(ps.stem(token), end=" ")

## i am liao hua and i ve had to becom a highwayman my five hundr men and i surviv on robberi

The example shows several words being “standardized”. In particular, become has turnded to becom and survive has turned to surviv as stemmer removed the infinitive e-suffix (e suffix is missing in words like becoming and surviving).

22.1.4 Lemmatization

As lemmatization is based on dictionaries, we have to install the corresponding nltk dictionary.

from nltk.stem import WordNetLemmatizer 

lemmatizer = WordNetLemmatizer()
  
for token in tokenizer.tokenize(doc):
    print(lemmatizer.lemmatize(token))

## I
## am
## Liao
## Hua
## and
## I
## ve
## had
## to
## become
## a
## highwayman
## My
## five
## hundred
## men
## and
## I
## survive
## on
## robbery

This example converts the words into a dictionary form. Note that the case is preserved. By default, lemmatizer treats all words as nouns (that is why “had” is preserved in its past tense). You can tell the word type with pos argument:

lemmatizer.lemmatize("had", pos="v")  # 'v' for verb

## 'have'

Now lemmatizer looked up the correct dictionary form “have”.

22.1.5 Ngrams

In human language, the meaning of word is very much dependent on context, the other words nearby. A popular option to preserve context is to look at ngrams, \(n\)-word ordered sequences. For instance, in case of a sample sentence “Video from Chang’e lander” we can create the following three bigrams (2-grams): “video from”, “from chang’e”, “chang’e lander”. Below is an example how to create ngrams using nltk library:

## Create 3-grams, use the text example from above
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')
tokens = tokenizer.tokenize(doc)
trigrams = nltk.ngrams(tokens, 3)  # 2-grams, returns a generator object
[b for b in trigrams]

## [('I', 'am', 'Liao'), ('am', 'Liao', 'Hua'), ('Liao', 'Hua', 'and'), ('Hua', 'and', 'I'), ('and', 'I', 've'), ('I', 've', 'had'), ('ve', 'had', 'to'), ('had', 'to', 'become'), ('to', 'become', 'a'), ('become', 'a', 'highwayman'), ('a', 'highwayman', 'My'), ('highwayman', 'My', 'five'), ('My', 'five', 'hundred'), ('five', 'hundred', 'men'), ('hundred', 'men', 'and'), ('men', 'and', 'I'), ('and', 'I', 'survive'), ('I', 'survive', 'on'), ('survive', 'on', 'robbery')]

Ngrams can be used in language models instead of tokens. However, as there are many more possible ngrams than tokens, we easily run into curse of dimensionality problem.

22.2 Converting text to numbers

After text cleaning and preparation, we are still not ready to use the ML techniques on the data. As all statistical models want to work with numeric matrices, we have to convert text into such form.

22.2.1 Bag-of-words and Document-term-matrix

Bag of words (BOW) is essentially just a word frequency table. The simplest way to create such table is to use CountVectorizer or TfidfVectorizer from sklearn.feature_extraction.text. These functions take a sequence of documents as input and output a BOW for each of the texts. All BOW-s are stacked on top of each other, so the vectorizer outputs not a vector (a single table) but a matrix, called Document-Term Matrix (DTM). Each row represents a document, and each column a token. The following example illustrates the steps to create a DTM.

Before we start, we neeed the documents:

## Two documents
doc1 = """
Then he went home and told his mother: "My teacher, Ji Gong, 
wants me to act as Wei Tuo tonight"
"""
doc2 = """
His mother asked, "What is acting as a Wei Tuo"?
"""

The first step is to create the vectorizer:

## Import the vectorizer
from sklearn.feature_extraction.text import CountVectorizer
vrizer = CountVectorizer()

This sets up the vectorizer (the example considers CountVectorizer but TfidfVectorizer usage is similar). CountVectorizer has a number of options, e.g. one can only ask for binary (contains/does not contain) BOW instead of word counts, specify the stopwords, or supply a custom tokenizer. See the documentation for more information.

The next step is to “fit” the vectorizer and transform the text. It is somewhat similar to ML models in sklearn, but here “fitting” means vocabulary building:

_t = vrizer.fit([doc1, doc2])

.fit() requires a sequence of documents. This example has a list of two, but a series from a data frame works too.

Afterwards one can transform different texts using the same vocabulary:

X = vrizer.transform([doc1, doc2])  # X is the DTM

This is somewhat similar to ML models’ predict() method, where you use a fitted model to make predictions based on new data. Here you need a fitted vectorizer that allows you to create a DTM of new documents. transform(), exactly as fit() takes a sequence of documents. There is also a .fit_transform method that performs both fitting and transforming on the same set of documents. This is slightly easier options if we are working with a single set of documents only.

The resulting DTM X can be used in ML models as any other design matrix.

As the DTM-s are usually sparse, the result is a sparse matrix, not the ordinary dense matrix (see Section 9.1.3):

X  # a sparse matrix

## <2x24 sparse matrix of type '<class 'numpy.int64'>'
##  with 29 stored elements in Compressed Sparse Row format>

It is a good idea to keep DTM in the sparse form if possible as this may save a lot of memory, and work much much faster. Sparse matrices have most of the functionality we need, but there are tasks where they fail and one has to convert them to dense ones using .toarray() method (see the example below). For instance, one cannot create data frames from sparse matrices. In other contexts, sparse matrices behave differently, more like numpy matrices, not like numpy arrays. This happens, for instance, when you multiply sparse matrices. This may create quite a bit of confusion, and you may want to convert these to a numpy array instead.

In order to better understand the DTM, you may want to convert it into a data frame with column names equal to the corresponding tokens. You can retrieve the vocabulary using .get_feature_names_out() method and use those for column names:

pd.DataFrame(X.toarray(), columns=vrizer.get_feature_names_out())

##    act  acting  and  as  asked  gong  he  ...  told  tonight  tuo  wants  wei  went  what
## 0    1       0    1   1      0     1   1  ...     1        1    1      1    1     1     0
## 1    0       1    0   1      1     0   0  ...     0        0    1      0    1     0     1
## 
## [2 rows x 24 columns]

In this example you need to convert the sparse matrix to an array, you cannot directly use it to make a data frame. Besides of that, we supply the vocabulary as column names.

Note that CountVectorizer contains a built-in preprocessing and tokenizer, in particular the documents are split into words, converted to lower case, and punctuation removed. All these steps can be adjusted, and instead of creating DTM of token unigrams, CountVectorizer can also handle token n-grams.

22.2.2 Example: categorize text

Let’s demonstrate how to use DTM for modeling using a simple example: we take two poems from different authors, and attribute a third poem to one of the authors based on text similarity (we use \(k\)-NN with cosine distance).

22.2.2.1 The data

First, the data. The first poem (by Rudaki):

# Rudaki
rudaki1 = """
   When you find me dead, my lips apart,
   A shell empty of life, worn out by want,
   Sit by my bedside and say, with charm:
   “It is I who killed you, I regret it now.”
"""
# Rudaki
rudaki2 = """
   You killed many, broke the enemy’s courage.
   You gave so much, there isn’t one beggar left.
   Many have lamb and sweets on their table,
   Others, not enough bread to ease their hunger.
   Take action. Don’t sit idle for too long,
   Even though your sacks of gold reach the moon.
"""

The second set of poems is by Sunthorn Phu. We reserve the second stanza of his second poem to be the “unknown” document to be analyzed.

phu1 = """
   I salute the Pagoda of the Holy Relics
   May the true religion live forever.
   I make merit, so the Buddha helps me
   Increase my power to attain enlightenment.
   And Id like my words, my book,
   To preserve, till the end of time and heavens,
   Sunthorn the scribe who belongs
   To the King of the White Elephant
"""
# Nirat Phu Khao Thong, 1st stanza - Sunthorn Phu 
phu2 = """
   Near to, I could smell the King's scent,
   Sweetly rend'ring the air at hand:
   The King died, tasteless became the land
   He died, and scentless my own fate.
"""
## 'unknown' document
# Nirat Phu Khao Thong, 2nd stanza - Sunthorn Phu
newdoc = """
   In the Palace His ashes in an urn,
   I in turn my merit dedicate
   To Him, and the Majesty in state
   For a great and glorious reign.
"""

This amounts to five documents, four of which are training data, and the last one is validation data.

22.2.2.2 Vectorize

We create the DTM as in the examples above, just this time we have four documents instead of two. We also use fit_transform as we do fitting and transforming on the same data:

vrizer = CountVectorizer()
X = vrizer.fit_transform([rudaki1, rudaki2, phu1, phu2])

Exercise 22.1 The dimension of X is

X.shape

## (4, 118)

Why does it have 4 rows? Why does it have 118 columns?

The answer

Next, we transform the unknown document newdoc into a BOW:

newX = vrizer.transform([newdoc])

This time it is important to use transform, not fit_transform. The latter would create a new vocabulary and hence the DTM columns for known and unknown data would not match. We cannot really compare documents that are transformed in different way. The detail to pay attention to is that we have to create a list of a single document, [newdoc]. This is because CountVectorize expects a list, even if fed with just a single document.

Now we are ready to compute the similarity between the documents.

TBD: demonstrate the matrices by converting to df and showing certain words.

22.2.2.3 Classify using k-NN

You may want to use just the baseline sklearn modeling tools. For instance, we can use k-NN to predict the author. For this task we need to create the outcome vector, this might look like:

y = np.array(["Rudaki", "Rudaki", "Phu", "Phu"])

Thereafter it is just about using the standard k-NN tools (see Section 14.2):

from sklearn.neighbors import KNeighborsClassifier

m = KNeighborsClassifier(1)  # use just a single neighbor
                             # as we have so few data
_t = m.fit(X, y)
m.predict(newX)

## array(['Phu'], dtype='<U6')

The result is correctly predicted as authored by Sunthorn Phu.

22.2.2.4 Manually find the closest examples

The downside by just predicting the category is that we do not know which documents are the most similar ones, and how similar are one or another document. So instead, we may just compute the cosine similarities between the unknown document and all other documents in the training data.

The easiest way to do it is to first normalize the relevant vectors, and thereafter just computing similarity using a matrix product. First, normalize the training DTM:

Xa = X.toarray()  # convert to array
norms = np.sqrt(np.sum(Xa*Xa, axis = 1)).reshape((-1,1))
                  # need an elementwise product here!
Xn = Xa/norms

And thereafter normalize newX:

newXa = newX.toarray()
newXn = newXa/np.sqrt(np.sum(newXa*newXa))

Exercise 22.2 Above, we used np.sqrt(np.sum(Xa*Xa, axis = 1)).reshape((-1,1)) to compute norms of X but we use np.sqrt(np.sum(newXa*newXa)) to compute the norm of newX. Why the difference?

Finally, the cosine similarity:

newXn @ Xn.T

## array([[0.18257419, 0.30588765, 0.68359907, 0.52223297]])

As you see, the most similar document is #3, Sunthorn Phu’s first poem, its similarity is 0.684.

22.2.2.5 Use sklearn’s `NearestNeighbors`

… TBD

We can also the sklearn’s pre-packaged nearest neighbor model with cosine metric. NearestNeighbors just computes the single nearest neighbor (as n_neighbors=1), and .kneighbors returns it together with the distance measure. We ask the distance to all 4 documents to be returned by kneigbors:

from sklearn.neighbors import NearestNeighbors
m = NearestNeighbors(n_neighbors=1, metric="cosine")
_t = m.fit(X)
d = m.kneighbors(newX, n_neighbors=4)  # return distance for all 4 documents
d

## (array([[0.31640093, 0.47776703, 0.69411235, 0.81742581]]), array([[2, 3, 1, 0]]))

As we can see, the closest document based on cosine distance is document 2 with distance 0.316. We got the poet right, but not the poem, but one can see that both Phu’s poems are closer than Rudaki’s poems. These short examples contain too little text for realistic precision.

Exercise 22.3

Repeat the previous example using TfidfVectorizer instead of CountVectorizer.

See the solution