Chapter 22 Natural Language Processing: Text As Data
thematic::thematic_on(font = thematic::font_spec(scale=1.8))
# theme_set(text = element_text(size = 24))
Text is a challenging type of data for various reasons. Most importantly, natural language is incredibly complex. But also text does not resemble a numeric matrix, the primary data form that we enter into statistical models.
Python offers a wide functionality for natural language processing (NLP), some of it in the general libraries like sklearn or tensorflow, and some of it in dedicated NLP libraries, such as nltk or spacy.
Below, we assume you have imported the following libraries:
22.1 Preparing text
Text data typically needs much more preparation than numeric data. First we demonstrate a number of preparatory steps. We are heavily relying on nltk library so we import it here:
## /usr/lib/python3/dist-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.4
## warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
22.1.1 Cleaning and homogenizing text
Raw text is typically not well suited for analyzis. The first steps usually include good old-fashioned text cleaning, including
- converting everything to lower case
- remove line breaks
- removing punctuation
- removing numbers
Obviously, what exactly should be done depends on the task. If we are trying to identify persons in the text, we may want to preserve capital letters as those contain information whether a word in name. Again, in historical text we may keep the numbers as those may represent dates and years.
22.1.2 Tokenization
The idea of tokenization is to split text into words. However, not every language has quite clear idea about what are words. Also, text often contains objects that are not obviously words (e.g. numbers and acronymes), we call the resulting objects tokens. English and other European languages are rather straightforward to tokenize, one basically has to split the text on whitespaces. nltk library offers and few handy options.
word_tokenize
mostly extracts what we normally consider words, but
also considers punctuation symbols, such as comma and period to be
separate tokens. This may be useful in context where we want to use
information embedded in punctuation:
doc = """I am Liao Hua, and I've had to become a highwayman.
My five hundred men and I survive on robbery."""
nltk.word_tokenize(doc)[:11] # returns a list
## ['I', 'am', 'Liao', 'Hua', ',', 'and', 'I', "'ve", 'had', 'to', 'become']
Note also that I’ve has been split into two tokens, I and ’ve, preserving the apostrophe.
A close cousing of word tokenizer is sentence tokenizer
sent_tokenize
which splits to document into sentences:
## ["I am Liao Hua, and I've had to become a highwayman.", 'My five hundred men and I survive on robbery.']
Finally, we can also create our own tokenizer based on regular expressions. In this example we consider all “word characters” to be part of a token:
## ['I', 'am', 'Liao', 'Hua', 'and', 'I', 've', 'had', 'to', 'become', 'a']
Note that unlike in case of word_tokenizer
, this regular expression
removes all non-word characters, including punctuation. I’ve is
split into two tokens, I and ve, the apostrophe is gone.
22.1.3 Stemming
Stemming is a simple method to make different grammatical forms of the same word more similar. In many contexts it is not useful to preserve different forms of the same word, such as “look” and “looked”. If we are interested in just the topic of the text, both of these words carry pretty much the same meaning, and different grammatical forms just burden the model with extra parameters and require more training. Stemming is a simple way to remove common prefixes and suffixes, and transform the word into its “root form”. However, stemming uses simple heuristics, and as a result, it may get various words wrong.
Here is an example using PorterStemmer from nltk library:
from nltk.stem import PorterStemmer
ps = PorterStemmer()
for token in tokenizer.tokenize(doc):
print(ps.stem(token), end=" ")
## i am liao hua and i ve had to becom a highwayman my five hundr men and i surviv on robberi
The example shows several words being “standardized”. In particular, become has turnded to becom and survive has turned to surviv as stemmer removed the infinitive e-suffix (e suffix is missing in words like becoming and surviving).
22.1.4 Lemmatization
As lemmatization is based on dictionaries, we have to install the corresponding nltk dictionary.
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
for token in tokenizer.tokenize(doc):
print(lemmatizer.lemmatize(token))
## I
## am
## Liao
## Hua
## and
## I
## ve
## had
## to
## become
## a
## highwayman
## My
## five
## hundred
## men
## and
## I
## survive
## on
## robbery
This example converts the words into a dictionary form. Note that the
case is preserved. By default, lemmatizer treats all words as nouns
(that is why “had” is preserved in its past tense). You can tell the
word type with pos
argument:
## 'have'
Now lemmatizer looked up the correct dictionary form “have”.
22.1.5 Ngrams
In human language, the meaning of word is very much dependent on context, the other words nearby. A popular option to preserve context is to look at ngrams, \(n\)-word ordered sequences. For instance, in case of a sample sentence “Video from Chang’e lander” we can create the following three bigrams (2-grams): “video from”, “from chang’e”, “chang’e lander”. Below is an example how to create ngrams using nltk library:
## Create 3-grams, use the text example from above
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')
tokens = tokenizer.tokenize(doc)
trigrams = nltk.ngrams(tokens, 3) # 2-grams, returns a generator object
[b for b in trigrams]
## [('I', 'am', 'Liao'), ('am', 'Liao', 'Hua'), ('Liao', 'Hua', 'and'), ('Hua', 'and', 'I'), ('and', 'I', 've'), ('I', 've', 'had'), ('ve', 'had', 'to'), ('had', 'to', 'become'), ('to', 'become', 'a'), ('become', 'a', 'highwayman'), ('a', 'highwayman', 'My'), ('highwayman', 'My', 'five'), ('My', 'five', 'hundred'), ('five', 'hundred', 'men'), ('hundred', 'men', 'and'), ('men', 'and', 'I'), ('and', 'I', 'survive'), ('I', 'survive', 'on'), ('survive', 'on', 'robbery')]
Ngrams can be used in language models instead of tokens. However, as there are many more possible ngrams than tokens, we easily run into curse of dimensionality problem.
22.2 Converting text to numbers
After text cleaning and preparation, we are still not ready to use the ML techniques on the data. As all statistical models want to work with numeric matrices, we have to convert text into such form.
22.2.1 Bag-of-words and Document-term-matrix
Bag of words (BOW) is essentially just a word frequency table. The
simplest way to create such table is to use CountVectorizer
or
TfidfVectorizer
from
sklearn.feature_extraction.text
. These functions take a sequence of
documents as input and output a BOW for each of the texts. All BOW-s
are stacked on top of each other, so the vectorizer outputs not a
vector (a single table) but a matrix, called
Document-Term Matrix (DTM). Each row represents a document, and each
column a token.
The following example illustrates the
steps to create a DTM.
Before we start, we neeed the documents:
## Two documents
doc1 = """
Then he went home and told his mother: "My teacher, Ji Gong,
wants me to act as Wei Tuo tonight"
"""
doc2 = """
His mother asked, "What is acting as a Wei Tuo"?
"""
The first step is to create the vectorizer:
## Import the vectorizer
from sklearn.feature_extraction.text import CountVectorizer
vrizer = CountVectorizer()
This sets up the vectorizer (the example considers CountVectorizer
but TfidfVectorizer
usage is similar). CountVectorizer
has a number of
options, e.g. one can only ask for binary (contains/does not contain)
BOW instead of word counts, specify the stopwords, or supply a custom
tokenizer. See the
documentation
for more information.
The next step is to “fit” the vectorizer and transform the text. It is somewhat similar to ML models in sklearn, but here “fitting” means vocabulary building:
.fit()
requires a sequence of documents. This example has a list of
two, but a series from a data frame works too.
Afterwards one can transform different texts using the same vocabulary:
This is somewhat similar to ML models’ predict()
method, where you
use a fitted model to make predictions based on new data. Here you
need a fitted vectorizer that allows you to create a DTM of new
documents.
transform()
, exactly as fit()
takes a sequence of documents.
There is also a
.fit_transform
method that performs both fitting and transforming on
the same set of documents. This is slightly easier options if we are
working with a single set of documents only.
The resulting DTM X
can be used in ML models as any other design matrix.
As the DTM-s are usually sparse, the result is a sparse matrix, not the ordinary dense matrix (see Section 9.1.3):
## <2x24 sparse matrix of type '<class 'numpy.int64'>'
## with 29 stored elements in Compressed Sparse Row format>
It is a good idea to keep DTM in the
sparse form if possible as this may save a lot of memory, and work
much much
faster. Sparse matrices have most of the
functionality we need, but there are tasks where they fail and one has to
convert them to dense ones using .toarray()
method
(see the example below).
For instance, one
cannot create data frames from sparse matrices. In other contexts,
sparse matrices behave differently, more like numpy matrices, not
like numpy arrays. This happens, for instance, when you multiply sparse
matrices.
This may create quite a bit of confusion, and you may want to convert
these to a numpy array instead.
In order to better understand the DTM, you
may want to convert it into a data
frame with column names equal to the corresponding tokens. You can
retrieve the vocabulary using .get_feature_names_out()
method and
use those for column names:
## act acting and as asked gong he ... told tonight tuo wants wei went what
## 0 1 0 1 1 0 1 1 ... 1 1 1 1 1 1 0
## 1 0 1 0 1 1 0 0 ... 0 0 1 0 1 0 1
##
## [2 rows x 24 columns]
In this example you need to convert the sparse matrix to an array, you cannot directly use it to make a data frame. Besides of that, we supply the vocabulary as column names.
Note that CountVectorizer
contains a built-in preprocessing and
tokenizer, in particular the documents are split into words, converted
to lower case, and punctuation removed. All these steps can be
adjusted, and instead of creating DTM of token unigrams,
CountVectorizer
can also handle token n-grams.
22.2.2 Example: categorize text
Let’s demonstrate how to use DTM for modeling using a simple example: we take two poems from different authors, and attribute a third poem to one of the authors based on text similarity (we use \(k\)-NN with cosine distance).
22.2.2.1 The data
First, the data. The first poem (by Rudaki):
# Rudaki
rudaki1 = """
When you find me dead, my lips apart,
A shell empty of life, worn out by want,
Sit by my bedside and say, with charm:
“It is I who killed you, I regret it now.”
"""
# Rudaki
rudaki2 = """
You killed many, broke the enemy’s courage.
You gave so much, there isn’t one beggar left.
Many have lamb and sweets on their table,
Others, not enough bread to ease their hunger.
Take action. Don’t sit idle for too long,
Even though your sacks of gold reach the moon.
"""
The second set of poems is by Sunthorn Phu. We reserve the second stanza of his second poem to be the “unknown” document to be analyzed.
phu1 = """
I salute the Pagoda of the Holy Relics
May the true religion live forever.
I make merit, so the Buddha helps me
Increase my power to attain enlightenment.
And Id like my words, my book,
To preserve, till the end of time and heavens,
Sunthorn the scribe who belongs
To the King of the White Elephant
"""
# Nirat Phu Khao Thong, 1st stanza - Sunthorn Phu
phu2 = """
Near to, I could smell the King's scent,
Sweetly rend'ring the air at hand:
The King died, tasteless became the land
He died, and scentless my own fate.
"""
## 'unknown' document
# Nirat Phu Khao Thong, 2nd stanza - Sunthorn Phu
newdoc = """
In the Palace His ashes in an urn,
I in turn my merit dedicate
To Him, and the Majesty in state
For a great and glorious reign.
"""
This amounts to five documents, four of which are training data, and the last one is validation data.
22.2.2.2 Vectorize
We create the DTM as in the examples above, just this time
we have four
documents instead of two. We also use fit_transform
as we do
fitting and transforming on the same data:
Exercise 22.1 The dimension of X
is
## (4, 118)
Why does it have 4 rows? Why does it have 118 columns?
Next, we transform the unknown document newdoc
into a BOW:
This time it is important to use transform
, not fit_transform
. The
latter would create a new vocabulary and hence the DTM columns for known
and unknown data would not match. We cannot really compare documents
that are transformed in different way. The detail to pay attention to
is that we have to
create a list of a single document, [newdoc]
. This is because
CountVectorize
expects a list, even if fed with just a single document.
Now we are ready to compute the similarity between the documents.
TBD: demonstrate the matrices by converting to df and showing certain words.
22.2.2.3 Classify using k-NN
You may want to use just the baseline sklearn modeling tools. For instance, we can use k-NN to predict the author. For this task we need to create the outcome vector, this might look like:
Thereafter it is just about using the standard k-NN tools (see Section 14.2):
from sklearn.neighbors import KNeighborsClassifier
m = KNeighborsClassifier(1) # use just a single neighbor
# as we have so few data
_t = m.fit(X, y)
m.predict(newX)
## array(['Phu'], dtype='<U6')
The result is correctly predicted as authored by Sunthorn Phu.
22.2.2.4 Manually find the closest examples
The downside by just predicting the category is that we do not know which documents are the most similar ones, and how similar are one or another document. So instead, we may just compute the cosine similarities between the unknown document and all other documents in the training data.
The easiest way to do it is to first normalize the relevant vectors, and thereafter just computing similarity using a matrix product. First, normalize the training DTM:
Xa = X.toarray() # convert to array
norms = np.sqrt(np.sum(Xa*Xa, axis = 1)).reshape((-1,1))
# need an elementwise product here!
Xn = Xa/norms
And thereafter normalize newX
:
Exercise 22.2 Above, we used np.sqrt(np.sum(Xa*Xa, axis = 1)).reshape((-1,1))
to
compute norms of X
but we use np.sqrt(np.sum(newXa*newXa))
to
compute the norm of newX
. Why the difference?
Finally, the cosine similarity:
## array([[0.18257419, 0.30588765, 0.68359907, 0.52223297]])
As you see, the most similar document is #3, Sunthorn Phu’s first poem, its similarity is 0.684.
22.2.2.5 Use sklearn’s NearestNeighbors
… TBD
We can also the sklearn’s pre-packaged nearest neighbor model
with cosine metric. NearestNeighbors
just computes the
single nearest neighbor (as n_neighbors=1
), and .kneighbors
returns it together with
the distance measure. We ask the distance to all 4 documents to
be returned by kneigbors
:
from sklearn.neighbors import NearestNeighbors
m = NearestNeighbors(n_neighbors=1, metric="cosine")
_t = m.fit(X)
d = m.kneighbors(newX, n_neighbors=4) # return distance for all 4 documents
d
## (array([[0.31640093, 0.47776703, 0.69411235, 0.81742581]]), array([[2, 3, 1, 0]]))
As we can see, the closest document based on cosine distance is document 2 with distance 0.316. We got the poet right, but not the poem, but one can see that both Phu’s poems are closer than Rudaki’s poems. These short examples contain too little text for realistic precision.
Exercise 22.3
- Repeat the previous example using
TfidfVectorizer
instead ofCountVectorizer
.
See the solution