Chapter 21 Natural Language Processing: Text As Data
Text is a challenging type of data for various reasons. Most importantly, natural language is incredibly complex. But also text does not resemble a numeric matrix, the primary data form that we enter into statistical models.
Python offers a wide functionality for natural language processing (NLP), some of it in the general libraries like sklearn or tensorflow, and some of it in dedicated NLP libraries, such as nltk or spacy.
Below, we assume you have imported the following libraries:
import pandas as pd
import numpy as np
21.1 Preparing text
Text data typically needs much more preparation than numeric data. First we demonstrate a number of preparatory steps. We are heavily relying on nltk library so we import it here:
import nltk
## /usr/lib/python3/dist-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.1
## warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
21.1.1 Cleaning and homogenizing text
Raw text is typically not well suited for analyzis. The first steps usually include good old-fashioned text cleaning, including
- converting everything to lower case
- remove line breaks
- removing punctuation
- removing numbers
Obviously, what exactly should be done depends on the task. If we are trying to identify persons in the text, we may want to preserve capital letters as those contain information whether a word in name. Again, in historical text we may keep the numbers as those may represent dates and years.
21.1.2 Tokenization
The idea of tokenization is to split text into words. However, not every language has quite clear idea about what are words. Also, text often contains objects that are not obviously words (e.g. numbers and acronymes), we call the resulting objects tokens. English and other European languages are rather straightforward to tokenize, one basically has to split the text on whitespaces. nltk library offers and few handy options.
word_tokenize
mostly extracts what we normally consider words, but
also considers punctuation symbols, such as comma and period to be
separate tokens. This may be useful in context where we want to use
information embedded in punctuation:
= """I am Liao Hua, and I've had to become a highwayman.
doc My five hundred men and I survive on robbery."""
11] # returns a list nltk.word_tokenize(doc)[:
## ['I', 'am', 'Liao', 'Hua', ',', 'and', 'I', "'ve", 'had', 'to', 'become']
Note also that I’ve has been split into two tokens, I and ’ve, preserving the apostrophe.
A close cousing of word tokenizer is sentence tokenizer
sent_tokenize
which splits to document into sentences:
# returns a list nltk.sent_tokenize(doc)
## ["I am Liao Hua, and I've had to become a highwayman.", 'My five hundred men and I survive on robbery.']
Finally, we can also create our own tokenizer based on regular expressions. In this example we consider all “word characters” to be part of a token:
= nltk.tokenize.RegexpTokenizer(r'\w+')
tokenizer 11] # returns a list tokenizer.tokenize(doc)[:
## ['I', 'am', 'Liao', 'Hua', 'and', 'I', 've', 'had', 'to', 'become', 'a']
Note that unlike in case of word_tokenizer
, this regular expression
removes all non-word characters, including punctuation. I’ve is
split into two tokens, I and ve, the apostrophe is gone.
21.1.3 Stemming
Stemming is a simple method to make different grammatical forms of the same word more similar. In many contexts it is not useful to preserve different forms of the same word, such as “look” and “looked”. If we are interested in just the topic of the text, both of these words carry pretty much the same meaning, and different grammatical forms just burden the model with extra parameters and require more training. Stemming is a simple way to remove common prefixes and suffixes, and transform the word into its “root form”. However, stemming uses simple heuristics, and as a result, it may get various words wrong.
Here is an example using PorterStemmer from nltk library:
from nltk.stem import PorterStemmer
= PorterStemmer()
ps
for token in tokenizer.tokenize(doc):
print(ps.stem(token), end=" ")
## i am liao hua and i ve had to becom a highwayman my five hundr men and i surviv on robberi
The example shows several words being “standardized”. In particular, become has turnded to becom and survive has turned to surviv as stemmer removed the infinitive e-suffix (e suffix is missing in words like becoming and surviving).
21.1.4 Lemmatization
As lemmatization is based on dictionaries, we have to install the corresponding nltk dictionary.
from nltk.stem import WordNetLemmatizer
= WordNetLemmatizer()
lemmatizer
for token in tokenizer.tokenize(doc):
print(lemmatizer.lemmatize(token))
## LookupError:
## **********************************************************************
## Resource [93momw-1.4[0m not found.
## Please use the NLTK Downloader to obtain the resource:
##
## [31m>>> import nltk
## >>> nltk.download('omw-1.4')
## [0m
## For more information see: https://www.nltk.org/data.html
##
## Attempted to load [93mcorpora/omw-1.4[0m
##
## Searched in:
## - '/home/siim/nltk_data'
## - '//usr/nltk_data'
## - '//usr/share/nltk_data'
## - '//usr/lib/nltk_data'
## - '/usr/share/nltk_data'
## - '/usr/local/share/nltk_data'
## - '/usr/lib/nltk_data'
## - '/usr/local/lib/nltk_data'
## **********************************************************************
This example converts the words into a dictionary form. Note that the
case is preserved. By default, lemmatizer treats all words as nouns
(that is why “had” is preserved in its past tense). You can tell the
word type with pos
argument:
"had", pos="v") # 'v' for verb lemmatizer.lemmatize(
## LookupError:
## **********************************************************************
## Resource [93momw-1.4[0m not found.
## Please use the NLTK Downloader to obtain the resource:
##
## [31m>>> import nltk
## >>> nltk.download('omw-1.4')
## [0m
## For more information see: https://www.nltk.org/data.html
##
## Attempted to load [93mcorpora/omw-1.4[0m
##
## Searched in:
## - '/home/siim/nltk_data'
## - '//usr/nltk_data'
## - '//usr/share/nltk_data'
## - '//usr/lib/nltk_data'
## - '/usr/share/nltk_data'
## - '/usr/local/share/nltk_data'
## - '/usr/lib/nltk_data'
## - '/usr/local/lib/nltk_data'
## **********************************************************************
Now lemmatizer looked up the correct dictionary form “have”.
21.1.5 Ngrams
In human language, the meaning of word is very much dependent on context, the other words nearby. A popular option to preserve context is to look at ngrams, \(n\)-word ordered sequences. For instance, in case of a sample sentence “Video from Chang’e lander” we can create the following three bigrams (2-grams): “video from”, “from chang’e”, “chang’e lander”. Below is an example how to create ngrams using nltk library:
## Create 3-grams, use the text example from above
= nltk.tokenize.RegexpTokenizer(r'\w+')
tokenizer = tokenizer.tokenize(doc)
tokens = nltk.ngrams(tokens, 3) # 2-grams, returns a generator object
trigrams for b in trigrams] [b
## [('I', 'am', 'Liao'), ('am', 'Liao', 'Hua'), ('Liao', 'Hua', 'and'), ('Hua', 'and', 'I'), ('and', 'I', 've'), ('I', 've', 'had'), ('ve', 'had', 'to'), ('had', 'to', 'become'), ('to', 'become', 'a'), ('become', 'a', 'highwayman'), ('a', 'highwayman', 'My'), ('highwayman', 'My', 'five'), ('My', 'five', 'hundred'), ('five', 'hundred', 'men'), ('hundred', 'men', 'and'), ('men', 'and', 'I'), ('and', 'I', 'survive'), ('I', 'survive', 'on'), ('survive', 'on', 'robbery')]
Ngrams can be used in language models instead of tokens. However, as there are many more possible ngrams than tokens, we easily run into curse of dimensionality problem.
21.2 Converting text to numbers
After text cleaning and preparation, we are still not ready to use the ML techniques on the data. As all statistical models want to work with numeric matrices, we have to convert text into such form.
21.2.1 Bag-of-words and Document-term-matrix
Bag of words (BOW) is essentially just a word frequency table. The
simplest way to create such table is to use CountVectorizer
and
TfidfVectorizer
from
sklearn.feature_extraction.text
. These functions take a sequence of
documents as input and output a BOW for each of the texts. All BOW-s
are packed on top of each other, the resulting matrix is called
Document-Term Matrix (DTM). The following example illustrates the
steps needed. First, we create the documents and the vectorizer.
## Two documents
= """
doc1 Then he went home and told his mother: "My teacher, Ji Gong,
wants me to act as Wei Tuo tonight"
"""
= """
doc2 His mother asked, "What is acting as a Wei Tuo"?
"""
## Import the vectorizer
from sklearn.feature_extraction.text import CountVectorizer
= CountVectorizer() vrizer
This sets up the vectorizer (the example considers CountVectorizer
but TfidfVectorizer
usage is similar). CountVectorizer
has a number of
options, e.g. one can only ask for binary (contains/does not contain)
BOW instead of word counts, specify the stopwords and supply a custom
tokenizer. See the
documentation
for more information.
The next step is to “fit” the vectorizer and transform the text. It is somewhat similar to ML models in sklearn where “fitting” means primarily vocabulary building. Afterwards one can transform different texts using the same vocabulary:
= vrizer.fit([doc1, doc2]) _
CountVectorizer()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
CountVectorizer()
= vrizer.transform([doc1, doc2]) X
CountVectorizer()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
CountVectorizer()
# a sparse matrix X
CountVectorizer()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
CountVectorizer()
Both fit
and transform
take in a list of text documents. Instead
of a list, other iterable collections, such as pd.Series
also work.
There is also a
.fit_transform
method that performs both fitting and transforming on
the same set of documents. This is slightly easier options if we are
working with a single set of documents only.
As the DTM-s are usually sparse, the result is a sparse matrix, not
the ordinary dense matrix. It is a good idea to keep DTM in the
sparse form if possible as this may save a lot of memory and work much
faster. Sparse matrices work for most of the
functionality we need, but there are tasks where they fail and one has to
convert them to dense ones using .toarray
method. For instance, one
cannot create data frames from sparse matrices. In other contexts,
sparse matrices behave differently, more like numpy matrices, not
like numpy arrays. This happens when you, e.g., multiply sparse
matrices.
This may create quite a bit of confusion, and one may want to convert
these to numpy arrays using .toarray
.
In order to understand the DTM, one may want to convert it into a data
frame with column names equal to the corresponding tokens. One can
retrieve the vocabulary using .get_feature_names
method, and build
the data frame from the DTM:
=vrizer.get_feature_names()) pd.DataFrame(X.toarray(), columns
CountVectorizer()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
CountVectorizer()
Here is an example where we need .toarray
: one cannot directly convert a
sparse matrix to data frame, it must be converted to a dense array
first. We also supply the vocabulary as column names.
Note that CountVectorizer
contains a built-in preprocessing and
tokenizer, in particular the documents are split into words, converted
to lower case, and punctuation removed. All these steps can be
adjusted, and instead of creating DTM of token unigrams,
CountVectorizer
can also handle token n-grams.
The resulting DTM X
can be used for modeling as any other design matrix.
21.2.2 Example: categorize text.
Lets demonstrate the DTM on a simple example: we take two poems from different authors, and attribute a third poem to one of the authors based on text similarity (we use \(k\)-NN with cosine distance).
The first poem (by Rudaki):
# Rudaki
= """
rudaki1 When you find me dead, my lips apart,
A shell empty of life, worn out by want,
Sit by my bedside and say, with charm:
“It is I who killed you, I regret it now.”
"""
CountVectorizer()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
CountVectorizer()
# Rudaki
= """
rudaki2 You killed many, broke the enemy’s courage.
You gave so much, there isn’t one beggar left.
Many have lamb and sweets on their table,
Others, not enough bread to ease their hunger.
Take action. Don’t sit idle for too long,
Even though your sacks of gold reach the moon.
"""
CountVectorizer()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
CountVectorizer()
The second set of poems is by Sunthorn Phu. We reserve the second stanza of his second poem to be the “unknown” document to be analyzed.
= """
phu1 I salute the Pagoda of the Holy Relics
May the true religion live forever.
I make merit, so the Buddha helps me
Increase my power to attain enlightenment.
And Id like my words, my book,
To preserve, till the end of time and heavens,
Sunthorn the scribe who belongs
To the King of the White Elephant
"""
CountVectorizer()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
CountVectorizer()
# Nirat Phu Khao Thong, 1st stanza - Sunthorn Phu
= """
phu2 Near to, I could smell the King's scent,
Sweetly rend'ring the air at hand:
The King died, tasteless became the land
He died, and scentless my own fate.
"""
CountVectorizer()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
CountVectorizer()
## 'unknown' document
# Nirat Phu Khao Thong, 2nd stanza - Sunthorn Phu
= """
newdoc In the Palace His ashes in an urn,
I in turn my merit dedicate
To Him, and the Majesty in state
For a great and glorious reign.
"""
CountVectorizer()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
CountVectorizer()
We create the DTM as in the examples above, just now we have four
documents instead of two. We also use fit_transform
as we do
fitting and transforming on the same data:
= CountVectorizer() vrizer
CountVectorizer()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
CountVectorizer()
= vrizer.fit_transform([rudaki1, rudaki2, phu1, phu2]) X
CountVectorizer()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
CountVectorizer()
It is instructive to how many tokens do the training documents contain:
1] X.shape[
CountVectorizer()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
CountVectorizer()
Next, we transform the unknown document newdoc
into a BOW:
= vrizer.transform([newdoc]) newX
CountVectorizer()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
CountVectorizer()
Now it is important to use transform
, not fit_transform
. The
latter would make a new vocabulary and hence the DTM columns for known
and unknown data would not match. We cannot easily compare documents
that are transformed in different way. The second note is that we
create a list of a single document. CountVectorize
expects a list,
even a list of one.
Now we are ready to compute the similarity between the documents.
… TBD
We
can compute the cosine similarities manually. The easiest way to
compute cosine similarity is by first normalizing the relevant
vectors, and thereafter just computing similarity with matrix
product. We can normalize newX
as
= newX.toarray() newX
CountVectorizer()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
CountVectorizer()
= newX/np.sqrt(newX.T @ newX) newXn
CountVectorizer()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
CountVectorizer()
… TBD
We can also the sklearn’s pre-packaged nearest neighbor model
with cosine metric. NearestNeighbors
just computes the
single nearest neighbor (as n_neighbors=1
), and .kneighbors
returns it together with
the distance measure. We ask the distance to all 4 documents to
be returned by kneigbors
:
from sklearn.neighbors import NearestNeighbors
CountVectorizer()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
CountVectorizer()
= NearestNeighbors(n_neighbors=1, metric="cosine") m
CountVectorizer()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
CountVectorizer()
= m.fit(X) _
NearestNeighbors(metric='cosine', n_neighbors=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
NearestNeighbors(metric='cosine', n_neighbors=1)
= m.kneighbors(newX, n_neighbors=4) # return distance for all 4 documents d
NearestNeighbors(metric='cosine', n_neighbors=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
NearestNeighbors(metric='cosine', n_neighbors=1)
d
NearestNeighbors(metric='cosine', n_neighbors=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
NearestNeighbors(metric='cosine', n_neighbors=1)
As we can see, the closest document based on cosine distance is document 2 with distance 0.316. We got the poet right, but not the poem, but one can see that both Phu’s poems are closer than Rudaki’s poems. These short examples contain too little text for realistic precision.
Exercise 21.1
- Repeat the previous example using
TfidfVectorizer
instead ofCountVectorizer
.
See the solution