Chapter 8 String Operations
Much information is embedded in strings, either in form of a written human text, or as a special form of data. Here we discuss some basic functionality to manipulate strings and access the information inside them.
We work with base-R functionality, in particular with functions
grep
, sub
and strsplit
, and give an overview of regular
expressions.
8.1 The basic functions
8.1.1 grep
grep
and its derivatives look for a pattern inside a vector of
strings. It takes at least two arguments: pattern and string. For
instance, let’s find which of the names “Gong”, “Zhao” and “Lane”
contains letter “a”:
c("Gong", "Zhao", "Lane")
names <- grep("a", names)
i <- i
## [1] 2 3
The result, 2, 3, tells us that
these are the vector components 2, 3,
i.e. “Zhao” and “Lane”. grep
has different forms and more options.
E.g. let us find pattern “zh” where disregarding the lower/upper case:
grep("zh", names, ignore.case=TRUE)
## [1] 2
Indeed, result 2 indicates that “Zhao” and only “Zhao” contains
pattern “zh”. We can also ask grep
to return the strings, not just
index to the matching ones by specifying the argument value
:
grep("n", names, value=TRUE)
## [1] "Gong" "Lane"
returns “Gong” and “Lane”, not 1 and 3.
The pattern in grep
is normally a regular
expression, see below.
8.1.2 sub
and gsub
sub
and gsub
replace patterns in strings. Their syntax is
sub(pattern, replacement, string)
. sub
only replaces the first
occurrence of pattern and gsub
(global-sub) all occurrences. The
“first” refers to the occurrence of pattern inside of components of
string vectors, not to the components itself. For instance
c("Kalakala", "Martinique")
text <-sub("a", "_", text)
## [1] "K_lakala" "M_rtinique"
gsub("a", "_", text)
## [1] "K_l_k_l_" "M_rtinique"
sub
only replaces the first “a” in “Kalakala”, and the first (and
only) “a” in “Martinique”. However, gsub
replaces all “a”-s.
gsub
is frequently used to remove certain letters or patterns from
the string, this can be achieved by replacing those with empty string,
""
:
gsub("a", "", text)
## [1] "Klkl" "Mrtinique"
The pattern in sub
is normally a regular
expression, see below.
8.1.3 strsplit
: tokenizing text
strsplit
can be used to split a string into parts, e.g. to split a
sentence into words. It takes two (or more) arguments, the string
vector, and the splitting pattern:
c("The wathcman nodded.", "The monk led five men to a big gate.")
sentences <-strsplit(sentences, " ")
## [[1]]
## [1] "The" "wathcman" "nodded."
##
## [[2]]
## [1] "The" "monk" "led" "five" "men" "to" "a" "big" "gate."
This splits both sentences into words (tokens) by using a space as the separator. The result is a list where each component is a character vector, the vector of words for this particular sentence.
Often we just want to split a single sentence, in that case we can
just extract the first list component with [[1]]
:
"If we catch the jewel thief now, we will find hin in this house"
sentence <-strsplit(sentence, " ")[[1]]
## [1] "If" "we" "catch" "the" "jewel" "thief" "now," "we" "will"
## [10] "find" "hin" "in" "this" "house"
As in the case of sub
and grep
, the split pattern is normally a regular
expression, see below.
8.1.4 Using the string functions with pipes
The base-R string functions are not to designed to work with magrittr
pipes so
they work in somewhat less elegant fashion. Remember: the pipe %>%
normally feeds the output of the previous expression into the first
argument of the following function. However, if it must be fed into
another place, one can use period .
. Let us re-write the previous
examples using pipes:
c("Gong", "Zhao", "Lane") %>%
grep("a", .)
## [1] 2 3
As grep
expects the string to be the second argument, we put the
period placeholder for the pipe’s content in that place. We can work
with gsub
in a similar fashion:
c("Kalakala", "Martinique") %>%
gsub("a", "_", .)
## [1] "K_l_k_l_" "M_rtinique"
Check out stringr package that is designed for string operations with pipes in mind.
strsplit
is an easy task in a sense that it already has the string
in the first position. Extracting an element from the list can be
done using function "[["()
:
"If we catch the jewel thief now, we will find him in this house" %>%
strsplit(" ") %>%
"[["(1)
## [1] "If" "we" "catch" "the" "jewel" "thief" "now," "we" "will"
## [10] "find" "him" "in" "this" "house"
"[["
is the extractor function, as it contains special characters,
it must be used in a quoted form. 1
is just its argument, denoting
to extract the first element.
Check out a more elegant way to extract list components using
extract2
in magrittr package.
8.2 Regular expressions
Regular expressions (regexps) are a set of patterns that can be used to find and
compare text. Regular expressions are somewhat similar to basic
string matching, searching for certain patterns in strings, but with
tremendously more possibilities.
All the functions mentioned in the introductions,
grep
, sub
and strsplit
can use regular expressions.
Here we only touch the basics of regexps, read the help page for
“regexp” for more information.
8.2.1 A few basic classes of regexps
Regular expressions are like strings where certain symbols are interpreted in a special way. Here is a list of a few common special symbols:
8.2.1.1 List of characters [...]
One can specify any character in the list as using square brackets
[]
. For instance, let’s remove all “a”-s and “l”-s from “Kalakala”:
gsub("[al]", "", "Kalakala")
## [1] "Kk"
[al]
means any letter “a” or “l”. When removing those,
we are left with only k-s.
As another example, we may want to tokenize the sentence not just at a space but also at commas:
"If we catch the jewel thief now, we will find him in this house" %>%
strsplit("[ ,]") %>%
"[["(1)
## [1] "If" "we" "catch" "the" "jewel" "thief" "now" "" "we"
## [10] "will" "find" "him" "in" "this" "house"
This removes the comma after “now” (it is now considered a splitting
symbol). However, as a downside we got an empty string after “now”
instead. This is because strsplit
sees two split symbols, comma and
space, next to each other and hence there is just an empty string
there inbetween. See repeated patterns below.
8.2.1.2 Repeated patterns
One can repeat the previous pattern one or more times with +
. So
a+
will match “a”, “aa”, “aaa” and so on. This also applies to the
list of characters, e.g. [ab]+
means any character, “a” or “b”, one
or more times. So [ab]+
will match “a”, “b”, “ab”, “ba”, “bab”,
“aaabaa” etc.
Such pattern matching opens up a way to split sentence not just on a space or comma, but on a sequence of spaces and commas of any length:
"If we catch the jewel thief now, we will find him in this house" %>%
strsplit("[ ,]+") %>%
"[["(1)
## [1] "If" "we" "catch" "the" "jewel" "thief" "now" "we" "will"
## [10] "find" "him" "in" "this" "house"
Now the matching patterns are not just spaces or commas, but any kind
of
combinations of these characters. As a result ", "
is considered a
single separator, and the empty string after “now” is gone.
8.3 Examples
This basic functionality can be combined in various ways, here we give an example of how to find the context of a word.
8.3.1 Find word context
Consider a sentence “Tonight, between third and fifth watch, I intend to catch the robber”. Let’s find the context for the word “watch”. We look at a single-word context, one word preceding and one word following the word “watch”:
## tokenize the sentence
"Tonight, between third and fifth watch, I intend to catch the robber" %>%
words <- strsplit("[, ]+") %>%
"[["(1)
## find the location of word 'watch'
which(words == "watch")
pos <-## print the context
-1]; words[pos+1] words[pos
## [1] "fifth"
## [1] "I"
Indeed, “watch” is located between the words “fifth” and “I”.