Chapter 8 String Operations

Much information is embedded in strings, either in form of a written human text, or as a special form of data. Here we discuss some basic functionality to manipulate strings and access the information inside them.

We work with base-R functionality, in particular with functions grep, sub and strsplit, and give an overview of regular expressions.

8.1 The basic functions

8.1.1 grep

grep and its derivatives look for a pattern inside a vector of strings. It takes at least two arguments: pattern and string. For instance, let’s find which of the names “Gong”, “Zhao” and “Lane” contains letter “a”:

names <- c("Gong", "Zhao", "Lane")
i <- grep("a", names)
i
## [1] 2 3

The result, 2, 3, tells us that these are the vector components 2, 3, i.e. “Zhao” and “Lane”. grep has different forms and more options. E.g. let us find pattern “zh” where disregarding the lower/upper case:

grep("zh", names, ignore.case=TRUE)
## [1] 2

Indeed, result 2 indicates that “Zhao” and only “Zhao” contains pattern “zh”. We can also ask grep to return the strings, not just index to the matching ones by specifying the argument value:

grep("n", names, value=TRUE)
## [1] "Gong" "Lane"

returns “Gong” and “Lane”, not 1 and 3.

The pattern in grep is normally a regular expression, see below.

8.1.2 sub and gsub

sub and gsub replace patterns in strings. Their syntax is sub(pattern, replacement, string). sub only replaces the first occurrence of pattern and gsub (global-sub) all occurrences. The “first” refers to the occurrence of pattern inside of components of string vectors, not to the components itself. For instance

text <- c("Kalakala", "Martinique")
sub("a", "_", text)
## [1] "K_lakala"   "M_rtinique"
gsub("a", "_", text)
## [1] "K_l_k_l_"   "M_rtinique"

sub only replaces the first “a” in “Kalakala”, and the first (and only) “a” in “Martinique”. However, gsub replaces all “a”-s.

gsub is frequently used to remove certain letters or patterns from the string, this can be achieved by replacing those with empty string, "":

gsub("a", "", text)
## [1] "Klkl"      "Mrtinique"

The pattern in sub is normally a regular expression, see below.

8.1.3 strsplit: tokenizing text

strsplit can be used to split a string into parts, e.g. to split a sentence into words. It takes two (or more) arguments, the string vector, and the splitting pattern:

sentences <- c("The wathcman nodded.", "The monk led five men to a big gate.")
strsplit(sentences, " ")
## [[1]]
## [1] "The"      "wathcman" "nodded." 
## 
## [[2]]
## [1] "The"   "monk"  "led"   "five"  "men"   "to"    "a"     "big"   "gate."

This splits both sentences into words (tokens) by using a space as the separator. The result is a list where each component is a character vector, the vector of words for this particular sentence.

Often we just want to split a single sentence, in that case we can just extract the first list component with [[1]]:

sentence <- "If we catch the jewel thief now, we will find hin in this house"
strsplit(sentence, " ")[[1]]
##  [1] "If"    "we"    "catch" "the"   "jewel" "thief" "now,"  "we"    "will" 
## [10] "find"  "hin"   "in"    "this"  "house"

As in the case of sub and grep, the split pattern is normally a regular expression, see below.

8.1.4 Using the string functions with pipes

The base-R string functions are not to designed to work with magrittr pipes so they work in somewhat less elegant fashion. Remember: the pipe %>% normally feeds the output of the previous expression into the first argument of the following function. However, if it must be fed into another place, one can use period .. Let us re-write the previous examples using pipes:

c("Gong", "Zhao", "Lane") %>%
   grep("a", .)
## [1] 2 3

As grep expects the string to be the second argument, we put the period placeholder for the pipe’s content in that place. We can work with gsub in a similar fashion:

c("Kalakala", "Martinique") %>%
   gsub("a", "_", .)
## [1] "K_l_k_l_"   "M_rtinique"

Check out stringr package that is designed for string operations with pipes in mind.

strsplit is an easy task in a sense that it already has the string in the first position. Extracting an element from the list can be done using function "[["():

"If we catch the jewel thief now, we will find him in this house" %>%
   strsplit(" ") %>%
   "[["(1)
##  [1] "If"    "we"    "catch" "the"   "jewel" "thief" "now,"  "we"    "will" 
## [10] "find"  "him"   "in"    "this"  "house"
"[[" is the extractor function, as it contains special characters, it must be used in a quoted form. 1 is just its argument, denoting to extract the first element.

Check out a more elegant way to extract list components using extract2 in magrittr package.

8.2 Regular expressions

Regular expressions (regexps) are a set of patterns that can be used to find and compare text. Regular expressions are somewhat similar to basic string matching, searching for certain patterns in strings, but with tremendously more possibilities. All the functions mentioned in the introductions, grep, sub and strsplit can use regular expressions. Here we only touch the basics of regexps, read the help page for “regexp” for more information.

8.2.1 A few basic classes of regexps

Regular expressions are like strings where certain symbols are interpreted in a special way. Here is a list of a few common special symbols:

8.2.1.1 List of characters [...]

One can specify any character in the list as using square brackets []. For instance, let’s remove all “a”-s and “l”-s from “Kalakala”:

gsub("[al]", "", "Kalakala")
## [1] "Kk"

[al] means any letter “a” or “l”. When removing those, we are left with only k-s.

As another example, we may want to tokenize the sentence not just at a space but also at commas:

"If we catch the jewel thief now, we will find him in this house" %>%
   strsplit("[ ,]") %>%
   "[["(1)
##  [1] "If"    "we"    "catch" "the"   "jewel" "thief" "now"   ""      "we"   
## [10] "will"  "find"  "him"   "in"    "this"  "house"

This removes the comma after “now” (it is now considered a splitting symbol). However, as a downside we got an empty string after “now” instead. This is because strsplit sees two split symbols, comma and space, next to each other and hence there is just an empty string there inbetween. See repeated patterns below.

8.2.1.2 Repeated patterns

One can repeat the previous pattern one or more times with +. So a+ will match “a”, “aa”, “aaa” and so on. This also applies to the list of characters, e.g. [ab]+ means any character, “a” or “b”, one or more times. So [ab]+ will match “a”, “b”, “ab”, “ba”, “bab”, “aaabaa” etc.

Such pattern matching opens up a way to split sentence not just on a space or comma, but on a sequence of spaces and commas of any length:

"If we catch the jewel thief now, we will find him in this house" %>%
   strsplit("[ ,]+") %>%
   "[["(1)
##  [1] "If"    "we"    "catch" "the"   "jewel" "thief" "now"   "we"    "will" 
## [10] "find"  "him"   "in"    "this"  "house"

Now the matching patterns are not just spaces or commas, but any kind of combinations of these characters. As a result ", " is considered a single separator, and the empty string after “now” is gone.

8.3 Examples

This basic functionality can be combined in various ways, here we give an example of how to find the context of a word.

8.3.1 Find word context

Consider a sentence “Tonight, between third and fifth watch, I intend to catch the robber”. Let’s find the context for the word “watch”. We look at a single-word context, one word preceding and one word following the word “watch”:

## tokenize the sentence
words <- "Tonight, between third and fifth watch, I intend to catch the robber" %>%
   strsplit("[, ]+") %>%
   "[["(1)
## find the location of word 'watch'
pos <- which(words == "watch")
## print the context
words[pos-1]; words[pos+1]
## [1] "fifth"
## [1] "I"

Indeed, “watch” is located between the words “fifth” and “I”.