Chapter 7 Working with strings

This section discusses the dedicated functionality for working with strings. It focuses on the base-R functions like paste() and grep(), and afterwards discusses the dedicated stringr library.

7.1 What is the basic string functionality

Character strings are one of the basic data types in most programming languages. They open up completely different options to work with data–you can work with words and text–but you cannot easily use strings for mathematics.

The operations we typically want to do with character values are

compare strings: are two strings equal
somewhat relatedly, we may want to order strings. Typically, we want to put them in an alphabetic order, but sometimes we may prefer a different order.
Often we need to combine string to make new (and longer) strings, or extract parts of a string (substrings).
Another common task is to find patterns in strings. For instance, out of the list of all courses, you may want to wind all informatics courses that have prefix “INFO”. Or alternatively, you may need to detect a more complex patter, e.g. all 200-level courses.
Sometimes we need not just to find but replace patterns. For instance, when working with the U.S. county data, you may notice that in Louisiana, counties are called “parish”. You may want to replace “parish” with “county”, just to make your data more homogeneous. Or maybe you want to remove that word completely by replacing it with empty string–after all we know that our data is about counties.

Strings are often used for other tasks too, such as storing long numbers. As character strings can be of arbitrary length, you can store arbitrarily large or arbitrarily precise numbers in strings.

Strings also open a way for doing natural language processing (NLP) with text. Those tasks typically involve loading a whole text, such as an email, Amazon review, or a resume into a character string, and thereafter analyzing it to determine whether it is spam, whether the user likes the product and whether the applicant has the relevant skills. We do not discuss NLP in this book.

Below, we’ll give an overview of the main functionality in base-R (Section 7.2) and thereafter in stringr package (Section 7.3).

7.2 Base-R string functions

Base-R includes the basic string operators, including the comparison operators, that form the basis for most other string functionality, functions for pattern detection and replacement, such as grep() and gsub(), and powerful regular expression (see Section 7.4).

7.2.1 Comparing strings

Strings can be compared with the same operators as numbers: == to test for equality and !== to test for inequality. For instance

"a" == "a"

## [1] TRUE

"aa" == "a"

## [1] FALSE

"b" != "a"

## [1] TRUE

Obviously, R does not do translation:

"蘭花" == "orchid"

## [1] FALSE

You can include computations on both sides:

"aa" == paste("a", "a", sep = "")

## [1] TRUE

7.2.2 Ordering strings

String ordering is based on alphabetical order. For instance, “b” is “larger” than “a”, because it follows “a” in the alphabet:

"b" > "a"

## [1] TRUE

But computers represent all symbols using internal numeric codes, not just the letters of English alphabet. So you can ask which one is “larger” about all sorts of symbols:

"{" > "&"

## [1] FALSE

"💀" > "😍"

## [1] FALSE

"蘭" > "ቄ"

## [1] TRUE

The fact that strings are alphabetically ordered is sometimes funny:

"mouse" > "elephant"

## [1] TRUE

But other times it may be problematic. Imagine, you are rendering a video, frame-by-frame, and you label your frames “frame1.png”, “frame2.png”, … “frame10.png”, “frame11.png” and so on. What is the alphabetic order, the natural order of the frame names? Maybe somewhat unexpectedly, it is

frame1.png
frame10.png
frame11.png
frame2.png"

This is because “2” follows “1” in the alphabetic (or more precisely, in ASCII order), and hence all names that contains “1” and “0” precede “2”. Never mind it does not make sense if you think of these as numbers. The easiest solution in such case is to call your first frames not “frame1.png” and “frame2.png” but “frame01.png” and “frame02.png”. This is typically much simpler than to explain computer to use a custom ordering mechanism for your files…

7.2.3 Combining strings

Combining strings means attaching strings together into longer strings. The most important base-R function here is paste(). In its simplest form it just attaches a few strings together:

paste("Shah", "Soleiman")

## [1] "Shah Soleiman"

This results in a space between “Shah” and “Soleiman” what may or may not what do you want. You can adjust it with an extra argument sep = as

paste("Shah", "Soleiman", sep = " 👑 ")

## [1] "Shah 👑 Soleiman"

There is also a handy shortuct, paste0() for paste(..., sep = ""), joining strings with no space in-between.

Paste is vectorized–it joins two string vectors, component-by-component. So you can do

paste(c("Shah", "Shahanshah"),
      c("Soleiman", "Ardashir"))

## [1] "Shah Soleiman"       "Shahanshah Ardashir"

This results in a vector of length two, “Shah Soleiman” and “Shahanshah Arashir”. But sometimes we want the result to be not a vector of length two, but a single string. This can be achieved with collapse = argument. The latter concatenates the two (or more) components of the joined vector (here “Shah Soleiman” and “Shahanshah Arashir”) into a single string:

paste(c("Shah", "Shahanshah"),
      c("Soleiman", "Ardashir"),
      collapse = " and ")

## [1] "Shah Soleiman and Shahanshah Ardashir"

Exercise 7.1 Take a vector of titles (king, shahanshah, shah) and a vector of names (Darius, Ardashir, Soleiman). Use paste() to combine these together into a single string

king Darius, shahanshah Ardashir and shah Soleiman

Hint: you need to use paste() twice. You may also check out str_flatten_comma() in the stringr package.

The solution

7.2.4 Finding patterns in strings

A common task is to find patterns in strings. Consider a list of courses–math126, soc102, info200, info201, bio220, info180. You want to find all informatics courses, those that start with “info”. This can be achieved with grep(). In it’s simplest form, it takes two arguments, pattern, the pattern to look for, and the string vector where to look for the pattern:

courses <- c("math126", "soc102", "info200",
             "info201", "bio220", "info180")
grep("info", courses)

## [1] 3 4 6

This results in a vector of indices–which elements in the original string vector contains the pattern “info”. If you’d like to see the actual elements, instead of their location, you can supply value = TRUE as an additional argument:

grep("info", courses, value = TRUE)

## [1] "info200" "info201" "info180"

And should you prefer logical values, maybe just to check whether a single component contains the pattern in an if()-statement (see Section 10), then you can use grepl() (grep logical):

grepl("info", courses)

## [1] FALSE FALSE  TRUE  TRUE FALSE  TRUE

This results in a vector of trues and falses, depending on if that component contains the pattern.

Be aware though that by default, the pattern is regular expression (See Section 7.4). This may lead to suprising results if you are unfamiliar with regexps. For instance,

grep(".", c("a.b", "ab", ".", "x"))

## [1] 1 2 3 4

tells that all four elements of the vector contain the period! This is because the period, ., means as any character in regexps. See Section 7.4 for more details. There are two solutions to this problem: the better (but much more complicated one) is to learn the regexps! The other is to supply the argument fixed = TRUE that tells grep that the pattern is just a normal string, not a regexp:

grep(".", c("a.b", "ab", ".", "x"),
     fixed = TRUE)

## [1] 1 3

Only the first and the third element contain the period.

Exercise 7.2 Consider a list of web addresses: www.urban.org, file:///home/otoomet/, https://faculty.washington.edu/, http://www.example.com/, https://www.index.ie, http://tartu.edu.

Use a version of grep() that finds all websites that use the secure http protocol, i.e. they start with “https:”. Print not the indices of these sites, but actual web addresses.

The solution

7.2.5 Replacing patterns in strings

Replacing patterns in strings is fairly similar to finding those, just you need to supply two patterns: one that will be replaced, and the other that the previous one is prelaced with. Base-R has two related functions: sub() and gsub(). sub() replaces only the first occurrence, gsub() replaces all occurrences. Both of these have at least three main arguments: the pattern, its replacement, and the string where to replace the patterns. For instance, let’s replace the first “e” with “o” in These bygone years:

phrase <- "These bygone years"
sub("e", "o", phrase)

## [1] "Those bygone years"

But if you want to replace all e-s, you need to use gsub():

gsub("e", "o", phrase)

## [1] "Thoso bygono yoars"

As was the case with grep(), both pattern and replacement are regular expression (see Section 7.4). If you want to use ordinary strings, use an extra argument fixed = TRUE. This ensures that the strings are not treated as regexps.

sub() is vectorized a similar fashion as grep(), so you can replace the same pattern in a whole string vector.

Exercise 7.3 Consider a vector of vessels: steamboat, sailboat, motorboat, river boat. Replace “boat” with “ship” in all these words using sub(). Do it as a single operation on the whole vector, not for each word separately!

The solution

7.3 String functions in stringr

Base R contains many string-related functions, the most popular ones include paste, match, grep and sub. Stringr package provides a wider functionality with more consistent usage. Here we describe a few useful functions for working with strings in the package. As a reminder, you load the packages with

library(stringr)

assuming you have already installed it (see Section 5.6).

7.3.1 Searching patterns in strings

One of the common tasks is to find strings that match a pattern. For instance, imagine email subjects “new password”, “new colleague”, “urgent!”, “from HR”, “please change your passwords!”, “Passwords are not needed any more”. How can we find messages that are related to password?

In base-R this can be achieved with grep() but here we focus on stringr functionality. First, let’s create a vector of email subjects:

subjects <- c("new password", "new colleague",
              "URGENT! Your paycheck!",
              "from HR", "please change your passwords!",
              "Passwords are expired")

str_detect is a function to find which string contains a regular expression (regexp, see Section 7.4 below) pattern. It returns a logical vector, telling which vector element contains the pattern:

str_detect(subjects, "password")

## [1]  TRUE FALSE FALSE FALSE  TRUE FALSE

As you can see, the first and the fifth element contain the phrase, the other elements do not. You may be surprised that the last one, “Passwords are expired” is marked as not containing the pattern. This is because the pattern matching is case-sensitive, and hence Password and password are different things.

Below are a few examples how to adopt the results for different needs using the basic tools. If you are not interested if any particular element contains or does not contain the pattern, you can just use str_detect to extract the relevant elements:

subjects[str_detect(subjects, "password")]

## [1] "new password"                  "please change your passwords!"

(But rather check out str_subset().)

If you want to match in a case-insensitive way, then one option is to force the strings first into lower case, and then search the patterns:

lSubjects <- tolower(subjects)  # force to lower case
i <- str_detect(lSubjects, "password")  # search in lower case strings
subjects[i]  # but 'subjects' are still in the original case

## [1] "new password"                  "please change your passwords!"
## [3] "Passwords are expired"

for clarity, we do it here in three steps: first, convert strings to lower case (and store to a temporary variable lSubjects). Second, find the pattern in the lower-case version of the subjects. And third, print the corresponding email subjects in the original case.

However, it may be easier to use dedicated functions and modifiers in stringr, here str_subset() for finding matching elements and the modifiers fixed()/regexp() to ask for case-insensitive results:

str_subset(subjects, fixed("password", ignore_case=TRUE))

## [1] "new password"                  "please change your passwords!"
## [3] "Passwords are expired"

Here string_detect() provides not just the logical values whether the elements contain the pattern but the patterns themselves. fixed(..., ignore_case=TRUE) means that the pattern should not be treated as regexp but as ordinary English string pattern, and that we ignore case here. The code is clearer and simpler, but there is an upfront cost of learning even more library functions.

7.3.2 Replacing patterns in strings

Another common task is to replace certain patterns in strings. For instance, we may want to change the word “password” to “access token” in the subjects above. This can be done with str_replace() (base-R equivalent is sub):

str_replace(subjects, "password", "access token")

## [1] "new access token"                  "new colleague"                    
## [3] "URGENT! Your paycheck!"            "from HR"                          
## [5] "please change your access tokens!" "Passwords are expired"

It takes three arguments: the vector of strings, the pattern to replace, and finally the new string to replace the pattern with. By default, the latter two are regexps.

As in case of str_detect(), the default options are case-sensitive, so “Password” in the third case is not replaced. We can ask for non-case-sensitive fixed patterns in a similar fashion as for that function:

str_replace(subjects, fixed("password", ignore_case=TRUE), "access token")

## [1] "new access token"                  "new colleague"                    
## [3] "URGENT! Your paycheck!"            "from HR"                          
## [5] "please change your access tokens!" "access tokens are expired"

str_replace only replaces the first pattern in each string. So if we attempt to replace “s”-s to “z”-s in the subjects, we get

str_replace(subjects, fixed("s", ignore_case=TRUE), "z")

## [1] "new pazsword"                  "new colleague"                
## [3] "URGENT! Your paycheck!"        "from HR"                      
## [5] "pleaze change your passwords!" "Pazswords are expired"

As you can see, only the first “s” was replaced. The solution is to use str_replace_all which replaces all those patterns:

str_replace_all(subjects, fixed("s", ignore_case=TRUE), "z")

## [1] "new pazzword"                  "new colleague"                
## [3] "URGENT! Your paycheck!"        "from HR"                      
## [5] "pleaze change your pazzwordz!" "Pazzwordz are expired"

One common application of pattern replacement is to remove parts of string by replacing those with empty strings "". For instance, we can remove password as

str_replace(subjects, "password", "")

## [1] "new "                   "new colleague"          "URGENT! Your paycheck!"
## [4] "from HR"                "please change your s!"  "Passwords are expired"

Regexps offer better functionality here, allowing the replace both “password”, “passwords”, and the relevant spaces.

7.3.3 Combining strings

str_c combines multiple strings together into a single one (similar to base-R paste()). For instance, if we want to add “Subject:” to each of the email subjects, then we can achieve it with

str_c("Subject: ", subjects)

## [1] "Subject: new password"                  "Subject: new colleague"                
## [3] "Subject: URGENT! Your paycheck!"        "Subject: from HR"                      
## [5] "Subject: please change your passwords!" "Subject: Passwords are expired"

As you see, it combines two string vectors. One is “Subject:” (length 1), and the other is subjects (length 6). It is done element-by-element, i.e. “Subject:” will be added to each element of the second vector.

But sometimes we want to merge all individual elements into a single one. This can be achieved with collapse argument:

str_c(subjects, collapse=" / ")

## [1] "new password / new colleague / URGENT! Your paycheck! / from HR / please change your passwords! / Passwords are expired"

This will convert the original string vector, subjects, into a single string by combining all these elements and placing " / " between each of them.

7.4 Regular expressions

Regular expressions (aka regexp-s) are a way to describe patterns in text. These are very powerful tools to find and replace simple patters in strings, in many ways the resemble a separate minimalistic programming language.

7.4.1 Basics of regexps

Regular expressions look a bit like ordinary strings. But if your function treats an argument as a regexp, not as ordinary string, then some of the symbols have different meaning. First we describe a few of the most common special characters.

. (dot) means any single character. For instance:
```
str_subset("abc", ".")
```
```
## [1] "abc"
```
then the pattern, "." is treated as a regular expression that matches any character. In particular, it also matches "a", and hence the string "abc" is detected by the function. Contrast this with the case where "." is treated not as a regexp but a normal character:
```
str_subset("abc", fixed("."))
```
```
## character(0)
```
? specifies a quantity: the preceding character must be there zero or one times. So one can detect
```
str_subset(c("hand", "hands"), "hands?")
```
```
## [1] "hand"  "hands"
```
The regular expression matches “hand” plus “s” zero or one time. So it can pick up both “hand” and “hands”.
* is a somewhat similar quantity specifier: the preceding character must be there zero or more times. So we have
```
str_subset(c("hand", "hands", "handss", "handssss", "handx"), "hands*")
```
```
## [1] "hand"     "hands"    "handss"   "handssss" "handx"
```
This pattern matches “hand” and any form of “handsss..”.

But it may be somewhat surprising to see that it also matches “handx”. After all, "s*" is supposed to match any number of “s”-s, not “x”-s. But it is simply how regular expressions work: after all, “handx” contains “hand” and zero “s”-s. So it matches "hands*" after all–there is not word in the regexp about what will or will not appear after the end of the regexp. End of regexp does not mean it is end of the string! Compare with ordinary pattern matching:

str_subset("abc", "b")

## [1] "abc"

“b” will match “abc” because “abc” contains “b”… But if you want to ensure that nothing follows “hand”, then you need to use string edge markers:

$ matches end of the string:
```
str_subset(c("abc", "ab"), "b$")
```
```
## [1] "ab"
```
will only detect “ab” because now we request “b” to be the last character in the string.
^, in a similar fashion, matches the beginning of the string:
```
str_subset(c("abc", "xab"), "^a")
```
```
## [1] "abc"
```
Only “abc” is detected because “a” must be in the first position.

But what if one wants to match one of the special characters, e.g. a dot or dollar sign? For instance

str_subset(c(".", "$"), ".")  # matches both

## [1] "." "$"

str_subset(c(".", "$"), "$")  # matches everying

## [1] "." "$"

                           # because each string has an end!

will not work as intended. For this simple example, one can use fixed strings instead of regexp. But in general, the special symbols must be escaped, using backslash. And as backslash is a special symbol, we need to escape it with another backslash, resulting in a somewhat awkward double-backslash notation:

str_subset(c(".", "$"), "\\.")  # matches both

## [1] "."

str_subset(c(".", "$"), "\\$")  # matches nothing

## [1] "$"

Example 7.1

Find all valid web addresses in the form _http://www.example.com_, _https://www.example.com_, _http://www.example.com/_, reject all other strings. We need to write a regexp that:

starts with http
followed by zero or one s
followed by ://www.example.com
followed by zero or one /
and that must be the end of the string.

This can be achieved with "^https?://www\\.example\\.com/?$":

str_subset(c("http://www.example.com", "www.example.com",
             "https://www.example.com/", "http://www.example.com/index.html",
             "ftp://www.example.com"),
           "^https?://www\\.example\\.com/?$")

## [1] "http://www.example.com"   "https://www.example.com/"

Explanation:

the string must start with http ^http
it is followed by zero or one s s?
followed by ://www.example.com //www\\.example\\.com, note how we have escaped the dots
followed by zero or one / /?
and that must be the end of the string $

7.4.2 Limitations

Regular expressions only work with “simple” patterns. They do not include any understanding of human languages or human grammar, and hence if you are looking for a “man”, regexps do not help you to find patterns like “guy” or “chap”.