Chapter 5 Working with strings

This section discusses the dedicated functionality for working with strings. It focuses on the base-R functions like paste() and grep(), and afterwards discusses the dedicated stringr library.

5.1 What is the basic string functionality

Character strings are one of the basic data types in most programming languages. They open up completely different options to work with data–you can work with words and text–but you cannot easily use strings for mathematics.

The operations we typically want to do with character values are
  • compare strings: are two strings equal
  • somewhat relatedly, we may want to order strings. Typically, we want to put them in an alphabetic order, but sometimes we may prefer a different order.
  • Often we need to combine string to make new (and longer) strings, or extract parts of a string (substrings).
  • Another common task is to find patterns in strings. For instance, out of the list of all courses, you may want to wind all informatics courses that have prefix “INFO”. Or alternatively, you may need to detect a more complex patter, e.g. all 200-level courses.
  • Sometimes we need not just to find but replace patterns. For instance, when working with the U.S. county data, you may notice that in Louisiana, counties are called “parish”. You may want to replace “parish” with “county”, just to make your data more homogeneous. Or maybe you want to remove that word completely by replacing it with empty string–after all we know that our data is about counties.

Strings are often used for other tasks too, such as storing long numbers. As character strings can be of arbitrary length, you can store arbitrarily large or arbitrarily precise numbers in strings.

Strings also open a way for doing natural language processing (NLP) with text. Those tasks typically involve loading a whole text, such as an email, Amazon review, or a resume into a character string, and thereafter analyzing it to determine whether it is spam, whether the user likes the product and whether the applicant has the relevant skills. We do not discuss NLP in this book.

Below, we’ll give an overview of the main functionality in base-R (Section 5.2) and thereafter in stringr package (Section 5.3).

5.2 Base-R string functions

Base-R includes the basic string operators, including the comparison operators, that form the basis for most other string functionality, functions for pattern detection and replacement, such as grep() and gsub(), and powerful regular expression (see Section 5.4).

5.2.1 Comparing strings

Strings can be compared with the same operators as numbers: == to test for equality and !== to test for inequality. For instance

"a" == "a"
## [1] TRUE
"aa" == "a"
## [1] FALSE
"b" != "a"
## [1] TRUE

Obviously, R does not do translation:

"蘭花" == "orchid"
## [1] FALSE

You can include computations on both sides:

"aa" == paste("a", "a", sep = "")
## [1] TRUE

5.2.2 Ordering strings

String ordering is based on alphabetical order. For instance, “b” is “larger” than “a”, because it follows “a” in the alphabet:

"b" > "a"
## [1] TRUE

But computers represent all symbols using internal numeric codes, not just the letters of English alphabet. So you can ask which one is “larger” about all sorts of symbols:

"{" > "&"
## [1] FALSE
"💀" > "😍"
## [1] FALSE
"蘭" > "ቄ"
## [1] TRUE

The fact that strings are alphabetically ordered is sometimes funny:

"mouse" > "elephant"
## [1] TRUE

But other times it may be problematic. Imagine, you are rendering a video, frame-by-frame, and you label your frames “frame1.png”, “frame2.png”, … “frame10.png”, “frame11.png” and so on. What is the alphabetic order, the natural order of the frame names? Maybe somewhat unexpectedly, it is

  1. frame1.png
  2. frame10.png
  3. frame11.png
  4. frame2.png”

This is because “2” follows “1” in the alphabetic (or more precisely, in ASCII order), and hence all names that contains “1” and “0” precede “2”. Never mind it does not make sense if you think of these as numbers. The easiest solution in such case is to call your first frames not “frame1.png” and “frame2.png” but “frame01.png” and “frame02.png”. This is typically much simpler than to explain computer to use a custom ordering mechanism for your files…

5.2.3 Combining strings

Combining strings means attaching strings together into longer strings. The most important base-R function here is paste(). In its simplest form it just attaches a few strings together:

paste("Shah", "Soleiman")
## [1] "Shah Soleiman"

This results in a space between “Shah” and “Soleiman” what may or may not what do you want. You can adjust it with an extra argument sep = as

paste("Shah", "Soleiman", sep = " 👑 ")
## [1] "Shah 👑 Soleiman"

There is also a handy shortuct, paste0() for paste(..., sep = ""), joining strings with no space in-between.

Paste is vectorized–it joins two string vectors, component-by-component. So you can do

paste(c("Shah", "Shahanshah"),
      c("Soleiman", "Ardashir"))
## [1] "Shah Soleiman"       "Shahanshah Ardashir"

This results in a vector of length two, “Shah Soleiman” and “Shahanshah Arashir”. But sometimes we want the result to be not a vector of length two, but a single string. This can be achieved with collapse = argument. The latter concatenates the two (or more) components of the joined vector (here “Shah Soleiman” and “Shahanshah Arashir”) into a single string:

paste(c("Shah", "Shahanshah"),
      c("Soleiman", "Ardashir"),
      collapse = " and ")
## [1] "Shah Soleiman and Shahanshah Ardashir"

Exercise 5.1 Take a vector of titles (king, shahanshah, shah) and a vector of names (Darius, Ardashir, Soleiman). Use paste() to combine these together into a single string

king Darius, shahanshah Ardashir and shah Soleiman

Hint: you need to use paste() twice. You may also check out str_flatten_comma() in the stringr package.

The solution

TBD: more base-R string functions

5.3 String functions in stringr

Base R contains many string-related functions, the most popular ones include paste, match, grep and sub. Stringr package provides a wider functionality with more consistent usage. Here we describe a few useful functions for working with strings in the package. As a reminder, you load the packages with

library(stringr)

assuming you have already installed it (see Section 3.6).

5.3.1 Searching patterns in strings

One of the common tasks is to find strings that match a pattern. For instance, imagine email subjects “new password”, “new colleague”, “urgeng!”, “from HR”, “please change your passwords!”, “Passwords are not needed any more”. How can we find messages that are related to password?

In base-R this can be achieved with grep() but here we focus on stringr functionality. First, let’s create a vector of email subjects:

subjects <- c("new password", "new colleague",
              "URGENT! Your paycheck!",
              "from HR", "please change your passwords!",
              "Passwords are expired")

str_detect is a function to find which string contains a regular expression (regexp, see Section 5.4 below) pattern. It returns a logical vector, telling which vector element contains the pattern:

str_detect(subjects, "password")
## [1]  TRUE FALSE FALSE FALSE  TRUE FALSE

As you can see, the first and the fifth element contain the phrase, the other elements do not. You may be surprised that the last one, “Passwords are expired” is marked as not containing the pattern. This is because the pattern matching is case-sensitive, and hence Password and password are different things.

Below are a few examples how to adopt the results for different needs using the basic tools. If you are not interested if any particular element contains or does not contain the pattern, you can just use str_detect to extract the relevant elements:

subjects[str_detect(subjects, "password")]
## [1] "new password"                  "please change your passwords!"

(But rather check out str_subset().)

If you want to match in a case-insensitive way, then one option is to force the strings first into lower case, and then search the patterns:

lSubjects <- tolower(subjects)  # force to lower case
i <- str_detect(lSubjects, "password")  # search in lower case strings
subjects[i]  # but 'subjects' are still in the original case
## [1] "new password"                  "please change your passwords!"
## [3] "Passwords are expired"

for clarity, we do it here in three steps: first, convert strings to lower case (and store to a temporary variable lSubjects). Second, find the pattern in the lower-case version of the subjects. And third, print the corresponding email subjects in the original case.

However, it may be easier to use dedicated functions and modifiers in stringr, here str_subset() for finding matching elements and the modifiers fixed()/regexp() to ask for case-insensitive results:

str_subset(subjects, fixed("password", ignore_case=TRUE))
## [1] "new password"                  "please change your passwords!"
## [3] "Passwords are expired"

Here string_detect() provides not just the logical values whether the elements contain the pattern but the patterns themselves. fixed(..., ignore_case=TRUE) means that the pattern should not be treated as regexp but as ordinary English string pattern, and that we ignore case here. The code is clearer and simpler, but there is an upfront cost of learning even more library functions.

5.3.2 Replacing patterns in strings

Another common task is to replace certain patterns in strings. For instance, we may want to change the word “password” to “access token” in the subjects above. This can be done with str_replace() (base-R equivalent is sub):

str_replace(subjects, "password", "access token")
## [1] "new access token"                  "new colleague"                    
## [3] "URGENT! Your paycheck!"            "from HR"                          
## [5] "please change your access tokens!" "Passwords are expired"

It takes three arguments: the vector of strings, the pattern to replace, and finally the new string to replace the pattern with. By default, the latter two are regexps.

As in case of str_detect(), the default options are case-sensitive, so “Password” in the third case is not replaced. We can ask for non-case-sensitive fixed patterns in a similar fashion as for that function:

str_replace(subjects, fixed("password", ignore_case=TRUE), "access token")
## [1] "new access token"                  "new colleague"                    
## [3] "URGENT! Your paycheck!"            "from HR"                          
## [5] "please change your access tokens!" "access tokens are expired"

str_replace only replaces the first pattern in each string. So if we attempt to replace “s”-s to “z”-s in the subjects, we get

str_replace(subjects, fixed("s", ignore_case=TRUE), "z")
## [1] "new pazsword"                  "new colleague"                
## [3] "URGENT! Your paycheck!"        "from HR"                      
## [5] "pleaze change your passwords!" "Pazswords are expired"

As you can see, only the first “s” was replaced. The solution is to use str_replace_all which replaces all those patterns:

str_replace_all(subjects, fixed("s", ignore_case=TRUE), "z")
## [1] "new pazzword"                  "new colleague"                
## [3] "URGENT! Your paycheck!"        "from HR"                      
## [5] "pleaze change your pazzwordz!" "Pazzwordz are expired"

One common application of pattern replacement is to remove parts of string by replacing those with empty strings "". For instance, we can remove password as

str_replace(subjects, "password", "")
## [1] "new "                   "new colleague"          "URGENT! Your paycheck!"
## [4] "from HR"                "please change your s!"  "Passwords are expired"

Regexps offer better functionality here, allowing the replace both “password”, “passwords”, and the relevant spaces.

5.3.3 Combining strings

str_c combines multiple strings together into a single one (similar to base-R paste()). For instance, if we want to add “Subject:” to each of the email subjects, then we can achieve it with

str_c("Subject: ", subjects)
## [1] "Subject: new password"                  "Subject: new colleague"                
## [3] "Subject: URGENT! Your paycheck!"        "Subject: from HR"                      
## [5] "Subject: please change your passwords!" "Subject: Passwords are expired"

As you see, it combines two string vectors. One is “Subject:” (length 1), and the other is subjects (length 6). It is done element-by-element, i.e. “Subject:” will be added to each element of the second vector.

But sometimes we want to merge all individual elements into a single one. This can be achieved with collapse argument:

str_c(subjects, collapse=" / ")
## [1] "new password / new colleague / URGENT! Your paycheck! / from HR / please change your passwords! / Passwords are expired"

This will convert the original string vector, subjects, into a single string by combining all these elements and placing ” / ” between each of them.

5.4 Regular expressions

Regular expressions (aka regexp-s) are a way to describe patterns in text. These are very powerful tools, not unlike a separate programming language, to find and replace simple patters in strings.

5.4.1 Basics of regexps

Regular expressions look a bit like ordinary strings. But if your function treats an argument as a regexp, not as ordinary string, then some of the symbols have different meaning. First we describe a few of the most common special characters.

  • . (dot) means any single character. For instance:

    str_subset("abc", ".")
    ## [1] "abc"

    then the pattern, "." is treated as a regular expression that matches any character. In particular, it also matches "a", and hence the string "abc" is detected by the function. Contrast this with the case where "." is treated not as a regexp but a normal character:

    str_subset("abc", fixed("."))
    ## character(0)
  • ? specifies a quantity: the preceding character must be there zero or one times. So one can detect

    str_subset(c("hand", "hands"), "hands?")
    ## [1] "hand"  "hands"

    The regular expression matches “hand” plus “s” zero or one time. So it can pick up both “hand” and “hands”.

  • * is a somewhat similar quantity specifier: the preceding character must be there zero or more times. So we have

    str_subset(c("hand", "hands", "handss", "handssss", "handx"), "hands*")
    ## [1] "hand"     "hands"    "handss"   "handssss" "handx"

    This pattern matches “hand” and any form of “handsss..”.

But it may be somewhat surprising to see that it also matches “handx”. After all, "s*" is supposed to match any number of “s”-s, not “x”-s. But it is simply how regular expressions work: after all, “handx” contains “hand” and zero “s”-s. So it matches "hands*" after all–there is not word in the regexp about what will or will not appear after the end of the regexp. End of regexp does not mean it is end of the string! Compare with ordinary pattern matching:

str_subset("abc", "b")
## [1] "abc"

“b” will match “abc” because “abc” contains “b”… But if you want to ensure that nothing follows “hand”, then you need to use string edge markers:

  • $ matches end of the string:

    str_subset(c("abc", "ab"), "b$")
    ## [1] "ab"

    will only detect “ab” because now we request “b” to be the last character in the string.

  • ^, in a similar fashion, matches the beginning of the string:

    str_subset(c("abc", "xab"), "^a")
    ## [1] "abc"

    Only “abc” is detected because “a” must be in the first position.

But what if one wants to match one of the special characters, e.g. a dot or dollar sign? For instance

str_subset(c(".", "$"), ".")  # matches both
## [1] "." "$"
str_subset(c(".", "$"), "$")  # matches everying
## [1] "." "$"
                           # because each string has an end!

will not work as intended. For this simple example, one can use fixed strings instead of regexp. But in general, the special symbols must be escaped, using backslash. And as backslash is a special symbol, we need to escape it with another backslash, resulting in a somewhat awkward double-backslash notation:

str_subset(c(".", "$"), "\\.")  # matches both
## [1] "."
str_subset(c(".", "$"), "\\$")  # matches nothing
## [1] "$"

Example 5.1 Find all valid web addresses in the form _http://www.example.com_, _https://www.example.com_, _http://www.example.com/_, reject all other strings. We need to write a regexp that:

  • starts with http
  • followed by zero or one s
  • followed by ://www.example.com
  • followed by zero or one /
  • and that must be the end of the string.

This can be achieved with "^https?://www\\.example\\.com/?$":

str_subset(c("http://www.example.com", "www.example.com",
             "https://www.example.com/", "http://www.example.com/index.html",
             "ftp://www.example.com"),
           "^https?://www\\.example\\.com/?$")
## [1] "http://www.example.com"   "https://www.example.com/"

Explanation:

  • the string must start with http: ^http
  • it is followed by zero or one s: s?
  • followed by ://www.example.com: //www\\.example\\.com, note how we have escaped the dots
  • followed by zero or one /: \?
  • and that must be the end of the string: $

5.4.2 Limitations

Regular expressions only work with “simple” patterns. They do not include any understanding of human languages or human grammar, and hence if you are looking for a “man”, regexps do not help you to find patterns like “guy” or “chap”.

See also

  • R help for regular expressions: ?regexp