Chapter 5 Working with strings

TBD: base-R string functions

5.1 String functions in stringr

(Note: this should be a separate section but I do not want to mess up the section numbering in the middle of quarter).

Base R contains many string-related functions, the most popular ones include paste, match, grep and sub. Stringr package provides a wider functionality with more consistent usage. Here we describe a few useful functions for working with strings in the package. As a reminder, you load the packages with

library(stringr)

assuming you have already installed it (see Section 3.6).

5.1.1 Searching patterns in strings

One of the common tasks is to find strings that match a pattern. For instance, imagine email subjects “new password”, “new colleague”, “urgeng!”, “from HR”, “please change your passwords!”, “Passwords are not needed any more”. How can we find messages that are related to password?

In base-R this can be achieved with grep() but here we focus on stringr functionality. First, let’s create a vector of email subjects:

subjects <- c("new password", "new colleague",
              "URGENT! Your paycheck!",
              "from HR", "please change your passwords!",
              "Passwords are expired")

str_detect is a function to find which string contains a regular expression (regexp, see Section 5.2 below) pattern. It returns a logical vector, telling which vector element contains the pattern:

str_detect(subjects, "password")
## [1]  TRUE FALSE FALSE FALSE  TRUE FALSE

As you can see, the first and the fifth element contain the phrase, the other elements do not. You may be surprised that the last one, “Passwords are expired” is marked as not containing the pattern. This is because the pattern matching is case-sensitive, and hence Password and password are different things.

Below are a few examples how to adopt the results for different needs using the basic tools. If you are not interested if any particular element contains or does not contain the pattern, you can just use str_detect to extract the relevant elements:

subjects[str_detect(subjects, "password")]
## [1] "new password"                  "please change your passwords!"

(But rather check out str_subset().)

If you want to match in a case-insensitive way, then one option is to force the strings first into lower case, and then search the patterns:

lSubjects <- tolower(subjects)  # force to lower case
i <- str_detect(lSubjects, "password")  # search in lower case strings
subjects[i]  # but 'subjects' are still in the original case
## [1] "new password"                  "please change your passwords!"
## [3] "Passwords are expired"

for clarity, we do it here in three steps: first, convert strings to lower case (and store to a temporary variable lSubjects). Second, find the pattern in the lower-case version of the subjects. And third, print the corresponding email subjects in the original case.

However, it may be easier to use dedicated functions and modifiers in stringr, here str_subset() for finding matching elements and the modifiers fixed()/regexp() to ask for case-insensitive results:

str_subset(subjects, fixed("password", ignore_case=TRUE))
## [1] "new password"                  "please change your passwords!"
## [3] "Passwords are expired"

Here string_detect() provides not just the logical values whether the elements contain the pattern but the patterns themselves. fixed(..., ignore_case=TRUE) means that the pattern should not be treated as regexp but as ordinary English string pattern, and that we ignore case here. The code is clearer and simpler, but there is an upfront cost of learning even more library functions.

5.1.2 Replacing patterns in strings

Another common task is to replace certain patterns in strings. For instance, we may want to change the word “password” to “access token” in the subjects above. This can be done with str_replace() (base-R equivalent is sub):

str_replace(subjects, "password", "access token")
## [1] "new access token"                  "new colleague"                    
## [3] "URGENT! Your paycheck!"            "from HR"                          
## [5] "please change your access tokens!" "Passwords are expired"

It takes three arguments: the vector of strings, the pattern to replace, and finally the new string to replace the pattern with. By default, the latter two are regexps.

As in case of str_detect(), the default options are case-sensitive, so “Password” in the third case is not replaced. We can ask for non-case-sensitive fixed patterns in a similar fashion as for that function:

str_replace(subjects, fixed("password", ignore_case=TRUE), "access token")
## [1] "new access token"                  "new colleague"                    
## [3] "URGENT! Your paycheck!"            "from HR"                          
## [5] "please change your access tokens!" "access tokens are expired"

str_replace only replaces the first pattern in each string. So if we attempt to replace “s”-s to “z”-s in the subjects, we get

str_replace(subjects, fixed("s", ignore_case=TRUE), "z")
## [1] "new pazsword"                  "new colleague"                
## [3] "URGENT! Your paycheck!"        "from HR"                      
## [5] "pleaze change your passwords!" "Pazswords are expired"

As you can see, only the first “s” was replaced. The solution is to use str_replace_all which replaces all those patterns:

str_replace_all(subjects, fixed("s", ignore_case=TRUE), "z")
## [1] "new pazzword"                  "new colleague"                
## [3] "URGENT! Your paycheck!"        "from HR"                      
## [5] "pleaze change your pazzwordz!" "Pazzwordz are expired"

One common application of pattern replacement is to remove parts of string by replacing those with empty strings "". For instance, we can remove password as

str_replace(subjects, "password", "")
## [1] "new "                   "new colleague"          "URGENT! Your paycheck!"
## [4] "from HR"                "please change your s!"  "Passwords are expired"

Regexps offer better functionality here, allowing the replace both “password”, “passwords”, and the relevant spaces.

5.1.3 Combining strings

str_c combines multiple strings together into a single one (similar to base-R paste()). For instance, if we want to add “Subject:” to each of the email subjects, then we can achieve it with

str_c("Subject: ", subjects)
## [1] "Subject: new password"                  "Subject: new colleague"                
## [3] "Subject: URGENT! Your paycheck!"        "Subject: from HR"                      
## [5] "Subject: please change your passwords!" "Subject: Passwords are expired"

As you see, it combines two string vectors. One is “Subject:” (length 1), and the other is subjects (length 6). It is done element-by-element, i.e. “Subject:” will be added to each element of the second vector.

But sometimes we want to merge all individual elements into a single one. This can be achieved with collapse argument:

str_c(subjects, collapse=" / ")
## [1] "new password / new colleague / URGENT! Your paycheck! / from HR / please change your passwords! / Passwords are expired"

This will convert the original string vector, subjects, into a single string by combining all these elements and placing ” / ” between each of them.

5.2 Regular expressions

Regular expressions (eke regexp-s) are a way to describe patterns in text. These are very powerful tools, not unlike a separate programming language, to find and replace simple patters in strings.

5.2.1 Basics of regexps

Regular expressions look a bit like ordinary strings. But if your function treats an argument as a regexp, not as ordinary string, then some of the symbols have different meaning. First we describe a few of the most common special characters.

  • . (dot) means any single character. For instance:

    str_subset("abc", ".")
    ## [1] "abc"

    then the pattern, "." is treated as a regular expression that matches any character. In particular, it also matches "a", and hence the string "abc" is detected by the function. Contrast this with the case where "." is treated not as a regexp but a normal character:

    str_subset("abc", fixed("."))
    ## character(0)
  • ? specifies a quantity: the preceding character must be there zero or one times. So one can detect

    str_subset(c("hand", "hands"), "hands?")
    ## [1] "hand"  "hands"

    The regular expression matches “hand” plus “s” zero or one time. So it can pick up both “hand” and “hands”.

  • * is a somewhat similar quantity specifier: the preceding character must be there zero or more times. So we have

    str_subset(c("hand", "hands", "handss", "handssss", "handx"), "hands*")
    ## [1] "hand"     "hands"    "handss"   "handssss" "handx"

    This pattern matches “hand” and any form of “handsss..”.

But it may be somewhat surprising to see that it also matches “handx”. After all, "s*" is supposed to match any number of “s”-s, not “x”-s. But it is simply how regular expressions work: after all, “handx” contains “hand” and zero “s”-s. So it matches "hands*" after all–there is not word in the regexp about what will or will not appear after the end of the regexp. End of regexp does not mean it is end of the string! Compare with ordinary pattern matching:

str_subset("abc", "b")
## [1] "abc"

“b” will match “abc” because “abc” contains “b”… But if you want to ensure that nothing follows “hand”, then you need to use string edge markers:

  • $ matches end of the string:

    str_subset(c("abc", "ab"), "b$")
    ## [1] "ab"

    will only detect “ab” because now we request “b” to be the last character in the string.

  • ^, in a similar fashion, matches the beginning of the string:

    str_subset(c("abc", "xab"), "^a")
    ## [1] "abc"

    Only “abc” is detected because “a” must be in the first position.

But what if one wants to match one of the special characters, e.g. a dot or dollar sign? For instance

str_subset(c(".", "$"), ".")  # matches both
## [1] "." "$"
str_subset(c(".", "$"), "$")  # matches everying
## [1] "." "$"
                           # because each string has an end!

will not work as intended. For this simple example, one can use fixed strings instead of regexp. But in general, the special symbols must be escaped, using backslash. And as backslash is a special symbol, we need to escape it with another backslash, resulting in a somewhat awkward double-backslash notation:

str_subset(c(".", "$"), "\\.")  # matches both
## [1] "."
str_subset(c(".", "$"), "\\$")  # matches nothing
## [1] "$"

Example 5.1 Find all valid web addresses in the form _http://www.example.com_, _https://www.example.com_, _http://www.example.com/_, reject all other strings. We need to write a regexp that:

  • starts with http
  • followed by zero or one s
  • followed by ://www.example.com
  • followed by zero or one /
  • and that must be the end of the string.

This can be achieved with "^https?://www\\.example\\.com/?$":

str_subset(c("http://www.example.com", "www.example.com",
             "https://www.example.com/", "http://www.example.com/index.html",
             "ftp://www.example.com"),
           "^https?://www\\.example\\.com/?$")
## [1] "http://www.example.com"   "https://www.example.com/"

Explanation:

  • the string must start with http: ^http
  • it is followed by zero or one s: s?
  • followed by ://www.example.com: //www\\.example\\.com, note how we have escaped the dots
  • followed by zero or one /: \?
  • and that must be the end of the string: $

5.2.2 Limitations

Regular expressions only work with “simple” patterns. They do not include any understanding of human languages or human grammar, and hence if you are looking for a “man”, regexps do not help you to find patterns like “guy” or “chap”.

See also

  • R help for regular expressions: ?regexp