Chapter 5 Working with strings
This section discusses the dedicated functionality for working with
strings. It focuses on the base-R functions like paste()
and
grep()
, and afterwards discusses the dedicated stringr library.
5.1 What is the basic string functionality
Character strings are one of the basic data types in most programming languages. They open up completely different options to work with data–you can work with words and text–but you cannot easily use strings for mathematics.
The operations we typically want to do with character values are- compare strings: are two strings equal
- somewhat relatedly, we may want to order strings. Typically, we want to put them in an alphabetic order, but sometimes we may prefer a different order.
- Often we need to combine string to make new (and longer) strings, or extract parts of a string (substrings).
- Another common task is to find patterns in strings. For instance, out of the list of all courses, you may want to wind all informatics courses that have prefix “INFO”. Or alternatively, you may need to detect a more complex patter, e.g. all 200-level courses.
- Sometimes we need not just to find but replace patterns. For instance, when working with the U.S. county data, you may notice that in Louisiana, counties are called “parish”. You may want to replace “parish” with “county”, just to make your data more homogeneous. Or maybe you want to remove that word completely by replacing it with empty string–after all we know that our data is about counties.
Strings are often used for other tasks too, such as storing long numbers. As character strings can be of arbitrary length, you can store arbitrarily large or arbitrarily precise numbers in strings.
Strings also open a way for doing natural language processing (NLP) with text. Those tasks typically involve loading a whole text, such as an email, Amazon review, or a resume into a character string, and thereafter analyzing it to determine whether it is spam, whether the user likes the product and whether the applicant has the relevant skills. We do not discuss NLP in this book.
Below, we’ll give an overview of the main functionality in base-R (Section 5.2) and thereafter in stringr package (Section 5.3).
5.2 Base-R string functions
Base-R includes the basic string operators, including the comparison
operators, that form the basis for most other string functionality,
functions for pattern detection and replacement, such
as grep()
and gsub()
, and powerful regular expression (see
Section 5.4).
5.2.1 Comparing strings
Strings can be compared with the same operators as numbers: ==
to test for equality and !==
to test for inequality. For
instance
## [1] TRUE
## [1] FALSE
## [1] TRUE
Obviously, R does not do translation:
## [1] FALSE
You can include computations on both sides:
## [1] TRUE
5.2.2 Ordering strings
String ordering is based on alphabetical order. For instance, “b” is “larger” than “a”, because it follows “a” in the alphabet:
## [1] TRUE
But computers represent all symbols using internal numeric codes, not just the letters of English alphabet. So you can ask which one is “larger” about all sorts of symbols:
## [1] FALSE
## [1] FALSE
## [1] TRUE
The fact that strings are alphabetically ordered is sometimes funny:
## [1] TRUE
But other times it may be problematic. Imagine, you are rendering a video, frame-by-frame, and you label your frames “frame1.png”, “frame2.png”, … “frame10.png”, “frame11.png” and so on. What is the alphabetic order, the natural order of the frame names? Maybe somewhat unexpectedly, it is
- frame1.png
- frame10.png
- frame11.png
- frame2.png”
This is because “2” follows “1” in the alphabetic (or more precisely, in ASCII order), and hence all names that contains “1” and “0” precede “2”. Never mind it does not make sense if you think of these as numbers. The easiest solution in such case is to call your first frames not “frame1.png” and “frame2.png” but “frame01.png” and “frame02.png”. This is typically much simpler than to explain computer to use a custom ordering mechanism for your files…
5.2.3 Combining strings
Combining strings means attaching strings together into longer
strings. The most important base-R function here is paste()
. In
its simplest form it just attaches a few strings together:
## [1] "Shah Soleiman"
This results in a space between “Shah” and “Soleiman” what may or may
not what do you want. You can adjust it with an extra argument sep =
as
## [1] "Shah 👑 Soleiman"
There is also a handy shortuct, paste0()
for paste(..., sep = "")
,
joining strings with no space in-between.
Paste is vectorized–it joins two string vectors, component-by-component. So you can do
## [1] "Shah Soleiman" "Shahanshah Ardashir"
This results in a vector of length two, “Shah Soleiman” and
“Shahanshah Arashir”. But sometimes we want the result to be not a
vector of length two, but a single string. This can be achieved with
collapse =
argument. The latter concatenates the two (or more)
components of the joined vector (here “Shah Soleiman” and “Shahanshah
Arashir”) into a single string:
## [1] "Shah Soleiman and Shahanshah Ardashir"
Exercise 5.1 Take a vector of titles (king, shahanshah, shah) and a vector of
names (Darius, Ardashir, Soleiman). Use paste()
to combine these
together into a single string
king Darius, shahanshah Ardashir and shah Soleiman
Hint: you need to use paste()
twice.
You may also check out str_flatten_comma()
in the stringr package.
TBD: more base-R string functions
5.3 String functions in stringr
Base R contains many string-related functions, the most popular ones
include paste
, match
, grep
and sub
. Stringr package
provides a wider functionality with more consistent usage.
Here we describe a few useful functions for working with strings in
the package. As a reminder, you load the packages with
assuming you have already installed it (see Section 3.6).
5.3.1 Searching patterns in strings
One of the common tasks is to find strings that match a pattern. For instance, imagine email subjects “new password”, “new colleague”, “urgeng!”, “from HR”, “please change your passwords!”, “Passwords are not needed any more”. How can we find messages that are related to password?
In base-R this can be achieved with grep()
but here we focus on
stringr functionality.
First, let’s create a vector
of email subjects:
subjects <- c("new password", "new colleague",
"URGENT! Your paycheck!",
"from HR", "please change your passwords!",
"Passwords are expired")
str_detect is a function to find which string contains a regular expression (regexp, see Section 5.4 below) pattern. It returns a logical vector, telling which vector element contains the pattern:
## [1] TRUE FALSE FALSE FALSE TRUE FALSE
As you can see, the first and the fifth element contain the phrase,
the other elements do not. You may be surprised that the last one,
“Passwords are expired” is marked as not containing the pattern. This
is because the pattern matching is case-sensitive, and hence
Password
and password
are different things.
Below are a few examples how to adopt the results for different needs
using the basic tools.
If you are not interested if any particular element contains or does
not contain the pattern, you can just use str_detect
to extract the
relevant elements:
## [1] "new password" "please change your passwords!"
(But rather check out str_subset()
.)
If you want to match in a case-insensitive way, then one option is to force the strings first into lower case, and then search the patterns:
lSubjects <- tolower(subjects) # force to lower case
i <- str_detect(lSubjects, "password") # search in lower case strings
subjects[i] # but 'subjects' are still in the original case
## [1] "new password" "please change your passwords!"
## [3] "Passwords are expired"
for clarity, we do it here in three steps: first, convert strings to
lower case (and store to a temporary variable lSubjects
). Second,
find the pattern in the lower-case version of the subjects. And
third, print the corresponding email subjects in the original case.
However, it may be easier to use dedicated functions and modifiers in
stringr, here str_subset()
for finding matching elements and the
modifiers fixed()
/regexp()
to ask for case-insensitive results:
## [1] "new password" "please change your passwords!"
## [3] "Passwords are expired"
Here string_detect()
provides not just the logical values whether
the elements contain the pattern but the patterns themselves.
fixed(..., ignore_case=TRUE)
means that the pattern should not be
treated as regexp but as ordinary English
string pattern, and that we ignore
case here.
The code is clearer and simpler, but there is an upfront cost of
learning even more library functions.
5.3.2 Replacing patterns in strings
Another common task is to replace certain patterns in strings. For
instance, we may want to change the word “password” to “access token”
in the subjects above. This can be done with str_replace()
(base-R
equivalent is sub
):
## [1] "new access token" "new colleague"
## [3] "URGENT! Your paycheck!" "from HR"
## [5] "please change your access tokens!" "Passwords are expired"
It takes three arguments: the vector of strings, the pattern to replace, and finally the new string to replace the pattern with. By default, the latter two are regexps.
As in case of str_detect()
, the default
options are case-sensitive, so “Password” in the third case is not
replaced. We can ask for non-case-sensitive fixed patterns in a
similar fashion as for that function:
## [1] "new access token" "new colleague"
## [3] "URGENT! Your paycheck!" "from HR"
## [5] "please change your access tokens!" "access tokens are expired"
str_replace
only replaces the first pattern in each string. So if
we attempt to replace “s”-s to “z”-s in the subjects, we get
## [1] "new pazsword" "new colleague"
## [3] "URGENT! Your paycheck!" "from HR"
## [5] "pleaze change your passwords!" "Pazswords are expired"
As you can see, only the first “s” was replaced.
The solution is to use str_replace_all
which replaces all those
patterns:
## [1] "new pazzword" "new colleague"
## [3] "URGENT! Your paycheck!" "from HR"
## [5] "pleaze change your pazzwordz!" "Pazzwordz are expired"
One common application of pattern replacement is to remove parts of
string by replacing those with empty strings ""
. For instance, we
can remove password as
## [1] "new " "new colleague" "URGENT! Your paycheck!"
## [4] "from HR" "please change your s!" "Passwords are expired"
Regexps offer better functionality here, allowing the replace both “password”, “passwords”, and the relevant spaces.
5.3.3 Combining strings
str_c
combines multiple strings together into a single one (similar
to base-R paste()
). For instance, if we want to add “Subject:” to
each of the email subjects, then we can achieve it with
## [1] "Subject: new password" "Subject: new colleague"
## [3] "Subject: URGENT! Your paycheck!" "Subject: from HR"
## [5] "Subject: please change your passwords!" "Subject: Passwords are expired"
As you see, it combines two string vectors. One is “Subject:”
(length 1), and the other is subjects
(length 6). It is done element-by-element,
i.e. “Subject:” will be added to each element of the second vector.
But sometimes we want to merge all individual elements into a single
one. This can be achieved with collapse
argument:
## [1] "new password / new colleague / URGENT! Your paycheck! / from HR / please change your passwords! / Passwords are expired"
This will convert the original string vector, subjects
, into a
single string by combining all these elements and placing ” / ”
between each of them.
5.4 Regular expressions
Regular expressions (aka regexp-s) are a way to describe patterns in text. These are very powerful tools, not unlike a separate programming language, to find and replace simple patters in strings.
5.4.1 Basics of regexps
Regular expressions look a bit like ordinary strings. But if your function treats an argument as a regexp, not as ordinary string, then some of the symbols have different meaning. First we describe a few of the most common special characters.
.
(dot) means any single character. For instance:## [1] "abc"
then the pattern,
"."
is treated as a regular expression that matches any character. In particular, it also matches"a"
, and hence the string"abc"
is detected by the function. Contrast this with the case where"."
is treated not as a regexp but a normal character:## character(0)
?
specifies a quantity: the preceding character must be there zero or one times. So one can detect## [1] "hand" "hands"
The regular expression matches “hand” plus “s” zero or one time. So it can pick up both “hand” and “hands”.
*
is a somewhat similar quantity specifier: the preceding character must be there zero or more times. So we have## [1] "hand" "hands" "handss" "handssss" "handx"
This pattern matches “hand” and any form of “handsss..”.
But it may be somewhat surprising to see that it also matches
“handx”. After all, "s*"
is supposed to match any number of “s”-s,
not “x”-s. But it is simply how regular expressions work: after all,
“handx” contains “hand” and zero “s”-s. So it matches "hands*"
after all–there is not word in the regexp about what will or will not
appear after the end of the regexp. End of regexp does not mean it
is end of the string! Compare with ordinary pattern matching:
## [1] "abc"
“b” will match “abc” because “abc” contains “b”… But if you want to ensure that nothing follows “hand”, then you need to use string edge markers:
$
matches end of the string:## [1] "ab"
will only detect “ab” because now we request “b” to be the last character in the string.
^
, in a similar fashion, matches the beginning of the string:## [1] "abc"
Only “abc” is detected because “a” must be in the first position.
But what if one wants to match one of the special characters, e.g. a dot or dollar sign? For instance
## [1] "." "$"
## [1] "." "$"
will not work as intended. For this simple example, one can use fixed strings instead of regexp. But in general, the special symbols must be escaped, using backslash. And as backslash is a special symbol, we need to escape it with another backslash, resulting in a somewhat awkward double-backslash notation:
## [1] "."
## [1] "$"
Example 5.1 Find all valid web addresses in the form _http://www.example.com_, _https://www.example.com_, _http://www.example.com/_, reject all other strings. We need to write a regexp that:
- starts with http
- followed by zero or one s
- followed by ://www.example.com
- followed by zero or one /
- and that must be the end of the string.
This can be achieved with "^https?://www\\.example\\.com/?$"
:
str_subset(c("http://www.example.com", "www.example.com",
"https://www.example.com/", "http://www.example.com/index.html",
"ftp://www.example.com"),
"^https?://www\\.example\\.com/?$")
## [1] "http://www.example.com" "https://www.example.com/"
Explanation:
- the string must start with http:
^http
- it is followed by zero or one s:
s?
- followed by ://www.example.com:
//www\\.example\\.com
, note how we have escaped the dots - followed by zero or one /:
\?
- and that must be the end of the string:
$
5.4.2 Limitations
Regular expressions only work with “simple” patterns. They do not include any understanding of human languages or human grammar, and hence if you are looking for a “man”, regexps do not help you to find patterns like “guy” or “chap”.
See also
- R help for regular expressions:
?regexp