Chapter 4 Pipes and dplyr: the easy way of data manipulation
Base-R offers powerful and consistent syntax for indexing and manipulating data frames and other data structures. But the code easily turns messy and cumbersome for slightly more complex tasks. This is the main reason that dplyr functions (sometimes called verbs) have gained tremendous popularity. These verbs are connected with pipes, resulting in code that can be read and understood in the order it is written.
This section will introduce pipes, and thereafter the dplyr verbs.
4.1 R pipes
4.1.1 A motivational story: how to make pancakes
Let us start with an example from a totally different field. How to make pancakes? A recipe might look like this:
- Take 2 eggs
- Add sugar, whisk together
- Add milk, whisk
- Add flour, whisk
- Spread 3 tablespoons of batter on skillet
- Cook
Let us try to write computer code that bakes a few pancakes. We use the standard programming approach with functions and variables.
add(2*egg, sugar)
batter <- whisk(batter)
batter <- add(batter, milk)
batter <- whisk(batter)
batter <- ...
This code works but it is not very easy to read. The assignment to
batter
takes quite a bit of space and attention away from the main
message, the action verbs add
, whisk
, and so on, and also the
argument batter
makes it harder to see the ingredients. From
technical perspective,
we create (and
repeatedly overwrite) an unnecessary variable batter. After all,
we don’t care about batter but about cakes!
We can try to
replace it with nested functions:
whisk(add(whisk(add(2*egg, sugar)), milk))
batter <- whisk(add(batter, flour))
batter <- cook(spread(batter, 3*tablespoon)) cake <-
This approach is more compact but even harder to read. In particular, the order of tasks is not the same as order of the functions. While standard in math and in coding, one has to read nested functions from inside out (i.e. from right to left), while keeping eyes on the additional arguments like “flour” or “milk” and parenthesis.
Fortunately there is another way. Note that we have a central variable here, batter, and everything we do is to add and manipulate batter. So we can imagine writing the code along these lines:
add(2*egg, sugar) %>%
whisk() %>%
add(milk) %>%
whisk() %>%
add(flour) %>%
whisk() %>%
...
Here the %>%
operator must be understood as “take the value of
previous function and feed it into the next function”.
This approach has two major advantages:
- We can read the code in a logical order. It is almost the same as reading the recipe!
- We do not introduce temporary variables (batter) that we overwrite
This is the approach of pipes. %>%
is the pipe operator.
4.1.2 Magrittr pipes
The pipe %>%
is living in the magrittr package. However, if you
use the popular dplyr or tidyverse, it will be automatically
imported so there is no need to load magrittr separately.
Pipe channels the output of the previous function (or expression) to the first input of the next function. If we want to compute and print \(\sqrt{2}\), we can do it in the traditional way with nested functions:
print(sqrt(2))
## [1] 1.414214
But we can also use the pipes instead, feeding “2” first to sqrt
and
the result thereafter to print
:
2 %>%
sqrt() %>%
print()
## [1] 1.414214
In this example we had little use of piping, one may even argue that the nested function approach is more compact and easier to read. Indeed, this may be the case here. But this may not be true in case of more complex code as we saw with the pancake example above.
The piped data will normally be the first argument for the function after pipe. If that function call has more arguments, those will be treated as second, third, etc arguments of the function.
In RStudio, you can get the pipe with Ctrl-Shift-M
(on linux)
Exercise 4.1 Create a vector of numbers 1, 2, 3, 4. Compute its mean, and print
(this time use cat
) it followed by word “approximately” like
2.5 (approximately)
See the solution
By default, pipe forwards the previous result into the first argument of
the following function. If one prefers to use another argument, this
can be specified with the placeholder .
(dot) for the result. For
instance, we can modify the first example and print \(\sqrt{2}\) after
the text “the answer is”:
2 %>%
sqrt() %>%
cat("The answer is", ., "\n")
## The answer is 1.414214
Now the result of square root will be fed to cat
as its second
argument.
The data-forwarding pipe %>%
is the most popular one, but there are
more pipes, for instance
%T>%
: pivot data to another function%$%
: extract list components%<>%
: pipe-and-assign
Finally, the right-hand assignment ->
works very well together with
pipes.
4.1.3 Advantages of pipe-based approach
Pipes have become very popular when doing data analysis in R. The main reason for this is that it allows to wrote code in a way that closely resembles how humans think about solving the problems.
But in order to be able to use pipes efficiently, you need certain familiarity with how to solve such problems, and how to tranform solutions in “human language” into computer code. Such approach should be backed up by the appropriate coding tools as well.
4.2 Basics of dplyr package
dplyr is one of the most widely used data manipulation packages for R. It is based on five distinct “verbs” and a large number of auxiliary functions. It may simplify data manipulations tremendously, and it is explicitly designed with pipes in mind.
The five basic “verbs” (functions) are:
- select: select/unselect variables
- filter: filter observations
- mutate: compute on individual observations
- arrange: order observations
- summarize: aggregate observations
dplyr can also do grouped operations, for instance to compute summary statistics for each group of interest separately without any explicit loops.
In itself, the dplyr functions are similar to other functions. They are designed in a way that their first argument is always the dataset, and they always return a modified dataset. This is what makes these well-suited to be used with pipes. The following examples use titanic data:
read.csv("../data/titanic.csv.bz2") titanic <-
For instance, the select
function takes the data as the first argument, and a list of desired
variables
to keep as the following arguments. We can use the traditional
approach with nested functions:
head(select(titanic, pclass, survived))
## pclass survived
## 1 1 1
## 2 1 1
## 3 1 0
## 4 1 0
## 5 1 0
## 6 1 1
Or the pipe-approach:
%>%
titanic select(pclass, survived) %>%
head()
## pclass survived
## 1 1 1
## 2 1 1
## 3 1 0
## 4 1 0
## 5 1 0
## 6 1 1
dplyr is designed in a way that the second approach is almost always convenient and easier to read.
Another distinct trait of dplyr is that it accepts the variable names
without quotes. So select(titanic, pclass, survivied)
must be
understood as selecting variables pclass and survived from the
titanic
dataset. This is called non-standard evaluation and it is
an extremely convenient convention. However, non-standard evaluation
may be a hassle with indirect variable names, i.e. when the variable
names are stored in another variable.
4.2.1 Basic functions (verbs)
4.2.1.1 select
: select variables
select
selects variables in data. We stress here: it picks
variables, not observations (filter
does the latter). In its
simplest form, it takes a list of desired variables to keep, like in
the example above. It also has a number of helpers and other
goodies. Let us print all the variables in titanic data:
names(titanic)
## [1] "pclass" "survived" "name" "sex" "age" "sibsp"
## [7] "parch" "ticket" "fare" "cabin" "embarked" "boat"
## [13] "body" "home.dest"
One can create sequences with :
. For instance, let’s select all
variables from pclass
to sex
:
%>%
titanic select(pclass:sex) %>%
head(3)
## pclass survived name sex
## 1 1 1 Allen, Miss. Elisabeth Walton female
## 2 1 1 Allison, Master. Hudson Trevor male
## 3 1 0 Allison, Miss. Helen Loraine female
One can reverse selection by preceding it with !
. Different
selections can be combined using &
and |
for AND and OR. For
instance, lets exclude
name
from the previous example:
%>%
titanic select(pclass:sex & !name) %>%
head(3)
## pclass survived sex
## 1 1 1 female
## 2 1 1 male
## 3 1 0 female
Here we combine two selections: everything from pclass to sex
(i.e. pclass, survived, name, sex) and everything except
name. The AND operator tells that only those columns that are
present in both sides of it are to be preserved.
The result includes all variables from pclass
to sex
but removes
name
from this list.
There are many more helpers and ways to use select
, see more in
tidyverse
documentation.
Exercise 4.2 Select variables pclass, survived, sex and age in at least two different ways.
See the solution
4.2.1.2 filter
: filter observations
While select
selects certain variables, filter
filters out
selected observation based on desired criteria. It takes one (or
more) logical arguments that describe the filtering criteria. For
instance, let’s keep only data about male survivors:
%>%
titanic filter(sex == "male", survived == 1) %>%
head(3)
## pclass survived name sex age sibsp
## 1 1 1 Allison, Master. Hudson Trevor male 0.9167 1
## 2 1 1 Anderson, Mr. Harry male 48.0000 0
## 3 1 1 Barkworth, Mr. Algernon Henry Wilson male 80.0000 0
## parch ticket fare cabin embarked boat body
## 1 2 113781 151.55 C22 C26 S 11 NA
## 2 0 19952 26.55 E12 S 3 NA
## 3 0 27042 30.00 A23 S B NA
## home.dest
## 1 Montreal, PQ / Chesterville, ON
## 2 New York, NY
## 3 Hessle, Yorks
Note the syntax: the data variable names (here sex and survived)
are written without quotes. We have used two criteria, separated by
comma. This is equivalent to logical AND and we could have written
instead sex == "male" & survived == 1
to achieve the exact same
result.
The beauty of using pipe is that we can combine different functions in a very intuitive and easy way. Let’s repeat the same example, but before filtering the observations we now select the same few variables we did above:
%>%
titanic select(pclass:sex & !name) %>%
filter(sex == "male", survived == 1) %>%
head(3)
## pclass survived sex
## 1 1 1 male
## 2 1 1 male
## 3 1 1 male
As you see, this is as simple as copying the line from the select
example into this example. The code is also very easy to read.
4.2.1.3 mutate
: compute observation-wise
mutate
is the one of main functions to perform computations on data
(the other one is summarize
). This performs
computations observation-wise, unlike summarize
that reduces many
observations into a single number. It takes one (or more) arguments
that are normally expressions, involving the data variables. For
instance, let’s add a variable died to the titanic data that equals
true if survived = 0:
%>%
titanic select(pclass:sex & !name) %>%
mutate(died = survived == 0) %>%
head(3)
## pclass survived sex died
## 1 1 1 female FALSE
## 2 1 1 male FALSE
## 3 1 0 female TRUE
This example demonstrates several traits of mutate
: first, we create
a new logical variable using operation survived == 0
. Second, we
assign this result to a new data variable died. Finally, note that
as is the case with other dplyr functions, the data variable names are
used without quotes. mutate
accepts more than one argument in case
you want to compute more than one variable in a single mutate
call.
It also has a plethora of helper functions, e.g. row_number
for
sequential row number, dense_rank
for rank order of a variable, and
n
for total number of rows. Note that you do not have to create a
new variable with mutate, you can also overwrite an existing one.
Here is another example where we perform three computations: first we replace sex male/female coding with M/F, second we report age in months for those who were younger than one year, and finally we add another variable that tell the order (rank) of the person in terms of age (with the youngest one being number one). Note that in the previous example we did not select age, so we have to change our selection a little bit now:
%>%
titanic select(survived:age & !name) %>%
mutate(sex = ifelse(sex == "male", "M", "F"),
ageRank = dense_rank(age)) %>%
mutate(age = ifelse(age < 1, paste(round(age*12, 0), "months"), age)) %>%
head(3)
## survived sex age ageRank
## 1 1 F 29 45
## 2 1 M 11 months 7
## 3 0 F 2 9
Now sex is coded as M and F. We can also see that the second,
11 months old passenger, is 7th-youngest in the passenger list.
Note that we have included two mutate
functions next to each other
in this example. This is because we want to ensure that ageRank is
computed using age in years. While mutate
computes the arguments in
the given order so we could merge these two mutate
-s,
it is a good idea to add an additional safety layer
in our code.
4.2.1.4 arrange
: order observations
Ordering observations is typically used either just before the final output, or when we are interested in a few largest or smallest values. Let’s arrange the observations by age (we also select a few variables only for not to clutter the output):
%>%
titanic select(survived:age & !name) %>%
arrange(age) %>%
head(3)
## survived sex age
## 1 1 female 0.1667
## 2 0 male 0.3333
## 3 1 male 0.4167
Here we see the three youngest passengers on the liner.
As the other dplyr functons, arrange
accepts multiple arguments, the
ties in the first argument will be broken according to the second, and
so on. Another extremely useful helper function is desc
, it
transforms the arrangement into the descending order. Here is an
example where we use two variables, and arrange the cases in descending
order by age and increasing by pclass:
%>%
titanic select(pclass:age & !name) %>%
arrange(desc(age), pclass) %>%
head(8)
## pclass survived sex age
## 1 1 1 male 80.0
## 2 1 1 female 76.0
## 3 3 0 male 74.0
## 4 1 0 male 71.0
## 5 1 0 male 71.0
## 6 3 0 male 70.5
## 7 1 0 male 70.0
## 8 2 0 male 70.0
Now the passengers are arranged according to age from the oldest to the youngest, but in case of equal age, those with smaller pclass value is listed first (see the two 70-year old passengers).
4.2.1.5 summarize
: aggregate over observations
The final dplyr base function, summarize
, is used to aggregate
(reduce) the data to a limited number of figures only. Its
functionality may seem a little trivial, and indeed it is—until we
start working with grouped data.
In its simplest form, summarize
takes an argument that describes how
to aggregate a variable into a single value. It is typically and
expression that computes something using data variables. For instance, let’s
compute the mean age of the passengers:
%>%
titanic summarize(mean(age, na.rm=TRUE))
## mean(age, na.rm = TRUE)
## 1 29.88113
The average age is almost 30 years (but remember, there are many missings). Note the following:
summarize
removed all other variables, except age.- It also removed all observations. In fact it collapsed all the data into a single value, the mean age. As we did not tell what to do with other variables, it just left those out.
- in case of grouped data, it also adds group indicators, see below.
Let us do another example where we preserve somewhat more complex information besides the mean age—percentage of upper class passengers (1st and 2nd), and percentage of women:
%>%
titanic summarize(age = mean(age, na.rm=TRUE),
pctUpper = mean(pclass %in% c(1,2)),
pctWomen = mean(sex == "female"))
## age pctUpper pctWomen
## 1 29.88113 0.4583652 0.3559969
We learn over 45% of passengers were traveling in 1st or 2nd class, and slightly over 1/3 of all passengers were female. Note that here we have given the aggregated variables dedicated names, otherwise dplyr picks something it considers suitable.
Note summarize does not need to compute the aggregated value in the mathematical sense of the word. For instance, we can also use it to pull out the first value. This trick is quite useful for certain grouped operations. As an example, we take the first observation and create a (somewhat useless) summary using no data at all, position = 1:
%>%
titanic summarize(name = name[1], position=1)
## name position
## 1 Allen, Miss. Elisabeth Walton 1
Note that there are also dedicated functions to extract variables by
position: first
, last
and nth
.
4.2.2 group_by
: grouped operations
Grouped operations make dplyr data processing pipelines incredibly powerful (see more below). These can address questions like What is the average age for first/second class passengers? Were females more likely to travel in upper classes than males? and many similar ones.
The idea with grouped operations is that first the data will be grouped–each observation will be labeled according to which group it belongs. Thereafter the computations are done one these groups independently, as if the rest of the data is not there. Finally, the results from different groups are assembled together again. Grouping is relevant only for certain kind of computations, for those that gather information from several cases. This includes summary operations, such as mean, group-specific operations like n, and all ranking-related operations, such as first and lag. If the computation only involves information from the same observation (like what we did with age and survived above), grouping has no effect. Indeed, groups only matters if your “group members” affect your results somehow!
Let us start with a very simple example: compute mean age for male and female passengers:
%>%
titanic group_by(sex) %>%
summarize(age = mean(age, na.rm=TRUE))
## # A tibble: 2 × 2
## sex age
## <chr> <dbl>
## 1 female 28.7
## 2 male 30.6
This example demonstrates the very basic usage of grouping using
group_by
. The dataset is effectively split into male and female
subset, summarize
is used separately on both of these subsets
exactly as above, and thereafter the two summaries
are combined together into a data frame. As we want to keep track
about which value corresponds to which group, the group label is also
added to the final data frame. So we learn that in average, women
were approximately 2 years younger than men.
We can group on multiple variables simultaneously. Let’s find survival rate by sex and passenger class:
%>%
titanic group_by(sex, pclass) %>%
summarize(survival = mean(survived)*100)
## # A tibble: 6 × 3
## # Groups: sex [2]
## sex pclass survival
## <chr> <int> <dbl>
## 1 female 1 96.5
## 2 female 2 88.7
## 3 female 3 49.1
## 4 male 1 34.1
## 5 male 2 14.6
## 6 male 3 15.2
Now we compute the mean survival percentage for both men and women in each class. The data shows what is already well-known–women and upper class passengers had much higher chances to survive.
Grouped operations do not have to involve reducing by summarize
, one
can also use groups when arranging, filtering and otherwise
computing. See examples below.
4.2.3 Combining dplyr functions in a single pipeline
This section discusses more complex and some of the more advanced aspects of dplyr pipelines. It has two-fold goal:
- To introduce some of the helper functions
- The teach you to think in terms of pipelines and data operations.
Instead of titanic data, we use babynames dataset from package babynames. We also limit the analysis to a subset to babies form from 1995-2003 and remove variable prop (percentage in a given year) as we do not use it:
babynames::babynames %>%
babynames <- select(!prop) %>%
filter(between(year, 1995, 2003))
Note the use of between
, a helper for filter
for multi-way
comparison.
Let us first take a quick look at the data:
%>%
babynames head()
## # A tibble: 6 × 4
## year sex name n
## <dbl> <chr> <chr> <int>
## 1 1995 F Jessica 27935
## 2 1995 F Ashley 26602
## 3 1995 F Emily 24380
## 4 1995 F Samantha 21645
## 5 1995 F Sarah 21376
## 6 1995 F Taylor 20425
The dataset includes 5 variables. n is the number of times the name was given in that year, e.g. Jessica was given 27935 times to girls born in 1995. In all, we have 257707 observations.
Let us start with a very simple task: what was the most popular boy
and girl name during this period? This necessarily involves ordering,
but we can use the dedicated helper functions rank
and desc
instead of
explicit arrange
:
%>%
babynames group_by(sex) %>%
filter(rank(desc(n)) == 1)
## # A tibble: 2 × 4
## # Groups: sex [2]
## year sex name n
## <dbl> <chr> <chr> <int>
## 1 1995 F Jessica 27935
## 2 1995 M Michael 41402
The filter
command does the following: first, it implicitly arranges
the observations by n in a descending order. Thereafter it filters
only observations with rank 1, i.e. the largest n value. This
leaves us with a single girl name and a single boy name. These are
the most popular ones.
Exercise 4.3 Find the longest boy and girl names in the data (longest in the sense of number of characters).
Hint: you can use nchar(name)
to find the length of the name (in
characters).
See the solution
It is very easy to generalize this and print the most popular boy and girl names for each year in our subset–we just have to add grouping by year:
%>%
babynames group_by(year, sex) %>%
filter(rank(desc(n)) == 1)
## # A tibble: 18 × 4
## # Groups: year, sex [18]
## year sex name n
## <dbl> <chr> <chr> <int>
## 1 1995 F Jessica 27935
## 2 1995 M Michael 41402
## 3 1996 F Emily 25151
## 4 1996 M Michael 38365
## 5 1997 F Emily 25732
## 6 1997 M Michael 37548
## 7 1998 F Emily 26181
## 8 1998 M Michael 36614
## 9 1999 F Emily 26539
## 10 1999 M Jacob 35361
## 11 2000 F Emily 25953
## 12 2000 M Jacob 34471
## 13 2001 F Emily 25055
## 14 2001 M Jacob 32541
## 15 2002 F Emily 24463
## 16 2002 M Jacob 30568
## 17 2003 F Emily 25688
## 18 2003 M Jacob 29630
We see that the most popular names remain largely the same over such a short period of time. Note that as we group first by year, the results are also first arranged by year and thereafter by sex.
Now let’s ask a somewhat different question: how popular (at which
rank) was name “Linda” in each year through this time period? In
order to answer this question we need
to create a new variable popularity, and we use the helper function
row_number
:
%>%
babynames group_by(year) %>%
arrange(desc(n)) %>%
mutate(popularity = row_number()) %>%
# at which place is every name at each year
filter(name == "Linda")
## # A tibble: 9 × 5
## # Groups: year [9]
## year sex name n popularity
## <dbl> <chr> <chr> <int> <int>
## 1 1995 F Linda 1234 467
## 2 1997 F Linda 1098 520
## 3 1996 F Linda 987 575
## 4 1998 F Linda 971 604
## 5 1999 F Linda 900 651
## 6 2000 F Linda 850 687
## 7 2001 F Linda 837 685
## 8 2002 F Linda 769 736
## 9 2003 F Linda 741 771
We see that popularity of “Linda” has slowly decreased, from mid 400th down to almost 800th place. Note that we have not selected by sex, so the rank also takes into account the boy names that are more popular than “Linda” in that year.
But here is even more complex question: in each year, what are the
most popular names that were given to both boys and girls? How can
we approach this? It actually turns out to be a fairly simple task:
namely, note that for
each year, each name is listed one or two times only, depending on
whether it was given to only boys/only girls, or for both boys and girls.
Hence what we need to do is to group by year and name, and find the
group size. This can be done using helper n
. We also need the
total popularity (which is confusingly also called n in these data).
When done this, we can re-group total popularity by year, and for each
year find the largest value of it:
%>%
babynames group_by(year, name) %>%
filter(n() == 2) %>%
# only preserve group size 2:
# both boy and girl
summarize(n = sum(n)) %>%
# aggregate boy/girl counts
group_by(year) %>%
# now we need year groups for ranking
filter(rank(desc(n)) == 1)
## # A tibble: 9 × 3
## # Groups: year [9]
## year name n
## <dbl> <chr> <int>
## 1 1995 Michael 41569
## 2 1996 Michael 38522
## 3 1997 Michael 37690
## 4 1998 Michael 36742
## 5 1999 Jacob 35424
## 6 2000 Jacob 34530
## 7 2001 Jacob 32579
## 8 2002 Jacob 30609
## 9 2003 Jacob 29677
We can see that these are the most popular boy names Michael and Jacob. For a quick check, let’s see what records are there for Michael in 1995:
%>%
babynames filter(year == 1995, name == "Michael")
## # A tibble: 2 × 4
## year sex name n
## <dbl> <chr> <chr> <int>
## 1 1995 F Michael 167
## 2 1995 M Michael 41402
Indeed, we can see that there are 167 baby girls named “Michael”.
Now let’s take an even more complex task. Which names have been among the 10 most popular names for a longest consecutive time period for both boys and girls? And when was that time period?
Let’s approach this question step-by-step. As the first thing, we need to find the top-10 most popular names for boys and girls and for each years. So we just group by sex and year, and filter only names that have rank no more than 10:
%>%
babynames group_by(sex, year) %>%
# do it separately for sex, year
filter(rank(desc(n)) <= 10) %>%
# only preserve cases that are in top-10
head()
## # A tibble: 6 × 4
## # Groups: sex, year [1]
## year sex name n
## <dbl> <chr> <chr> <int>
## 1 1995 F Jessica 27935
## 2 1995 F Ashley 26602
## 3 1995 F Emily 24380
## 4 1995 F Samantha 21645
## 5 1995 F Sarah 21376
## 6 1995 F Taylor 20425
As the next step we take each of these names, and order these by year. In this way we can see if any of the years are missing:
%>%
babynames group_by(sex, year) %>%
filter(rank(desc(n)) <= 10) %>%
group_by(sex, name) %>%
# we allow the same name be given to both
# boys and girls
arrange(year, .by_group=TRUE) %>%
# order each name by year _within_ groups
head()
## # A tibble: 6 × 4
## # Groups: sex, name [2]
## year sex name n
## <dbl> <chr> <chr> <int>
## 1 2001 F Abigail 14807
## 2 2002 F Abigail 15297
## 3 2003 F Abigail 15925
## 4 1996 F Alexis 16589
## 5 1997 F Alexis 17170
## 6 1998 F Alexis 19778
Note that arrange
tends to ignore groups, unless told not to by
.by_group=TRUE
.
It would be trivial to answer a slightly different question–how many
years has the name been in top-10. We could do just be grouping by
name and using the helper n
. But that would ignore potential gap
years when the name was less popular.
Instead, we can test if the years are consecutive by simply subtracting
the previous year from the current year. If the difference is 1, then
the years are consecutive.
This can be done using the
lag
helper function:
%>%
babynames group_by(sex, year) %>%
filter(rank(desc(n)) <= 10) %>%
group_by(sex, name) %>%
arrange(year, .by_group=TRUE) %>%
mutate(difference = year - lag(year)) %>%
# compute difference b/w current year
# and the previous year the name was in
# top-10
head()
## # A tibble: 6 × 5
## # Groups: sex, name [2]
## year sex name n difference
## <dbl> <chr> <chr> <int> <dbl>
## 1 2001 F Abigail 14807 NA
## 2 2002 F Abigail 15297 1
## 3 2003 F Abigail 15925 1
## 4 1996 F Alexis 16589 NA
## 5 1997 F Alexis 17170 1
## 6 1998 F Alexis 19778 1
We can see that e.g. Abigail is present in 2001, 2002, and 2003. The difference between two consecutive appearances is 1 year, except for the first time when it appears, because there is not previous year and the result is undefined.
Now we want to mark the consecutive spells of presence of the name
somehow. In particular, if the name is missing for some years in the
middle, we want to use that missing time to split the presence into
two successive spells. We can detect all such differences by just
looking for difference > 1. We also want to count how many such
breaks we have seen, this can be done with cumsum
. But first we
need to handle the missing values for difference. We’ll replace
those with 0:
%>%
babynames group_by(sex, year) %>%
filter(rank(desc(n)) <= 10) %>%
group_by(sex, name) %>%
arrange(year, .by_group=TRUE) %>%
mutate(difference = year - lag(year)) %>%
mutate(difference = ifelse(is.na(difference), 0, difference)) %>%
# replace missing difference values
# cumsum cannot handle missings
mutate(gap = difference > 1) %>%
# do we have a break in the sequence
# of name appearance?
mutate(spell = cumsum(gap)) %>%
# how many of such breaks have we had?
head()
## # A tibble: 6 × 7
## # Groups: sex, name [2]
## year sex name n difference gap spell
## <dbl> <chr> <chr> <int> <dbl> <lgl> <int>
## 1 2001 F Abigail 14807 0 FALSE 0
## 2 2002 F Abigail 15297 1 FALSE 0
## 3 2003 F Abigail 15925 1 FALSE 0
## 4 1996 F Alexis 16589 0 FALSE 0
## 5 1997 F Alexis 17170 1 FALSE 0
## 6 1998 F Alexis 19778 1 FALSE 0
We use the fact that cumsum
simply increments the spell number every
time where difference > 1. So we essentially count spells, and
because we have grouped by name, we do it separately for each name.
Now we have done all the preparations: for each name, and for each year we know which spell—consecutive appearance in top 10—it falls into. We just have to find the longest spell for each name. Again, we group by sex, name, but now also by the spell, because in this way we can easily compute the length of the spell—it is just the number of cases there. Because the data is arrange by year it is easy to find the first and last year for each spell. And finally, for each name we pick the longest such spell:
%>%
babynames group_by(sex, year) %>%
filter(rank(desc(n)) <= 10) %>%
group_by(sex, name) %>%
arrange(year, .by_group=TRUE) %>%
mutate(difference = year - lag(year)) %>%
mutate(difference = ifelse(is.na(difference), 0, difference)) %>%
mutate(gap = difference > 1) %>%
mutate(spell = cumsum(gap)) %>%
group_by(sex, name, spell) %>%
# work on individual spells
summarize(length = n(), start = first(year), end=last(year)) %>%
# compute length, start/end year of the spell
# note: we collapse data now to spells
group_by(sex, name) %>%
# now look at each sex/name only...
filter(rank(desc(length)) == 1) %>%
# and pick only the longest spell
arrange(desc(length)) %>%
# and output it by length
head(10)
## # A tibble: 10 × 6
## # Groups: sex, name [10]
## sex name spell length start end
## <chr> <chr> <int> <int> <dbl> <dbl>
## 1 F Ashley 0 9 1995 2003
## 2 F Emily 0 9 1995 2003
## 3 F Hannah 0 9 1995 2003
## 4 F Samantha 0 9 1995 2003
## 5 M Christopher 0 9 1995 2003
## 6 M Jacob 0 9 1995 2003
## 7 M Joshua 0 9 1995 2003
## 8 M Matthew 0 9 1995 2003
## 9 M Michael 0 9 1995 2003
## 10 F Alexis 0 8 1996 2003
We can see that 9 names have been in top-10 through every year of our 9-year observation period.
For illustration, let’s repeat the same code on the full babynames dataset, from 1880 till 2017, and write the code in a slightly more efficient manner:
::babynames %>%
babynames group_by(sex, year) %>%
filter(rank(desc(n)) <= 10) %>%
group_by(sex, name) %>%
arrange(year, .by_group=TRUE) %>%
mutate(difference = ifelse(is.na(lag(year)), 0,
- lag(year))) %>%
year # remove missings in the mutate step
mutate(spell = cumsum(difference > 1)) %>%
# we do not need the auxiliary variable gap
# we can just cumsum(difference > 1)
group_by(sex, name, spell) %>%
summarize(length = n(), start = first(year), end=last(year)) %>%
group_by(sex, name) %>%
filter(rank(desc(length)) == 1) %>%
arrange(desc(length)) %>%
head(20)
## # A tibble: 20 × 6
## # Groups: sex, name [20]
## sex name spell length start end
## <chr> <chr> <int> <int> <dbl> <dbl>
## 1 M James 0 113 1880 1992
## 2 M John 0 107 1880 1986
## 3 M Robert 2 102 1888 1989
## 4 M William 0 96 1880 1975
## 5 F Mary 0 92 1880 1971
## 6 M Charles 0 75 1880 1954
## 7 M Michael 0 74 1943 2016
## 8 F Margaret 0 60 1880 1939
## 9 M George 0 58 1880 1937
## 10 M David 0 57 1936 1992
## 11 M Joseph 0 56 1880 1935
## 12 M Richard 0 51 1920 1970
## 13 F Helen 0 45 1891 1935
## 14 F Elizabeth 0 44 1880 1923
## 15 M Christopher 0 43 1967 2009
## 16 M Frank 0 43 1880 1922
## 17 F Anna 0 41 1880 1920
## 18 F Ruth 0 39 1892 1930
## 19 F Patricia 0 38 1929 1966
## 20 M Matthew 0 37 1972 2008
As we can see, James is leading the pack with a 113-years long spell from 1880 till 1992.
We hope these examples illustrate how the basic dplyr-style thinking, combined with certain helper functions, allows us to address fairly complex problems in a reasonably straightforward fashion. Many problems can be disassembled into grouping, filtering and ordering steps, and when it pays off to be able to be able to build workflows from these building blocks.