Chapter 4 Pipes and dplyr: the easy way of data manipulation

Base-R offers powerful and consistent syntax for indexing and manipulating data frames and other data structures. But the code easily turns messy and cumbersome for slightly more complex tasks. This is the main reason that dplyr functions (sometimes called verbs) have gained tremendous popularity. These verbs are connected with pipes, resulting in code that can be read and understood in the order it is written.

This section will introduce pipes, and thereafter the dplyr verbs.

4.1 R pipes

4.1.1 A motivational story: how to make pancakes

Let us start with an example from a totally different field. How to make pancakes? A recipe might look like this:

  1. Take 2 eggs
  2. Add sugar, whisk together
  3. Add milk, whisk
  4. Add flour, whisk
  5. Spread 3 tablespoons of batter on skillet
  6. Cook

Let us try to write computer code that bakes a few pancakes. We use the standard programming approach with functions and variables.

batter <- add(2*egg, sugar)
batter <- whisk(batter)
batter <- add(batter, milk)
batter <- whisk(batter)
...

This code works but it is not very easy to read. The assignment to batter takes quite a bit of space and attention away from the main message, the action verbs add, whisk, and so on, and also the argument batter makes it harder to see the ingredients. From technical perspective, we create (and repeatedly overwrite) an unnecessary variable batter. After all, we don’t care about batter but about cakes! We can try to replace it with nested functions:

batter <- whisk(add(whisk(add(2*egg, sugar)), milk))
batter <- whisk(add(batter, flour))
cake <- cook(spread(batter, 3*tablespoon))

This approach is more compact but even harder to read. In particular, the order of tasks is not the same as order of the functions. While standard in math and in coding, one has to read nested functions from inside out (i.e. from right to left), while keeping eyes on the additional arguments like “flour” or “milk” and parenthesis.

Fortunately there is another way. Note that we have a central variable here, batter, and everything we do is to add and manipulate batter. So we can imagine writing the code along these lines:

add(2*egg, sugar) %>%
   whisk() %>%
   add(milk) %>%
   whisk() %>%
   add(flour) %>%
   whisk() %>%
   ...

Here the %>% operator must be understood as “take the value of previous function and feed it into the next function”.

This approach has two major advantages:

  • We can read the code in a logical order. It is almost the same as reading the recipe!
  • We do not introduce temporary variables (batter) that we overwrite

This is the approach of pipes. %>% is the pipe operator.

4.1.2 Magrittr pipes

The pipe %>% is living in the magrittr package. However, if you use the popular dplyr or tidyverse, it will be automatically imported so there is no need to load magrittr separately.

Pipe channels the output of the previous function (or expression) to the first input of the next function. If we want to compute and print \(\sqrt{2}\), we can do it in the traditional way with nested functions:

print(sqrt(2))
## [1] 1.414214

But we can also use the pipes instead, feeding “2” first to sqrt and the result thereafter to print:

2 %>%
   sqrt() %>%
   print()
## [1] 1.414214

In this example we had little use of piping, one may even argue that the nested function approach is more compact and easier to read. Indeed, this may be the case here. But this may not be true in case of more complex code as we saw with the pancake example above.

The piped data will normally be the first argument for the function after pipe. If that function call has more arguments, those will be treated as second, third, etc arguments of the function.

In RStudio, you can get the pipe with Ctrl-Shift-M (on linux)

Exercise 4.1 Create a vector of numbers 1, 2, 3, 4. Compute its mean, and print (this time use cat) it followed by word “approximately” like

2.5 (approximately)

See the solution

By default, pipe forwards the previous result into the first argument of the following function. If one prefers to use another argument, this can be specified with the placeholder . (dot) for the result. For instance, we can modify the first example and print \(\sqrt{2}\) after the text “the answer is”:

2 %>%
   sqrt() %>%
   cat("The answer is", ., "\n")
## The answer is 1.414214

Now the result of square root will be fed to cat as its second argument.

The data-forwarding pipe %>% is the most popular one, but there are more pipes, for instance

  • %T>%: pivot data to another function
  • %$%: extract list components
  • %<>%: pipe-and-assign

Finally, the right-hand assignment -> works very well together with pipes.

4.1.3 Advantages of pipe-based approach

Pipes have become very popular when doing data analysis in R. The main reason for this is that it allows to wrote code in a way that closely resembles how humans think about solving the problems.

But in order to be able to use pipes efficiently, you need certain familiarity with how to solve such problems, and how to tranform solutions in “human language” into computer code. Such approach should be backed up by the appropriate coding tools as well.

4.2 Basics of dplyr package

dplyr is one of the most widely used data manipulation packages for R. It is based on five distinct “verbs” and a large number of auxiliary functions. It may simplify data manipulations tremendously, and it is explicitly designed with pipes in mind.

The five basic “verbs” (functions) are:

  • select: select/unselect variables
  • filter: filter observations
  • mutate: compute on individual observations
  • arrange: order observations
  • summarize: aggregate observations

dplyr can also do grouped operations, for instance to compute summary statistics for each group of interest separately without any explicit loops.

In itself, the dplyr functions are similar to other functions. They are designed in a way that their first argument is always the dataset, and they always return a modified dataset. This is what makes these well-suited to be used with pipes. The following examples use titanic data:

titanic <- read.csv("../data/titanic.csv.bz2")

For instance, the select function takes the data as the first argument, and a list of desired variables to keep as the following arguments. We can use the traditional approach with nested functions:

head(select(titanic, pclass, survived))
##   pclass survived
## 1      1        1
## 2      1        1
## 3      1        0
## 4      1        0
## 5      1        0
## 6      1        1

Or the pipe-approach:

titanic %>%
   select(pclass, survived) %>%
   head()
##   pclass survived
## 1      1        1
## 2      1        1
## 3      1        0
## 4      1        0
## 5      1        0
## 6      1        1

dplyr is designed in a way that the second approach is almost always convenient and easier to read.

Another distinct trait of dplyr is that it accepts the variable names without quotes. So select(titanic, pclass, survivied) must be understood as selecting variables pclass and survived from the titanic dataset. This is called non-standard evaluation and it is an extremely convenient convention. However, non-standard evaluation may be a hassle with indirect variable names, i.e. when the variable names are stored in another variable.

4.2.1 Basic functions (verbs)

4.2.1.1 select: select variables

select selects variables in data. We stress here: it picks variables, not observations (filter does the latter). In its simplest form, it takes a list of desired variables to keep, like in the example above. It also has a number of helpers and other goodies. Let us print all the variables in titanic data:

names(titanic)
##  [1] "pclass"    "survived"  "name"      "sex"       "age"       "sibsp"    
##  [7] "parch"     "ticket"    "fare"      "cabin"     "embarked"  "boat"     
## [13] "body"      "home.dest"

One can create sequences with :. For instance, let’s select all variables from pclass to sex:

titanic %>%
   select(pclass:sex) %>%
   head(3)
##   pclass survived                           name    sex
## 1      1        1  Allen, Miss. Elisabeth Walton female
## 2      1        1 Allison, Master. Hudson Trevor   male
## 3      1        0   Allison, Miss. Helen Loraine female

One can reverse selection by preceding it with !. Different selections can be combined using & and | for AND and OR. For instance, lets exclude name from the previous example:

titanic %>%
   select(pclass:sex & !name) %>%
   head(3)
##   pclass survived    sex
## 1      1        1 female
## 2      1        1   male
## 3      1        0 female

Here we combine two selections: everything from pclass to sex (i.e. pclass, survived, name, sex) and everything except name. The AND operator tells that only those columns that are present in both sides of it are to be preserved. The result includes all variables from pclass to sex but removes name from this list.

There are many more helpers and ways to use select, see more in tidyverse documentation.

Exercise 4.2 Select variables pclass, survived, sex and age in at least two different ways.

See the solution

4.2.1.2 filter: filter observations

While select selects certain variables, filter filters out selected observation based on desired criteria. It takes one (or more) logical arguments that describe the filtering criteria. For instance, let’s keep only data about male survivors:

titanic %>%
   filter(sex == "male", survived == 1) %>%
   head(3)
##   pclass survived                                 name  sex     age sibsp
## 1      1        1       Allison, Master. Hudson Trevor male  0.9167     1
## 2      1        1                  Anderson, Mr. Harry male 48.0000     0
## 3      1        1 Barkworth, Mr. Algernon Henry Wilson male 80.0000     0
##   parch ticket   fare   cabin embarked boat body
## 1     2 113781 151.55 C22 C26        S   11   NA
## 2     0  19952  26.55     E12        S    3   NA
## 3     0  27042  30.00     A23        S    B   NA
##                         home.dest
## 1 Montreal, PQ / Chesterville, ON
## 2                    New York, NY
## 3                   Hessle, Yorks

Note the syntax: the data variable names (here sex and survived) are written without quotes. We have used two criteria, separated by comma. This is equivalent to logical AND and we could have written instead sex == "male" & survived == 1 to achieve the exact same result.

The beauty of using pipe is that we can combine different functions in a very intuitive and easy way. Let’s repeat the same example, but before filtering the observations we now select the same few variables we did above:

titanic %>%
   select(pclass:sex & !name) %>%
   filter(sex == "male", survived == 1) %>%
   head(3)
##   pclass survived  sex
## 1      1        1 male
## 2      1        1 male
## 3      1        1 male

As you see, this is as simple as copying the line from the select example into this example. The code is also very easy to read.

4.2.1.3 mutate: compute observation-wise

mutate is the one of main functions to perform computations on data (the other one is summarize). This performs computations observation-wise, unlike summarize that reduces many observations into a single number. It takes one (or more) arguments that are normally expressions, involving the data variables. For instance, let’s add a variable died to the titanic data that equals true if survived = 0:

titanic %>%
   select(pclass:sex & !name) %>%
   mutate(died = survived == 0) %>%
   head(3)
##   pclass survived    sex  died
## 1      1        1 female FALSE
## 2      1        1   male FALSE
## 3      1        0 female  TRUE

This example demonstrates several traits of mutate: first, we create a new logical variable using operation survived == 0. Second, we assign this result to a new data variable died. Finally, note that as is the case with other dplyr functions, the data variable names are used without quotes. mutate accepts more than one argument in case you want to compute more than one variable in a single mutate call. It also has a plethora of helper functions, e.g. row_number for sequential row number, dense_rank for rank order of a variable, and n for total number of rows. Note that you do not have to create a new variable with mutate, you can also overwrite an existing one.

Here is another example where we perform three computations: first we replace sex male/female coding with M/F, second we report age in months for those who were younger than one year, and finally we add another variable that tell the order (rank) of the person in terms of age (with the youngest one being number one). Note that in the previous example we did not select age, so we have to change our selection a little bit now:

titanic %>%
   select(survived:age & !name) %>%
   mutate(sex = ifelse(sex == "male", "M", "F"),
          ageRank = dense_rank(age)) %>%
   mutate(age = ifelse(age < 1, paste(round(age*12, 0), "months"), age)) %>%
   head(3)
##   survived sex       age ageRank
## 1        1   F        29      45
## 2        1   M 11 months       7
## 3        0   F         2       9

Now sex is coded as M and F. We can also see that the second, 11 months old passenger, is 7th-youngest in the passenger list. Note that we have included two mutate functions next to each other in this example. This is because we want to ensure that ageRank is computed using age in years. While mutate computes the arguments in the given order so we could merge these two mutate-s, it is a good idea to add an additional safety layer in our code.

4.2.1.4 arrange: order observations

Ordering observations is typically used either just before the final output, or when we are interested in a few largest or smallest values. Let’s arrange the observations by age (we also select a few variables only for not to clutter the output):

titanic %>%
   select(survived:age & !name) %>%
   arrange(age) %>%
   head(3)
##   survived    sex    age
## 1        1 female 0.1667
## 2        0   male 0.3333
## 3        1   male 0.4167

Here we see the three youngest passengers on the liner.

As the other dplyr functons, arrange accepts multiple arguments, the ties in the first argument will be broken according to the second, and so on. Another extremely useful helper function is desc, it transforms the arrangement into the descending order. Here is an example where we use two variables, and arrange the cases in descending order by age and increasing by pclass:

titanic %>%
   select(pclass:age & !name) %>%
   arrange(desc(age), pclass) %>%
   head(8)
##   pclass survived    sex  age
## 1      1        1   male 80.0
## 2      1        1 female 76.0
## 3      3        0   male 74.0
## 4      1        0   male 71.0
## 5      1        0   male 71.0
## 6      3        0   male 70.5
## 7      1        0   male 70.0
## 8      2        0   male 70.0

Now the passengers are arranged according to age from the oldest to the youngest, but in case of equal age, those with smaller pclass value is listed first (see the two 70-year old passengers).

4.2.1.5 summarize: aggregate over observations

The final dplyr base function, summarize, is used to aggregate (reduce) the data to a limited number of figures only. Its functionality may seem a little trivial, and indeed it is—until we start working with grouped data.

In its simplest form, summarize takes an argument that describes how to aggregate a variable into a single value. It is typically and expression that computes something using data variables. For instance, let’s compute the mean age of the passengers:

titanic %>%
   summarize(mean(age, na.rm=TRUE))
##   mean(age, na.rm = TRUE)
## 1                29.88113

The average age is almost 30 years (but remember, there are many missings). Note the following:

  • summarize removed all other variables, except age.
  • It also removed all observations. In fact it collapsed all the data into a single value, the mean age. As we did not tell what to do with other variables, it just left those out.
  • in case of grouped data, it also adds group indicators, see below.

Let us do another example where we preserve somewhat more complex information besides the mean age—percentage of upper class passengers (1st and 2nd), and percentage of women:

titanic %>%
   summarize(age = mean(age, na.rm=TRUE),
             pctUpper = mean(pclass %in% c(1,2)),
             pctWomen = mean(sex == "female"))
##        age  pctUpper  pctWomen
## 1 29.88113 0.4583652 0.3559969

We learn over 45% of passengers were traveling in 1st or 2nd class, and slightly over 1/3 of all passengers were female. Note that here we have given the aggregated variables dedicated names, otherwise dplyr picks something it considers suitable.

Note summarize does not need to compute the aggregated value in the mathematical sense of the word. For instance, we can also use it to pull out the first value. This trick is quite useful for certain grouped operations. As an example, we take the first observation and create a (somewhat useless) summary using no data at all, position = 1:

titanic %>%
   summarize(name = name[1], position=1)
##                            name position
## 1 Allen, Miss. Elisabeth Walton        1

Note that there are also dedicated functions to extract variables by position: first, last and nth.

4.2.2 group_by: grouped operations

Grouped operations make dplyr data processing pipelines incredibly powerful (see more below). These can address questions like What is the average age for first/second class passengers? Were females more likely to travel in upper classes than males? and many similar ones.

The idea with grouped operations is that first the data will be grouped–each observation will be labeled according to which group it belongs. Thereafter the computations are done one these groups independently, as if the rest of the data is not there. Finally, the results from different groups are assembled together again. Grouping is relevant only for certain kind of computations, for those that gather information from several cases. This includes summary operations, such as mean, group-specific operations like n, and all ranking-related operations, such as first and lag. If the computation only involves information from the same observation (like what we did with age and survived above), grouping has no effect. Indeed, groups only matters if your “group members” affect your results somehow!

Let us start with a very simple example: compute mean age for male and female passengers:

titanic %>%
   group_by(sex) %>%
   summarize(age = mean(age, na.rm=TRUE))
## # A tibble: 2 × 2
##   sex      age
##   <chr>  <dbl>
## 1 female  28.7
## 2 male    30.6

This example demonstrates the very basic usage of grouping using group_by. The dataset is effectively split into male and female subset, summarize is used separately on both of these subsets exactly as above, and thereafter the two summaries are combined together into a data frame. As we want to keep track about which value corresponds to which group, the group label is also added to the final data frame. So we learn that in average, women were approximately 2 years younger than men.

We can group on multiple variables simultaneously. Let’s find survival rate by sex and passenger class:

titanic %>%
   group_by(sex, pclass) %>%
   summarize(survival = mean(survived)*100)
## # A tibble: 6 × 3
## # Groups:   sex [2]
##   sex    pclass survival
##   <chr>   <int>    <dbl>
## 1 female      1     96.5
## 2 female      2     88.7
## 3 female      3     49.1
## 4 male        1     34.1
## 5 male        2     14.6
## 6 male        3     15.2

Now we compute the mean survival percentage for both men and women in each class. The data shows what is already well-known–women and upper class passengers had much higher chances to survive.

Grouped operations do not have to involve reducing by summarize, one can also use groups when arranging, filtering and otherwise computing. See examples below.

4.2.3 Combining dplyr functions in a single pipeline

This section discusses more complex and some of the more advanced aspects of dplyr pipelines. It has two-fold goal:

  • To introduce some of the helper functions
  • The teach you to think in terms of pipelines and data operations.

Instead of titanic data, we use babynames dataset from package babynames. We also limit the analysis to a subset to babies form from 1995-2003 and remove variable prop (percentage in a given year) as we do not use it:

babynames <- babynames::babynames %>%
   select(!prop) %>%
   filter(between(year, 1995, 2003))

Note the use of between, a helper for filter for multi-way comparison.

Let us first take a quick look at the data:

babynames %>%
   head()
## # A tibble: 6 × 4
##    year sex   name         n
##   <dbl> <chr> <chr>    <int>
## 1  1995 F     Jessica  27935
## 2  1995 F     Ashley   26602
## 3  1995 F     Emily    24380
## 4  1995 F     Samantha 21645
## 5  1995 F     Sarah    21376
## 6  1995 F     Taylor   20425

The dataset includes 5 variables. n is the number of times the name was given in that year, e.g. Jessica was given 27935 times to girls born in 1995. In all, we have 257707 observations.

Let us start with a very simple task: what was the most popular boy and girl name during this period? This necessarily involves ordering, but we can use the dedicated helper functions rank and desc instead of explicit arrange:

babynames %>%
   group_by(sex) %>%
   filter(rank(desc(n)) == 1)
## # A tibble: 2 × 4
## # Groups:   sex [2]
##    year sex   name        n
##   <dbl> <chr> <chr>   <int>
## 1  1995 F     Jessica 27935
## 2  1995 M     Michael 41402

The filter command does the following: first, it implicitly arranges the observations by n in a descending order. Thereafter it filters only observations with rank 1, i.e. the largest n value. This leaves us with a single girl name and a single boy name. These are the most popular ones.

Exercise 4.3 Find the longest boy and girl names in the data (longest in the sense of number of characters).

Hint: you can use nchar(name) to find the length of the name (in characters).

See the solution

It is very easy to generalize this and print the most popular boy and girl names for each year in our subset–we just have to add grouping by year:

babynames %>%
   group_by(year, sex) %>%
   filter(rank(desc(n)) == 1)
## # A tibble: 18 × 4
## # Groups:   year, sex [18]
##     year sex   name        n
##    <dbl> <chr> <chr>   <int>
##  1  1995 F     Jessica 27935
##  2  1995 M     Michael 41402
##  3  1996 F     Emily   25151
##  4  1996 M     Michael 38365
##  5  1997 F     Emily   25732
##  6  1997 M     Michael 37548
##  7  1998 F     Emily   26181
##  8  1998 M     Michael 36614
##  9  1999 F     Emily   26539
## 10  1999 M     Jacob   35361
## 11  2000 F     Emily   25953
## 12  2000 M     Jacob   34471
## 13  2001 F     Emily   25055
## 14  2001 M     Jacob   32541
## 15  2002 F     Emily   24463
## 16  2002 M     Jacob   30568
## 17  2003 F     Emily   25688
## 18  2003 M     Jacob   29630

We see that the most popular names remain largely the same over such a short period of time. Note that as we group first by year, the results are also first arranged by year and thereafter by sex.

Now let’s ask a somewhat different question: how popular (at which rank) was name “Linda” in each year through this time period? In order to answer this question we need to create a new variable popularity, and we use the helper function row_number:

babynames %>%
   group_by(year) %>%
   arrange(desc(n)) %>%
   mutate(popularity = row_number()) %>%
                           # at which place is every name at each year
   filter(name == "Linda")
## # A tibble: 9 × 5
## # Groups:   year [9]
##    year sex   name      n popularity
##   <dbl> <chr> <chr> <int>      <int>
## 1  1995 F     Linda  1234        467
## 2  1997 F     Linda  1098        520
## 3  1996 F     Linda   987        575
## 4  1998 F     Linda   971        604
## 5  1999 F     Linda   900        651
## 6  2000 F     Linda   850        687
## 7  2001 F     Linda   837        685
## 8  2002 F     Linda   769        736
## 9  2003 F     Linda   741        771

We see that popularity of “Linda” has slowly decreased, from mid 400th down to almost 800th place. Note that we have not selected by sex, so the rank also takes into account the boy names that are more popular than “Linda” in that year.

But here is even more complex question: in each year, what are the most popular names that were given to both boys and girls? How can we approach this? It actually turns out to be a fairly simple task: namely, note that for each year, each name is listed one or two times only, depending on whether it was given to only boys/only girls, or for both boys and girls. Hence what we need to do is to group by year and name, and find the group size. This can be done using helper n. We also need the total popularity (which is confusingly also called n in these data). When done this, we can re-group total popularity by year, and for each year find the largest value of it:

babynames %>%
   group_by(year, name) %>%
   filter(n() == 2) %>%
                           # only preserve group size 2:
                           # both boy and girl
   summarize(n = sum(n)) %>%
                           # aggregate boy/girl counts
   group_by(year) %>%
                           # now we need year groups for ranking
   filter(rank(desc(n)) == 1)
## # A tibble: 9 × 3
## # Groups:   year [9]
##    year name        n
##   <dbl> <chr>   <int>
## 1  1995 Michael 41569
## 2  1996 Michael 38522
## 3  1997 Michael 37690
## 4  1998 Michael 36742
## 5  1999 Jacob   35424
## 6  2000 Jacob   34530
## 7  2001 Jacob   32579
## 8  2002 Jacob   30609
## 9  2003 Jacob   29677

We can see that these are the most popular boy names Michael and Jacob. For a quick check, let’s see what records are there for Michael in 1995:

babynames %>%
   filter(year == 1995, name == "Michael")
## # A tibble: 2 × 4
##    year sex   name        n
##   <dbl> <chr> <chr>   <int>
## 1  1995 F     Michael   167
## 2  1995 M     Michael 41402

Indeed, we can see that there are 167 baby girls named “Michael”.

Now let’s take an even more complex task. Which names have been among the 10 most popular names for a longest consecutive time period for both boys and girls? And when was that time period?

Let’s approach this question step-by-step. As the first thing, we need to find the top-10 most popular names for boys and girls and for each years. So we just group by sex and year, and filter only names that have rank no more than 10:

babynames %>%
   group_by(sex, year) %>%
                           # do it separately for sex, year
   filter(rank(desc(n)) <= 10) %>%
                           # only preserve cases that are in top-10
   head()
## # A tibble: 6 × 4
## # Groups:   sex, year [1]
##    year sex   name         n
##   <dbl> <chr> <chr>    <int>
## 1  1995 F     Jessica  27935
## 2  1995 F     Ashley   26602
## 3  1995 F     Emily    24380
## 4  1995 F     Samantha 21645
## 5  1995 F     Sarah    21376
## 6  1995 F     Taylor   20425

As the next step we take each of these names, and order these by year. In this way we can see if any of the years are missing:

babynames %>%
   group_by(sex, year) %>%
   filter(rank(desc(n)) <= 10) %>%
   group_by(sex, name) %>%
                           # we allow the same name be given to both
                           # boys and girls
   arrange(year, .by_group=TRUE) %>%
                           # order each name by year _within_ groups
   head()
## # A tibble: 6 × 4
## # Groups:   sex, name [2]
##    year sex   name        n
##   <dbl> <chr> <chr>   <int>
## 1  2001 F     Abigail 14807
## 2  2002 F     Abigail 15297
## 3  2003 F     Abigail 15925
## 4  1996 F     Alexis  16589
## 5  1997 F     Alexis  17170
## 6  1998 F     Alexis  19778

Note that arrange tends to ignore groups, unless told not to by .by_group=TRUE.

It would be trivial to answer a slightly different question–how many years has the name been in top-10. We could do just be grouping by name and using the helper n. But that would ignore potential gap years when the name was less popular. Instead, we can test if the years are consecutive by simply subtracting the previous year from the current year. If the difference is 1, then the years are consecutive. This can be done using the lag helper function:

babynames %>%
   group_by(sex, year) %>%
   filter(rank(desc(n)) <= 10) %>%
   group_by(sex, name) %>%
   arrange(year, .by_group=TRUE) %>%
   mutate(difference = year - lag(year)) %>%
                           # compute difference b/w current year
                           # and the previous year the name was in
                           # top-10
   head()
## # A tibble: 6 × 5
## # Groups:   sex, name [2]
##    year sex   name        n difference
##   <dbl> <chr> <chr>   <int>      <dbl>
## 1  2001 F     Abigail 14807         NA
## 2  2002 F     Abigail 15297          1
## 3  2003 F     Abigail 15925          1
## 4  1996 F     Alexis  16589         NA
## 5  1997 F     Alexis  17170          1
## 6  1998 F     Alexis  19778          1

We can see that e.g. Abigail is present in 2001, 2002, and 2003. The difference between two consecutive appearances is 1 year, except for the first time when it appears, because there is not previous year and the result is undefined.

Now we want to mark the consecutive spells of presence of the name somehow. In particular, if the name is missing for some years in the middle, we want to use that missing time to split the presence into two successive spells. We can detect all such differences by just looking for difference > 1. We also want to count how many such breaks we have seen, this can be done with cumsum. But first we need to handle the missing values for difference. We’ll replace those with 0:

babynames %>%
   group_by(sex, year) %>%
   filter(rank(desc(n)) <= 10) %>%
   group_by(sex, name) %>%
   arrange(year, .by_group=TRUE) %>%
   mutate(difference = year - lag(year)) %>%
   mutate(difference = ifelse(is.na(difference), 0, difference)) %>%
                           # replace missing difference values
                           # cumsum cannot handle missings
   mutate(gap = difference > 1) %>%
                           # do we have a break in the sequence
                           # of name appearance?
   mutate(spell = cumsum(gap)) %>%
                           # how many of such breaks have we had?
   head()
## # A tibble: 6 × 7
## # Groups:   sex, name [2]
##    year sex   name        n difference gap   spell
##   <dbl> <chr> <chr>   <int>      <dbl> <lgl> <int>
## 1  2001 F     Abigail 14807          0 FALSE     0
## 2  2002 F     Abigail 15297          1 FALSE     0
## 3  2003 F     Abigail 15925          1 FALSE     0
## 4  1996 F     Alexis  16589          0 FALSE     0
## 5  1997 F     Alexis  17170          1 FALSE     0
## 6  1998 F     Alexis  19778          1 FALSE     0

We use the fact that cumsum simply increments the spell number every time where difference > 1. So we essentially count spells, and because we have grouped by name, we do it separately for each name.

Now we have done all the preparations: for each name, and for each year we know which spell—consecutive appearance in top 10—it falls into. We just have to find the longest spell for each name. Again, we group by sex, name, but now also by the spell, because in this way we can easily compute the length of the spell—it is just the number of cases there. Because the data is arrange by year it is easy to find the first and last year for each spell. And finally, for each name we pick the longest such spell:

babynames %>%
   group_by(sex, year) %>%
   filter(rank(desc(n)) <= 10) %>%
   group_by(sex, name) %>%
   arrange(year, .by_group=TRUE) %>%
   mutate(difference = year - lag(year)) %>%
   mutate(difference = ifelse(is.na(difference), 0, difference)) %>%
   mutate(gap = difference > 1) %>%
   mutate(spell = cumsum(gap)) %>%
   group_by(sex, name, spell) %>%
                           # work on individual spells
   summarize(length = n(), start = first(year), end=last(year)) %>%
                           # compute length, start/end year of the spell
                           # note: we collapse data now to spells
   group_by(sex, name) %>%
                           # now look at each sex/name only...
   filter(rank(desc(length)) == 1) %>%
                           # and pick only the longest spell
   arrange(desc(length)) %>%
                           # and output it by length
   head(10)
## # A tibble: 10 × 6
## # Groups:   sex, name [10]
##    sex   name        spell length start   end
##    <chr> <chr>       <int>  <int> <dbl> <dbl>
##  1 F     Ashley          0      9  1995  2003
##  2 F     Emily           0      9  1995  2003
##  3 F     Hannah          0      9  1995  2003
##  4 F     Samantha        0      9  1995  2003
##  5 M     Christopher     0      9  1995  2003
##  6 M     Jacob           0      9  1995  2003
##  7 M     Joshua          0      9  1995  2003
##  8 M     Matthew         0      9  1995  2003
##  9 M     Michael         0      9  1995  2003
## 10 F     Alexis          0      8  1996  2003

We can see that 9 names have been in top-10 through every year of our 9-year observation period.

For illustration, let’s repeat the same code on the full babynames dataset, from 1880 till 2017, and write the code in a slightly more efficient manner:

babynames::babynames %>%
   group_by(sex, year) %>%
   filter(rank(desc(n)) <= 10) %>%
   group_by(sex, name) %>%
   arrange(year, .by_group=TRUE) %>%
   mutate(difference = ifelse(is.na(lag(year)), 0,
                              year - lag(year))) %>%
                           # remove missings in the mutate step
   mutate(spell = cumsum(difference > 1)) %>%
                           # we do not need the auxiliary variable gap
                           # we can just cumsum(difference > 1)
   group_by(sex, name, spell) %>%
   summarize(length = n(), start = first(year), end=last(year)) %>%
   group_by(sex, name) %>%
   filter(rank(desc(length)) == 1) %>%
   arrange(desc(length)) %>%
   head(20)
## # A tibble: 20 × 6
## # Groups:   sex, name [20]
##    sex   name        spell length start   end
##    <chr> <chr>       <int>  <int> <dbl> <dbl>
##  1 M     James           0    113  1880  1992
##  2 M     John            0    107  1880  1986
##  3 M     Robert          2    102  1888  1989
##  4 M     William         0     96  1880  1975
##  5 F     Mary            0     92  1880  1971
##  6 M     Charles         0     75  1880  1954
##  7 M     Michael         0     74  1943  2016
##  8 F     Margaret        0     60  1880  1939
##  9 M     George          0     58  1880  1937
## 10 M     David           0     57  1936  1992
## 11 M     Joseph          0     56  1880  1935
## 12 M     Richard         0     51  1920  1970
## 13 F     Helen           0     45  1891  1935
## 14 F     Elizabeth       0     44  1880  1923
## 15 M     Christopher     0     43  1967  2009
## 16 M     Frank           0     43  1880  1922
## 17 F     Anna            0     41  1880  1920
## 18 F     Ruth            0     39  1892  1930
## 19 F     Patricia        0     38  1929  1966
## 20 M     Matthew         0     37  1972  2008

As we can see, James is leading the pack with a 113-years long spell from 1880 till 1992.


We hope these examples illustrate how the basic dplyr-style thinking, combined with certain helper functions, allows us to address fairly complex problems in a reasonably straightforward fashion. Many problems can be disassembled into grouping, filtering and ordering steps, and when it pays off to be able to be able to build workflows from these building blocks.

4.3 More advanced dplyr usage

4.4 Limitations of dplyr

  • indirect variable names
  • API breaks