Chapter 14 More about data manipulations: merging, reshaping and cleaning data

In the dplyr section we discussed a large class of data manipulation operations, such as selecting variables, filtering observations, arranging data in certain order and computing new values. Here we add additional operations: reshaping.

14.1 Merging data: joins

When working with real-world data, you’ll often find that that data is stored across multiple files or across multiple data frames. This is done for a number of reasons. For one, it can help to reduce memory usage. For example, if you had a data frame containing information on students enrolled in different courses, then you can store information about each course, such as the instructor’s name, meeting time, and classroom, in a separate data frame. This saves a lot of space, otherwise you have to duplicate that information for every student who takes the same course. You may also want to keep your data better organized by having student information in one file, and course information in another. This separation and organization of data is a core concern in the design of relational databases.

Below, we discuss how to combine such separately organized information into a single data frame.

14.1.1 Combining data frames

In simple cases, we may just combine datasets, row-by-row, or column by column. Consider a student dataset

students <- data.frame(
   student = c("Cecilia", "Marco", "Guangqi"),
   "GPA 2023-au" = c(4.0, 3.8, 2.8),
   check.names = FALSE)
students
##   student GPA 2023-au
## 1 Cecilia         4.0
## 2   Marco         3.8
## 3 Guangqi         2.8

This contains the students’ names and their GPA for 2023-au quarter.

When the following quarter ends, we may store their new GPA in a vector

gpa2024 <- c(3.9, 3.7, 3.0)
gpa2024
## [1] 3.9 3.7 3.0

The numbers here is in the same order as in the original data frame, i.e. “3.9” for Cecilia and “2.8” for Guangqi. Now we can just add the vector to the GPA data frame using cbind() (combine column-wise):

cbind(students, "GPA 2024-wi" = gpa2024)
##   student GPA 2023-au GPA 2024-wi
## 1 Cecilia         4.0         3.9
## 2   Marco         3.8         3.7
## 3 Guangqi         2.8         3.0

This adds another column (vector) to the data frame, and calls it “GPA 2024-wi”. gpa2024 it is just a single vector of data. But we can also combine two data frames in the similar fashion, row-by-row.

When a new student Nong arrives to the college, we can create a separate data frame for his results:

nongsGPA <- data.frame(
   student = "Nong",
   "GPA 2023-au" = 3.6,
   check.names = FALSE)
nongsGPA
##   student GPA 2023-au
## 1    Nong         3.6

Next, we want to add it as a new row to our students’ data. Two data frames can be combined row-wise in a fairly similar fashion using rbind() (combine row-wise):

rbind(students, nongsGPA)
##   student GPA 2023-au
## 1 Cecilia         4.0
## 2   Marco         3.8
## 3 Guangqi         2.8
## 4    Nong         3.6

But note the difference–now we add a row to the data frame, and a row contains different kinds of values, here one character (“Nong”) and one number (“3.6”). Hence we need to convert the row to a data frame, and use exactly the same column names as in the students data frame above. rbind() ensures that the correct columns are matched.

The combining we described here is very simple, in a sense that each row will be combined with the corresponding row, or each column with the corresponding column. If the students in the 2024-wi data are in a different order, the result will be wrong.

Below, we discuss more complex merging that ensures that even if the order differs, the results are still correct.

14.1.2 Merging by key

The simple way of combining data frames what we did above–just adding those row-wise, is fairly limited. What happens if new students join the college, or previous students leave? Or if just the data is ordered differently? In such cases we need to merge rows in two tables not by their order (row number), but we need to find the corresponding rows in both tables. This is merging by key.

14.1.2.1 Merge key

For instance, imagine that in winter 2024 the student list looks somewhat different:

students24 <- data.frame(
   student = c("Cecilia", "Guangqi", "Nong"),
   "GPA 2024-wi" = c(3.9, 3.0, 3.6),
   check.names = FALSE)
students24
##   student GPA 2024-wi
## 1 Cecilia         3.9
## 2 Guangqi         3.0
## 3    Nong         3.6

If we were just to use cbind() here, we’d create quite a mess:

cbind(students, students24$`GPA 2024-wi`)
##   student GPA 2023-au students24$`GPA 2024-wi`
## 1 Cecilia         4.0                      3.9
## 2   Marco         3.8                      3.0
## 3 Guangqi         2.8                      3.6

While all is well with Cecilia, Marco is now assigned Guangqi’s winter quarter grades, and Guangqi in turn gets Nong’s winter grades. This will cause a lot of trouble for the university!

What we need to do instead is to find which line in the first dataset (autumn 2023) corresponds to which row in the second dataset (winter 2024), and combine the corresponding rows only. This is the idea of merging or joining (because you are “merging” or “joining” the data frames together).

Before you can actually do any merging, you need to find some kind of identifiers, values that determine which row corresponds in the first data frame corresponds to which row in the second data frame. In this case it is easy–it is the students’ names. We have to ensure that Cecilia’s grades from one quarter are merged with her grades in the next quarter, and the same for Quangqi, Marco and others.

This value, or a set of values, is called merge key. We cannot do any merging, before we have identified the merge key. There are multiple kinds of merge keys:

  • primary key is the key in the current data frame.
  • foreign key is the key in the other data frame. It may be a column with the same name, or it may also have different names across different datasets. And be aware–sometimes a column with a similar name has a different meaning in different datasets!
  • Often, a single column is enough to identify the records. The example above can be merged correctly by using just name. But sometimes you need multiple columns, such as year and month. This is called compound key, a key that consists of multiple columns.
  • If there is no good identifier, we may create one from what we have. This is called surrogate key. For instance, it is common to link personal datasets using name and day of birth. This usually works, but may create problems if two persons of the same name are born in the same day. Another problem with this surrogate key is the habits to write names in many slightly different forms. For instance, are “John Smith”, “John A. Smith” and “John Alexander Smit” the same person?

14.1.3 What to merge and what not to merge?

Before you even begin with merging, you should ask if these datasets can be merged? And even if you can merge these, does it make sense? Do you want these to be merged? We discuss here a few examples.

Students:

name course quarter year
Ernst info201 2023-au 1
Ernst math126 2023-au 1
Ernst cse120 2024-wi 1
Eberhard arch101 2023-au 2
Eberhard info201 2024-wi 2

Courses:

name quarter day time instructor
info201 2023-au Tue 10:30 Schnellenberg
info201 2024-wi Tue 10:30 Schnellenberg
math126 2023-au Thu 1:30 Recke
arch101 2023-au Wed 3:30 Plater

Example students’ and courses’ data.

Imagine you are working at a college and you have access to students’ data (name, day of birth, address, courses they have taken, grades, and so on). You also have access to the college’s course catalog, including course number, quarter, instructor’s name, etc. Can you merge these two datasets? What might be the merge key? Why may you want to do that, what interesting questions can be answered in this way?

These two data frames can easily be merged. The key might be the course number. But should they key contain the quarter? The answer is “maybe”, depending on what kind of questions are we asking.

  • If we ignore the quarter, then we merge a course called “info201” with all students who have taken “info201”. This might be useful to compare that course with other similar courses, to analyze the average grades, which other courses the students tend to take, and other course-specific questions.
  • If we include quarter in the merge key, then we can answer questions about a particular course instance. For instance, we can compare different instructors of the same course, or analyze at which year students tend to take info201.
  • Note that we can use one of the columns with common name, quarter, as the merge key, but we cannot use the other name, as the key. This is because in these two different tables name means different things: it is the student’s name in the first table and the course name in the second table.

All these examples are valid question that the college might want to analyze.

Exercise 14.1 Assume we merge students with courses using just course number and ignoring everything else. Provide additional examples of questions that one might want to answer with the resulting data.

date temp precip sunrise
2024-02-21 9 0.5 7:03
2024-02-22 12 0.0 7:01
2024-02-23 12 0.0 6:59
2024-02-24 5 1.0 6:57

Example weather data

But imagine now that we have also weather data: for every day, we know temperature, precipitation, wind speed, and also sunrise/sunset time. Can we combine students and weather in a meaningful way? What kind of questions might we answer with such a combination?

  • First, weather is not something that college can do much about. The schedules are typically planned months in advance, and it is hard to change much based on the last minute weather forecast. This means weather data is perhaps not that relevant for colleges.
  • Second, what kind of meaningful questions can we answer if we have such combined data? The only questions that come to my mind is to see if there is any relationship between temperature and grades, or more likely, between sunlight hours, what time of day the course is offered, and students’ performance.
  • Third, what might be the merge key? There are no course numbers nor student names in the weather table. The only possible key is date. Neither student data nor course data contains date though–but course data includes quarter. So now we can compute average weather or time of sunset for the quarters from the weather data, merge this with course data, and finally combine with students’ data. This sounds complicated, but it is perfectly doable.

These examples are also valid questions, but probably ones that the colleges consider less important.

14.1.4 Different types of joins

When merging data, the computer will always find the common key values in both data frames, and merge the corresponding rows. But as usually, the devil is in the details. What should the computer do if there is not match for some of the keys in one of the data frames? Or if a key in one table corresponds to multiple lines in the other table? This is where one has to decide what is a good way to proceed. Below, we discuss the details based on left_join(), and thereafter briefly introduce the other joins.

14.1.4.1 When the first data frame matters: left_join()

The left_join() operation is one of the common merge operations. It looks for matching columns between the two data frames, and then returns a new data frame that is otherwise identical to the first (“left”) data frame, but with extra columns, taken from the second data frame added on. Importantly, the left-part of left_join() ensures that all rows from the left data frame (the first table) are preserved, and no new rows are added. This is achieved by:
  1. If there are no matching key for a row in the first table, the corresponding entries of the new column are filled with missings (NA).
  2. If there are rows in the second table that do not have corresponding rows in the first table, then those rows are dropped.

Below is an example of command

left_join(songs, artists)
This will merge a table songs with artists. Here songs is the left data frame and artists is the right data frame (because songs is left of artists). The figure below explains the process:
Diagram of the left_join() function (by Nathan Stephens).

How left_join() works: the first table is preserved, the new column is filled with NA if no matching pair is found.

  1. left-join preserves the left table: the gray songs data frame is preserved intact in the result.
  2. the join detects the merge key. Here it is name–the common column name in these two tables. But you can also specify it with by argument. The column name of the key(s) may differ between the tables!
  3. The lines in the first table are merged with the corresponding ones in the second: we learn that in “Across the Universe”, John plays guitar; in “Come together”, John plays guitar; and in “Hello, Goodbye”, Paul plays bass.
  4. But there is no match for artist “Buddy” in “Peggy Sue”. Hence that field will be left NA. This is the final merged dataset.

Exercise 14.2 What do you think, what will be the result of left_join(artists, songs)? In particular, what song(s) will be assigned to John?

See the solution

Below, we discuss a few more details using another example. Imagine you are a data scientist at a college, and you have two tables: majors for student majors, and contact for student contact information. Both of these tables use column id, the university-wide student id. (You may also have a third table that links the id to their real name, but we are not concerned about that here.)

contact <- data.frame(
   id = c(1, 2, 3, 4),  # student id-s
   email = c("lisoc@college.edu", "zemath@college.edu",
             "bioarlin@college.edu", "uv98@college.edu")
)
contact
##   id                email
## 1  1    lisoc@college.edu
## 2  2   zemath@college.edu
## 3  3 bioarlin@college.edu
## 4  4     uv98@college.edu
majors <- data.frame(
   studentId = c(1, 2, 3),  # student id-s
   major = c("sociology", "math", "biology")
)
majors
##   studentId     major
## 1         1 sociology
## 2         2      math
## 3         3   biology

Both tables have an id-column, allowing you to match the rows from the contact table to the majors table. However, because different tables call the id-column differently, we need to explicitly tell what is the merge key. We can left-merge contact information with majors as

left_join(contact, majors,
          by = c("id" = "studentId"))
##   id                email     major
## 1  1    lisoc@college.edu sociology
## 2  2   zemath@college.edu      math
## 3  3 bioarlin@college.edu   biology
## 4  4     uv98@college.edu      <NA>

Note the by argument: it is a named vector (see Section 4.4.4), where the key column name in the first data frame goes left and of the second data frame goes right. The result is very similar to the songs-artists example: for students 1, 2, and 3 we have both email and major, but we do not know the major of student 4.

In short, left_join() returns all of the rows from the first table, with all of the columns from both tables.

Notice that because of how a left join is defined, the argument order matters! The resulting table only has rows for elements in the left (first) table; any unmatched elements in the second table are lost. If you switch the order of the operands, you would only have information for students with majors:

left_join(majors, contact,
          by = c("studentId" = "id"))
##   studentId     major                email
## 1         1 sociology    lisoc@college.edu
## 2         2      math   zemath@college.edu
## 3         3   biology bioarlin@college.edu

You don’t get any information for student #4, because they didn’t have a record in the left-hand table!

14.1.4.2 Other joins

But it is not always that you want to preserve all rows of the first data frame and drop the “orphans” from the second one. In fact, dplyr (and relational database systems in general) provides a number of different kinds of joins, each of which decides the rows to be included in its own way. In all these cases, all columns from both tables will be included, with rows taking on any values from their matches in the second table.

  • left_join All rows from the first (left) data frame are returned. That is, you get all the data from the left-hand table, with extra column values added from the right-hand table. Left-hand rows without a match will have NA in the right-hand columns.

  • right_join All rows from the second (right) data frame are returned. That is, you get all the data from the right-hand table, with extra column values added from the left-hand table. Right-hand rows without a match will have NA in the left-hand columns. This is the “opposite” of a left_join(), and the equivalent of switching the order of the data frames in left_join().

  • inner_join Only rows in both data frames are returned. That is, you get any rows that had matching observations in both tables, with the column values from both tables. There will be no additional NA values created by the join. Observations from the left that had no match in the right, or observations in the right that had no match in the left, will not be returned at all.

  • full_join All rows from both data frames are returned. That is, you get a row for any observation, whether or not it matched. If it happened to match, it will have values from both tables in that row. Observations without a match will have NA in the columns from the other table.

The key to deciding between these is to think about what set of data you want as your set of observations (rows), and which columns you’d be okay with being NA if a record is missing.

All these joind add columns from one table to another. dplyr also provides filtering joins which exclude rows based on whether they have a matching observation in another table, and set operations which combine observations as if they were set elements. See the documentation for more detail on these options, but in this course we’ll be primarily focusing on the mutating joins described above.

TBD: when to use what join

TBD: joins and pipes

TBD: More advanced topics one-to-many merging for map data

14.2 Reshaping data

Above, in Section 11, we discussed how many types of data can be stored in data frames, rectangular tables of rows and columns. But it turns out that the exact same information can be stored in multiple ways. Here we discuss two of the most popular forms, long form and wide form, and reshaping–the process of transforming the one to the other.

14.2.1 Motivating example

Example temperature data

Temperature data, collected in two locations (N and S) at two altitudes (L and H).

Imagine you are a researcher at a meteorology institute. Every day you collect air temperature data at two locations: North (N) and South (S), and at two altitudes: Low (L) and High (H). An example of the data you just collected may look like in the table at right, here temperature in North at low altitude was 20, and in South at high altitude was 25.

How you might represent these data? The table at right offers one example. It is a rectangular table (data frame) that contains two rows, one for each location, and two columns, one for each altitude.

Obviously, this is not the only way to represent these data. Here are three other ways you may write down these exact same data:
Different ways to represent the same data

Three other ways to represent the same data

The first of these three examples, b), is just a transposition of the original table a). The other two c) and d), are “made of” the original table, but not by simple operations like transposition. You may notice though that c) and d) are basically transpositions of each other. But in any case, data in these tables is the same. All four agree that temperature in South at low altitude was 30, and the same for North-high was 15. It is not that some tables are correct while others are wrong. They are all correct. Let’s now discuss in which sense do these tables differ.

The original table a) has location in rows and altitude in columns. So here a row represents a location, and each column is an altitude, corresponding to that location. If you want, different altitude values are grouped together into the corresponding locations. The table b) is similar in many ways. But instead of grouping altitudes together into rows representing locations, it groups locations together into rows representing altitudes.

Finally, tables c) and d) discard any groupings and just put all the measures underneath each other (c) or next to each other (d). Obviously, now each row or column needs to have a label, telling which location and altitude the measure is taken of. This is achieved here through two additional columns (c) or rows (d).

While all these tables are correct, they are not always good for all purposes. Typically, a) and b) are better for human reading–this is because we can quickly move the eyes both left-and-right and up-and-down, and in this way compare both locations and altitude. Table c) tends to be the best for computers, but it is not always so. We’ll discuss it more below. Table d) is usually hard to use and can rarely be seen.

Are there any other options that were not represented here? Not really. Obviously, we can put S in the first place instead of N in tables a) and b), or alternatively, we can combine Loc and Alt values into a single LocAlt that might obtain values like “NH” or “S.L”. But otherwise, these four tables are all we can do.

Tables a) and b) are typically called data in wide form, and c) is called long form. To be more precise, we might call d) the wide form, and a) and b) as “middle form”. However, the words “wide” and “long” usually refer to another alternative, and as d) is rarely used, there is no need to talk about the middle forms.

Here is a little more about what is the wide form and what is the long form:

  • In order to be able to talk about wide/long form, the data must have some sort of groupings. Here we have two such groupings: location and altitude. We can group values together by similar location, or similar altitude.
  • Besides of groups, we also need some kind of value. Here it is temperature. Values are typically the most valuable part of a dataset. Indeed, the institute does not have to hire a data scientist to tell them that there exist such things like high altitude in North.
  • In the long form, each row represents a single observation. Groups are not collected into the same row in any way. But we need additional columns that denote the group, in the example we need columns location and altitude. Otherwise we cannot tell to which group the value belongs to.
  • In wide form, at least one group is collected into a single line. In table (a) it is location, in table (b) it is altitude. Table (d) shows an example where both groups are collected into a single line. Now we need either extra rows to tell where the groups belong to, or (more commonly), this is encoded into column names.
  • It is not always clear what are groupings, what are values. It is somewhat fuzzy and depends on what do you want to do.

Note that the story gets more complicated with more groups and measures.

14.2.2 A look at data

Now it is time to look at real data. Let’s take the 2015 subset of the alcohol disorders data:

disorders <- read_delim("data/alcohol-disorders.csv") %>%
   filter(Year == 2015) %>%
   select(country, disordersM, disordersF)
disorders
## # A tibble: 5 × 3
##   country       disordersM disordersF
##   <chr>              <dbl>      <dbl>
## 1 Argentina          3.07       1.17 
## 2 Kenya              0.747      0.654
## 3 Taiwan             0.891      0.261
## 4 Ukraine            3.90       1.43 
## 5 United States      2.93       1.73

The columns disordersM and disordersF refer to the percentage of male and female population with alcohol disorders.

Which table from Section 14.2.1 this printout corresponds to? Is it (a), (b), (c), or (d)? It is fairly easy to see, that it is basically the same as table (a) (or b): instead of two location, the rows represent five countries. And instead of two altitudes, the columns represent two sexes.

You can also see that the data has two grouping dimensions: country and gender. The data is grouped into countries, so each row corresponds to a country, and each column (except country) is a measure of alcohol disorders. The column names also carry important information, they tell which gender the particular numbers correspond to. And finally, all the cells contain a similar measure: percentage of population with alcohol disorders. These data are in wide form.

This form of data is very popular for a few reasons:

  • It is well suited for quick visual comparison. For instance, one can easily compare males in Kenya (0.747) to males in Taiwan (0.891) and find that these values are fairly similar. We can also easily compare men and women in Taiwan (0.891 and 0.261), these figures are rather different.
  • It is easy to extract individual vectors. For instance, if you want to plot alcohol usage for men versus women across the countries, then this form of data is very well suited for the task.

But the table above is not convenient for other common tasks. For instance, to find the sex and country of the largest value of alcohol disorders, one needs to either compare both disorder columns manually, or maybe write a loop over the columns instead. This can be tricky. But this kind of tasks are very common when doing data processing.

Instead, we may want to present the data in the long form, as in table c):

## # A tibble: 10 × 3
##    country       sex   disorders
##    <chr>         <chr>     <dbl>
##  1 Argentina     M         3.07 
##  2 Argentina     F         1.17 
##  3 Kenya         M         0.747
##  4 Kenya         F         0.654
##  5 Taiwan        M         0.891
##  6 Taiwan        F         0.261
##  7 Ukraine       M         3.90 
##  8 Ukraine       F         1.43 
##  9 United States M         2.93 
## 10 United States F         1.73

Here the values are not grouped in any way, all the groups (years and countries) are spread to separate rows. Also, no row contains multiple instances of the same type of value–here it means no row contains multiple disorder values. We also need one or more distinct columns to mark the group membership–a column for year, and a column for country. So the long-form data requires three columns: year, country and disorder. This is how the table above was constructed.

Now finding the maximum value is fairly easy: just find the largest value of column disorders and report the corresponding row. But unfortunately, comparing males in Kenya to males in Taiwan is somewhat more complicated–first we need to find the group of rows for Kenya, and then the male row in the Kenya group. This is not as easy as with wide form data, at least on visually.

But as we explained above, these two are just different forms of the same data, they are not different datasets. Whichever form we use, Kenyan males still have somewhat lower propensity of alcohol disorders than Taiwanese men (0.747 versus 0.891). These are the same data.

Let’s look at the difference between the wide form and long form table again. In the wide form above, each line represents a country, and each country contains a group of values: disorders for men and for women. So data is grouped in lines by country. In the long form, there is no such grouping, each line represents a country-sex combination. You can also see that in the long form, each column contains different kind of values: the first column is country, the second column is sex, and the third one is the disorder percentage. This was not the case with the wide form, in that table two columns contained the disorder percentage.

Exercise 14.3 Can you sketch an alternative way to present these data in wide form?

The solution

Exercise 14.4 How might the pure wide form data look like, where neither countries nor sexes are in different rows?

See the solution

Example 14.1 Some kind of groups are needed to make the distinction between wide and long form meaningful. But what constitutes a group may be a bit unclear. For instance, look at the height-weight data. This represents a few health-related characteristics of five teenagers:

sex age height weight
Female 16 173 58.5
Female 17 165 56.7
Male 17 170 61.2
Male 16 163 54.4
Male 18 170 63.5

Is this dataset in long form or in wide form? In order to answer this question, we need to understand three things: 1) what do the rows represent; 2) what are the values–the actual data values in the table (in one or in multiple columns), and 3) what are the group indicators (in column names or in dedicated columns). The answers are:

  1. A row represents a person (this can be inferred, but also learned from the documentation).
  2. We can immediately see that each column represents a value. age, height and weight are, obviously, values. Here sex is a value too: it is not a group, but the sex of that particular person.
  3. Finally, we do not have any group indicators! This data is not grouped in any way, it is individual data, and hence we do not have group indicators.
Now you may have two questions:
  1. Why is sex a value, not a group? This is because this dataset is about five teenagers. True, it might be about two groups, boys and girls, with three kids measured in each group (the third girl would have NA-s in every column). In that case sex would be a group indicator. This was not how the dataset was constructed though.

  2. why aren’t age, _height and weight (or maybe even sex,age, _height and weight) just different measures for the same group (one person is one group)? You can imagine reshaping this dataframe into two columns: measure (this will be either “sex”, “age”, “height” or “weight”) and value (this would be e.g. “Female”, “16”, “173” and so on).

    This is not wrong (except you cannot mix numeric and character values in the single column) but it is hardly ever useful. True, these are all measures, but there are few tasks where you may want to do similar operations with all these values. It does not make sense to put height and weight on the same plot, and filtering like filter(value > 100) is fairly meaningless.

So it makes sense to say that these data do not have any groupings, but four different kind of values instead.

How might the wide form of these data look like? Remember: wide form has all values for the group in the same row. But here we have only trivial groups–every person forms their own group and hence the data is also in a wide form! Distinction between wide and long form is meaningful only if the data contains some kind of groupings.

Exercise 14.5 Consider the following dataset: the numbers are visits at a hospital by different kinds of patients:

pregnant male female
yes NA 11
no 25 20
  • Is this data frame in a wide form or long form?
  • Can you reshape it to the other form?

See the solution

14.2.3 pivot_longer() and pivot_wider(): reshaping between wide and long form

Fortunately, it is straightforward to convert between the wide and long form. This is called reshaping. The tidyr package (part of tidyverse) offers two functions: pivot_longer() and pivot_wider(). Let’s first convert the original disorder data frame to the long form using pivot_longer(). For the reminder, here is the disorder data again:

country disordersM disordersF
Argentina 3.0698864 1.1703126
Kenya 0.7469638 0.6539660
Taiwan 0.8912813 0.2611961
Ukraine 3.8954954 1.4253794
United States 2.9275393 1.7291682
pivot_longer() takes a number of arguments, the most important of which are cols, values_to and names_to.
  • col tells which columns are to be transformed to long format. We want to transform the value columns, disordersM and disordersF. These columns contain a similar value–percentage of alcohol disorders. We do not want to convert country: this is the grouping dimension, the name of the country, not percentage of disorders (remember, these data are grouped by country and sex). The value for col can be specified in a similar fashion as for select(), see Section 12.3.1. Here we can pick e.g. col = c(disordersM, disordersF) or equivalently, col = !country.
  • values_to is just the name to the resulting value column: the two columns we selected, disordersM and disordersF will be stacked together into a single column (or more precisely, interleaved into a single column). If not specified, it will be called “value”.
  • names_to is the name for the new grouping indicator. As disordersM and disordersF are combined together, we need a new column that tells which rows is originating from which of these. If not specified, the column will be called just “name”.

Now we are ready to do the actual reshaping. We need to select all columns except country. Lets call the resulting values “disorders”, and the other grouping column, the one that will contain the original column names “sex”:

longDisorders <- disorders %>%
   pivot_longer(!country,
                values_to = "disorders",
                names_to = "sex")
longDisorders
## # A tibble: 10 × 3
##    country       sex        disorders
##    <chr>         <chr>          <dbl>
##  1 Argentina     disordersM     3.07 
##  2 Argentina     disordersF     1.17 
##  3 Kenya         disordersM     0.747
##  4 Kenya         disordersF     0.654
##  5 Taiwan        disordersM     0.891
##  6 Taiwan        disordersF     0.261
##  7 Ukraine       disordersM     3.90 
##  8 Ukraine       disordersF     1.43 
##  9 United States disordersM     2.93 
## 10 United States disordersF     1.73

And here we have it. The long form contains three columns–two of these, country and sex are grouping indicators, and the third one disorders, contains the actual values. Note that you may want to do a bit of post-processing and remove “disorders” from the sex column leaving only “F” or “M”. It happens frequently that the column names do not make nice group indicators.

Long form is good for various kinds of data processing tasks, but sometimes we want to transform it back to wide form. This can be done with pivot_wider(). The most important arguments it takes are names_from and values_from. The former tells which column contains the new column names–it should be one of the grouping dimensions, here sex or country. values_from tells which column contains the corresponding values, here it must be disorders. So let’s get our wide form disorders back:

longDisorders %>%
   pivot_wider(names_from = "sex",
               values_from = "disorders") %>%
   head(5)
## # A tibble: 5 × 3
##   country       disordersM disordersF
##   <chr>              <dbl>      <dbl>
## 1 Argentina          3.07       1.17 
## 2 Kenya              0.747      0.654
## 3 Taiwan             0.891      0.261
## 4 Ukraine            3.90       1.43 
## 5 United States      2.93       1.73

And we got it back 😀

Exercise 14.6

Use the alcohol disorders data from above:
  • Reshape it into a long form. Remove “disorders” from the sex column, so that it only read “F” and “M”.
  • Reshape it into the wide form, but now put sexes in rows and countries in columns.

The solution

Exercise 14.7

Use the alcohol disorders data from above:
  • Reshape it into a long form, and thereafter into the widest possible form that only contains a single row.
  • Reshape it directly into the widest possible form.

The solution

14.2.4 Multiple groupings and multiple values

The above examples were fairly simple–this was because we were only using two grouping dimensions, and only one measure variable. Now we look at the case where we have more dimensions and more values.

14.2.4.1 Three grouping dimensions

Imagine you are working for the meteorology institute again, as in Section 14.2.1. Your task is again to collect temperature data in North (N) and South (S) at Low (L) and High (H) altitude. However, now the data must be measured every day, not just once. So far you have collected data for Yesterday (Y) and Today (T), and today seems to be just one degree warmer across all the measurements.

Temperature data that contains 3 groupings

Temperature data with three groupings: location, altitude, and time. Here in a wide form with respect to altitude and time, and in long form with respect to location.

How might you store the data? The example table at right offers one option. You have stored two locations in two rows, and \(2\times 2\) altitude-time combinations in four columns.

Is this a good way to store the data? What are the alternative options? The figure below offers a few examples:
Possible alternative ways to represent the data

Possible alternative ways to represent 3-way grouped data.

Why does your new task–store data over time–offer new ways to make the data frames? This is because now we have three grouping dimensions: location, altitude and time. And not surprisingly, more dimensions offer more ways to represent data. More specifically, tables (a), (b) and (c) are typically referred to as “wide form”, although all these tables have location in rows. Table (d) is pure long form, as the only columns are grouping indicators and the temperature values. Finally, (d) is pure wide form as no groupings are placed in rows. As in case of the two-grouping data, wide forms are often used to represent data visually, while the long form (d) is often useful for data processing. As above, the pure wide table (e) is rarely used, because it is cumbersome to process and display.

Exercise 14.8 Are these all possible options? Can you come up with more possible ways to represent these data? How many different ways are there in all to make a data frame of these data?

The solution

Temperature data as data cube

Temperature data as data cube.

Finally, in case of three grouping dimensions, we can also represent data not as a rectangle but as a 3D “box”. This is commonly called data cube. Data cubes are not very common, and base-R does not support these. Data cubes are much harder to visualize, essentially you need to convert those to (2D) data frames to display their content. But they are actually used in exactly such context, when collecting 3D spatial data, or 4D spatio-temporal data. In the latter case we would need even a 4D cube. We do not discuss data cubes in this course. However, it is useful to know and imagine data cubes when reshaping data frames with multiple grouping dimensions.

14.2.4.2 Multiple values

But real data may be even more complicated. When you send your expensive weather balloons up, it is unlikely that you measure just temperature. You may measure a bunch of different atmospheric parameters, including humidity, wind speed, sunlight, gps coordinates and more.

A possible representation of the atmospheric dataset that also contains humidity as a data frame.

First, imagine you also measure humidity. The dataset you collect may now look something like what is displayed at right. It is a data frame where rows represent location-altitude combinations, and columns represent time-measurement type combinations. How many grouping dimensions do we have in this data frame?

The answer is not quite obvious. We definitely have three–location, altitude and time. But will measurement type–temperature or humidity–also make it as a grouping dimension?

Temperature-humidity data in a completely long form.

This is not quite clear. From pure technical point of view, it is as good as time or location, so you can imagine making a long form dataset as seen here. But conceptually, this may not be a very useful thing to do, as there are little need to use the value column for any operations besides filtering it for either temperature or humidity. These two are very different physical concepts, measured in completely different units. You cannot easily put them on the same figure, nor will computing averages, or filtering like filter(value > 50) make much sense. So conceptually you probably do not want to do that. It is more useful to just use two grouping indicators–Location, Altitude and Time, and two value columns: temperature and humidity.

The values we considered here, temperature and humidity, will change across all three grouping dimensions. But not all measures do.

Temperature-humidity data with geographic coordinates. The latter do not depend on time and elevation.

For instance, imagine you add the geographic coordinates to your weather data. The North and South mean large regions, so the small drift of your weather balloons does not matter much. The result may look like the table here at right.

Here I have just marked two sets of coordinates–one for North, one for South. This is because the coordinates do not depend on altitude and time (and they definitely do not depend on the type of measurement). Such partly constant values add some additional considerations when reshaping data. For instance, when converting the table at right to a data frame (the figure is just a handwritten table!), you need to replicate the GPS coordinates–you need to assign the same coordinates for both Low and High altitude. When converting the data frame to a long form, you probably want to add a column of coordinates to this dataset, one that contains largely identical values. This may be wasteful for large datasets.

Exercise 14.9

Consider the Ice Extent Data. Let’s only consider columns year, month, region, extent and area–remove columns data-type and time.
  1. What are the grouping dimensions in these data?
  2. Are these in a wide form or in long form?
  3. What are values in these data? Do they change according along all the grouping dimensions? Would it make sense to combine the values into an additional grouping dimension?
  4. If you want to convert the dataset into a wide on by month, then what might the column names look like?

The Solution

14.2.4.3 Reshaping multi-dimensional data

Now lets reshape a more complex dataset. Take the complete alcohol disorder data:

disorders <- read_delim("data/alcohol-disorders.csv")
disorders %>%
   sample_n(4)
## # A tibble: 4 × 6
##   country   Code   Year disordersM disordersF population
##   <chr>     <chr> <dbl>      <dbl>      <dbl>      <dbl>
## 1 Taiwan    TWN    2019      0.903      0.258   23777742
## 2 Ukraine   UKR    2018      3.53       1.33    44446952
## 3 Taiwan    TWN    2018      0.892      0.259   23726186
## 4 Argentina ARG    2017      3.02       1.16    44054616

This dataset has three grouping dimensions–country, year and sex. It has two values, disorders (for males and females), and population. You can also easily see that population differs by country and year, but there is no separate population for males and females–it is a single population figure.

Are these data in a wide or in a long form? We can try to answer this in the following way:
  1. What does a row represent? Here, it is obviously a country-year combination. Hence it is in a long form according to country and year. Sexes are combined into the same row, hence it is in wide form according to sex.

  2. What are the values? The data in the current form contain three value columns: male disorders, female disorders, and population. This means two types of values–disorder percentage (for both men and women), and population.

    Year is not a value despite of being numeric–it is just a label on the time dimension.

  3. What are the group indicators? It is fairly obvious that both country and year are groups. Moreover, country is given by two columns, country and Code, we might remove one and still have all the same information in data. So we have three group columns, one of which is redundant.

    The final grouping dimension, sex, does not have a separate name, its name is embedded in the disorder column name.

Exercise 14.10 Consider a subset of COVID Scandinavia data. It looks like this:

read_delim("data/covid-scandinavia.csv.bz2") %>%
   select(code2, country, date, type, count, lockdown, population) %>%
   slice(c(201:204, 891:895, 2271:2274))
## # A tibble: 13 × 7
##    code2 country date       type      count lockdown   population
##    <chr> <chr>   <date>     <chr>     <dbl> <date>          <dbl>
##  1 DK    Denmark 2020-05-01 Confirmed  9311 2020-03-11    5837213
##  2 DK    Denmark 2020-05-01 Deaths      460 2020-03-11    5837213
##  3 DK    Denmark 2020-05-02 Confirmed  9407 2020-03-11    5837213
##  4 DK    Denmark 2020-05-02 Deaths      475 2020-03-11    5837213
##  5 FI    Finland 2020-05-01 Confirmed  5051 2020-03-18    5528737
##  6 FI    Finland 2020-05-01 Deaths      218 2020-03-18    5528737
##  7 FI    Finland 2020-05-02 Confirmed  5176 2020-03-18    5528737
##  8 FI    Finland 2020-05-02 Deaths      220 2020-03-18    5528737
##  9 FI    Finland 2020-05-03 Confirmed  5254 2020-03-18    5528737
## 10 SE    Sweden  2020-05-01 Confirmed 22133 NA           10377781
## 11 SE    Sweden  2020-05-01 Deaths     3027 NA           10377781
## 12 SE    Sweden  2020-05-02 Confirmed 22432 NA           10377781
## 13 SE    Sweden  2020-05-02 Deaths     3102 NA           10377781
  1. What are the grouping dimensions in these data?
  2. Are these data in a long or a wide form?
  3. Would you count type (Confirmed/Deaths) as a grouping dimension?
  4. How many different values do you see in this dataset?
  5. Do all values change according to all dimensions?
  6. How would you describe the column code2? Is it value? Grouping dimension? Something else? Where would you put it when you’d transform the dataset into a wide form?

The solution

Let’s first convert the dataset into a long form, including long form by sex:

disordersL <- disorders %>%
   pivot_longer(starts_with("disorders"),
                           # all columns, names of which
                           # start with 'disorders'
                names_to = "sex",
                values_to = "disorders") %>%
   mutate(sex = gsub("disorders", "", sex))
                           # better name for sex
disordersL %>%
   head(10)
## # A tibble: 10 × 6
##    country   Code   Year population sex   disorders
##    <chr>     <chr> <dbl>      <dbl> <chr>     <dbl>
##  1 Argentina ARG    2015   43257064 M          3.07
##  2 Argentina ARG    2015   43257064 F          1.17
##  3 Argentina ARG    2016   43668236 M          3.04
##  4 Argentina ARG    2016   43668236 F          1.16
##  5 Argentina ARG    2017   44054616 M          3.02
##  6 Argentina ARG    2017   44054616 F          1.16
##  7 Argentina ARG    2018   44413592 M          3.04
##  8 Argentina ARG    2018   44413592 F          1.17
##  9 Argentina ARG    2019   44745516 M          3.11
## 10 Argentina ARG    2019   44745516 F          1.19

This dataset is in long form. Each row represents a country-year-sex combination, and we have three columns describing these grouping indicators. We have two values: population and disorders. The representation is a little bit wasteful, as exactly the same population figure is replicated twice, once for men, once for women. But this data representation is good for many tasks, e.g. for comparing males and females on a barplot.

Alternatively, we may want to convert it into a form where country and sex are in rows, but years are in columns. In order to achieve this, we should first convert it into a long form, and thereafter make it into a wide form by year. Hence we need to spread the two value columns–population and disorders–into columns by year:

disordersL %>%
   pivot_wider(names_from = Year,
               values_from = c(population, disorders)) %>%
                           # note: both 'population' and 'disorders'
                           # are values
   head(4)
## # A tibble: 4 × 13
##   country   Code  sex   population_2015 population_2016 population_2017
##   <chr>     <chr> <chr>           <dbl>           <dbl>           <dbl>
## 1 Argentina ARG   M            43257064        43668236        44054616
## 2 Argentina ARG   F            43257064        43668236        44054616
## 3 Kenya     KEN   M            46851496        47894668        48948140
## 4 Kenya     KEN   F            46851496        47894668        48948140
##   population_2018 population_2019 disorders_2015 disorders_2016 disorders_2017
##             <dbl>           <dbl>          <dbl>          <dbl>          <dbl>
## 1        44413592        44745516          3.07           3.04           3.02 
## 2        44413592        44745516          1.17           1.16           1.16 
## 3        49953300        50951452          0.747          0.744          0.742
## 4        49953300        50951452          0.654          0.655          0.658
##   disorders_2018 disorders_2019
##            <dbl>          <dbl>
## 1          3.04           3.11 
## 2          1.17           1.19 
## 3          0.752          0.773
## 4          0.662          0.669

Note that values_from takes a vector value, the names of the value columns we want to combine into columns by year.

Exercise 14.11 Take the Ice extent data. Let’s only consider columns year, month, region, extent and area–remove columns data-type and time.

  1. reshape the dataset in a way that there are only a single row for each year and region, and values for each month are in their separate columns.
  2. Reshape it into a wide form so that each year-month combination is in a single row, but North and South regions are in separate columns.

See the solution

14.2.5 Applications

Above we introduced reshaping. But there we assumed that you know what kind of shape you want the data to be in. This section provides a few examples that help you to decide what shape is appropriate for a particular task.

Consider the traffic fatalities data. It looks like

fatalities <- read_delim("data/fatalities.csv")
fatalities %>%
   sample_n(4)
## # A tibble: 4 × 4
##    year state fatal      pop
##   <dbl> <chr> <dbl>    <dbl>
## 1  1984 OR      572 2675998.
## 2  1985 WA      744 4408994.
## 3  1985 OR      559 2686996.
## 4  1986 MN      571 4212996

This dataset has two groupings: year and state, and two values: number of fatalities, and population. It is stored in a long form.

You want to plot the traffic fatalities over time by state as a line plot with different colors denoting different states (see Section 13.4.2). Is this a good shape if you intend to use ggplot?

plot of chunk tidyr-reshape-fatalines

Traffic fatalities over time by state. Long form data is exceptionally well suited for ggplot.

The crucial part, when using ggplot, is the aesthetics mapping in the form aes(x, y, col = ...). It turns out this is exactly the form of data you want: aes(year, fatal, col = state) is designed for long form data: it takes the fatal column, separates it by state, and makes the lines in a separate color:

fatalities %>%
   ggplot(aes(year, fatal,
              col = state)) +
   geom_line()

This is a good approach–it makes sense to let computer to figure out how many different states are there in the state column. It also makes sense to let us to decide which aesthetic we want to depend on state, maybe we want this to be line style instead of color.

Do you always want a long form data for such a plot? Well, it depends on the software you are using. Not all software is designed to plot data in long form. Base-R matplot(), for instance works differently: you call it matplot(x, Y, type = "l") where x is the vector of x-values and Y is the matrix of y-values. This plots all columns of Y against x. type = "l" asks it to draw lines (otherwise it will be a scatterplot). Here we need wide form data instead:

fWide <- fatalities %>%
   select(!pop) %>%
                           # all vars, except 'pop'
   pivot_wider(names_from = state,
               values_from = fatal)
head(fWide, 3)
## # A tibble: 3 × 4
##    year    MN    OR    WA
##   <dbl> <dbl> <dbl> <dbl>
## 1  1982   571   518   748
## 2  1983   555   550   698
## 3  1984   582   572   746
plot of chunk tidyr-reshape-fatamatrix

The same plot, made using base-R matplot(). This function expects data in a wide form.

We also need to separate x, here the column year, and the data to be plotted (everything but year):

matplot(fWide$year,
                           # plot year...
        select(fWide, !year),
                           # ...vs everything but year
        type = "l")

As you can see, the plot is basically the same, save the different fonts, line styles and such. Typically, ggplot is easier to use when you already have a ready-made dataset, but for plotting individual data vectors and matrices, base-R functions tend to be easier. As this example shows, it is also easier to use if your data is already in a wide form.

But now consider a somewhat different task: you want to do a trajectory plot (see Section 13.9.2.1) where fatalities in Oregon are on x-axis and fatalities in Washington on y-axis. Such plots are fairly common in many applications (although I have never seen it for traffic accidents). Here, unlike in the previous plot, a single point denotes both Oregon and Washington.

plot of chunk tidyr-reshape-fatalphase

Traffic fatalities, Oregon versus Washington.

In order to use ggplot, you need something like aes(OR, WA). However, there are not separate columns for OR and WA in the original data. Hence we need to separate those–this type plot requires data in a wide form. Here we can just re-use the wide table fWide we created above:

fWide %>%
   ggplot(aes(OR, WA)) +
   geom_path() +
   geom_label(aes(label = year))
                           # year labels
                           # for clarity

Exercise 14.12 Use ice extent data. Both ice area and extent are certain surface areas that are measured in the same units (million km2 in these data) so you can put them on the same graph.

Make a line plot of ice extent and area over time, where time is on x-axis, and the two measures are on y axis.

The solution

14.3 Date and time

Date and time data is something we use a lot in the everyday life. However, it turns out that it is a surprisingly complex data type. Below, we work on a few common tasks with the related data structures.

A lot of the problems boil down to that the common time units like days, months, weeks, even minutes, are not physial units. Physical unit is just second with a precisely defined duration. The other units are convenience units, duration of which changes. Even more, time is commonly accounted with in a mixed measurement system, incorporating decimal, 12/24-based, 60-based, and 7-based units. Months, with their 28-31 day length, are an attempt to make moon’s revolution to fit into that of earth. On top of that, different regions have different time zones and daylight saving time, some of which is re-arranged each year. Not surprisingly, this creates a lot of issues when working with time data.

14.3.1 Duration

Consider a common task: you are analyzing the results of an athletic competition. There were four runners, and their time was:

bib# time
102 58:30
101 59:45
104 1:00:15
103 1:02:30

So the the two best athletes run slightly below 1hr and the two slower ones slightly over that.

This is a common way to express the results. For time intervals less than 1hr, we write MM:SS, e.g. 58:30. From the context we understand what it means, in this case it is 58 minutes and 30 seconds We can stay strictly within decimal system and physical units and write 3510 seconds instead, but unfortunately, such numbers are much harder to understand than the everyday “58:30”.

We can load such data file into R as

results <- read_delim(
   "bib#,time
102,58:30
101,59:45
104,1:00:15
103,1:02:30")

This uses the fact that read_delim() can load not just files, but also file content, fed to it as a string. So we write a multi-line string that is just the content in the form or a csv file.

But now we have a problem: The results look like

results
## # A tibble: 4 × 2
##   `bib#` time    
##    <dbl> <time>  
## 1    102 58:30:00
## 2    101 59:45:00
## 3    104 01:00:15
## 4    103 01:02:30

As you can see, the entry “58:30” is not interpreted as 58 minutes and 30 seconds but as 58 hours and 30 minutes! The runners #102 and #102 apparently spent more than two days on the course! This can be confirmed by computing the time range:

results %>%
   pull(time) %>%
   range()
## Time differences in secs
## [1]   3615 215100

Indeed the winner has time 3615 seconds (1h 15s, the runner #104), and the slowest one is 215100 seconds. That is what 59 hours and 45 mins amounts to.

In the small case here, it is easy to remedy the problem: we can just change “58:30” to “0:58:30”. But in larger data files, this is not feasible. We need to do this in the code. The strategy we pursue below is as follows: first, read the time column as text, then add “00:” in front of those columns that only contain a single colon, and finally, convert the result to duration.

First, we read data while telling the time has column type “character” (see documentation for details):

results <- read_delim(
   "bib#,time
102,58:30
101,59:45
104,1:00:15
103,1:02:30",
col_types = list(time = "character"))
                           # tell that column 'time' is of type 'character'
results
## # A tibble: 4 × 2
##   `bib#` time   
##    <dbl> <chr>  
## 1    102 58:30  
## 2    101 59:45  
## 3    104 1:00:15
## 4    103 1:02:30

next, we can just use mutate() and ifelse() to add “00:” in front of the columns that contain a single colon only:

results %>%
   mutate(time = ifelse(str_count(time, ":") == 1,
                        paste0("00:", time),
                        time))
## # A tibble: 4 × 2
##   `bib#` time    
##    <dbl> <chr>   
## 1    102 00:58:30
## 2    101 00:59:45
## 3    104 1:00:15 
## 4    103 1:02:30

This code:

  1. counts the colons in the column time: str_count(time, ":")
  2. if there is a single colon, pastes “00:” in front of the value paste0("00:", time)
  3. otherwise keeps time as it is (just time)

Now all the time column values have hour markers in front of minutes, but the column is still of character type, not time. We can convert it to time using parse_time() function from the readr package:

results <- results %>%
   mutate(time = ifelse(str_count(time, ":") == 1,
                        paste0("00:", time),
                        time)) %>%
   mutate(time = parse_time(time))

Obviously, we can achieve the same in a single mutate.

Now the time is understood in a correct manner: the winner is

results %>%
   filter(rank(time) == 1)
## # A tibble: 1 × 2
##   `bib#` time  
##    <dbl> <time>
## 1    102 58'30"
                           # athlete #102 is the winner!
results %>%
   ggplot(aes(time)) +
   geom_histogram()

plot of chunk unnamed-chunk-37

14.4 More about data cleaning

TBD: more about data cleaning

14.4.1 Separate a single column into multiple columns

TBD: new motivation and more complex drinking data example

But the new long-form drinking data has a problem. Namely, the sex.year variable is a combination of both sex and year, and that makes it difficult to select only certain genders or years. It would be easier to replace it with two different columns: sex and year.

Fortunately, tidyr package has several functions for this task. Here the appropriate one would be separate_wider_position() that splits a column into multiple columns by position. Its main arguments are cols that tells which columns to separate, and widths that specifices how many characters will go to each new column. There are other arguments, e.g. if the original column is to be removed, what to do if the column is too wide or too narrow, etc.

The most important of these arguments is widths. It must be a named vector, describing the width of the the fields that go into columns. Its names will be transformed to the corresponding column names. As the long form drinking data has sex.year values in the form SYYYY, we need to assign width 1 for sex, and width 4 for year. So the the argument might look like

widths = c(sex = 1, year = 4)

Here we give the columns natural names, sex for sex and year for year:

longDrinking %>%
   separate_wider_position(sex.year,
                           widths = c(sex = 1, year=4)) %>%
   head()
## Error in eval(expr, envir, enclos): object 'longDrinking' not found

Note that separate does not convert the column types. It is all well that sex is character, but we may want to convert year to a number. So the final code may look like:

longDrinking %>%
   separate_wider_position(sex.year,
                           widths = c(sex = 1, year=4)) %>%
   mutate(year = as.numeric(year)) %>%
   head()
## Error in eval(expr, envir, enclos): object 'longDrinking' not found

This form is the most convenient one to use for filtering and many other data processing tasks.

Besides separating by position, there are also functions to separate based by a delimiter, and by regular expressions. Obviously, there are many other options, e.g. one may use string processing functionality to separate certain substrings and regular expressions.

thematic::thematic_on(font = thematic::font_spec(scale=1.8))
# theme_set(text = element_text(size = 24))