Chapter 14 More about data manipulations: merging, reshaping and cleaning data

In the dplyr section we discussed a large class of data manipulation operations, such as selecting variables, filtering observations, arranging data in certain order and computing new values. Here we add additional operations: reshaping.

14.1 Merging data: joins

When working with real-world data, you’ll often find that that data is stored across multiple files or across multiple data frames. This is done for a number of reasons. For one, it can help to reduce memory usage. For example, if you had a data frame containing information on students enrolled in different courses, then you can store information about each course, such as the instructor’s name, meeting time, and classroom, in a separate data frame. This saves a lot of space, otherwise you have to duplicate that information for every student who takes the same course. You may also want to keep your data better organized by having student information in one file, and course information in another. This separation and organization of data is a core concern in the design of relational databases.

Below, we discuss how to combine such separately organized information into a single data frame.

14.1.1 Combining data frames

In simple cases, we may just combine datasets, row-by-row, or column by column. Consider a student dataset

students <- data.frame(
   student = c("Cecilia", "Marco", "Guangqi"),
   "GPA 2023-au" = c(4.0, 3.8, 2.8),
   check.names = FALSE)
students
##   student GPA 2023-au
## 1 Cecilia         4.0
## 2   Marco         3.8
## 3 Guangqi         2.8

This contains the students’ names and their GPA for 2023-au quarter.

When the following quarter ends, we may store their new GPA in a vector

gpa2024 <- c(3.9, 3.7, 3.0)
gpa2024
## [1] 3.9 3.7 3.0

The numbers here is in the same order as in the original data frame, i.e. “3.9” for Cecilia and “2.8” for Guangqi. Now we can just add the vector to the GPA data frame using cbind() (combine column-wise):

cbind(students, "GPA 2024-wi" = gpa2024)
##   student GPA 2023-au GPA 2024-wi
## 1 Cecilia         4.0         3.9
## 2   Marco         3.8         3.7
## 3 Guangqi         2.8         3.0

This adds another column (vector) to the data frame, and calls it “GPA 2024-wi”. gpa2024 it is just a single vector of data. But we can also combine two data frames in the similar fashion, row-by-row.

When a new student Nong arrives to the college, we can create a separate data frame for his results:

nongsGPA <- data.frame(
   student = "Nong",
   "GPA 2023-au" = 3.6,
   check.names = FALSE)
nongsGPA
##   student GPA 2023-au
## 1    Nong         3.6

Next, we want to add it as a new row to our students’ data. Two data frames can be combined row-wise in a fairly similar fashion using rbind() (combine row-wise):

rbind(students, nongsGPA)
##   student GPA 2023-au
## 1 Cecilia         4.0
## 2   Marco         3.8
## 3 Guangqi         2.8
## 4    Nong         3.6

But note the difference–now we add a row to the data frame, and a row contains different kinds of values, here one character (“Nong”) and one number (“3.6”). Hence we need to convert the row to a data frame, and use exactly the same column names as in the students data frame above. rbind() ensures that the correct columns are matched.

The combining we described here is very simple, in a sense that each row will be combined with the corresponding row, or each column with the corresponding column. If the students in the 2024-wi data are in a different order, the result will be wrong.

Below, we discuss more complex merging that ensures that even if the order differs, the results are still correct.

14.1.2 Merging by key

The simple way of combining data frames what we did above–just adding those row-wise, is fairly limited. What happens if new students join the college, or previous students leave? Or if just the data is ordered differently? In such cases we need to merge rows in two tables not by their order (row number), but we need to find the corresponding rows in both tables. This is merging by key.

14.1.2.1 Merge key

For instance, imagine that in winter 2024 the student list looks somewhat different:

students24 <- data.frame(
   student = c("Cecilia", "Guangqi", "Nong"),
   "GPA 2024-wi" = c(3.9, 3.0, 3.6),
   check.names = FALSE)
students24
##   student GPA 2024-wi
## 1 Cecilia         3.9
## 2 Guangqi         3.0
## 3    Nong         3.6

If we were just to use cbind() here, we’d create quite a mess:

cbind(students, students24$`GPA 2024-wi`)
##   student GPA 2023-au students24$`GPA 2024-wi`
## 1 Cecilia         4.0                      3.9
## 2   Marco         3.8                      3.0
## 3 Guangqi         2.8                      3.6

While all is well with Cecilia, Marco is now assigned Guangqi’s winter quarter grades, and Guangqi in turn gets Nong’s winter grades. This will cause a lot of trouble for the university!

What we need to do instead is to find which line in the first dataset (autumn 2023) corresponds to which row in the second dataset (winter 2024), and combine the corresponding rows only. This is the idea of merging or joining (because you are “merging” or “joining” the data frames together).

Before you can actually do any merging, you need to find some kind of identifiers, values that determine which row corresponds in the first data frame corresponds to which row in the second data frame. In this case it is easy–it is the students’ names. We have to ensure that Cecilia’s grades from one quarter are merged with her grades in the next quarter, and the same for Quangqi, Marco and others.

This value, or a set of values, is called merge key. We cannot do any merging, before we have identified the merge key. There are multiple kinds of merge keys:

  • primary key is the key in the current data frame.
  • foreign key is the key in the other data frame. It may be a column with the same name, or it may also have different names across different datasets. And be aware–sometimes a column with a similar name has a different meaning in different datasets!
  • Often, a single column is enough to identify the records. The example above can be merged correctly by using just name. But sometimes you need multiple columns, such as year and month. This is called compound key, a key that consists of multiple columns.
  • If there is no good identifier, we may create one from what we have. This is called surrogate key. For instance, it is common to link personal datasets using name and day of birth. This usually works, but may create problems if two persons of the same name are born in the same day. Another problem with this surrogate key is the habits to write names in many slightly different forms. For instance, are “John Smith”, “John A. Smith” and “John Alexander Smit” the same person?

14.1.3 What to merge and what not to merge?

Before you even begin with merging, you should ask if these datasets can be merged? And even if you can merge these, does it make sense? Do you want these to be merged? We discuss here a few examples.

Students:

name course quarter year
Ernst info201 2023-au 1
Ernst math126 2023-au 1
Ernst cse120 2024-wi 1
Eberhard arch101 2023-au 2
Eberhard info201 2024-wi 2

Courses:

name quarter day time instructor
info201 2023-au Tue 10:30 Schnellenberg
info201 2024-wi Tue 10:30 Schnellenberg
math126 2023-au Thu 1:30 Recke
arch101 2023-au Wed 3:30 Plater

Example students’ and courses’ data.

Imagine you are working at a college and you have access to students’ data (name, day of birth, address, courses they have taken, grades, and so on). You also have access to the college’s course catalog, including course number, quarter, instructor’s name, etc. Can you merge these two datasets? What might be the merge key? Why may you want to do that, what interesting questions can be answered in this way?

These two data frames can easily be merged. The key might be the course number. But should they key contain the quarter? The answer is “maybe”, depending on what kind of questions are we asking.

  • If we ignore the quarter, then we merge a course called “info201” with all students who have taken “info201”. This might be useful to compare that course with other similar courses, to analyze the average grades, which other courses the students tend to take, and other course-specific questions.
  • If we include quarter in the merge key, then we can answer questions about a particular course instance. For instance, we can compare different instructors of the same course, or analyze at which year students tend to take info201.
  • Note that we can use one of the columns with common name, quarter, as the merge key, but we cannot use the other name, as the key. This is because in these two different tables name means different things: it is the student’s name in the first table and the course name in the second table.

All these examples are valid question that the college might want to analyze.

Exercise 14.1 Assume we merge students with courses using just course number and ignoring everything else. Provide additional examples of questions that one might want to answer with the resulting data.

date temp precip sunrise
2024-02-21 9 0.5 7:03
2024-02-22 12 0.0 7:01
2024-02-23 12 0.0 6:59
2024-02-24 5 1.0 6:57

Example weather data

But imagine now that we have also weather data: for every day, we know temperature, precipitation, wind speed, and also sunrise/sunset time. Can we combine students and weather in a meaningful way? What kind of questions might we answer with such a combination?

  • First, weather is not something that college can do much about. The schedules are typically planned months in advance, and it is hard to change much based on the last minute weather forecast. This means weather data is perhaps not that relevant for colleges.
  • Second, what kind of meaningful questions can we answer if we have such combined data? The only questions that come to my mind is to see if there is any relationship between temperature and grades, or more likely, between sunlight hours, what time of day the course is offered, and students’ performance.
  • Third, what might be the merge key? There are no course numbers nor student names in the weather table. The only possible key is date. Neither student data nor course data contains date though–but course data includes quarter. So now we can compute average weather or time of sunset for the quarters from the weather data, merge this with course data, and finally combine with students’ data. This sounds complicated, but it is perfectly doable.

These examples are also valid questions, but probably ones that the colleges consider less important.

14.1.4 Different types of joins

When merging data, the computer will always find the common key values in both data frames, and merge the corresponding rows. But as usually, the devil is in the details. What should the computer do if there is not match for some of the keys in one of the data frames? Or if a key in one table corresponds to multiple lines in the other table? This is where one has to decide what is a good way to proceed. Below, we discuss the details based on left_join(), and thereafter briefly introduce the other joins.

14.1.4.1 When the first data frame matters: left_join()

The left_join() operation is one of the common merge operations. It looks for matching columns between the two data frames, and then returns a new data frame that is otherwise identical to the first (“left”) data frame, but with extra columns, taken from the second data frame added on. Importantly, the left-part of left_join() ensures that all rows from the left data frame (the first table) are preserved, and no new rows are added. This is achieved by:
  1. If there are no matching key for a row in the first table, the corresponding entries of the new column are filled with missings (NA).
  2. If there are rows in the second table that do not have corresponding rows in the first table, then those rows are dropped.

Below is an example of command

left_join(songs, artists)
This will merge a table songs with artists. Here songs is the left data frame and artists is the right data frame (because songs is left of artists). The figure below explains the process:
Diagram of the left_join() function (by Nathan Stephens).

How left_join() works: the first table is preserved, the new column is filled with NA if no matching pair is found.

  1. left-join preserves the left table: the gray songs data frame is preserved intact in the result.
  2. the join detects the merge key. Here it is name–the common column name in these two tables. But you can also specify it with by argument. The column name of the key(s) may differ between the tables!
  3. The lines in the first table are merged with the corresponding ones in the second: we learn that in “Across the Universe”, John plays guitar; in “Come together”, John plays guitar; and in “Hello, Goodbye”, Paul plays bass.
  4. But there is no match for artist “Buddy” in “Peggy Sue”. Hence that field will be left NA. This is the final merged dataset.

Exercise 14.2 What do you think, what will be the result of left_join(artists, songs)? In particular, what song(s) will be assigned to John?

See the solution

Below, we discuss a few more details using another example. Imagine you are a data scientist at a college, and you have two tables: majors for student majors, and contact for student contact information. Both of these tables use column id, the university-wide student id. (You may also have a third table that links the id to their real name, but we are not concerned about that here.)

contact <- data.frame(
   id = c(1, 2, 3, 4),  # student id-s
   email = c("lisoc@college.edu", "zemath@college.edu",
             "bioarlin@college.edu", "uv98@college.edu")
)
contact
##   id                email
## 1  1    lisoc@college.edu
## 2  2   zemath@college.edu
## 3  3 bioarlin@college.edu
## 4  4     uv98@college.edu
majors <- data.frame(
   studentId = c(1, 2, 3),  # student id-s
   major = c("sociology", "math", "biology")
)
majors
##   studentId     major
## 1         1 sociology
## 2         2      math
## 3         3   biology

Both tables have an id-column, allowing you to match the rows from the contact table to the majors table. However, because different tables call the id-column differently, we need to explicitly tell what is the merge key. We can left-merge contact information with majors as

left_join(contact, majors,
          by = c("id" = "studentId"))
##   id                email     major
## 1  1    lisoc@college.edu sociology
## 2  2   zemath@college.edu      math
## 3  3 bioarlin@college.edu   biology
## 4  4     uv98@college.edu      <NA>

Note the by argument: it is a named vector (see Section 4.4.4), where the key column name in the first data frame goes left and of the second data frame goes right. The result is very similar to the songs-artists example: for students 1, 2, and 3 we have both email and major, but we do not know the major of student 4.

In short, left_join() returns all of the rows from the first table, with all of the columns from both tables.

Notice that because of how a left join is defined, the argument order matters! The resulting table only has rows for elements in the left (first) table; any unmatched elements in the second table are lost. If you switch the order of the operands, you would only have information for students with majors:

left_join(contact, majors,
          by = c("id" = "studentId"))
##   id                email     major
## 1  1    lisoc@college.edu sociology
## 2  2   zemath@college.edu      math
## 3  3 bioarlin@college.edu   biology
## 4  4     uv98@college.edu      <NA>

You don’t get any information for student #4, because they didn’t have a record in the left-hand table!

14.1.4.2 Other joins

But it is not always that you want to preserve all rows of the first data frame and drop the “orphans” from the second one. In fact, dplyr (and relational database systems in general) provides a number of different kinds of joins, each of which decides the rows to be included in its own way. In all these cases, all columns from both tables will be included, with rows taking on any values from their matches in the second table.

  • left_join All rows from the first (left) data frame are returned. That is, you get all the data from the left-hand table, with extra column values added from the right-hand table. Left-hand rows without a match will have NA in the right-hand columns.

  • right_join All rows from the second (right) data frame are returned. That is, you get all the data from the right-hand table, with extra column values added from the left-hand table. Right-hand rows without a match will have NA in the left-hand columns. This is the “opposite” of a left_join, and the equivalent of switching the operands.

  • inner_join Only rows in both data frames are returned. That is, you get any rows that had matching observations in both tables, with the column values from both tables. There will be no additional NA values created by the join. Observations from the left that had no match in the right, or observations in the right that had no match in the left, will not be returned at all.

  • full_join All rows from both data frames are returned. That is, you get a row for any observation, whether or not it matched. If it happened to match, it will have values from both tables in that row. Observations without a match will have NA in the columns from the other table.

The key to deciding between these is to think about what set of data you want as your set of observations (rows), and which columns you’d be okay with being NA if a record is missing.

Note that these are all mutating joins, which add columns from one table to another. dplyr also provides filtering joins which exclude rows based on whether they have a matching observation in another table, and set operations which combine observations as if they were set elements. See the documentation for more detail on these options, but in this course we’ll be primarily focusing on the mutating joins described above.

TBD: one-to-many merging for map data

TBD: joins and pipes

TBD: when to use what join

14.2 Cleaning data

TBD

14.3 Date and time

Date and time data is something we use a lot in the everyday life. However, it turns out that it is a surprisingly complex data type. Below, we work on a few common tasks with the related data structures.

A lot of the problems boil down to that the common time units like days, months, weeks, even minutes, are not physial units. Physical unit is just second with a precisely defined duration. The other units are convenience units, duration of which changes. Even more, time is commonly accounted with in a mixed measurement system, incorporating decimal, 12/24-based, 60-based, and 7-based units. Months, with their 28-31 day length, are an attempt to make moon’s revolution to fit into that of earth. On top of that, different regions have different time zones and daylight saving time, some of which is re-arranged each year. Not surprisingly, this creates a lot of issues when working with time data.

14.3.1 Duration

Consider a common task: you are analyzing the results of an athletic competition. There were four runners, and their time was:

bib# time
102 58:30
101 59:45
104 1:00:15
103 1:02:30

So the the two best athletes run slightly below 1hr and the two slower ones slightly over that.

This is a common way to express the results. For time intervals less than 1hr, we write MM:SS, e.g. 58:30. From the context we understand what it means, in this case it is 58 minutes and 30 seconds We can stay strictly within decimal system and physical units and write 3510 seconds instead, but unfortunately, such numbers are much harder to understand than the everyday “58:30”.

We can load such data file into R as

results <- read_delim(
   "bib#,time
102,58:30
101,59:45
104,1:00:15
103,1:02:30")

This uses the fact that read_delim() can load not just files, but also file content, fed to it as a string. So we write a multi-line string that is just the content in the form or a csv file.

But now we have a problem: The results look like

results
## # A tibble: 4 × 2
##   `bib#` time    
##    <dbl> <time>  
## 1    102 58:30:00
## 2    101 59:45:00
## 3    104 01:00:15
## 4    103 01:02:30

As you can see, the entry “58:30” is not interpreted as 58 minutes and 30 seconds but as 58 hours and 30 minutes! The runners #102 and #102 apparently spent more than two days on the course! This can be confirmed by computing the time range:

results %>%
   pull(time) %>%
   range()
## Time differences in secs
## [1]   3615 215100

Indeed the winner has time 3615 seconds (1h 15s, the runner #104), and the slowest one is 215100 seconds. That is what 59 hours and 45 mins amounts to.

In the small case here, it is easy to remedy the problem: we can just change “58:30” to “0:58:30”. But in larger data files, this is not feasible. We need to do this in the code. The strategy we pursue below is as follows: first, read the time column as text, then add “00:” in front of those columns that only contain a single colon, and finally, convert the result to duration.

First, we read data while telling the time has column type “character” (see documentation for details):

results <- read_delim(
   "bib#,time
102,58:30
101,59:45
104,1:00:15
103,1:02:30",
col_types = list(time = "character"))
                           # tell that column 'time' is of type 'character'
results
## # A tibble: 4 × 2
##   `bib#` time   
##    <dbl> <chr>  
## 1    102 58:30  
## 2    101 59:45  
## 3    104 1:00:15
## 4    103 1:02:30

next, we can just use mutate() and ifelse() to add “00:” in front of the columns that contain a single colon only:

results %>%
   mutate(time = ifelse(str_count(time, ":") == 1,
                        paste0("00:", time),
                        time))
## # A tibble: 4 × 2
##   `bib#` time    
##    <dbl> <chr>   
## 1    102 00:58:30
## 2    101 00:59:45
## 3    104 1:00:15 
## 4    103 1:02:30

This code:

  1. counts the colons in the column time: str_count(time, ":")
  2. if there is a single colon, pastes “00:” in front of the value paste0("00:", time)
  3. otherwise keeps time as it is (just time)

Now all the time column values have hour markers in front of minutes, but the column is still of character type, not time. We can convert it to time using parse_time() function from the readr package:

results <- results %>%
   mutate(time = ifelse(str_count(time, ":") == 1,
                        paste0("00:", time),
                        time)) %>%
   mutate(time = parse_time(time))

Obviously, we can achieve the same in a single mutate.

Now the time is understood in a correct manner: the winner is

results %>%
   filter(rank(time) == 1)
## # A tibble: 1 × 2
##   `bib#` time  
##    <dbl> <time>
## 1    102 58'30"
                           # athlete #102 is the winner!
results %>%
   ggplot(aes(time)) +
   geom_histogram()

plot of chunk unnamed-chunk-22

14.4 Reshaping data

Above, in Section 11, we discussed how many types of data can be stored in data frames, rectangular tables of rows and columns. But it turns out that the exact same information can be stored in multiple ways. Here we explain the two most popular forms, long form and wide form, and reshaping–the process of transforming the one to the other.

14.4.1 Long form and wide form data

First, let’s look at drinking data from the Drinking-patterns study across the U.S. states (not included in this repo). Call this dataset drinking. The sample here contains the percentage of respondents in three states who have had at least one alcoholic drink in past 30 days in 2009 to 2012:

state 2009 2010 2011 2012
Tennessee 48.3 48.1 39.6 48.1
North Carolina 60.3 59.7 60.4 59.2
Pennsylvania 36.0 37.3 40.6 41.2

These data is presented in a way that is commonly seen in that sort of tables. Each row in the dataset is a state, and each column (except state) is measures drinking. The column names also carry important information, they tell which year the particular number was measured. And finally, all the cells contain a similar measure: percentage of individuals who have had at least one drink in the past 30 days. This is an example of data in wide form.

This form of data has a number of uses:

  • It is well suited for quick visual comparison. For instance, one can easily compare North Carolina in 2009 with Pennsylvania in 2009 (60.3 versus 36.0), North Carolina in 2011 and 2012 (60.4 versus 41.2).
  • It is easy to extract individual vectors. For instance, if you want to plot alcohol usage in 2009 versus 2012, then this form of data is very well suited for the task.

But the table above is not convenient for other common tasks. For instance, if we want to find the year of largest drinking across the states, then we need to either compare all columns manually, or write a loop over the columns instead. This is tricky. But this kind of tasks are common when doing data processing.

For such tasks, it is better to store the same data in a slightly different format:

state year drinking
Tennessee 2009 48.3
Tennessee 2010 48.1
North Carolina 2009 60.3
North Carolina 2010 59.7
Pennsylvania 2009 36
Pennsylvania 2010 37.3

Note that these two datasets contain exactly the same data. For instance, Tennessee in 2009 had drinking rate 48.3% while in North Carolina in 2010 it was 59.7%. This table is called data in long form. What distinguishes this table from the previous one is that here each column contains different kind of values: the first column is state, the second column is year, and the third one is the drinking percentage. This was not the case with the wide form, in that table all the columns, except the state names, contain drinking percentage.

Let’ take a closer look on how to find a certain value, for instance, the drinking propensity in Tennessee in 2010 (48.1%). In the wide form table, we need to find the row by looking for the state in the first column, and find the matching column through column names. In the long form table table, we need to find first the state throught the state column as well. But now this does not give us a single row for all Tennessee data, but four rows, one for each year. So our second task is to find the year of interest in the year column among the rows that correspond to Tennessee. Here it is 2nd row, and the value is 48.1 again.

For humans, the first approach is probably preferable. But for computers, not necessarily. The problem is that in the first table we were using two different type of methods–filtering rows based on a criterion (state = Tennessee) and also selecting columns based on a criterion (column = 2010). Such different operations may not be well suited for data processing pipelines. In the second case, we just used two similar filtering conditions: first, state = Tennessee, and second, year = 2010. This turns out a much simpler approach for many tasks.

14.4.2 Reshaping between wide and long form

How can we transform the original, wide form data into a long form? And what does the long form and wide form exactly mean?

The main trait of long form data is that each column depicts a different kind of value. If there are any inherent groupings, all the values that belong to the same group are spread into multiple rows. Additionally, we need one or more distinct columns to mark the group membership. In the drinking example, there was only one kind of value (drinking percentage) but two groupings: state and year. So the long-form data would require three columns: state, year and drinking. The first two of these denot groups, and the last one are the values. So the wide form might look like:

state year drinking
Tennessee 2009 48.3
Tennessee 2010 48.1

The wide form, however, has all values of one (or more) groups in the same row. In the drinking example, these were all years for a state:

state 2009 2010 2011 2012
Tennessee 48.3 48.1 39.6 48.1
North Carolina 60.3 59.7 60.4 59.2

Note how we have four separate columns of values, one for each year. But different states are still in separate rows. So in a sense it is not a pure wide form: it is long form in terms of states but wide form in terms of sex-year combinations.

Exercise 14.3 In the example above, states are in rows and years are in columns. Can you swap this around–years in rows and states in columns? How would the result look like?

See the solution

Exercise 14.4 How might the pure wide form data look like where states are not in separate rows?

See the solution

Alternatively, we can think that what distinguishes the wide and long form is what each row represents. In the long form, it is a single group element, in this example it is a state-year combination. In the wide form, it is a group. In the example above, it is just a state–all years are in the same row.

Some kind of groups are needed to make the distinction between wide and long form meaningful. For instance, look at the height-weight data. This represents a few health-related characteristics of five teenagers:

sex age height weight
Female 16 173 58.5
Female 17 165 56.7
Male 17 170 61.2
Male 16 163 54.4
Male 18 170 63.5

Is this dataset in long form or in wide form? In order to answer this question, we need to understand three things: 1) what do the rows represent; 2) what are the values–the actual data values in the table (in one or in multiple columns), and 3) what are the group indicators (in column names or in dedicated columns). The answers are:

  1. a row represents one person (this can be inferred, but also learned from the documentation).
  2. We can immediately see that each column represents a value. age, height and weight are, obviously, values. Here sex is a value too: it is not a group, but the sex of that particular person.
  3. Finally, we do not have any group indicators! This data is not grouped in any way, it is individual data, and hence we do not have group indicators.

How might the wide form of these data look like? Remember: wide form has all values for the group in the same row. But here we have only trivial groups–every person forms their own group and hence the data is also in a wide form!

Distinction between wide and long form is meaningful only if the data contains some kind of groupings.

14.4.3 pivot_longer() and pivot_wider()

Fortunately, it is fairly straightforward to convert one data type to the other. This is called reshaping. The tidyr package (part of tidyverse) offers two functions: pivot_longer() and pivot_wider(). Let’s first convert the original table to the long form using pivot_longer(). It takes number of arguments, most important of which are cols, names_to and values_to.

TBD: fix the thing that we only have one sex in the example above

col tells which columns are to be transformed to long format. in the example above, we want to transform B2002, F2002, M2002, B2003, F2003 and M2003. All these columns contain a similar type of value–percentage of those who are drinking. We do not want to convert state: this is name of the state, not percentage of drinkers. The value for col can be specified in a similar fashion as for select(), see Section 12.3.1.

As a result, all these columns will be converted into two columns: the first one contains the column names B2002, F2002 and so on, and the other one contains the corresponding values, 32.5, 25.2, etc. The two columns are necessary because we need both values, and some kind of indicator that explains what these values are. names_to and values_to are names of these two respective columns. So we can reshape this dataset into a long form as:

longDrinking <- drinking %>%
   pivot_longer(!state, names_to = "sex.year", values_to = "drinking")
longDrinking %>%
   head(7)
## # A tibble: 7 × 3
##   state          sex.year drinking
##   <chr>          <chr>       <dbl>
## 1 Tennessee      B2002        32.5
## 2 Tennessee      F2002        25.2
## 3 Tennessee      M2002        40  
## 4 Tennessee      B2003        35.5
## 5 Tennessee      F2003        27.6
## 6 Tennessee      M2003        43.6
## 7 North Carolina B2002        42.9

Long form is good for various data processing pipelines, but sometimes we want to transform it back to wide form. This can be done with pivot_wider(). The most important arguments it takes are names_from and values_from: which column values should be transformed to column names in the wide form, and which column should contain the corresponding values. So let’s get our wide form drinking back:

longDrinking %>%
   pivot_wider(names_from = "sex.year",
               values_from = "drinking") %>%
   head(5)
## # A tibble: 3 × 7
##   state          B2002 F2002 M2002 B2003 F2003 M2003
##   <chr>          <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Tennessee       32.5  25.2  40    35.5  27.6  43.6
## 2 North Carolina  42.9  33.9  52.2  44.1  35.7  52.8
## 3 Pennsylvania    59.8  51.6  68.4  61    53.2  69

And we got our wide form back 😀

Exercise 14.5 Consider the Ice extent data. Let’s only consider columns year, month, region, and area–select only these four columns.

  1. In terms of region, is it in a wide form or in long form?
  2. What might the column names of the wide form of the same dataset look like?
  3. reshape the dataset in a way that there are only a single row for each month, and northern and southern area are in their separate columns.

See the solution

Exercise 14.6 Consider the data frame about the number of patients in a hospital:

pregnant male female
yes NA 11
no 25 20
  • Is this data frame in a wide form or long form?
  • Can you reshape it to the other form?

See the solution

14.4.4 Multiple groupings and multiple values

Now let’s look at a more complex example of reshaping. Consider alcohol disorder data:

disorders <- read_delim("data/alcohol-disorders.csv")
disorders %>%
   head(3)
## # A tibble: 3 × 6
##   country   Code   Year disordersM disordersF population
##   <chr>     <chr> <dbl>      <dbl>      <dbl>      <dbl>
## 1 Argentina ARG    2015       3.07       1.17   43257064
## 2 Argentina ARG    2016       3.04       1.16   43668236
## 3 Argentina ARG    2017       3.02       1.16   44054616

This dataset contains the percentage of population who suffers of alcohol use disorders for a number of countries and a number of years. It also contains the relevant population measures. For instance, in Argentina in 2015, slightly over 3% of the male population and over 1% of female population suffer from the disorders while the total population was approximately 43M.

Are these data in a wide or in a long form? Let’s try to analyze this using the three questions we used above (in Section 14.4.2):

  1. What does a row represent? Here, it is obviously a country-year combination.

  2. What are the values? These data contain three value columns: male disorders, female disorders, and population. Year is part of the grouping, and both country and Code are grouping as well. Moreover, country and Code contain exactly the same information, each country is uniquely associated with its code.

  3. What are the group indicators? It is fairly obvious that both country and year are groups. Moreover, country is given by two columns, country and Code, we might remove one and still have all the same information in data. So we have three group columns, one of which is redundant.

    But we have more groups: namely males and females. Indeed, disordersM and disordersF contain the same kind of values, percentage of population. Just this group indicator is not in a dedicated column, but spread into names of two columns instead.

TBD: complete multiple grouping reshape.

14.5 Separate a single column into multiple columns

TBD: new motivation and more complex drinking data example

But the new long-form drinking data has a problem. Namely, the sex.year variable is a combination of both sex and year, and that makes it difficult to select only certain genders or years. It would be easier to replace it with two different columns: sex and year.

Fortunately, tidyr package has several functions for this task. Here the appropriate one would be separate_wider_position() that splits a column into multiple columns by position. Its main arguments are cols that tells which columns to separate, and widths that specifices how many characters will go to each new column. There are other arguments, e.g. if the original column is to be removed, what to do if the column is too wide or too narrow, etc.

The most important of these arguments is widths. It must be a named vector, describing the width of the the fields that go into columns. Its names will be transformed to the corresponding column names. As the long form drinking data has sex.year values in the form SYYYY, we need to assign width 1 for sex, and width 4 for year. So the the argument might look like

widths = c(sex = 1, year = 4)

Here we give the columns natural names, sex for sex and year for year:

longDrinking %>%
   separate_wider_position(sex.year,
                           widths = c(sex = 1, year=4)) %>%
   head()
## # A tibble: 6 × 4
##   state     sex   year  drinking
##   <chr>     <chr> <chr>    <dbl>
## 1 Tennessee B     2002      32.5
## 2 Tennessee F     2002      25.2
## 3 Tennessee M     2002      40  
## 4 Tennessee B     2003      35.5
## 5 Tennessee F     2003      27.6
## 6 Tennessee M     2003      43.6

Note that separate does not convert the column types. It is all well that sex is character, but we may want to convert year to a number. So the final code may look like:

longDrinking %>%
   separate_wider_position(sex.year,
                           widths = c(sex = 1, year=4)) %>%
   mutate(year = as.numeric(year)) %>%
   head()
## # A tibble: 6 × 4
##   state     sex    year drinking
##   <chr>     <chr> <dbl>    <dbl>
## 1 Tennessee B      2002     32.5
## 2 Tennessee F      2002     25.2
## 3 Tennessee M      2002     40  
## 4 Tennessee B      2003     35.5
## 5 Tennessee F      2003     27.6
## 6 Tennessee M      2003     43.6

This form is the most convenient one to use for filtering and many other data processing tasks.

Besides separating by position, there are also functions to separate based by a delimiter, and by regular expressions. Obviously, there are many other options, e.g. one may use string processing functionality to separate certain substrings and regular expressions.

thematic::thematic_on(font = thematic::font_spec(scale=1.8))
# theme_set(text = element_text(size = 24))