Chapter 14 More about data manipulations: merging, reshaping and cleaning data
In the dplyr section we discussed a large class of data manipulation operations, such as selecting variables, filtering observations, arranging data in certain order and computing new values. Here we add additional operations: reshaping.
14.1 Merging data: joins
When working with real-world data, you’ll often find that that data is stored across multiple files or across multiple data frames. This is done for a number of reasons. For one, it can help to reduce memory usage. For example, if you had a data frame containing information on students enrolled in different courses, then you can store information about each course, such as the instructor’s name, meeting time, and classroom, in a separate data frame. This saves a lot of space, otherwise you have to duplicate that information for every student who takes the same course. You may also want to keep your data better organized by having student information in one file, and course information in another. This separation and organization of data is a core concern in the design of relational databases.
Below, we discuss how to combine such separately organized information into a single data frame.
14.1.1 Combining data frames
In simple cases, we may just combine datasets, row-by-row, or column by column. Consider a student dataset
data.frame(
students <-student = c("Cecilia", "Marco", "Guangqi"),
"GPA 2023-au" = c(4.0, 3.8, 2.8),
check.names = FALSE)
students
## student GPA 2023-au
## 1 Cecilia 4.0
## 2 Marco 3.8
## 3 Guangqi 2.8
This contains the students’ names and their GPA for 2023-au quarter.
When the following quarter ends, we may store their new GPA in a vector
c(3.9, 3.7, 3.0)
gpa2024 <- gpa2024
## [1] 3.9 3.7 3.0
The numbers here is in the same order as in the original data frame,
i.e. “3.9” for Cecilia and “2.8” for Guangqi.
Now we can
just add the vector to the GPA data frame
using cbind()
(combine column-wise):
cbind(students, "GPA 2024-wi" = gpa2024)
## student GPA 2023-au GPA 2024-wi
## 1 Cecilia 4.0 3.9
## 2 Marco 3.8 3.7
## 3 Guangqi 2.8 3.0
This adds another column (vector) to the data frame, and calls it “GPA 2024-wi”. gpa2024 it is just a single vector of data. But we can also combine two data frames in the similar fashion, row-by-row.
When a new student Nong arrives to the college, we can create a separate data frame for his results:
data.frame(
nongsGPA <-student = "Nong",
"GPA 2023-au" = 3.6,
check.names = FALSE)
nongsGPA
## student GPA 2023-au
## 1 Nong 3.6
Next, we want to add it as a new row to our students’ data.
Two data frames can be combined row-wise in a fairly similar fashion
using rbind()
(combine row-wise):
rbind(students, nongsGPA)
## student GPA 2023-au
## 1 Cecilia 4.0
## 2 Marco 3.8
## 3 Guangqi 2.8
## 4 Nong 3.6
But note the difference–now we add a row to the data frame, and a row
contains different kinds of values, here one character (“Nong”) and
one number (“3.6”). Hence we need to convert the row to a data
frame, and use exactly the same column names as in the students data
frame above. rbind()
ensures
that the correct columns are matched.
The combining we described here is very simple, in a sense that each row will be combined with the corresponding row, or each column with the corresponding column. If the students in the 2024-wi data are in a different order, the result will be wrong.
Below, we discuss more complex merging that ensures that even if the order differs, the results are still correct.
14.1.2 Merging by key
The simple way of combining data frames what we did above–just adding those row-wise, is fairly limited. What happens if new students join the college, or previous students leave? Or if just the data is ordered differently? In such cases we need to merge rows in two tables not by their order (row number), but we need to find the corresponding rows in both tables. This is merging by key.
14.1.2.1 Merge key
For instance, imagine that in winter 2024 the student list looks somewhat different:
data.frame(
students24 <-student = c("Cecilia", "Guangqi", "Nong"),
"GPA 2024-wi" = c(3.9, 3.0, 3.6),
check.names = FALSE)
students24
## student GPA 2024-wi
## 1 Cecilia 3.9
## 2 Guangqi 3.0
## 3 Nong 3.6
If we were just to use cbind()
here, we’d create quite a mess:
cbind(students, students24$`GPA 2024-wi`)
## student GPA 2023-au students24$`GPA 2024-wi`
## 1 Cecilia 4.0 3.9
## 2 Marco 3.8 3.0
## 3 Guangqi 2.8 3.6
While all is well with Cecilia, Marco is now assigned Guangqi’s winter quarter grades, and Guangqi in turn gets Nong’s winter grades. This will cause a lot of trouble for the university!
What we need to do instead is to find which line in the first dataset (autumn 2023) corresponds to which row in the second dataset (winter 2024), and combine the corresponding rows only. This is the idea of merging or joining (because you are “merging” or “joining” the data frames together).
Before you can actually do any merging, you need to find some kind of identifiers, values that determine which row corresponds in the first data frame corresponds to which row in the second data frame. In this case it is easy–it is the students’ names. We have to ensure that Cecilia’s grades from one quarter are merged with her grades in the next quarter, and the same for Quangqi, Marco and others.
This value, or a set of values, is called merge key. We cannot do any merging, before we have identified the merge key. There are multiple kinds of merge keys:
- primary key is the key in the current data frame.
- foreign key is the key in the other data frame. It may be a column with the same name, or it may also have different names across different datasets. And be aware–sometimes a column with a similar name has a different meaning in different datasets!
- Often, a single column is enough to identify the records. The example above can be merged correctly by using just name. But sometimes you need multiple columns, such as year and month. This is called compound key, a key that consists of multiple columns.
- If there is no good identifier, we may create one from what we have. This is called surrogate key. For instance, it is common to link personal datasets using name and day of birth. This usually works, but may create problems if two persons of the same name are born in the same day. Another problem with this surrogate key is the habits to write names in many slightly different forms. For instance, are “John Smith”, “John A. Smith” and “John Alexander Smit” the same person?
14.1.3 What to merge and what not to merge?
Before you even begin with merging, you should ask if these datasets can be merged? And even if you can merge these, does it make sense? Do you want these to be merged? We discuss here a few examples.
Students:
name | course | quarter | year |
---|---|---|---|
Ernst | info201 | 2023-au | 1 |
Ernst | math126 | 2023-au | 1 |
Ernst | cse120 | 2024-wi | 1 |
Eberhard | arch101 | 2023-au | 2 |
Eberhard | info201 | 2024-wi | 2 |
… | … | … | … |
Courses:
name | quarter | day | time | instructor |
---|---|---|---|---|
info201 | 2023-au | Tue | 10:30 | Schnellenberg |
info201 | 2024-wi | Tue | 10:30 | Schnellenberg |
math126 | 2023-au | Thu | 1:30 | Recke |
arch101 | 2023-au | Wed | 3:30 | Plater |
… | … | … | … | … |
Imagine you are working at a college and you have access to students’ data (name, day of birth, address, courses they have taken, grades, and so on). You also have access to the college’s course catalog, including course number, quarter, instructor’s name, etc. Can you merge these two datasets? What might be the merge key? Why may you want to do that, what interesting questions can be answered in this way?
These two data frames can easily be merged. The key might be the course number. But should they key contain the quarter? The answer is “maybe”, depending on what kind of questions are we asking.
- If we ignore the quarter, then we merge a course called “info201” with all students who have taken “info201”. This might be useful to compare that course with other similar courses, to analyze the average grades, which other courses the students tend to take, and other course-specific questions.
- If we include quarter in the merge key, then we can answer questions about a particular course instance. For instance, we can compare different instructors of the same course, or analyze at which year students tend to take info201.
- Note that we can use one of the columns with common name, quarter, as the merge key, but we cannot use the other name, as the key. This is because in these two different tables name means different things: it is the student’s name in the first table and the course name in the second table.
All these examples are valid question that the college might want to analyze.
Exercise 14.1 Assume we merge students with courses using just course number and ignoring everything else. Provide additional examples of questions that one might want to answer with the resulting data.
date | temp | precip | sunrise |
---|---|---|---|
2024-02-21 | 9 | 0.5 | 7:03 |
2024-02-22 | 12 | 0.0 | 7:01 |
2024-02-23 | 12 | 0.0 | 6:59 |
2024-02-24 | 5 | 1.0 | 6:57 |
But imagine now that we have also weather data: for every day, we know temperature, precipitation, wind speed, and also sunrise/sunset time. Can we combine students and weather in a meaningful way? What kind of questions might we answer with such a combination?
- First, weather is not something that college can do much about. The schedules are typically planned months in advance, and it is hard to change much based on the last minute weather forecast. This means weather data is perhaps not that relevant for colleges.
- Second, what kind of meaningful questions can we answer if we have such combined data? The only questions that come to my mind is to see if there is any relationship between temperature and grades, or more likely, between sunlight hours, what time of day the course is offered, and students’ performance.
- Third, what might be the merge key? There are no course numbers nor student names in the weather table. The only possible key is date. Neither student data nor course data contains date though–but course data includes quarter. So now we can compute average weather or time of sunset for the quarters from the weather data, merge this with course data, and finally combine with students’ data. This sounds complicated, but it is perfectly doable.
These examples are also valid questions, but probably ones that the colleges consider less important.
14.1.4 Different types of joins
When merging data, the computer will always
find the common key values in both data
frames, and merge the corresponding
rows. But as usually, the devil
is in the details. What should the computer do if there is not match
for some of the keys in one of the data frames? Or if a key in one
table corresponds to multiple lines in the other table? This is
where one has to decide what is a good way to proceed. Below, we
discuss the details based on left_join()
, and thereafter briefly
introduce
the other joins.
14.1.4.1 When the first data frame matters: left_join()
The left_join()
operation is one of the common merge
operations. It looks for matching columns between the two data
frames, and then returns a new data frame that is otherwise identical
to the first (“left”) data frame, but with extra columns, taken
from the second data frame added on. Importantly, the left
-part of
left_join()
ensures that all rows from the left data frame (the
first table) are preserved, and no new rows are added. This is
achieved by:
- If there are no matching key for a row in the first table, the
corresponding entries of the new column are filled with missings
(
NA
). - If there are rows in the second table that do not have corresponding rows in the first table, then those rows are dropped.
Below is an example of command
left_join(songs, artists)
- left-join preserves the left table: the gray songs data frame is preserved intact in the result.
- the join detects the merge key. Here it is name–the common
column name in these two tables. But you can also specify it with
by
argument. The column name of the key(s) may differ between the tables! - The lines in the first table are merged with the corresponding ones in the second: we learn that in “Across the Universe”, John plays guitar; in “Come together”, John plays guitar; and in “Hello, Goodbye”, Paul plays bass.
- But there is no match for artist “Buddy” in “Peggy Sue”. Hence
that field will be left
NA
. This is the final merged dataset.
Exercise 14.2 What do you think, what will be the result of left_join(artists, songs)
? In particular, what song(s) will be assigned to John?
See the solution
Below, we discuss a few more details using another example. Imagine you are a data scientist at a college, and you have two tables: majors for student majors, and contact for student contact information. Both of these tables use column id, the university-wide student id. (You may also have a third table that links the id to their real name, but we are not concerned about that here.)
data.frame(
contact <-id = c(1, 2, 3, 4), # student id-s
email = c("lisoc@college.edu", "zemath@college.edu",
"bioarlin@college.edu", "uv98@college.edu")
) contact
## id email
## 1 1 lisoc@college.edu
## 2 2 zemath@college.edu
## 3 3 bioarlin@college.edu
## 4 4 uv98@college.edu
data.frame(
majors <-studentId = c(1, 2, 3), # student id-s
major = c("sociology", "math", "biology")
) majors
## studentId major
## 1 1 sociology
## 2 2 math
## 3 3 biology
Both tables have an id-column, allowing you to match
the rows from the contact
table to the majors
table. However,
because different tables call the id-column differently,
we need to explicitly tell what is
the merge key. We can left-merge contact information with majors as
left_join(contact, majors,
by = c("id" = "studentId"))
## id email major
## 1 1 lisoc@college.edu sociology
## 2 2 zemath@college.edu math
## 3 3 bioarlin@college.edu biology
## 4 4 uv98@college.edu <NA>
Note the by argument: it is a named vector (see Section 4.4.4), where the key column name in the first data frame goes left and of the second data frame goes right. The result is very similar to the songs-artists example: for students 1, 2, and 3 we have both email and major, but we do not know the major of student 4.
In short, left_join()
returns all of the rows from the first table, with all of the columns from both tables.
Notice that because of how a left join is defined, the argument order matters! The resulting table only has rows for elements in the left (first) table; any unmatched elements in the second table are lost. If you switch the order of the operands, you would only have information for students with majors:
left_join(contact, majors,
by = c("id" = "studentId"))
## id email major
## 1 1 lisoc@college.edu sociology
## 2 2 zemath@college.edu math
## 3 3 bioarlin@college.edu biology
## 4 4 uv98@college.edu <NA>
You don’t get any information for student #4, because they didn’t have a record in the left-hand table!
14.1.4.2 Other joins
But it is not always that you want to preserve all rows of the first
data frame and drop the “orphans” from the second one. In fact,
dplyr
(and relational database systems in general) provides
a number of different kinds of joins, each of which decides the rows
to be included in its own way.
In all these cases,
all columns from both tables will be included, with rows taking on any values from their matches in the second table.
left_join
All rows from the first (left) data frame are returned. That is, you get all the data from the left-hand table, with extra column values added from the right-hand table. Left-hand rows without a match will haveNA
in the right-hand columns.right_join
All rows from the second (right) data frame are returned. That is, you get all the data from the right-hand table, with extra column values added from the left-hand table. Right-hand rows without a match will haveNA
in the left-hand columns. This is the “opposite” of aleft_join
, and the equivalent of switching the operands.inner_join
Only rows in both data frames are returned. That is, you get any rows that had matching observations in both tables, with the column values from both tables. There will be no additionalNA
values created by the join. Observations from the left that had no match in the right, or observations in the right that had no match in the left, will not be returned at all.full_join
All rows from both data frames are returned. That is, you get a row for any observation, whether or not it matched. If it happened to match, it will have values from both tables in that row. Observations without a match will haveNA
in the columns from the other table.
The key to deciding between these is to think about what set of data you want as your set of observations (rows), and which columns you’d be okay with being NA
if a record is missing.
Note that these are all mutating joins, which add columns from one table to another. dplyr
also provides filtering joins which exclude rows based on whether they have a matching observation in another table, and set operations which combine observations as if they were set elements. See the documentation for more detail on these options, but in this course we’ll be primarily focusing on the mutating joins described above.
TBD: one-to-many merging for map data
TBD: joins and pipes
TBD: when to use what join
14.3 Date and time
Date and time data is something we use a lot in the everyday life. However, it turns out that it is a surprisingly complex data type. Below, we work on a few common tasks with the related data structures.
A lot of the problems boil down to that the common time units like days, months, weeks, even minutes, are not physial units. Physical unit is just second with a precisely defined duration. The other units are convenience units, duration of which changes. Even more, time is commonly accounted with in a mixed measurement system, incorporating decimal, 12/24-based, 60-based, and 7-based units. Months, with their 28-31 day length, are an attempt to make moon’s revolution to fit into that of earth. On top of that, different regions have different time zones and daylight saving time, some of which is re-arranged each year. Not surprisingly, this creates a lot of issues when working with time data.
14.3.1 Duration
Consider a common task: you are analyzing the results of an athletic competition. There were four runners, and their time was:
bib# | time |
---|---|
102 | 58:30 |
101 | 59:45 |
104 | 1:00:15 |
103 | 1:02:30 |
So the the two best athletes run slightly below 1hr and the two slower ones slightly over that.
This is a common way to express the results. For time intervals less
than 1hr, we write MM:SS
, e.g. 58:30
. From the context we
understand what it means, in this case it is 58 minutes and 30 seconds
We can stay strictly within decimal system and physical units and write
3510 seconds instead, but unfortunately, such numbers are much
harder to understand than the everyday “58:30”.
We can load such data file into R as
read_delim(
results <-"bib#,time
102,58:30
101,59:45
104,1:00:15
103,1:02:30")
This uses the fact that read_delim()
can load not just files, but
also file content, fed to it as a string. So we write a multi-line
string that is just the content in the form or a csv file.
But now we have a problem: The results look like
results
## # A tibble: 4 × 2
## `bib#` time
## <dbl> <time>
## 1 102 58:30:00
## 2 101 59:45:00
## 3 104 01:00:15
## 4 103 01:02:30
As you can see, the entry “58:30” is not interpreted as 58 minutes and 30 seconds but as 58 hours and 30 minutes! The runners #102 and #102 apparently spent more than two days on the course! This can be confirmed by computing the time range:
%>%
results pull(time) %>%
range()
## Time differences in secs
## [1] 3615 215100
Indeed the winner has time 3615 seconds (1h 15s, the runner #104), and the slowest one is 215100 seconds. That is what 59 hours and 45 mins amounts to.
In the small case here, it is easy to remedy the problem: we can just change “58:30” to “0:58:30”. But in larger data files, this is not feasible. We need to do this in the code. The strategy we pursue below is as follows: first, read the time column as text, then add “00:” in front of those columns that only contain a single colon, and finally, convert the result to duration.
First, we read data while telling the time has column type “character” (see documentation for details):
read_delim(
results <-"bib#,time
102,58:30
101,59:45
104,1:00:15
103,1:02:30",
col_types = list(time = "character"))
# tell that column 'time' is of type 'character'
results
## # A tibble: 4 × 2
## `bib#` time
## <dbl> <chr>
## 1 102 58:30
## 2 101 59:45
## 3 104 1:00:15
## 4 103 1:02:30
next, we can just use mutate()
and ifelse()
to add “00:” in front
of the columns that contain a single colon only:
%>%
results mutate(time = ifelse(str_count(time, ":") == 1,
paste0("00:", time),
time))
## # A tibble: 4 × 2
## `bib#` time
## <dbl> <chr>
## 1 102 00:58:30
## 2 101 00:59:45
## 3 104 1:00:15
## 4 103 1:02:30
This code:
- counts the colons in the column time:
str_count(time, ":")
- if there is a single colon, pastes “00:” in front of the value
paste0("00:", time)
- otherwise keeps time as it is (just
time
)
Now all the time column values have hour markers in front of minutes,
but the column is still of character type, not time. We can convert
it to time using parse_time()
function from the readr package:
results %>%
results <- mutate(time = ifelse(str_count(time, ":") == 1,
paste0("00:", time),
%>%
time)) mutate(time = parse_time(time))
Obviously, we can achieve the same in a single mutate.
Now the time is understood in a correct manner: the winner is
%>%
results filter(rank(time) == 1)
## # A tibble: 1 × 2
## `bib#` time
## <dbl> <time>
## 1 102 58'30"
# athlete #102 is the winner!
%>%
results ggplot(aes(time)) +
geom_histogram()
14.4 Reshaping data
Above, in Section 11, we discussed how many types of data can be stored in data frames, rectangular tables of rows and columns. But it turns out that the exact same information can be stored in multiple ways. Here we explain the two most popular forms, long form and wide form, and reshaping–the process of transforming the one to the other.
14.4.1 Long form and wide form data
First, let’s look at drinking data from the Drinking-patterns study across the U.S. states (not included in this repo). Call this dataset drinking. The sample here contains the percentage of respondents in three states who have had at least one alcoholic drink in past 30 days in 2009 to 2012:
state | 2009 | 2010 | 2011 | 2012 |
---|---|---|---|---|
Tennessee | 48.3 | 48.1 | 39.6 | 48.1 |
North Carolina | 60.3 | 59.7 | 60.4 | 59.2 |
Pennsylvania | 36.0 | 37.3 | 40.6 | 41.2 |
These data is presented in a way that is commonly seen in that sort of tables. Each row in the dataset is a state, and each column (except state) is measures drinking. The column names also carry important information, they tell which year the particular number was measured. And finally, all the cells contain a similar measure: percentage of individuals who have had at least one drink in the past 30 days. This is an example of data in wide form.
This form of data has a number of uses:
- It is well suited for quick visual comparison. For instance, one can easily compare North Carolina in 2009 with Pennsylvania in 2009 (60.3 versus 36.0), North Carolina in 2011 and 2012 (60.4 versus 41.2).
- It is easy to extract individual vectors. For instance, if you want to plot alcohol usage in 2009 versus 2012, then this form of data is very well suited for the task.
But the table above is not convenient for other common tasks. For instance, if we want to find the year of largest drinking across the states, then we need to either compare all columns manually, or write a loop over the columns instead. This is tricky. But this kind of tasks are common when doing data processing.
For such tasks, it is better to store the same data in a slightly different format:
state | year | drinking |
---|---|---|
Tennessee | 2009 | 48.3 |
Tennessee | 2010 | 48.1 |
… | … | … |
North Carolina | 2009 | 60.3 |
North Carolina | 2010 | 59.7 |
… | … | … |
Pennsylvania | 2009 | 36 |
Pennsylvania | 2010 | 37.3 |
… | … | … |
Note that these two datasets contain exactly the same data. For instance, Tennessee in 2009 had drinking rate 48.3% while in North Carolina in 2010 it was 59.7%. This table is called data in long form. What distinguishes this table from the previous one is that here each column contains different kind of values: the first column is state, the second column is year, and the third one is the drinking percentage. This was not the case with the wide form, in that table all the columns, except the state names, contain drinking percentage.
Let’ take a closer look on how to find a certain value, for instance, the drinking propensity in Tennessee in 2010 (48.1%). In the wide form table, we need to find the row by looking for the state in the first column, and find the matching column through column names. In the long form table table, we need to find first the state throught the state column as well. But now this does not give us a single row for all Tennessee data, but four rows, one for each year. So our second task is to find the year of interest in the year column among the rows that correspond to Tennessee. Here it is 2nd row, and the value is 48.1 again.
For humans, the first approach is probably preferable. But for computers, not necessarily. The problem is that in the first table we were using two different type of methods–filtering rows based on a criterion (state = Tennessee) and also selecting columns based on a criterion (column = 2010). Such different operations may not be well suited for data processing pipelines. In the second case, we just used two similar filtering conditions: first, state = Tennessee, and second, year = 2010. This turns out a much simpler approach for many tasks.
14.4.2 Reshaping between wide and long form
How can we transform the original, wide form data into a long form? And what does the long form and wide form exactly mean?
The main trait of long form data is that each column depicts a different kind of value. If there are any inherent groupings, all the values that belong to the same group are spread into multiple rows. Additionally, we need one or more distinct columns to mark the group membership. In the drinking example, there was only one kind of value (drinking percentage) but two groupings: state and year. So the long-form data would require three columns: state, year and drinking. The first two of these denot groups, and the last one are the values. So the wide form might look like:
state | year | drinking |
---|---|---|
Tennessee | 2009 | 48.3 |
Tennessee | 2010 | 48.1 |
The wide form, however, has all values of one (or more) groups in the same row. In the drinking example, these were all years for a state:
state | 2009 | 2010 | 2011 | 2012 |
---|---|---|---|---|
Tennessee | 48.3 | 48.1 | 39.6 | 48.1 |
North Carolina | 60.3 | 59.7 | 60.4 | 59.2 |
Note how we have four separate columns of values, one for each year. But different states are still in separate rows. So in a sense it is not a pure wide form: it is long form in terms of states but wide form in terms of sex-year combinations.
Exercise 14.3 In the example above, states are in rows and years are in columns. Can you swap this around–years in rows and states in columns? How would the result look like?
See the solution
Exercise 14.4 How might the pure wide form data look like where states are not in separate rows?
See the solution
Alternatively, we can think that what distinguishes the wide and long form is what each row represents. In the long form, it is a single group element, in this example it is a state-year combination. In the wide form, it is a group. In the example above, it is just a state–all years are in the same row.
Some kind of groups are needed to make the distinction between wide and long form meaningful. For instance, look at the height-weight data. This represents a few health-related characteristics of five teenagers:
sex | age | height | weight |
---|---|---|---|
Female | 16 | 173 | 58.5 |
Female | 17 | 165 | 56.7 |
Male | 17 | 170 | 61.2 |
Male | 16 | 163 | 54.4 |
Male | 18 | 170 | 63.5 |
Is this dataset in long form or in wide form? In order to answer this question, we need to understand three things: 1) what do the rows represent; 2) what are the values–the actual data values in the table (in one or in multiple columns), and 3) what are the group indicators (in column names or in dedicated columns). The answers are:
- a row represents one person (this can be inferred, but also learned from the documentation).
- We can immediately see that each column represents a value. age, height and weight are, obviously, values. Here sex is a value too: it is not a group, but the sex of that particular person.
- Finally, we do not have any group indicators! This data is not grouped in any way, it is individual data, and hence we do not have group indicators.
How might the wide form of these data look like? Remember: wide form has all values for the group in the same row. But here we have only trivial groups–every person forms their own group and hence the data is also in a wide form!
Distinction between wide and long form is meaningful only if the data contains some kind of groupings.
14.4.3 pivot_longer()
and pivot_wider()
Fortunately, it is fairly straightforward to convert one data type to
the other. This is called reshaping.
The tidyr package (part of tidyverse) offers two
functions: pivot_longer()
and pivot_wider()
. Let’s first convert
the original table to the long form using pivot_longer()
. It takes
number of arguments, most important of which are cols
, names_to
and values_to
.
TBD: fix the thing that we only have one sex in the example above
col
tells which columns are to be transformed to
long format. in the example above, we want to transform B2002,
F2002, M2002, B2003, F2003 and M2003. All these columns
contain a similar type of value–percentage of those who are
drinking. We do not want to convert state: this is name of the
state, not percentage of drinkers. The value for col
can be
specified in a similar fashion as for select()
, see Section
12.3.1.
As a result, all these columns will be converted into two columns: the
first one contains the column names B2002, F2002 and so on, and
the other one contains the corresponding values, 32.5, 25.2, etc. The
two columns are necessary because we need both values, and some kind
of indicator that explains what these values are.
names_to
and values_to
are names of these two respective columns.
So we can reshape this dataset into a long form as:
drinking %>%
longDrinking <- pivot_longer(!state, names_to = "sex.year", values_to = "drinking")
%>%
longDrinking head(7)
## # A tibble: 7 × 3
## state sex.year drinking
## <chr> <chr> <dbl>
## 1 Tennessee B2002 32.5
## 2 Tennessee F2002 25.2
## 3 Tennessee M2002 40
## 4 Tennessee B2003 35.5
## 5 Tennessee F2003 27.6
## 6 Tennessee M2003 43.6
## 7 North Carolina B2002 42.9
Long form is good for various data processing pipelines, but sometimes
we want to transform it back to wide form. This can be done with
pivot_wider()
. The most important arguments it takes are
names_from
and values_from
: which column values should be
transformed to column names in the wide form, and which column should
contain the corresponding values. So let’s get our wide form drinking
back:
%>%
longDrinking pivot_wider(names_from = "sex.year",
values_from = "drinking") %>%
head(5)
## # A tibble: 3 × 7
## state B2002 F2002 M2002 B2003 F2003 M2003
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Tennessee 32.5 25.2 40 35.5 27.6 43.6
## 2 North Carolina 42.9 33.9 52.2 44.1 35.7 52.8
## 3 Pennsylvania 59.8 51.6 68.4 61 53.2 69
And we got our wide form back 😀
Exercise 14.5 Consider the Ice extent data. Let’s only consider columns year, month, region, and area–select only these four columns.
- In terms of region, is it in a wide form or in long form?
- What might the column names of the wide form of the same dataset look like?
- reshape the dataset in a way that there are only a single row for each month, and northern and southern area are in their separate columns.
See the solution
Exercise 14.6 Consider the data frame about the number of patients in a hospital:
pregnant | male | female |
---|---|---|
yes | NA | 11 |
no | 25 | 20 |
- Is this data frame in a wide form or long form?
- Can you reshape it to the other form?
See the solution
14.4.4 Multiple groupings and multiple values
Now let’s look at a more complex example of reshaping. Consider alcohol disorder data:
read_delim("data/alcohol-disorders.csv")
disorders <-%>%
disorders head(3)
## # A tibble: 3 × 6
## country Code Year disordersM disordersF population
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Argentina ARG 2015 3.07 1.17 43257064
## 2 Argentina ARG 2016 3.04 1.16 43668236
## 3 Argentina ARG 2017 3.02 1.16 44054616
This dataset contains the percentage of population who suffers of alcohol use disorders for a number of countries and a number of years. It also contains the relevant population measures. For instance, in Argentina in 2015, slightly over 3% of the male population and over 1% of female population suffer from the disorders while the total population was approximately 43M.
Are these data in a wide or in a long form? Let’s try to analyze this using the three questions we used above (in Section 14.4.2):
What does a row represent? Here, it is obviously a country-year combination.
What are the values? These data contain three value columns: male disorders, female disorders, and population. Year is part of the grouping, and both country and Code are grouping as well. Moreover, country and Code contain exactly the same information, each country is uniquely associated with its code.
What are the group indicators? It is fairly obvious that both country and year are groups. Moreover, country is given by two columns, country and Code, we might remove one and still have all the same information in data. So we have three group columns, one of which is redundant.
But we have more groups: namely males and females. Indeed, disordersM and disordersF contain the same kind of values, percentage of population. Just this group indicator is not in a dedicated column, but spread into names of two columns instead.
TBD: complete multiple grouping reshape.
14.5 Separate a single column into multiple columns
TBD: new motivation and more complex drinking data example
But the new long-form drinking data has a problem. Namely, the sex.year variable is a combination of both sex and year, and that makes it difficult to select only certain genders or years. It would be easier to replace it with two different columns: sex and year.
Fortunately, tidyr package has several functions for this task.
Here the appropriate one would be separate_wider_position()
that
splits a column into multiple columns by position. Its main arguments
are cols
that tells which columns to separate, and widths
that
specifices how many characters will go to each new column. There are
other arguments, e.g. if the original column is to be removed, what to
do if the column is too wide or too narrow, etc.
The most important of these arguments is widths
. It must be a named
vector, describing the width of the the fields that go into columns.
Its names will be transformed to the corresponding column names. As
the long form drinking data has sex.year values in the form SYYYY,
we need to assign width 1 for sex, and width 4 for year. So the the
argument might look like
c(sex = 1, year = 4) widths =
Here we give the columns natural names, sex for sex and year for year:
%>%
longDrinking separate_wider_position(sex.year,
widths = c(sex = 1, year=4)) %>%
head()
## # A tibble: 6 × 4
## state sex year drinking
## <chr> <chr> <chr> <dbl>
## 1 Tennessee B 2002 32.5
## 2 Tennessee F 2002 25.2
## 3 Tennessee M 2002 40
## 4 Tennessee B 2003 35.5
## 5 Tennessee F 2003 27.6
## 6 Tennessee M 2003 43.6
Note that separate does not convert the column types. It is all well that sex is character, but we may want to convert year to a number. So the final code may look like:
%>%
longDrinking separate_wider_position(sex.year,
widths = c(sex = 1, year=4)) %>%
mutate(year = as.numeric(year)) %>%
head()
## # A tibble: 6 × 4
## state sex year drinking
## <chr> <chr> <dbl> <dbl>
## 1 Tennessee B 2002 32.5
## 2 Tennessee F 2002 25.2
## 3 Tennessee M 2002 40
## 4 Tennessee B 2003 35.5
## 5 Tennessee F 2003 27.6
## 6 Tennessee M 2003 43.6
This form is the most convenient one to use for filtering and many other data processing tasks.
Besides separating by position, there are also functions to separate based by a delimiter, and by regular expressions. Obviously, there are many other options, e.g. one may use string processing functionality to separate certain substrings and regular expressions.
::thematic_on(font = thematic::font_spec(scale=1.8))
thematic# theme_set(text = element_text(size = 24))