Chapter 14 More about data manipulations: merging, reshaping and cleaning data
In the dplyr section we discussed a large class of data manipulation operations, such as selecting variables, filtering observations, arranging data in certain order and computing new values. Here we add additional operations: reshaping.
14.1 Merging data: joins
When working with real-world data, you’ll often find that that data is stored across multiple files or across multiple data frames. This is done for a number of reasons. For one, it can help to reduce memory usage. For example, if you had a data frame containing information on students enrolled in different courses, then you can store information about each course, such as the instructor’s name, meeting time, and classroom, in a separate data frame. This saves a lot of space, otherwise you have to duplicate that information for every student who takes the same course. You may also want to keep your data better organized by having student information in one file, and course information in another. This separation and organization of data is a core concern in the design of relational databases.
Below, we discuss how to combine such separately organized information into a single data frame.
14.1.1 Combining data frames
In simple cases, we may just combine datasets, row-by-row, or column by column. Consider a student dataset
data.frame(
students <-student = c("Cecilia", "Marco", "Guangqi"),
"GPA 2023-au" = c(4.0, 3.8, 2.8),
check.names = FALSE)
students
## student GPA 2023-au
## 1 Cecilia 4.0
## 2 Marco 3.8
## 3 Guangqi 2.8
This contains the students’ names and their GPA for 2023-au quarter.
When the following quarter ends, we may store their new GPA in a vector
c(3.9, 3.7, 3.0)
gpa2024 <- gpa2024
## [1] 3.9 3.7 3.0
The numbers here is in the same order as in the original data frame,
i.e. “3.9” for Cecilia and “2.8” for Guangqi.
Now we can
just add the vector to the GPA data frame
using cbind()
(combine column-wise):
cbind(students, "GPA 2024-wi" = gpa2024)
## student GPA 2023-au GPA 2024-wi
## 1 Cecilia 4.0 3.9
## 2 Marco 3.8 3.7
## 3 Guangqi 2.8 3.0
This adds another column (vector) to the data frame, and calls it “GPA 2024-wi”. gpa2024 it is just a single vector of data. But we can also combine two data frames in the similar fashion, row-by-row.
When a new student Nong arrives to the college, we can create a separate data frame for his results:
data.frame(
nongsGPA <-student = "Nong",
"GPA 2023-au" = 3.6,
check.names = FALSE)
nongsGPA
## student GPA 2023-au
## 1 Nong 3.6
Next, we want to add it as a new row to our students’ data.
Two data frames can be combined row-wise in a fairly similar fashion
using rbind()
(combine row-wise):
rbind(students, nongsGPA)
## student GPA 2023-au
## 1 Cecilia 4.0
## 2 Marco 3.8
## 3 Guangqi 2.8
## 4 Nong 3.6
But note the difference–now we add a row to the data frame, and a row
contains different kinds of values, here one character (“Nong”) and
one number (“3.6”). Hence we need to convert the row to a data
frame, and use exactly the same column names as in the students data
frame above. rbind()
ensures
that the correct columns are matched.
The combining we described here is very simple, in a sense that each row will be combined with the corresponding row, or each column with the corresponding column. If the students in the 2024-wi data are in a different order, the result will be wrong.
Below, we discuss more complex merging that ensures that even if the order differs, the results are still correct.
14.1.2 Merging by key
The simple way of combining data frames what we did above–just adding those row-wise, is fairly limited. What happens if new students join the college, or previous students leave? Or if just the data is ordered differently? In such cases we need to merge rows in two tables not by their order (row number), but we need to find the corresponding rows in both tables. This is merging by key.
14.1.2.1 Merge key
For instance, imagine that in winter 2024 the student list looks somewhat different:
data.frame(
students24 <-student = c("Cecilia", "Guangqi", "Nong"),
"GPA 2024-wi" = c(3.9, 3.0, 3.6),
check.names = FALSE)
students24
## student GPA 2024-wi
## 1 Cecilia 3.9
## 2 Guangqi 3.0
## 3 Nong 3.6
If we were just to use cbind()
here, we’d create quite a mess:
cbind(students, students24$`GPA 2024-wi`)
## student GPA 2023-au students24$`GPA 2024-wi`
## 1 Cecilia 4.0 3.9
## 2 Marco 3.8 3.0
## 3 Guangqi 2.8 3.6
While all is well with Cecilia, Marco is now assigned Guangqi’s winter quarter grades, and Guangqi in turn gets Nong’s winter grades. This will cause a lot of trouble for the university!
What we need to do instead is to find which line in the first dataset (autumn 2023) corresponds to which row in the second dataset (winter 2024), and combine the corresponding rows only. This is the idea of merging or joining (because you are “merging” or “joining” the data frames together).
Before you can actually do any merging, you need to find some kind of identifiers, values that determine which row corresponds in the first data frame corresponds to which row in the second data frame. In this case it is easy–it is the students’ names. We have to ensure that Cecilia’s grades from one quarter are merged with her grades in the next quarter, and the same for Quangqi, Marco and others.
This value, or a set of values, is called merge key. We cannot do any merging, before we have identified the merge key. There are multiple kinds of merge keys:
- primary key is the key in the current data frame.
- foreign key is the key in the other data frame. It may be a column with the same name, or it may also have different names across different datasets. And be aware–sometimes a column with a similar name has a different meaning in different datasets!
- Often, a single column is enough to identify the records. The example above can be merged correctly by using just name. But sometimes you need multiple columns, such as year and month. This is called compound key, a key that consists of multiple columns.
- If there is no good identifier, we may create one from what we have. This is called surrogate key. For instance, it is common to link personal datasets using name and day of birth. This usually works, but may create problems if two persons of the same name are born in the same day. Another problem with this surrogate key is the habits to write names in many slightly different forms. For instance, are “John Smith”, “John A. Smith” and “John Alexander Smit” the same person?
14.1.3 What to merge and what not to merge?
Before you even begin with merging, you should ask if these datasets can be merged? And even if you can merge these, does it make sense? Do you want these to be merged? We discuss here a few examples.
Students:
name | course | quarter | year |
---|---|---|---|
Ernst | info201 | 2023-au | 1 |
Ernst | math126 | 2023-au | 1 |
Ernst | cse120 | 2024-wi | 1 |
Eberhard | arch101 | 2023-au | 2 |
Eberhard | info201 | 2024-wi | 2 |
… | … | … | … |
Courses:
name | quarter | day | time | instructor |
---|---|---|---|---|
info201 | 2023-au | Tue | 10:30 | Schnellenberg |
info201 | 2024-wi | Tue | 10:30 | Schnellenberg |
math126 | 2023-au | Thu | 1:30 | Recke |
arch101 | 2023-au | Wed | 3:30 | Plater |
… | … | … | … | … |
Imagine you are working at a college and you have access to students’ data (name, day of birth, address, courses they have taken, grades, and so on). You also have access to the college’s course catalog, including course number, quarter, instructor’s name, etc. Can you merge these two datasets? What might be the merge key? Why may you want to do that, what interesting questions can be answered in this way?
These two data frames can easily be merged. The key might be the course number. But should they key contain the quarter? The answer is “maybe”, depending on what kind of questions are we asking.
- If we ignore the quarter, then we merge a course called “info201” with all students who have taken “info201”. This might be useful to compare that course with other similar courses, to analyze the average grades, which other courses the students tend to take, and other course-specific questions.
- If we include quarter in the merge key, then we can answer questions about a particular course instance. For instance, we can compare different instructors of the same course, or analyze at which year students tend to take info201.
- Note that we can use one of the columns with common name, quarter, as the merge key, but we cannot use the other name, as the key. This is because in these two different tables name means different things: it is the student’s name in the first table and the course name in the second table.
All these examples are valid question that the college might want to analyze.
Exercise 14.1 Assume we merge students with courses using just course number and ignoring everything else. Provide additional examples of questions that one might want to answer with the resulting data.
date | temp | precip | sunrise |
---|---|---|---|
2024-02-21 | 9 | 0.5 | 7:03 |
2024-02-22 | 12 | 0.0 | 7:01 |
2024-02-23 | 12 | 0.0 | 6:59 |
2024-02-24 | 5 | 1.0 | 6:57 |
But imagine now that we have also weather data: for every day, we know temperature, precipitation, wind speed, and also sunrise/sunset time. Can we combine students and weather in a meaningful way? What kind of questions might we answer with such a combination?
- First, weather is not something that college can do much about. The schedules are typically planned months in advance, and it is hard to change much based on the last minute weather forecast. This means weather data is perhaps not that relevant for colleges.
- Second, what kind of meaningful questions can we answer if we have such combined data? The only questions that come to my mind is to see if there is any relationship between temperature and grades, or more likely, between sunlight hours, what time of day the course is offered, and students’ performance.
- Third, what might be the merge key? There are no course numbers nor student names in the weather table. The only possible key is date. Neither student data nor course data contains date though–but course data includes quarter. So now we can compute average weather or time of sunset for the quarters from the weather data, merge this with course data, and finally combine with students’ data. This sounds complicated, but it is perfectly doable.
These examples are also valid questions, but probably ones that the colleges consider less important.
14.1.4 Different types of joins
When merging data, the computer will always
find the common key values in both data
frames, and merge the corresponding
rows. But as usually, the devil
is in the details. What should the computer do if there is not match
for some of the keys in one of the data frames? Or if a key in one
table corresponds to multiple lines in the other table? This is
where one has to decide what is a good way to proceed. Below, we
discuss the details based on left_join()
, and thereafter briefly
introduce
the other joins.
14.1.4.1 When the first data frame matters: left_join()
The left_join()
operation is one of the common merge
operations. It looks for matching columns between the two data
frames, and then returns a new data frame that is otherwise identical
to the first (“left”) data frame, but with extra columns, taken
from the second data frame added on. Importantly, the left
-part of
left_join()
ensures that all rows from the left data frame (the
first table) are preserved, and no new rows are added. This is
achieved by:
- If there are no matching key for a row in the first table, the
corresponding entries of the new column are filled with missings
(
NA
). - If there are rows in the second table that do not have corresponding rows in the first table, then those rows are dropped.
Below is an example of command
left_join(songs, artists)
- left-join preserves the left table: the gray songs data frame is preserved intact in the result.
- the join detects the merge key. Here it is name–the common
column name in these two tables. But you can also specify it with
by
argument. The column name of the key(s) may differ between the tables! - The lines in the first table are merged with the corresponding ones in the second: we learn that in “Across the Universe”, John plays guitar; in “Come together”, John plays guitar; and in “Hello, Goodbye”, Paul plays bass.
- But there is no match for artist “Buddy” in “Peggy Sue”. Hence
that field will be left
NA
. This is the final merged dataset.
Exercise 14.2 What do you think, what will be the result of left_join(artists, songs)
? In particular, what song(s) will be assigned to John?
See the solution
Below, we discuss a few more details using another example. Imagine you are a data scientist at a college, and you have two tables: majors for student majors, and contact for student contact information. Both of these tables use column id, the university-wide student id. (You may also have a third table that links the id to their real name, but we are not concerned about that here.)
data.frame(
contact <-id = c(1, 2, 3, 4), # student id-s
email = c("lisoc@college.edu", "zemath@college.edu",
"bioarlin@college.edu", "uv98@college.edu")
) contact
## id email
## 1 1 lisoc@college.edu
## 2 2 zemath@college.edu
## 3 3 bioarlin@college.edu
## 4 4 uv98@college.edu
data.frame(
majors <-studentId = c(1, 2, 3), # student id-s
major = c("sociology", "math", "biology")
) majors
## studentId major
## 1 1 sociology
## 2 2 math
## 3 3 biology
Both tables have an id-column, allowing you to match
the rows from the contact
table to the majors
table. However,
because different tables call the id-column differently,
we need to explicitly tell what is
the merge key. We can left-merge contact information with majors as
left_join(contact, majors,
by = c("id" = "studentId"))
## id email major
## 1 1 lisoc@college.edu sociology
## 2 2 zemath@college.edu math
## 3 3 bioarlin@college.edu biology
## 4 4 uv98@college.edu <NA>
Note the by argument: it is a named vector (see Section 4.4.4), where the key column name in the first data frame goes left and of the second data frame goes right. The result is very similar to the songs-artists example: for students 1, 2, and 3 we have both email and major, but we do not know the major of student 4.
In short, left_join()
returns all of the rows from the first table, with all of the columns from both tables.
Notice that because of how a left join is defined, the argument order matters! The resulting table only has rows for elements in the left (first) table; any unmatched elements in the second table are lost. If you switch the order of the operands, you would only have information for students with majors:
left_join(majors, contact,
by = c("studentId" = "id"))
## studentId major email
## 1 1 sociology lisoc@college.edu
## 2 2 math zemath@college.edu
## 3 3 biology bioarlin@college.edu
You don’t get any information for student #4, because they didn’t have a record in the left-hand table!
14.1.4.2 Other joins
But it is not always that you want to preserve all rows of the first
data frame and drop the “orphans” from the second one. In fact,
dplyr
(and relational database systems in general) provides
a number of different kinds of joins, each of which decides the rows
to be included in its own way.
In all these cases,
all columns from both tables will be included, with rows taking on any values from their matches in the second table.
left_join
All rows from the first (left) data frame are returned. That is, you get all the data from the left-hand table, with extra column values added from the right-hand table. Left-hand rows without a match will haveNA
in the right-hand columns.right_join
All rows from the second (right) data frame are returned. That is, you get all the data from the right-hand table, with extra column values added from the left-hand table. Right-hand rows without a match will haveNA
in the left-hand columns. This is the “opposite” of aleft_join()
, and the equivalent of switching the order of the data frames inleft_join()
.inner_join
Only rows in both data frames are returned. That is, you get any rows that had matching observations in both tables, with the column values from both tables. There will be no additionalNA
values created by the join. Observations from the left that had no match in the right, or observations in the right that had no match in the left, will not be returned at all.full_join
All rows from both data frames are returned. That is, you get a row for any observation, whether or not it matched. If it happened to match, it will have values from both tables in that row. Observations without a match will haveNA
in the columns from the other table.
The key to deciding between these is to think about what set of data you want as your set of observations (rows), and which columns you’d be okay with being NA
if a record is missing.
All these joind add columns from one table to another. dplyr
also provides filtering joins which exclude rows based on whether they have a matching observation in another table, and set operations which combine observations as if they were set elements. See the documentation for more detail on these options, but in this course we’ll be primarily focusing on the mutating joins described above.
TBD: when to use what join
TBD: joins and pipes
TBD: More advanced topics one-to-many merging for map data
14.2 Reshaping data
Above, in Section 11, we discussed how many types of data can be stored in data frames, rectangular tables of rows and columns. But it turns out that the exact same information can be stored in multiple ways. Here we discuss two of the most popular forms, long form and wide form, and reshaping–the process of transforming the one to the other.
14.2.1 Motivating example
Imagine you are a researcher at a meteorology institute. Every day you collect air temperature data at two locations: North (N) and South (S), and at two altitudes: Low (L) and High (H). An example of the data you just collected may look like in the table at right, here temperature in North at low altitude was 20, and in South at high altitude was 25.
How you might represent these data? The table at right offers one example. It is a rectangular table (data frame) that contains two rows, one for each location, and two columns, one for each altitude.
The first of these three examples, b), is just a transposition of the original table a). The other two c) and d), are “made of” the original table, but not by simple operations like transposition. You may notice though that c) and d) are basically transpositions of each other. But in any case, data in these tables is the same. All four agree that temperature in South at low altitude was 30, and the same for North-high was 15. It is not that some tables are correct while others are wrong. They are all correct. Let’s now discuss in which sense do these tables differ.
The original table a) has location in rows and altitude in columns. So here a row represents a location, and each column is an altitude, corresponding to that location. If you want, different altitude values are grouped together into the corresponding locations. The table b) is similar in many ways. But instead of grouping altitudes together into rows representing locations, it groups locations together into rows representing altitudes.
Finally, tables c) and d) discard any groupings and just put all the measures underneath each other (c) or next to each other (d). Obviously, now each row or column needs to have a label, telling which location and altitude the measure is taken of. This is achieved here through two additional columns (c) or rows (d).
While all these tables are correct, they are not always good for all purposes. Typically, a) and b) are better for human reading–this is because we can quickly move the eyes both left-and-right and up-and-down, and in this way compare both locations and altitude. Table c) tends to be the best for computers, but it is not always so. We’ll discuss it more below. Table d) is usually hard to use and can rarely be seen.
Are there any other options that were not represented here? Not really. Obviously, we can put S in the first place instead of N in tables a) and b), or alternatively, we can combine Loc and Alt values into a single LocAlt that might obtain values like “NH” or “S.L”. But otherwise, these four tables are all we can do.
Tables a) and b) are typically called data in wide form, and c) is called long form. To be more precise, we might call d) the wide form, and a) and b) as “middle form”. However, the words “wide” and “long” usually refer to another alternative, and as d) is rarely used, there is no need to talk about the middle forms.
Here is a little more about what is the wide form and what is the long form:
- In order to be able to talk about wide/long form, the data must have some sort of groupings. Here we have two such groupings: location and altitude. We can group values together by similar location, or similar altitude.
- Besides of groups, we also need some kind of value. Here it is temperature. Values are typically the most valuable part of a dataset. Indeed, the institute does not have to hire a data scientist to tell them that there exist such things like high altitude in North.
- In the long form, each row represents a single observation. Groups are not collected into the same row in any way. But we need additional columns that denote the group, in the example we need columns location and altitude. Otherwise we cannot tell to which group the value belongs to.
- In wide form, at least one group is collected into a single line. In table (a) it is location, in table (b) it is altitude. Table (d) shows an example where both groups are collected into a single line. Now we need either extra rows to tell where the groups belong to, or (more commonly), this is encoded into column names.
- It is not always clear what are groupings, what are values. It is somewhat fuzzy and depends on what do you want to do.
Note that the story gets more complicated with more groups and measures.
14.2.2 A look at data
Now it is time to look at real data. Let’s take the 2015 subset of the alcohol disorders data:
read_delim("data/alcohol-disorders.csv") %>%
disorders <- filter(Year == 2015) %>%
select(country, disordersM, disordersF)
disorders
## # A tibble: 5 × 3
## country disordersM disordersF
## <chr> <dbl> <dbl>
## 1 Argentina 3.07 1.17
## 2 Kenya 0.747 0.654
## 3 Taiwan 0.891 0.261
## 4 Ukraine 3.90 1.43
## 5 United States 2.93 1.73
The columns disordersM and disordersF refer to the percentage of male and female population with alcohol disorders.
Which table from Section 14.2.1 this printout corresponds to? Is it (a), (b), (c), or (d)? It is fairly easy to see, that it is basically the same as table (a) (or b): instead of two location, the rows represent five countries. And instead of two altitudes, the columns represent two sexes.
You can also see that the data has two grouping dimensions: country and gender. The data is grouped into countries, so each row corresponds to a country, and each column (except country) is a measure of alcohol disorders. The column names also carry important information, they tell which gender the particular numbers correspond to. And finally, all the cells contain a similar measure: percentage of population with alcohol disorders. These data are in wide form.
This form of data is very popular for a few reasons:
- It is well suited for quick visual comparison. For instance, one can easily compare males in Kenya (0.747) to males in Taiwan (0.891) and find that these values are fairly similar. We can also easily compare men and women in Taiwan (0.891 and 0.261), these figures are rather different.
- It is easy to extract individual vectors. For instance, if you want to plot alcohol usage for men versus women across the countries, then this form of data is very well suited for the task.
But the table above is not convenient for other common tasks. For instance, to find the sex and country of the largest value of alcohol disorders, one needs to either compare both disorder columns manually, or maybe write a loop over the columns instead. This can be tricky. But this kind of tasks are very common when doing data processing.
Instead, we may want to present the data in the long form, as in table c):
## # A tibble: 10 × 3
## country sex disorders
## <chr> <chr> <dbl>
## 1 Argentina M 3.07
## 2 Argentina F 1.17
## 3 Kenya M 0.747
## 4 Kenya F 0.654
## 5 Taiwan M 0.891
## 6 Taiwan F 0.261
## 7 Ukraine M 3.90
## 8 Ukraine F 1.43
## 9 United States M 2.93
## 10 United States F 1.73
Here the values are not grouped in any way, all the groups (years and countries) are spread to separate rows. Also, no row contains multiple instances of the same type of value–here it means no row contains multiple disorder values. We also need one or more distinct columns to mark the group membership–a column for year, and a column for country. So the long-form data requires three columns: year, country and disorder. This is how the table above was constructed.
Now finding the maximum value is fairly easy: just find the largest value of column disorders and report the corresponding row. But unfortunately, comparing males in Kenya to males in Taiwan is somewhat more complicated–first we need to find the group of rows for Kenya, and then the male row in the Kenya group. This is not as easy as with wide form data, at least on visually.
But as we explained above, these two are just different forms of the same data, they are not different datasets. Whichever form we use, Kenyan males still have somewhat lower propensity of alcohol disorders than Taiwanese men (0.747 versus 0.891). These are the same data.
Let’s look at the difference between the wide form and long form table again. In the wide form above, each line represents a country, and each country contains a group of values: disorders for men and for women. So data is grouped in lines by country. In the long form, there is no such grouping, each line represents a country-sex combination. You can also see that in the long form, each column contains different kind of values: the first column is country, the second column is sex, and the third one is the disorder percentage. This was not the case with the wide form, in that table two columns contained the disorder percentage.
Exercise 14.3 Can you sketch an alternative way to present these data in wide form?
Exercise 14.4 How might the pure wide form data look like, where neither countries nor sexes are in different rows?
See the solution
Example 14.1 Some kind of groups are needed to make the distinction between wide and long form meaningful. But what constitutes a group may be a bit unclear. For instance, look at the height-weight data. This represents a few health-related characteristics of five teenagers:
sex | age | height | weight |
---|---|---|---|
Female | 16 | 173 | 58.5 |
Female | 17 | 165 | 56.7 |
Male | 17 | 170 | 61.2 |
Male | 16 | 163 | 54.4 |
Male | 18 | 170 | 63.5 |
Is this dataset in long form or in wide form? In order to answer this question, we need to understand three things: 1) what do the rows represent; 2) what are the values–the actual data values in the table (in one or in multiple columns), and 3) what are the group indicators (in column names or in dedicated columns). The answers are:
- A row represents a person (this can be inferred, but also learned from the documentation).
- We can immediately see that each column represents a value. age, height and weight are, obviously, values. Here sex is a value too: it is not a group, but the sex of that particular person.
- Finally, we do not have any group indicators! This data is not grouped in any way, it is individual data, and hence we do not have group indicators.
Why is sex a value, not a group? This is because this dataset is about five teenagers. True, it might be about two groups, boys and girls, with three kids measured in each group (the third girl would have NA-s in every column). In that case sex would be a group indicator. This was not how the dataset was constructed though.
why aren’t age, _height and weight (or maybe even sex,age, _height and weight) just different measures for the same group (one person is one group)? You can imagine reshaping this dataframe into two columns: measure (this will be either “sex”, “age”, “height” or “weight”) and value (this would be e.g. “Female”, “16”, “173” and so on).
This is not wrong (except you cannot mix numeric and character values in the single column) but it is hardly ever useful. True, these are all measures, but there are few tasks where you may want to do similar operations with all these values. It does not make sense to put height and weight on the same plot, and filtering like
filter(value > 100)
is fairly meaningless.
So it makes sense to say that these data do not have any groupings, but four different kind of values instead.
How might the wide form of these data look like? Remember: wide form has all values for the group in the same row. But here we have only trivial groups–every person forms their own group and hence the data is also in a wide form! Distinction between wide and long form is meaningful only if the data contains some kind of groupings.
Exercise 14.5 Consider the following dataset: the numbers are visits at a hospital by different kinds of patients:
pregnant | male | female |
---|---|---|
yes | NA | 11 |
no | 25 | 20 |
- Is this data frame in a wide form or long form?
- Can you reshape it to the other form?
See the solution
14.2.3 pivot_longer()
and pivot_wider()
: reshaping between wide and long form
Fortunately, it is straightforward to convert between the wide and
long form. This is called reshaping.
The tidyr package (part of tidyverse) offers two
functions: pivot_longer()
and pivot_wider()
. Let’s first convert
the original disorder data frame
to the long form using pivot_longer()
.
For the reminder, here is the disorder data again:
country | disordersM | disordersF |
---|---|---|
Argentina | 3.0698864 | 1.1703126 |
Kenya | 0.7469638 | 0.6539660 |
Taiwan | 0.8912813 | 0.2611961 |
Ukraine | 3.8954954 | 1.4253794 |
United States | 2.9275393 | 1.7291682 |
pivot_longer()
takes a
number of arguments, the most important of which are cols
, values_to
and names_to
.
col
tells which columns are to be transformed to long format. We want to transform the value columns, disordersM and disordersF. These columns contain a similar value–percentage of alcohol disorders. We do not want to convert country: this is the grouping dimension, the name of the country, not percentage of disorders (remember, these data are grouped by country and sex). The value forcol
can be specified in a similar fashion as forselect()
, see Section 12.3.1. Here we can pick e.g.col = c(disordersM, disordersF)
or equivalently,col = !country
.values_to
is just the name to the resulting value column: the two columns we selected, disordersM and disordersF will be stacked together into a single column (or more precisely, interleaved into a single column). If not specified, it will be called “value”.names_to
is the name for the new grouping indicator. As disordersM and disordersF are combined together, we need a new column that tells which rows is originating from which of these. If not specified, the column will be called just “name”.
Now we are ready to do the actual reshaping. We need to select all columns except country. Lets call the resulting values “disorders”, and the other grouping column, the one that will contain the original column names “sex”:
disorders %>%
longDisorders <- pivot_longer(!country,
values_to = "disorders",
names_to = "sex")
longDisorders
## # A tibble: 10 × 3
## country sex disorders
## <chr> <chr> <dbl>
## 1 Argentina disordersM 3.07
## 2 Argentina disordersF 1.17
## 3 Kenya disordersM 0.747
## 4 Kenya disordersF 0.654
## 5 Taiwan disordersM 0.891
## 6 Taiwan disordersF 0.261
## 7 Ukraine disordersM 3.90
## 8 Ukraine disordersF 1.43
## 9 United States disordersM 2.93
## 10 United States disordersF 1.73
And here we have it. The long form contains three columns–two of these, country and sex are grouping indicators, and the third one disorders, contains the actual values. Note that you may want to do a bit of post-processing and remove “disorders” from the sex column leaving only “F” or “M”. It happens frequently that the column names do not make nice group indicators.
Long form is good for various kinds of data processing tasks,
but sometimes
we want to transform it back to wide form. This can be done with
pivot_wider()
. The most important arguments it takes are
names_from
and values_from
. The former tells which column
contains the new column names–it should be one of the grouping
dimensions, here sex or country. values_from
tells which column
contains the corresponding values, here it must be disorders.
So let’s get our wide form disorders
back:
%>%
longDisorders pivot_wider(names_from = "sex",
values_from = "disorders") %>%
head(5)
## # A tibble: 5 × 3
## country disordersM disordersF
## <chr> <dbl> <dbl>
## 1 Argentina 3.07 1.17
## 2 Kenya 0.747 0.654
## 3 Taiwan 0.891 0.261
## 4 Ukraine 3.90 1.43
## 5 United States 2.93 1.73
And we got it back 😀
Exercise 14.6
Use the alcohol disorders data from above:- Reshape it into a long form. Remove “disorders” from the sex column, so that it only read “F” and “M”.
- Reshape it into the wide form, but now put sexes in rows and countries in columns.
Exercise 14.7
Use the alcohol disorders data from above:- Reshape it into a long form, and thereafter into the widest possible form that only contains a single row.
- Reshape it directly into the widest possible form.
14.2.4 Multiple groupings and multiple values
The above examples were fairly simple–this was because we were only using two grouping dimensions, and only one measure variable. Now we look at the case where we have more dimensions and more values.
14.2.4.1 Three grouping dimensions
Imagine you are working for the meteorology institute again, as in Section 14.2.1. Your task is again to collect temperature data in North (N) and South (S) at Low (L) and High (H) altitude. However, now the data must be measured every day, not just once. So far you have collected data for Yesterday (Y) and Today (T), and today seems to be just one degree warmer across all the measurements.
How might you store the data? The example table at right offers one option. You have stored two locations in two rows, and \(2\times 2\) altitude-time combinations in four columns.
Why does your new task–store data over time–offer new ways to make the data frames? This is because now we have three grouping dimensions: location, altitude and time. And not surprisingly, more dimensions offer more ways to represent data. More specifically, tables (a), (b) and (c) are typically referred to as “wide form”, although all these tables have location in rows. Table (d) is pure long form, as the only columns are grouping indicators and the temperature values. Finally, (d) is pure wide form as no groupings are placed in rows. As in case of the two-grouping data, wide forms are often used to represent data visually, while the long form (d) is often useful for data processing. As above, the pure wide table (e) is rarely used, because it is cumbersome to process and display.
Exercise 14.8 Are these all possible options? Can you come up with more possible ways to represent these data? How many different ways are there in all to make a data frame of these data?
Finally, in case of three grouping dimensions, we can also represent data not as a rectangle but as a 3D “box”. This is commonly called data cube. Data cubes are not very common, and base-R does not support these. Data cubes are much harder to visualize, essentially you need to convert those to (2D) data frames to display their content. But they are actually used in exactly such context, when collecting 3D spatial data, or 4D spatio-temporal data. In the latter case we would need even a 4D cube. We do not discuss data cubes in this course. However, it is useful to know and imagine data cubes when reshaping data frames with multiple grouping dimensions.
14.2.4.2 Multiple values
But real data may be even more complicated. When you send your expensive weather balloons up, it is unlikely that you measure just temperature. You may measure a bunch of different atmospheric parameters, including humidity, wind speed, sunlight, gps coordinates and more.
First, imagine you also measure humidity. The dataset you collect may now look something like what is displayed at right. It is a data frame where rows represent location-altitude combinations, and columns represent time-measurement type combinations. How many grouping dimensions do we have in this data frame?
The answer is not quite obvious. We definitely have three–location, altitude and time. But will measurement type–temperature or humidity–also make it as a grouping dimension?
This is not quite clear. From pure technical point of view, it is as
good as time or location, so you can imagine making a long form
dataset as seen here. But conceptually, this may not be a very useful
thing to do, as there are little need to use the value column for any
operations besides filtering it for either temperature or humidity.
These two are very different physical concepts, measured in completely
different units. You cannot easily put
them on the same figure, nor will computing averages, or filtering like
filter(value > 50)
make much sense. So conceptually you probably do
not want to do that. It is more useful to just use two grouping
indicators–Location, Altitude and Time, and two value
columns: temperature and humidity.
For instance, imagine you add the geographic coordinates to your weather data. The North and South mean large regions, so the small drift of your weather balloons does not matter much. The result may look like the table here at right.
Here I have just marked two sets of coordinates–one for North, one for South. This is because the coordinates do not depend on altitude and time (and they definitely do not depend on the type of measurement). Such partly constant values add some additional considerations when reshaping data. For instance, when converting the table at right to a data frame (the figure is just a handwritten table!), you need to replicate the GPS coordinates–you need to assign the same coordinates for both Low and High altitude. When converting the data frame to a long form, you probably want to add a column of coordinates to this dataset, one that contains largely identical values. This may be wasteful for large datasets.
Exercise 14.9
Consider the Ice Extent Data. Let’s only consider columns year, month, region, extent and area–remove columns data-type and time.- What are the grouping dimensions in these data?
- Are these in a wide form or in long form?
- What are values in these data? Do they change according along all the grouping dimensions? Would it make sense to combine the values into an additional grouping dimension?
- If you want to convert the dataset into a wide on by month, then what might the column names look like?
14.2.4.3 Reshaping multi-dimensional data
Now lets reshape a more complex dataset. Take the complete alcohol disorder data:
read_delim("data/alcohol-disorders.csv")
disorders <-%>%
disorders sample_n(4)
## # A tibble: 4 × 6
## country Code Year disordersM disordersF population
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Taiwan TWN 2019 0.903 0.258 23777742
## 2 Ukraine UKR 2018 3.53 1.33 44446952
## 3 Taiwan TWN 2018 0.892 0.259 23726186
## 4 Argentina ARG 2017 3.02 1.16 44054616
This dataset has three grouping dimensions–country, year and sex. It has two values, disorders (for males and females), and population. You can also easily see that population differs by country and year, but there is no separate population for males and females–it is a single population figure.
Are these data in a wide or in a long form? We can try to answer this in the following way:What does a row represent? Here, it is obviously a country-year combination. Hence it is in a long form according to country and year. Sexes are combined into the same row, hence it is in wide form according to sex.
What are the values? The data in the current form contain three value columns: male disorders, female disorders, and population. This means two types of values–disorder percentage (for both men and women), and population.
Year is not a value despite of being numeric–it is just a label on the time dimension.
What are the group indicators? It is fairly obvious that both country and year are groups. Moreover, country is given by two columns, country and Code, we might remove one and still have all the same information in data. So we have three group columns, one of which is redundant.
The final grouping dimension, sex, does not have a separate name, its name is embedded in the disorder column name.
Exercise 14.10 Consider a subset of COVID Scandinavia data. It looks like this:
read_delim("data/covid-scandinavia.csv.bz2") %>%
select(code2, country, date, type, count, lockdown, population) %>%
slice(c(201:204, 891:895, 2271:2274))
## # A tibble: 13 × 7
## code2 country date type count lockdown population
## <chr> <chr> <date> <chr> <dbl> <date> <dbl>
## 1 DK Denmark 2020-05-01 Confirmed 9311 2020-03-11 5837213
## 2 DK Denmark 2020-05-01 Deaths 460 2020-03-11 5837213
## 3 DK Denmark 2020-05-02 Confirmed 9407 2020-03-11 5837213
## 4 DK Denmark 2020-05-02 Deaths 475 2020-03-11 5837213
## 5 FI Finland 2020-05-01 Confirmed 5051 2020-03-18 5528737
## 6 FI Finland 2020-05-01 Deaths 218 2020-03-18 5528737
## 7 FI Finland 2020-05-02 Confirmed 5176 2020-03-18 5528737
## 8 FI Finland 2020-05-02 Deaths 220 2020-03-18 5528737
## 9 FI Finland 2020-05-03 Confirmed 5254 2020-03-18 5528737
## 10 SE Sweden 2020-05-01 Confirmed 22133 NA 10377781
## 11 SE Sweden 2020-05-01 Deaths 3027 NA 10377781
## 12 SE Sweden 2020-05-02 Confirmed 22432 NA 10377781
## 13 SE Sweden 2020-05-02 Deaths 3102 NA 10377781
- What are the grouping dimensions in these data?
- Are these data in a long or a wide form?
- Would you count type (Confirmed/Deaths) as a grouping dimension?
- How many different values do you see in this dataset?
- Do all values change according to all dimensions?
- How would you describe the column code2? Is it value? Grouping dimension? Something else? Where would you put it when you’d transform the dataset into a wide form?
Let’s first convert the dataset into a long form, including long form by sex:
disorders %>%
disordersL <- pivot_longer(starts_with("disorders"),
# all columns, names of which
# start with 'disorders'
names_to = "sex",
values_to = "disorders") %>%
mutate(sex = gsub("disorders", "", sex))
# better name for sex
%>%
disordersL head(10)
## # A tibble: 10 × 6
## country Code Year population sex disorders
## <chr> <chr> <dbl> <dbl> <chr> <dbl>
## 1 Argentina ARG 2015 43257064 M 3.07
## 2 Argentina ARG 2015 43257064 F 1.17
## 3 Argentina ARG 2016 43668236 M 3.04
## 4 Argentina ARG 2016 43668236 F 1.16
## 5 Argentina ARG 2017 44054616 M 3.02
## 6 Argentina ARG 2017 44054616 F 1.16
## 7 Argentina ARG 2018 44413592 M 3.04
## 8 Argentina ARG 2018 44413592 F 1.17
## 9 Argentina ARG 2019 44745516 M 3.11
## 10 Argentina ARG 2019 44745516 F 1.19
This dataset is in long form. Each row represents a country-year-sex combination, and we have three columns describing these grouping indicators. We have two values: population and disorders. The representation is a little bit wasteful, as exactly the same population figure is replicated twice, once for men, once for women. But this data representation is good for many tasks, e.g. for comparing males and females on a barplot.
Alternatively, we may want to convert it into a form where country and sex are in rows, but years are in columns. In order to achieve this, we should first convert it into a long form, and thereafter make it into a wide form by year. Hence we need to spread the two value columns–population and disorders–into columns by year:
%>%
disordersL pivot_wider(names_from = Year,
values_from = c(population, disorders)) %>%
# note: both 'population' and 'disorders'
# are values
head(4)
## # A tibble: 4 × 13
## country Code sex population_2015 population_2016 population_2017
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 Argentina ARG M 43257064 43668236 44054616
## 2 Argentina ARG F 43257064 43668236 44054616
## 3 Kenya KEN M 46851496 47894668 48948140
## 4 Kenya KEN F 46851496 47894668 48948140
## population_2018 population_2019 disorders_2015 disorders_2016 disorders_2017
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 44413592 44745516 3.07 3.04 3.02
## 2 44413592 44745516 1.17 1.16 1.16
## 3 49953300 50951452 0.747 0.744 0.742
## 4 49953300 50951452 0.654 0.655 0.658
## disorders_2018 disorders_2019
## <dbl> <dbl>
## 1 3.04 3.11
## 2 1.17 1.19
## 3 0.752 0.773
## 4 0.662 0.669
Note that values_from
takes a vector value, the names of the value
columns we want to combine into columns by year.
Exercise 14.11 Take the Ice extent data. Let’s only consider columns year, month, region, extent and area–remove columns data-type and time.
- reshape the dataset in a way that there are only a single row for each year and region, and values for each month are in their separate columns.
- Reshape it into a wide form so that each year-month combination is in a single row, but North and South regions are in separate columns.
See the solution
14.2.5 Applications
Above we introduced reshaping. But there we assumed that you know what kind of shape you want the data to be in. This section provides a few examples that help you to decide what shape is appropriate for a particular task.
Consider the traffic fatalities data. It looks like
read_delim("data/fatalities.csv")
fatalities <-%>%
fatalities sample_n(4)
## # A tibble: 4 × 4
## year state fatal pop
## <dbl> <chr> <dbl> <dbl>
## 1 1984 OR 572 2675998.
## 2 1985 WA 744 4408994.
## 3 1985 OR 559 2686996.
## 4 1986 MN 571 4212996
This dataset has two groupings: year and state, and two values: number of fatalities, and population. It is stored in a long form.
You want to plot the traffic fatalities over time by state as a line plot with different colors denoting different states (see Section 13.4.2). Is this a good shape if you intend to use ggplot?
The crucial part, when using ggplot, is the aesthetics mapping in
the form aes(x, y, col = ...)
. It turns out this is exactly the
form of data you want: aes(year, fatal, col = state)
is designed for
long form data: it takes the fatal column, separates it by state, and
makes the lines in a separate color:
%>%
fatalities ggplot(aes(year, fatal,
col = state)) +
geom_line()
This is a good approach–it makes sense to let computer to figure out how many different states are there in the state column. It also makes sense to let us to decide which aesthetic we want to depend on state, maybe we want this to be line style instead of color.
Do you always want a long form data for such a plot? Well, it depends
on the software you are using. Not all software is designed to plot
data in long form. Base-R matplot()
, for instance works
differently: you call it
matplot(x, Y, type = "l")
where x is the vector of
x-values and Y is the matrix of y-values. This plots all
columns of Y against x. type = "l"
asks it
to draw lines (otherwise it will be a scatterplot). Here we need
wide form data instead:
fatalities %>%
fWide <- select(!pop) %>%
# all vars, except 'pop'
pivot_wider(names_from = state,
values_from = fatal)
head(fWide, 3)
## # A tibble: 3 × 4
## year MN OR WA
## <dbl> <dbl> <dbl> <dbl>
## 1 1982 571 518 748
## 2 1983 555 550 698
## 3 1984 582 572 746
We also need to separate x, here the column year, and the data to be plotted (everything but year):
matplot(fWide$year,
# plot year...
select(fWide, !year),
# ...vs everything but year
type = "l")
As you can see, the plot is basically the same, save the different fonts, line styles and such. Typically, ggplot is easier to use when you already have a ready-made dataset, but for plotting individual data vectors and matrices, base-R functions tend to be easier. As this example shows, it is also easier to use if your data is already in a wide form.
But now consider a somewhat different task: you want to do a trajectory plot (see Section 13.9.2.1) where fatalities in Oregon are on x-axis and fatalities in Washington on y-axis. Such plots are fairly common in many applications (although I have never seen it for traffic accidents). Here, unlike in the previous plot, a single point denotes both Oregon and Washington.
In order to use ggplot, you need something like aes(OR, WA)
.
However, there are not separate columns for OR and WA in the original
data. Hence we need to separate those–this type plot requires data
in
a wide
form. Here we can just re-use the wide table fWide
we created
above:
%>%
fWide ggplot(aes(OR, WA)) +
geom_path() +
geom_label(aes(label = year))
# year labels
# for clarity
Exercise 14.12 Use ice extent data. Both ice area and extent are certain surface areas that are measured in the same units (million km2 in these data) so you can put them on the same graph.
Make a line plot of ice extent and area over time, where time is on x-axis, and the two measures are on y axis.
14.3 Date and time
Date and time data is something we use a lot in the everyday life. However, it turns out that it is a surprisingly complex data type. Below, we work on a few common tasks with the related data structures.
A lot of the problems boil down to that the common time units like days, months, weeks, even minutes, are not physial units. Physical unit is just second with a precisely defined duration. The other units are convenience units, duration of which changes. Even more, time is commonly accounted with in a mixed measurement system, incorporating decimal, 12/24-based, 60-based, and 7-based units. Months, with their 28-31 day length, are an attempt to make moon’s revolution to fit into that of earth. On top of that, different regions have different time zones and daylight saving time, some of which is re-arranged each year. Not surprisingly, this creates a lot of issues when working with time data.
14.3.1 Duration
Consider a common task: you are analyzing the results of an athletic competition. There were four runners, and their time was:
bib# | time |
---|---|
102 | 58:30 |
101 | 59:45 |
104 | 1:00:15 |
103 | 1:02:30 |
So the the two best athletes run slightly below 1hr and the two slower ones slightly over that.
This is a common way to express the results. For time intervals less
than 1hr, we write MM:SS
, e.g. 58:30
. From the context we
understand what it means, in this case it is 58 minutes and 30 seconds
We can stay strictly within decimal system and physical units and write
3510 seconds instead, but unfortunately, such numbers are much
harder to understand than the everyday “58:30”.
We can load such data file into R as
read_delim(
results <-"bib#,time
102,58:30
101,59:45
104,1:00:15
103,1:02:30")
This uses the fact that read_delim()
can load not just files, but
also file content, fed to it as a string. So we write a multi-line
string that is just the content in the form or a csv file.
But now we have a problem: The results look like
results
## # A tibble: 4 × 2
## `bib#` time
## <dbl> <time>
## 1 102 58:30:00
## 2 101 59:45:00
## 3 104 01:00:15
## 4 103 01:02:30
As you can see, the entry “58:30” is not interpreted as 58 minutes and 30 seconds but as 58 hours and 30 minutes! The runners #102 and #102 apparently spent more than two days on the course! This can be confirmed by computing the time range:
%>%
results pull(time) %>%
range()
## Time differences in secs
## [1] 3615 215100
Indeed the winner has time 3615 seconds (1h 15s, the runner #104), and the slowest one is 215100 seconds. That is what 59 hours and 45 mins amounts to.
In the small case here, it is easy to remedy the problem: we can just change “58:30” to “0:58:30”. But in larger data files, this is not feasible. We need to do this in the code. The strategy we pursue below is as follows: first, read the time column as text, then add “00:” in front of those columns that only contain a single colon, and finally, convert the result to duration.
First, we read data while telling the time has column type “character” (see documentation for details):
read_delim(
results <-"bib#,time
102,58:30
101,59:45
104,1:00:15
103,1:02:30",
col_types = list(time = "character"))
# tell that column 'time' is of type 'character'
results
## # A tibble: 4 × 2
## `bib#` time
## <dbl> <chr>
## 1 102 58:30
## 2 101 59:45
## 3 104 1:00:15
## 4 103 1:02:30
next, we can just use mutate()
and ifelse()
to add “00:” in front
of the columns that contain a single colon only:
%>%
results mutate(time = ifelse(str_count(time, ":") == 1,
paste0("00:", time),
time))
## # A tibble: 4 × 2
## `bib#` time
## <dbl> <chr>
## 1 102 00:58:30
## 2 101 00:59:45
## 3 104 1:00:15
## 4 103 1:02:30
This code:
- counts the colons in the column time:
str_count(time, ":")
- if there is a single colon, pastes “00:” in front of the value
paste0("00:", time)
- otherwise keeps time as it is (just
time
)
Now all the time column values have hour markers in front of minutes,
but the column is still of character type, not time. We can convert
it to time using parse_time()
function from the readr package:
results %>%
results <- mutate(time = ifelse(str_count(time, ":") == 1,
paste0("00:", time),
%>%
time)) mutate(time = parse_time(time))
Obviously, we can achieve the same in a single mutate.
Now the time is understood in a correct manner: the winner is
%>%
results filter(rank(time) == 1)
## # A tibble: 1 × 2
## `bib#` time
## <dbl> <time>
## 1 102 58'30"
# athlete #102 is the winner!
%>%
results ggplot(aes(time)) +
geom_histogram()
14.4 More about data cleaning
TBD: more about data cleaning
14.4.1 Separate a single column into multiple columns
TBD: new motivation and more complex drinking data example
But the new long-form drinking data has a problem. Namely, the sex.year variable is a combination of both sex and year, and that makes it difficult to select only certain genders or years. It would be easier to replace it with two different columns: sex and year.
Fortunately, tidyr package has several functions for this task.
Here the appropriate one would be separate_wider_position()
that
splits a column into multiple columns by position. Its main arguments
are cols
that tells which columns to separate, and widths
that
specifices how many characters will go to each new column. There are
other arguments, e.g. if the original column is to be removed, what to
do if the column is too wide or too narrow, etc.
The most important of these arguments is widths
. It must be a named
vector, describing the width of the the fields that go into columns.
Its names will be transformed to the corresponding column names. As
the long form drinking data has sex.year values in the form SYYYY,
we need to assign width 1 for sex, and width 4 for year. So the the
argument might look like
c(sex = 1, year = 4) widths =
Here we give the columns natural names, sex for sex and year for year:
%>%
longDrinking separate_wider_position(sex.year,
widths = c(sex = 1, year=4)) %>%
head()
## Error in eval(expr, envir, enclos): object 'longDrinking' not found
Note that separate does not convert the column types. It is all well that sex is character, but we may want to convert year to a number. So the final code may look like:
%>%
longDrinking separate_wider_position(sex.year,
widths = c(sex = 1, year=4)) %>%
mutate(year = as.numeric(year)) %>%
head()
## Error in eval(expr, envir, enclos): object 'longDrinking' not found
This form is the most convenient one to use for filtering and many other data processing tasks.
Besides separating by position, there are also functions to separate based by a delimiter, and by regular expressions. Obviously, there are many other options, e.g. one may use string processing functionality to separate certain substrings and regular expressions.
::thematic_on(font = thematic::font_spec(scale=1.8))
thematic# theme_set(text = element_text(size = 24))