Chapter 9 Manipulating data: merging and reshaping
In Section Pipes and dplyr we discussed various ways
to manipulate data focusing on dplyr functions select
, filter
,
arrange
, mutate
and summarize
. Those functions allow to operate
line-by-line and aggregate multiple lines into a few summary values.
However, there are other kinds of data manipulations where line-by-line approach is not helpful. These are merge (join) operations, and reshaping between long and wide form.
9.1 Merging dataframes line-by-line and column-by-column
A simple merge operation is just attaching one dataframe next to the
other, either right of it, or underneath it. This can be done using
cbind
(column-bind) and rbind
(row-bind) functions.
In their simplest form, cbind
and rbind
can combine vectors into
matrices:
1:3
x <- 11:13
y <-cbind(x, y) # combine column-wise
## x y
## [1,] 1 11
## [2,] 2 12
## [3,] 3 13
rbind(x, y) # combine row-wise
## [,1] [,2] [,3]
## x 1 2 3
## y 11 12 13
We repeat here that the results are matrices, not data frames. But
you can transform these into data frames by as.data.frame
. Note
also that the lengths of the vectors must match.
But these functions also allow to combine data frames:
data.frame(x=1:3, y=11:13)
df1 <- data.frame(a=c("a", "b", "c"), b=c("A", "B", "C"))
df2 <-cbind(df1, df2)
## x y a b
## 1 1 11 a A
## 2 2 12 b B
## 3 3 13 c C
So cbind
just attaches one dataframe next to the other, given these
have equal number of rows. As we were cbind
-ing data frames, not
vectors, the result is a data frame too.
rbind
attaches the second dataframe underneath the first one, given
the variable match:
data.frame(x=1, y="a")
df1 <- data.frame(x=2, y="b")
df2 <-rbind(df1, df2)
## x y
## 1 1 a
## 2 2 b
As above, the result is a data frame.
rbind
allows us to create a data frame row-by-row by starting with
an empty data frame (one can use the special empty value NULL
), and
adding new data to it, row-by-row:
NULL
df <-for(i in 1:6) {
rbind(df,
df <-data.frame(x=i, x2=i^2))
} df
## x x2
## 1 1 1
## 2 2 4
## 3 3 9
## 4 4 16
## 5 5 25
## 6 6 36
This is often a very handy and convenient approach, although not efficient in case the loops are long.
9.2 Merging by key (database joins)
Merging data with cbind
is all well if the rows of the two
dataframes correspond to the same cases. But this is often not the
case. Even more, they may correspond to conceptually different
objects, e.g. geographic locations and temperatures where the first
dataframe lists locations and the second temperatures for certain
dates and locations. Below we discuss merging using the base-R
functions, a very good introduction to dplyr joins is in R for Data
Science.
In that case we want to merge by key. Merge key is an unique identifier that connects the case in one dataframe to the case in another dataframe.
Consider two data frames:
data.frame(id=c(101, 102, 103),
customers <-name=c("Timur", "Shah Rukh", "Ibn Khaldun"),
yob=c(1336, 1405, 1332))
customers
## id name yob
## 1 101 Timur 1336
## 2 102 Shah Rukh 1405
## 3 103 Ibn Khaldun 1332
data.frame(id=c(1,2,3,4),
purchases <-date=c("2021-10-17", "2021-09-12", "2021-06-12",
"2021-09-03"),
price=c(100, 99.95, 499, 52.30),
customer=c(101, 101, 103, 101))
purchases
## id date price customer
## 1 1 2021-10-17 100.00 101
## 2 2 2021-09-12 99.95 101
## 3 3 2021-06-12 499.00 103
## 4 4 2021-09-03 52.30 101
So we have data about three customers (born over 500 years ago), who collectively did four purchases, all in 2021.
9.2.1 What is key
Obviously, we cannot merge these two dataframes line-by-line as they describe very different objects, persons and purchases. But we can connect each purchase to the corresponding customer. This is where we need a key–a unique identifier that connects cases in these data frames. In this case it is the customer id, called id in the customers’ dataframe and customer in the purchases dataframe. Note that key can contain more than one variable, e.g. we may have different id-s for different dates, in that case the key may contain both id and date. But this is not the case here.
The basic merging by key can be done by merge(df1, df2, by.x=<key1>, by.y=<key2>
). df1
and df2
are the data frames to be merged, and
key1
and key2
are the name of the key in the first and second data
frame:
merge(purchases, customers, by.x="customer", by.y="id")
## customer id date price name yob
## 1 101 1 2021-10-17 100.00 Timur 1336
## 2 101 2 2021-09-12 99.95 Timur 1336
## 3 101 4 2021-09-03 52.30 Timur 1336
## 4 103 3 2021-06-12 499.00 Ibn Khaldun 1332
Here we take the variable customer in the purchases data frame, and find the matching values of id in the customers data frame. For instance, the first purchase (purchase id = 1) is made by customer 101, i.e. by Timur, and so on.
Note that the variable id is used in two different meanings: first as customer id in the customers data, but also as purchase id in the purchases dataset. Despite the rather similar meaning of the variable—in both dataframes id is a unique label for each case—this is not a valid merge key. In these data it is easy to see what is key, but in more complex databases it may require more than superficial knowledge of the data.