Chapter 9 Manipulating data: merging and reshaping

In Section Pipes and dplyr we discussed various ways to manipulate data focusing on dplyr functions select, filter, arrange, mutate and summarize. Those functions allow to operate line-by-line and aggregate multiple lines into a few summary values.

However, there are other kinds of data manipulations where line-by-line approach is not helpful. These are merge (join) operations, and reshaping between long and wide form.

9.1 Merging dataframes line-by-line and column-by-column

A simple merge operation is just attaching one dataframe next to the other, either right of it, or underneath it. This can be done using cbind (column-bind) and rbind (row-bind) functions.

In their simplest form, cbind and rbind can combine vectors into matrices:

x <- 1:3
y <- 11:13
cbind(x, y)  # combine column-wise

##      x  y
## [1,] 1 11
## [2,] 2 12
## [3,] 3 13

rbind(x, y)  # combine row-wise

##   [,1] [,2] [,3]
## x    1    2    3
## y   11   12   13

We repeat here that the results are matrices, not data frames. But you can transform these into data frames by as.data.frame. Note also that the lengths of the vectors must match.

But these functions also allow to combine data frames:

df1 <- data.frame(x=1:3, y=11:13)
df2 <- data.frame(a=c("a", "b", "c"), b=c("A", "B", "C"))
cbind(df1, df2)

##   x  y a b
## 1 1 11 a A
## 2 2 12 b B
## 3 3 13 c C

So cbind just attaches one dataframe next to the other, given these have equal number of rows. As we were cbind-ing data frames, not vectors, the result is a data frame too.

rbind attaches the second dataframe underneath the first one, given the variable match:

df1 <- data.frame(x=1, y="a")
df2 <- data.frame(x=2, y="b")
rbind(df1, df2)

##   x y
## 1 1 a
## 2 2 b

As above, the result is a data frame.

rbind allows us to create a data frame row-by-row by starting with an empty data frame (one can use the special empty value NULL), and adding new data to it, row-by-row:

df <- NULL
for(i in 1:6) {
   df <- rbind(df,
               data.frame(x=i, x2=i^2))
}
df

##   x x2
## 1 1  1
## 2 2  4
## 3 3  9
## 4 4 16
## 5 5 25
## 6 6 36

This is often a very handy and convenient approach, although not efficient in case the loops are long.

9.2 Merging by key (database joins)

Merging data with cbind is all well if the rows of the two dataframes correspond to the same cases. But this is often not the case. Even more, they may correspond to conceptually different objects, e.g. geographic locations and temperatures where the first dataframe lists locations and the second temperatures for certain dates and locations. Below we discuss merging using the base-R functions, a very good introduction to dplyr joins is in R for Data Science.

In that case we want to merge by key. Merge key is an unique identifier that connects the case in one dataframe to the case in another dataframe.

Consider two data frames:

customers <- data.frame(id=c(101, 102, 103),
                        name=c("Timur", "Shah Rukh", "Ibn Khaldun"),
                        yob=c(1336, 1405, 1332))
customers

##    id        name  yob
## 1 101       Timur 1336
## 2 102   Shah Rukh 1405
## 3 103 Ibn Khaldun 1332

purchases <- data.frame(id=c(1,2,3,4),
                        date=c("2021-10-17", "2021-09-12", "2021-06-12",
                               "2021-09-03"),
                        price=c(100, 99.95, 499, 52.30),
                        customer=c(101, 101, 103, 101))
purchases

##   id       date  price customer
## 1  1 2021-10-17 100.00      101
## 2  2 2021-09-12  99.95      101
## 3  3 2021-06-12 499.00      103
## 4  4 2021-09-03  52.30      101

So we have data about three customers (born over 500 years ago), who collectively did four purchases, all in 2021.

9.2.1 What is key

Obviously, we cannot merge these two dataframes line-by-line as they describe very different objects, persons and purchases. But we can connect each purchase to the corresponding customer. This is where we need a key–a unique identifier that connects cases in these data frames. In this case it is the customer id, called id in the customers’ dataframe and customer in the purchases dataframe. Note that key can contain more than one variable, e.g. we may have different id-s for different dates, in that case the key may contain both id and date. But this is not the case here.

The basic merging by key can be done by merge(df1, df2, by.x=<key1>, by.y=<key2>). df1 and df2 are the data frames to be merged, and key1 and key2 are the name of the key in the first and second data frame:

merge(purchases, customers, by.x="customer", by.y="id")

##   customer id       date  price        name  yob
## 1      101  1 2021-10-17 100.00       Timur 1336
## 2      101  2 2021-09-12  99.95       Timur 1336
## 3      101  4 2021-09-03  52.30       Timur 1336
## 4      103  3 2021-06-12 499.00 Ibn Khaldun 1332

Here we take the variable customer in the purchases data frame, and find the matching values of id in the customers data frame. For instance, the first purchase (purchase id = 1) is made by customer 101, i.e. by Timur, and so on.

Note that the variable id is used in two different meanings: first as customer id in the customers data, but also as purchase id in the purchases dataset. Despite the rather similar meaning of the variable—in both dataframes id is a unique label for each case—this is not a valid merge key. In these data it is easy to see what is key, but in more complex databases it may require more than superficial knowledge of the data.

9.2.2 Inner and outer joins

9.3 Reshaping

Tabular data can be stored in a multiple different ways. One of the most useful distinction is between wide form and long form data. This section discusses these two forms of data, and reshaping–how to transform data between wide and long form.

Machine learning in R