F Thinking Big: Data Tables

Data frames are core elements for data handling in R. However, they suffer from several limitations. One of the major issue with data frames is that they are memory hungry and slow. This is not an issue when working with relatively small datasets (say, up to 100,000 rows). However, when your dataset size exceed gigabytes, dataframes may be infeasibly slow and too memory hungry.

F.1 Background: Passing By Value And Passing By Reference

R is (mostly) a pass-by-value language. This means that when you modify the data, at every step a new copy of the complete modified object is created, stored in memory, and the former object is freed (carbage-collected) if not in use any more.

The main advantage of this approach is consistency: we have the guarantee that functions do not modify their inputs. However, in case of large objects, copying may be slow, and even more, it requires at least twice as much memory before the old object is freed. In case of more complex process pipelines, the memory consumption may be even more than twice of the size of the original object.

Data tables implement a number of pass-by-reference functions. In pass-by-reference, the function is not given a fresh copy of the inputs, but is instead told where the object is in memory. Instead of copying gigabytes of data, only a single tiny memory pointer is passed. But this also means the function now is accessing and modifying the original object, not a copy of it. This may sometimes lead to bugs and unexpected behavior, but professional use of pass-by-reference approach may improve the speed and lower the memory footprint substantially.

F.2 Data Tables: Introduction

Data tables and most of the related goodies live in data.table library, so you either have to load the library or specify the namespace when using the functions.

F.2.1 Replacement for Data Frames (Sort of)

Data tables are designed to be largely a replacement to data frames. The syntax is similar and they are largely replaceable. For instance, we can create and play with a data table as

dt <- data.table(id=1:5, x=rnorm(5), y=runif(5))
##       id          x         y
##    <int>      <num>     <num>
## 1:     1 -0.8690057 0.3673254
## 2:     2  0.2356334 0.6302937
## 3:     3  0.6514741 0.3326334
## 4:     4 -1.8085753 0.2318481
## 5:     5  0.6464888 0.1198675

The result looks almost identical to a similar data frame (the only difference are the colons after the row numbers). Behind the scenes these objects are almost identical too–both objects are lists of vectors. This structural similarity allows to use data tables as drop-in replacements for dataframes, at least in some circumstances. For instance, we can extract variables with $:

## [1] -0.8690057  0.2356334  0.6514741 -1.8085753  0.6464888

or rows with row indices:

##       id          x         y
##    <int>      <num>     <num>
## 1:     2  0.2356334 0.6302937
## 2:     4 -1.8085753 0.2318481

However, data tables use unquoted variables names (like dplyr) by default:

## [1] -0.8690057  0.2356334  0.6514741 -1.8085753  0.6464888

In case we need to store the variable name into another variable, with have to use the additional argument with:

var <- "x"
dt[, var, with=FALSE]
##             x
##         <num>
## 1: -0.8690057
## 2:  0.2356334
## 3:  0.6514741
## 4: -1.8085753
## 5:  0.6464888

Note also that instead of getting a vector, now we get a data.table with a single column “x” in the first. This behavior is the main culprit that when replacing data frames with data tables one may need to change quite a bit of code.

F.2.2 Fast Reading and Writing

Many data frame users may appreciate the fact that the data input-output function fread and fwrite run at least a magnitude faster on large files. These are largely replacement for read.table and write.table, however they syntax differs noticeably in places. In particular, fread accepts either a file name, http-url, or a shell command that prints output; it automatically detects the column separator, but it does not automatically open compressed files. The latter is not a big deal when using unix where one can just issue

data <- fread("bzcat data.csv.bz2")

However, the decompression is not that simple on windows and hence it is hard to write platform-independent code that opens compressed files.36

If your computer has enough memory and speed is not an issue, your interest for data tables may end here. You can just transform data table into a data frame with setDF (and the other way around with setDT). Let’s transform our data table to data frame:

##   id          x         y
## 1  1 -0.8690057 0.3673254
## 2  2  0.2356334 0.6302937
## 3  3  0.6514741 0.3326334
## 4  4 -1.8085753 0.2318481
## 5  5  0.6464888 0.1198675

Do you see that the colons after row names are gone? This means dt now is a data frame.

Note that this function behaves very differently from what we have learned earlier: it modifies the object in place (by reference). We do not have to assign the result into a new variable using a construct like df <- setDF(dt) (but we still can write like this, handy when using magrittr pipes). This is a manifestation of the power of data.tables: the object is not copied but the same object is modified in memory instead. setDF and setDT are very efficient, even huge tables are converted instantly with virtually no need for any additional memory.

However, big powers come hand-in-hand with big responsibility: it is easy to forget that setDF modifies the function argument.

F.3 Indexing: The Major Powerhorse of Data Tables

Data tables’ indexing is much more powerful than that of data frames. The single-bracket indexing is a powerful (albeit confusing) set of functions. It’s general syntax is as follows:

dt[i, j, by]

where i specifies what to do with rows (for instance, select certain rows), j tells what to do with columns (such as select columns, compute new columns, aggregate columns), and by contains the grouping variables.

Let’s demonstrate this with the flights data from nycflights13 package. We load the data and transform it into data.table:

  1. Automatic decompression is a feature request for data tables↩︎