F Thinking Big: Data Tables
Data frames are core elements for data handling in R. However, they suffer from several limitations. One of the major issue with data frames is that they are memory hungry and slow. This is not an issue when working with relatively small datasets (say, up to 100,000 rows). However, when your dataset size exceed gigabytes, dataframes may be infeasibly slow and too memory hungry.
F.1 Background: Passing By Value And Passing By Reference
R is (mostly) a pass-by-value language. This means that when you modify the data, at every step a new copy of the complete modified object is created, stored in memory, and the former object is freed (carbage-collected) if not in use any more.
The main advantage of this approach is consistency: we have the guarantee that functions do not modify their inputs. However, in case of large objects, copying may be slow, and even more, it requires at least twice as much memory before the old object is freed. In case of more complex process pipelines, the memory consumption may be even more than twice of the size of the original object.
Data tables implement a number of pass-by-reference functions. In pass-by-reference, the function is not given a fresh copy of the inputs, but is instead told where the object is in memory. Instead of copying gigabytes of data, only a single tiny memory pointer is passed. But this also means the function now is accessing and modifying the original object, not a copy of it. This may sometimes lead to bugs and unexpected behavior, but professional use of pass-by-reference approach may improve the speed and lower the memory footprint substantially.
F.2 Data Tables: Introduction
Data tables and most of the related goodies live in data.table library, so you either have to load the library or specify the namespace when using the functions.
F.2.1 Replacement for Data Frames (Sort of)
Data tables are designed to be largely a replacement to data frames. The syntax is similar and they are largely replaceable. For instance, we can create and play with a data table as
library(data.table)
data.table(id=1:5, x=rnorm(5), y=runif(5))
dt <- dt
## id x y
## <int> <num> <num>
## 1: 1 -0.8690057 0.3673254
## 2: 2 0.2356334 0.6302937
## 3: 3 0.6514741 0.3326334
## 4: 4 -1.8085753 0.2318481
## 5: 5 0.6464888 0.1198675
The result looks almost identical to a similar data frame (the only
difference are the colons after the row numbers). Behind the scenes
these objects are almost identical too–both objects are lists of
vectors. This structural similarity allows to use data tables
as drop-in replacements for dataframes, at least in some
circumstances. For instance, we can extract variables with $
:
$x dt
## [1] -0.8690057 0.2356334 0.6514741 -1.8085753 0.6464888
or rows with row indices:
c(2,4),] dt[
## id x y
## <int> <num> <num>
## 1: 2 0.2356334 0.6302937
## 2: 4 -1.8085753 0.2318481
However, data tables use unquoted variables names (like dplyr) by default:
dt[,x]
## [1] -0.8690057 0.2356334 0.6514741 -1.8085753 0.6464888
In case we need to store the variable name into another variable, with
have to use the additional argument with
:
"x"
var <-FALSE] dt[, var, with=
## x
## <num>
## 1: -0.8690057
## 2: 0.2356334
## 3: 0.6514741
## 4: -1.8085753
## 5: 0.6464888
Note also that instead of getting a vector, now we get a data.table with a single column “x” in the first. This behavior is the main culprit that when replacing data frames with data tables one may need to change quite a bit of code.
F.2.2 Fast Reading and Writing
Many data frame users may appreciate the fact that the data
input-output function fread
and fwrite
run at least a magnitude
faster on large files. These are largely replacement for read.table
and write.table
, however they syntax differs noticeably in places.
In particular, fread
accepts either a file name, http-url, or a shell command
that prints output; it automatically detects the column separator,
but it
does not automatically open compressed files. The latter is not a big
deal when using unix where one can just issue
fread("bzcat data.csv.bz2") data <-
However, the decompression is not that simple on windows and hence it is hard to write platform-independent code that opens compressed files.37
If your computer has enough memory and speed is not an issue, your
interest for data tables may end here. You can just transform data
table into a data frame with setDF
(and the other way around with setDT
). Let’s transform our data table to data
frame:
setDF(dt)
dt
## id x y
## 1 1 -0.8690057 0.3673254
## 2 2 0.2356334 0.6302937
## 3 3 0.6514741 0.3326334
## 4 4 -1.8085753 0.2318481
## 5 5 0.6464888 0.1198675
Do you see that the colons after row names are gone? This means dt
now is a data frame.
Note that this function behaves very differently from what we have
learned earlier: it modifies the object in place (by reference). We
do not have to assign the result into a new variable using a construct
like df <- setDF(dt)
(but we still can write like this, handy when
using magrittr pipes). This is a manifestation of the power of
data.tables: the object is not copied but the same object is modified
in memory instead. setDF
and setDT
are very efficient, even huge
tables are converted instantly with virtually no need for any
additional memory.
However, big powers come hand-in-hand with big responsibility:
it is easy to forget that setDF
modifies the function argument.
F.3 Indexing: The Major Powerhorse of Data Tables
Data tables’ indexing is much more powerful than that of data frames. The single-bracket indexing is a powerful (albeit confusing) set of functions. It’s general syntax is as follows:
dt[i, j, by]
where i
specifies what to do with rows (for instance, select certain
rows), j
tells what to do with columns (such as select columns,
compute new columns, aggregate columns), and by
contains the
grouping variables.
Let’s demonstrate this with the flights data from nycflights13 package. We load the data and transform it into data.table:
Automatic decompression is a feature request for data tables↩︎