Chapter 19 Data structures

In Section 4.4 above we met the four basic (atomic) data types in R: integer, double, logical and character (string). Here we discuss some of these in more detail, and include a few more data types, most importantly factors, R’s implementation of categorical variables.

19.1 More about strings

Section 4.4.2 introduced strings. Here we discuss more of related technical details. The section may be a little hard for beginners, it is primarily meant to be a reference for later chapters of the book.

19.1.1 Characters and special characters

Strings are typically made of normal alphabetical characters, enclosed in quotes. For instance

"abc 123"
## [1] "abc 123"
'蘭花'
## [1] "蘭花"

are both valid strings. The fact that the second one contains non-English characters does not make it invalid–these are still characters.

But strings can also contain special characters. These are sequences of a backslash \ and a letter. For instance \n is the new line and \t is the tab symbol. So code

cat("a\nb\n")
## a
## b
can be understood as follows:
  • The first character is “a” that will be printed literally
  • The second one is the special character “\n”, the new line, that is printed as … new line (line break).
  • The third character is “b”, printed literally…
  • followed by another new line, so whatever follows “b” will be printed on the new line (nothing here).

(I use cat() here as it will show line breaks as line breaks, if you just type the "a\nb\n", then \n will be shown literally.) In a similar fashion, in

cat('蘭\t\n')
## 蘭    花

the \t symbol is converted to tab–you see a long space between “蘭” and “花”.

19.1.2 How to include special characters in strings

The fact that “\” is a special character makes a bit complicated to include the actual backslash \ in a string. The solution is to use double backslash \\ instead. In a similar fashion as backslash-n \n is a special character for new line, backslash-backslash \\ is the special character for backslash. So you if you want to print “A\B”, you need to make the string "A\\B":

cat("A\\B")
## A\B

This does not usually create problems–we rarely need to print backslash symbols. But there are a few major exceptions–one of these is Windows file paths.

Traditionally, Windows uses backslash as path separator. So instead of the unix habit to write paths like "/home/siim/info201", on Windows you typically write "C:\home\siim\info201". Many software packages (including R) support unix-stile forward slash on Windows too. This may be the best option if you actually want to type the path. But when copy-pasting from a file manager, you end up with the backslashed version. Unfortunately, this does not work:

cat("C:\home\siim\info201")

## Error: '\h' is an unrecognized escape in character string

This is because R will interpret \h (in “\home”), \s and \i as special characters. The error tells that there is no such special character as \h (\s and \i do not exist either but we only see the first error).

There are three ways to fix this: First, you can manually change all backslashes to forward slashes like "C:/home/siim/info201". Or you can change those to double-backslashes as "C:\\home\\siim\\info201". Finally, you can use raw strings. These are otherwise similar strings like normal strings, but raw strings interpret special as normal characters. In R, you can create raw strings as r"(<string>)": r must precede the quotes, and the string content must be in parenthesis. So we can write the Windows path as

cat(r"(C:\home\siim\info201)")
## C:\home\siim\info201

This is perhaps the easiest way to copy-paste file paths on Windows.

19.1.3 Line breaks and multiline strings

Special character \n, the line break, can also be inserted as … line break–just breaking the string definition into multiple lines. This is handy when printing poems:35

cat("What are you carving?
I'm carving a skewer.
  Why are you carving?
To shoot a raven.")
## What are you carving?
## I'm carving a skewer.
##   Why are you carving?
## To shoot a raven.

Note that the multiline strings preserve spacing: the third line starts with two extra spaces, these are preserved in the printout.

This is perhaps the best approach if you want to store a long text with multiple line breaks into a character variable. Just compare with

cat("What are you carving?\nI'm carving a skewer.\nWhy are you carving?\nTo shoot a raven.")
## What are you carving?
## I'm carving a skewer.
## Why are you carving?
## To shoot a raven.

–the result is the same, but the code is much harder to read (and write).

19.2 Factor Variables

Data often contains categorical variables–variables that can only take a small number of pre-determined values.36 Factors is how R has implemented categorical variables.

19.2.1 Factor basics

Factors are in many ways similar to character strings. In particular, character strings are often treated in a similar fashion (as categorical variables), also factors can contain arbitrary text, exactly as strings. But why do we need a special data type for something that is rather similar to characters? There are a few reasons:

  • R may not always understand when data is meant to be categorical and when not. A prime example is when the categories are coded as numbers, and we have to explain it explicitly to ggplot (see Section 14.5). In such situation it is convenient to force numbers into a dedicated categorical data type.
  • Second, categorical values only allow for limited operations. For instance, no arithmetic is possible with categories, even if those are coded as numbers.
  • Finally, certain categorical data may be ordered categories. It would also be useful to have dedicated data type that only supports the related operations. See Section 19.2.3 for more.

For example, imagine you work with shirt sizes which can only take on the values S, M, and L. You can store these as characters:

shirts <- c("S", "M", "S", "L", "M", "L")

This will usually work, but the computer can happily accept unknown types:

shirts[3] <- "U"
shirts
## [1] "S" "M" "U" "L" "M" "L"

Instead of a character vector, you can make shirts to a factor:

shirts <- factor(c("S", "M", "S", "L", "M", "L"))
shirts
## [1] S M S L M L
## Levels: L M S

As you see, the printout is largely similar–but now R also prints levels, the known permitted categories. Also, replacing an element with an unknown value is not possible any more:

shirts[3] <- "U"
## Warning in `[<-.factor`(`*tmp*`, 3, value = "U"): invalid factor level, NA generated
shirts
## [1] S    M    <NA> L    M    L   
## Levels: L M S

Factors are more limiting when working with numbers:

pclass <- factor(c(1, 2, 3, 2, 1))
pclass
## [1] 1 2 3 2 1
## Levels: 1 2 3
pclass + 1  # cannot do arithmetic!
## Warning in Ops.factor(pclass, 1): '+' not meaningful for factors
## [1] NA NA NA NA NA

19.2.2 Converting factors to other data types

Internally, factors are stored as two things: a) levels (unique labels); and b) integers, numbers that tell which label a particular element has. It is easy to convert the factors to character with as.character():

f <- factor(c(9, 8, 7, 8, 9))
f
## [1] 9 8 7 8 9
## Levels: 7 8 9
as.character(f)
## [1] "9" "8" "7" "8" "9"

However, it may be an unpleasant surprise to see that it does not quite work with numbers:

as.numeric(f)
## [1] 3 2 1 2 3

This is because as.numeric() ignores the factor labels and just returns the underlying number–the index that tells which level each element has.

A solution is to first covert the factor to character with, and thereafter to convert the resulting characters to numbers:

as.numeric(as.character(f))
## [1] 9 8 7 8 9

Alternatively, one can also convert just the levels to numbers, and then pick the number based on the underlying integer index:

as.numeric(levels(f))[f]
## [1] 9 8 7 8 9

This approach is probably more efficient–it requires to convert only limited number of character levels to numbers; however, it is harder to understand, and in case of medium-size datasets the gain is probably not worth it.

19.2.3 Ordered factors

Many kind of categorical data do not have any inherent ordering, for instance college majors or cities do not have any ordering. But certain other categories do. For instance, one may evaluate a coworker’s skills as “excellent”, “good”, “average” or “fair”. Here we know that “excellent” is better than “good”, and “fair” is the worst.

You can create ordered factors using factor(..., ordered = TRUE):

skills <- c("good", "fair", "excellent", "average")
factor(skills, ordered = TRUE)
## [1] good      fair      excellent average  
## Levels: average < excellent < fair < good

The printout is fairly similar to that of the unordered factors, except that the categories (factor levels) now show the ordering, stressed with <-sign between them.

However, this example shows a problem: the skill ordering is messed up with “fair” being better than “excellent” and so on. This is because we did not tell R what order the levels should take, and hence it relied on a simple alphabetical ordering. It is easy to fix by specifying the levels = ... argument, where the levels are supplied in the correct order:

factor(skills, ordered = TRUE,
       levels = c("fair", "average", "good", "excellent"))
## [1] good      fair      excellent average  
## Levels: fair < average < good < excellent

Now computer knows that “average” is better than “fair”, and “excellent” is the best category.

As data structures, ordered and unordered factors are almost the same. However, many modeling and plotting methods differ. For instance, by default ggplot uses contrast colors for unordered factors and colors on a gradient scale for ordered factors.

Exercise 19.1

Use orange trees data. Make a barplot of average height across the trees for each age, and color the bars according to age.
  1. Use age as unordered factor
  2. … as ordered factor.

Stay with default colors. What’s the difference? Which one would make more sense in this context?

The solution

19.3 Data frames

Section 12 introduces the basics of data frames. Here we discuss a few details in depth, some of which are devilishly annoying.

19.3.1 Data frames and tibbles

Base-R includes a fundamental data structure data frame. This can be created with the function data.frame():

df <- data.frame(product = c("apples", "mangoes"),
                 price = c(2.50, 3.50))

If is printed in a straightforward manner:

df
##   product price
## 1  apples   2.5
## 2 mangoes   3.5

Data frame is of class “data.frame”:

class(df)
## [1] "data.frame"

It supports the basic operations, like single-bracket and double bracket indexing, and dollar-notation (see Section 12.3).

However, tidyverse functions use a slightly different data frame, called tibble. You can create it using the function tibble(), but more importantly, tidyverse functions may unexpectedly return a tibble. For instance, summarize() creates a new data frame, or more precisely, a tibble:

tbl <- df %>%
   group_by(product) %>%
   summarize(price = mean(price))
tbl
## # A tibble: 2 × 2
##   product price
##   <chr>   <dbl>
## 1 apples    2.5
## 2 mangoes   3.5

As you can see, tbl is printed slightly differently: the printout marks data types below the column names. tbl is of class tibble:

class(tbl)
## [1] "tbl_df"     "tbl"        "data.frame"

So it is using slightly different printing but also indexing functionality.

There are many more subtle differences:

  • If the data frame rows are too wide to fit on screen, the whole printout will be wrapped to multiple lines. Tibbles, however, will leave out some of the columns, and tell which ones underneath.
  • The output width for data frames can be adjusted with, e.g. options(widht = 70), for tibbles you need to set options(pillar.width = 70).
  • Importantly, and extremely annoyingly, if you extract a single column from a data frame using bracket notation, you’ll get a vector. If you extract do the same with a tibble, you’ll get a data frame with a single column:
df[, "price"]
## [1] 2.5 3.5
tbl[, "price"]
## # A tibble: 2 × 1
##   price
##   <dbl>
## 1   2.5
## 2   3.5

Fortunately, you can use drop = argument if you want to ensure that whichever flavor of data frame is fed in, you are using the same indexing convention:

tbl[, "price", drop = TRUE]
## [1] 2.5 3.5
  • Finally, while the base-R functionality, including that of data frames is very stable, tidyverse functionality changes much more rapidly over time. In this sense, base-R is a better choice for long-term projects.

19.3.2 Data tables

Data tables are built on top of data frames and again share a lot of the functionality. While tibbles attempt to provide a more logical way to interact with data frames, data tables focus on memory footprint and speed. You need a dedicated package, data.table, to use data tables.

You can create data frames in a fairly similar way as data frames:

library(data.table)
dt <- data.table(product = c("apples", "mangoes"),
                 price = c(2.50, 3.50))

it is printed slightly differently though:

dt
##    product price
##     <char> <num>
## 1:  apples   2.5
## 2: mangoes   3.5

You can see that row numbers (actually not row numbers but the key in case of data tables) are followed by a colon. You can also see the column type printed underneath its name, although the names are somewhat different than in case of tibbles.

Importantly, the indexing follows the tibble-convention where if you extract a single column, it returns a single-column data table, not a vector:

dt[, "price"]
##    price
##    <num>
## 1:   2.5
## 2:   3.5

But data tables have quite a complex and powerful functionality that you can do inside the [ ] brackets. For instance, this code computes average price by products:

dt[, .(price = mean(price)), by = product]
##    product price
##     <char> <num>
## 1:  apples   2.5
## 2: mangoes   3.5

Data tables are typically much faster and much more memory efficient, compared to data frames and tibbles. However, for small datasets, the more intuitive approach of tidyverse will outweigth the compute efficiency.

See Section F for more about data tables.

19.4 Formulas

TBD: what are formulas


  1. Samish folklore.↩︎

  2. See Intro to Data Science Section about categorical variable for more about categorical variables.↩︎