Chapter 18 Data structures

In Section 4.4 above we met the four basic (atomic) data types in R: integer, double, logical and character (string). Here we discuss some of these in more detail, and include a few more data types, most importantly factors, R’s implementation of categorical variables.

18.1 More about strings

Section 4.4.2 introduced strings. Here we discuss more of related technical details. The section may be a little hard for beginners, it is primarily meant to be a reference for later chapters of the book.

18.1.1 Characters and special characters

Strings are typically made of normal alphabetical characters, enclosed in quotes. For instance

"abc 123"

## [1] "abc 123"

'蘭花'

## [1] "蘭花"

are both valid strings. The fact that the second one contains non-English characters does not make it invalid–these are still characters.

But strings can also contain special characters. These are sequences of a backslash \ and a letter. For instance \n is the new line and \t is the tab symbol. So code

cat("a\nb\n")

## a
## b

can be understood as follows:

The first character is “a” that will be printed literally
The second one is the special character “\n”, the new line, that is printed as … new line (line break).
The third character is “b”, printed literally…
followed by another new line, so whatever follows “b” will be printed on the new line (nothing here).

(I use cat() here as it will show line breaks as line breaks, if you just type the "a\nb\n", then \n will be shown literally.) In a similar fashion, in

cat('蘭\t花\n')

## 蘭    花

the \t symbol is converted to tab–you see a long space between “蘭” and “花”.

18.1.2 How to include special characters in strings

The fact that “\” is a special character makes a bit complicated to include the actual backslash \ in a string. The solution is to use double backslash \\ instead. In a similar fashion as backslash-n \n is a special character for new line, backslash-backslash \\ is the special character for backslash. So you if you want to print “A\B”, you need to make the string "A\\B":

cat("A\\B")

## A\B

This does not usually create problems–we rarely need to print backslash symbols. But there are a few major exceptions–one of these is Windows file paths.

Traditionally, Windows uses backslash as path separator. So instead of the unix habit to write paths like "/home/siim/info201", on Windows you typically write "C:\home\siim\info201". Many software packages (including R) support unix-stile forward slash on Windows too. This may be the best option if you actually want to type the path. But when copy-pasting from a file manager, you end up with the backslashed version. Unfortunately, this does not work:

cat("C:\home\siim\info201")

## Error: '\h' is an unrecognized escape in character string

This is because R will interpret \h (in “\home”), \s and \i as special characters. The error tells that there is no such special character as \h (\s and \i do not exist either but we only see the first error).

There are three ways to fix this: First, you can manually change all backslashes to forward slashes like "C:/home/siim/info201". Or you can change those to double-backslashes as "C:\\home\\siim\\info201". Finally, you can use raw strings. These are otherwise similar strings like normal strings, but raw strings interpret special as normal characters. In R, you can create raw strings as r"(<string>)": r must precede the quotes, and the string content must be in parenthesis. So we can write the Windows path as

cat(r"(C:\home\siim\info201)")

## C:\home\siim\info201

This is perhaps the easiest way to copy-paste file paths on Windows.

18.1.3 Line breaks and multiline strings

Special character \n, the line break, can also be inserted as … line break–just breaking the string definition into multiple lines. This is handy when printing poems:³³

cat("What are you carving?
I'm carving a skewer.
  Why are you carving?
To shoot a raven.")

## What are you carving?
## I'm carving a skewer.
##   Why are you carving?
## To shoot a raven.

Note that the multiline strings preserve spacing: the third line starts with two extra spaces, these are preserved in the printout.

This is perhaps the best approach if you want to store a long text with multiple line breaks into a character variable. Just compare with

cat("What are you carving?\nI'm carving a skewer.\nWhy are you carving?\nTo shoot a raven.")

## What are you carving?
## I'm carving a skewer.
## Why are you carving?
## To shoot a raven.

–the result is the same, but the code is much harder to read (and write).

18.2 Factor Variables

Data often contains categorical variables–variables that can only take a small number of pre-determined values.³⁴ Factors is how R has implemented categorical variables.

18.2.1 Factor basics

Factors are in many ways similar to character strings. In particular, character strings are often treated in a similar fashion (as categorical variables), also factors can contain arbitrary text, exactly as strings. But why do we need a special data type for something that is rather similar to characters? There are a few reasons:

R may not always understand when data is meant to be categorical and when not. A prime example is when the categories are coded as numbers, and we have to explain it explicitly to ggplot (see Section 14.5). In such situation it is convenient to force numbers into a dedicated categorical data type.
Second, categorical values only allow for limited operations. For instance, no arithmetic is possible with categories, even if those are coded as numbers.
Finally, certain categorical data may be ordered categories. It would also be useful to have dedicated data type that only supports the related operations. See Section 18.2.3 for more.

For example, imagine you work with shirt sizes which can only take on the values S, M, and L. You can store these as characters:

shirts <- c("S", "M", "S", "L", "M", "L")

This will usually work, but the computer can happily accept unknown types:

shirts[3] <- "U"
shirts

## [1] "S" "M" "U" "L" "M" "L"

Instead of a character vector, you can make shirts to a factor:

shirts <- factor(c("S", "M", "S", "L", "M", "L"))
shirts

## [1] S M S L M L
## Levels: L M S

As you see, the printout is largely similar–but now R also prints levels, the known permitted categories. Also, replacing an element with an unknown value is not possible any more:

shirts[3] <- "U"

## Warning in `[<-.factor`(`*tmp*`, 3, value = "U"): invalid factor level, NA generated

shirts

## [1] S    M    <NA> L    M    L   
## Levels: L M S

Factors are more limiting when working with numbers:

pclass <- factor(c(1, 2, 3, 2, 1))
pclass

## [1] 1 2 3 2 1
## Levels: 1 2 3

pclass + 1  # cannot do arithmetic!

## Warning in Ops.factor(pclass, 1): '+' not meaningful for factors

## [1] NA NA NA NA NA

18.2.2 Converting factors to other data types

Internally, factors are stored as two things: a) levels (unique labels); and b) integers, numbers that tell which label a particular element has. It is easy to convert the factors to character with as.character():

f <- factor(c(9, 8, 7, 8, 9))
f

## [1] 9 8 7 8 9
## Levels: 7 8 9

as.character(f)

## [1] "9" "8" "7" "8" "9"

However, it may be an unpleasant surprise to see that it does not quite work with numbers:

as.numeric(f)

## [1] 3 2 1 2 3

This is because as.numeric() ignores the factor labels and just returns the underlying number–the index that tells which level each element has.

A solution is to first covert the factor to character with, and thereafter to convert the resulting characters to numbers:

as.numeric(as.character(f))

## [1] 9 8 7 8 9

Alternatively, one can also convert just the levels to numbers, and then pick the number based on the underlying integer index:

as.numeric(levels(f))[f]

## [1] 9 8 7 8 9

This approach is probably more efficient–it requires to convert only limited number of character levels to numbers; however, it is harder to understand, and in case of medium-size datasets the gain is probably not worth it.

18.2.3 Ordered factors

Many kind of categorical data do not have any inherent ordering, for instance college majors or cities do not have any ordering. But certain other categories do. For instance, one may evaluate a coworker’s skills as “excellent”, “good”, “average” or “fair”. Here we know that “excellent” is better than “good”, and “fair” is the worst.

You can create ordered factors using factor(..., ordered = TRUE):

skills <- c("good", "fair", "excellent", "average")
factor(skills, ordered = TRUE)

## [1] good      fair      excellent average  
## Levels: average < excellent < fair < good

The printout is fairly similar to that of the unordered factors, except that the categories (factor levels) now show the ordering, stressed with <-sign between them.

However, this example shows a problem: the skill ordering is messed up with “fair” being better than “excellent” and so on. This is because we did not tell R what order the levels should take, and hence it relied on a simple alphabetical ordering. It is easy to fix by specifying the levels = ... argument, where the levels are supplied in the correct order:

factor(skills, ordered = TRUE,
       levels = c("fair", "average", "good", "excellent"))

## [1] good      fair      excellent average  
## Levels: fair < average < good < excellent

Now computer knows that “average” is better than “fair”, and “excellent” is the best category.

As data structures, ordered and unordered factors are almost the same. However, many modeling and plotting methods differ. For instance, by default ggplot uses contrast colors for unordered factors and colors on a gradient scale for ordered factors.

Exercise 18.1

Use orange trees data. Make a barplot of average height across the trees for each age, and color the bars according to age.

Use age as unordered factor
… as ordered factor.

Stay with default colors. What’s the difference? Which one would make more sense in this context?

The solution

18.3 Data frames

TBD: data frames, data tables, tibbles

18.4 Formulas

TBD: what are formulas

Samish folklore.↩︎
See Intro to Data Science Section about categorical variable for more about categorical variables.↩︎