Chapter 17 Data structures

In Section 2.5 above we met the four basic (atomic) data types in R: integer, double, logical and character (string). Here we discuss some of these in more detail, and include a few more data types, most importantly factors, R’s implementation of categorical variables.

17.1 More about strings

Strings are typically made of alphabetical characters, enclosed in quotes. For instance

"abc"
## [1] "abc"
'蘭花'
## [1] "蘭花"

are both valid strings. The fact that the second one contains non-English characters does not make it invalid–these are still characters.

But strings can also contain special characters. These are sequences of a backslash \ and a letter. For instance \n is the new line and \t is the tab symbol. For instance, in

cat("a\nb\n")
## a
## b

the \n is not printed literally, but replaced by a line break instead.29 In a similar fashion, in

cat('蘭\t\n')
## 蘭    花

the \t symbol is converted to tab–you see multiple whitespaces between “蘭” and “花”.

This makes a bit complicated to include an actual backslash \ in a string. The solution is to use double backslash \\ instead. In a similar fashion as \n is a special character for new line, \\ is a special character for backslash. So you if you want to print “A”, you need

cat("A\\B")
## A\B

This does not usually create many problems–we rarely need to print backslash symbols. But there are a few major exceptions–one of these is Windows file paths.

Traditionally, Windows uses backslash as path separator. So instead of the unix habit to write paths like “/home/siim/info201”, you typically write “C:”. Obviously, various software (including R) supports unix-stile forward slash on Windows too. This may be the best option if you actually want to type the path. But when copy-pasting from a file manager, you end up with the backslashed version. Unfortunately, this does not work:

cat("C:\home\siim\info201")

## Error: '\h' is an unrecognized escape in character string

This is because R will interpret \h (in “\home”), \s and \i as special characters. The error tells that there is no such special character as \h (\s and \i do not exist either but we only see the first error).

Obviously, you can manually change all backslashes to forward slashes or to double backslashes. Another solution to this are raw strings. These are otherwise similar strings like normal strings, but raw strings interpret special as normal characters. In R, you can create raw strings as r"(<string>)": r preceding the quotes, and the quoted content in parenthesis. So we can write the Windows path as

cat(r"(C:\home\siim\info201)")
## C:\home\siim\info201

This is perhaps the easiest way to copy-paste file paths on Windows.

TBD: line break in string definition

17.2 Factor Variables

Factors are a way of optimizing variables that consist of a finite set of categories (i.e., they are categorical (nominal) variables).

For example, imagine that you had a vector of shirt sizes which could only take on the values small, medium, or large. If you were working with a large dataset (thousands of shirts!), it would end up taking up a lot of memory to store the character strings (5+ letters per word at 1 or more bytes per letter) for each one of those variables.

A factor on the other hand would instead store a number (called a level) for each of these character strings: for example, 1 for small, 2 for medium, or 3 for large (though the order or specific numbers will vary). R will remember the relationship between the integers and their labels (the strings). Since each number only takes 4 bytes (rather than 1 per letter), factors allow R to keep much more information in memory.

# Start with a character vector of shirt sizes
shirt_sizes <- c("small", "medium", "small", "large", "medium", "large")

# Convert to a vector of factor data
shirt_sizes_factor <- as.factor(shirt_sizes)

# View the factor and its levels
print(shirt_sizes_factor)

# The length of the factor is still the length of the vector, not the number of levels
length(shirt_sizes_factor) # 6

When you print out the shirt_sizes_factor variable, R still (intelligently) prints out the labels that you are presumably interested in. It also indicates the levels, which are the only possible values that elements can take on.

It is worth re-stating: factors are not vectors. This means that most all the operations and functions you want to use on vectors will not work:

# Create a factor of numbers (factors need not be strings)
num_factors <- as.factor(c(10,10,20,20,30,30,40,40))

# Print the factor to see its levels
print(num_factors)

# Multiply the numbers by 2
num_factors * 2  # Error: * not meaningful
                 # returns vector of NA instead

# Changing entry to a level is fine
num_factors[1] <- 40

# Change entry to a value that ISN'T a level fails
num_factors[1] <- 50  # Error: invalid factor level
                      # num_factors[1] is now NA

If you create a data frame with a string vector as a column (as what happens with read.csv()), it will automatically be treated as a factor unless you explicitly tell it not to:

# Vector of shirt sizes
shirt_size <- c("small", "medium", "small", "large", "medium", "large")

# Vector of costs (in dollars)
cost <- c(15.5, 17, 17, 14, 12, 23)

# Data frame of inventory (with factors, since didn't say otherwise)
shirts_factor <- data.frame(shirt_size, cost)

# The shirt_size column is a factor
is.factor(shirts_factor$shirt_size) # TRUE

# Can treat this as a vector; but better to fix how the data is loaded
as.vector(shirts_factor$shirt_size) # a vector

# Data frame of orders (without factoring)
shirts <- data.frame(shirt_size, cost)

# The shirt_size column is NOT a factor
is.factor(shirts$shirt_size) # FALSE

This is not to say that factors can’t be useful (beyond just saving memory)! They offer easy ways to group and process data using specialized functions:

shirt_size <- c("small", "medium", "small", "large", "medium", "large")
cost <- c(15.5, 17, 17, 14, 12, 23)

# Data frame of inventory (with factors)
shirts_factor <- data.frame(shirt_size, cost)

# Produce a list of data frames, one for each factor level
# first argument is the data frame to split, second is the factor to split by
shirt_size_frames <- split(shirts_factor, shirts_factor$shirt_size)


# Apply a function (mean) to each factor level
#   first argument is the vector to apply the function to,
#   second argument is the factor to split by
#   third argument is the name of the function
tapply(shirts_factor$cost, shirts_factor$shirt_size, mean)

TBD: convert factor numbers to numbers

17.2.1 Ordered factors

TBD: ordered factors

17.3 Data frames

TBD: data frames, data tables, tibbles

17.4 Formulas

TBD: what are formulas


  1. cat() will actually show line breaks as line breaks, if you just type the "a\nb\n", then \n will be shown literally.↩︎