Chapter 17 Data structures
In Section 2.5 above we met the four basic (atomic) data types in R: integer, double, logical and character (string). Here we discuss some of these in more detail, and include a few more data types, most importantly factors, R’s implementation of categorical variables.
17.1 More about strings
Section 2.5.2 introduced strings. Here we discuss more of related technical details. The section may be a little hard for beginners, it is primarily meant to be a reference for later chapters of the book.
17.1.1 Characters and special characters
Strings are typically made of normal alphabetical characters, enclosed in quotes. For instance
"abc 123"
## [1] "abc 123"
'蘭花'
## [1] "蘭花"
are both valid strings. The fact that the second one contains non-English characters does not make it invalid–these are still characters.
But strings can also contain special characters. These are
sequences of a backslash \
and a letter. For instance \n
is the
new line and \t
is the tab symbol. So code
cat("a\nb\n")
## a
## b
can be understood as follows:
- The first character is “a” that will be printed literally
- The second one is the special character “\n”, the new line, that is printed as … new line (line break).
- The third character is “b”, printed literally…
- followed by another new line, so whatever follows “b” will be printed on the new line (nothing here).
(I use cat()
here as it will show line breaks as line breaks, if
you just type the "a\nb\n"
, then \n
will be shown literally.)
In a similar fashion, in
cat('蘭\t花\n')
## 蘭 花
the \t
symbol is converted to tab–you see a long space
between “蘭” and “花”.
17.1.2 How to include special characters in strings
The fact that “\” is a special character
makes a bit complicated to include the actual backslash \
in a
string. The solution is to use double backslash \\
instead. In a
similar fashion as backslash-n \n
is a special character for new line, backslash-backslash \\
is the
special character for backslash. So you if you want to print “A\B”,
you need to make the string "A\\B"
:
cat("A\\B")
## A\B
This does not usually create problems–we rarely need to print backslash symbols. But there are a few major exceptions–one of these is Windows file paths.
Traditionally, Windows uses backslash as path separator. So instead
of the unix habit to write paths like "/home/siim/info201"
,
on Windows you typically write
"C:\home\siim\info201"
. Many software packages (including R)
support unix-stile forward slash on Windows too. This may be the
best option if you actually want to type the path. But when
copy-pasting from a file manager, you end up with the backslashed
version. Unfortunately, this does not work:
cat("C:\home\siim\info201")
## Error: '\h' is an unrecognized escape in character string
This is because R will interpret \h
(in “\home”), \s
and \i
as
special characters. The error tells that there is no such special
character as \h
(\s
and \i
do not exist either but we only see
the first error).
There are three ways to fix this:
First, you can manually change all backslashes to forward slashes like
"C:/home/siim/info201"
. Or you can change those to
double-backslashes as "C:\\home\\siim\\info201"
.
Finally, you can use raw
strings. These are otherwise similar strings like normal strings,
but raw strings interpret special as normal characters. In R, you can
create raw strings as r"(<string>)"
: r
must precede the quotes, and
the string content must be in parenthesis. So we can write the Windows
path as
cat(r"(C:\home\siim\info201)")
## C:\home\siim\info201
This is perhaps the easiest way to copy-paste file paths on Windows.
17.1.3 Line breaks and multiline strings
Special character \n
, the line break, can also be inserted as
… line break–just breaking the string definition into multiple
lines. This is handy when printing poems:30
cat("What are you carving?
I'm carving a skewer.
Why are you carving?
To shoot a raven.")
## What are you carving?
## I'm carving a skewer.
## Why are you carving?
## To shoot a raven.
Note that the multiline strings preserve spacing: the third line starts with two extra spaces, these are preserved in the printout.
This is perhaps the best approach if you want to store a long text with multiple line breaks into a character variable. Just compare with
cat("What are you carving?\nI'm carving a skewer.\nWhy are you carving?\nTo shoot a raven.")
## What are you carving?
## I'm carving a skewer.
## Why are you carving?
## To shoot a raven.
–the result is the same, but the code is much harder to read (and write).
17.2 Factor Variables
Data often contains categorical variables–variables that can only take a small number of pre-determined values.31 Factors is how R has implemented categorical variables.
17.2.1 Factor basics
Factors are in many ways similar to character strings. In particular, character strings are often treated in a similar fashion (as categorical variables), also factors can contain arbitrary text, exactly as strings. But why do we need a special data type for something that is rather similar to characters? There are a few reasons:
- R may not always understand when data is meant to be categorical and when not. A prime example is when the categories are coded as numbers, and we have to explain it explicitly to ggplot (see Section 13.5). In such situation it is convenient to force numbers into a dedicated categorical data type.
- Second, categorical values only allow for limited operations. For instance, no arithmetic is possible with categories, even if those are coded as numbers.
- Finally, certain categorical data may be ordered categories. It would also be useful to have dedicated data type that only supports the related operations. See Section 17.2.3 for more.
For example, imagine you work with shirt sizes which can only take
on the values S
, M
, and L
. You can store these as characters:
c("S", "M", "S", "L", "M", "L") shirts <-
This will usually work, but the computer can happily accept unknown types:
3] <- "U"
shirts[ shirts
## [1] "S" "M" "U" "L" "M" "L"
Instead of a character vector, you can make shirts to a factor:
factor(c("S", "M", "S", "L", "M", "L"))
shirts <- shirts
## [1] S M S L M L
## Levels: L M S
As you see, the printout is largely similar–but now R also prints levels, the known permitted categories. Also, replacing an element with an unknown value is not possible any more:
3] <- "U" shirts[
## Warning in `[<-.factor`(`*tmp*`, 3, value = "U"): invalid
## factor level, NA generated
shirts
## [1] S M <NA> L M L
## Levels: L M S
Factors are more limiting when working with numbers:
factor(c(1, 2, 3, 2, 1))
pclass <- pclass
## [1] 1 2 3 2 1
## Levels: 1 2 3
+ 1 # cannot do arithmetic! pclass
## Warning in Ops.factor(pclass, 1): '+' not meaningful for
## factors
## [1] NA NA NA NA NA
17.2.2 Converting factors to other data types
Internally, factors are stored as two things: a) levels (unique
labels); and b) integers, numbers that tell which label a particular
element has. It is easy to convert the factors to character with
as.character()
:
factor(c(9, 8, 7, 8, 9))
f <- f
## [1] 9 8 7 8 9
## Levels: 7 8 9
as.character(f)
## [1] "9" "8" "7" "8" "9"
However, it may be an unpleasant surprise to see that it does not quite work with numbers:
as.numeric(f)
## [1] 3 2 1 2 3
This is because as.numeric()
ignores the factor labels and just
returns the underlying number–the index that tells which level each
element has.
A solution is to first covert the factor to character with, and thereafter to convert the resulting characters to numbers:
as.numeric(as.character(f))
## [1] 9 8 7 8 9
Alternatively, one can also convert just the levels to numbers, and then pick the number based on the underlying integer index:
as.numeric(levels(f))[f]
## [1] 9 8 7 8 9
This approach is probably more efficient–it requires to convert only limited number of character levels to numbers; however, it is harder to understand, and in case of medium-size datasets the gain is probably not worth it.
17.2.3 Ordered factors
Many kind of categorical data do not have any inherent ordering, for instance college majors or cities do not have any ordering. But certain other categories do. For instance, one may evaluate a coworker’s skills as “excellent”, “good”, “average” or “fair”. Here we know that “excellent” is better than “good”, and “fair” is the worst.
You can create ordered factors using factor(..., ordered = TRUE)
:
c("good", "fair", "excellent", "average")
skills <-factor(skills, ordered = TRUE)
## [1] good fair excellent average
## Levels: average < excellent < fair < good
The printout is fairly similar to that of the unordered factors,
except that the categories (factor levels) now show the ordering,
stressed with <
-sign between them.
However, this example shows a problem: the skill ordering is messed up
with “fair” being better than “excellent” and so on. This is because
we did not tell R what order the levels should take, and hence it
relied on a simple alphabetical ordering. It is easy to fix by
specifying the levels = ...
argument, where the levels are supplied
in the correct order:
factor(skills, ordered = TRUE,
levels = c("fair", "average", "good", "excellent"))
## [1] good fair excellent average
## Levels: fair < average < good < excellent
Now computer knows that “average” is better than “fair”, and “excellent” is the best category.
As data structures, ordered and unordered factors are almost the same. However, many modeling and plotting methods differ. For instance, by default ggplot uses contrast colors for unordered factors and colors on a gradient scale for ordered factors.
Exercise 17.1
Use orange trees data. Make a barplot of average height across the trees for each age, and color the bars according to age.- Use age as unordered factor
- … as ordered factor.
Stay with default colors. What’s the difference? Which one would make more sense in this context?
Samish folklore.↩︎
See Intro to Data Science Section about categorical variable for more about categorical variables.↩︎