Chapter 11 Data Frames
We have already learned several different types of variables, all of which can hold data. These include numeric and logical variables, vectors and lists. However, these types are too limited to work with many types of actual data. This chapter introduces data frame objects, one of the most popular data structures to store and manipulate data. Data frames are basically two-dimensional “tables” where you can operate both on rows and columns. They are in many ways similar to tables in Excel or Google docs. But instead of interacting with this data structure through a UI, we’ll learn how to do it through programming. This allows us to perform more complex and replicable analysis. This chapter covers various ways of creating, describing, and accessing information in data frames, as well as how they are related to other data types in R.
11.1 What is Data Frame?
You can think of Data Frames as tables where data is organized into rows and columns. For example, consider the following table of names, weights, and heights:
The fundamental concepts in this data frame are rows and columns. Each rows (called observations, cases or records) represent a similar object, here a person. All these objects have similar properties, here name, height and weight. Each column (called variable, attribute or feature) represents a unique type of information collected for each observation (each person). So name column only contains names while weight column only contains weights.
This is the essence of data frame–it is a table of similar objects, and for each object we know multiple distinct types of information. It turns out to be a very powerful way to represent data, and data frames are incorporated not just into R but also in many other analysis frameworks.
Note that the example data frame is rectangular: we have three measures for each patient (name, height and weight), and each measure is there for all five patients. This is a fundamental property of data frames–data frames are rectangular, each observation must have the same number of variables, and each variable must be there for each each observation.19 (But individual values may be missing, see Section 11.6.)
name | age | height |
---|---|---|
Sophie | 5 | 110 |
Sophie | 6 | 115 |
Helge | 4 | 100 |
Helge | 6 | 112 |
In the example above, it is fairly easy to see that a row represents a person (a patient). But sometimes it is more complicated. For instance, a person can be measured multiple times. Imagine a similar health dataset for children–each child may have been measured multiple times at different age, resulting in multiple rows for each child.
In this example, a row represents a children-age combination. You can think about columns name and age as identifiers, and height as data. Here we have a single type of data for each combination of identifiers.
Exercise 11.1 Consider the following dataset about orange trees (see Data Appendix):
Tree | age | circumference |
---|---|---|
1 | 118 | 30 |
1 | 484 | 58 |
1 | 664 | 87 |
2 | 118 | 33 |
2 | 484 | 69 |
2 | 664 | 111 |
Tree is the tree id, age is its age (days), and circumference is the circumference of the trunk (mm).
What does a row represent here?
See the solution
Exercise 11.2 Consider the dataset about COVID cases/deaths in Scandinavia in 2020 (see in Data Appendix). Here is a small example of it:
country | date | type | count | lockdown |
---|---|---|---|---|
Denmark | 2020-05-01 | Confirmed | 9311 | 2020-03-11 |
Denmark | 2020-05-01 | Deaths | 460 | 2020-03-11 |
Denmark | 2020-05-02 | Confirmed | 9407 | 2020-03-11 |
Denmark | 2020-05-02 | Deaths | 475 | 2020-03-11 |
Finland | 2020-05-01 | Confirmed | 5051 | 2020-03-18 |
Finland | 2020-05-01 | Deaths | 218 | 2020-03-18 |
Finland | 2020-05-02 | Confirmed | 5176 | 2020-03-18 |
Finland | 2020-05-02 | Deaths | 220 | 2020-03-18 |
Norway | 2020-05-01 | Confirmed | 7783 | 2020-03-12 |
Norway | 2020-05-01 | Deaths | 210 | 2020-03-12 |
Norway | 2020-05-02 | Confirmed | 7809 | 2020-03-12 |
Norway | 2020-05-02 | Deaths | 211 | 2020-03-12 |
type denotes different types of covid measures (confirmed cases and deaths), count are the corresponding counts (e.g. 9311 confirmed covid cases in Denmark in 2020-05-01), and lockdown is the date where major lockdown rules were put in place.
What does a row represent here?
See the solution
In R, data frames are made of lists (see Section 6) in which each element is a vector of the same length. Each vector represents a column, not a row, and each row of data frame corresponds to the element at a certain position in each of the vectors in the list. You can think of data frame as if you print a list, and then rotate it 90° clockwise. This is why it is important to understand lists when working with data frames.
Through their list origin, data frames can contain variables that are of different type. In the example above, name is character while height is numeric. These are different components of the list. But all names must be character–they are elements of the same vector in the list.
You can work with data frames using the same tools as when working with lists. But data frames include additional properties that are usually more convenient.
11.2 Working with data frames
This section describes the basic functionality of data frames. The most important topic–accessing data in data frames–is described in Section 11.3.
11.2.1 Creating Data Frames
This section shows how to create data frames manually. This is a skill that is rarely used–normally you load data from external sources, such as csv files (see Section 11.5 below). But it comes in handy when you need to debug or test some functionality on data frames.
Data frames can be made with the function data.frame()
. As
arguments, you have to provide the variables–data vectors of equal length. Here
is a code that creates the same example patients’ data frame from above:
## create data vectors
c("Ada", "Bob", "Chris", "Diya", "Emma")
name <- 58:62
height <- c(115, 117, 120, 123, 126)
weight <-## combine data vectors into a data frame
data.frame(name, height, weight)
patients <- patients
## name height weight
## 1 Ada 58 115
## 2 Bob 59 117
## 3 Chris 60 120
## 4 Diya 61 123
## 5 Emma 62 126
You can see that a data frame is printed as a nice rectangular table, not as a ragged list. This is one of additional properties of data frames.
Because data frames are lists, you can access the values of patients
using the same dollar notation or double-bracket notation as in case
of
lists:
$weight # retrieve weights patients
## [1] 115 117 120 123 126
"height"]] # retrieve heights patients[[
## [1] 58 59 60 61 62
Exercise 11.3 Create a data frame that contains three variables: country name, its capital’s name, and its population (at least five countries). Choose suitable names for your three variables.
- Now extract country name using the dollar-notation.
- Extract country population using double-bracket notation
- Extract capital using indirect variable name (See Section 6.3.2).
See the solution
11.2.2 Describing Data Frames
One of the first steps you do when you encounter a new data frame is
to get a basic idea what it contains. Here we describe a number of
functions that provide such summary data. A brief summary of the
functions is below (assume df
is a data frame):
Function | Description |
---|---|
nrow(df) |
Number of rows |
ncol(df) |
Number of columns |
dim(df) |
Dimensions (rows, columns) |
names(df) |
column names |
colnames(df) |
column names |
rownames(df) |
row names |
head(df, n) |
the first n rows (as a new data frame) |
tail(df, n) |
the last n rows (as a new data frame) |
For instance, if we want to know how many rows are there in the patients data frame, we can find it with
nrow(patients)
## [1] 5
The function ncol()
behaves in a similar manner. However, dim()
returns bot number of rows and columns–it returns a vector with the
first element being the former and the second element the latter:
dim(patients)
## [1] 5 3
names()
and colnames()
are synonyms and return the variable names
of the data frame:
names(patients)
## [1] "name" "height" "weight"
You can also give names to rows of data frames. For instance, instead of having a separate column for names, one can put the name as a row name. But currently row names are just row numbers:
rownames(patients)
## [1] "1" "2" "3" "4" "5"
Finally, head()
and tail()
show a few first and last lines of the
data frame (by default six lines). To show the last two lines you can
do
tail(patients, 2)
## name height weight
## 4 Diya 61 123
## 5 Emma 62 126
There is also an RStudio-exclusive options, View()
that opens the
data frame in an RStudio window with a spreadsheet-like interface.
However, you cannot use View()
in many contexts. For example, if you
compile your results to an html or pdf file, then the result must be
viewable with a browser or a pdf-viewer, and hence the
RStudio-specific View()
will give an error.
Some of these description functions can also be used to
modify
the data frame. For example, you can use the names()
to assign new
names to the variables in data:
names(patients) <- c("Name", "Inches", "Pounds")
patients
## Name Inches Pounds
## 1 Ada 58 115
## 2 Bob 59 117
## 3 Chris 60 120
## 4 Diya 61 123
## 5 Emma 62 126
Note how we assigned new values to the column names, and as a result, the data frame has new variable names.
11.3 Accessing Data in Data Frames
But we cannot use data frames for much unless we are able to manipulate and access these data. First, we discuss perhaps the most important tasks: selecting desired variables and filtering based on certain conditions. Afterwards, we show even more indexing methods.
There is an alternative way of accessing data in data frames–the way of pipes and dplyr. That, a much more intuitive way, is discussed in Section 12.
We use a small data frame of emperors:
c("Qin Shi Huang", "Napoleon Bonaparte", "Nicholas II",
name <-"Mehmed VI", "Naruhito")
c(-259, 1769, 1868, 1861, 1960) # negative: BC
born <- c(-221, 1804, 1894, 1918, 2019)
throned <- c("China", "France", "Russia", "Ottoman Empire", "Japan")
ruled <- c(-210, 1821, 1918, 1926, NA) # Naruhito is alive
died <- data.frame(name, born, throned, ruled, died)
emperors <- emperors
## name born throned ruled died
## 1 Qin Shi Huang -259 -221 China -210
## 2 Napoleon Bonaparte 1769 1804 France 1821
## 3 Nicholas II 1868 1894 Russia 1918
## 4 Mehmed VI 1861 1918 Ottoman Empire 1926
## 5 Naruhito 1960 2019 Japan NA
Note that as Naruhito is still alive, we do not know his year of death. We use a special value NA (not available) in its place. See more in Section 11.6.
11.3.1 Selecting variables
Typically, when working with data, one of the most important tasks is to extract certain variables. Data frames make it (relatively) easy in two different ways.
Dollar notation is easier to write: it is just
dataframe$variable
. For instance, if we want to pull out emperors’
names, we can ask this as
$name emperors
## [1] "Qin Shi Huang" "Napoleon Bonaparte" "Nicholas II"
## [4] "Mehmed VI" "Naruhito"
Remember–data frames are made of lists, and it is the same dollar notation we used for extracting list components in Section 6.3.4.
Double-bracket notation is also similar to that of lists (Section
6.3.2): you put the name (as a string) in double
brackts like dataframe[["variable"]]
. So we can exactly the same
vector of
emperors’ names as
"name"]] emperors[[
## [1] "Qin Shi Huang" "Napoleon Bonaparte" "Nicholas II"
## [4] "Mehmed VI" "Naruhito"
The dollar notation is usually easier to write, and hence the double-bracket notation is mainly used for indirect variable names (See Section 6.3.2). For instance:
"name"
var <- emperors[[var]]
## [1] "Qin Shi Huang" "Napoleon Bonaparte" "Nicholas II"
## [4] "Mehmed VI" "Naruhito"
Exercise 11.4 What happens if you try to use indirect variable names with dollar-notation?
See the solution
This was about extracting individual variables. But the real datasets may contain a large number of columns, most of which we do not need for the particular analysis. So another task we often do when we start to work with a new dataset, is to limit the number of columns to a smaller and more manageable set. It is often easier to work with a smaller “sub-dataframe” than with the huge original dataframe: when printing, less numbers on screen is easier to understand what you need; and if the datasets are large, we may also gain in terms of computing performance and memory usage.
The most obvious approach here is just to list the variable names we want to preserve. For instance, if we are only interested in name and year of birth, we can write
c("name", "born")] emperors[
## name born
## 1 Qin Shi Huang -259
## 2 Napoleon Bonaparte 1769
## 3 Nicholas II 1868
## 4 Mehmed VI 1861
## 5 Naruhito 1960
Technically, this is almost like list indexing by name (see Section 6.3.2)–the list indexing returns a sublist that only contains the named components. But as emperors is a data frame, it returns a sub-dataframe, not a sublist.
This approach is a good one if we only want to preserve a few
variables and “forget” the others. But other times we only want to
remove a few variables and keep everything else. This can be achieved
by setting those variables to NULL
, a special symbol for empty
element. This will remove the component, exactly as in case of lists
(see Section 6.4). We can remove throned variable
as
$throned <- NULL
emperors emperors
## name born ruled died
## 1 Qin Shi Huang -259 China -210
## 2 Napoleon Bonaparte 1769 France 1821
## 3 Nicholas II 1868 Russia 1918
## 4 Mehmed VI 1861 Ottoman Empire 1926
## 5 Naruhito 1960 Japan NA
Note that if the variable does not exist, setting it to NULL
is
silently ignored
$marriage <- NULL
emperors emperors
## name born ruled died
## 1 Qin Shi Huang -259 China -210
## 2 Napoleon Bonaparte 1769 France 1821
## 3 Nicholas II 1868 Russia 1918
## 4 Mehmed VI 1861 Ottoman Empire 1926
## 5 Naruhito 1960 Japan NA
There are no warnings, and the data frame is unchanged.
We’ll learn more, easier and more powerful methods to select variables in Section 12.
Exercise 11.5 Sometimes you need to do similar tasks with all variables in the data frame. A good way to do it is a for-loop.
Print the column names in your Seahawks data frame
Write a loop over all columns in your data frame. in the loop, print the variable name (use
cat()
for printing).Write a loop over all columns in your data frame. In the loop, print the variable name (use
cat()
), and the variable itself (useprint()
).Write a loop over all columns in your data frame. In the loop, print the variable name (use
cat()
), andTRUE
/FALSE
, depending if the variable is numeric.Hint: use
is.numeric(df$col)
to test if the column is numeric.Write a loop over all columns in your data frame. Inside of the loop print the variable name (use
cat()
), and its minimum value if the variable is numeric!Hint:
min(df$col)
finds the minimum.
See the solution
11.3.2 Filtering rows of data frames
As discussed above, since data frames are lists, it’s possible to
use both dollar notation (data$variable
) and double-bracket notation
(data[["variable"]]
) to access the data variables. If used in this
way, the results are vectors and hence individual elements can be
accessed as elements in any other vector. For instance, we can
extract names of all emperors who were born before 1800 as
$name[emperors$born < 1800] emperors
## [1] "Qin Shi Huang" "Napoleon Bonaparte"
Here the first dollar-notation epxression, emperors$name
, is
a just a vector of names. The second dollar-notation expression, [emperors$born < 1800]
, is a logical vector where TRUE
corresponds to those who are
born before 1800. Needless to say, you can also use double-bracket
notation here instead of one or both of these dollar-notations
if you want to use indirect variable names.
Note that the expression looks somewhat heavy and bloated–the need to
write emperors$
twice seems to be unnecessary, it also makes the
code harder to read. We’ll learn a more intuitive filtering method in
Section 12.3.2.
Exercise 11.6 Extract names of all emperors who died before year 1800
- Do it using only dollar notation
- Use double bracket notation at the first and dollar notation at the second place
- Use solely double bracket notation.
- Explain, what is the
NA
you see there.
See the solution
11.3.3 Using single-bracket notation to extract both rows and columns
11.3.3.1 Basics of the single-bracket-notation
Perhaps the most powerful way to extract information from data frames
is
a variation of single-bracket notation. This
allows you to specify both rows and columns when extracting data.
Here you need to put two index values separated in the brackets and
separate these by a comma (,
):
df[rows index, column index]
The first index specifies rows and the second specifies columns. The indices should be similar to how you index elements in vectors and lists–they can be numbers, logical values, or names; and you can mix these three types. Underneath a few examples using the emperors’ data from above. As a reminder, the emperors data is
emperors
## name born ruled died
## 1 Qin Shi Huang -259 China -210
## 2 Napoleon Bonaparte 1769 France 1821
## 3 Nicholas II 1868 Russia 1918
## 4 Mehmed VI 1861 Ottoman Empire 1926
## 5 Naruhito 1960 Japan NA
Extract a single element of 2nd row, 3rd column:
2, 3] # vector emperors[
## [1] "France"
Extract 2nd and 4th row, 3rd column:
c(2,4), 3] # vector emperors[
## [1] "France" "Ottoman Empire"
Extract 2nd and 4th row, variable “died”:
c(2,4), "died"] # vector emperors[
## [1] 1821 1926
Extract 2nd and 4th row, variables “name” and “ruled”:
c(2,4), c("name", "ruled")] # data frame emperors[
## name ruled
## 2 Napoleon Bonaparte France
## 4 Mehmed VI Ottoman Empire
Usually, the result of such index operations is a vector, if only a single column was returned, and a data frame, if multiple columns are needed. For instance, the death years of two emperors is a vector, but if we ask both name and the country they ruled, we get a data frame.
We can also ask for all rows or all columns, by just leaving out the corresponding index:
3] # all rows, 3rd column emperors[,
## [1] "China" "France" "Russia" "Ottoman Empire"
## [5] "Japan"
$ruled == "China",] # Chinese emperors, all columns emperors[emperors
## name born ruled died
## 1 Qin Shi Huang -259 China -210
11.3.3.2 Certain confusing results
Handling of missing data is somewhat counter-intuitive. For instance, if we extract all emperors who died before year 1:
$died < 1,] emperors[emperors
## name born ruled died
## 1 Qin Shi Huang -259 China -210
## NA <NA> NA <NA> NA
We’ll see Qin Shi Huang, which is correct. But we also see a line
of NA
-s, which seems weird. This is because there is an emperor,
Naruhito, whose year of death we do not know, and hence the logical
index vector is
$died < 1 emperors
## [1] TRUE FALSE FALSE FALSE NA
The last element of the vector is NA
, and this causes the line of
missings in the outcome. We just do not know if we have another
emperor in the line.20
An easy solution is to use the which()
function that converts the
logical vector into a numeric one, marking which elements are true,
and ignoring missings:
which(emperors$died < 1) # no NA-s here
## [1] 1
or when extracting emperors:
which(emperors$died < 1),] # no NA-s here emperors[
## name born ruled died
## 1 Qin Shi Huang -259 China -210
Exercise 11.7 Use the emperors’ dataset.
- extract 3rd and 4th row.
- extract all emperors who died in 20th century (all information about them)
- extract name and country for all emperors who died in 20th century
See the solution
Another frequent source of confusion is related to extracting column as a vector and extracting column as a single-column data frame. A column as a vector can be extracted as
"ruled"] # note: comma emperors[,
## [1] "China" "France" "Russia" "Ottoman Empire"
## [5] "Japan"
while a data frame is
"ruled"] # note: no comma emperors[
## ruled
## 1 China
## 2 France
## 3 Russia
## 4 Ottoman Empire
## 5 Japan
Note the difference: the first result is printed as a vector and the latter as a data frame.
Why does a comma cause such a difference? This is because comma between brackets tells R to use the data frame–specific single bracket notation. If there is no comma, we are doing list indexing, and extracting a sublist of a single component. And because the list is a data frame here, we get a sub–data frame with a single column only.
11.3.3.3 Summary
Here is a brief summary of the main tools:
Syntax | Description | Example |
---|---|---|
df[row_num, col_num] |
Element by row and column indices | patients[2,3] (element in the second row, third column) |
df[row_name, col_name] |
Element by row and column names | df['Ada','height'] (element in row named Ada and column named height ; the height of Ada . patients data does not have row names.) |
df[row, col] |
Element by row and col; can mix indices and names | patients[2,'height'] (second element in the height column) |
df[row, ] |
All elements (columns) in row index or name | df[2,] (all columns in the second row) |
df[, col] |
All elements (rows) in a col index or name, as a vector | df[,'height'] (complete height column as a vector) |
df[col] |
All elements (rows) in a col index or name, as a one-column data frame | df['height'] (data frame containing only height column) |
Take special note of the 4th option’s syntax (for retrieving rows): you still include the comma (,
), but because you leave which column blank, you get all of the columns!
# Extract the second row
2, ] # comma
df[
# Extract the second column AS A VECTOR
2] # comma
df[,
# Extract the second column AS A DATA FRAME
2] # no comma df[
Extracting more than one column will produce a sub-data frame; extracting from just one column will produce a vector).
11.3.4 Modifying data frames
The previous tools can also be used to modify existing data frames.
For instance, we can set a new variable, “age” to the “patients” data frame using dollar-notation as
$age <- c(22, 33, 44, 55, 66)
patients patients
## Name Inches Pounds age
## 1 Ada 58 115 22
## 2 Bob 59 117 33
## 3 Chris 60 120 44
## 4 Diya 61 123 55
## 5 Emma 62 126 66
Instead of adding new variables, we can also overwrite the existing ones in the same way. Here an example about how to use double-bracket notation for replacing “Pounds”:
"Pounds"]] <- c(120, 120, 130, 130, 140)
patients[[ patients
## Name Inches Pounds age
## 1 Ada 58 120 22
## 2 Bob 59 120 33
## 3 Chris 60 130 44
## 4 Diya 61 130 55
## 5 Emma 62 140 66
It is also possible to replace only parts of the data frame, for instance, let’s add 10 lb of weight to everyone who is over 40:
$Pounds[patients$age > 40] <-
patients patients$Pounds[patients$age > 40] + 10
patients
## Name Inches Pounds age
## 1 Ada 58 120 22
## 2 Bob 59 120 33
## 3 Chris 60 140 44
## 4 Diya 61 140 55
## 5 Emma 62 150 66
This syntax is somewhat awkward, so
let’s explain what it does:
- on both sides of the assignment, we ensure we only work with those
who are over 40 (
patients$age > 40
). - on both sides we work only with “Pounds” (
patients$Pounds
) - on the right-hand side we add “10” to the Pounds of everyone who is over 40 (there are 3 such patients)
- and finally, on the left-hand side, we assign such new weights to the “Pounds” in the data frame.
This is conceptually similar to vector operations (see Section
4.5). In fact, these are vector operations as
patients$Pounds
is a vector!
Finally, the same task can be achieved with ifelse()
(see Section
8.3.3):
$Pounds <- ifelse(patients$age > 40,
patients# who is over 40?
$Pounds + 10,
patients# add 10lb to those
$Pounds)
patients# otherwise keep the weight
patients
## Name Inches Pounds age
## 1 Ada 58 120 22
## 2 Bob 59 120 33
## 3 Chris 60 150 44
## 4 Diya 61 150 55
## 5 Emma 62 160 66
This is perhaps more clear that the indexed assignment above. It will become even easier to read when using dplyr tools (see Section 12).
Exercise 11.8 A year has passed and everyone has aged by a year. Add one year to everyone’s age!
Here is some helper code to create data:
c("Ada", "Bob", "Chris", "Diya", "Emma")
Name <- c(58, 59, 60, 61, 62)
Inches <- c(120, 120, 150, 150, 160)
Pounds <- c(22, 33, 44, 55, 66) age <-
See the solution
11.4 R built-in datasets
Above, we created simple data frames manually. R also has a number of built-in data frames, and a number of packages that contain even more data frames. These are designed for testing and demonstration purposes, for actual analysis you usually need to load datasets from disk (see Section 11.5.3).
What is a built-in dataset? All programming languages, including R,
contain certain built-in values. Such values are the constant pi
(3.1416), logical constants TRUE
and FALSE
, and other similar
values.
In a similar fashion, R also has built-in datasets. Depending on how
they are set up, it may be possible to access those directly, or using
the data()
function.
For instance, iris
is a built-in dataset about size of iris flowers
(see Section I.11). You can just use it through its
name, iris
. Let’s take a quick look:
head(iris, 3)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
it contains five columns, flower size and species. Importantly, from the usage perspective, the built-in dataset is just a data frame, exactly this kind of data frame as what you created manually above.
Not all built-in data are data frames. For instance, state.abb
is a
character vector of 2-letter abbreviations of the U.S. states (see
Section I.13):
head(state.abb, 10)
## [1] "AL" "AK" "AZ" "AR" "CA" "CO" "CT" "DE" "FL" "GA"
If datasets are provided by other packages, these can be loaded with
loaded with data()
command. For instance, the ice cream data from Ecdat package
(see Section
I.9) can be loaded
data(Icecream, package = "Ecdat")
head(Icecream, 4)
## cons income price temp
## 1 0.386 78 0.270 41
## 2 0.374 79 0.282 56
## 3 0.393 81 0.277 63
## 4 0.425 80 0.280 68
Another advantage of this approach is that you tell exactly from which package the dataset should be loaded. This is important if multiple packages provide datasets of the same name.
You can learn more about both built-in and package-provided dataset with the command
??datasets
Exercise 11.9 What is the data structure of the built-in dataset co2
? What do the
values represent?
See the answer
11.5 Working with CSV Data
So far you’ve been constructing your own data frames. It is a good skill and a good way to learn to handle data frames, but in practice, such manual coding is rarely needed besides debugging. It’s much more common to load data from somewhere, for instance from a file on your computer or by from internet.
11.5.1 CSV file format
While R is able to ingest data
from a variety of sources, this chapter will focus on reading tabular
data in comma separated value (CSV) format, often stored in a file
with an extension .csv
.
In this format, each line represents a record
(row) of data, while each feature (column) of that record is
separated by a comma. In csv format, the health data from above
might look like:
Ada,58,115
Bob,59,117
Chris,60,120
Diya,61,123
Emma,62,126
There are a variety of csv file format flavors, another popular option is to separate the columns not by comma but by a tab-symbol. The health data would now look like
Ada 58 115
Bob 59 117
Chris 60 120
Diya 61 123
Emma 62 126
The advantage of tab-separated files is that they are somewhat easier to read by humans. There are many other options, e.g. in languages were comma is the standard decimal separator, it is common to use semicolon to separate columns instead. But despite of the different separators, all those formats are typically (confusingly) referred to as “comma-separated files” and the character that separates the columns is often called separator or delimiter. Different separators across different datasets is a frequent source of confusion for beginners. It is critical to get the separator right, but fortunately one can usually get the separator automatically detected by computer. Also, wrong separator usually results in very distinct problems that are easy to diagnose.
Spreadsheet programs like Microsoft Excel or Google Sheets can also
load, export and manipulate data that is saved in this format.
However, one cannot save formatting and colors in a .csv
file, .csv
format can only handle data.
There are multiple ways to load csv data into R, but before we get to loading, we need to talk about working directory.
11.5.2 R Working Directory
One of
the biggest sources of frustration when loading .csv
files that beginners
encounter is to tell R where on the computer the data is located.
Normally, you should use relative path to navigate to the data file in
your project. Remember, relative path is relative to the current
working directory (see Section B.3.1). But what
exactly is the current working directory?
As the file will be loaded by R, it is fairly obvious that we need to code the path relative to R’s current working directory. Remember–all programs have their working directory for exactly such tasks.
Like the command-line, the R interpreter (running inside R Studio) has a current working directory from which all file paths are relative. But it is not necessarily the directory of the current script file!
This makes sense if you think about it: you can run R commands through the console without having a script, and you can open multiple script files from separate folders that are all interacting with the same execution environment. Even more, you can have multiple R consoles open (although not in RStudio). Hence a running R console needs it’s own working directory.
Just as you can view the current working directory when on the command line (using pwd
), you can use an R function to view the current working directory when in R:
# get the absolute path to the current working directory
getwd()
It is unfortunate that R is using a different command for working directory than shell, but that is what we have to live with.
If you have set up an RStudio “Project”, R’s working directory will be that project folder. This is perhaps the best way to ensure that you have a consistent working directory: you set up an RStudio project in the folder where you are currently working.
But if, for some reason,
you haven’t created a project, or if you need to
work in a folder outside of the project, you may want to change your
working directory. Again, unfortunately the command is not cd
but
setwd()
instead.
This function accepts both relative and absolute path, so you can
change directory both inside of the project (relative) or move, e.g.,
to your desktop (absolute). In general, it is advisable to avoid
using setwd()
inside of scripts, as people usually do not expect
scripts to change directories.
Another way to change the working directory is through the RStudio menus, namely Session -> Set Working Directory. You can either set the working directory To Source File Location (the folder containing whichever script you are currently editing; this is usually what you want), or you can browse for a particular directory with Choose Directory.
It is normally enough to set working directory once per session. The next important thing is to understand what is the relative path of your data file and how to load it.
11.5.3 Loading csv files
There are multiple ways to load csv files into R. The base-R includes
functions (among others) read.csv()
for reading comma-separated
files and read.delim()
for reading tab-separated files. Below, we
focus on read_delim()
in package readr that will automatically
detect the correct separator.
readr is a separate package that needs to be installed (using
install.packages("readr")
) and thereafter loaded using
library(readr)
.21
See Section 3.6
for more about how to install and load packages.
read_delim()
reads the given csv file and returns its content as a
data frame. It will automatically figure out the correct separator,
but you can also specify it manually in case the automatic detection
fails. It can read compressed files, often ending with .csv.bz2
or
.csv.bz
so if you have such a file, there is no need to decompress
it.
Normally you want to assign its returned data frame to
a variable (otherwise
it will be just printed and forgotten). Typical usage, reading data
from file.csv and storing it into workspace variable data looks
like:
read_delim('file.csv') data <-
Let’s load a tiny height-weight data from directory data inside of the current working directory. If you want to replicate this exercise, you should either download the dataset into the same folder, or adapt the path in the command below.
library(readr)
read_delim("data/height-weight.csv") hw <-
This function will return a data frame, and we save it into variable hw.
Note how the file name now is specified not just as file name but as
relative path: "data/height-weight.csv"
means to first go into a
folder data (inside the current working directory),
and thereafter grab the file height-weight.csv from
there. If you are unsure if your current working directory contains
the folder data, then you should check
- what is your current working directory? (
getwd()
) - what are the files and folders there as R sees them?
(
list.files()
)
When reading is successful, then read_delim()
reports a few basic
facts about the file. Here we see that it contains 5 rows
and
4 columns. We also see that it’s delimiter is tab–"\t"
is the tab symbol, and it contains one string variable (sex), and
three numeric variables (age, height, weight; dbl
, double,
stands for numeric variables).
Finally, now we can also print the dataset:
hw
## # A tibble: 5 × 4
## sex age height weight
## <chr> <dbl> <dbl> <dbl>
## 1 Female 16 173 58.5
## 2 Female 17 165 56.7
## 3 Male 17 170 61.2
## 4 Male 16 163 54.4
## 5 Male 18 170 63.5
A note about file paths on Windows. R supports the unix-style forward
slashes /
as path separators, i.e. you can always
write "data/file.csv"
. On
windows, one can also use windows-standard backslashes \
. However, as
backlash is also an escape character inside of strings, these must be
written as double backslashes: "data\\file.csv"
. In this book we
use forward slash as path separator.
11.5.4 Troubleshooting loading files
The two common problems the beginners face when loading data are using wrong file path and using wrong separator. Here we discuss a few ways to understand and fix these problems.
If you get the file path wrong then you’ll see an error message like
read_delim("non-existent-file.csv")
## Error: 'non-existent-file.csv' does not exist in current working directory ('/home/siim/tyyq/info201-book').
The error message tells exactly what the problem is: the file non-existent-file.csv does not exist where you are looking at it. Unfortunately the error alone is not enough to suggest a solution. But here are a few steps your should take.
- Ensure you understand the file system tree and the relative path (see Section @(cmd-file-system-tree-working-dir)).
- Make sure you know where did you put the file. Is it in fact in the place you think it is? Note that some computers may have multiple Desktop folders, and you may be looking at the wrong one!
- What is the current working directory of R? Use
getwd()
to find it out. Is the relative path of the file with respect the current working directory correct? - Is the file name correct? You may have mis-spelled it, or there
may be an extension that is normally hidden in the graphical file
viewer. (
ls
on terminal always shows the full file name.)
We strongly recommend that you learn to use the file name completion
feature in RStudio: each time you need to write the file name, start
with writing a few first letters and hit "../"
and then hit the
TAB-button.
You may achieve similar tasks as RStudio’s file name completion by using R commands as well. For instance, to view files in the “data” folder, you can issue command
list.files("data/")
## [1] "alcohol-disorders.csv" "country-concept-similarity.csv.bz2"
## [3] "covid-scandinavia.csv.bz2" "height-weight.csv"
## [5] "ice-extent.csv.bz2" "orange-trees.csv"
## [7] "readme.md" "readme.md~"
## [9] "titanic.csv.bz2" "ukraine-oblasts-population.csv"
## [11] "ukraine-with-regions_1530.geojson"
You’ll receive a character vector of all files that R found in the folder “data/”. And importantly–you’ll see them in the exact same way as R sees them. This includes the path relative to the current R working directory.
Another common problem is to use a wrong delimiter. read_delim()
will normally get it right, but not always. Also, a web search may
give you different suggestions, e.g. read.csv()
that has different
assumptions about the delimiter (it assumes it is comma). Here is an
example what happens when we get the separator wrong. We use
read.csv()
that assumes the columns are separated by commas:
read.csv("data/height-weight.csv") # assumes comma-separated
hw <-dim(hw)
## [1] 5 1
Firstdim()
tells us the right number of rows (5) as the
separators do not mess up lines. But number of columns–1–is clearly
wrong. This is because read.csv()
is expecting to see commas that
separate columns, but as it cannot find any, it lumps all values into
a single column.
When looking at data
hw
## sex.age.height.weight
## 1 Female\t16\t173\t58.5
## 2 Female\t17\t165\t56.7
## 3 Male\t17\t170\t61.2
## 4 Male\t16\t163\t54.4
## 5 Male\t18\t170\t63.5
we can see a repeated pattern of \t
present in it. This is the tab
symbol (in a string, you mark tab symbol as "\t"
). This also gives
a strong hint that data was loaded with a wrong separator, and the
correct one is tab. So you should use read.delim()
(this assumes
the separator is tab), or just read_delim()
that can detect the
separator automatically.
11.6 Learning to know your data
Data is treacherous. There are many things that can go wrong when working with data, and before you even start any serious work, you should have an overview of what exactly is there in the dataset. Below, we describe a few things that you should do each time you start working with a new dataset.
11.6.1 Did you load data correctly?
You first task should be to check if you loaded data correctly–and if you loaded the correct dataset. For instance, if you want to analyze Titanic data, you ought to load it along these lines:
read_delim("data/titanic.csv.bz2") titanic <-
But what exactly did you load? Maybe read_delim()
was not the right
way to load this file? Maybe the file got corrupted somehow? Maybe
titanic.csv is not the file you want in the first place? Let’s
check!
Perhaps the first and simplest way to check the dataset is just to print its dimension (rows and columns):
dim(titanic)
## [1] 1309 14
This looks encouraging: 1309 rows and 14 columns feels about right for a passenger list data. But this was about rows and columns–we still do not know what do these contain. I recommend to test it with printing a few lines of data to have a visual idea what is there:
head(titanic, 3)
## # A tibble: 3 × 14
## pclass survived name sex age sibsp parch ticket
## <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr>
## 1 1 1 Allen, Miss. Elisabeth Walton female 29 0 0 24160
## 2 1 1 Allison, Master. Hudson Trevor male 0.917 1 2 113781
## 3 1 0 Allison, Miss. Helen Loraine female 2 1 2 113781
## fare cabin embarked boat body home.dest
## <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 211. B5 S 2 NA St Louis, MO
## 2 152. C22 C26 S 11 NA Montreal, PQ / Chesterville, ON
## 3 152. C22 C26 S <NA> NA Montreal, PQ / Chesterville, ON
sample_n(titanic, 3)
## # A tibble: 3 × 14
## pclass survived name sex age sibsp parch ticket
## <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr>
## 1 3 0 Svensson, Mr. Johan male 74 0 0 347060
## 2 1 0 Hilliard, Mr. Herbert Henry male NA 0 0 17463
## 3 2 1 Kantor, Mrs. Sinai (Miriam Sternin) female 24 1 0 244367
## fare cabin embarked boat body home.dest
## <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 7.78 <NA> S <NA> NA <NA>
## 2 51.9 E46 S <NA> NA Brighton, MA
## 3 26 <NA> S 12 NA Moscow / Bronx, NY
It is convenient to check the first few lines with head()
, but
sometimes the beginning of datasets looks good, while further down it
is empty or garbled. In that case a random sample may give you a
better idea.
Currently, both of these functions show a similar picture. And this
picture is plausible–you see columns with reasonable names, and
meaningful values. It is fine if you do not understand all of
these–at least they look plausible. It is also fine if some values are
missing–most datasets contain many-many missing values.
But what happens if something goes wrong with data loading? Obviously, this depends on what exactly is wrong. If you have accidentally deleted your dataset then you may see a “no such file” error. If the dataset is empty, you may find that your data frame contains zero rows. Here is an example what happens if you load it with wrong delimiter (see also Section 11.5.4):
read.delim("data/titanic.csv.bz2")
twrong <-# assume delimiter is tab, it is not!
dim(twrong)
## [1] 1309 1
This will issue the first warning: the dataset contains a single column only. A single column is not necessarily wrong, but you have the slightest idea about your dataset, you should be able to tell whether it makes sense.
Next, printing a few lines shows:
sample_n(twrong, 2)
## pclass.survived.name.sex.age.sibsp.parch.ticket.fare.cabin.embarked.boat.body.home.dest
## 1 3,0,Sage, Mr. Douglas Bullen,male,,8,2,CA. 2343,69.5500,,S,,,
## 2 1,1,Simonius-Blumer, Col. Oberst Alfons,male,56,0,0,13213,35.5000,A26,C,3,,Basel, Switzerland
It shows that the data is there, just not correctly arranged into columns.
Finally, we may also check names of the mis-read dataset:
names(twrong)
## [1] "pclass.survived.name.sex.age.sibsp.parch.ticket.fare.cabin.embarked.boat.body.home.dest"
Again, all names are there, but they are combined into a single
column. Here these problems are caused by read.delim()
that does
not figure out the correct delimiter.
11.6.2 Data types
After the data is correctly loaded, you probably want to take a quick look at column types. We discussed the basic types–numbers, texts, and similar in Section 2.5. But there are more data types, e.g. dates and timestamps (see Section 14.3), categorical data (see Section 17.2), and others. It is helpful to see, if the column types are what you might expect.
An easy way to see the data types is just to print a few lines of the dataset. The printout will include the abbreviated column type:22
head(titanic, 1)
## # A tibble: 1 × 14
## pclass survived name sex age sibsp parch ticket fare
## <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 1 1 Allen, Miss. Elisabeth Walton female 29 0 0 24160 211.
## cabin embarked boat body home.dest
## <chr> <chr> <chr> <dbl> <chr>
## 1 B5 S 2 NA St Louis, MO
We can see that pclass is of type dbl (“double”, i.e. a “double-precision” number), name is chr (“character”, i.e. text), and so on.
Exercise 11.10 What do you think, why is type of the boat above “chr” (text), not number, although its value is the number “2”? Are the ticket, fare, and cabin data types what you expect?
Understanding the data types will help to spot problems in the data. For instance, age is currently a number (“dbl”). This is probably what we might expect–when talking about “age”, we usually mean age in years. But what if age turns out to be text (“chr”)? How might that happen? This may indicate problems with coding–the column contains values that cannot be converted to a number. For instance, someones age may be listed as “middle-aged”, “twenties” or “younger than 50”. Alternatively, there may just be typos or other errors in the column, e.g. age may be listed as “4t” or as “male”. The former is probably a typo, the latter probably means that someone has entered sex into the wrong column.
You need to know the data type before we do any computations. Otherwise, you may get surprising results, e.g. you may learn the someone in their twenties is older than a 50-year old 🙄
c(50, "twenties") # character vector
age <-max(age) # alphabetically, "5" is before "t"
## [1] "twenties"
In case you cannot or do not want to print the few lines of data as
above, you can use the function class()
. For instance, let’s load
the Steam CS-GO data:
read_delim("data/csgo-reviews.csv.bz2")
csgo <-names(csgo)
## [1] "rating" "nHelpful" "nFunny" "nScreenshots" "date"
## [6] "hours" "nGames" "nReviews"
We can manually see that both rating and date are text:
class(csgo$rating)
## [1] "character"
class(csgo$date)
## [1] "character"
If you want to know the type of all columns, then you can use a for-loop:
for(col in names(csgo)) {
cat(col, ": ", class(csgo[[col]]), "\n", sep = "")
}
## rating: character
## nHelpful: numeric
## nFunny: numeric
## nScreenshots: numeric
## date: character
## hours: numeric
## nGames: numeric
## nReviews: numeric
Or better, sapply()
(see Section 16.2.1):
sapply(csgo, class)
## rating nHelpful nFunny nScreenshots date hours
## "character" "numeric" "numeric" "numeric" "character" "numeric"
## nGames nReviews
## "numeric" "numeric"
Exercise 11.11 Write a for loop over all columns of the csgo data frame. Print out the average value for the columns, but only for those columns that are numeric!
11.6.3 Missing values
After you have seen that you loaded the data correctly, you may want to check how much information is there in data. One of the persistent problems with real-world data is missing values. Missing values are normally displayed as NA (for “Not Available”), no matter for what reason the data is missing. It may be missing because it is not applicable for a given case (e.g time of death of a living person, see Section 11.3), or it may be that the information exists but we do not know it, or it may be that whoever created the dataset just forgot to enter it. It may also caused by errors in data pre-processing code.
R has a dedicated function, is.na()
, that takes in a vector and
returns a logical vector, True if the element is missing and False
if it is not. For instance,
c(1,2,NA,4,NA)
x <-is.na(x)
## [1] FALSE FALSE TRUE FALSE TRUE
As you see, elements 3 and 5 are True (they are missing) while the others are False (they are valid values).
We can use it to count the number of missings (see Section 4.6):
sum(is.na(x))
## [1] 2
or the percentage of missings:
mean(is.na(x))
## [1] 0.4
We can also remove missings from x as
!is.na(x)] x[
## [1] 1 2 4
Note what happens here: first we invert the meaning of is.na()
with
the logical NOT !
and get
!is.na(x)
## [1] TRUE TRUE FALSE TRUE FALSE
i.e. True corresponds to non-missing elements and False to missing elements. And thereafter we use logical indexing (see Section 4.4.3) to extract only non-missing elements.
Exercise 11.12 Create a data frame with two columns: one of these is the x above with two missing values. The other column should not contain any missings.
Extract only those rows from the data frame that do not contain missings.
Before any further analysis, you want to check how many missings is there in the variables you want to use in your analysis. For instance, assume you want to use variables age, survived, and boat in the titanic dataset. You may compute
sum(is.na(titanic$age))
## [1] 263
to see that there are 263 missing age values. Or alternatively,
mean(is.na(titanic$boat))
## [1] 0.6287242
indicates that a very large percentage of boat is missing, casting doubt if it can be used for any analysis at all.23
11.6.4 Range and implausible values
Unfortunately, not all missing values are coded as NA. First, it is
a common habit in many survey datasets, but also elsewhere, to denote
missing values with certain implausible numbers, e.g. “-1” or “999”.
These numbers do not show up with is.na()
because, well, they are
valid numbers! I stress here that they are valid numbers, not
necessarily valid ages, ticket prices or whatever the variable is describing.
There are two good strategies to assess such problems.
First, one should look up the data documentation. If the dataset is carefully designed and coded, the documentation will probably tell what values have special meaning.
Example 11.1
World Value Survey asks many questions like “How important is family in your life” with answer ranging from “1” (very important) to “4” (not at all important). But the interviewer is also told to use negative numbers as- -1: don’t know
- -2: no answer
- -3: not applicable
These are missing values–we do not know how important is family if we
get no answer. But these are valid numbers and hence not picked up by
is.na()
.
But too often, the documentation is incomplete or missing altogether, and you are left to guess what certain values mean. In that case, a good option is to look at maximum and minimum values (for numeric data), or to find all text values there (if categorical data). For instance, let’s analyze the age in Titanic data. We can guess it is passengers’ age in years, but is it? What is its minimum value:
min(titanic$age)
## [1] NA
Why is minimum value missing? This is because some age values are
missing, and hence we do not know what is the true smallest age.
This is a good but annoying way to remind us that we cannot
compute the minimum value of data that contains missings.
Fortunately, an easy workaround is to set the argument na.rm
to
True:
min(titanic$age, na.rm = TRUE)
## [1] 0.1667
na.rm = TRUE
tells R to ignore missing values when computing the
minimum. The maximum age value works in the similar fashion, one can
also use range()
that prints both minimum and maximum age value:
range(titanic$age, na.rm = TRUE)
## [1] 0.1667 80.0000
This indicates that the youngest passenger was 2 months, and the oldest one 80 years old. Both are plausible values and so we can conclude that all passenger age values are plausible in Titanic data–they all must be in the range from 2 months till 80 years.
Exercise 11.13
Load Ice extent data. Let’s focus on ice extent (column extent) and area (area).Do these columns include any missing values (NA-s)?
What is a plausible range for ice extent and area? Can you suggest a lower bound and an upper bound that the ice extent/area plausibly cannot exceed?
You may need to consult the documentation in Section I.10 to understand these variables.7
Are all extent and area values plausible? Explain what do you see!
If the column we are interested is categorical, then we cannot just
compute the minimum and maximum of it. If possible, then one should
look at all values that are there. This can be done with unique()
(that displays all unique values), or table()
, that also shows how
many times each value occurs in data.
Let’s take a look at the boat column in Titanic data:
unique(titanic$boat)
## [1] "2" "11" NA "3" "10" "D" "4" "9"
## [9] "6" "B" "8" "A" "5" "7" "C" "14"
## [17] "5 9" "13" "1" "15" "5 7" "8 10" "12" "16"
## [25] "13 15 B" "C D" "15 16" "13 15"
The result contains a number of boat names (boat numbers). It also contains NA for missing boat. Importantly, it also contains a few examples where an individual has been assigned multiple boats, such as “13 15 B”. I do not know what it means. It is possible that someone was transferred from one boat to another, but this is unlikely. Maybe the data collectors just weren’t sure which boat the passenger was in.
The table()
function will give a broadly similar picture, but
importantly, it also tells the frequency of these values:
table(titanic$boat)
##
## 1 10 11 12 13 13 15 13 15 B 14 15 15 16
## 5 29 25 19 39 2 1 33 37 1
## 16 2 3 4 5 5 7 5 9 6 7 8
## 23 13 26 31 27 2 1 20 23 23
## 8 10 9 A B C C D D
## 1 25 11 9 38 2 20
The result may look a little hard to understand at first because it may be unclear where are the boat names and where the corresponding counts. But it contains pairs of rows–the boat names in the upper row and the respective counts in the lower row. For instance, boat “1” was assigned to 5 passenger, boat “10” to 29, and so on. From the table we can see that there are only 7 problematic multi-boat cases (two for “13 15”, 1 for “13 15 B” and so on). Hence we can conclude that by far the most case include valid boat names.
Exercise 11.14 Analyze the home/destination column of Titanic data (home.dest). Do you see implausible values? Missing values? How many different values do you have? Do you think this approach–checking the individual values–is a good way to go here?
Exercise 11.15 When you count the number of different values, you’ll find that
unique()
and table()
give you a slightly different number:
length(unique(titanic$boat))
## [1] 28
length(table(titanic$boat))
## [1] 27
Which value is missing from table()
? How can you include all the
values in the table (see the documentation).
11.6.5 Descriptive analysis
TBD: mean(na.rm=TRUE)
There is also a handy function summary()
that displays a few summary
information for vectors, including the number of missings:
summary(titanic$fare)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 7.896 14.454 33.295 31.275 512.329 1
summary()
a good way to get a quick overview of individual columns.
In code we usually prefer single values though and then it makes more
sense to use dedicated functions like min()
, max()
and mean()
.
11.6.6 Making simple plots
A powerful tool to understand data is to visualize it. Later (Section 13) we will learn about ggplot2 library that is designed for plotting data. Here we will give a brief overview of base-R plotting functionality that is simpler, but may need more work to create plots that git to our needs.
The central function in base-R plotting is plot()
. It normally
takes two arguments, \(x\) and \(y\), and makes a scatterplot of the \((x, y)\) pairs. Here is an example using random numbers:
rnorm(50) # 50 random numbers
x <- rnorm(50)
y <-plot(x, y)
This creates a simple scatterplot of 50 points, marked as empty circles. But note that the labels are just the variable names, it also does not have title or explanation. All those must be adjusted or added manually if desired.
plot()
function supports a plethora of additional arguments that can
be used for adjusting and customizing the plot. See the example
below.
These include:
xlab
,ylab
: \(x\) and \(y\)-axis labels. Defaults are the variable names, but frequently those need to be adjusted.main
: main title, a text put above the plot in bold font.col
: color of the objects (dots in the previous example). It can be a numbered color (1
is black,2
is red, and so on), a simple color as string (e.g."red"
or"green"
), it can be a complex R color (e.g."cornsilk2"
or"orangered3"
, see e.g. this image or just google “r colors”). It can also be hex color code, e.g."#998877"
. There are more options, check out functionsrgb()
,hsv()
and others.pch
: point type.1
, the default draws empty circles,2
crosses,16
filled circles. You can see them withplot(1:20, pch=1:20)
.lwd
: line width for line plots.lwd=1
, is the default,lwd=2
makes thick lines,lwd=0.5
makes thin lines. You can experiment with different numbers.cex
: size of the points. Default iscex=1
, trycex=2
for large points
Here is a tuned example of the previous plot:
plot(x, y,
main = "Scatterplot demo",
xlab = "A random number",
ylab = "Another random number",
pch=10, cex=2.5, col="firebrick")
We use large reddish crossed circle–shaped dots and custom labels.
Another very important option for plot()
function is type
. This
tells the plot type with "p"
(default) meaning points (scatterplot), "l"
meaning line plot, and "b"
meaning both points and lines. It is
probably a bad idea to connect these random points with lines, but
line plots have its place for displaying, e.g. time series data.
Exercise 11.16 Why is it a bad idea to make a line plot of random dots? Use
type="l"
in the previous example to find it out!
See the solution
We demonstrate this with beavers data, built-in data beaver1 (see Section 11.4). It measures body temperature of a beaver over a day, it looks like
head(beaver2, 3)
## day time temp activ
## 1 307 930 36.58 0
## 2 307 940 36.73 0
## 3 307 950 36.93 0
It contains four variables–date, time, body temperature (°C) and
activity status (“0”: in its nest, “1”: outside). See ?beavers
for
more details.
plot(beaver2$temp, type="l",
col=3, lwd=1.5,
xlab="Measurement",
ylab="Temp (deg C)")
We can see how the body temperature changes over time between 37° and
38°. Note that we have done some plot tuning here: specified
color 3
(green), made the line a bit wider, and adjusted the labels.
Base-R offers a simple way to save plots. Normally, plots are displayed on screen (this is the “default device”). But you can pick a different “device”, e.g. a pdf file. One can save the pdf of the figure as
pdf("plot.pdf", width=6, height=4) # width, height in inches
plot(...) # do your plotting here
dev.off() # finish the pdf file.
Instead of pdf, you can also save png, jpg and other file formats and
supply many other options. See ?pdf
and ?png
for more
information.
There are three things to be aware of:
- The order of tasks is this: a) open the device (e.g.
pdf()
); b) do your plotting; c) close the device (dev.off()
). If you do plotting first, you’ll get an empty file as the plot was still done on screen. dev.off()
is needed. If you leave this command out, the file will be incomplete and will probably not display. This is the command that ensures the image is completely written to the file.- The plot, including its fonts and point sizes, will be fitted to the image on disk. It typically has different dimensions than what you have on screen, and hence you may be surprised to see fonts and lines that are too large or too narrow. You can either adjust the image size when writing the file, or the font/point/line sizes when doing the plotting.
Finally, ggplot (Section 13) provides an extended functionality to save plots but it still supports the base-R devices as described above.
The need for rectangular structure is suitable for many tasks, but not for all tasks. For instance, if different patients have different number of measurements, then data frames may not be the best way to represent data.↩︎
It may sound counter-intuitive that the computer does not know whether Naruhito–who is alive in 2023–died before year 1. We know that this is impossible. But computers know nothing, unless we explain it to them. And we haven’t explained it.↩︎
readr is also part of tidyverse (see Section 12.3), so if you have installed and loaded the latter, there in no need to load readr separately.↩︎
Strictly speaking, this is only true for the tibble-flavor of the data frames. Tibbles are the data frames that are loaded with
read_delim()
, you can also create them manually. Butdata.frame()
function does not create tibbles.↩︎Here it is actually not a problem with data. The reason that boat is missing for over 60% of entries is the sad fact that over 60% of passengers did not get into a boat.↩︎