Chapter 2 R: Programming Language and a Statistical System
R is a popular programming language and a statistical/data-analysis frontend. It is widely used for statistical tasks, social and biological sciences, and data science. R attempts to find a balance between being easy to use and providing universal and powerful tools for analysis. In a scale between simplicity and power, it is more complex and powerful than certain other environments, such as Excel or stata, but more easy to use than more programming-oriented environments, such as python or java. R is weakly typed which makes it easy to use for quick scripting.
Base R is a reasonably powerful language that contains a large number of tools suitable for computing and data analysis. It also has flexible tools to handle and program language itself, including classes and environments. The base R is not very hard to learn (but harder than e.g. base python). R has a rich infrastructure of libraries. While the base language is mostly consistent, the libraries follow different style and are of different quality. However, the more popular ones are almost compulsory for effective R usage.
R can be used in three main ways:
- One can write R programs and thereafter run those on command line as batch files. This is useful if you use R as backend, in makefiles, or in order to run tasks that take long time. Modern IDE-s typically also support such execution with a single click.
- An alternative and perhaps the most popular way to use R is through the interactive console. One can open the console by running R on command line, but nowadays the most popular way is to use the built-in console in IDE-s, such as RStudio. IDE-s offer lot of additional functionality, such as syntax coloring, debugging and nice data editors.
- Finally, another popular usage of R is in literal programming. Literal progamming is mixing written text with computer code, when compiling the code will be replaced with its output, and hence one can create reports that can be easily updated as new data will arrive. Perhaps the most popular literal programming environment is rmarkdown, but there are other options. See more in Section RMarkdown.
R can also used in Jupyter notebooks although it is a less common usage.
Needless to say, a chunk of R code runs in a similar whichever way do you end up using it, so in terms of language there is no additional learning if you switch from one context to another.
2.1 Base language
The base R language is in many ways similar to the other traditional programming languages, such as java or python. It is relatively simple and dynamically typed. It is designed more to be a good interactive environment than a fast and effective language. This makes it a very suitable choice for interactive data analysis and numerical computing. However, R is less popular as an industrial backend tool.
2.1.1 Variables and assignment
Variables in R are names for in-memory data storage, just as in other tradidional languages. As in python, but unlike in java or C, the variables do not have to be declared, and you can change their type on the fly (it is a dynamically typed language).
Variable assignment is normally done using the left-assignment
operator <-
(but there are at least 4 other options).
The most important data types are doubles (floating-point numbers),
integers, logicals, and characters (strings). The following
example demonstrates assignment and
all these data types:
1 # double
a <- 2L # integer
b <- FALSE
λ <- 'text' s <-
When declaring a numeric variable, it is normally considered to be a
double, unless explicitly declared to be an integer using L
suffix
(originating from the word ‘long’). Note that R supports UTF-8 letters in
variable names, as visible with the variable λ. We can query the data
type (class) of the variable by storage.mode
:
storage.mode(a)
## [1] "double"
storage.mode(b)
## [1] "integer"
If needed, one type can be explicitly cast into another type:
as.integer(λ) # convert to integer
## [1] 0
as.character(a) # convert to string
## [1] "1"
One can see that FALSE
is converted to zero as integer.
2.1.2 Atomic variables are vectors
One of the most distinct traits of R language is that all elementary (called atomic) variables are vectors. This means all the variables we defined above are not just single values but can contain more than a single values. All base operations on atomic variables are vectorized, i.e. they operate on all the values in one go. It is the best to be demonstrated by examples.
We can create vectors by the concatenation function c
:
c(1, 2, 3) # three doubles in a vector
v <- v
## [1] 1 2 3
If we do simple operations with v
, the operation will be performed
on all elements of the vector:
+ 1 v
## [1] 2 3 4
2*v
## [1] 2 4 6
as.character(v)
## [1] "1" "2" "3"
Note that we did not use explicit loop to repeat the same operation on
all elements. The operators +
, *
, and the function as.character
are already vectorized, i.e. they do the necessary loops internally.
Another popular way to create vectors is to use the colon sequence operator:
1:10
## [1] 1 2 3 4 5 6 7 8 9 10
This is perhaps the most popular way to run loops in a pre-determined
number of times. Note also that :
is a sortcut to seq
function,
the latter allows many more options.
Another popular way to create and manipulate vectors is to add more
components to it using the same concatenation function c
. Here is
an example:
"Feliciano"
people <- c(people, "Lynn")
people <- c("Matías", people)
people <- people
## [1] "Matías" "Feliciano" "Lynn"
In this example, the variable (vector) people
first only contains
one name, “Feliciano”. Next we concatenate “Lynn” to the end
of it, and thereafter “Matías” in front of it.
Vectorized operations are very fast and efficient, and one should always consider vectorizing the code if possible. There are many more ways to play and manipulate vectors, see Section Indexing and Named Vectors.
2.1.3 Mathematical, logical and other operators
The mathematical operators are (mostly) traditional: +
, -
, *
,
/
for addition, subtraction, multiplication and division, and ^
for exponentiation. Other useful mathematical operations are %/%
for integer division, and %%
for modulo:
7 %/% 2 # 3
## [1] 3
7 %% 2 # 1
## [1] 1
Logical operations work mostly as-expected too. In particular >
,
<
, >=
, and <=
. As in several other languages, equality is
tested with double equal signs ==
. Inequality can be tested with
!=
, and logical negation is !
:
1
a <-> 1 # FALSE a
## [1] FALSE
>= 1 # TRUE a
## [1] TRUE
== 1 # TRUE a
## [1] TRUE
!= 1 # FALSE a
## [1] FALSE
!(a == 1) # FALSE
## [1] FALSE
The logical and is &
, logical or is |
(note: single &
and a
single |
!), and logical not is
!
.
All these operators are vectorized.
Exercise 2.1 Create vector 1, 4, 9, 16, 25 in two ways:
- using sequence and mathematical operations
- using
c
function
See the solution
2.1.4 Control structures
The basic control structures in R are fairly similar to those in
python or java. These include conditional execution with if
and
else
, looping with for
and while
, and breaking the loops with
break
and continue
.
Let us demonstrate this with a simple example:
for(i in 1:10) {
cat(i, "\n")
if(i > 3) {
cat("too much\n")
break
} }
## 1
## 2
## 3
## 4
## too much
The for-loop wants to run 10 times, each time assigning the
consecutive integers to i
. cat
is just R way of printing.
However, if i
is more than 3, the code prints “too much” and
terminates the loop. Note that the loop content, and the if-block are
encapsulated in curly braces {
..}
, exactly like in java (you can
leave out curly braces if the block contains just a single line).
Unlike in python, indentation does not play any role from the syntax
point of view.
As R does not have any distinct line ending marker, such as ;
in
java, it uses some heuristics to figure out where certain code blocks
end. In particular, the else
condition should be on the same line
where if
condition ends. For instance, the following code always
works:
for(i in 1:10) {
cat(i, " ") # note: no newline here
if(i %% 2 == 0) {
cat("even\n")
else { # 'else' on the same line where if block ends with '}'
} cat("odd\n")
} }
## 1 odd
## 2 even
## 3 odd
## 4 even
## 5 odd
## 6 even
## 7 odd
## 8 even
## 9 odd
## 10 even
If you put else
on a separate line, R may stop with “unexpected
‘else’” error.
Exercise 2.2 Can you afford going out with friends? Let’s find it out:
- How many friends do you have? Put it into a variable
- What is your budget? Put it into a variable
- Print a message I am going out with X friends where X
is your number of friends. Hint: use
cat
function likecat("I om going out with", X, "friends\n")
- What does the meal cost? Put it in a variable
- Compute total meal price for your whole company. Do not forget to buy a meal for yourself too!
- Add 15% tip to the total price
- Print either can afford or cannot afford, depending on if the total cost exceeds/does not exceed the budget
See the solution
2.1.4.1 Creating vectors in loop
Quite often we need to compute a value for every element in a collection, and store all the results in a single vector (or list). For instance, one may want to see how many observations there are in a number of data files, or how many ingredients there are in different recipies. A common solution is such case is the following: first create an empty vector, and thereafter loop over the collection and append the computed value to it one-by-one. For instance, here is code that creates a list of squares of numbers:
NULL
squares =for(i in 1:10)
c(squares, i^2)
squares <- squares
## [1] 1 4 9 16 25 36 49 64 81 100
This is a handy and frequently used algorithm. It is not particularly efficient though, and becomes very slow if the collection is large. The problem is that the lists are created with fixed finite length, and when you add new elements to the list, you run out of the pre-allocated space. The computer has to allocate new space and copy the former data into the new location. But it works well for small vectors.
2.1.5 Strings and printing
Strings in R (called character) can be constructed in traditional ways, using either single or double quotes:
"what"
a <- 'is' b <-
Both of these are equivalent. The former is
useful for creating a string that contains a single quote like
a <- "what's"
, and the latter is better if you want to include
double quotes (and you also save one press of shift key 😄).
Strings can be concatenated with paste
or paste0
function. The
former leaves a space between the strings, the latter does not:
paste(a, b)
## [1] "what is"
paste0(a, b)
## [1] "whatis"
(for paste
you can also specify the desired separator using sep
argument.)
One can concatenate numbers and strings in a similar fashion, numbers are automatically converted to strings:
3
a <-paste0("a=", a)
## [1] "a=3"
paste("1/a =", 1/a)
## [1] "1/a = 0.333333333333333"
The latter example may give us undesired precision. A solution is to round the numbers before printing:
paste("1/a =", round(1/a, 3))
## [1] "1/a = 0.333"
A more powerful way to format numbers is formatC
function that uses
C-style formatting strings. Base R contains a large number of string
functions, and there are more in add-on packages.
Exercise 2.3 Create a string height is 5’3". Hint: paste two strings you create using single/double quotes
See the solution
R has two main printing functions: cat
and print
. cat
is the
standard way of printing messages, it prints as many arguments as
needed, and is suitable to fit everything on a single line. print
only prints a single object at time and is better suited to print more complex
objects that need several lines. This includes model summaries, data
frames, and functions. Note that cat
does not add newline at the
end, so you have to do it yourself if this is needed. A typical
printing looks like
sqrt(2)
result <-cat("The result is", result, "\n")
## The result is 1.414214
If working in an interactive session, or using r-markdown, R also automatically prints the results that are not saved in variables. So instead of the above, you may just do
sqrt(2)
## [1] 1.414214
However, this does not work in all context, for instance when running code in from command line. This often confuses beginners as the output that was there a second ago is suddenly gone with no obvious error message. But even when the output remains visible, printing results without any explanatory messages is not a good habit when writing anything longer than a few lines of code—it is hard to understand what does this result mean. In longer and more complex code it is advisable to stay with explicit printing.
2.1.6 Functions
Functions in R behave very much like in other traditional
programming languages. Functions are objects, created with the
keyword function
and are normally assigned to variables (which are
then called “functions”). Functions always return a value, this may
be done explicitly using the return
keyword, but if it is not done,
it implicitly returns the value of the last evaluated expression. For
instance, we may define a function to add two values:
function(x, y) {
add <- x + y
z <-return(z)
}add(4,5)
## [1] 9
In this example we compute the sum, assign it to variable
z
, and explicitly return the latter.
R functions also support default values, for instance:
function(x, y=2) {
multiply <-*y # implicitly return the product
x
}multiply(4)
## [1] 8
multiply(4, 3)
## [1] 12
In this example we do not assign the product to a temporary variable, and we rely on implicit return.
Functions may have both side effects (such as printing and plotting), and return values.
2.2 Indexing and Named Vectors
Indexing refers to manipulating individual elements in vectors. There are three ways of indexing in R:
- Integer indexing, selecting elements by position. This is typically used to extract objects from known position (or from random position).
- Logical indexing, selecting by logical condition. Logical
conditions
form the basis for data processing and
allow such tasks as extracting all positive elements, or all
x
-values whereage
-value is less than 16. - Character indexing, selecting elements by their name (given the vector has names). This can be used to create key-value lookup tables where names are the keys.
All of these play an important role for vectors, and form a basis for data manipulation (see Section Data Frames). In this section we just discuss vectors, indexing data frames works in a largely similar fashion, two-dimensional data objects just add more options and more complexity.
2.2.1 Integer indexing
The elements can be accessed using square brackets, inside of which is the position. R indices are 1-based (the first element is at position 1, not at 0, like julia but unlike python and java). This is useful for human referencing, but less useful for complex lookups. For instance
c(1,2,3)
v <-1] # 1st element v[
## [1] 1
2] # 2nd element v[
## [1] 2
3] <- -4 # assignment
v[ v
## [1] 1 2 -4
Negative indices mean to exclude these elements (unlike in python where it starts counting from end):
-2] # 1st and 3rd element only v[
## [1] 1 -4
So if you want to delete elements, you can exclude these from the vector, and reassign the result to the same vector
c("α", "β", "γ", "δ", "ε")
alphabet = alphabet[-3] # remove γ
alphabet <- alphabet
## [1] "α" "β" "δ" "ε"
One can access more than one vector element by providing more than a single index (in a form of an index vector):
c(1,3)] # 1st and 3rd element alphabet[
## [1] "α" "δ"
1:3] # 1st till 3rd element alphabet[
## [1] "α" "β" "δ"
c(-1,-2)] # everything except 1st and 2nd element alphabet[
## [1] "δ" "ε"
We can also add more than one element at time:
c(5,6)] <- c("ζ", "η")
alphabet[ alphabet
## [1] "α" "β" "δ" "ε" "ζ" "η"
Note that this example included assigning to the non-existing positions (5 and 6). As a result the vector was automatically lengthened to length 6. Obviously, we can also assign to existing positions and overwrite the previous values.
Exercise 2.4 Manipulate elements from the vector
Consider the vector alphabet above:
- extract 1st, 3rd, and 5th element
- add θ as 8th element
- extract all elements except the first and the last
See the solution
2.2.2 Logical indexing
Indexing a vector by a logical vector plays and enourmously important role in R. This allows us to extract elements according to a certain condition, e.g. only positive cases, or only observations reported after a certain date. We’ll walk through the logical indexing step-by-step through examples.
Consider a data vector
c(-1, 2, -3, 4, -5, 6) v <-
The very basic idea of logical indexing is to use a logical vector of
similar length of TRUE
-s and FALSE
-s to extract elements from this
vector. TRUE
would mean to retain the corresponding element and
FALSE
would mean to drop it. So if we want to extract all positive
elements (i.e. elements at position 2, 4, 6), we can proceed as
follows:
c(FALSE, TRUE, FALSE, TRUE, FALSE, TRUE)
i <- v[i]
## [1] 2 4 6
Indeed, only the positive elements were extracted as a result.
However, it would be extremely hard to create such tailor-made indices
for long vectors. Fortunately this can be done automatically using
logical operators. Remember: logical operators are vectorized and
they work on all elements on the vector. So instead of creating the
index vector i
manually, we can write
v > 0
i <- i
## [1] FALSE TRUE FALSE TRUE FALSE TRUE
v[i]
## [1] 2 4 6
This approach achieves exactly the same result, except that we now
compute i
, and if you look at the code, you can easily see our
intention. We can write this task in an even more compact form by not
creating the helper variable i
:
> 0] v[v
## [1] 2 4 6
This approach turns out to be extremely handy in all kind of data filtering and manipulation tasks.
But we can also extract elements not just based on the values of the vector itself, but also based on the values of other vectors. Consider another example with two data vectors:
c(11,12,27,33,18)
temp <- c("2020-09-15", "2020-09-16", "2020-09-17", "2020-09-18", "2020-09-19")
date <-< "2020-09-17"] # temperatures before Sept 17th temp[date
## [1] 11 12
>= "2020-09-16") & (date < "2020-09-19")] # b/w Sept 16 and 19 temp[(date
## [1] 12 27 33
First we extract the temperature recordings before Sept 17th, and thereafter the recordings for a date interval.
Exercise 2.5 Separate data into “good” and “bad” cases
Consider two data vectors:
c(32, 33, 23, 14, 45, 33)
result <- c("good", "good", "bad", "good", "bad", "good") quality <-
Use logical indexing to create two new vectors: one for good-quality results, the other for bad-quality results. See the solution
2.2.3 Named vectors and indexing by name
The third way to access individual elements is by their names. But before we can work with names, we have to assign names to the elements. Consider the air quality data vector:
c(40, 50, 80, 180, 22)
airquality <-names(airquality) <- c("Toronto", "Teheran", "Tokyo", "Tel Aviv", "Tacoma")
airquality
## Toronto Teheran Tokyo Tel Aviv Tacoma
## 40 50 80 180 22
This is an example of named vector, a vector where every element has a name.
Now we can access individual elements by name:
"Tacoma"] airquality[
## Tacoma
## 22
c("Toronto", "Tokyo")] <- 0
airquality[ airquality
## Toronto Teheran Tokyo Tel Aviv Tacoma
## 0 50 0 180 22
2.4 Data frames
This section describes data frames, the central structures for holding data in memory. Here we describe bare R techiques to handle data frames, for a different and (typically) more intuitive approach, see Section 4
2.4.1 What is data frame
A central data structure for data analysis is data frame. It is a rectangular block of data (often numbers) that has a certain number of rows and columns. It is in many ways similar to excel tables, except that the excel tables may be more complex than rectangular data blocks.
Data in data frames is typically laid out by observations and variables. Observations (aka cases) are rows that denote different individual objects we are investigating, such as different persons, different dates, or different geographic locations. Different columns (variables, aka features) are different measures, different bits of data we have collected about the objects we are analyzing. If the data is about people, the variables might be their gender, income, age, and health condition. In case of geographic location, the variables might be temperature, precipitation, and population size.
Below is an example dataframe that contains information about the US president G.W.Bush approval rate in fall 2001.
## date approve disapprove dontknow
## 8 2001 Oct 5-6 87 10 3
## 9 2001 Sep 21-22 90 6 4
## 10 2001 Sep 14-15 86 10 4
## 11 2001 Sep 7-10 51 39 10
## 12 2001 Aug 24-26 55 36 9
## 13 2001 Aug 16-19 57 34 9
The rows (observations) in this case are different dates, ranging from August till October 2001. The first, unnamed column, is the row number (we display only a few selected rows from a larger dataset here) and does not play a big role. The following column (variable) date is the date of the poll (more precisely the date range when the poll was conducted), and the following three columns tell how many respondents approved, disapproved, or did not have opinion about the president’s leadership. So our data frame (actually a subset from a larger one) contains 6 observations and 4 variables.
Note also that the poll dates are in the reverse order. Data frame as a data structure does not care about the order of observations, everything goes. But for a particular analysis you may prefer the data to be arranged in a certain order.
2.4.2 Workspace variables and data variables
Note that the columns of data frame are often called variables, exactly as in Section 2.1.1. The exact meaning of variable is often clear from the context. But we distinguish between workspace variables and data variables when such disctinction is explicitly needed.
Workspace variables are the variables we create using <-
assignment
operator, and normally they live in the R workspace (and can be saved
when exiting R). For instance
2 x <-
creates a new workspace variable x
(or maybe updates an existing
one).
Data variables, in contrary, live in
the data frame they are part of, and are normally not accessible from the
workspace—the dataframe is stored in the workspace, but the data
variables are inside of the data frame. See below for how to access
and assign data variables.
2.4.3 Extracting and assigning individual variables in data frames
There are three main ways to extract the individual variables from data frames:
2.4.3.1 Dollar-notation
Dollar-notation uses the construct dataframe$variable
to extract
variable variable
from data frame dataframe
. The variable
name
can be quoted but for simplicity it is almost always used without
quotes.
For instance, if the
G.W. Bush approval ratings data frame above is called approval
, we
can extract the variable dontknow
as
$dontknow approval
## [1] 3 4 4 10 9 9
This results in a vector where individual components correspond to the
dontknow
values for different cases. It is in the same order as the
original data frame.
Dollar-notation is a good choice when working interactively, and also when writing code, but where you know the variable name when writing your code.
2.4.3.2 Matrix-style indexing
Matrix-style indexing treats data frames as matrices with two indices–one for rows and another for columns. We can leave the row index empty and use column names:
"dontknow"] approval[,
## [1] 3 4 4 10 9 9
Note two important bits of syntax here:
* the comma before "dontknow"
signals that we talk about columns.
Leaving out the comma will result in a single-column data frame
instead.
* "dontknow"
must be quoted–it must be a valid string variable.
This construct is less convenient to be used interactively, but comes handy if one wants to determine the variable name later. For instance, we can write
"dontknow"
whichVar <- approval[,whichVar]
## [1] 3 4 4 10 9 9
In this example the (workspace) variable whichVar
contains the name
of the
(data) variable dontknow
, and hence R returns the content of that
data variable.
Matrix-style indexing allows to extract more than a single variable in one go, e.g. variables approve and disapprove:
c("approve", "disapprove")] approval[,
## approve disapprove
## 8 87 10
## 9 90 6
## 10 86 10
## 11 51 39
## 12 55 36
## 13 57 34
As two variables cannot be extracted as a single vector, the result is a sub-dataframe with these two variables.
2.4.3.3 List-style indexing
Data frames are internally made of lists and hence most of the list
properties carry over to data frames too. One of the more useful ones
is list-style indexing with double brackets [[ ]]
. We can extract a
single variable from a data frame using
"dontknow"]] approval[[
## [1] 3 4 4 10 9 9
This gives us a single vector, exactly as the matrix-style indexing does when requesting a single variable. We can also extract a sub-dataframe with only selected variables in the same way as extracting selected list components:
c("approve", "disapprove")] approval[
## approve disapprove
## 8 87 10
## 9 90 6
## 10 86 10
## 11 51 39
## 12 55 36
## 13 57 34
Matrix- and list-style indexing also works with logical indexing, e.g. one can write expressions like
$approve > 60, c("approve", "disapprove")] approval[approval
## approve disapprove
## 8 87 10
## 9 90 6
## 10 86 10
for matrix-style, and
c("approve", "disapprove")][approval$approve > 60,] approval[
## approve disapprove
## 8 87 10
## 9 90 6
## 10 86 10
for list-style indexing.
The existence of such a plethora of indexing systems is quite confusing for beginners. It is advisable to start with a single approach and stay with that until you feel reasonably confident with this system. First thereafter explore the other options.
It should be noted that dplyr and data.table package introduce even more ways to access the data frame elements.
2.4.4 Loading data
It is useful to be able to create data frames manually–but the usage is normally limited to debugging and testing purposes. We almost always load data from files or databases.
2.4.4.1 CSV files
CSV files are one of the most popular way to store data that can be transformed into data frames. CSV stands for “comma-separated variables”, a simple text files where the values are separated either by comma or another consistent separator. For instance, a few lines from HADCrut global temperature data, stored as csv looks like
year,anomaly,lower2.5,upper2.5
1850,-0.41765878,-0.589203,-0.24611452
1851,-0.2333498,-0.41186792,-0.054831687
1852,-0.22939907,-0.40938243,-0.04941572
1853,-0.27035445,-0.43000934,-0.110699534
1854,-0.29163003,-0.43282393,-0.15043613
...
The example displays the first six lines of the file, consisting a
header that contains the variable names, and five lines of data. Each
line contains the data value (year, anomaly, …), and in each line
the values are separated by commas. While comma is perhaps the most
popular csv separator, there are many other common ones, e.g. the
“tab” character (often coded as "\t"
), semicolon, pipe (|
) and
others. Note that the file is still called “csv file”, even if it is
separated with something else, not with comma.
2.4.4.2 Loading CSV files into data frames
There are a variety of ways to load csv files into data frames. Here
we use function read_delim()
from tidyverse library1
As an example, let’s load the same
HADCrut global temperature
data,
see the example above:
library(tidyverse)
read_delim("../data/hadcrut-5.0.1.0-annual.csv.bz2")
hadcrut <-dim(hadcrut)
## [1] 173 4
head(hadcrut, 5)
## # A tibble: 5 × 4
## year anomaly lower2.5 upper2.5
## <dbl> <dbl> <dbl> <dbl>
## 1 1850 -0.418 -0.589 -0.246
## 2 1851 -0.233 -0.412 -0.0548
## 3 1852 -0.229 -0.409 -0.0494
## 4 1853 -0.270 -0.430 -0.111
## 5 1854 -0.292 -0.433 -0.150
This displays the same data as the CSV example above, just now as data frame, not as the csv text. You can compare the values and see that they are the same–just the standard way to display the data frames will round the values down to three digits and include some additional information, such as row numbers and variable types. Note also that commas are gone–commas were just column markers, not values.
read_delim()
is convenient because it will detect the separator
itself. However, being part of the tidyverse-world, it requires
using an additional package, and it returns the tibble-flavor of the
data frame. Alternatively, one may use the base-R loading functions
read.csv()
and read.delim()
. But these functions do not detect
the separator itself–the former reads comma-separated
files and the latter tab-separated files. What happens if you get it
wrong? For instance, what happens if we assume that the temperature
data is stored as tab-separated file and load it using read.delim()
?
Here is the result:
read.delim("../data/hadcrut-5.0.1.0-annual.csv.bz2")
d <-dim(d)
## [1] 173 1
names(d)
## [1] "year.anomaly.lower2.5.upper2.5"
head(d, 5)
## year.anomaly.lower2.5.upper2.5
## 1 1850,-0.41765878,-0.589203,-0.24611452
## 2 1851,-0.2333498,-0.41186792,-0.054831687
## 3 1852,-0.22939907,-0.40938243,-0.04941572
## 4 1853,-0.27035445,-0.43000934,-0.110699534
## 5 1854,-0.29163003,-0.43282393,-0.15043613
We can immediately see multiple problems: first, the data frame now
contains only a single column. This is because read.delim()
looks
for tab-symbols to separate the columns, and unable to find any in the
csv lines, it assumes that all these belong to a single column.
Second, the column has a weird name that is a combination of all
individual names. Finally, the data itself is also all mixed up. But
the data values give a clear hint that we got the separator
wrong–each value consists of plausible values separated by commas.
As read.delim()
looks for tab symbols, it assumes that commas are
part of the column and hence they are visible in the values.
It is a common source of confusion by beginners–functions
read_delim()
and
read.delim()
look and behave in a very similar manner,
but only the former can
automatically detect the correct separator.
More specifically, it is a function in the readr package that itself is a component of tidyverse.↩︎