Chapter 2 R: Programming Language and a Statistical System

R is a popular programming language and a statistical/data-analysis frontend. It is widely used for statistical tasks, social and biological sciences, and data science. R attempts to find a balance between being easy to use and providing universal and powerful tools for analysis. In a scale between simplicity and power, it is more complex and powerful than certain other environments, such as Excel or stata, but more easy to use than more programming-oriented environments, such as python or java. R is weakly typed which makes it easy to use for quick scripting.

Base R is a reasonably powerful language that contains a large number of tools suitable for computing and data analysis. It also has flexible tools to handle and program language itself, including classes and environments. The base R is not very hard to learn (but harder than e.g. base python). R has a rich infrastructure of libraries. While the base language is mostly consistent, the libraries follow different style and are of different quality. However, the more popular ones are almost compulsory for effective R usage.

R can be used in three main ways:

One can write R programs and thereafter run those on command line as batch files. This is useful if you use R as backend, in makefiles, or in order to run tasks that take long time. Modern IDE-s typically also support such execution with a single click.
An alternative and perhaps the most popular way to use R is through the interactive console. One can open the console by running R on command line, but nowadays the most popular way is to use the built-in console in IDE-s, such as RStudio. IDE-s offer lot of additional functionality, such as syntax coloring, debugging and nice data editors.
Finally, another popular usage of R is in literal programming. Literal progamming is mixing written text with computer code, when compiling the code will be replaced with its output, and hence one can create reports that can be easily updated as new data will arrive. Perhaps the most popular literal programming environment is rmarkdown, but there are other options. See more in Section RMarkdown.

R can also used in Jupyter notebooks although it is a less common usage.

Needless to say, a chunk of R code runs in a similar whichever way do you end up using it, so in terms of language there is no additional learning if you switch from one context to another.

2.1 Base language

The base R language is in many ways similar to the other traditional programming languages, such as java or python. It is relatively simple and dynamically typed. It is designed more to be a good interactive environment than a fast and effective language. This makes it a very suitable choice for interactive data analysis and numerical computing. However, R is less popular as an industrial backend tool.

2.1.1 Variables and assignment

Variables in R are names for in-memory data storage, just as in other tradidional languages. As in python, but unlike in java or C, the variables do not have to be declared, and you can change their type on the fly (it is a dynamically typed language).

Variable assignment is normally done using the left-assignment operator <- (but there are at least 4 other options). The most important data types are doubles (floating-point numbers), integers, logicals, and characters (strings). The following example demonstrates assignment and all these data types:

a <- 1  # double
b <- 2L  # integer
λ <- FALSE
s <- 'text'

When declaring a numeric variable, it is normally considered to be a double, unless explicitly declared to be an integer using L suffix (originating from the word ‘long’). Note that R supports UTF-8 letters in variable names, as visible with the variable λ. We can query the data type (class) of the variable by storage.mode:

storage.mode(a)

## [1] "double"

storage.mode(b)

## [1] "integer"

If needed, one type can be explicitly cast into another type:

as.integer(λ)  # convert to integer

## [1] 0

as.character(a)  # convert to string

## [1] "1"

One can see that FALSE is converted to zero as integer.

2.1.2 Atomic variables are vectors

One of the most distinct traits of R language is that all elementary (called atomic) variables are vectors. This means all the variables we defined above are not just single values but can contain more than a single values. All base operations on atomic variables are vectorized, i.e. they operate on all the values in one go. It is the best to be demonstrated by examples.

We can create vectors by the concatenation function c:

v <- c(1, 2, 3)  # three doubles in a vector
v

## [1] 1 2 3

If we do simple operations with v, the operation will be performed on all elements of the vector:

v + 1

## [1] 2 3 4

2*v

## [1] 2 4 6

as.character(v)

## [1] "1" "2" "3"

Note that we did not use explicit loop to repeat the same operation on all elements. The operators +, *, and the function as.character are already vectorized, i.e. they do the necessary loops internally.

Another popular way to create vectors is to use the colon sequence operator:

1:10

##  [1]  1  2  3  4  5  6  7  8  9 10

This is perhaps the most popular way to run loops in a pre-determined number of times. Note also that : is a sortcut to seq function, the latter allows many more options.

Another popular way to create and manipulate vectors is to add more components to it using the same concatenation function c. Here is an example:

people <- "Feliciano"
people <- c(people, "Lynn")
people <- c("Matías", people)
people

## [1] "Matías"    "Feliciano" "Lynn"

In this example, the variable (vector) people first only contains one name, “Feliciano”. Next we concatenate “Lynn” to the end of it, and thereafter “Matías” in front of it.

Vectorized operations are very fast and efficient, and one should always consider vectorizing the code if possible. There are many more ways to play and manipulate vectors, see Section Indexing and Named Vectors.

2.1.3 Mathematical, logical and other operators

The mathematical operators are (mostly) traditional: +, -, *, / for addition, subtraction, multiplication and division, and ^ for exponentiation. Other useful mathematical operations are %/% for integer division, and %% for modulo:

7 %/% 2  # 3

## [1] 3

7 %% 2  # 1

## [1] 1

Logical operations work mostly as-expected too. In particular >, <, >=, and <=. As in several other languages, equality is tested with double equal signs ==. Inequality can be tested with !=, and logical negation is !:

a <- 1
a > 1  # FALSE

## [1] FALSE

a >= 1  # TRUE

## [1] TRUE

a == 1  # TRUE

## [1] TRUE

a != 1  # FALSE

## [1] FALSE

!(a == 1)  # FALSE

## [1] FALSE

The logical and is &, logical or is | (note: single & and a single |!), and logical not is !. All these operators are vectorized.

Exercise 2.1 Create vector 1, 4, 9, 16, 25 in two ways:

using sequence and mathematical operations
using c function

See the solution

2.1.4 Control structures

The basic control structures in R are fairly similar to those in python or java. These include conditional execution with if and else, looping with for and while, and breaking the loops with break and continue.

Let us demonstrate this with a simple example:

for(i in 1:10) {
   cat(i, "\n")
   if(i > 3) {
      cat("too much\n")
      break
   }
}

## 1 
## 2 
## 3 
## 4 
## too much

The for-loop wants to run 10 times, each time assigning the consecutive integers to i. cat is just R way of printing. However, if i is more than 3, the code prints “too much” and terminates the loop. Note that the loop content, and the if-block are encapsulated in curly braces {..}, exactly like in java (you can leave out curly braces if the block contains just a single line). Unlike in python, indentation does not play any role from the syntax point of view.

As R does not have any distinct line ending marker, such as ; in java, it uses some heuristics to figure out where certain code blocks end. In particular, the else condition should be on the same line where if condition ends. For instance, the following code always works:

for(i in 1:10) {
   cat(i, " ")  # note: no newline here
   if(i %% 2 == 0) {
      cat("even\n")
   } else {  # 'else' on the same line where if block ends with '}'
      cat("odd\n")
   }
}

## 1  odd
## 2  even
## 3  odd
## 4  even
## 5  odd
## 6  even
## 7  odd
## 8  even
## 9  odd
## 10  even

If you put else on a separate line, R may stop with “unexpected ‘else’” error.

Exercise 2.2 Can you afford going out with friends? Let’s find it out:

How many friends do you have? Put it into a variable
What is your budget? Put it into a variable
Print a message I am going out with X friends where X is your number of friends. Hint: use cat function like cat("I om going out with", X, "friends\n")
What does the meal cost? Put it in a variable
Compute total meal price for your whole company. Do not forget to buy a meal for yourself too!
Add 15% tip to the total price
Print either can afford or cannot afford, depending on if the total cost exceeds/does not exceed the budget

See the solution

2.1.4.1 Creating vectors in loop

Quite often we need to compute a value for every element in a collection, and store all the results in a single vector (or list). For instance, one may want to see how many observations there are in a number of data files, or how many ingredients there are in different recipies. A common solution is such case is the following: first create an empty vector, and thereafter loop over the collection and append the computed value to it one-by-one. For instance, here is code that creates a list of squares of numbers:

squares = NULL
for(i in 1:10)
   squares <- c(squares, i^2)
squares

##  [1]   1   4   9  16  25  36  49  64  81 100

This is a handy and frequently used algorithm. It is not particularly efficient though, and becomes very slow if the collection is large. The problem is that the lists are created with fixed finite length, and when you add new elements to the list, you run out of the pre-allocated space. The computer has to allocate new space and copy the former data into the new location. But it works well for small vectors.

2.1.5 Strings and printing

Strings in R (called character) can be constructed in traditional ways, using either single or double quotes:

a <- "what"
b <- 'is'

Both of these are equivalent. The former is useful for creating a string that contains a single quote like a <- "what's", and the latter is better if you want to include double quotes (and you also save one press of shift key 😄).

Strings can be concatenated with paste or paste0 function. The former leaves a space between the strings, the latter does not:

paste(a, b)

## [1] "what is"

paste0(a, b)

## [1] "whatis"

(for paste you can also specify the desired separator using sep argument.)

One can concatenate numbers and strings in a similar fashion, numbers are automatically converted to strings:

a <- 3
paste0("a=", a)

## [1] "a=3"

paste("1/a =", 1/a)

## [1] "1/a = 0.333333333333333"

The latter example may give us undesired precision. A solution is to round the numbers before printing:

paste("1/a =", round(1/a, 3))

## [1] "1/a = 0.333"

A more powerful way to format numbers is formatC function that uses C-style formatting strings. Base R contains a large number of string functions, and there are more in add-on packages.

Exercise 2.3 Create a string height is 5’3”. Hint: paste two strings you create using single/double quotes

See the solution

R has two main printing functions: cat and print. cat is the standard way of printing messages, it prints as many arguments as needed, and is suitable to fit everything on a single line. print only prints a single object at time and is better suited to print more complex objects that need several lines. This includes model summaries, data frames, and functions. Note that cat does not add newline at the end, so you have to do it yourself if this is needed. A typical printing looks like

result <- sqrt(2)
cat("The result is", result, "\n")

## The result is 1.414214

If working in an interactive session, or using r-markdown, R also automatically prints the results that are not saved in variables. So instead of the above, you may just do

sqrt(2)

## [1] 1.414214

However, this does not work in all context, for instance when running code in from command line. This often confuses beginners as the output that was there a second ago is suddenly gone with no obvious error message. But even when the output remains visible, printing results without any explanatory messages is not a good habit when writing anything longer than a few lines of code—it is hard to understand what does this result mean. In longer and more complex code it is advisable to stay with explicit printing.

2.1.6 Functions

Functions in R behave very much like in other traditional programming languages. Functions are objects, created with the keyword function and are normally assigned to variables (which are then called “functions”). Functions always return a value, this may be done explicitly using the return keyword, but if it is not done, it implicitly returns the value of the last evaluated expression. For instance, we may define a function to add two values:

add <- function(x, y) {
   z <- x + y
   return(z)
}
add(4,5)

## [1] 9

In this example we compute the sum, assign it to variable z, and explicitly return the latter.

R functions also support default values, for instance:

multiply <- function(x, y=2) {
   x*y  # implicitly return the product
}
multiply(4)

## [1] 8

multiply(4, 3)

## [1] 12

In this example we do not assign the product to a temporary variable, and we rely on implicit return.

Functions may have both side effects (such as printing and plotting), and return values.

2.1.7 Categorical variables

TBD

2.2 Indexing and Named Vectors

Indexing refers to manipulating individual elements in vectors. There are three ways of indexing in R:

Integer indexing, selecting elements by position. This is typically used to extract objects from known position (or from random position).
Logical indexing, selecting by logical condition. Logical conditions form the basis for data processing and allow such tasks as extracting all positive elements, or all x-values where age-value is less than 16.
Character indexing, selecting elements by their name (given the vector has names). This can be used to create key-value lookup tables where names are the keys.

All of these play an important role for vectors, and form a basis for data manipulation (see Section Data Frames). In this section we just discuss vectors, indexing data frames works in a largely similar fashion, two-dimensional data objects just add more options and more complexity.

2.2.1 Integer indexing

The elements can be accessed using square brackets, inside of which is the position. R indices are 1-based (the first element is at position 1, not at 0, like julia but unlike python and java). This is useful for human referencing, but less useful for complex lookups. For instance

v <- c(1,2,3)
v[1]  # 1st element

## [1] 1

v[2]  # 2nd element

## [1] 2

v[3] <- -4  # assignment
v

## [1]  1  2 -4

Negative indices mean to exclude these elements (unlike in python where it starts counting from end):

v[-2]  # 1st and 3rd element only

## [1]  1 -4

So if you want to delete elements, you can exclude these from the vector, and reassign the result to the same vector

alphabet = c("α", "β", "γ", "δ", "ε")
alphabet <- alphabet[-3]  # remove γ
alphabet

## [1] "α" "β" "δ" "ε"

One can access more than one vector element by providing more than a single index (in a form of an index vector):

alphabet[c(1,3)]  # 1st and 3rd element

## [1] "α" "δ"

alphabet[1:3]  # 1st till 3rd element

## [1] "α" "β" "δ"

alphabet[c(-1,-2)]  # everything except 1st and 2nd element

## [1] "δ" "ε"

We can also add more than one element at time:

alphabet[c(5,6)] <- c("ζ", "η")
alphabet

## [1] "α" "β" "δ" "ε" "ζ" "η"

Note that this example included assigning to the non-existing positions (5 and 6). As a result the vector was automatically lengthened to length 6. Obviously, we can also assign to existing positions and overwrite the previous values.

Exercise 2.4 Manipulate elements from the vector

Consider the vector alphabet above:

extract 1st, 3rd, and 5th element
add θ as 8th element
extract all elements except the first and the last

See the solution

2.2.2 Logical indexing

Indexing a vector by a logical vector plays and enourmously important role in R. This allows us to extract elements according to a certain condition, e.g. only positive cases, or only observations reported after a certain date. We’ll walk through the logical indexing step-by-step through examples.

Consider a data vector

v <- c(-1, 2, -3, 4, -5, 6)

The very basic idea of logical indexing is to use a logical vector of similar length of TRUE-s and FALSE-s to extract elements from this vector. TRUE would mean to retain the corresponding element and FALSE would mean to drop it. So if we want to extract all positive elements (i.e. elements at position 2, 4, 6), we can proceed as follows:

i <- c(FALSE, TRUE, FALSE, TRUE, FALSE, TRUE)
v[i]

## [1] 2 4 6

Indeed, only the positive elements were extracted as a result.

However, it would be extremely hard to create such tailor-made indices for long vectors. Fortunately this can be done automatically using logical operators. Remember: logical operators are vectorized and they work on all elements on the vector. So instead of creating the index vector i manually, we can write

i <- v > 0
i

## [1] FALSE  TRUE FALSE  TRUE FALSE  TRUE

v[i]

## [1] 2 4 6

This approach achieves exactly the same result, except that we now compute i, and if you look at the code, you can easily see our intention. We can write this task in an even more compact form by not creating the helper variable i:

v[v > 0]

## [1] 2 4 6

This approach turns out to be extremely handy in all kind of data filtering and manipulation tasks.

But we can also extract elements not just based on the values of the vector itself, but also based on the values of other vectors. Consider another example with two data vectors:

temp <- c(11,12,27,33,18)
date <- c("2020-09-15", "2020-09-16", "2020-09-17", "2020-09-18", "2020-09-19")
temp[date < "2020-09-17"]  # temperatures before Sept 17th

## [1] 11 12

temp[(date >= "2020-09-16") & (date < "2020-09-19")]  # b/w Sept 16 and 19

## [1] 12 27 33

First we extract the temperature recordings before Sept 17th, and thereafter the recordings for a date interval.

Exercise 2.5 Separate data into “good” and “bad” cases

Consider two data vectors:

result <- c(32, 33, 23, 14, 45, 33)
quality <- c("good", "good", "bad", "good", "bad", "good")

Use logical indexing to create two new vectors: one for good-quality results, the other for bad-quality results. See the solution

2.2.3 Named vectors and indexing by name

The third way to access individual elements is by their names. But before we can work with names, we have to assign names to the elements. Consider the air quality data vector:

airquality <- c(40, 50, 80, 180, 22)
names(airquality) <- c("Toronto", "Teheran", "Tokyo", "Tel Aviv", "Tacoma")
airquality

##  Toronto  Teheran    Tokyo Tel Aviv   Tacoma 
##       40       50       80      180       22

This is an example of named vector, a vector where every element has a name.

Now we can access individual elements by name:

airquality["Tacoma"]

## Tacoma 
##     22

airquality[c("Toronto", "Tokyo")] <- 0
airquality

##  Toronto  Teheran    Tokyo Tel Aviv   Tacoma 
##        0       50        0      180       22

2.3 Lists

TBD

2.4 Data frames

This section describes data frames, the central structures for holding data in memory. Here we describe bare R techiques to handle data frames, for a different and (typically) more intuitive approach, see Section 4

2.4.1 What is data frame

A central data structure for data analysis is data frame. It is a rectangular block of data (often numbers) that has a certain number of rows and columns. It is in many ways similar to excel tables, except that the excel tables may be more complex than rectangular data blocks.

Data in data frames is typically laid out by observations and variables. Observations (aka cases) are rows that denote different individual objects we are investigating, such as different persons, different dates, or different geographic locations. Different columns (variables, aka features) are different measures, different bits of data we have collected about the objects we are analyzing. If the data is about people, the variables might be their gender, income, age, and health condition. In case of geographic location, the variables might be temperature, precipitation, and population size.

Below is an example dataframe that contains information about the US president G.W.Bush approval rate in fall 2001.

##              date approve disapprove dontknow
## 8    2001 Oct 5-6      87         10        3
## 9  2001 Sep 21-22      90          6        4
## 10 2001 Sep 14-15      86         10        4
## 11  2001 Sep 7-10      51         39       10
## 12 2001 Aug 24-26      55         36        9
## 13 2001 Aug 16-19      57         34        9

The rows (observations) in this case are different dates, ranging from August till October 2001. The first, unnamed column, is the row number (we display only a few selected rows from a larger dataset here) and does not play a big role. The following column (variable) date is the date of the poll (more precisely the date range when the poll was conducted), and the following three columns tell how many respondents approved, disapproved, or did not have opinion about the president’s leadership. So our data frame (actually a subset from a larger one) contains 6 observations and 4 variables.

Note also that the poll dates are in the reverse order. Data frame as a data structure does not care about the order of observations, everything goes. But for a particular analysis you may prefer the data to be arranged in a certain order.

2.4.2 Workspace variables and data variables

Note that the columns of data frame are often called variables, exactly as in Section 2.1.1. The exact meaning of variable is often clear from the context. But we distinguish between workspace variables and data variables when such disctinction is explicitly needed.

Workspace variables are the variables we create using <- assignment operator, and normally they live in the R workspace (and can be saved when exiting R). For instance

x <- 2

creates a new workspace variable x (or maybe updates an existing one). Data variables, in contrary, live in the data frame they are part of, and are normally not accessible from the workspace—the dataframe is stored in the workspace, but the data variables are inside of the data frame. See below for how to access and assign data variables.

2.4.3 Extracting and assigning individual variables in data frames

There are three main ways to extract the individual variables from data frames:

2.4.3.1 Dollar-notation

Dollar-notation uses the construct dataframe$variable to extract variable variable from data frame dataframe. The variable name can be quoted but for simplicity it is almost always used without quotes. For instance, if the G.W. Bush approval ratings data frame above is called approval, we can extract the variable dontknow as

approval$dontknow

## [1]  3  4  4 10  9  9

This results in a vector where individual components correspond to the dontknow values for different cases. It is in the same order as the original data frame.

Dollar-notation is a good choice when working interactively, and also when writing code, but where you know the variable name when writing your code.

2.4.3.2 Matrix-style indexing

Matrix-style indexing treats data frames as matrices with two indices–one for rows and another for columns. We can leave the row index empty and use column names:

approval[,"dontknow"]

## [1]  3  4  4 10  9  9

Note two important bits of syntax here: * the comma before "dontknow" signals that we talk about columns. Leaving out the comma will result in a single-column data frame instead. * "dontknow" must be quoted–it must be a valid string variable.

This construct is less convenient to be used interactively, but comes handy if one wants to determine the variable name later. For instance, we can write

whichVar <- "dontknow"
approval[,whichVar]

## [1]  3  4  4 10  9  9

In this example the (workspace) variable whichVar contains the name of the (data) variable dontknow, and hence R returns the content of that data variable.

Matrix-style indexing allows to extract more than a single variable in one go, e.g. variables approve and disapprove:

approval[,c("approve", "disapprove")]

##    approve disapprove
## 8       87         10
## 9       90          6
## 10      86         10
## 11      51         39
## 12      55         36
## 13      57         34

As two variables cannot be extracted as a single vector, the result is a sub-dataframe with these two variables.

2.4.3.3 List-style indexing

Data frames are internally made of lists and hence most of the list properties carry over to data frames too. One of the more useful ones is list-style indexing with double brackets [[ ]]. We can extract a single variable from a data frame using

approval[["dontknow"]]

## [1]  3  4  4 10  9  9

This gives us a single vector, exactly as the matrix-style indexing does when requesting a single variable. We can also extract a sub-dataframe with only selected variables in the same way as extracting selected list components:

approval[c("approve", "disapprove")]

##    approve disapprove
## 8       87         10
## 9       90          6
## 10      86         10
## 11      51         39
## 12      55         36
## 13      57         34

Matrix- and list-style indexing also works with logical indexing, e.g. one can write expressions like

approval[approval$approve > 60, c("approve", "disapprove")]

##    approve disapprove
## 8       87         10
## 9       90          6
## 10      86         10

for matrix-style, and

approval[c("approve", "disapprove")][approval$approve > 60,]

##    approve disapprove
## 8       87         10
## 9       90          6
## 10      86         10

for list-style indexing.

The existence of such a plethora of indexing systems is quite confusing for beginners. It is advisable to start with a single approach and stay with that until you feel reasonably confident with this system. First thereafter explore the other options.

It should be noted that dplyr and data.table package introduce even more ways to access the data frame elements.

2.4.4 Loading data

It is useful to be able to create data frames manually–but the usage is normally limited to debugging and testing purposes. We almost always load data from files or databases.

2.4.4.1 CSV files

CSV files are one of the most popular way to store data that can be transformed into data frames. CSV stands for “comma-separated variables”, a simple text files where the values are separated either by comma or another consistent separator. For instance, a few lines from HADCrut global temperature data, stored as csv looks like

year,anomaly,lower2.5,upper2.5
1850,-0.41765878,-0.589203,-0.24611452
1851,-0.2333498,-0.41186792,-0.054831687
1852,-0.22939907,-0.40938243,-0.04941572
1853,-0.27035445,-0.43000934,-0.110699534
1854,-0.29163003,-0.43282393,-0.15043613
...

The example displays the first six lines of the file, consisting a header that contains the variable names, and five lines of data. Each line contains the data value (year, anomaly, …), and in each line the values are separated by commas. While comma is perhaps the most popular csv separator, there are many other common ones, e.g. the “tab” character (often coded as "\t"), semicolon, pipe (|) and others. Note that the file is still called “csv file”, even if it is separated with something else, not with comma.

2.4.4.2 Loading CSV files into data frames

There are a variety of ways to load csv files into data frames. Here we use function read_delim() from tidyverse library¹ As an example, let’s load the same HADCrut global temperature data, see the example above:

library(tidyverse)
hadcrut <- read_delim("../data/hadcrut-5.0.1.0-annual.csv.bz2")
dim(hadcrut)

## [1] 173   4

head(hadcrut, 5)

## # A tibble: 5 × 4
##    year anomaly lower2.5 upper2.5
##   <dbl>   <dbl>    <dbl>    <dbl>
## 1  1850  -0.418   -0.589  -0.246 
## 2  1851  -0.233   -0.412  -0.0548
## 3  1852  -0.229   -0.409  -0.0494
## 4  1853  -0.270   -0.430  -0.111 
## 5  1854  -0.292   -0.433  -0.150

This displays the same data as the CSV example above, just now as data frame, not as the csv text. You can compare the values and see that they are the same–just the standard way to display the data frames will round the values down to three digits and include some additional information, such as row numbers and variable types. Note also that commas are gone–commas were just column markers, not values.

read_delim() is convenient because it will detect the separator itself. However, being part of the tidyverse-world, it requires using an additional package, and it returns the tibble-flavor of the data frame. Alternatively, one may use the base-R loading functions read.csv() and read.delim(). But these functions do not detect the separator itself–the former reads comma-separated files and the latter tab-separated files. What happens if you get it wrong? For instance, what happens if we assume that the temperature data is stored as tab-separated file and load it using read.delim()? Here is the result:

d <- read.delim("../data/hadcrut-5.0.1.0-annual.csv.bz2")
dim(d)

## [1] 173   1

names(d)

## [1] "year.anomaly.lower2.5.upper2.5"

head(d, 5)

##              year.anomaly.lower2.5.upper2.5
## 1    1850,-0.41765878,-0.589203,-0.24611452
## 2  1851,-0.2333498,-0.41186792,-0.054831687
## 3  1852,-0.22939907,-0.40938243,-0.04941572
## 4 1853,-0.27035445,-0.43000934,-0.110699534
## 5  1854,-0.29163003,-0.43282393,-0.15043613

We can immediately see multiple problems: first, the data frame now contains only a single column. This is because read.delim() looks for tab-symbols to separate the columns, and unable to find any in the csv lines, it assumes that all these belong to a single column. Second, the column has a weird name that is a combination of all individual names. Finally, the data itself is also all mixed up. But the data values give a clear hint that we got the separator wrong–each value consists of plausible values separated by commas. As read.delim() looks for tab symbols, it assumes that commas are part of the column and hence they are visible in the values.

It is a common source of confusion by beginners–functions read_delim() and read.delim() look and behave in a very similar manner, but only the former can automatically detect the correct separator.

2.5 Other data structures

2.5.1 Factors–categorical variables

2.5.1.1 Converting between factors, strings and numbers

TBD: convert from factors to numbers

More specifically, it is a function in the readr package that itself is a component of tidyverse.↩︎