Chapter 2 Introduction to R

R logo. © 2016 The R Foundation (CC-BY-SA 4.0).

R is a powerful programming language and environment for data analysis. It is one of the most popular data science tools because it is designed from ground up for statistics and data analysis. It is the programming language used throughout this book.

This chapter is primarily designed for readers who have little to no experience with programming, and hence we devote quite a bit of space to topics like variables and data types. If you have programming experience, you may quickly skim through this chapter to just learn the basic R syntax, and how to use RStudio.

2.1 What is R and why do you want to use it?

R is a programming language that allows you to write code to work with data. It is designed from ground-up for this task–statistics and data processing.

R is called “R” because it was inspired by and comes after the language “S”, a language for Statistics developed by AT&T.

There are many other languages that are good for working with data. We have selected R because of its simplicity–as a language that is designed for such tasks from ground up, its tools are rather simple. This is also a reason why R is very popular in areas like health and social sciences–data processing in R is typically easier and requires less coding than in more general languages.

Working with R (and other programming languages) works by writing formal instructions to your computer, and the computer will execute those. The instructions can be written in different “languages”, more precisely programming languages, and the computer needs tools to understand each of these. R software you installed above (see Section 1.1) is one such tool.

As projects grow, it will become useful not to issue the instructions one-by-one, but to write them all down in a single file, and then tell the computer to execute all of those instructions. This list of instructions is called a script or program or code. Writing scripts is called programming or coding. Executing or “running” a script will cause each instruction (line of code) to be run in order, one after the other, just as if you had typed them in one by one. Writing scripts allows you to save, share, and re-use your work. By saving instructions in a file (or set of files), you can easily check, change, and re-execute the list of instructions as you figure out how to use data to answer questions.

As you begin working with data in R, you will be writing multiple instructions (lines of code) and saving them in files with the .R extension, representing R scripts. Through this course we use RStudio for this task, but if you wish, you can use any text editor.

2.2 First look at RStudio

The primary way to use R in this course is through RStudio (see below). However, R can be used also without RStudio.

RStudio is an open-source integrated development environment (IDE) that provides an informative user interface for interacting with the R interpreter. If you haven’t done this already, make sure to download and install the free version of RStudio (see Section 1.2 above). IDEs are glorified text editors that provide various other handy tools for programming. For instance, RStudio let’s you to edit your program, colors your code in a way to make understanding it easier (syntax coloring), allows you to execute it with a simple keypress, explore data and workspace variables, your command history, install packages, and much more.

2.2.1 Rstudio layout

RStudio’s default user interface. Red texts are annotations.

When you open RStudio (either by searching for it, or double-clicking on a desktop icon), you’ll see an interface that looks something like here. By default, RStudio interface consists of 4 panes–small windows for different tasks (you can customize this layout if you wish):

  • Console: The bottom-left pane is a console, the R command line for entering R commands. The console will also show your code.

    Normally you use console for quick computations and short sequences of 1-2 lines of code. Longer blocks of code is usually easier to do as scripts.

  • Script: The top-left pane is a text editor for writing R code, markdown, and other files. It contains a plethora of tools to work with R (and some other) code, including syntax coloring (coloring code according the its function), “auto-complete” and formating text, and to execute your code easily. Note that this pane is hidden if there are no open scripts; select File > New File > R Script from the menu to create a new script file.

  • Environment: The top-right pane displays information about the current R environment (workspace)—specifically, information that you have stored inside of workspace variables (see Section 2.4 below). In the example in RStudio script window, the value 201 is stored in a variable called x. You’ll often create dozens of variables within a script, and the Environment pane helps you keep track of which values you have stored in what variables.

  • Plots, packages, help, etc.: The bottom right pane contains multiple tabs for accessing various information about your files and code. When you create visualizations, those plots will also be in that pane. Most importantly, this is also where you can access the documentation. If you have a question about how something in R works, this is a good place to start!

Note, you can use the small spaces between the panes to adjust the size of each area to your liking. You can also use menu options to reorganize the panes if you wish. The most useful tools are focusing and zooming. Focusing means moving your cursor and input into a particular pane, e.g. Ctrl + 1 makes the script pane active and Ctrl + 2 makes console pane active. Using keyboard shortcuts to move your focus is much faster than grabbing the mouse.

Zooming is a bit similar to focus, just it also hides the other panes and makes the zoomed on of full size. Ctrl + Shift + 1 zooms to script, Ctrl + Shift + 2 zooms to console, and Ctrl + Shift + 0 restores the original 4-pane view. Zooming to individual panes is very useful if you are working on a small screen. See the View > Panes menu and options therein, the menus also list keyboard shortcuts.

See Section J for more information.

2.2.2 Creating projects

RStudio organizes your work into “projects”. Projects are just separate sets of files and folders that are relevant for different tasks, for instance for all the work you do in this class, you may also want to create separate projects for different problem sets. Afterward, you can just click on the corresponding project icon to re-open RStudio in the correct folder with all the relevant files open.

RStudio menu to create project in an existing folder
RStudio menu to create project in an existing folder

Creating a project in an existing folder.

In order to create a project you broadly need to follow these steps:

  • Create the folder where you want to keep the files that you need for this project. As a minimum, I recommend to have a separate folder for this class, but you also may want to have separate ones for some homework assignments.
  • From menu, select File -> New Project. It offers you a few different options (see the figure).
  • Pick Existing Directory, browse to the folder you created above, and click Choose. Finally, click Create Project.

RStudio restarts, and now you are working within the project.

Rstudio project icon
Rstudio project icon

Rstudio with a demo project open (upper window) and Rstudio project icon in the folder (in the lower window).

In the example image here, we see a project that contains just one file–hello.R. This is opened in RStudio above. Below, you see the project folder (called info201) that contains the same file, and the blue RStudio project icon (highlighted).

Next time when you want to work on the same project, you just double-click on the project icon. This will ensure that you have the same files and folders open that are relevant for this project.

2.3 Basic R

Here we introduce the very basics of R language. We start with typing simple commands on console, and thereafter switch to scripts. If your task requires just 1-2 commands, then it is often easier to type those directly on the console (the lower-left pane in RStudio) while longer sequences are typically better to be written as a separate script (see below).

2.3.1 Entering commands on console

R prompt
R prompt

R prompt in the R console window.

R Console is a small window where you can type in R commands.1 The commands must be typed after the R prompt >. The prompt is a marker that R is ready and is waiting your commands.

We can start with simple arithmetic. Write 1 + 1 in the R command prompt and hit enter. R replies with [1] 2. Below we write these steps as:

1 + 1
## [1] 2

The first block shows the commands you issue in R console, and underneath is ## followed by the R’s reply (the answer). The R’s reply contains the answer, 2, and a marker [1]. The marker is related that to the fact that one command may produce many answers and this is the first of those (see more in Section 4 below).

This is how we can use R as a powerful calculator. The other arithmetic operations are pretty easy and intuitive: - for subtraction, * for multiplication, / for division and ^ for exponentiation. Only exponentiation is somewhat non-standard, different programming languages have different habits here. R knows that multiplication must be done before addition, if you want the opposite then you need parenthesis:

1 + 2*3
## [1] 7
(1 + 2)*3
## [1] 9

Let’s now compute something that is hard to do manually–namely the length of light-year. Light-year is the distance that light, moving 300,000 kilometers per second, covers in one year:

300000*60*60*24*365
## [1] 9.4608e+12

Here we take the speed of light, and multiply it by seconds in minute (60), minutes in hour (60), hours in day (24) and days in year (365). R prints the answer in exponential form, it must be understood as \(9.46\cdot 10^{12}\), i.e. almost 10 trillion kilometers.

You cannot just click on the previously entered command and edit it. But in RStudio, you can use the up arrow to retrieve the previously entered command, edit it, and re-run.

See more in Section J.

2.3.2 Writing scripts

How to open new script in RStudio
How to open new script in RStudio

One can open a new script through RStudio menus, the corresponding keyboard shortcut is visible as well.

Next, let’s re-write these calculations as a script. The easiest way to write scripts is using the RStudio script editor. Depending on your exact configuration, an “Untitled” script may already be open, or you can choose from menu File -> New File -> R Script (or Ctrl - Shift - N). This opens a new R script in a dedicated window (top left in RStudio).

Let’s put the same R command in that window. Now the command (or more often, a collection of commands) is called script or computer program.2 So the content of your script window will look like

300000*60*60*24*365
This is a script, a very simple, one-line computer program.
“Source” button in R Script window
“Source” button in R Script window

Location of “Source” button in R Script window in RStudio.

The next task is to run the script, it means execute all the commands there (or in this case the only command we have there). RStudio offer several ways to do it:

  • “Source” (Ctrl + Shift + S) will execute (source in R parlance) the program. It will not show the code that you execute, nor any results that are not explicitly printed (see Section 2.6).
  • “Source with Echo” (Ctrl + Shift + Enter) will also execute the code, but will show both the code and output, even if not explicitly printed.
“Run” button in R Script window
“Run” button in R Script window

Location of “Run” button in R Script window in RStudio.

Another handy way to execute code is to use “Run” button (Ctrl + Enter / ⌘ + Shift + Enter). This executes either the region that is highlighted, or the command where the cursor is currently located if there is no highlight. In the example figure at right, this will execute the line “1 + 2 + 3”, and show both the code and the result in the “Console” window.

Finally, you may want to save your script using a better name than “Unititled”. Use the menu: File -> Save As… to pick a good name.

R continuation prompt
R continuation prompt

If R thinks that you are not ready with the command, it shows + instead of >.

Sometimes it happens that you either write your script wrong, or you run only a part of it. In that case you may notice that the normal command prompt > is replaced by continuation prompt +. This is because R thinks that you are not ready with the command and expects you to continue.

The image here demonstrates how this can happen. After entering 2 * (1 +, R does not see the closing parenthesis and concludes that you will continue the command. You may notice that “nothing works” until R has understood that the command is finished. Here you can just enter the closing parenthesis ), but otherwise the Esc key will help. Pressing Esc will interrupt the incomplete command and restore the normal command prompt.

2.3.3 Comments

One of the extremely handy and simple features of scripts (and computer programs in general) are comments. These are part of the code that are ignored by computer. These are just notes for the human reader (including you!) to make it easier to understand what the code does. Since programs can be opaque and difficult to understand, comments are widely used to add explanations. Even your own code may be quite incomprehensible a few months after writing it.

Comments should be clear, concise, and helpful—they should provide information that is not otherwise present or “obvious” in the code itself.

In R, we mark text as a comment by putting it after the pound/hashtag symbol (#). Everything from the # until the end of the line is a comment. It is common to put descriptive comments immediately above the code it describes, and sometimes immediately aftewards. One can also put short notes at the end of the line of code:

So the commented light-year script might look like this:

## Length of light-year:
## c by seconds in minute by minutes in hour by
## .. by hours in day by days in year
300000*60*60*24*365

Note that these comments start with double hash sign ##: only one is needed, but as the computer ignores everything after the first one, it will also ignore the second one. So any number of has signs is fine!

See Section 7.5.2 for more about how to write good comments.

You can “execute” comments and enter those on the console, but it is not very useful as they do not do anything.

Comments are also used for temporarily “deleting” parts of the code–if you add comment signs # in front of every line in some parts of your code, these lines will be ignored by the computer. But you can easily get these back if you need those again.

In RStudio, you can turn highlighted lines into comments and back by pressing Ctrl - Shift - C. See more in Section J.


From now on, you can write (or copy) the example code directly into the script window and execute it using “Source” or “Run”.

2.4 Variables

Since computer programs involve working with lots of data, we need a way to store and refer to this information. We do this using variables.

2.4.1 What are variables

For instance, if we want to add numbers, we can do just write it as

2 + 5
## [1] 7

This is a good way to compute something where we know the inputs (numbers “2” and “7”) and we just want to print the output. But quite often we want to do something similar, just we do not know what are the numbers. It may sound a bit counter-intuitive–how on earth can we compute something if we do not know the inputs? –but there are many valid reasons for that. For instance, we may ask the input from the user. Or the input may be date or time, and we do not know when will someone run our program. Or the input is read from a dataset, and it may be one of many datasets. In such cases can we cannot “hardcode” our computations like 2 + 7. We must keep the program open to learn the actual input values later. This can be done using variables.

The same example above, just using variables, may look like3

x <- 2
y <- 5
x + y
## [1] 7

So what is the difference? After all, we still got the same number?

However, now our code stores the numbers, “2” and “7”, in memory under two separate labels (variable names) “x” and “y”. You can think of variabls as labeled “boxes” for data. You can use the label to refer to the data inside. The numbers can be stored into the boxes (variables) using a special assignment operator <-, it is like an arrow that puts number “2” into a box labelled “x” and number “5” into the box “y”. This process is called assignment. Note that variable names goes left, value comes right.4 Later, we just use the box labels (variable names) to perform the tasks with data that is inside of the boxes (variables).

In RStudio, use Alt-- (Alt-minus) to get the <- operator.

See Section J for more.

Now you can imagine that instead of x <- 2 and y <- 5, we may instead write code that asks x from the user, and reads y from a dataset. But computation, adding x and y, will remain the same. This is the beauty of variables: as long as the computations are the same, we can use the same code.5

But variables can also be used to remember and retrieve the values later. This requires a slightly different code, for instance:

x <- 2
y <- 5
z <- x + y
z
## [1] 7

Note that we store the result of x + y in “z” in a fairly similar manner as how we stored numbers into “x” and “y”. Just what goes into the box “z” is a result of a calculation, not a given number as above. Now we have an additional “box” in memory, labeled as “z”. You can see your variables in RStudio “Environment” pane. You can also see all the variables using command ls():

ls()
## [1] "r" "x" "y" "z"

This shows that we have defined three variables: “x”, “y” and “z”.

More specifically, we are talking here about workspace variables or environment variables. These are the variables that are part of R workspace, and that you can see on the top-right “Environment” tab in RStudio. These are what programming languages typically call just variables. Later, in Section 11, we will encounter data variables, stored in the datasets and not in the workspace.

A note about the last line–it is just “z” and nothing else. This is for printing the result. R console normally only prints the result if it is not assigned to a variable. If we were writing the code instead like

x <- 2
y <- 5
z <- x + y

then we do not see any result. The result is still computed, just not printed on screen. The last lonely “z” prints it in a simple manner (see Section 2.6 for more about printing).

We can use any variable to do computations and store it in any variable. So we can also do like this:

## to begin with, 'z' contains value '7'
z <- z + 1  # take z, add 1, and store result back in z
z  # now it is '8'
## [1] 8

Here we take the number form the “box z”, add “1” to it, and “put it back into the same box”. This is perfectly valid computer code, and in fact widely used for various tasks, such as counting.

2.4.2 Variable names

In the example above, we used a single-letter variable names. But they need not to be single-letter only, they may be much longer. In fact, you are fairly free to choose any kind of names you want but there some rules: variable names must begin with a letter and can contain any combination of letters, numbers, periods (.), or underscores (_).

Here are a few examples of valid variable names:

x <- 1
xx <- 2
x1 <- 3
anotherX <- 4  # camelCase
one_more_x <- 5  # snake_case
beta.2 <- 6

All these styles have their advantages and disadvantages, in general, pick shorter names for shorter scripts and long descriptive names for large complex projects. You can pick all kinds of variables names, but they should be descriptive and informative about what the “boxes” contain. Confusing or misleading variable names is a major problem in programming. See more in Section 7.5.1.

A good example of how to use variables and choose variable names is here:

minutes_in_day <- 60*24

Variable names are case-sensitive, so “x” and “X” are two different variables. In the example above, Minutes_in_day will not work:

Minutes_in_day
## Error in eval(expr, envir, enclos): object 'Minutes_in_day' not found

Here are some examples of invalid variable names:

1x <- 7  # starts with a number
new x <- 7  # contains space
price$ <- 8  # contains $

This code will not work and produce errors.

Exercise 2.1 When coding, it is important to understand the error messages. Type these invalid assignments in RStudio console. What are the exact error messages you get?

See the solution

Variable names must begin with a letter, but it does not have to be English letter. Any UTF-8 letter is fine. So you can write code like

<- 2
<- 3
+
## [1] 5

You can see what value is inside any variable by typing that variable name as a line of code:

x
## [1] 1

2.5 Data Types

In the previous section, we were only working with numeric values. We did some computations and stored those in variables. But there are data that are not numbers.

The two most important non-numeric data types are text (strings) and logical values. Using other data types is very similar to using numbers. For instance,

greeting <- "Hi!"  # text
answer <- TRUE  # logical

R is intelligent enough to understand that if we have code x <- 7, then x will contain a numeric value (and so we can do math with it!), and if your write y <- "blah-blah-blah", then it is text, and we can convert it to upper case instead.6

There are four “basic types” (called atomic data types) in R that we encounter in this book.

2.5.1 Numeric

The default computational data type in R is numeric data. It can represent real numbers (numbers that contain decimals). We can use use mathematical operators (such as +, -, *, ^, see below in Section 2.5.1) to do computations with numeric data. There are also numerous functions that work on numeric data (such as calculating sums, averages and square roots).

Numeric data is normally printed in a fairly obvious way, e.g.

1/2
## [1] 0.5

In case of non-finite fraction, only the first few digits are printed:

-1/7
## [1] -0.1428571

If numbers are too large, or too small, then they are printed in exponential form:

1000*2000*3000*4000/1.1
## [1] 2.181818e+13
1/1000/2000/3000
## [1] 1.666667e-10

The exponential form must be understood as \(2.181818\cdot10^{13}\) in the former case, and as \(1.666667\cdot 10^{-10}\) in the latter case. Exponential form can also be used to enter numbers, e.g.

x <- -3e-2  # -0.03
x
## [1] -0.03

Naturally, there are various ways to adjust the way the numbers are printed.

There is also a special mathematical constants: pi is \(\pi = 3.1415927\), and Inf is infinity. You can get infinities when you do certain operations, e.g. divide by zero. You can also use infinity if you need a constant that is larger than any number.

One can use Mathematical operators with numeric values. Mathematical operators are the common signs like + and - that allow to do basic mathematics (to “operate”), plus a few others:

  • +: addition
  • -: subtraction
  • *: multiplication
  • /: division
  • ^: exponentiation (i.e. 2^3 means 2*2*2).

These are defined for most numbers, except for a few corner cases, such as division by zero. The other way to do math, besides operators, is with functions. We’ll talk more about those below in Section 3.2.

Besides these well known mathematical operations, there are more, for instance

  • %/% is integer division: e.g. 7 %/% 2 equals 3. This is a division that only returns the integer part and ignores the remainder.
  • %% is modulo, e.g. 7 %% 2 equals 1–when you divide 7 by 2, then 1 is “left over”.

There are many more mathematical operators, such as matrix product or outer product. We do not discuss these in this book.

Exercise 2.2 Use integer division to transform years to decades. E.g. 1966 → 1960 and 2023 → 2020.

See the solution

2.5.2 Character

Another very common task we do is to perform simple text manipulations. Text data is called character or string data in R. This may include simple tasks like storing a single letter in a variable, or changing words to upper case; but it may also include quite complicated text analysis.

You can tell that something is character data by putting this in quotes (both single quotes ' and double quotes " will do). For instance, we can store the name of a certain well-known playwriter in a variable: r famous_poet <- "Bill Shakespeare" Note that character data is still data, so it can be assigned to a variable just like numeric data! We can print its value by just typing its name on the console, or using dedicated printing functions (see Section 2.6). There are no special operators for character data, though there are a many functions for working with strings.

Note that it is not the content but the type of the content that decides if the variable is numeric or character:

x <- 1  # this is numeric
y <- "1"  # this is character

Both variables contain “one”, but in case of “x” this is stored as number, in “y” it is stored as string. This is because 1 (without quotes) is a number and "1" (with quotes) is a character, and the variable automatically “knows” what type data you put in there. Hence we can do mathematical operations with “x” but not with “y”, and text functions with “y” but not with “x”:

x + 1
## [1] 2

will work but y + 1 will give an error. If you are unsure what type of a particular variable is, you can query it with function class(), e.g.

class(y)
## [1] "character"

Exercise 2.3 Try to add a number to y. What is the exact error message? Do you understand what it tells?

There are no dedicated character operators but there is a plethora of functions dedicated to manipulating text.

2.5.3 Logical

The third extremely important variable type is logical variables (a.k.a Boolean variables). These can only store two values–“true” or “false”. In R, these two values are written as TRUE and FALSE. Importantly, these are not the strings "TRUE" or "FALSE"; logical values are a different type! If you write these values in RStudio script window, you see that it has a special color for these “logical constants”.

logical values are called “booleans” after mathematician and logician George Boole.

But why do we need such “powerless” variables that only can contain two values? Weren’t it more useful to use numbers or strings that can contain much more? It turns out that logical values are extremely important. Namely, most of decision-making is logical. We either do this, or we do not do this. And there is a lot of decision-making in the computer code. We have to check if our results are correct (or not), if the user input makes sense (or not), if we are done with all inputs or not, so forth. All these decisions involve only two values, and R has many decisionmaking tools that rely on such logical values (see Section 8).

You can create logical variables directly, like a <- TRUE but that is rarely useful. Most commonly we see those as the result of applying comparison operators to data. These are

  • <: less than
  • >: greater than
  • <=: less-than-or-equal
  • >=: greater-than-or-equal
  • ==: equal
  • !=: not-equal

Note that equality is tested with double equal signs ==, not with single equal sign! For instance

2 == 3
## [1] FALSE

gives you FALSE but you cannot use single equal sign for comparison, 2 = 3 gives an error instead.

Comparison operators behave in many ways exactly as mathematical operators like + and *, just they result in logical values:

3 < 4
## [1] TRUE
3.14 < 3
## [1] FALSE

We can store these values in variables exactly like in case of numbers or strings:

a <- 3
b <- 4
c <- a == b  # does 3 equal 4?
c
## [1] FALSE

Exercise 2.4 Are you more than 20 years old? Assign you age into a variable, compare this to 20, and store the result in another variable. Finally print it, it should print TRUE or FALSE, depending if you are older than 20 or not.

See the solution

One can also compare strings. While equality is fairly obvious, then for instance

"cat" > "dog"
## [1] FALSE
turns out to be false. This has nothing to do with the size of the corresponding mammals–the fact that cat is “smaller” here means it is located before dog when written in alphabetic order.

Logical values have also additional operators, called logical operators or boolean operators. These work only with logical values and they produce logical values. This allows you to make more complex logical expressions. Although their behavior is very similar to that of mathematical operators, logical operators are often confusing for beginners. We are used to work with numbers but not with logical values.

Logical operators include & (logical and), | (logical or), and ! (logical not). The meaning of these logical operators corresponds rather closely (but not exactly!) to their meaning in everyday language. In particular true AND true is true, for instance

x <- 3
y <- 5

x < 4  # TRUE
## [1] TRUE
y > 4  # TRUE
## [1] TRUE
x < 4 & y > 4  # TRUE and TRUE is TRUE
## [1] TRUE

But if any of the involved logical values is false, then logical AND will produce false:

x > 4 & y > 4  # FALSE and TRUE is FALSE
## [1] FALSE

However, you can use logical NOT, ! to reverse the condition:

!(x > 4) & y > 4  # not FALSE and TRUE is TRUE
## [1] TRUE

Note that we need to put x > 4 in parenthesis to tell R that ! applies to x > 4, not on x alone!

Logical OR behaves otherwise similarly, but it is true if at least one of the values involved is true:

pet <- "dog"
weather <- "rain"

# Check if pet is "cat" OR "dog"
pet == "cat" | pet == "dog"
## [1] TRUE
# Check if pet is dog OR whether is sunny
pet == "dog" | weather == "sunny"
## [1] TRUE

It’s easy to write complex expressions with logical operators. If you find yourself getting lost, I recommend rethinking your question to see if there is a simpler way to express it!

Exercise 2.5 Use the pet example above to deduce if you are happy and it is raining today. You may write it in a way as

mood <- ...
weather <- ...
...

Your code should print TRUE or FALSE depending your mood and weather.

See the solution

2.5.4 Integer

The final “atomic” data type we encounter in this book is integer. These are numbers like “numeric”, but these can only hold integer values. Now again, one may ask why do we need such limited numbers, but there are a few reasons for this.

  • First, and most importantly, integer arithmetic is precise. This is not guaranteed to be the case of floating point “numerics”–computers cannot represent infinite number of decimals, and hence usually only produce results that are close to, but not exactly right.
  • The other reason why integers is sometime preferred is that integer arithmetic may be faster and consume less memory. However, for computations we encounter in this class, the storage and computation speed does not matter.

Integers are produced by certain operations, e.g when creating sequences.

Base R has two additional “basic types” that we do not discuss in this book:

  • Complex: Complex (imaginary) numbers have their own data storage type in R, they are are created using the i syntax: c <- 1 + 2i.
  • Raw: is a sequence of “raw” data. It is good for storing a “raw” sequence of bytes, such as image data. R does not interpret raw data in any particular way.

2.6 Producing output: cat and print

When you just compute on R console, or even when you write small scripts, it is not necessary to dedicate any extra effort to printing. The results are automatically printed. This is a common behavior in R console: the last result will be printed. It is a handy but limited feature.

Output depends on the way the code is executed. The same script is first “run”, that produces the first lines of output on the console, including the result “1”. Thereafter it is “sourced”. The only prints the source() command, but no output.

First, it only prints the “last” value (unless assigned to a variable). Second, this only works in certain environments, e.g. in RStudio console when running the program, but not when “sourcing” it (see Section 2.3.2). Third, when writing longer programs, you may want to see more results than the last one, and maybe also add some explanatory notes. Finally, the result depends on what exactly does the “last” value mean–the code can either be fed line-by-line, in which case every value is the last one, or all at once, in which case only the last line is the last one…

All this suggests that instead on relying automatic printing, in more complex projects you may want to use dedicated printing functions. R has two printing commands: cat and print. cat is useful if you want to print simple objects, but potentially more than one object. These may be one or more numbers, strings, and explanatory text. print can output complex objects but only one at time.

Next, we illustrate the usage of cat:

## Compute length of light-year
ly <- 300000*60*60*24*365
cat("Length of light-year is", ly, "km\n")
## Length of light-year is 9.4608e+12 km

This short script computes the length of light-year and prints it with a small informative message. Alternatively, we can just compute this number and let R console to automatically print it:

ly <- 300000*60*60*24*365
ly
## [1] 9.4608e+12

Why should we use cat then? The automatic printing is good enough if you work interactively on console, or just run very short code snippets. But if the code is not run on R console, then the number may not even be printed. Alternatively, if the script computes and prints many results, the user gets easily confused what do these numbers mean. So it is a good habit to output your results together with a brief explanation.

The syntax of cat is pretty simple: it takes a list of arguments, texts, variables and numbers you want to print. One very useful symbol you may want to add is the newline character "\n". (Note: it uses backslash "\n", not _slash "/n".) This forces printing to jump to the next line:

## output on single line:
cat("hi there\n")
## hi there
## output on multiple lines
cat("hi\n there\n")  # jump to new line
## hi
##  there

print is somewhat similar to cat but designed to output more complex objects, such as vectors, lists, and data frames. Print may produce multi-line output but it does not allow to add explanatory messages. You have to cat the message and print your complex object thereafter.

Obviously, output does not have to be printed on console, it may also be sent to a file, or uploaded to internet, or played as audio instead. But whatever the exact format, it is important to ensure the user has enough information to understand what the output is.

Finally, let’s use the tools we learned above, and re-write the light-year script in a way that looks more like normal computer code:

## Compute the length of lightyear
c <- 300000  # speed of light (km/s)
lightMinute <- c*60
lightHour <- lightMinute*60
lightDay <- lightHour*24
lightYear <- lightDay*365
cat("Lightyear is", lightYear, "km\n")
## Lightyear is 9.4608e+12 km

Exercise 2.6 How long it takes for sound to travel around Earth?

  • Speed of sound is 0.34 km/s
  • Circumference of earth is 42,000 km
  • Write a similar script that computes the time in seconds, hours, and days.
  • It should print something like Sound travels around Earth in xxx seconds or in yyy hours, or zzz days

See the solution

2.7 Getting Help

Humans make errors. It is impossible to write anything resembling a substantial computer program without dozens of errors in the process. Programmers spend a considerable amount of time trying to find and correct errors (this is called debugging). Here are a few suggestions about how to get help.

  1. Read the error messages: If there is an issue with the way you have written or executed your code, R will often print out a red error message in your console. Do your best to understand the message–read it carefully, and think about what is meant by each word in the message. You may also put it directly into Google and see if you can get better explanations. You’ll soon get the hang of interpreting these messages if you put the time into trying to understand them.

  2. Google: When you’re trying to figure out how to do something, it should be no surprise that Google is often the best resource. Try searching for queries like "how to <DO THING> in R". More frequently than not, your question will lead you to a Q/A forum called StackOverflow (see below), which is a great place to find potential answers.

  3. StackOverflow: StackOverflow is an amazing Q/A forum for asking/answering programming questions. Indeed, most basic questions have already been asked/answered here. However, don’t hesitate to post your own questions to StackOverflow. Familiarize yourself with how to ask questions on StackOverflow though.
    It happens often that by the time I can articulate the question clearly enough to post it, I’ve figured out my problem anyway.

  4. Documentation: R’s documentation is actually quite good. Functions and behaviors are all described in the same format, and often contain helpful examples. To search the documentation within R (or in RStudio), simply type ? followed by the function name you’re using (more on functions coming soon). You can also search the documentation by typing two questions marks (??SEARCH).

    You can also look up help by using the help() function (e.g., help(print) will look up information on the print() function, just like ?print does). There is also an example() function you can call to see examples of a function in action (e.g., example(print)).

    rdocumentation.org has a lovely searchable and readable interface to the R documentation.

  5. chatGPT and similar AI applications can generate code for you, if you know what to ask. It may not be correct code, and it may not be exactly what do you want, but it is advisable to familiarize yourself with such tools. It is not a substitute for basic manual coding though–it is important you know the basic programming tools and syntax, among other things it also helps to evaluate the suitability of AI-offered solutions.

See Section 7 for more information about learning, getting help, and debugging.

2.8 Summary

  • R prompt: marker “>” in R Console, marking that R is ready to accept your commands. See Section 2.3.1.

    Sometimes it turns into continuation prompt+” that does not accept commands, press Esc to get back to “>”. See Section 2.3.2.

  • variables: labeled location (“boxes”) in memory that contain values. Variable names must begin with a letter, and can contain letters, numbers, underscores _ and dots. Variable names can be used instead of the corresponding values. See Section 2.4.

Resources


  1. RStudio also contains a system command shell, labeled “Terminal”. Do not mistake it for R console, labeled “Console”. It will not understand R commands.↩︎

  2. There is no clear distinction between script and program. Typically, one calls simple programs “scripts” and more complex programs “programs”. Also, programs written in compiled languages are rarely called “scripts”. So writing scripts is “programming”.↩︎

  3. Those who are familiar with statically-typed languages, such as java or C++, may notice that we do not have to declare the variables nor their types. R will figure it out automatically. One can also change the variable type with no extra effort–it is a dynamically typed language.↩︎

  4. R also has a (rarely used) right-assignment operator ->, so you can write 2 -> x instead.↩︎

  5. This is analogous to mathematical formulas, e.g. \(S = \pi\cdot r^2\). The formula remains the same, whatever the value of \(r\).↩︎

  6. It may seem that dynamic typing, the fact that a language can automatically determine the data type, is a great thing to have. It may be so. But it also has distinct downsides, in particular it makes it easier to do hard-to-find mistakes.↩︎