K Exercise solutions
K.1 Introduction to R
K.1.1 Variables
K.1.1.1 Invalid variable names
K.1.2 Data Types
K.1.2.1 Years to decades
If we integer-divide year by “10”, then we get the decade (without the trailing “0”). E.g.
## [1] 196
Now we just multiply the result by 10:
## [1] 1960
Or, to make the order of operation more clear:
## [1] 2020
K.1.2.2 Are you above 20?
There are many ways to do it, here is just one possible solution:
## [1] TRUE
Note the variable names: age
is fairly self-explanatory, older
is
much less so. In complex projects one may prefer name like
age_over_20
or something like this. But in a few-line scripts, even
a
and o
may do.
K.1.3 Producing output
K.1.3.1 Sound around earth
We can follow the lightyear example fairly closely:
s <- 0.34 # speed of sound, km/s
distance <- 42000
tSec <- distance/s
tHrs <- tSec/3600
tDay <- tHrs/24
cat("It takes", tSec, "seconds, or",
tHrs, "hours, \nor", tDay,
"days for sound to travel around earth\n")
## It takes 123529.4 seconds, or 34.31373 hours,
## or 1.429739 days for sound to travel around earth
Note how we injected the new line, \n
in front of “or” for days.
This makes the output lines somewhat shorter and easier to read.
Now it does not happen often that sound actually travels around the world, but the pressure wave of Krakatoa volcanic eruption 1883 was actually measured circumnavigating the world 3 times in 5 days. See the Wikipedia entry.
K.2 Functions
K.2.1 For-loops
K.2.1.1 Odd numbers only
The form of seq()
we need here is seq(from, to, by)
so that the
sequence runs from from
to to
with a step by
. So we can write
## 1^2 = 1
## 3^2 = 9
## 5^2 = 25
## 7^2 = 49
## 9^2 = 81
K.2.1.2 Multiply 7
We can just follow the loop example in Section 3.1:
## 7*10 = 70
## 7*9 = 63
## 7*8 = 56
## 7*7 = 49
## 7*6 = 42
## 7*5 = 35
## 7*4 = 28
## 7*3 = 21
## 7*2 = 14
## 7*1 = 7
## 7*0 = 0
Note the differences:
- we go down from “10” to “0” using
10:0
- we need specify that the numbers and strings we print should not be
separated by space using
sep=""
argument for cat. - we could have created a separate variable
i7 <- i*7
but we chose to write this expression directly as an argument forcat()
.
K.2.1.3 Print carets ^
This is very simple: we just need to use cat("^")
10 times in a
loop:
## ^^^^^^^^^^
Note that we end the line after the loop, this is because we do not want the whatever-follows-it to be on the same line.
K.2.1.4 Asivärk
The trick here is to use the caret-printing example, but now we need to do it not 10 times, but a different number of times in each row. We can call this number n, and change n in another, outer for-loop, from 1 to 10:
## ^
## ^^
## ^^^
## ^^^^
## ^^^^^
## ^^^^^^
## ^^^^^^^
## ^^^^^^^^
## ^^^^^^^^^
## ^^^^^^^^^^
Note how the middle rows are essentially the
caret-printing example, the only difference is
1:n
instead of 1:10
in the loop header. This ensures that the
outer loop index n can change the number of carets printed.
K.2.1.5 Cloud and Rain
This is a somewhat more complicated example, but the broad idea is
similar to that of Asivärk. We need nested
loops here too: first, the outer loop should count the number of
v
-s. Second, we need three inner loops: for dashes at left, v
-s in
the middle, and dashes at right. All these loops should nest inside the
outer loop:
for(n in seq(10, 2, by=-2)) {
# n is the number of v-s each row
# 10, 8, 6, ...
nDash <- (12 - n)/2 # how many raindrops each side of the cloud
## Left raindrops
for(i in 1:nDash) {
cat("-")
}
## Center cloud
for(i in 1:n) {
cat("v")
}
## Right raindrops
for(i in 1:nDash) {
cat("-")
}
cat("\n") # row ends here
}
## -vvvvvvvvvv-
## --vvvvvvvv--
## ---vvvvvv---
## ----vvvv----
## -----vv-----
K.2.3 Writing functions
K.2.3.1 M87 black hole in km
The function might look similar to feet2m
, but we may need to
compute the length of a single light-year inside of the function:
ly2km <- function(distance) {
c <- 300000
ly <- c*60*60*24*365 # length of a single light-year:
# speed of light * seconds in minute *
# minutes in hour * hours in day *
# days in year
distance*ly
}
And we find the distance to the black hole as
## [1] 5.20344e+20
or maybe it is easier to write it as
## [1] 5.20344e+20
If this number does not tell you much then you are not alone–so big distances are beyond what we one earth can perceive.
K.2.3.2 Years to decades
Perhaps the most un-intuitive part here is the integer division %/%
:
it just divides the numbers, but discards all fractional parts. For
instance,
## [1] 202
In order to make this into the decade, we just need to multiply the result by 10 again. So the function might look like:
## [1] 2020
## [1] 1930
## [1] 1960
## [1] 1970
K.2.4 Output versus return
We can create such a function by just using paste0
:
hi <- function(name) {
paste0("Hi ", name, ", isn't it a nice day today?")
# remember: paste0 does not leave spaces b/w arguments
}
This function returns the result of paste0
, the character string
that combines the greeting and the name. It does not output
anything–there is no print
nor cat
command. We can show it works
as expected: when called on R console, its returned value, the
greeting, is automatically printed:
## [1] "Hi Arthur, isn't it a nice day today?"
and if the result is assigned to a variable then nothing is printed:
K.3 Vectors
K.3.1 Vectorized operations
K.3.1.1 Extract April month row numbers
We just need to make a sequence from 3 till no more than 350 (number of rows) with step 12:
## [1] 3 15 27 39 51 63 75 87 99 111 123 135 147 159 171 183 195 207 219 231
## [21] 243 255 267 279 291 303 315 327 339
K.3.1.2 Yu Huang and Guanyin in liquor store
We can just call the data age and cashier:
In normal language–you are able to buy if you are at least 21 years old or your cashier is Guanyin. This means the first customer cannot, but the other two can buy the drink.
The expression is pretty much exactly the sentence above, written in R syntax:
## [1] FALSE TRUE TRUE
Note that we use >=
to test age at least 21, and ==
to test
equality.
So the first customer cannot get the drink but the two others can.
K.3.1.3 Descriptive statistics
## [1] 5.5
## [1] 5.5
## [1] 5.5
So all averages are the same.
## [1] 5.5
## [1] 5.5
## [1] 1
Medians of x
and y
are the same, but that of z
is just 1.
## [1] 1 10
## [1] -11 22
## [1] 1 55
Here range is easily visible from how the vectors were created, so computation is not really needed. But this is usually not the case where the vectors originate from a large dataset.
## [1] 9.166667
## [1] 99.16667
## [1] 243
Variances are hard to judge manually, but they are different too.
So we summarized these vectors into five different numbers (two for range), despite of the fact that they were of different length.
K.3.1.4 Recycling where length do not match
## Warning in c(10, 20, 30, 40) + 1:3: longer object length is not a multiple of
## shorter object length
## [1] 11 22 33 41
This is the warning message, as you can see, this operations results
in an incomplete recycling where only the first component 1
of the
shorter vector was used.
K.3.2 Vector indices
K.3.2.2 Extract positive numbers
We have data
height <- c(160, 170, 180, 190, 175) # cm
weight <- c(50, 60, 70, 80, 90) # kg
name <- c("Kannika", "Nan", "Nin", "Kasem", "Panya")
Height of everyone at least 180cm:
## [1] 180 190
Names of those at least 180cm:
## [1] "Nin" "Kasem"
Weight of all patients who are at least 180cm tall
## [1] 70 80
Names of everyone who weighs less than 70kg
## [1] "Kannika" "Nan"
Names of everyone who is either taller than 170, or weighs more than 70.
## [1] "Nin" "Kasem" "Panya"
K.3.2.3 Character indexing: state abbreviations
First, we can set names to the state.abb
variable:
Note that we need to be sure that the names and abbreviations are in the same order! (They are, this is how the data is defined, see Section I.13.) This results in a named vector:
## Alabama Alaska Arizona Arkansas California
## "AL" "AK" "AZ" "AR" "CA"
Now we can just extract the abbreviations:
## Utah Connecticut Nevada
## "UT" "CT" "NV"
This is a common way to create lookup tables in R.
K.3.3 Modifying vectors
K.3.3.1 Wrong number of items
Feeding in a single item works perfectly:
## [1] "backpack" "ipad" "ipad"
Just now both the elements 2 and 3 are “ipad”. This is because of the recycling rules (see Section 4.3.4), the shorter item (here “ipad”) will just replicated as many times as needed (here two).
But feeding in 3 elements results in a warning:
## Warning in supplies[c(2, 3)] <- c("tablet", "book", "paper"): number of items to
## replace is not a multiple of replacement length
## [1] "backpack" "tablet" "book"
Otherwise, the replacement works, just the last item, “paper”, is ignored.
K.3.3.2 Absolute value
We can do it explicitly in multiple steps:
x <- c(0, 1, -1.5, 2, -2.5)
iNegative <- x < 0 # which elements are negative
positive <- -x[iNegative] # flip the sign for negatives
# so you get the corresponding
# positives
x[iNegative] <- positive # replace negatives
x
## [1] 0.0 1.0 1.5 2.0 2.5
However, it is much more concise if done in a shorter form:
## [1] 0.0 1.0 1.5 2.0 2.5
K.3.3.3 Managers’ rent
Here is the data:
income <- c(Shang = 1000, Zhou = 2000, Qin = 3000, Han = 4000)
rent <- c(Shang = 200, Zhou = 1000, Qin = 1700, Han = 2800)
This problem can be solved in two ways. First the way how it is stated in the problem:
b <- c(0, 0, 0, 0) # to begin with, befit "0" for everyone
iHR <- rent > 0.5*income # who is rent-burdened?
iHR # just for check
## Shang Zhou Qin Han
## FALSE FALSE TRUE TRUE
So Qin and Han are rent-burdened.
## [1] 0 0 425 700
Here we replaced benefits for two people–we had to use iHR on both sides of the assignment.
We can also solve it the other way around (not asked in the problem statement): first we can compute the benefit for everyone, and thereafter replace it for the non-rent burdened with “0”:
b <- 0.25*rent # benefits to everyone
iLR <- rent <= 0.5*income # who's rent is low?
b[iLR] <- 0 # replace their benefits by 0.
b
## Shang Zhou Qin Han
## 0 0 425 700
Note that all replacement elements have the same value here, “0”.
K.4 Lists
K.4.1 Vectors and lists
The vector will be
## [1] 1 2 3 4 5
and the list
## [[1]]
## [1] 1
##
## [[2]]
## [1] 2 3 4
##
## [[3]]
## [1] 5
The printout clearly shows that in case of vector we end up with a vector of 5 similar elements (just numbers). But the list contains three elements, the first and last are single numbers (well, more precisely length-1 vectors), while the middle component is a length-3 vector.
As this example shows, one cannot easily print all list elements on a single row as is the case with vectors.
K.4.2 Print employee list
First re-create the same persons:
person <- list(name = "Ada", job = "Programmer", salary = 78000,
union = TRUE)
person2 <- list("Ji", 123000, FALSE)
employees <- list(person, person2)
The printout looks like
## [[1]]
## [[1]]$name
## [1] "Ada"
##
## [[1]]$job
## [1] "Programmer"
##
## [[1]]$salary
## [1] 78000
##
## [[1]]$union
## [1] TRUE
##
##
## [[2]]
## [[2]][[1]]
## [1] "Ji"
##
## [[2]][[2]]
## [1] 123000
##
## [[2]][[3]]
## [1] FALSE
We can see our two employees here, Ada (at first position) and Ji (at
second position). All element names for Ada are preceded with [[1]]
and for Ji with [[1]]
. These indicate the corresponding positions.
Ada and Ji data itself is printed out slightly differently, reflecting
the fact that Ada’s components have names while Ji’s components do
not. So Ada’s components use $name
tag and Ji’s components use a
similar [[1]]
positional tag.
K.5 How to write code
K.5.1 Divide and conquer
K.5.1.1 Patient names and weights
The recipe to display the names might sound like
- Take the vector of weights
- Find which weights are above 60kg
- Get names that correspond to those weights
- Print those
This recipe is a bit ambiguous though–the which weights is not quite clear, and if you know how to work with vectors, it may mean both numeric position (3 and 4) or logical index (FALSE, FALSE, TRUE, TRUE, FALSE). But if you know the tools, you also know that both of these approaches are fine, so the ambiguity is maybe even its strength.
Second, if you know the tools, then you know that explicit printing may not be needed.
The recipe to display the weights may be like
- Take the vector of weights
- Find which weights are above 60kg
- Display those
This recipe works well if we have access to the vectorized operations and indexing like what we have in R. But if we do not have acess to these tools, we may instead write
- Take the array of weights
- Walk over every weight in this array
- Is the weight over 60kg?
- If yes, print it!
Which recipe do you prefer? Obviously, it depends on the tools you have access to.
Here is example code:
## Data
name <- c("Bao-chai", "Xiang-yun", "Bao-yu", "Xi-chun", "Dai-yu")
weight <- c(55, 56, 65, 62, 58) # kg
## Names
name[weight > 60] # simple, but does follow the recipe closely
## [1] "Bao-yu" "Xi-chun"
## more complex, but follow the recipe more closely
i <- weight > 60
heavies <- name[i]
cat(heavies, "\n")
## Bao-yu Xi-chun
For weights, we have similar two options
## [1] 65 62
## [1] 65 62
K.5.2 Learning more
K.5.2.1 Time difference in days
Novadays AI-based tools are fairly good at doing this. The figure at
right show chatGPT’s answer (incorporated in Bing) to such a
question. This answer is correct and can be incorporated to your code
with only little adjustments. However, one should still look up what do
these functions do and what does format = "%b %d, %Y"
mean.
However, the answer my not be enough if you do not know the basics of
R, e.g. what is the assignment operator <-
or the comment character
#
. Also, it lacks some context and it does not discuss more efficient
or simpler ways to achieve the same task. For instance, it does not
suggest to write the dates in the ISO format YYYY-mm-dd which would
simplify the solution.
The as.Date()
help page offers much more information than what
chatGPT gives. In particular, the tryFormats
and its default values
are very useful. However, it also assumes more understanding of the
workings of R, e.g. what does the ## S3 method for class 'character'
exactly mean, and which of the functions listed there one actually needs.
So AI-tools are not a substitute to documentation (nor the other way around). AI is great to quickly get a solution. In order to evaluate the solution, you need to know more. But as your time is valuable too–use AI for tasks where you do not need to go in depth, but learn the most important tools in depth.
Here is a simplyfied version of the chatGPT-suggested solution:
dates <- as.Date(c("2023-10-16", "2023-11-12", "2014-07-03"))
# ISO dates do not need format specification
difftime(dates[2], dates[1], units="days")
## Time difference of 27 days
## Time difference of 3419 days
When working with dates, you should also be familiar with lubridate library and tools therein.
K.5.3 Coding style
K.5.3.1 Variable names for election data
One of the decisions you need to make here is how to name the political parties. You definitely do not want to use the full names as those are very long. Here we are actually in a very good situation, as these parties have standard English abbreviation (BJP, INC and YSRCP).
Below is one option:- The original data:
- elections. If there are more election-related things, besides of the dataset, we may call it electionData to stress this is a dataset.
- Corrected original
- electionsFixed
- 2019 only
- elections2019. This assumes we do not need 2019 non-fixed version.
- Sub-datasets for parties.
- electionsBJP
- electionsINC
- electionsYSRCP.
- Winning districts only
- winsBJP
- winsINC
- winsYSRCP
Obviously, there are more options, e.g. if the project is very short, then you may replace elections with just e. If you need more, e.g. also 2024 election data, you may need variable names like elections2019BJP and wins2024INC.
You may also think what to do if the data is about Japan instead, and the party you are interested, 公明党, is abbreviated as 公明. (See Komeito).
K.6 Conditional statements
K.6.1 if-statement
K.6.1.1 Tell if second string longer
This is quite a simple application of if and else:
compareStrings <- function(s1, s2) {
if(nchar(s2) > nchar(s1)) {
## if 2nd string longer the print
cat("The second string is longer\n")
}
## Do nothing else
}
compareStrings("a", "aa") # prints
## The second string is longer
K.6.1.2 Print if number even
- Here the logic is as follows:
- print the number
- if even, print ” - even”.
for(i in 1:10) {
cat(i, "\n") # print the number (and new line)
if(i %% 2 == 0) {
cat(" - even\n") # print 'even' (and new line)
}
}
## 1
## 2
## - even
## 3
## 4
## - even
## 5
## 6
## - even
## 7
## 8
## - even
## 9
## 10
## - even
- Now we need to think more about printing. It goes as follows:
- print the number (no new line)
- if even, print ” - even” (no new line)
- add new line, unconditionally.
for(i in 1:10) {
cat(i) # print the number, but do not switch to new line
if(i %% 2 == 0) {
cat(" - even") # print 'even', do not switch to new line
}
cat("\n") # switch to new line at the end of line here
# whatever number it is
}
## 1
## 2 - even
## 3
## 4 - even
## 5
## 6 - even
## 7
## 8 - even
## 9
## 10 - even
K.6.1.3 Print even/odd
The code is simple, and printing is a bit simpler too
for(i in 1:10) {
cat(i) # print the number, but do not switch to new line
if(i %% 2 == 0) {
cat(" even\n") # print 'even' and new line
} else {
cat(" odd\n")
}
}
## 1 odd
## 2 even
## 3 odd
## 4 even
## 5 odd
## 6 even
## 7 odd
## 8 even
## 9 odd
## 10 even
K.6.1.4 Going out with friends
money <- 200
nFriends <- 5
price <- 30
sum <- (nFriends + 1)*price # friends + myself
total <- sum*1.15 # add tip
if(total > money) {
cat("Cannot afford 😭\n")
} else {
cat("Can afford ✌\n")
}
## Cannot afford 😭
K.6.1.5 Test porridge temperature
We just need to remove assignments and return()
:
test_food_temp <- function(temp) {
if(temp > 120) {
"This porridge is too hot!"
} else if(temp < 70) {
"This porridge is too cold!"
} else {
"This porridge is just right!"
}
}
## The test results are the same:
test_food_temp(119) # just right!
## [1] "This porridge is just right!"
## [1] "This porridge is too cold!"
## [1] "This porridge is too hot!"
In my opinion, shorter code is easier to read, but different people may have different opinion.
K.6.2 Conditional statements and vectors
K.6.2.1 Should you go to the boba place?
The problem is worded in a somewhat vague manner, so you may need to make it more specific. Here we assume that you only go if you can afford a drink–at least one drink. You do not need that all drinks are affordable.
This means you need to write code that checks if any tea is cheaper than $7.
K.6.2.2 Can you get a drink?
With the original prices:
price <- c(5, 6, 7, 8)
if(any(price <= 7)) {
cat("You can get a drink\n")
} else {
cat("This is a too expensive place\n")
}
## You can get a drink
If they rise the price by $3 across the board then we can just add “3” to the price vector:
price <- price + 3
if(any(price <= 7)) {
cat("You can get a drink\n")
} else {
cat("This is a too expensive place\n")
}
## This is a too expensive place
The results are intuitively obvious–it is affordable using the original prices but not with the new prices.
K.6.2.3 absv()
of a vector
The code crashes with a message
## Error in if (x < 0) {: the condition has length > 1
This is because here the code needs to make two decisions: one for “-3” and another for “3”. But if-else can only handle a single decision!
Note that the decisions for these two values differ–in the first case the code needs to flip the sign, and in the second case the sign must be preserved. But this does not play a role in terms of error messages, the problem here is two decisions, not the fact that the decisions here are different.
K.6.2.4 Step function
The function produces different output, depending on whether \(x \le 0\)
or otherwise. Hence we can use condition x <= 0
. From the step
function definition, the true value is 0
and false value 1
:
## [1] 0 1 0 1
Alternatively, we can use the opposite condition x > 0
and flip the
true and false values:
## [1] 0 1 0 1
Obviously, we can also define a function, instead of just using ifelse()
,
although here it does not help us much because the code is so short:
## [1] 0 1 0 1
K.6.2.5 Leaky relu
As leaky relu needs a different behavior, depending on whether \(x > 0\)
or otherwise, we can use logical condition x > 0
. From its
definition, the true
value is just x
and the false value is 0.1*x
:
## [1] -0.3 3.0 -0.1 1.0
K.6.2.6 Sign function
This case is slightly more complex, but we can describe it as two separate cases:- Pick the condition, for instance \(x < 0\).
- now the true value is
-1
, but the false value depends on \(x\)
- now the true value is
- In the false case, we have essentially the step function:
- if \(x > 0\), the value is “1”
- otherwise, the value is “0”
Note that we can only get to the “otherwise” if \(x = 0\), because
if \(x < 0\), the first step will already produce. Se we can
write here just
ifelse(x > 0, 1, 0)
.
Combining these two ifelses, we have
## [1] -1 1 -1 1 0
K.6.2.7 Are bowls too hot?
First we can use ifelse()
to find if the porridge is too hot or not:
## [1] "all right" "too hot" "all right" "too hot"
Next, let’s compose the bowl id message
## [1] "Bowl 1" "Bowl 2" "Bowl 3" "Bowl 4"
Now it is just to combine these two messages:
## [1] "Bowl 1 is all right" "Bowl 2 is too hot" "Bowl 3 is all right"
## [4] "Bowl 4 is too hot"
All this can also be achieved in a shorter form:
## [1] "Bowl 1 is all right" "Bowl 2 is too hot" "Bowl 3 is all right"
## [4] "Bowl 4 is too hot"
Note that I created the sequence of correct length here using
1:length(temp)
instead of hard-coding 1:4
as above.
K.6.3 A few useful and useless tools
K.6.3.1 Are elements in the set?
This is a straightforward application of %in%
, all()
and any()
:
vec <- c("a", "b", "c")
set <- c("c", "b", "d")
if(all(vec %in% set)) {
cat("All in!\n")
} else if(any(vec %in% set)) {
cat("Some in!\n")
} else {
cat("None in!\n")
}
## Some in!
Note another advantage of %in%
over a chain of OR operators: we can
define the set-of-interest in a single place, and use it multiple
times.
K.6.3.2 Southern states
Let’s start by defining the vector of states, and the set of southern states:
states <- c("Madhya Pradesh", "Orissa", "Andra Pradesh",
"Karnataka", "Gujarat", "Andra Pradesh",
"Kerala", "West Bengal",
"Punjab", "Karnataka")
south <- c("Telangana", "Andra Pradesh", "Karnataka",
"Tamil Nadu", "Kerala", "Puducherry")
Now we can easily test which state is in South:
## [1] FALSE FALSE TRUE TRUE FALSE TRUE TRUE FALSE FALSE TRUE
As creating the corresponding character vector involves a separate
decision for each element of the states vector, we need to use
ifelse()
:
## [1] "Not South" "Not South" "South" "South" "Not South" "South"
## [7] "South" "Not South" "Not South" "South"
K.6.3.3 x == TRUE
versus x
- The condition
x == TRUE
can only be true ifx
is of logical type. - If
x
is of different type, it will be implicitly converted to logical, if possible. (R cannot automatically convert more complex data types, such as lists.) Number “0” will be converted toFALSE
, all other numbers toTRUE
. Empty string “” will be converted toFALSE
, all other strings toTRUE
. - The example code:
## [1] "true"
The first expression, x == TRUE
results in FALSE
, because, well,
x
is not TRUE
. However, the second if converts x
to
logical. This will be TRUE
, and hence the message is printed.
So if(x == TRUE)
and if(x)
are not exactly the same. But it is
a bad practice to write code in a way where x
can be of different
type, sometimes logical, sometimes not. Such code is too hard to
understand.
K.7 File system tree
K.7.1 File system tree
K.7.1.2 Sketch your picture folder tree
Here is mine. I have picked mostly shorter example names, just to fit those on the figure.K.7.1.4 Matlab accessing matrix.dat
From amath352 to matrix.dat we can move as (see the figure):
- up (into UW)
- up (into Documents)
- up (into Yucun’s stuff)
- into Downloads
- grab matrix.dat from there
Or in the short form:
"../../../Downloads/matrix.dat"
Again, we should not start be going up to amath352 as we already are there.
K.7.1.5 Get picture from info201
Again, this is different on your computer. But given my file system tree looks like above, my path will be The corresponding list of instructions is:- up (to teaching)
- up (to tyyq)
- up (to my stuff)
- into Pictures
- into Nature
- grab the green-lake-ice.jpg from there.
In the short form, it is
"../../../Pictures/Nature/green-lake-ice.jpg"
Note that I do not have pictures in Pictures folder, but in subfolders inside there. If you do, the descent into Nature will be unnecessary.
K.7.1.7 Absolute path of an image
Suppose I have an image “fractal.png” inside of my Picture folder that, in turn, is in my home folder. Assume further that I am using Windows and my home folder is on “D:” drive. The long directions might look like:
- start at root “This PC”
- go to drive “D:”
- go to “Users”
- go to “siim” (assume “siim” is my user name)
- go to “Pictures”
- grab “fractal.png” from there.
In the short form it is
D:/Users/siim/Pictures/fractal.png
Note that we do not use the root “This PC” when writing paths on windows.
K.7.1.8 Absolute path of the home folder
Obviously, this is different for every user and every computer. Here is mine on my home computer. I have marked a few other folder (etc, system configuration files and usr – installed applications).
There are multiple ways to see where in the file system tree it is located, one option is to use file managers. Here is an example that shows the path in Gnome file selector. Note that root is denoted by a hard disk icon, and the home folder siim is combined with a home icon.
K.7.1.9 Yucun moving his project
- If he is using absolute path (it might be
"/Users/yucun/Documents/data/data.csv"
), the it does not change. This is because absolute path always starts from the file system root, and file system root does not change if you move around your files and folders–as long as the file in question (data.csv) remains in place. - If he moves data to a different computer… then he probably has
to change the paths. Most importantly, the other computer may not
have the data folder inside of the Documents folder, but
somewhere else. Second, the other computer may also have different
file system tree, e.g. if the other one is a PC, his home folder
may be
"C:/Users/yucun"
instead. Relative path is of no help here, unless the other computer has similar file and folder layout.
K.7.2 Accessing files from R
K.7.2.1 R working directory path type
This is absolute path: you see this because “/home/siim/tyyq/info201-book” starts with
the root symbol /
. See more in Sections 9.1.2
and 9.1.3.
K.7.2.2 RStudio console working directory
The only way to see it is to run getwd()
in rstudio console. You
can run it directly, or you can also execute a line of a script. What
matters is that it runs on console.
The example here shows “/home/siim/tyyq/teaching/info201/inclass” as the current working directory.
K.7.2.3 List files in R and in graphical manager
Assume the current working directory is “/home/siim/tyyq/teaching/info201/inclass” as in the exercise above.
We can use list.files()
to see files here.
And here are the same files, seen through the eyes of a graphical file manager (PCManFM). Note the navigation bar above the icons that displays the absolute path of the folder, and the side pane that displays the file system tree (a small view of it only).
It is easy to see that the files are the same. Note that R normally sorts files alphabetically, but file managers may show these in different ways, either alphabetically, by creation time, or you may even manually position individual icons. All this may be configured differently on your computer!
You can also see that here, both R and the file manager show all names in the same way, including the complete extensions like .R or .jpg. This may be different on your computer (and can be changed).
K.8 Data Frames
K.8.1 What is data frame
K.8.2 Working with data frames
K.8.2.1 Countries and capitals
Appropriate names are country for the country, capital for its capital, and population for the population. We call the data frame as countries (plural) to distinguish it from the individual variable. Obviously, one can come up with other names. We can create the data frame as
countries <- data.frame(
country = c("Gabon", "Congo", "DR Congo", "Uganda", "Kenya"),
capital = c("Libreville", "Brazzaville", "Kinshasa", "Kampala", "Nairobi"),
population = c(2.340, 5.546, 108.408, 45.854, 55.865))
countries
## country capital population
## 1 Gabon Libreville 2.340
## 2 Congo Brazzaville 5.546
## 3 DR Congo Kinshasa 108.408
## 4 Uganda Kampala 45.854
## 5 Kenya Nairobi 55.865
where population is in Millions (2022 estimates from Wikipedia).
We can extract the country names by dollar notation as
## [1] "Gabon" "Congo" "DR Congo" "Uganda" "Kenya"
and population with double brackets as
## [1] 2.340 5.546 108.408 45.854 55.865
Capital using indirect name:
## [1] "Libreville" "Brazzaville" "Kinshasa" "Kampala" "Nairobi"
K.8.3 Accessing Data in Data Frames
K.8.3.1 Indirect variable name with dollar notation
R will interpret the workspace variable name that contains data variable name as data variable name:
## NULL
As you see, R is looking for a data variable var
. As it cannot find
it, it returns NULL
, the special code for empty element.
K.8.3.2 Loop of columns of a data frame
- Column names. No loop needed here:
## [1] "name" "born" "throned" "ruled" "died"
- Print names in loop. We can just loop over the names:
## name
## born
## throned
## ruled
## died
- Print name and column. We need indirect access here as the column
name is now stored in the variable (called
n
below). So we can access it asemperors[[n]]
:
## name
## [1] "Qin Shi Huang" "Napoleon Bonaparte" "Nicholas II"
## [4] "Mehmed VI" "Naruhito"
## born
## [1] -259 1769 1868 1861 1960
## throned
## [1] -221 1804 1894 1918 2019
## ruled
## [1] "China" "France" "Russia" "Ottoman Empire"
## [5] "Japan"
## died
## [1] -210 1821 1918 1926 NA
- Print name and type. This is similar to the above, except now we
print
is.numeric(emperors[[n]])
.
## name is numeric: FALSE
## born is numeric: TRUE
## throned is numeric: TRUE
## ruled is numeric: FALSE
## died is numeric: TRUE
- Print name and minimum. Now use the
TRUE
/FALSE
for a logical test, only print average if this is true:
for(n in names(emperors)) {
cat(n, "")
if(is.numeric(emperors[[n]])) {
cat(min(emperors[[n]]))
}
cat("\n")
}
## name
## born -259
## throned -221
## ruled
## died NA
Note: you may want to use min(emperors[[n]], na.rm = TRUE)
to avoid
the missing minimum for died column.
K.8.3.3 Emperors who died before 1800
Pure dollar notation is almost exactly the same as the example in the text:
## [1] "Qin Shi Huang" NA
When using double brackets at the first place, we have
## [1] "Qin Shi Huang" NA
Note that we have a weird construct here [[...]][..]
. It looks
weird, but it perfectly works. emperors[["name"]]
is a vector, and
a vector can be indexed using [...]
.
When we put double brackets in both places, we get
## [1] "Qin Shi Huang" NA
This is perhaps the “heaviest” notation, where it may be hard to keep track of the brackets. However, it is a perfectly valid way to extract emperors!
Finally, NA
in the output is related to Naruhito. As we do not know
his year of death, R sends a message that there is one name where we
do not know if he died before 1800. It is a little stupid–as
Naruhito is alive today, he cannot have died before 1800. But we
haven’t explained this knowledge to R.
K.8.3.4 Single-bracket data acess (emperors)
Extract 3rd and 4th row:
## name born throned ruled died
## 3 Nicholas II 1868 1894 Russia 1918
## 4 Mehmed VI 1861 1918 Ottoman Empire 1926
All emperors who died in 20th century:
## name born throned ruled died
## 3 Nicholas II 1868 1894 Russia 1918
## 4 Mehmed VI 1861 1918 Ottoman Empire 1926
## NA <NA> NA NA <NA> NA
This will still give us NA
for Naruhito–we haven’t explained to R
in any way that someone who was alive in 2023, cannot have died in
20th century. If a NA
is not desired, one can use which()
:
## name born throned ruled died
## 3 Nicholas II 1868 1894 Russia 1918
## 4 Mehmed VI 1861 1918 Ottoman Empire 1926
Name and country of those emperors
## name ruled
## 3 Nicholas II Russia
## 4 Mehmed VI Ottoman Empire
K.8.3.5 Patients aging
First create the data frame:
Name <- c("Ada", "Bob", "Chris", "Diya", "Emma")
Inches <- c(58, 59, 60, 61, 62)
Pounds <- c(120, 120, 150, 150, 160)
age <- c(22, 33, 44, 55, 66)
patients <- data.frame(Name, Inches, Pounds, age)
patients
## Name Inches Pounds age
## 1 Ada 58 120 22
## 2 Bob 59 120 33
## 3 Chris 60 150 44
## 4 Diya 61 150 55
## 5 Emma 62 160 66
Adding a single year of age involves just modifying data, but we do not need to filter anythign as this applies to everyone:
## Name Inches Pounds age
## 1 Ada 58 120 23
## 2 Bob 59 120 34
## 3 Chris 60 150 45
## 4 Diya 61 150 56
## 5 Emma 62 160 67
K.8.4 R built-in datasets
K.8.4.1 co2
data
Let’s take a look at the data:
## [1] 315.42 316.31 316.50 317.56 318.13 318.00
It looks like a numeric vector, but more specifically it is a time
series (“ts”) object, it can be seen with
class()
:
## [1] "ts"
The name suggests that this is some kind of CO2 data. The help page
(can be accessed with ?co2
) indicates that this is Mauna Loa
observatory CO2 data, measured as particles per million (ppm),
available for each month from 1959 till 1997.
K.8.5 Learning to know your data
K.8.5.1 CSGO column averages
This can be achieved in a fairly simple fashion by extending the example with for-loop:
csgo <- read_delim("data/csgo-reviews.csv.bz2")
for(col in names(csgo)) {
if(class(csgo[[col]]) == "numeric") {
cat(col, ": ", mean(csgo[[col]]), "\n", sep = "")
}
}
## nHelpful: 620.1343
## nFunny: 6.217729
## nScreenshots: 215.5503
## hours: 805.3682
## nGames: 100.7909
## nReviews: 7.609077
Actually, it is better to write the code not as class(x) == "numeric"
but as inherits(x, "numeric")
. This is because a column
may have multiple classes, and in that case ==
will give an error.
K.8.5.2 Implausible ice extent/area values
Load data:
## [1] 1062 7
Is there any NA-s?
## [1] 0
## [1] 0
Apparently, all values are valid
The area cannot be negative. The same is true for extent, which is also and area–area of a specific ice concentration. It is harder to come up with a maximum plausible value, but sea ice area cannot exceed the total world sea surface (361M km2 according to wikipedia). Hence the plausible values must be in range \([0, 361]\).
Are all values plausible?
## [1] -9999.00 19.76
## [1] -9999.00 15.75
All is well with the upper limit–it is much smaller than 361. But some of the values are negative, in particular \(-9999\). This cannot be a valid value and appears to be a way to code missing data.
K.8.5.3 Explore home/destination
We can explore the destinations in a similar fashion as above:
[1] "St Louis, MO"
[2] "Montreal, PQ / Chesterville, ON"
[3] "New York, NY"
[4] "Hudson, NY"
[5] "Belfast, NI"
[6] "Bayside, Queens, NY"
[7] "Montevideo, Uruguay"
[8] "Paris, France"
[9] NA
[10] "Hessle, Yorks"
[11] "Montreal, PQ"
...
The excerpt here shows a number of plausible values, such as “St Luis, MO”. We also see that some values are missing. Unfortunately, there are too many different values,
## [1] 370
So that it is very hard to look at all these manually and decide if all are plausible.
If necessary, one can try other options, e.g. to test if the locations contain valid characters only, or even attempt to geo-locate these places with e.g. google maps API.
K.8.5.4 Which value is missing in table()
If you compare the values carefully, you see that NA is missing in the table.
The documentation of table()
shows:
?table
...
useNA: whether to include ‘NA’ values in the table. See ‘Details’.
Can be abbreviated.
...
This means that you can ask the table to include missings through useNA
argument, e.g.
##
## 1 10 11 12 13 13 15 13 15 B 14 15 15 16
## 5 29 25 19 39 2 1 33 37 1
## 16 2 3 4 5 5 7 5 9 6 7 8
## 23 13 26 31 27 2 1 20 23 23
## 8 10 9 A B C C D D <NA>
## 1 25 11 9 38 2 20 823
"ifany"
will show the number of missings, if there are any
missings. Here we have 823 missings.
K.9 dplyr
K.9.1 Grammar of data manipulation
K.9.1.1 How many trees over size 100?
We can do something like this:
- Take the orange tree dataset
- keep only rows that have size > 100
- pull out the tree number
- find all unique trees
- how many unique trees did you find?
Obviously, you can come up with different lists, e.g. the items 4 and 5 might be combined into one. They are kept separate here that these two items correspond to a single function in base-R.
K.9.1.2 Two ways to find the largest tree
The difference is in how the recipe breaks ties for the largest tree. If there are two largest trees of equal size, these will be put in an arbitrary order. If we pick the first line below, we’ll get one of the largest trees, but not both. The second recipe extracts all trees of maximum size, so it can find all such trees.
In practice, it is more useful not to order the trees and pick the first, but rank them with an explicit way to break ties. For instance
## [1] 1 3 1
Will tell that both the first and the third tree are on the “first place”
in descending order. See more with ?rank
.
K.9.2 Most important dplyr functions
K.9.2.1 Add decade to babynames
We can compute decade by first integer-dividing year by 10, and then multiplying the result by 10:
## # A tibble: 5 × 6
## year sex name n prop decade
## <dbl> <chr> <chr> <int> <dbl> <dbl>
## 1 1973 F Arletta 11 0.00000708 1970
## 2 1990 M Iven 6 0.00000279 1990
## 3 2000 M George 3037 0.00145 2000
## 4 1994 F Falicia 40 0.0000205 1990
## 5 2014 F Tabetha 5 0.00000256 2010
K.9.2.2 How many names over all years
We just need to add the count variable n:
## # A tibble: 1 × 1
## n
## <int>
## 1 348120517
K.9.2.3 Shiva for boys/girls
The task list might look like this:- filter to keep only boys (or only girls)
- filter to keep only name “Shiva”
- summarize this dataset by adding up all counts n
There are, obviouly, other options, for instance, you can swapt the filter by sex and filter by name.
## # A tibble: 1 × 1
## `sum(n)`
## <int>
## 1 397
## # A tibble: 1 × 1
## `sum(n)`
## <int>
## 1 249
K.9.3 Combining dplyr operations
The tasklist for this question (see above) might be:
- Take the orange tree dataset
- keep only rows that have size > 100
- pull out the tree number
- find all unique trees
- how many unique trees did you find?
This can be translated to code as:
## [1] 5
So there are 5 different trees.
K.9.4 Grouped operations
K.9.4.1 Titanic fare by class
The computations are pretty much the same as the example in the text:
titanic %>%
group_by(pclass) %>%
summarize(avgFare = mean(fare, na.rm=TRUE),
maxFare = max(fare, na.rm=TRUE),
avgAge = mean(age, na.rm=TRUE),
maxAge = max(age, na.rm=TRUE)
)
## # A tibble: 3 × 5
## pclass avgFare maxFare avgAge maxAge
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 87.5 512. 39.2 80
## 2 2 21.2 73.5 29.5 70
## 3 3 13.3 69.6 24.8 74
The results make sense, as the first class is the most expensive, and the third class the cheapest option. However, it is hard to see why the most expensive 3rd class options was much more than the 2nd class average. It is also reasonable that older people are more likely to travel in upper classes, as they may be wealthier, and their health may be more fragile.
K.9.4.2 Most distinct names
Here we compute the number of distinct names for each year, order the result by \(n\), and print the first three lines:
## # A tibble: 3 × 2
## year n
## <dbl> <int>
## 1 2008 32510
## 2 2007 32416
## 3 2009 32242
Apparently, these years are late 2000-s.
K.9.4.3 Most popular boy and girl names
The only difference here is to group by year and sex:
babynames %>%
filter(between(year, 2002, 2006)) %>%
group_by(year, sex) %>%
arrange(desc(n), .by_group = TRUE) %>%
summarize(name = name[1])
## # A tibble: 10 × 3
## # Groups: year [5]
## year sex name
## <dbl> <chr> <chr>
## 1 2002 F Emily
## 2 2002 M Jacob
## 3 2003 F Emily
## 4 2003 M Jacob
## 5 2004 F Emily
## 6 2004 M Jacob
## 7 2005 F Emily
## 8 2005 M Jacob
## 9 2006 F Emily
## 10 2006 M Jacob
As we can see, these are just Emily and Jacob.
K.9.4.4 Three most popular names
The first 3 names in terms of popularity can just be filtered using
the condition rank(desc(n)) <= 3
:
## # A tibble: 15 × 5
## # Groups: year [5]
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 2002 M Jacob 30568 0.0148
## 2 2002 M Michael 28246 0.0137
## 3 2002 M Joshua 25986 0.0126
## 4 2003 F Emily 25688 0.0128
## 5 2003 M Jacob 29630 0.0141
## 6 2003 M Michael 27118 0.0129
## 7 2004 F Emily 25033 0.0124
## 8 2004 M Jacob 27879 0.0132
## 9 2004 M Michael 25454 0.0121
## 10 2005 F Emily 23937 0.0118
## 11 2005 M Jacob 25830 0.0121
## 12 2005 M Michael 23812 0.0112
## 13 2006 M Jacob 24841 0.0113
## 14 2006 M Michael 22632 0.0103
## 15 2006 M Joshua 22317 0.0102
As you can see, these are various combinations of “Jacob”, “Michael”, “Joshua” and “Emily”.
K.9.4.5 10 most popular girl names after 2000
This is just about keeping girls only, and arranging by popularity afterward:
babynames %>%
filter(sex == "F",
year > 2000) %>%
group_by(name) %>%
summarize(n = sum(n)) %>%
filter(rank(desc(n)) <= 5) %>%
arrange(desc(n))
## # A tibble: 5 × 2
## name n
## <chr> <int>
## 1 Emma 327254
## 2 Emily 298119
## 3 Olivia 290625
## 4 Isabella 285307
## 5 Sophia 265572
We can see that “Emma” has been the most popular.
K.9.4.6 Most popular name by decade
This is noticeably more tricky task:
- First we need to compute decade, this can be done using integer
division
%/%
as(year %/% 10)*10
. - Thereafter, we need to add all counts n for each name and decade. Hence we group by name and decade, and sum n.
- Thereafter, we need to rank the popularity for each decade. Hence we group again, but now just by decade.
We can do it along these lines:
babynames %>%
mutate(decade = year %/% 10 * 10) %>%
group_by(name, decade) %>%
summarize(n = sum(n)) %>%
group_by(decade) %>%
filter(rank(desc(n)) == 1) %>%
arrange(decade)
## # A tibble: 14 × 3
## # Groups: decade [14]
## name decade n
## <chr> <dbl> <int>
## 1 Mary 1880 92030
## 2 Mary 1890 131630
## 3 Mary 1900 162188
## 4 Mary 1910 480015
## 5 Mary 1920 704177
## 6 Robert 1930 593451
## 7 James 1940 798225
## 8 James 1950 846042
## 9 Michael 1960 836934
## 10 Michael 1970 712722
## 11 Michael 1980 668892
## 12 Michael 1990 464249
## 13 Jacob 2000 274316
## 14 Emma 2010 158715
We see that in the early years, “Mary” was leading the pack, later mostly the boy names have dominated.
Note the third line group_by(name, decade)
. For each decade, this
makes groupings
based on name only, not separately for name and sex. Hence for names
that were given to both boys and girls, we add up all instances across
genders.
K.9.4.7 “Mei” by decade
The final code might look like
babynames %>%
filter(sex == "F") %>%
mutate(decade = (year %/% 10) * 10) %>%
group_by(name, decade) %>%
summarize(n = sum(n)) %>% # popularity over all 10 years!
group_by(decade) %>%
mutate(k = rank(desc(n))) %>%
filter(name == "Mei")
## # A tibble: 8 × 4
## # Groups: decade [8]
## name decade n k
## <chr> <dbl> <int> <dbl>
## 1 Mei 1940 18 6274.
## 2 Mei 1950 15 8015
## 3 Mei 1960 36 7082
## 4 Mei 1970 111 5149
## 5 Mei 1980 136 5356.
## 6 Mei 1990 191 5176
## 7 Mei 2000 385 3788.
## 8 Mei 2010 356 3560.
We see that “Mei” has gained in popularity over time, starting around 6000th place in popularity in 1940-s down to around 3500 in 2010-s.
A reminder here: the counts n in the table are probably underestimates–names are only included if they are given for at least 5 times.
K.9.5 More advanced dplyr usage
K.9.5.1 Sea and Creek 1980-2000
We can just filter the required years and the required
names, both using
%in%
:
## # A tibble: 2 × 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1985 M Sea 6 0.00000312
## 2 2000 M Creek 7 0.00000335
We can see that these names were not popular, but both were given over five times to boys.
K.9.5.2 Name popularity frequency table
Here we want to count how many times are there numbers \(n=5\), \(n=6\), and so on. So we just count it:
## # A tibble: 5 × 2
## n nn
## <int> <int>
## 1 544 1
## 2 2295 1
## 3 843 1
## 4 526 2
## 5 1101 2
Over all time: we need to aggregate \(n\):
## # A tibble: 5 × 2
## n nn
## <int> <int>
## 1 8423 1
## 2 196 60
## 3 1347 3
## 4 10512 1
## 5 993 6
K.10 ggplot2
K.10.1 Basic plotting with ggplot2
K.10.2 Most important plot types
K.10.2.1 COVID-Scandinavia with combined line-point plot
covS <- read_delim(
"data/covid-scandinavia.csv.bz2") %>%
filter(date > "2020-03-01",
date < "2020-07-01") %>%
filter(type == "Deaths") %>%
select(country, date, count)
covS %>%
ggplot(aes(date, count,
col = country)) +
geom_line() +
geom_point()
Here the result does look less appealing than just the line plot. The reason is that the points are too densely placed. In the Swedish case we can still distinguish points but not see any lines between those, in the other cases all dots overlap, essentially forming thicker lines.
Combined plots are only useful if the data points are sparse. Data is everywhere on these curves, and hence marking the location only makes the result more confusing.
K.10.2.2 Orange tree barplot in different colors
We can just add the aesthetic fill=Tree
to make the bar colors to be
different for diffent trees:
Remember that it is fill
aesthetic that controls the fill color, not
the col
aesthetic!
But here the colors do not contain any information that is not already embedded in the bars. While colors are usually a nice visual feature, it may be misleading some cases, making the viewer to believe that the colors have a distinct meaning, separate of the bars.
K.10.2.3 Histogram of Titanic data
Here is age histogram:
30 bins seems a good choice here.
Here is age histogram:
A larger number of bins is better here, in order to make more bins available for cheaper tickets, less than 100£, where we have most data.
As you see, age is distributed broadly normally, but fare is more like log-normal with a long right tail of very expensive tickets. Why is it like that? It is broadly related to the fact that human age has pretty hard upper limit, but no such limit exists for wealth. There were very wealthy passengers, but no-one could have been 500 years old.
K.10.2.4 Iris’ petal length distribution
The histogram is clearly bimodal:
One group of iris flowers have petals shorter than 2cm, the other group has petals that are about 5cm long.
In my opinion, it does not resemble neither the price nor age histograms–although the age diagram shows a small second peak for children.
The reason for such bimodal distribution can be understood by looking at the petal dimension for individual species:
## # A tibble: 3 × 3
## Species min max
## <fct> <dbl> <dbl>
## 1 setosa 1 1.9
## 2 versicolor 3 5.1
## 3 virginica 4.5 6.9
Setosa petals are all less than 2cm long while versicolor a viginica have petals that are at least 3cm long. Hence the bimodal distribution indicates that we have different groups of observations, here different species.
Note also that we can easily differentiate setosa from the two other species, but we cannot easily disentangle versicolor and virginica.
K.10.2.5 Diamond price in a narrow range
Here is the price distribution for mass range \([0.45,0.5]\)ct.
And here for \([0.95,1]\)ct.
Now it is fairly obvious that better cut is associated with higher price.
K.10.2.6 Petal length by species as boxplot
We can see that all setosa sepals are shorter than any of versicolor and virginica sepals. This is because the largest setosa outlier (1.9cm), is smaller than any versicolor outlier (3cm). virginica does not have any outliers shown, hence its smallest value is the lower whisker (4.5cm). This is the same message that we got from the exercise above.
In a similar fashion, we see that the upper whisker of versicolor is above the lower whisker of virginica. This means that the longest petals of versicolor are longer than the shortest petals of virginica. Hence we have an overlap.
K.10.2.7 Which plot type?
- Average ticket price is a continuous value while passenger class is a discrete value. Barplot is well suited for this task, but scatterplot and line plot may also work.
- Here you want to display relationship between a continuous distribution (age) and a categorical variable (passenger class). Boxplot is designed for this task, but you may also try density plot, violin plot, and multiple histograms.
K.10.2.8 Fatalities by state
Here we can just make three lines (or line/point combinations) of distinct color–one for each state:
read_delim("data/fatalities.csv") %>%
ggplot(aes(year, fatal, col = state)) +
geom_line() +
geom_point()
We can see that Minnesota and Oregon have comparable numbers of traffic deaths, but there are more fatalities in Washington. However, the figure does not tell whether one state is larger than another one.
K.10.2.9 Covid cases by country
Here we a) do not specify the col
aesthetic, and use group
instead:
## Load and filter data
covS <- read_delim(
"data/covid-scandinavia.csv.bz2") %>%
filter(date > "2020-03-01",
date < "2020-07-01") %>%
filter(type == "Confirmed") %>%
select(country, date, count)
## Make the plot
covS %>%
ggplot(aes(date, count,
group = country)) +
# group is important!!!
geom_line() +
theme(text = element_text(size=15))
You may want to label the countries, see Section 13.8.1
K.10.2.10 Point shape to mark the population size
This will just give an error:
read_delim("data/fatalities.csv") %>%
ggplot(aes(year, fatal,
shape = pop)) +
# point shape depends on population
geom_line() +
geom_point()
## Error in `geom_line()`:
## ! Problem while computing aesthetics.
## ℹ Error occurred in the 1st layer.
## Caused by error in `scale_f()`:
## ! A continuous variable cannot be mapped to the shape aesthetic.
## ℹ Choose a different aesthetic or use `scale_shape_binned()`.
K.10.2.11 Orange tree growth with/without factors
Here is the plot without converting Tree to a factor:
And here is the same plot, but now with converting Tree to a factor:
Most importantly, in the first case the different trees are not separated into different lines, as R does not know that the numeric tree id-s are actually discrete values. However, in the second case we tell it, and hence the lines are distinct.
In the first we also have continuous colors (shades of blue), in the second case we have discrete colors.
K.10.2.12 Titanic age distribution with/without factors
Here is the plot without converting pclass to a factor:
And here is the same plot, but now with converting pclass to a factor:
Now we have a separate box for each class. We can see that upper classes are older.
When we attempt to compare the distributions by different values a
continuous variable, then we are in a similar situation as when trying
to split the lines according to a continuous value. ggplot does not
know which continuous values should be grouped together, and hence
does not do any grouping at all. A solution is to convert the
continuous value to a discrete one using factor()
.
K.10.3 Inheritance
K.10.3.1 Ice extent in January
Everything in color:
ice <- read_delim("data/ice-extent.csv.bz2")
ice %>%
filter(month == 2) %>%
ggplot(aes(year, extent, col = region)) +
geom_line() +
geom_point()
Gray lines:
ice %>%
filter(month == 2) %>%
ggplot(aes(year, extent, col = region)) +
geom_line(aes(group = region),
col = "gray80",
linewidth = 2) +
geom_point()
3 Months in north:
ice %>%
filter(month %in% c(2, 5, 9)) %>%
filter(region == "N") %>%
ggplot(aes(year, extent, col = factor(month))) +
geom_line() +
geom_point()
3 Months in north, gray lines
ice %>%
filter(month %in% c(2, 5, 9)) %>%
filter(region == "N") %>%
ggplot(aes(year, extent, col = factor(month))) +
geom_line(aes(group = month),
col = "gray30",
linewidth = 2) +
geom_point()
K.10.3.2 Monthly ice extent over years
Here is the code with some explanations. First, load and clean the data:
ice <- read_delim("data/ice-extent.csv.bz2") %>%
filter(region == "N") %>%
select(year, month, extent) %>%
filter(extent > 0)
# cleaning
Next, let’s find the first and the last year in the dataset. You are welcome to do it just by analyzing the dataset manually, but here we compute these years. However, there is a problem: we only have a few months for the first and for the last year. Hence, we use instead the first year that contains January data, and the last year the that contains December data. Obviously, you are welcome to choose these years differently:
## What is the first year with January data?
y1 <- ice %>%
filter(month == 1) %>%
filter(rank(year) == 1) %>%
pull(year)
y1 # 1979
## [1] 1979
## The last year where we have December data
y2 <- ice %>%
filter(month == 12) %>%
filter(rank(desc(year)) == 1) %>%
pull(year)
y2 # 2012
## [1] 2022
Below, we’ll create the respective datasets on the fly, by specifying
geom_line(data = filter(ice, year == y1))
.
Now let’s compute the decadal averages:
avgExtent <- ice %>%
mutate(decade = year %/% 10 * 10) %>%
group_by(decade, month) %>%
summarize(extent = mean(extent))
ggplot(ice,
aes(month, extent)) +
geom_line(col = "gray77",
# all years light gray
aes(group = year)) +
# ensure different lines for different years
geom_line(data = filter(ice,
year == 2012),
# 2012 data
col = "yellow") +
geom_line(data = filter(ice,
year == y2),
col = "orangered2") +
# last year
geom_line(data = filter(ice,
year == y1),
col = "gold") +
# first year
geom_line(data = avgExtent,
aes(col = decade,
group = decade))
In order to make the plot good, you may need some more fiddling, e.g. you may want to ensure the colors are easy to distinguish, maybe make some lines thicker or semi-transparent, and make the x-scale better. But the information is all here.
K.10.4 Tuning your plots
K.10.4.1 Political parties with one color not specified
Let’s leave out INC and write
data.frame(party = c("BJP", "INC", "AITC"),
seats = c(303, 52, 23)) %>%
ggplot(aes(party, seats, fill=party)) +
geom_col() +
scale_fill_manual(
values = c(BJP="orange2",
AITC="springgreen3")
)
As you see, it does not result in an error but a gray bar for INC.
The gray value can be adjusted with na.value
, e.g. as
scale_fill_manual(na.value="red")
.
K.10.4.2 Manually specifying a continuous scale
I do not know how one might be able to manually specify colors for a continuous scale. The problem is that continuous variables can take an infinite number of values–and you cannot specify an infinite number of values manually.
The closest existing option to this is scale_color_gradientn()
.
This allows you to link a number of data values to specific colors,
and tell ggplot to use gradient for whatever values there are
in-between.
K.10.4.4 March ice extent
ice <- read_delim("data/ice-extent.csv.bz2")
## create a separate filtered df--
## we need it for both plotting
## and for computing the average
ice3 <- ice %>%
filter(month == 3,
region == "N")
avg <- ice3$extent %>%
mean()
ggplot(ice3,
aes(year, extent, fill = extent)) +
geom_col() +
scale_fill_gradient2(low = "red",
mid = "white",
high = "blue",
midpoint = avg)
Here one might want to make plot not of the extent, but of the difference between the extent and it’s average (baseline) value.
K.10.4.5 Adjust text labels
Here is an example solution:
fts <- read_delim("data/fatalities.csv")
ftsLast <- fts %>%
group_by(state) %>%
filter(rank(desc(year)) == 1)
ggplot(fts,
aes(year, fatal,
group = state)) +
geom_line() +
geom_label(data = ftsLast,
aes(label = state),
nudge_x = -0.3) +
labs(
y = "Number of traffic fatalities",
title = "Traffic fatalities over time in
Washington, Oregon and Minnesota") +
theme(axis.title.x = element_blank())
It moves the plot labels slightly left (nudge_x = -0.3
) and removes
the year label by using theme()
. It also demonstrates the usage
of multi-line strings for title.
K.10.4.6 Line-text-plot
Here is an example solution:
ggplot(fts,
aes(year, fatal,
group = state)) +
geom_line(col = "gray70") +
geom_text(aes(label = state,
col = state),
alpha = 0.8)
I did the lines light gray (gray70), and labels for different states
have different color. I also made the label somewhat transparent
(alpha = 0.8
) to reduce the problem of overlapping.
However, the figure is not great. Most importantly, labeling the points with exactly the same labels while also connecting these with lines seems unnecessary, and noisy. One label would be sufficient here.
Also, the “MN” and “OR” labels are partly overlapping, this is not visually pleasant. ggrepel package might help here.
Finally, the color key is completely unnecessary–the labels already
convey the exact information. It
can removed easily by + guides(col = "none")
.
K.10.4.7 Diamonds with log scale
Here and example with both x and y in log:diamonds %>%
sample_n(1000) %>%
ggplot(aes(carat, price)) +
geom_point() +
scale_x_log10() +
scale_y_log10()
As you can see, the graph is now fairly evenly populated with dots (diamonds). The relationship also looks remarkably linear.
Which graph is the best is debatable. The log-log plot here clearly solves the oversaturate lower-left corner problem in the original image, and the linear relationship looks appealing. However, humans are not that good at understanding log scales. The relationship is is curved in the linear scale–larger diamonds are not just more expensive, but the value of extra carat increases with weight. This fact is not obvious from the log-log scale figure.
The two log-linear relationship are not that useful in my opinion.
K.10.4.8 Arctic Death Spiral
ice <- read_delim(
"data/ice-extent.csv.bz2") %>%
filter(extent > 0,
region == "N") %>%
select(year, month, extent)
ggplot(ice,
aes(month, extent,
col = year,
group = year)) +
geom_line(linewdith = 0.3) +
coord_polar() +
scale_color_gradient(
low = "dodgerblue2",
high = "orangered2") +
scale_y_continuous(limits = c(0, NA)) +
scale_x_continuous(breaks = 1:12,
limits = c(0,12))
I keep year continuous, as otherwise the plot would contain too many discrete colors for years. However, I need to group the lines by year, otherwise ggplot would just show a single line.
Some works should be done with breaks and limits: I tell ggplot to set
the center of the plot to be 0 and leave the outer limit for it to be
figure out (scale_y_continuous(limits = c(0, NA))
). I tell that I
want to mark months 1 to 12 (breaks = 1:12
), but the angle should
start with month 0 (limits = c(0, 12)
). Otherwise December and
January will overlap.
## data for 'month0'
month0 <- ice %>%
filter(month == 12) %>%
# take December data
mutate(month = 0,
# set month = 0
year = year + 1)
# ...it is for next year
rbind(ice, month0) %>%
# merge month 0 to data
ggplot(aes(month, extent,
col = year,
group = year)) +
geom_line(linewdith = 0.3) +
coord_polar() +
scale_color_gradient(
low = "dodgerblue2",
high = "orangered2") +
scale_y_continuous(limits = c(0, NA)) +
scale_x_continuous(breaks = 1:12,
limits = c(0,12))
The month-0 data is created by just picking the December data, and
thereafter manually setting the month to 0 and year to the following
year. Thereafter, month-0 is merged with the ice data frame using
rbind()
(See section 14.1.1). The plotting code
is exactly the same as above.
K.10.5 More geoms and plot types
K.10.5.2 Colored violinplot
The solution is just to add fill = cut
to the aes()
function. I
have also added alpha = 0.6
as I like the transparent colors.
ggplot(diamonds,
aes(cut, price,
fill = cut)) +
geom_violin(alpha = 0.6) +
theme(axis.text.x =
element_text(angle=80,
hjust = 0.9)) +
guides(fill = "none")
theme(...)
rotates the x-axis labels, and guides(...)
removes
the redundant color key.
K.10.5.3 All years on ice extent-area plot
Here is a solution. The main trick is to use data inheritance and to plot first all years in gray, and thereafter the selected ones with a custom color:
ice <- read.delim(
"data/ice-extent.csv.bz2") %>%
filter(extent > 0, area > 0) %>%
# clean
filter(region == "N") %>%
# only northern hemisphere
arrange(year, month)
# ensure in temporal order
ggplot(ice, aes(extent, area)) +
geom_path(alpha = 0.3) +
# semi-transparent
geom_path(data = filter(ice,
year == 2022),
aes(col = month))
We may want to give the plot better labels, and maybe mark a few more years.
If you want to display more than one year on this graph, then it may
be better to display years using different colors, and label months
with numbers on the graph (using geom_text()
or similar).
K.10.5.4 All diamonds with different methods
First the hexagonal bins:
(Note that you need to install hexbin package for geom_hexbin()
to work.)
And now the density plot:
Note that by default, the density plot only covers the most dense area of the diamonds’ distribution.
As you can see, the hexagonal histogram looks more beautiful, but density contours are clearer to read. Personally, I prefer colored versions for presentations, but the lines are easier to read to understand the details.
K.10.5.5 Cut versus price
There are, obviously, many ways to display the relationship. Here is the best that I was able to come up with:diamonds %>%
sample_n(6000) %>%
ggplot(aes(carat, price,
col = cut)) +
geom_point(col = "gray",
size = 0.3,
alpha = 0.3) +
geom_smooth(se = FALSE) +
scale_x_log10() +
scale_y_log10()
I find the points too noisy, hence I plot a subset of them small and transparent, and in gray. Instead, I use color-coded scatterplot smoothers to indicate the average price by cut. Finally, log scale ensures that the densest part of the distribution, the one at low carat and price, is clearly visible.
As you can see, “fair” cut clearly commands inferior price, but for most other cuts, the price difference is very small.
K.11 More about data manipulations
K.11.1 Merging data: joins
K.11.1.1 Merge artists, songs
left_join(artists, songs)
should put the artists first and add a
column song at the end of it. Something like
name plays song
John guitar Come Together
Paul bass Hello, Goodbye
But the problem is that John is playing in two songs, so a single song name may not be sufficient. One can come up with multiple solutions. For instance, you can list the first song where John plays. Or you can create two lines for John, one for each song. You may also create two columns for songs, one for each song.
left_join()
picks the option of creating two lines, one for each
song:
songs <- data.frame(song = c("Across the Universe", "Come Together",
"Hello, Goodbye", "Peggy Sue"),
name = c("John", "John", "Paul", "Buddy"))
artists <- data.frame(name = c("George", "John", "Paul", "Ringo"),
plays = c("sitar", "guitar", "bass", "drums"))
left_join(artists, songs)
## name plays song
## 1 George sitar <NA>
## 2 John guitar Across the Universe
## 3 John guitar Come Together
## 4 Paul bass Hello, Goodbye
## 5 Ringo drums <NA>
K.11.2 Reshaping
K.11.2.1 Alcohol disorders wide form
Instead of grouping the values by country, you can group them by sex. So sex will be in rows and countries in columns. The result might look like
sex | Argentina | Kenya | Taiwan | Ukraine | United States |
---|---|---|---|---|---|
M | 3.069886 | 0.7469638 | 0.8912813 | 3.895495 | 2.927539 |
F | 1.170313 | 0.6539660 | 0.2611961 | 1.425379 | 1.729168 |
We have essentially rotated the data by 90°. This table is also easy to understand. In terms of Section 14.2.1, we use table (b) instead of (a).
K.11.2.2 Alcohol disorder data in pure wide form
If we do not have countries in separate rows, then we need more columns. Currently we have two sexes for each state. We still need those two, but now they must be in the same row for every country. So we’ll have a peculiar data frame with a single row only! So the resulting dataset will contain a single row and a large number of columns, one for each country-state combination. There will be no distinct “country” column nor a separate “sex”. It might look like
MArgentina | MKenya | MTaiwan | MUkraine | MUnited States | FArgentina | FKenya | FTaiwan | FUkraine | FUnited States |
---|---|---|---|---|---|---|---|---|---|
3.069886 | 0.7469638 | 0.8912813 | 3.895495 | 2.927539 | 1.170313 | 0.653966 | 0.2611961 | 1.425379 | 1.729168 |
Note that we now need to add country name to the column names to make clear which “M” means Argentina and which one Taiwan.
K.11.2.3 Reshape patients data
This data frame is in a wide form as there are two columns, male and female, that contain counts. The NA is somewhat misleading, it would be more appropriate to put “0” in that place instead.
Hence we can reshape it into a long form:
patients <- data.frame(pregnant = c("yes", "no"),
male = c(NA, 25),
female = c(11, 20))
patients %>%
pivot_longer(!pregnant,
names_to = "sex",
values_to = "count")
## # A tibble: 4 × 3
## pregnant sex count
## <chr> <chr> <dbl>
## 1 yes male NA
## 2 yes female 11
## 3 no male 25
## 4 no female 20
The result has three columns-pregnant, sex and count. We may want remove the NA-row.
K.11.2.4 Alcohol disorders with sexes in rows
First the long form with better sex names:
longDisorders <- disorders %>%
pivot_longer(!country,
values_to = "disorders",
names_to = "sex") %>%
mutate(sex = gsub("disorders", "", sex))
longDisorders
## # A tibble: 10 × 3
## country sex disorders
## <chr> <chr> <dbl>
## 1 Argentina M 3.07
## 2 Argentina F 1.17
## 3 Kenya M 0.747
## 4 Kenya F 0.654
## 5 Taiwan M 0.891
## 6 Taiwan F 0.261
## 7 Ukraine M 3.90
## 8 Ukraine F 1.43
## 9 United States M 2.93
## 10 United States F 1.73
And now reshape it to the alternate wide form:
## # A tibble: 2 × 6
## sex Argentina Kenya Taiwan Ukraine `United States`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 M 3.07 0.747 0.891 3.90 2.93
## 2 F 1.17 0.654 0.261 1.43 1.73
As we want to put countries in columns, we need to use names_from = "country"
.
K.11.2.5 Alcohol disorders widest possible
First the long form (with shorter sex names), exactly as above:
longDisorders <- disorders %>%
pivot_longer(!country,
values_to = "disorders",
names_to = "sex") %>%
mutate(sex = gsub("disorders", "", sex))
Here the groups are in two columns: country and sex. We want both
of these to be in columns, hence we need names_from = c(country, sex)
. There is only a single column of disorder values, so
values_from = disorders
will stay the same:
## # A tibble: 1 × 10
## Argentina_M Argentina_F Kenya_M Kenya_F Taiwan_M Taiwan_F Ukraine_M Ukraine_F
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 3.07 1.17 0.747 0.654 0.891 0.261 3.90 1.43
## `United States_M` `United States_F`
## <dbl> <dbl>
## 1 2.93 1.73
When reshaping it directly, without transforming to the long form first, we have one grouping column, country. Hence we need
## # A tibble: 1 × 10
## disordersM_Argentina disordersM_Kenya disordersM_Taiwan disordersM_Ukraine
## <dbl> <dbl> <dbl> <dbl>
## 1 3.07 0.747 0.891 3.90
## `disordersM_United States` disordersF_Argentina disordersF_Kenya disordersF_Taiwan
## <dbl> <dbl> <dbl> <dbl>
## 1 2.93 1.17 0.654 0.261
## disordersF_Ukraine `disordersF_United States`
## <dbl> <dbl>
## 1 1.43 1.73
K.11.2.6 Different ways to represent location-altitude-time data
It is fairly easy to see that there are 8 options to present these data as a data frame. Namely, the data have 3 groupings: location, altitude, and time. Each of these grouping can be either in rows, or columns, independently of each others. So there are 2 possibilities for each grouping, and hence \(2\times2\times2 = 8\) in total.
What are missing from the example tables are- Loc, Alt in columns; Time in rows
- Loc, Time in columns; Alt in rows
- Loc in columns; Time, Alt in rows
K.11.2.7 Ice extent data grouping dimensions
As a refresher, the ice extent data looks like
ice <- read.delim(
"data/ice-extent.csv.bz2") %>%
filter(extent > 0, area > 0) %>%
# clean
select(year, month, region, extent, area)
ice %>%
head(3)
## year month region extent area
## 1 1978 11 N 11.65 9.04
## 2 1978 11 S 15.90 11.69
## 3 1978 12 N 13.67 10.90
The grouping dimensions: it is a bit tricky. One of these is region (North and South). The other one can be time (year-month combination), or you can talk about two other dimensions: year and month. In my opinion, it makes more sense to consider a linear time (year-month combination) as a single dimensions. But if you want to compare the same month across different years, it may make more sense to talk about two dimensions, year and month.
In terms of the region, the dataset is in long form. There is only a single column region that contains region type (“N” and “S”).
Values are extent and area. And yes, the differ by year and month.
Should these be combined together into an additional grouping dimension? Maybe. Unlike the example of temperature and humidity, they are measured in the same physical units (million km2 in these data), they have fairly similar values, and hence they can be represented on the same graph. But computing averages or filtering may still not make sense. Personally, I’d keep them separate.
We need to combine the month name and value column name. For instance, area2 for area in February and extent11 for November extent.
K.11.2.8 Groupings for COVID Scandinavia data
This dataset contains 2 or 3 groupings: country and date are grouping dimensions for sure, type may be counted as such
These data are in the long form: one row is a combination of country-date-type, no single grouping is gathered into one line.
Personally, I would not count type (= Confirmed/Deaths) as groupings, as these are quite different measures. But technically, it can be counted as grouping.
But I cannot imagine what you can do with the count column without filtering out either deaths or confirmed cases.
I see 3 values in the current form: count (a number), lockdown (a date), and population (a number). If we treat type not as grouping dimension, then instead of count we have two columns: Confirmed and Deaths, in that case we have 3 values.
Only confirmed and death counts change along country and date. Population and lockdown are different for different countries, but do not change over time.
code2 is just another name for the country (its 2-letter iso code). It is equivalent to country column and contains no independent information. So if you want to preserve it, then it should be handled exactly the same way as country. So if you convert it into a row where each row is a country, you want to add a column “code2” to that row. If you want to put countries in columns, then you probably want to remove code2 alltogether.
K.11.2.9 Reshape ice extent
reshape the data to a wide form:
## # A tibble: 4 × 26 ## year region area_11 area_12 area_1 area_2 area_3 area_4 area_5 area_6 area_7 ## <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1978 N 9.04 10.9 NA NA NA NA NA NA NA ## 2 1978 S 11.7 6.97 NA NA NA NA NA NA NA ## 3 1979 N 8.37 10.6 12.4 13.2 13.2 12.5 11.1 9.34 6.69 ## 4 1979 S 11.3 6.24 3.47 2.11 2.66 5.45 8.3 11.2 13.3 ## area_8 area_9 area_10 extent_11 extent_12 extent_1 extent_2 extent_3 extent_4 ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 NA NA NA 11.6 13.7 NA NA NA NA ## 2 NA NA NA 15.9 10.4 NA NA NA NA ## 3 5.06 4.58 6.19 10.9 13.3 15.4 16.2 16.3 15.4 ## 4 13.8 14.3 13.7 15.3 9.24 5.4 3.14 4 7.49 ## extent_5 extent_6 extent_7 extent_8 extent_9 extent_10 ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 NA NA NA NA NA NA ## 2 NA NA NA NA NA NA ## 3 13.9 12.5 10.3 8.04 7.05 8.75 ## 4 10.8 14.2 16.5 17.7 18.2 17.8
As you see, by default the variables names are extent_12 and so. The default, alphabetic order is not the best one, you may want to convert the month names into a 2-digit form so January would be “01” and so in order to ensure that the alphabetic order corresponds to the logical order. You can also see that Year-month combinations that are missing in the dataset are replaced by NA.
wide by region:
## # A tibble: 4 × 6 ## year month area_N area_S extent_N extent_S ## <int> <int> <dbl> <dbl> <dbl> <dbl> ## 1 1978 11 9.04 11.7 11.6 15.9 ## 2 1978 12 10.9 6.97 13.7 10.4 ## 3 1979 1 12.4 3.47 15.4 5.4 ## 4 1979 2 13.2 2.11 16.2 3.14
This dataset contains 4 value columns only and is easier to grasp.
K.11.2.10 Ice extent and area over time
First, calculate the average area and extent by year. This can be done by justgroup_by()
, see Section 12.5.
Now, the plot. ggplot is designed in a way that it is easier to plot a single
column (call it Mkm2), and split it into two lines of different
color by another column (call it type). However, in these data we
have separate columns for area and extent. So we want to transform it
into a long form according to these measures:
ice %>%
group_by(year) %>%
summarize(extent = mean(extent),
area = mean(area)) %>%
pivot_longer(c(extent, area),
values_to = "Mkm2",
names_to = "type") %>%
ggplot(aes(year, Mkm2,
col = type)) +
geom_line()
Note that we might get the same results differently, by using
geom_line()
twice, first with with aesthetics mapping
aes(year, extent)
and thereafter
aes(year, area)
. Note also that the first and last averages
may be off, depending on which months are included/excluded for those
years.
K.12 Making maps
K.12.1 Shapefiles and GeoJSON
K.12.1.1 Difference between spatial data frame and manual map data frame
There are multiple differences:
- Perhaps most importantly, the hand-made NZ map in Section 15.1.1 is stored as one vertex per row, while the spatial data frame is stored one polygon per row. This makes spatial data frames much smaller, for instance, you do not need to replicate the same color value for every single vertex–a single value for the polygon is enough.
- Another important difference is the presence of coordinate reference system (CRS). This allows to easily transform one coordinate system to another, and in this way to use spatial data that is stored using different systems.
K.12.1.2 Why left_join()
?
We use left join to merge map and population. Remember: this retains all the rows of map but drops the lines of population where there are no corresponding region on the map. Hence we retain all regions (rows of map), with potentially NA as the population value. This is a reasonable approach.
Alternatively:- inner join would remove regions where population from the map. That would leave holes in the map. It is probably better to keep those regions and use a dedicate NA-color, such as gray, instead.
- Outer join will preserve all regions, but also population information for those regions that are not present on the map. This will probably not be a serious problem, it may just clutter your data frame with un-necessary rows.
- Finally, right join will combine the worst of both worlds: leave holes in map for missing population data, while also cluttering the final dataset.