K Exercise solutions
K.1 Introduction to R
K.1.1 Variables
K.1.1.1 Invalid variable names
![](img/solutions/invalid-variable-names.png)
K.1.2 Data Types
K.1.2.1 Years to decades
If we integer-divide year by “10”, then we get the decade (without the trailing “0”). E.g.
1966 %/% 10
## [1] 196
Now we just multiply the result by 10:
1966 %/% 10 * 10
## [1] 1960
Or, to make the order of operation more clear:
2023 %/% 10)*10 (
## [1] 2020
K.1.2.2 Are you above 20?
There are many ways to do it, here is just one possible solution:
53
age <- age > 20
older <- older
## [1] TRUE
Note the variable names: age
is fairly self-explanatory, older
is
much less so. In complex projects one may prefer name like
age_over_20
or something like this. But in a few-line scripts, even
a
and o
may do.
K.1.3 Producing output
K.1.3.1 Sound around earth
We can follow the lightyear example fairly closely:
0.34 # speed of sound, km/s
s <- 42000
distance <- distance/s
tSec <- tSec/3600
tHrs <- tHrs/24
tDay <-cat("It takes", tSec, "seconds, or",
"hours, \nor", tDay,
tHrs, "days for sound to travel around earth\n")
## It takes 123529.4 seconds, or 34.31373 hours,
## or 1.429739 days for sound to travel around earth
Note how we injected the new line, \n
in front of “or” for days.
This makes the output lines somewhat shorter and easier to read.
Now it does not happen often that sound actually travels around the world, but the pressure wave of Krakatoa volcanic eruption 1883 was actually measured circumnavigating the world 3 times in 5 days. See the Wikipedia entry.
K.2 Functions
K.2.1 For-loops
K.2.1.1 Odd numbers only
The form of seq()
we need here is seq(from, to, by)
so that the
sequence runs from from
to to
with a step by
. So we can write
for(i in seq(1, 9, 2)) {
i^2
i2 <-cat(i, "^2 = ", i2, "\n", sep="")
}
## 1^2 = 1
## 3^2 = 9
## 5^2 = 25
## 7^2 = 49
## 9^2 = 81
K.2.1.2 Multiply 7
We can just follow the loop example in Section 3.1:
for(i in 10:0) {
cat("7*", i, " = ", 7*i, "\n", sep="")
}
## 7*10 = 70
## 7*9 = 63
## 7*8 = 56
## 7*7 = 49
## 7*6 = 42
## 7*5 = 35
## 7*4 = 28
## 7*3 = 21
## 7*2 = 14
## 7*1 = 7
## 7*0 = 0
Note the differences:
- we go down from “10” to “0” using
10:0
- we need specify that the numbers and strings we print should not be
separated by space using
sep=""
argument for cat. - we could have created a separate variable
i7 <- i*7
but we chose to write this expression directly as an argument forcat()
.
K.2.1.3 Print carets ^
This is very simple: we just need to use cat("^")
10 times in a
loop:
for(i in 1:10) {
cat("^")
}cat("\n") # end the line here
## ^^^^^^^^^^
Note that we end the line after the loop, this is because we do not want the whatever-follows-it to be on the same line.
K.2.1.4 Asivärk
The trick here is to use the caret-printing example, but now we need to do it not 10 times, but a different number of times in each row. We can call this number n, and change n in another, outer for-loop, from 1 to 10:
for(n in 1:10) {
for(i in 1:n) {
cat("^")
}cat("\n")
}
## ^
## ^^
## ^^^
## ^^^^
## ^^^^^
## ^^^^^^
## ^^^^^^^
## ^^^^^^^^
## ^^^^^^^^^
## ^^^^^^^^^^
Note how the middle rows are essentially the
caret-printing example, the only difference is
1:n
instead of 1:10
in the loop header. This ensures that the
outer loop index n can change the number of carets printed.
K.2.1.5 Cloud and Rain
This is a somewhat more complicated example, but the broad idea is
similar to that of Asivärk. We need nested
loops here too: first, the outer loop should count the number of
v
-s. Second, we need three inner loops: for dashes at left, v
-s in
the middle, and dashes at right. All these loops should nest inside the
outer loop:
for(n in seq(10, 2, by=-2)) {
# n is the number of v-s each row
# 10, 8, 6, ...
(12 - n)/2 # how many raindrops each side of the cloud
nDash <-## Left raindrops
for(i in 1:nDash) {
cat("-")
}## Center cloud
for(i in 1:n) {
cat("v")
}## Right raindrops
for(i in 1:nDash) {
cat("-")
}cat("\n") # row ends here
}
## -vvvvvvvvvv-
## --vvvvvvvv--
## ---vvvvvv---
## ----vvvv----
## -----vv-----
K.2.3 Writing functions
K.2.3.1 M87 black hole in km
The function might look similar to feet2m
, but we may need to
compute the length of a single light-year inside of the function:
function(distance) {
ly2km <- 300000
c <- c*60*60*24*365 # length of a single light-year:
ly <-# speed of light * seconds in minute *
# minutes in hour * hours in day *
# days in year
*ly
distance }
And we find the distance to the black hole as
ly2km(55000000)
## [1] 5.20344e+20
or maybe it is easier to write it as
ly2km(55e6) # 55*10^6
## [1] 5.20344e+20
If this number does not tell you much then you are not alone–so big distances are beyond what we one earth can perceive.
K.2.3.2 Years to decades
Perhaps the most un-intuitive part here is the integer division %/%
:
it just divides the numbers, but discards all fractional parts. For
instance,
2024 %/% 10
## [1] 202
In order to make this into the decade, we just need to multiply the result by 10 again. So the function might look like:
function(year) {
decade <-%/% 10)*10
(year
}
decade(2024)
## [1] 2020
decade(1931)
## [1] 1930
decade(1969)
## [1] 1960
decade(1970)
## [1] 1970
K.2.3.3 Dates
function(day, month, year = 2024) {
date <-paste0(year, "-", month, "-", day)
}
date(30, 3, 2012)
## [1] "2012-3-30"
date(30, 3)
## [1] "2024-3-30"
Note that the order of arguments is somewhat arbitrary, you can also
use function(month, day, year)
or any other order. But obviously,
later you need to supply the actual arguments in the corresponding order.
K.2.4 Output versus return
We can create such a function by just using paste0
:
function(name) {
hi <-paste0("Hi ", name, ", isn't it a nice day today?")
# remember: paste0 does not leave spaces b/w arguments
}
This function returns the result of paste0
, the character string
that combines the greeting and the name. It does not output
anything–there is no print
nor cat
command. We can show it works
as expected: when called on R console, its returned value, the
greeting, is automatically printed:
hi("Arthur") # remember: name (it is text) must be quoted
## [1] "Hi Arthur, isn't it a nice day today?"
and if the result is assigned to a variable then nothing is printed:
hi("Arthur") greeting <-
K.3 Vectors
K.3.1 Vectorized operations
K.3.1.1 Extract April month row numbers
We just need to make a sequence from 3 till no more than 350 (number of rows) with step 12:
seq(3, 350, by = 12)
## [1] 3 15 27 39 51 63 75 87 99 111 123 135 147 159 171 183 195 207 219 231
## [21] 243 255 267 279 291 303 315 327 339
K.3.1.2 Yu Huang and Guanyin in liquor store
We can just call the data age and cashier:
c(16, 20, 24)
age <- c("Yu Huang", "Guanyin", "Yu Huang") cashier <-
In normal language–you are able to buy if you are at least 21 years old or your cashier is Guanyin. This means the first customer cannot, but the other two can buy the drink.
The expression is pretty much exactly the sentence above, written in R syntax:
>= 21 | cashier == "Guanyin" age
## [1] FALSE TRUE TRUE
Note that we use >=
to test age at least 21, and ==
to test
equality.
So the first customer cannot get the drink but the two others can.
K.3.1.3 Descriptive statistics
1:10
x <- -11:22
y <- c(1,1,1,1,1,1,1,1,1,1,1, 55)
z <-mean(x)
## [1] 5.5
mean(y)
## [1] 5.5
mean(z)
## [1] 5.5
So all averages are the same.
median(x)
## [1] 5.5
median(y)
## [1] 5.5
median(z)
## [1] 1
Medians of x
and y
are the same, but that of z
is just 1.
range(x)
## [1] 1 10
range(y)
## [1] -11 22
range(z)
## [1] 1 55
Here range is easily visible from how the vectors were created, so computation is not really needed. But this is usually not the case where the vectors originate from a large dataset.
var(x)
## [1] 9.166667
var(y)
## [1] 99.16667
var(z)
## [1] 243
Variances are hard to judge manually, but they are different too.
So we summarized these vectors into five different numbers (two for range), despite of the fact that they were of different length.
K.3.1.4 Recycling where length do not match
c(10, 20, 30, 40) + 1:3
## Warning in c(10, 20, 30, 40) + 1:3: longer object length is not a multiple of
## shorter object length
## [1] 11 22 33 41
This is the warning message, as you can see, this operations results
in an incomplete recycling where only the first component 1
of the
shorter vector was used.
K.3.2 Vector indices
K.3.2.1 Extract positive numbers
This is a simple application of logical indexing:
-5:5
v <-> 0] v[v
## [1] 1 2 3 4 5
K.3.2.2 Extract positive numbers
We have data
c(160, 170, 180, 190, 175) # cm
height <- c(50, 60, 70, 80, 90) # kg
weight <- c("Kannika", "Nan", "Nin", "Kasem", "Panya") name <-
Height of everyone at least 180cm:
>= 180] height[height
## [1] 180 190
Names of those at least 180cm:
>= 180] name[height
## [1] "Nin" "Kasem"
Weight of all patients who are at least 180cm tall
>= 180] weight[height
## [1] 70 80
Names of everyone who weighs less than 70kg
< 70] name[weight
## [1] "Kannika" "Nan"
Names of everyone who is either taller than 170, or weighs more than 70.
> 170 | weight > 70] name[height
## [1] "Nin" "Kasem" "Panya"
K.3.2.3 Character indexing: state abbreviations
First, we can set names to the state.abb
variable:
names(state.abb) <- state.name
Note that we need to be sure that the names and abbreviations are in the same order! (They are, this is how the data is defined, see Section I.11.) This results in a named vector:
1:5] state.abb[
## Alabama Alaska Arizona Arkansas California
## "AL" "AK" "AZ" "AR" "CA"
Now we can just extract the abbreviations:
c("Utah", "Connecticut", "Nevada")] state.abb[
## Utah Connecticut Nevada
## "UT" "CT" "NV"
This is a common way to create lookup tables in R.
K.3.3 Modifying vectors
K.3.3.1 Wrong number of items
Feeding in a single item works perfectly:
c("backpack", "laptop", "pen")
supplies <-c(2, 3)] <- "ipad"
supplies[ supplies
## [1] "backpack" "ipad" "ipad"
Just now both the elements 2 and 3 are “ipad”. This is because of the recycling rules (see Section 4.3.4), the shorter item (here “ipad”) will just replicated as many times as needed (here two).
But feeding in 3 elements results in a warning:
c(2, 3)] <- c("tablet", "book", "paper") supplies[
## Warning in supplies[c(2, 3)] <- c("tablet", "book", "paper"): number of items to
## replace is not a multiple of replacement length
supplies
## [1] "backpack" "tablet" "book"
Otherwise, the replacement works, just the last item, “paper”, is ignored.
K.3.3.2 Absolute value
We can do it explicitly in multiple steps:
c(0, 1, -1.5, 2, -2.5)
x <- x < 0 # which elements are negative
iNegative <- -x[iNegative] # flip the sign for negatives
positive <-# so you get the corresponding
# positives
positive # replace negatives
x[iNegative] <- x
## [1] 0.0 1.0 1.5 2.0 2.5
However, it is much more concise if done in a shorter form:
c(0, 1, -1.5, 2, -2.5)
x <-< 0] <- -x[x < 0]
x[x x
## [1] 0.0 1.0 1.5 2.0 2.5
K.3.3.3 Managers’ rent
Here is the data:
c(Shang = 1000, Zhou = 2000, Qin = 3000, Han = 4000)
income <- c(Shang = 200, Zhou = 1000, Qin = 1700, Han = 2800) rent <-
This problem can be solved in two ways. First the way how it is stated in the problem:
c(0, 0, 0, 0) # to begin with, befit "0" for everyone
b <- rent > 0.5*income # who is rent-burdened?
iHR <-# just for check iHR
## Shang Zhou Qin Han
## FALSE FALSE TRUE TRUE
So Qin and Han are rent-burdened.
0.25*rent[iHR] # compute their benefit
b[iHR] <- b
## [1] 0 0 425 700
Here we replaced benefits for two people–we had to use iHR on both sides of the assignment.
We can also solve it the other way around (not asked in the problem statement): first we can compute the benefit for everyone, and thereafter replace it for the non-rent burdened with “0”:
0.25*rent # benefits to everyone
b <- rent <= 0.5*income # who's rent is low?
iLR <- 0 # replace their benefits by 0.
b[iLR] <- b
## Shang Zhou Qin Han
## 0 0 425 700
Note that all replacement elements have the same value here, “0”.
K.4 Lists
K.4.1 Vectors and lists
The vector will be
c(1, 2:4, 5)
## [1] 1 2 3 4 5
and the list
list(1, 2:4, 5)
## [[1]]
## [1] 1
##
## [[2]]
## [1] 2 3 4
##
## [[3]]
## [1] 5
The printout clearly shows that in case of vector we end up with a vector of 5 similar elements (just numbers). But the list contains three elements, the first and last are single numbers (well, more precisely length-1 vectors), while the middle component is a length-3 vector.
As this example shows, one cannot easily print all list elements on a single row as is the case with vectors.
K.4.2 Print employee list
First re-create the same persons:
list(name = "Ada", job = "Programmer", salary = 78000,
person <-union = TRUE)
list("Ji", 123000, FALSE)
person2 <- list(person, person2) employees <-
The printout looks like
employees
## [[1]]
## [[1]]$name
## [1] "Ada"
##
## [[1]]$job
## [1] "Programmer"
##
## [[1]]$salary
## [1] 78000
##
## [[1]]$union
## [1] TRUE
##
##
## [[2]]
## [[2]][[1]]
## [1] "Ji"
##
## [[2]][[2]]
## [1] 123000
##
## [[2]][[3]]
## [1] FALSE
We can see our two employees here, Ada (at first position) and Ji (at
second position). All element names for Ada are preceded with [[1]]
and for Ji with [[1]]
. These indicate the corresponding positions.
Ada and Ji data itself is printed out slightly differently, reflecting
the fact that Ada’s components have names while Ji’s components do
not. So Ada’s components use $name
tag and Ji’s components use a
similar [[1]]
positional tag.
K.5 How to write code
K.5.1 Divide and conquer
K.5.1.1 Patient names and weights
The recipe to display the names might sound like
- Take the vector of weights
- Find which weights are above 60kg
- Get names that correspond to those weights
- Print those
This recipe is a bit ambiguous though–the which weights is not quite clear, and if you know how to work with vectors, it may mean both numeric position (3 and 4) or logical index (FALSE, FALSE, TRUE, TRUE, FALSE). But if you know the tools, you also know that both of these approaches are fine, so the ambiguity is maybe even its strength.
Second, if you know the tools, then you know that explicit printing may not be needed.
The recipe to display the weights may be like
- Take the vector of weights
- Find which weights are above 60kg
- Display those
This recipe works well if we have access to the vectorized operations and indexing like what we have in R. But if we do not have acess to these tools, we may instead write
- Take the array of weights
- Walk over every weight in this array
- Is the weight over 60kg?
- If yes, print it!
Which recipe do you prefer? Obviously, it depends on the tools you have access to.
Here is example code:
## Data
c("Bao-chai", "Xiang-yun", "Bao-yu", "Xi-chun", "Dai-yu")
name <- c(55, 56, 65, 62, 58) # kg
weight <-
## Names
> 60] # simple, but does follow the recipe closely name[weight
## [1] "Bao-yu" "Xi-chun"
## more complex, but follow the recipe more closely
weight > 60
i <- name[i]
heavies <-cat(heavies, "\n")
## Bao-yu Xi-chun
For weights, we have similar two options
## Short
> 60] weight[weight
## [1] 65 62
## More along the recipe
weight > 60
i <-# implicit printing weight[i]
## [1] 65 62
K.5.2 Learning more
K.5.2.1 Time difference in days
![](img/solutions/how-learn-timediff.png)
ChatGPT may give very good code examples that only require minimal adjustments.
Novadays AI-based tools are fairly good at doing this. The figure at
right show chatGPT’s answer (incorporated in Bing) to such a
question. This answer is correct and can be incorporated to your code
with only little adjustments. However, one should still look up what do
these functions do and what does format = "%b %d, %Y"
mean.
However, the answer my not be enough if you do not know the basics of
R, e.g. what is the assignment operator <-
or the comment character
#
. Also, it lacks some context and it does not discuss more efficient
or simpler ways to achieve the same task. For instance, it does not
suggest to write the dates in the ISO format YYYY-mm-dd which would
simplify the solution.
![](img/solutions/as.Date-help.png)
The first page of as.Date help (you can get it with ?as.Date
).
The as.Date()
help page offers much more information than what
chatGPT gives. In particular, the tryFormats
and its default values
are very useful. However, it also assumes more understanding of the
workings of R, e.g. what does the ## S3 method for class 'character'
exactly mean, and which of the functions listed there one actually needs.
So AI-tools are not a substitute to documentation (nor the other way around). AI is great to quickly get a solution. In order to evaluate the solution, you need to know more. But as your time is valuable too–use AI for tasks where you do not need to go in depth, but learn the most important tools in depth.
Here is a simplyfied version of the chatGPT-suggested solution:
as.Date(c("2023-10-16", "2023-11-12", "2014-07-03"))
dates <-# ISO dates do not need format specification
difftime(dates[2], dates[1], units="days")
## Time difference of 27 days
difftime(dates[2], dates[3], units="days")
## Time difference of 3419 days
When working with dates, you should also be familiar with lubridate library and tools therein.
K.5.3 Coding style
K.5.3.1 Variable names for election data
One of the decisions you need to make here is how to name the political parties. You definitely do not want to use the full names as those are very long. Here we are actually in a very good situation, as these parties have standard English abbreviation (BJP, INC and YSRCP).
Below is one option:- The original data:
- elections. If there are more election-related things, besides of the dataset, we may call it electionData to stress this is a dataset.
- Corrected original
- electionsFixed
- 2019 only
- elections2019. This assumes we do not need 2019 non-fixed version.
- Sub-datasets for parties.
- electionsBJP
- electionsINC
- electionsYSRCP.
- Winning districts only
- winsBJP
- winsINC
- winsYSRCP
Obviously, there are more options, e.g. if the project is very short, then you may replace elections with just e. If you need more, e.g. also 2024 election data, you may need variable names like elections2019BJP and wins2024INC.
You may also think what to do if the data is about Japan instead, and the party you are interested, 公明党, is abbreviated as 公明. (See Komeito).
K.6 Conditional statements
K.6.1 if-statement
K.6.1.1 Tell if second string longer
This is quite a simple application of if and else:
function(s1, s2) {
compareStrings <-if(nchar(s2) > nchar(s1)) {
## if 2nd string longer the print
cat("The second string is longer\n")
}## Do nothing else
}
compareStrings("a", "aa") # prints
## The second string is longer
compareStrings("aa", "a") # does not print
K.6.1.2 Print if number even
- Here the logic is as follows:
- print the number
- if even, print " - even".
for(i in 1:10) {
cat(i, "\n") # print the number (and new line)
if(i %% 2 == 0) {
cat(" - even\n") # print 'even' (and new line)
} }
## 1
## 2
## - even
## 3
## 4
## - even
## 5
## 6
## - even
## 7
## 8
## - even
## 9
## 10
## - even
- Now we need to think more about printing. It goes as follows:
- print the number (no new line)
- if even, print " - even" (no new line)
- add new line, unconditionally.
for(i in 1:10) {
cat(i) # print the number, but do not switch to new line
if(i %% 2 == 0) {
cat(" - even") # print 'even', do not switch to new line
}cat("\n") # switch to new line at the end of line here
# whatever number it is
}
## 1
## 2 - even
## 3
## 4 - even
## 5
## 6 - even
## 7
## 8 - even
## 9
## 10 - even
K.6.1.3 Print even/odd
The code is simple, and printing is a bit simpler too
for(i in 1:10) {
cat(i) # print the number, but do not switch to new line
if(i %% 2 == 0) {
cat(" even\n") # print 'even' and new line
else {
} cat(" odd\n")
} }
## 1 odd
## 2 even
## 3 odd
## 4 even
## 5 odd
## 6 even
## 7 odd
## 8 even
## 9 odd
## 10 even
K.6.1.4 Going out with friends
200
money <- 5
nFriends <- 30
price <- (nFriends + 1)*price # friends + myself
sum <- sum*1.15 # add tip
total <-if(total > money) {
cat("Cannot afford 😭\n")
else {
} cat("Can afford ✌\n")
}
## Cannot afford 😭
K.6.1.5 Test porridge temperature
We just need to remove assignments and return()
:
function(temp) {
test_food_temp <-if(temp > 120) {
"This porridge is too hot!"
else if(temp < 70) {
} "This porridge is too cold!"
else {
} "This porridge is just right!"
}
}## The test results are the same:
test_food_temp(119) # just right!
## [1] "This porridge is just right!"
test_food_temp(60) # too cold!
## [1] "This porridge is too cold!"
test_food_temp(150) # too hot!
## [1] "This porridge is too hot!"
In my opinion, shorter code is easier to read, but different people may have different opinion.
K.6.2 Conditional statements and vectors
K.6.2.1 Should you go to boba place?
The problem is worded in a somewhat vague manner, so you may need to make it more specific. Here we assume that you only go if you can afford a drink.
This means you need to write code that checks if any tea is cheaper than $7.
K.6.2.2 Can you get a drink?
With the original prices:
c(5, 6, 7, 8)
price <-if(any(price <= 7)) {
cat("You can get a drink\n")
else {
} cat("This is a too expensive place\n")
}
## You can get a drink
If they rise the price by $3 across the board then we can just add “3” to the price vector:
price + 3
price <-if(any(price <= 7)) {
cat("You can get a drink\n")
else {
} cat("This is a too expensive place\n")
}
## This is a too expensive place
The results are intuitively obvious–it is affordable using the original prices but not with the new prices.
K.7 File system tree
K.7.1 File system tree
K.7.1.1 Sketch your file system tree
This is, obviously, different for everyone, but here is mine:![fig: sketch of my file system tree](figs/solutions/my-file-system-tree.png)
A subset of the file system tree in my computer. Black boxes denote folders, blue boxes are files.
K.7.1.2 Sketch your picture folder tree
Here is mine. I have picked mostly shorter example names, just to fit those on the figure.![fig: sketch of my Pictures folder tree](figs/solutions/my-pics-folder-tree.png)
A subset of the Pictures folder in my computer. Black boxes denote folders, blue boxes are files.
K.7.1.4 Matlab accessing matrix.dat
![fig: Get matrix.dat from Downloads](figs/solutions/yucun-matlab-matrixdat.png)
How to navigate to cheatsheet.pdf in Downloads from amat352.
From amath352 to matrix.dat we can move as (see the figure):
- up (into UW)
- up (into Documents)
- up (into Yucun’s stuff)
- into Downloads
- grab matrix.dat from there
Or in the short form:
"../../../Downloads/matrix.dat"
Again, we should not start be going up to amath352 as we already are there.
K.7.1.5 Get picture from info201
Again, this is different on your computer. But given my file system tree looks like above, my path will be![fig: sketch of my file system tree](figs/solutions/info201-to-pics.png)
How I can access green-lake-ice.png from my info201 folder.
- up (to teaching)
- up (to tyyq)
- up (to my stuff)
- into Pictures
- into Nature
- grab the green-lake-ice.jpg from there.
In the short form, it is
"../../../Pictures/Nature/green-lake-ice.jpg"
Note that I do not have pictures in Pictures folder, but in subfolders inside there. If you do, the descent into Nature will be unnecessary.
K.7.1.7 Absolute path of an image
Suppose I have an image “fractal.png” inside of my Picture folder that, in turn, is in my home folder. Assume further that I am using Windows and my home folder is on “D:” drive. The long directions might look like:
- start at root “This PC”
- go to drive “D:”
- go to “Users”
- go to “siim” (assume “siim” is my user name)
- go to “Pictures”
- grab “fractal.png” from there.
In the short form it is
D:/Users/siim/Pictures/fractal.png
Note that we do not use the root “This PC” when writing paths on windows.
K.7.1.8 Absolute path of the home folder
![fig: Home folder location in file system tree](figs/solutions/home-folder-abs-path.png)
Obviously, this is different for every user and every computer. Here is mine on my home computer. I have marked a few other folder (etc, system configuration files and usr – installed applications).
![fig: path in Gnome file selector](img/solutions/gnome-file-selector-path.png)
Absolute path, here root - home - siim, as shown in Gnome file selector.
There are multiple ways to see where in the file system tree it is located, one option is to use file managers. Here is an example that shows the path in Gnome file selector. Note that root is denoted by a hard disk icon, and the home folder siim is combined with a home icon.
K.7.1.9 Yucun moving his project
- If he is using absolute path (it might be
"/Users/yucun/Documents/data/data.csv"
), the it does not change. This is because absolute path always starts from the file system root, and file system root does not change if you move around your files and folders–as long as the file in question (data.csv) remains in place. - If he moves data to a different computer… then he probably has
to change the paths. Most importantly, the other computer may not
have the data folder inside of the Documents folder, but
somewhere else. Second, the other computer may also have different
file system tree, e.g. if the other one is a PC, his home folder
may be
"C:/Users/yucun"
instead. Relative path is of no help here, unless the other computer has similar file and folder layout.
K.7.2 Accessing files from R
K.7.2.1 R working directory path type
This is absolute path: you see this because “/home/otoomet/tyyq/info201-book” starts with
the root symbol /
. See more in Sections 9.1.2
and 9.1.3.
K.7.2.2 RStudio console working directory
![fig: Typing ‘getwd()’ in rstudio console](img/solutions/rstudio-console-getwd.png)
The only way to see it is to run getwd()
in rstudio console. You
can run it directly, or you can also execute a line of a script. What
matters is that it runs on console.
The example here shows “/home/siim/tyyq/teaching/info201/inclass” as the current working directory.
K.7.2.3 List files in R and in graphical manager
Assume the current working directory is “/home/siim/tyyq/teaching/info201/inclass” as in the exercise above.
![fig: ‘list.files()’ showing files in the current working directory](img/solutions/rstudio-console-listfiles.png)
We can use list.files()
to see files here.
![fig: Files in current working directory as displayed graphically](img/solutions/mgr-files-in-wd.png)
And here are the same files, seen through the eyes of a graphical file manager (PCManFM). Note the navigation bar above the icons that displays the absolute path of the folder, and the side pane that displays the file system tree (a small view of it only).
It is easy to see that the files are the same. Note that R normally sorts files alphabetically, but file managers may show these in different ways, either alphabetically, by creation time, or you may even manually position individual icons. All this may be configured differently on your computer!
You can also see that here, both R and the file manager show all names in the same way, including the complete extensions like .R or .jpg. This may be different on your computer (and can be changed).
K.8 Data Frames
K.8.1 What is data frame
K.8.2 Working with data frames
K.8.2.1 Countries and capitals
Appropriate names are country for the country, capital for its capital, and population for the population. We call the data frame as countries (plural) to distinguish it from the individual variable. Obviously, one can come up with other names. We can create the data frame as
data.frame(
countries <-country = c("Gabon", "Congo", "DR Congo", "Uganda", "Kenya"),
capital = c("Libreville", "Brazzaville", "Kinshasa", "Kampala", "Nairobi"),
population = c(2.340, 5.546, 108.408, 45.854, 55.865))
countries
## country capital population
## 1 Gabon Libreville 2.340
## 2 Congo Brazzaville 5.546
## 3 DR Congo Kinshasa 108.408
## 4 Uganda Kampala 45.854
## 5 Kenya Nairobi 55.865
where population is in Millions (2022 estimates from Wikipedia).
We can extract the country names by dollar notation as
$country countries
## [1] "Gabon" "Congo" "DR Congo" "Uganda" "Kenya"
and population with double brackets as
"population"]] countries[[
## [1] 2.340 5.546 108.408 45.854 55.865
Capital using indirect name:
"capital"
var <- countries[[var]]
## [1] "Libreville" "Brazzaville" "Kinshasa" "Kampala" "Nairobi"
K.8.3 Accessing Data in Data Frames
K.8.3.1 Indirect variable name with dollar notation
R will interpret the workspace variable name that contains data variable name as data variable name:
"population"
var <-$var # NULL countries
## NULL
As you see, R is looking for a data variable var
. As it cannot find
it, it returns NULL
, the special code for empty element.
K.8.3.2 Loop of columns of a data frame
- Column names. No loop needed here:
names(emperors)
## [1] "name" "born" "throned" "ruled" "died"
- Print names in loop. We can just loop over the names:
for(n in names(emperors)) {
cat(n, "\n")
}
## name
## born
## throned
## ruled
## died
- Print name and column. We need indirect access here as the column
name is now stored in the variable (called
n
below). So we can access it asemperors[[n]]
:
for(n in names(emperors)) {
cat(n, "\n")
print(emperors[[n]])
}
## name
## [1] "Qin Shi Huang" "Napoleon Bonaparte" "Nicholas II"
## [4] "Mehmed VI" "Naruhito"
## born
## [1] -259 1769 1868 1861 1960
## throned
## [1] -221 1804 1894 1918 2019
## ruled
## [1] "China" "France" "Russia" "Ottoman Empire"
## [5] "Japan"
## died
## [1] -210 1821 1918 1926 NA
- Print name and type. This is similar to the above, except now we
print
is.numeric(emperors[[n]])
.
for(n in names(emperors)) {
cat(n, "is numeric:", is.numeric(emperors[[n]]), "\n")
}
## name is numeric: FALSE
## born is numeric: TRUE
## throned is numeric: TRUE
## ruled is numeric: FALSE
## died is numeric: TRUE
- Print name and minimum. Now use the
TRUE
/FALSE
for a logical test, only print average if this is true:
for(n in names(emperors)) {
cat(n, "")
if(is.numeric(emperors[[n]])) {
cat(min(emperors[[n]]))
}cat("\n")
}
## name
## born -259
## throned -221
## ruled
## died NA
Note: you may want to use min(emperors[[n]], na.rm = TRUE)
to avoid
the missing minimum for died column.
K.8.3.3 Emperors who died before 1800
Pure dollar notation is almost exactly the same as the example in the text:
$name[emperors$died < 1800] emperors
## [1] "Qin Shi Huang" NA
When using double brackets at the first place, we have
"name"]][emperors$died < 1800] emperors[[
## [1] "Qin Shi Huang" NA
Note that we have a weird construct here [[...]][..]
. It looks
weird, but it perfectly works. emperors[["name"]]
is a vector, and
a vector can be indexed using [...]
.
When we put double brackets in both places, we get
"name"]][emperors[["died"]] < 1800] emperors[[
## [1] "Qin Shi Huang" NA
This is perhaps the “heaviest” notation, where it may be hard to keep track of the brackets. However, it is a perfectly valid way to extract emperors!
Finally, NA
in the output is related to Naruhito. As we do not know
his year of death, R sends a message that there is one name where we
do not know if he died before 1800. It is a little stupid–as
Naruhito is alive today, he cannot have died before 1800. But we
haven’t explained this knowledge to R.
K.8.3.4 Single-bracket data acess (emperors)
Extract 3rd and 4th row:
3:4,] # alternatively, emperors[c(3,4),] emperors[
## name born throned ruled died
## 3 Nicholas II 1868 1894 Russia 1918
## 4 Mehmed VI 1861 1918 Ottoman Empire 1926
All emperors who died in 20th century:
$died >= 1900 & emperors$died < 2000,] emperors[emperors
## name born throned ruled died
## 3 Nicholas II 1868 1894 Russia 1918
## 4 Mehmed VI 1861 1918 Ottoman Empire 1926
## NA <NA> NA NA <NA> NA
This will still give us NA
for Naruhito–we haven’t explained to R
in any way that someone who was alive in 2023, cannot have died in
20th century. If a NA
is not desired, one can use which()
:
which(emperors$died >= 1900 & emperors$died < 2000),] emperors[
## name born throned ruled died
## 3 Nicholas II 1868 1894 Russia 1918
## 4 Mehmed VI 1861 1918 Ottoman Empire 1926
Name and country of those emperors
which(emperors$died >= 1900 & emperors$died < 2000),
emperors[c("name", "ruled")]
## name ruled
## 3 Nicholas II Russia
## 4 Mehmed VI Ottoman Empire
K.8.3.5 Patients aging
First create the data frame:
c("Ada", "Bob", "Chris", "Diya", "Emma")
Name <- c(58, 59, 60, 61, 62)
Inches <- c(120, 120, 150, 150, 160)
Pounds <- c(22, 33, 44, 55, 66)
age <- data.frame(Name, Inches, Pounds, age)
patients <- patients
## Name Inches Pounds age
## 1 Ada 58 120 22
## 2 Bob 59 120 33
## 3 Chris 60 150 44
## 4 Diya 61 150 55
## 5 Emma 62 160 66
Adding a single year of age involves just modifying data, but we do not need to filter anythign as this applies to everyone:
$age <- patients$age + 1
patients patients
## Name Inches Pounds age
## 1 Ada 58 120 23
## 2 Bob 59 120 34
## 3 Chris 60 150 45
## 4 Diya 61 150 56
## 5 Emma 62 160 67
K.9 dplyr
K.9.1 Grammar of data manipulation
K.9.1.1 How many trees over size 100?
We can do something like this:
- Take the orange tree dataset
- keep only rows that have size > 100
- pull out the tree number
- find all unique trees
- how many unique trees did you find?
Obviously, you can come up with different lists, e.g. the items 4 and 5 might be combined into one. They are kept separate here that these two items correspond to a single function in base-R.
K.9.1.2 Two ways to find the largest tree
The difference is in how the recipe breaks ties for the largest tree. If there are two largest trees of equal size, these will be put in an arbitrary order. If we pick the first line below, we’ll get one of the largest trees, but not both. The second recipe extracts all trees of maximum size, so it can find all such trees.
In practice, it is more useful not to order the trees and pick the first, but rank them with an explicit way to break ties. For instance
c(20, 10, 20)
size <-rank(desc(size), ties.method="min")
## [1] 1 3 1
Will tell that both the first and the third tree are on the “first place”
in descending order. See more with ?rank
.
K.9.2 Most important dplyr functions
K.9.2.1 Add decade to babynames
We can compute decade by first integer-dividing year by 10, and then multiplying the result by 10:
%>%
babynames mutate(decade = year %/% 10 * 10) %>%
sample_n(5) # just show it works
## # A tibble: 5 × 6
## year sex name n prop decade
## <dbl> <chr> <chr> <int> <dbl> <dbl>
## 1 2015 M Zixuan 7 0.00000343 2010
## 2 2011 F Aleaya 6 0.0000031 2010
## 3 1932 M Arlen 178 0.000166 1930
## 4 1975 F Tiffiney 35 0.0000224 1970
## 5 2007 F Coral 178 0.0000842 2000
K.9.2.2 How many names over all years
We just need to add the count variable n:
%>%
babynames summarize(n = sum(n))
## # A tibble: 1 × 1
## n
## <int>
## 1 348120517
K.9.2.3 Shiva for boys/girls
The task list might look like this:- filter to keep only boys (or only girls)
- filter to keep only name “Shiva”
- summarize this dataset by adding up all counts n
There are, obviouly, other options, for instance, you can swapt the filter by sex and filter by name.
## for boys
%>%
babynames filter(sex == "M") %>%
filter(name == "Shiva") %>%
summarize(sum(n))
## # A tibble: 1 × 1
## `sum(n)`
## <int>
## 1 397
## for girls
%>%
babynames filter(sex == "F") %>%
filter(name == "Shiva") %>%
summarize(sum(n))
## # A tibble: 1 × 1
## `sum(n)`
## <int>
## 1 249
K.9.3 Combining dplyr operations
The tasklist for this question (see above) might be:
- Take the orange tree dataset
- keep only rows that have size > 100
- pull out the tree number
- find all unique trees
- how many unique trees did you find?
This can be translated to code as:
%>%
Orange filter(circumference > 100) %>%
pull(Tree) %>%
unique() %>%
length()
## [1] 5
So there are 5 different trees.
K.9.4 Grouped operations
K.9.4.1 Titanic fare by class
The computations are pretty much the same as the example in the text:
%>%
titanic group_by(pclass) %>%
summarize(avgFare = mean(fare, na.rm=TRUE),
maxFare = max(fare, na.rm=TRUE),
avgAge = mean(age, na.rm=TRUE),
maxAge = max(age, na.rm=TRUE)
)
## # A tibble: 3 × 5
## pclass avgFare maxFare avgAge maxAge
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 87.5 512. 39.2 80
## 2 2 21.2 73.5 29.5 70
## 3 3 13.3 69.6 24.8 74
The results make sense, as the first class is the most expensive, and the third class the cheapest option. However, it is hard to see why the most expensive 3rd class options was much more than the 2nd class average. It is also reasonable that older people are more likely to travel in upper classes, as they may be wealthier, and their health may be more fragile.
K.9.4.2 Most distinct names
Here we compute the number of distinct names for each year, order the result by \(n\), and print the first three lines:
%>%
babynames group_by(year) %>%
summarize(n = n_distinct(name)) %>%
arrange(desc(n)) %>%
head(3)
## # A tibble: 3 × 2
## year n
## <dbl> <int>
## 1 2008 32510
## 2 2007 32416
## 3 2009 32242
Apparently, these years are late 2000-s.
K.9.4.3 Most popular boy and girl names
The only difference here is to group by year and sex:
%>%
babynames filter(between(year, 2002, 2006)) %>%
group_by(year, sex) %>%
arrange(desc(n), .by_group = TRUE) %>%
summarize(name = name[1])
## # A tibble: 10 × 3
## # Groups: year [5]
## year sex name
## <dbl> <chr> <chr>
## 1 2002 F Emily
## 2 2002 M Jacob
## 3 2003 F Emily
## 4 2003 M Jacob
## 5 2004 F Emily
## 6 2004 M Jacob
## 7 2005 F Emily
## 8 2005 M Jacob
## 9 2006 F Emily
## 10 2006 M Jacob
As we can see, these are just Emily and Jacob.
K.9.4.4 Three most popular names
The first 3 names in terms of popularity can just be filtered using
the condition rank(desc(n)) <= 3
:
%>%
babynames filter(between(year, 2002, 2006)) %>%
group_by(year) %>%
filter(rank(desc(n)) <= 3)
## # A tibble: 15 × 5
## # Groups: year [5]
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 2002 M Jacob 30568 0.0148
## 2 2002 M Michael 28246 0.0137
## 3 2002 M Joshua 25986 0.0126
## 4 2003 F Emily 25688 0.0128
## 5 2003 M Jacob 29630 0.0141
## 6 2003 M Michael 27118 0.0129
## 7 2004 F Emily 25033 0.0124
## 8 2004 M Jacob 27879 0.0132
## 9 2004 M Michael 25454 0.0121
## 10 2005 F Emily 23937 0.0118
## 11 2005 M Jacob 25830 0.0121
## 12 2005 M Michael 23812 0.0112
## 13 2006 M Jacob 24841 0.0113
## 14 2006 M Michael 22632 0.0103
## 15 2006 M Joshua 22317 0.0102
As you can see, these are various combinations of “Jacob”, “Michael”, “Joshua” and “Emily”.
K.9.4.5 10 most popular girl names after 2000
This is just about keeping girls only, and arranging by popularity afterward:
%>%
babynames filter(sex == "F",
> 2000) %>%
year group_by(name) %>%
summarize(n = sum(n)) %>%
filter(rank(desc(n)) <= 5) %>%
arrange(desc(n))
## # A tibble: 5 × 2
## name n
## <chr> <int>
## 1 Emma 327254
## 2 Emily 298119
## 3 Olivia 290625
## 4 Isabella 285307
## 5 Sophia 265572
We can see that “Emma” has been the most popular.
K.9.4.6 Most popular name by decade
This is noticeably more tricky task:
- First we need to compute decade, this can be done using integer
division
%/%
as(year %/% 10)*10
. - Thereafter, we need to add all counts n for each name and decade. Hence we group by name and decade, and sum n.
- Thereafter, we need to rank the popularity for each decade. Hence we group again, but now just by decade.
We can do it along these lines:
%>%
babynames mutate(decade = year %/% 10 * 10) %>%
group_by(name, decade) %>%
summarize(n = sum(n)) %>%
group_by(decade) %>%
filter(rank(desc(n)) == 1) %>%
arrange(decade)
## # A tibble: 14 × 3
## # Groups: decade [14]
## name decade n
## <chr> <dbl> <int>
## 1 Mary 1880 92030
## 2 Mary 1890 131630
## 3 Mary 1900 162188
## 4 Mary 1910 480015
## 5 Mary 1920 704177
## 6 Robert 1930 593451
## 7 James 1940 798225
## 8 James 1950 846042
## 9 Michael 1960 836934
## 10 Michael 1970 712722
## 11 Michael 1980 668892
## 12 Michael 1990 464249
## 13 Jacob 2000 274316
## 14 Emma 2010 158715
We see that in the early years, “Mary” was leading the pack, later mostly the boy names have dominated.
Note the third line group_by(name, decade)
. For each decade, this
makes groupings
based on name only, not separately for name and sex. Hence for names
that were given to both boys and girls, we add up all instances across
genders.
K.9.4.7 “Mei” by decade
The final code might look like
%>%
babynames filter(sex == "F") %>%
mutate(decade = (year %/% 10) * 10) %>%
group_by(name, decade) %>%
summarize(n = sum(n)) %>% # popularity over all 10 years!
group_by(decade) %>%
mutate(k = rank(desc(n))) %>%
filter(name == "Mei")
## # A tibble: 8 × 4
## # Groups: decade [8]
## name decade n k
## <chr> <dbl> <int> <dbl>
## 1 Mei 1940 18 6274.
## 2 Mei 1950 15 8015
## 3 Mei 1960 36 7082
## 4 Mei 1970 111 5149
## 5 Mei 1980 136 5356.
## 6 Mei 1990 191 5176
## 7 Mei 2000 385 3788.
## 8 Mei 2010 356 3560.
We see that “Mei” has gained in popularity over time, starting around 6000th place in popularity in 1940-s down to around 3500 in 2010-s.
A reminder here: the counts n in the table are probably underestimates–names are only included if they are given for at least 5 times.
K.9.5 More advanced dplyr usage
K.9.5.1 Sea and Creek 1980-2000
We can just filter the required years and the required
names, both using
%in%
:
%>%
babynames filter(year %in% c(1980, 1985, 1990, 1995, 2000),
%in% c("Sea", "Creek")) name
## # A tibble: 2 × 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1985 M Sea 6 0.00000312
## 2 2000 M Creek 7 0.00000335
We can see that these names were not popular, but both were given over five times to boys.
K.9.5.2 Name popularity frequency table
Here we want to count how many times are there numbers \(n=5\), \(n=6\), and so on. So we just count it:
babynames %>%
p <- filter(year == 2004,
== "F") %>%
sex count(n)
%>%
p sample_n(5)
## # A tibble: 5 × 2
## n nn
## <int> <int>
## 1 789 1
## 2 111 11
## 3 423 3
## 4 292 3
## 5 713 1
Over all time: we need to aggregate \(n\):
babynames %>%
p <- group_by(name) %>%
summarize(n = sum(n)) %>%
count(n)
%>%
p sample_n(5)
## # A tibble: 5 × 2
## n nn
## <int> <int>
## 1 4593 1
## 2 3435 2
## 3 11556 1
## 4 3404 1
## 5 2570 2
K.10 ggplot2
K.10.1 Basic plotting with ggplot2
K.10.1.1 Length versus width
![fig: plot of chunk unnamed-chunk-100](.fig/solutions/unnamed-chunk-100-1.png)
The only confusing part here is that both the data variables and
aesthetics are called x and y. So we need aesthetic mapping
aes(x = x, y = y)
:
ggplot(d1000,
aes(x = x, y = y)) +
geom_point()
We see that most diamonds have very similar x and y, hence they are almost circular when seen from above.
K.10.1.2 Two aes()
-s in one plot
![fig: plot of chunk unnamed-chunk-101](.fig/solutions/unnamed-chunk-101-1.png)
This works beautifully:
ggplot(d1000,
aes(x = carat, y = price)) +
geom_point(aes(col = cut))
In fact, for the current plot,
this is equivalent to specifying all aesthetics in
ggplot()
, or specifying those in geom_point()
.
K.10.1.3 Amended color plot
![fig: plot of chunk unnamed-chunk-102](.fig/solutions/unnamed-chunk-102-1.png)
Specifying x and y in ggplot()
and fixed aesthetics in
geom_point()
:
ggplot(d1000,
aes(x = carat, y = price)) +
geom_point(col = "limegreen",
size = 3,
alpha = 0.3)
Transparency helps to understand the dense region of small diamonds, as there is a lot of overplotting, it is hard to tell otherwise what is going on there.
K.10.2 Most important plot types
K.10.2.1 Orange tree barplot in different colors
![fig: plot of chunk sol-ggplot-types-orange-bar-colors](.fig/solutions/sol-ggplot-types-orange-bar-colors-1.png)
Using aes(..., fill=Tree)
uses the values of the data variable
Tree
to
determine the color of the bars.
We can just add the aesthetic fill=Tree
to make the bar colors to be
different for diffent trees:
ggplot(avg,
aes(Tree, size, fill=Tree)) +
geom_col()
Remember that it is fill
aesthetic that controls the fill color, not
the col
aesthetic!
But here the colors do not contain any information that is not already embedded in the bars. While colors are usually a nice visual feature, it may be misleading some cases, making the viewer to believe that the colors have a distinct meaning, separate of the bars.
K.10.2.2 Histogram of Titanic data
![fig: plot of chunk unnamed-chunk-105](.fig/solutions/unnamed-chunk-105-1.png)
Here is age histogram:
ggplot(titanic,
aes(age)) +
geom_histogram(bins = 30,
fill = "mediumpurple4",
col = "gold1")
30 bins seems a good choice here.
![fig: plot of chunk unnamed-chunk-106](.fig/solutions/unnamed-chunk-106-1.png)
Here is age histogram:
ggplot(titanic,
aes(fare)) +
geom_histogram(bins = 70,
fill = "mediumpurple4",
col = "gold1")
A larger number of bins is better here, in order to make more bins available for cheaper tickets, less than 100£, where we have most data.
As you see, age is distributed broadly normally, but fare is more like log-normal with a long right tail of very expensive tickets. Why is it like that? It is broadly related to the fact that human age has pretty hard upper limit, but no such limit exists for wealth. There were very wealthy passengers, but no-one could have been 500 years old.
K.10.2.3 Diamond price in a narrow range
![fig: plot of chunk unnamed-chunk-107](.fig/solutions/unnamed-chunk-107-1.png)
Here is the price distribution for mass range \([0.45,0.5]\)ct.
%>%
diamonds filter(between(carat, 0.45, 0.5)) %>%
ggplot(aes(cut, price)) +
geom_boxplot()
![fig: plot of chunk unnamed-chunk-108](.fig/solutions/unnamed-chunk-108-1.png)
And here for \([0.95,1]\)ct.
%>%
diamonds filter(between(carat, 0.95, 1)) %>%
ggplot(aes(cut, price)) +
geom_boxplot()
Now it is fairly obvious that better cut is associated with higher price.
K.10.3 Inheritance
K.10.3.1 Ice extent in January
Everything in color:
read_delim("data/ice-extent.csv.bz2")
ice <-%>%
ice filter(month == 2) %>%
ggplot(aes(year, extent, col = region)) +
geom_line() +
geom_point()
![](.fig/solutions/unnamed-chunk-109-1.png)
plot of chunk unnamed-chunk-109
Gray lines:
%>%
ice filter(month == 2) %>%
ggplot(aes(year, extent, col = region)) +
geom_line(aes(group = region),
col = "gray80",
linewidth = 2) +
geom_point()
![](.fig/solutions/unnamed-chunk-110-1.png)
plot of chunk unnamed-chunk-110
3 Months in north:
%>%
ice filter(month %in% c(2, 5, 9)) %>%
filter(region == "N") %>%
ggplot(aes(year, extent, col = factor(month))) +
geom_line() +
geom_point()
![](.fig/solutions/unnamed-chunk-111-1.png)
plot of chunk unnamed-chunk-111
3 Months in north, gray lines
%>%
ice filter(month %in% c(2, 5, 9)) %>%
filter(region == "N") %>%
ggplot(aes(year, extent, col = factor(month))) +
geom_line(aes(group = month),
col = "gray30",
linewidth = 2) +
geom_point()
![](.fig/solutions/unnamed-chunk-112-1.png)
plot of chunk unnamed-chunk-112
K.10.4 Tuning your plots
K.10.4.1 Political parties with one color not specified
![fig: plot of chunk ggplot-tuning-loksabha-missing](.fig/solutions/ggplot-tuning-loksabha-missing-1.png)
Party which’ color is uncpecified is displayed as gray, more
specifically as value of the argument na.value
of the
scale_fill_manual()
.
Let’s leave out INC and write
data.frame(party = c("BJP", "INC", "AITC"),
seats = c(303, 52, 23)) %>%
ggplot(aes(party, seats, fill=party)) +
geom_col() +
scale_fill_manual(
values = c(BJP="orange2",
AITC="springgreen3")
)
As you see, it does not result in an error but a gray bar for INC.
The gray value can be adjusted with na.value
, e.g. as
scale_fill_manual(na.value="red")
.
K.10.4.2 Manually specifying a continuous scale
I do not know how one might be able to manually specify colors for a continuous scale. The problem is that continuous variables can take an infinite number of values–and you cannot specify an infinite number of values manually.
The closest existing option to this is scale_color_gradientn()
.
This allows you to link a number of data values to specific colors,
and tell ggplot to use gradient for whatever values there are
in-between.
K.10.4.3 Using wrong scales
![fig: plot of chunk unnamed-chunk-114](.fig/solutions/unnamed-chunk-114-1.png)
Using wrong scale (col instead of fill) is silently ignored.
The wrong scale is silently ignored:
data.frame(GDP=c(1000, 1050),
gdp <-year=c(2023, 2024))
ggplot(gdp,
aes(year, GDP, fill=factor(year))) +
geom_col() +
scale_color_manual(
values = c("2023"="black",
"2024" = "white")
)
K.10.4.4 March ice extent
![fig: plot of chunk ice-gradient2](.fig/solutions/ice-gradient2-1.png)
Coloring bars according to the value
read_delim("data/ice-extent.csv.bz2")
ice <-## create a separate filtered df--
## we need it for both plotting
## and for computing the average
ice %>%
ice3 <- filter(month == 3,
== "N")
region ice3$extent %>%
avg <- mean()
ggplot(ice3,
aes(year, extent, fill = extent)) +
geom_col() +
scale_fill_gradient2(low = "red",
mid = "white",
high = "blue",
midpoint = avg)
Here one might want to make plot not of the extent, but of the difference between the extent and it’s average (baseline) value.
K.10.5 More geoms and plot types
K.10.5.1 Titanic fare by passenger class
![fig: plot of chunk fare-density-pclass](.fig/solutions/fare-density-pclass-1.png)
Here is the example:
%>%
titanic ggplot(aes(fare,
fill = factor(pclass))) +
geom_density(alpha = 0.5) +
coord_cartesian(xlim = c(0, 100),
ylim = c(0, 0.05))
We limit the plot region to \([0, 100] \times [0, 0.05]\) to zoom into the more interesting area. Alternatively, one may consider log-scale.
K.11 More about data manipulations
K.11.1 Merging data: joins
K.11.1.1 Merge artists, songs
left_join(artists, songs)
should put the artists first and add a
column song at the end of it. Something like
name plays song
John guitar Come Together
Paul bass Hello, Goodbye
But the problem is that John is playing in two songs, so a single song name may not be sufficient. One can come up with multiple solutions. For instance, you can list the first song where John plays. Or you can create two lines for John, one for each song. You may also create two columns for songs, one for each song.
left_join()
picks the option of creating two lines, one for each
song:
data.frame(song = c("Across the Universe", "Come Together",
songs <-"Hello, Goodbye", "Peggy Sue"),
name = c("John", "John", "Paul", "Buddy"))
data.frame(name = c("George", "John", "Paul", "Ringo"),
artists <-plays = c("sitar", "guitar", "bass", "drums"))
left_join(artists, songs)
## name plays song
## 1 George sitar <NA>
## 2 John guitar Across the Universe
## 3 John guitar Come Together
## 4 Paul bass Hello, Goodbye
## 5 Ringo drums <NA>
K.11.2 Reshaping
K.11.2.1 Drinking data with years in rows
This is fairly easy and fairly logical. Essentially, we need to to rotate the original wide form data by 90°:
data.frame(
drinking <-state = c("Tennessee", "North Carolina", "Pennsylvania"),
`2009` = c(48.3, 60.3, 36),
`2010` = c(48.1, 59.7, 37.3),
`2011` = c(39.6, 60.4, 40.6),
`2012` = c(48.1, 59.2, 41.2),
check.names = FALSE
)%>%
drinking pivot_longer(!state, names_to = "year", values_to = "pct") %>%
pivot_wider(names_from = "state", values_from = "pct")
## # A tibble: 4 × 4
## year Tennessee `North Carolina` Pennsylvania
## <chr> <dbl> <dbl> <dbl>
## 1 2009 48.3 60.3 36
## 2 2010 48.1 59.7 37.3
## 3 2011 39.6 60.4 40.6
## 4 2012 48.1 59.2 41.2
This table is easy to understand. Putting years in rows is also widely used in the literature.
K.11.2.2 Drinking data in pure wide form
If we do not have states in separate rows, then we need more columns. Currently we have 6 sex-year combinations for each state. We still need the six of those, but now they must be in the same row for all state. So we’ll have a peculiar data frame with a single row only! So the resulting dataset will contain a single row and a large number of columns, one for each state. But there will be no distinct “state” column. It might look like
2009_Tennessee | 2009_North Carolina | 2009_Pennsylvania | 2010_Tennessee | 2010_North Carolina | 2010_Pennsylvania | 2011_Tennessee | 2011_North Carolina | 2011_Pennsylvania | 2012_Tennessee | 2012_North Carolina | 2012_Pennsylvania |
---|---|---|---|---|---|---|---|---|---|---|---|
48.3 | 60.3 | 36 | 48.1 | 59.7 | 37.3 | 39.6 | 60.4 | 40.6 | 48.1 | 59.2 | 41.2 |
Note that we now need to add state name to the column names to make clear which “2009” means Tennessee and which one North Carolina.
K.11.2.3 Reshape ice extent
As a refresher, the ice extent data looks like
%>%
ice select(year, month, region, area) %>%
head(3)
## # A tibble: 3 × 4
## year month region area
## <dbl> <dbl> <chr> <dbl>
## 1 1978 11 N 9.04
## 2 1978 11 S 11.7
## 3 1978 12 N 10.9
- In terms of the region, the dataset is in long form. There is only a single column region that contains region type (“N” and “S”).
- In wide form, the column names might be Narea and Sarea, for instance.
- reshape to wide:
%>%
ice select(year, month, region, area) %>%
pivot_wider(names_from = "region", values_from = "area") %>%
head(4)
## # A tibble: 4 × 4
## year month N S
## <dbl> <dbl> <dbl> <dbl>
## 1 1978 11 9.04 11.7
## 2 1978 12 10.9 6.97
## 3 1979 1 12.4 3.47
## 4 1979 2 13.2 2.11
As you see, by default the variables names are “N” and “S”, the same values that were in the region column.
K.11.2.4 Reshape patients data
This data frame is in a wide form as there are two columns, male and female, that contain counts. The NA is somewhat misleading, it would be more appropriate to put “0” in that place instead.
Hence we can reshape it into a long form:
data.frame(pregnant = c("yes", "no"),
patients <-male = c(NA, 25),
female = c(11, 20))
%>%
patients pivot_longer(!pregnant,
names_to = "sex",
values_to = "count")
## # A tibble: 4 × 3
## pregnant sex count
## <chr> <chr> <dbl>
## 1 yes male NA
## 2 yes female 11
## 3 no male 25
## 4 no female 20
The result has two columns-pregnance, sex and count. If needed, we can remove the NA-row.
K.12 Making maps
K.12.1 Shapefiles and GeoJSON
K.12.1.1 Difference between spatial data frame and manual map data frame
There are multiple differences:
- Perhaps most importantly, the hand-made NZ map in Section 15.1.1 is stored as one vertex per row, while the spatial data frame is stored one polygon per row. This makes spatial data frames much smaller, for instance, you do not need to replicate the same color value for every single vertex–a single value for the polygon is enough.
- Another important difference is the presence of coordinate reference system (CRS). This allows to easily transform one coordinate system to another, and in this way to use spatial data that is stored using different systems.
K.12.1.2 Why left_join()
?
We use left join to merge map and population. Remember: this retains all the rows of map but drops the lines of population where there are no corresponding region on the map. Hence we retain all regions (rows of map), with potentially NA as the population value. This is a reasonable approach.
Alternatively:- inner join would remove regions where population from the map. That would leave holes in the map. It is probably better to keep those regions and use a dedicate NA-color, such as gray, instead.
- Outer join will preserve all regions, but also population information for those regions that are not present on the map. This will probably not be a serious problem, it may just clutter your data frame with un-necessary rows.
- Finally, right join will combine the worst of both worlds: leave holes in map for missing population data, while also cluttering the final dataset.