K Exercise solutions

K.1 Introduction to R

K.1.1 Variables

K.1.1.1 Invalid variable names

You just type the assignment into RStudio console. Here is the screenshot (as of R 4.3.1). As you see, the errors are not all the same, and they do not tell it explicitly that the variable name is wrong.

K.1.2 Data Types

K.1.2.1 Years to decades

If we integer-divide year by “10”, then we get the decade (without the trailing “0”). E.g.

1966 %/% 10
## [1] 196

Now we just multiply the result by 10:

1966 %/% 10 * 10
## [1] 1960

Or, to make the order of operation more clear:

(2023 %/% 10)*10
## [1] 2020

K.1.2.2 Are you above 20?

There are many ways to do it, here is just one possible solution:

age <- 53
older <- age > 20
older
## [1] TRUE

Note the variable names: age is fairly self-explanatory, older is much less so. In complex projects one may prefer name like age_over_20 or something like this. But in a few-line scripts, even a and o may do.

K.1.2.3 Happy and raining

We can basically copy the pets example and write

mood <- "happy"
weather <- "sunny"
mood == "happy" & weather == "rainy"  # false, as not rainy
## [1] FALSE

Alternatively, can use parenthesis to stress what is compared:

(mood == "happy") & (weather == "rainy")
## [1] FALSE

K.1.3 Producing output

K.1.3.1 Sound around earth

We can follow the lightyear example fairly closely:

s <- 0.34  # speed of sound, km/s
distance <- 42000
tSec <- distance/s
tHrs <- tSec/3600
tDay <- tHrs/24
cat("It takes", tSec, "seconds, or",
    tHrs, "hours, \nor", tDay,
    "days for sound to travel around earth\n")
## It takes 123529.4 seconds, or 34.31373 hours, 
## or 1.429739 days for sound to travel around earth

Note how we injected the new line, \n in front of “or” for days. This makes the output lines somewhat shorter and easier to read.

Now it does not happen often that sound actually travels around the world, but the pressure wave of Krakatoa volcanic eruption 1883 was actually measured circumnavigating the world 3 times in 5 days. See the Wikipedia entry.

K.2 Functions

K.2.1 For-loops

K.2.1.1 Odd numbers only

The form of seq() we need here is seq(from, to, by) so that the sequence runs from from to to with a step by. So we can write

for(i in seq(1, 9, 2)) {
   i2 <- i^2
   cat(i, "^2 = ", i2, "\n", sep="")
}
## 1^2 = 1
## 3^2 = 9
## 5^2 = 25
## 7^2 = 49
## 9^2 = 81

K.2.1.2 Multiply 7

We can just follow the loop example in Section 3.1:

for(i in 10:0) {
   cat("7*", i, " = ", 7*i, "\n", sep="")
}
## 7*10 = 70
## 7*9 = 63
## 7*8 = 56
## 7*7 = 49
## 7*6 = 42
## 7*5 = 35
## 7*4 = 28
## 7*3 = 21
## 7*2 = 14
## 7*1 = 7
## 7*0 = 0

Note the differences:

  • we go down from “10” to “0” using 10:0
  • we need specify that the numbers and strings we print should not be separated by space using sep="" argument for cat.
  • we could have created a separate variable i7 <- i*7 but we chose to write this expression directly as an argument for cat().

K.2.1.3 Print carets ^

This is very simple: we just need to use cat("^") 10 times in a loop:

for(i in 1:10) {
   cat("^")
}
cat("\n")  # end the line here
## ^^^^^^^^^^

Note that we end the line after the loop, this is because we do not want the whatever-follows-it to be on the same line.

K.2.1.4 Asivärk

The trick here is to use the caret-printing example, but now we need to do it not 10 times, but a different number of times in each row. We can call this number n, and change n in another, outer for-loop, from 1 to 10:

for(n in 1:10) {
   for(i in 1:n) {
      cat("^")
   }
   cat("\n")
}
## ^
## ^^
## ^^^
## ^^^^
## ^^^^^
## ^^^^^^
## ^^^^^^^
## ^^^^^^^^
## ^^^^^^^^^
## ^^^^^^^^^^

Note how the middle rows are essentially the caret-printing example, the only difference is 1:n instead of 1:10 in the loop header. This ensures that the outer loop index n can change the number of carets printed.

K.2.1.5 Cloud and Rain

This is a somewhat more complicated example, but the broad idea is similar to that of Asivärk. We need nested loops here too: first, the outer loop should count the number of v-s. Second, we need three inner loops: for dashes at left, v-s in the middle, and dashes at right. All these loops should nest inside the outer loop:

for(n in seq(10, 2, by=-2)) {
                           # n is the number of v-s each row
                           # 10, 8, 6, ...
   nDash <- (12 - n)/2  # how many raindrops each side of the cloud
   ## Left raindrops
   for(i in 1:nDash) {
      cat("-")
   }
   ## Center cloud
   for(i in 1:n) {
      cat("v")
   }
   ## Right raindrops
   for(i in 1:nDash) {
      cat("-")
   }
   cat("\n")  # row ends here
}
## -vvvvvvvvvv-
## --vvvvvvvv--
## ---vvvvvv---
## ----vvvv----
## -----vv-----

K.2.2 What are functions

K.2.2.1 Value/side effects of seq()

Documentation (and experimentation) shows that seq() will return a sequence of numbers and does not produce any side effects:

x <- seq(1:5)  # no side effects
x  # the returned sequence
## [1] 1 2 3 4 5

K.2.3 Writing functions

K.2.3.1 M87 black hole in km

The function might look similar to feet2m, but we may need to compute the length of a single light-year inside of the function:

ly2km <- function(distance) {
   c <- 300000
   ly <- c*60*60*24*365  # length of a single light-year:
                           # speed of light * seconds in minute *
                           # minutes in hour * hours in day *
                           # days in year
   distance*ly
}

And we find the distance to the black hole as

ly2km(55000000)
## [1] 5.20344e+20

or maybe it is easier to write it as

ly2km(55e6)  # 55*10^6
## [1] 5.20344e+20

If this number does not tell you much then you are not alone–so big distances are beyond what we one earth can perceive.

K.2.3.2 Years to decades

Perhaps the most un-intuitive part here is the integer division %/%: it just divides the numbers, but discards all fractional parts. For instance,

2024 %/% 10
## [1] 202

In order to make this into the decade, we just need to multiply the result by 10 again. So the function might look like:

decade <- function(year) {
   (year %/% 10)*10
}

decade(2024)
## [1] 2020
decade(1931)
## [1] 1930
decade(1969)
## [1] 1960
decade(1970)
## [1] 1970

K.2.3.3 Dates

date <- function(day, month, year = 2024) {
   paste0(year, "-", month, "-", day)
}

date(30, 3, 2012)
## [1] "2012-3-30"
date(30, 3)
## [1] "2024-3-30"

Note that the order of arguments is somewhat arbitrary, you can also use function(month, day, year) or any other order. But obviously, later you need to supply the actual arguments in the corresponding order.

K.2.4 Output versus return

We can create such a function by just using paste0:

hi <- function(name) {
   paste0("Hi ", name, ", isn't it a nice day today?")
                           # remember: paste0 does not leave spaces b/w arguments
}

This function returns the result of paste0, the character string that combines the greeting and the name. It does not output anything–there is no print nor cat command. We can show it works as expected: when called on R console, its returned value, the greeting, is automatically printed:

hi("Arthur")  # remember: name (it is text) must be quoted
## [1] "Hi Arthur, isn't it a nice day today?"

and if the result is assigned to a variable then nothing is printed:

greeting <- hi("Arthur")

K.3 Vectors

K.3.1 Vectorized operations

K.3.1.1 Extract April month row numbers

We just need to make a sequence from 3 till no more than 350 (number of rows) with step 12:

seq(3, 350, by = 12)
##  [1]   3  15  27  39  51  63  75  87  99 111 123 135 147 159 171 183 195 207 219 231
## [21] 243 255 267 279 291 303 315 327 339

K.3.1.2 Yu Huang and Guanyin in liquor store

We can just call the data age and cashier:

age <- c(16, 20, 24)
cashier <- c("Yu Huang", "Guanyin", "Yu Huang")

In normal language–you are able to buy if you are at least 21 years old or your cashier is Guanyin. This means the first customer cannot, but the other two can buy the drink.

The expression is pretty much exactly the sentence above, written in R syntax:

age >= 21 | cashier == "Guanyin"
## [1] FALSE  TRUE  TRUE

Note that we use >= to test age at least 21, and == to test equality.

So the first customer cannot get the drink but the two others can.

K.3.1.3 Descriptive statistics

x <- 1:10
y <- -11:22
z <- c(1,1,1,1,1,1,1,1,1,1,1, 55)
mean(x)
## [1] 5.5
mean(y)
## [1] 5.5
mean(z)
## [1] 5.5

So all averages are the same.

median(x)
## [1] 5.5
median(y)
## [1] 5.5
median(z)
## [1] 1

Medians of x and y are the same, but that of z is just 1.

range(x)
## [1]  1 10
range(y)
## [1] -11  22
range(z)
## [1]  1 55

Here range is easily visible from how the vectors were created, so computation is not really needed. But this is usually not the case where the vectors originate from a large dataset.

var(x)
## [1] 9.166667
var(y)
## [1] 99.16667
var(z)
## [1] 243

Variances are hard to judge manually, but they are different too.

So we summarized these vectors into five different numbers (two for range), despite of the fact that they were of different length.

K.3.1.4 Recycling where length do not match

c(10, 20, 30, 40) + 1:3
## Warning in c(10, 20, 30, 40) + 1:3: longer object length is not a multiple of
## shorter object length
## [1] 11 22 33 41

This is the warning message, as you can see, this operations results in an incomplete recycling where only the first component 1 of the shorter vector was used.

K.3.2 Vector indices

K.3.2.1 Extract positive numbers

This is a simple application of logical indexing:

v <- -5:5
v[v > 0]
## [1] 1 2 3 4 5

K.3.2.2 Extract positive numbers

We have data

height <- c(160, 170, 180, 190, 175)  # cm
weight <- c(50, 60, 70, 80, 90)  # kg
name <- c("Kannika", "Nan", "Nin", "Kasem", "Panya")

Height of everyone at least 180cm:

height[height >= 180]
## [1] 180 190

Names of those at least 180cm:

name[height >= 180]
## [1] "Nin"   "Kasem"

Weight of all patients who are at least 180cm tall

weight[height >= 180]
## [1] 70 80

Names of everyone who weighs less than 70kg

name[weight < 70]
## [1] "Kannika" "Nan"

Names of everyone who is either taller than 170, or weighs more than 70.

name[height > 170 | weight > 70]
## [1] "Nin"   "Kasem" "Panya"

K.3.2.3 Character indexing: state abbreviations

First, we can set names to the state.abb variable:

names(state.abb) <- state.name

Note that we need to be sure that the names and abbreviations are in the same order! (They are, this is how the data is defined, see Section I.11.) This results in a named vector:

state.abb[1:5]
##    Alabama     Alaska    Arizona   Arkansas California 
##       "AL"       "AK"       "AZ"       "AR"       "CA"

Now we can just extract the abbreviations:

state.abb[c("Utah", "Connecticut", "Nevada")]
##        Utah Connecticut      Nevada 
##        "UT"        "CT"        "NV"

This is a common way to create lookup tables in R.

K.3.3 Modifying vectors

K.3.3.1 Wrong number of items

Feeding in a single item works perfectly:

supplies <- c("backpack", "laptop", "pen")
supplies[c(2, 3)] <- "ipad"
supplies
## [1] "backpack" "ipad"     "ipad"

Just now both the elements 2 and 3 are “ipad”. This is because of the recycling rules (see Section 4.3.4), the shorter item (here “ipad”) will just replicated as many times as needed (here two).

But feeding in 3 elements results in a warning:

supplies[c(2, 3)] <- c("tablet", "book", "paper")
## Warning in supplies[c(2, 3)] <- c("tablet", "book", "paper"): number of items to
## replace is not a multiple of replacement length
supplies
## [1] "backpack" "tablet"   "book"

Otherwise, the replacement works, just the last item, “paper”, is ignored.

K.3.3.2 Absolute value

We can do it explicitly in multiple steps:

x <- c(0, 1, -1.5, 2, -2.5)
iNegative <- x < 0  # which elements are negative
positive <- -x[iNegative]  # flip the sign for negatives
                           # so you get the corresponding
                           # positives
x[iNegative] <- positive  # replace negatives
x
## [1] 0.0 1.0 1.5 2.0 2.5

However, it is much more concise if done in a shorter form:

x <- c(0, 1, -1.5, 2, -2.5)
x[x < 0] <- -x[x < 0]
x
## [1] 0.0 1.0 1.5 2.0 2.5

K.3.3.3 Managers’ rent

Here is the data:

income <- c(Shang = 1000, Zhou = 2000, Qin = 3000, Han = 4000)
rent <- c(Shang = 200, Zhou = 1000, Qin = 1700, Han = 2800)

This problem can be solved in two ways. First the way how it is stated in the problem:

b <- c(0, 0, 0, 0)  # to begin with, befit "0" for everyone
iHR <- rent > 0.5*income  # who is rent-burdened?
iHR  # just for check
## Shang  Zhou   Qin   Han 
## FALSE FALSE  TRUE  TRUE

So Qin and Han are rent-burdened.

b[iHR] <- 0.25*rent[iHR]  # compute their benefit
b
## [1]   0   0 425 700

Here we replaced benefits for two people–we had to use iHR on both sides of the assignment.

We can also solve it the other way around (not asked in the problem statement): first we can compute the benefit for everyone, and thereafter replace it for the non-rent burdened with “0”:

b <- 0.25*rent  # benefits to everyone
iLR <- rent <= 0.5*income  # who's rent is low?
b[iLR] <- 0  # replace their benefits by 0.
b
## Shang  Zhou   Qin   Han 
##     0     0   425   700

Note that all replacement elements have the same value here, “0”.

K.4 Lists

K.4.1 Vectors and lists

The vector will be

c(1, 2:4, 5)
## [1] 1 2 3 4 5

and the list

list(1, 2:4, 5)
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 2 3 4
## 
## [[3]]
## [1] 5

The printout clearly shows that in case of vector we end up with a vector of 5 similar elements (just numbers). But the list contains three elements, the first and last are single numbers (well, more precisely length-1 vectors), while the middle component is a length-3 vector.

As this example shows, one cannot easily print all list elements on a single row as is the case with vectors.

K.4.2 Print employee list

First re-create the same persons:

person <- list(name = "Ada", job = "Programmer", salary = 78000,
               union = TRUE)
person2 <- list("Ji", 123000, FALSE)
employees <- list(person, person2)

The printout looks like

employees
## [[1]]
## [[1]]$name
## [1] "Ada"
## 
## [[1]]$job
## [1] "Programmer"
## 
## [[1]]$salary
## [1] 78000
## 
## [[1]]$union
## [1] TRUE
## 
## 
## [[2]]
## [[2]][[1]]
## [1] "Ji"
## 
## [[2]][[2]]
## [1] 123000
## 
## [[2]][[3]]
## [1] FALSE

We can see our two employees here, Ada (at first position) and Ji (at second position). All element names for Ada are preceded with [[1]] and for Ji with [[1]]. These indicate the corresponding positions.

Ada and Ji data itself is printed out slightly differently, reflecting the fact that Ada’s components have names while Ji’s components do not. So Ada’s components use $name tag and Ji’s components use a similar [[1]] positional tag.

K.5 How to write code

K.5.1 Divide and conquer

K.5.1.1 Patient names and weights

The recipe to display the names might sound like

  1. Take the vector of weights
  2. Find which weights are above 60kg
  3. Get names that correspond to those weights
  4. Print those

This recipe is a bit ambiguous though–the which weights is not quite clear, and if you know how to work with vectors, it may mean both numeric position (3 and 4) or logical index (FALSE, FALSE, TRUE, TRUE, FALSE). But if you know the tools, you also know that both of these approaches are fine, so the ambiguity is maybe even its strength.

Second, if you know the tools, then you know that explicit printing may not be needed.

The recipe to display the weights may be like

  1. Take the vector of weights
  2. Find which weights are above 60kg
  3. Display those

This recipe works well if we have access to the vectorized operations and indexing like what we have in R. But if we do not have acess to these tools, we may instead write

  1. Take the array of weights
  2. Walk over every weight in this array
  3. Is the weight over 60kg?
  4. If yes, print it!

Which recipe do you prefer? Obviously, it depends on the tools you have access to.

Here is example code:

## Data
name <- c("Bao-chai", "Xiang-yun", "Bao-yu", "Xi-chun", "Dai-yu")
weight <- c(55, 56, 65, 62, 58)  # kg

## Names
name[weight > 60]  # simple, but does follow the recipe closely
## [1] "Bao-yu"  "Xi-chun"
## more complex, but follow the recipe more closely
i <- weight > 60
heavies <- name[i]
cat(heavies, "\n")
## Bao-yu Xi-chun

For weights, we have similar two options

## Short
weight[weight > 60]
## [1] 65 62
## More along the recipe
i <- weight > 60
weight[i]  # implicit printing
## [1] 65 62

K.5.2 Learning more

K.5.2.1 Time difference in days

ChatGPT may give very good code examples that only require minimal adjustments.

Novadays AI-based tools are fairly good at doing this. The figure at right show chatGPT’s answer (incorporated in Bing) to such a question. This answer is correct and can be incorporated to your code with only little adjustments. However, one should still look up what do these functions do and what does format = "%b %d, %Y" mean.

However, the answer my not be enough if you do not know the basics of R, e.g. what is the assignment operator <- or the comment character #. Also, it lacks some context and it does not discuss more efficient or simpler ways to achieve the same task. For instance, it does not suggest to write the dates in the ISO format YYYY-mm-dd which would simplify the solution.

The first page of as.Date help (you can get it with ?as.Date).

The as.Date() help page offers much more information than what chatGPT gives. In particular, the tryFormats and its default values are very useful. However, it also assumes more understanding of the workings of R, e.g. what does the ## S3 method for class 'character' exactly mean, and which of the functions listed there one actually needs.

So AI-tools are not a substitute to documentation (nor the other way around). AI is great to quickly get a solution. In order to evaluate the solution, you need to know more. But as your time is valuable too–use AI for tasks where you do not need to go in depth, but learn the most important tools in depth.

Here is a simplyfied version of the chatGPT-suggested solution:

dates <- as.Date(c("2023-10-16", "2023-11-12", "2014-07-03"))
                           # ISO dates do not need format specification
difftime(dates[2], dates[1], units="days")
## Time difference of 27 days
difftime(dates[2], dates[3], units="days")
## Time difference of 3419 days

When working with dates, you should also be familiar with lubridate library and tools therein.

K.5.3 Coding style

K.5.3.1 Variable names for election data

One of the decisions you need to make here is how to name the political parties. You definitely do not want to use the full names as those are very long. Here we are actually in a very good situation, as these parties have standard English abbreviation (BJP, INC and YSRCP).

Below is one option:
  1. The original data:
    • elections. If there are more election-related things, besides of the dataset, we may call it electionData to stress this is a dataset.
  2. Corrected original
    • electionsFixed
  3. 2019 only
    • elections2019. This assumes we do not need 2019 non-fixed version.
  4. Sub-datasets for parties.
    • electionsBJP
    • electionsINC
    • electionsYSRCP.
  5. Winning districts only
    • winsBJP
    • winsINC
    • winsYSRCP

Obviously, there are more options, e.g. if the project is very short, then you may replace elections with just e. If you need more, e.g. also 2024 election data, you may need variable names like elections2019BJP and wins2024INC.

You may also think what to do if the data is about Japan instead, and the party you are interested, 公明党, is abbreviated as 公明. (See Komeito).

K.6 Conditional statements

K.6.1 if-statement

K.6.1.1 Tell if second string longer

This is quite a simple application of if and else:

compareStrings <- function(s1, s2) {
   if(nchar(s2) > nchar(s1)) {
      ## if 2nd string longer the print
      cat("The second string is longer\n")
   }
   ## Do nothing else
}

compareStrings("a", "aa")  # prints
## The second string is longer
compareStrings("aa", "a")  # does not print

K.6.1.2 Print if number even

  1. Here the logic is as follows:
  1. print the number
  2. if even, print " - even".
for(i in 1:10) {
   cat(i, "\n")  # print the number (and new line)
   if(i %% 2 == 0) {
      cat(" - even\n")  # print 'even' (and new line)
   }
}
## 1 
## 2 
##  - even
## 3 
## 4 
##  - even
## 5 
## 6 
##  - even
## 7 
## 8 
##  - even
## 9 
## 10 
##  - even
  1. Now we need to think more about printing. It goes as follows:
  1. print the number (no new line)
  2. if even, print " - even" (no new line)
  3. add new line, unconditionally.
for(i in 1:10) {
   cat(i)  # print the number, but do not switch to new line
   if(i %% 2 == 0) {
      cat(" - even")  # print 'even', do not switch to new line
   }
   cat("\n")  # switch to new line at the end of line here
              # whatever number it is
}
## 1
## 2 - even
## 3
## 4 - even
## 5
## 6 - even
## 7
## 8 - even
## 9
## 10 - even

K.6.1.3 Print even/odd

The code is simple, and printing is a bit simpler too

for(i in 1:10) {
   cat(i)  # print the number, but do not switch to new line
   if(i %% 2 == 0) {
      cat(" even\n")  # print 'even' and new line
   } else {
      cat(" odd\n")
   }
}
## 1 odd
## 2 even
## 3 odd
## 4 even
## 5 odd
## 6 even
## 7 odd
## 8 even
## 9 odd
## 10 even

K.6.1.4 Going out with friends

money <- 200
nFriends <- 5
price <- 30
sum <- (nFriends + 1)*price  # friends + myself
total <- sum*1.15  # add tip
if(total > money) {
   cat("Cannot afford 😭\n")
} else {
   cat("Can afford ✌\n")
}
## Cannot afford 😭

K.6.1.5 Test porridge temperature

We just need to remove assignments and return():

test_food_temp <- function(temp) {
  if(temp > 120) {
     "This porridge is too hot!"
  } else if(temp < 70) {
     "This porridge is too cold!"
  } else {
     "This porridge is just right!"
  }
}
## The test results are the same:
test_food_temp(119)  # just right!
## [1] "This porridge is just right!"
test_food_temp(60)   # too cold!
## [1] "This porridge is too cold!"
test_food_temp(150)  # too hot!
## [1] "This porridge is too hot!"

In my opinion, shorter code is easier to read, but different people may have different opinion.

K.6.1.6 Implement absolute value

absv <- function(x) {
   if(x > 0) {
      x
   } else {
      -x
   }
}
absv(-2)
## [1] 2
absv(3)
## [1] 3

K.6.2 Conditional statements and vectors

K.6.2.1 Should you go to boba place?

The problem is worded in a somewhat vague manner, so you may need to make it more specific. Here we assume that you only go if you can afford a drink.

This means you need to write code that checks if any tea is cheaper than $7.

K.6.2.2 Can you get a drink?

With the original prices:

price <- c(5, 6, 7, 8)
if(any(price <= 7)) {
   cat("You can get a drink\n")
} else {
   cat("This is a too expensive place\n")
}
## You can get a drink

If they rise the price by $3 across the board then we can just add “3” to the price vector:

price <- price + 3
if(any(price <= 7)) {
   cat("You can get a drink\n")
} else {
   cat("This is a too expensive place\n")
}
## This is a too expensive place

The results are intuitively obvious–it is affordable using the original prices but not with the new prices.

K.7 File system tree

K.7.1 File system tree

K.7.1.1 Sketch your file system tree

This is, obviously, different for everyone, but here is mine:
sketch of my file system tree

A subset of the file system tree in my computer. Black boxes denote folders, blue boxes are files.

K.7.1.2 Sketch your picture folder tree

Here is mine. I have picked mostly shorter example names, just to fit those on the figure.
sketch of my Pictures folder tree

A subset of the Pictures folder in my computer. Black boxes denote folders, blue boxes are files.

K.7.1.3 Navigate the file tree

Get cheatsheet.pdf from cse142

How to navigate to cheatsheet.pdf from cse142.

From cse142 to cheatsheet.pdf we can move as (see the figure):

  1. up (into UW)
  2. into info201
  3. grab cheatsheet.pdf from there

Or in the short form:

"../info201/cheatsheet.pdf"

Note that we should not start be going up to cse142 as we already are there.

K.7.1.4 Matlab accessing matrix.dat

Get matrix.dat from Downloads

How to navigate to cheatsheet.pdf in Downloads from amat352.

From amath352 to matrix.dat we can move as (see the figure):

  1. up (into UW)
  2. up (into Documents)
  3. up (into Yucun’s stuff)
  4. into Downloads
  5. grab matrix.dat from there

Or in the short form:

"../../../Downloads/matrix.dat"

Again, we should not start be going up to amath352 as we already are there.

K.7.1.5 Get picture from info201

Again, this is different on your computer. But given my file system tree looks like above, my path will be
sketch of my file system tree

How I can access green-lake-ice.png from my info201 folder.

The corresponding list of instructions is:
  1. up (to teaching)
  2. up (to tyyq)
  3. up (to my stuff)
  4. into Pictures
  5. into Nature
  6. grab the green-lake-ice.jpg from there.

In the short form, it is

"../../../Pictures/Nature/green-lake-ice.jpg"

Note that I do not have pictures in Pictures folder, but in subfolders inside there. If you do, the descent into Nature will be unnecessary.

K.7.1.6 Absolute path

Yucun’s file system tree (Mac version)

For reference, here is the file system tree again.

Now the initial location (the working directory of the java program) is irrelevant. One has to to start from root:

  1. Start at root “/”
  2. into “Users”
  3. into “yucun”
  4. into “Pictures”
  5. grab “Ross Lake.jpg” from there

Or in the computer way:

/Users/yucun/Pictures/Ross Lake.jpg

K.7.1.7 Absolute path of an image

Suppose I have an image “fractal.png” inside of my Picture folder that, in turn, is in my home folder. Assume further that I am using Windows and my home folder is on “D:” drive. The long directions might look like:

  1. start at root “This PC”
  2. go to drive “D:”
  3. go to “Users”
  4. go to “siim” (assume “siim” is my user name)
  5. go to “Pictures”
  6. grab “fractal.png” from there.

In the short form it is D:/Users/siim/Pictures/fractal.png

Note that we do not use the root “This PC” when writing paths on windows.

K.7.1.8 Absolute path of the home folder

Home folder location in file system tree

Obviously, this is different for every user and every computer. Here is mine on my home computer. I have marked a few other folder (etc, system configuration files and usr – installed applications).

path in Gnome file selector

Absolute path, here root - home - siim, as shown in Gnome file selector.

There are multiple ways to see where in the file system tree it is located, one option is to use file managers. Here is an example that shows the path in Gnome file selector. Note that root is denoted by a hard disk icon, and the home folder siim is combined with a home icon.

K.7.1.9 Yucun moving his project

  1. If he is using absolute path (it might be "/Users/yucun/Documents/data/data.csv"), the it does not change. This is because absolute path always starts from the file system root, and file system root does not change if you move around your files and folders–as long as the file in question (data.csv) remains in place.
  2. If he moves data to a different computer… then he probably has to change the paths. Most importantly, the other computer may not have the data folder inside of the Documents folder, but somewhere else. Second, the other computer may also have different file system tree, e.g. if the other one is a PC, his home folder may be "C:/Users/yucun" instead. Relative path is of no help here, unless the other computer has similar file and folder layout.

K.7.2 Accessing files from R

K.7.2.1 R working directory path type

This is absolute path: you see this because “/home/otoomet/tyyq/info201-book” starts with the root symbol /. See more in Sections 9.1.2 and 9.1.3.

K.7.2.2 RStudio console working directory

Typing ‘getwd()’ in rstudio console

The only way to see it is to run getwd() in rstudio console. You can run it directly, or you can also execute a line of a script. What matters is that it runs on console.

The example here shows “/home/siim/tyyq/teaching/info201/inclass” as the current working directory.

K.7.2.3 List files in R and in graphical manager

Assume the current working directory is “/home/siim/tyyq/teaching/info201/inclass” as in the exercise above.

‘list.files()’ showing files in the current working directory

We can use list.files() to see files here.

Files in current working directory as displayed graphically

And here are the same files, seen through the eyes of a graphical file manager (PCManFM). Note the navigation bar above the icons that displays the absolute path of the folder, and the side pane that displays the file system tree (a small view of it only).

It is easy to see that the files are the same. Note that R normally sorts files alphabetically, but file managers may show these in different ways, either alphabetically, by creation time, or you may even manually position individual icons. All this may be configured differently on your computer!

You can also see that here, both R and the file manager show all names in the same way, including the complete extensions like .R or .jpg. This may be different on your computer (and can be changed).

K.8 Data Frames

K.8.1 What is data frame

K.8.1.1 What does a row in orange tree data represent?

In this data frame a row represent a tree-age combination–each tree has been measured at a different age.

K.8.1.2 What does a row in covid data represent?

In this data frame a row represent a country-date-type combination. You can see that as each country is there multiple times, each date is there multiple times, and each date is there multiple times. But these combinations are unique.

K.8.2 Working with data frames

K.8.2.1 Countries and capitals

Appropriate names are country for the country, capital for its capital, and population for the population. We call the data frame as countries (plural) to distinguish it from the individual variable. Obviously, one can come up with other names. We can create the data frame as

countries <- data.frame(
   country = c("Gabon", "Congo", "DR Congo", "Uganda", "Kenya"),
   capital = c("Libreville", "Brazzaville", "Kinshasa", "Kampala", "Nairobi"),
   population = c(2.340, 5.546, 108.408, 45.854, 55.865))
countries
##    country     capital population
## 1    Gabon  Libreville      2.340
## 2    Congo Brazzaville      5.546
## 3 DR Congo    Kinshasa    108.408
## 4   Uganda     Kampala     45.854
## 5    Kenya     Nairobi     55.865

where population is in Millions (2022 estimates from Wikipedia).

We can extract the country names by dollar notation as

countries$country
## [1] "Gabon"    "Congo"    "DR Congo" "Uganda"   "Kenya"

and population with double brackets as

countries[["population"]]
## [1]   2.340   5.546 108.408  45.854  55.865

Capital using indirect name:

var <- "capital"
countries[[var]]
## [1] "Libreville"  "Brazzaville" "Kinshasa"    "Kampala"     "Nairobi"

K.8.3 Accessing Data in Data Frames

K.8.3.1 Indirect variable name with dollar notation

R will interpret the workspace variable name that contains data variable name as data variable name:

var <- "population"
countries$var  # NULL
## NULL

As you see, R is looking for a data variable var. As it cannot find it, it returns NULL, the special code for empty element.

K.8.3.2 Loop of columns of a data frame

  1. Column names. No loop needed here:
names(emperors)
## [1] "name"    "born"    "throned" "ruled"   "died"
  1. Print names in loop. We can just loop over the names:
for(n in names(emperors)) {
   cat(n, "\n")
}
## name 
## born 
## throned 
## ruled 
## died
  1. Print name and column. We need indirect access here as the column name is now stored in the variable (called n below). So we can access it as emperors[[n]]:
for(n in names(emperors)) {
   cat(n, "\n")
   print(emperors[[n]])
}
## name 
## [1] "Qin Shi Huang"      "Napoleon Bonaparte" "Nicholas II"       
## [4] "Mehmed VI"          "Naruhito"          
## born 
## [1] -259 1769 1868 1861 1960
## throned 
## [1] -221 1804 1894 1918 2019
## ruled 
## [1] "China"          "France"         "Russia"         "Ottoman Empire"
## [5] "Japan"         
## died 
## [1] -210 1821 1918 1926   NA
  1. Print name and type. This is similar to the above, except now we print is.numeric(emperors[[n]]).
for(n in names(emperors)) {
   cat(n, "is numeric:", is.numeric(emperors[[n]]), "\n")
}
## name is numeric: FALSE 
## born is numeric: TRUE 
## throned is numeric: TRUE 
## ruled is numeric: FALSE 
## died is numeric: TRUE
  1. Print name and minimum. Now use the TRUE/FALSE for a logical test, only print average if this is true:
for(n in names(emperors)) {
   cat(n, "")
   if(is.numeric(emperors[[n]])) {
      cat(min(emperors[[n]]))
   }
   cat("\n")
}
## name 
## born -259
## throned -221
## ruled 
## died NA

Note: you may want to use min(emperors[[n]], na.rm = TRUE) to avoid the missing minimum for died column.

K.8.3.3 Emperors who died before 1800

Pure dollar notation is almost exactly the same as the example in the text:

emperors$name[emperors$died < 1800]
## [1] "Qin Shi Huang" NA

When using double brackets at the first place, we have

emperors[["name"]][emperors$died < 1800]
## [1] "Qin Shi Huang" NA

Note that we have a weird construct here [[...]][..]. It looks weird, but it perfectly works. emperors[["name"]] is a vector, and a vector can be indexed using [...].

When we put double brackets in both places, we get

emperors[["name"]][emperors[["died"]] < 1800]
## [1] "Qin Shi Huang" NA

This is perhaps the “heaviest” notation, where it may be hard to keep track of the brackets. However, it is a perfectly valid way to extract emperors!

Finally, NA in the output is related to Naruhito. As we do not know his year of death, R sends a message that there is one name where we do not know if he died before 1800. It is a little stupid–as Naruhito is alive today, he cannot have died before 1800. But we haven’t explained this knowledge to R.

K.8.3.4 Single-bracket data acess (emperors)

Extract 3rd and 4th row:

emperors[3:4,]  # alternatively, emperors[c(3,4),]
##          name born throned          ruled died
## 3 Nicholas II 1868    1894         Russia 1918
## 4   Mehmed VI 1861    1918 Ottoman Empire 1926

All emperors who died in 20th century:

emperors[emperors$died >= 1900 & emperors$died < 2000,]
##           name born throned          ruled died
## 3  Nicholas II 1868    1894         Russia 1918
## 4    Mehmed VI 1861    1918 Ottoman Empire 1926
## NA        <NA>   NA      NA           <NA>   NA

This will still give us NA for Naruhito–we haven’t explained to R in any way that someone who was alive in 2023, cannot have died in 20th century. If a NA is not desired, one can use which():

emperors[which(emperors$died >= 1900 & emperors$died < 2000),]
##          name born throned          ruled died
## 3 Nicholas II 1868    1894         Russia 1918
## 4   Mehmed VI 1861    1918 Ottoman Empire 1926

Name and country of those emperors

emperors[which(emperors$died >= 1900 & emperors$died < 2000),
         c("name", "ruled")]
##          name          ruled
## 3 Nicholas II         Russia
## 4   Mehmed VI Ottoman Empire

K.8.3.5 Patients aging

First create the data frame:

Name <- c("Ada", "Bob", "Chris", "Diya", "Emma")
Inches <- c(58, 59, 60, 61, 62)
Pounds <- c(120, 120, 150, 150, 160)
age <- c(22, 33, 44, 55, 66)
patients <- data.frame(Name, Inches, Pounds, age)
patients
##    Name Inches Pounds age
## 1   Ada     58    120  22
## 2   Bob     59    120  33
## 3 Chris     60    150  44
## 4  Diya     61    150  55
## 5  Emma     62    160  66

Adding a single year of age involves just modifying data, but we do not need to filter anythign as this applies to everyone:

patients$age <- patients$age + 1
patients
##    Name Inches Pounds age
## 1   Ada     58    120  23
## 2   Bob     59    120  34
## 3 Chris     60    150  45
## 4  Diya     61    150  56
## 5  Emma     62    160  67

K.8.4 Learning to know your data

K.8.4.1 Line plot of random dots

plot of chunk sol-line-plot-random-dots

We can just add type="l" to any of the previous example, let’s do it with the simple untuned example above:

x <- rnorm(50)  # 50 random numbers
y <- rnorm(50)
plot(x, y, type="l")

The result is just an undescribable mish-mash.

K.9 dplyr

K.9.1 Grammar of data manipulation

K.9.1.1 How many trees over size 100?

We can do something like this:

  1. Take the orange tree dataset
  2. keep only rows that have size > 100
  3. pull out the tree number
  4. find all unique trees
  5. how many unique trees did you find?

Obviously, you can come up with different lists, e.g. the items 4 and 5 might be combined into one. They are kept separate here that these two items correspond to a single function in base-R.

K.9.1.2 Two ways to find the largest tree

The difference is in how the recipe breaks ties for the largest tree. If there are two largest trees of equal size, these will be put in an arbitrary order. If we pick the first line below, we’ll get one of the largest trees, but not both. The second recipe extracts all trees of maximum size, so it can find all such trees.

In practice, it is more useful not to order the trees and pick the first, but rank them with an explicit way to break ties. For instance

size <- c(20, 10, 20)
rank(desc(size), ties.method="min")
## [1] 1 3 1

Will tell that both the first and the third tree are on the “first place” in descending order. See more with ?rank.

K.9.2 Most important dplyr functions

K.9.2.1 Add decade to babynames

We can compute decade by first integer-dividing year by 10, and then multiplying the result by 10:

babynames %>%
   mutate(decade = year %/% 10 * 10) %>%
   sample_n(5)  # just show it works
## # A tibble: 5 × 6
##    year sex   name         n       prop decade
##   <dbl> <chr> <chr>    <int>      <dbl>  <dbl>
## 1  2015 M     Zixuan       7 0.00000343   2010
## 2  2011 F     Aleaya       6 0.0000031    2010
## 3  1932 M     Arlen      178 0.000166     1930
## 4  1975 F     Tiffiney    35 0.0000224    1970
## 5  2007 F     Coral      178 0.0000842    2000

K.9.2.2 How many names over all years

We just need to add the count variable n:

babynames %>%
   summarize(n = sum(n))
## # A tibble: 1 × 1
##           n
##       <int>
## 1 348120517

K.9.2.3 Shiva for boys/girls

The task list might look like this:
  1. filter to keep only boys (or only girls)
  2. filter to keep only name “Shiva”
  3. summarize this dataset by adding up all counts n

There are, obviouly, other options, for instance, you can swapt the filter by sex and filter by name.

## for boys
babynames %>%
   filter(sex == "M") %>%
   filter(name == "Shiva") %>%
   summarize(sum(n))
## # A tibble: 1 × 1
##   `sum(n)`
##      <int>
## 1      397
## for girls
babynames %>%
   filter(sex == "F") %>%
   filter(name == "Shiva") %>%
   summarize(sum(n))
## # A tibble: 1 × 1
##   `sum(n)`
##      <int>
## 1      249

K.9.3 Combining dplyr operations

The tasklist for this question (see above) might be:

  1. Take the orange tree dataset
  2. keep only rows that have size > 100
  3. pull out the tree number
  4. find all unique trees
  5. how many unique trees did you find?

This can be translated to code as:

Orange %>%
   filter(circumference > 100) %>%
   pull(Tree) %>%
   unique() %>%
   length()
## [1] 5

So there are 5 different trees.

K.9.4 Grouped operations

K.9.4.1 Titanic fare by class

The computations are pretty much the same as the example in the text:

titanic %>%
   group_by(pclass) %>%
   summarize(avgFare = mean(fare, na.rm=TRUE),
             maxFare = max(fare, na.rm=TRUE),
             avgAge = mean(age, na.rm=TRUE),
             maxAge = max(age, na.rm=TRUE)
             )
## # A tibble: 3 × 5
##   pclass avgFare maxFare avgAge maxAge
##    <dbl>   <dbl>   <dbl>  <dbl>  <dbl>
## 1      1    87.5   512.    39.2     80
## 2      2    21.2    73.5   29.5     70
## 3      3    13.3    69.6   24.8     74

The results make sense, as the first class is the most expensive, and the third class the cheapest option. However, it is hard to see why the most expensive 3rd class options was much more than the 2nd class average. It is also reasonable that older people are more likely to travel in upper classes, as they may be wealthier, and their health may be more fragile.

K.9.4.2 Most distinct names

Here we compute the number of distinct names for each year, order the result by \(n\), and print the first three lines:

babynames %>%
   group_by(year) %>%
   summarize(n = n_distinct(name)) %>%
   arrange(desc(n)) %>%
   head(3)
## # A tibble: 3 × 2
##    year     n
##   <dbl> <int>
## 1  2008 32510
## 2  2007 32416
## 3  2009 32242

Apparently, these years are late 2000-s.

K.9.4.3 Most popular boy and girl names

The only difference here is to group by year and sex:

babynames %>%
   filter(between(year, 2002, 2006)) %>%
   group_by(year, sex) %>%
   arrange(desc(n), .by_group = TRUE) %>%
   summarize(name = name[1])
## # A tibble: 10 × 3
## # Groups:   year [5]
##     year sex   name 
##    <dbl> <chr> <chr>
##  1  2002 F     Emily
##  2  2002 M     Jacob
##  3  2003 F     Emily
##  4  2003 M     Jacob
##  5  2004 F     Emily
##  6  2004 M     Jacob
##  7  2005 F     Emily
##  8  2005 M     Jacob
##  9  2006 F     Emily
## 10  2006 M     Jacob

As we can see, these are just Emily and Jacob.

K.9.4.4 Three most popular names

The first 3 names in terms of popularity can just be filtered using the condition rank(desc(n)) <= 3:

babynames %>%
   filter(between(year, 2002, 2006)) %>%
   group_by(year) %>%
   filter(rank(desc(n)) <= 3)
## # A tibble: 15 × 5
## # Groups:   year [5]
##     year sex   name        n   prop
##    <dbl> <chr> <chr>   <int>  <dbl>
##  1  2002 M     Jacob   30568 0.0148
##  2  2002 M     Michael 28246 0.0137
##  3  2002 M     Joshua  25986 0.0126
##  4  2003 F     Emily   25688 0.0128
##  5  2003 M     Jacob   29630 0.0141
##  6  2003 M     Michael 27118 0.0129
##  7  2004 F     Emily   25033 0.0124
##  8  2004 M     Jacob   27879 0.0132
##  9  2004 M     Michael 25454 0.0121
## 10  2005 F     Emily   23937 0.0118
## 11  2005 M     Jacob   25830 0.0121
## 12  2005 M     Michael 23812 0.0112
## 13  2006 M     Jacob   24841 0.0113
## 14  2006 M     Michael 22632 0.0103
## 15  2006 M     Joshua  22317 0.0102

As you can see, these are various combinations of “Jacob”, “Michael”, “Joshua” and “Emily”.

K.9.4.5 10 most popular girl names after 2000

This is just about keeping girls only, and arranging by popularity afterward:

babynames %>%
   filter(sex == "F",
          year > 2000) %>%
   group_by(name) %>%
   summarize(n = sum(n)) %>%
   filter(rank(desc(n)) <= 5) %>%
   arrange(desc(n))
## # A tibble: 5 × 2
##   name          n
##   <chr>     <int>
## 1 Emma     327254
## 2 Emily    298119
## 3 Olivia   290625
## 4 Isabella 285307
## 5 Sophia   265572

We can see that “Emma” has been the most popular.

K.9.4.6 Most popular name by decade

This is noticeably more tricky task:

  • First we need to compute decade, this can be done using integer division %/% as (year %/% 10)*10.
  • Thereafter, we need to add all counts n for each name and decade. Hence we group by name and decade, and sum n.
  • Thereafter, we need to rank the popularity for each decade. Hence we group again, but now just by decade.

We can do it along these lines:

babynames %>%
   mutate(decade = year %/% 10 * 10) %>%
   group_by(name, decade) %>%
   summarize(n = sum(n)) %>%
   group_by(decade) %>%
   filter(rank(desc(n)) == 1) %>%
   arrange(decade)
## # A tibble: 14 × 3
## # Groups:   decade [14]
##    name    decade      n
##    <chr>    <dbl>  <int>
##  1 Mary      1880  92030
##  2 Mary      1890 131630
##  3 Mary      1900 162188
##  4 Mary      1910 480015
##  5 Mary      1920 704177
##  6 Robert    1930 593451
##  7 James     1940 798225
##  8 James     1950 846042
##  9 Michael   1960 836934
## 10 Michael   1970 712722
## 11 Michael   1980 668892
## 12 Michael   1990 464249
## 13 Jacob     2000 274316
## 14 Emma      2010 158715

We see that in the early years, “Mary” was leading the pack, later mostly the boy names have dominated.

Note the third line group_by(name, decade). For each decade, this makes groupings based on name only, not separately for name and sex. Hence for names that were given to both boys and girls, we add up all instances across genders.

K.9.4.7 “Mei” by decade

The final code might look like

babynames %>%
   filter(sex == "F") %>%
   mutate(decade = (year %/% 10) * 10) %>%
   group_by(name, decade) %>%
   summarize(n = sum(n)) %>%  # popularity over all 10 years!
   group_by(decade) %>%
   mutate(k = rank(desc(n))) %>%
   filter(name == "Mei")
## # A tibble: 8 × 4
## # Groups:   decade [8]
##   name  decade     n     k
##   <chr>  <dbl> <int> <dbl>
## 1 Mei     1940    18 6274.
## 2 Mei     1950    15 8015 
## 3 Mei     1960    36 7082 
## 4 Mei     1970   111 5149 
## 5 Mei     1980   136 5356.
## 6 Mei     1990   191 5176 
## 7 Mei     2000   385 3788.
## 8 Mei     2010   356 3560.

We see that “Mei” has gained in popularity over time, starting around 6000th place in popularity in 1940-s down to around 3500 in 2010-s.

A reminder here: the counts n in the table are probably underestimates–names are only included if they are given for at least 5 times.

K.9.5 More advanced dplyr usage

K.9.5.1 Sea and Creek 1980-2000

We can just filter the required years and the required names, both using %in%:

babynames %>%
   filter(year %in% c(1980, 1985, 1990, 1995, 2000),
          name %in% c("Sea", "Creek"))
## # A tibble: 2 × 5
##    year sex   name      n       prop
##   <dbl> <chr> <chr> <int>      <dbl>
## 1  1985 M     Sea       6 0.00000312
## 2  2000 M     Creek     7 0.00000335

We can see that these names were not popular, but both were given over five times to boys.

K.9.5.2 Name popularity frequency table

Here we want to count how many times are there numbers \(n=5\), \(n=6\), and so on. So we just count it:

p <- babynames %>%
   filter(year == 2004,
          sex == "F") %>%
   count(n)
p %>%
   sample_n(5)
## # A tibble: 5 × 2
##       n    nn
##   <int> <int>
## 1   789     1
## 2   111    11
## 3   423     3
## 4   292     3
## 5   713     1

Over all time: we need to aggregate \(n\):

p <- babynames %>%
   group_by(name) %>%
   summarize(n = sum(n)) %>%
   count(n)
p %>%
   sample_n(5)
## # A tibble: 5 × 2
##       n    nn
##   <int> <int>
## 1  4593     1
## 2  3435     2
## 3 11556     1
## 4  3404     1
## 5  2570     2

K.10 ggplot2

K.10.1 Basic plotting with ggplot2

K.10.1.1 Length versus width

plot of chunk unnamed-chunk-100

The only confusing part here is that both the data variables and aesthetics are called x and y. So we need aesthetic mapping aes(x = x, y = y):

ggplot(d1000,
       aes(x = x, y = y)) +
   geom_point()

We see that most diamonds have very similar x and y, hence they are almost circular when seen from above.

K.10.1.2 Two aes()-s in one plot

plot of chunk unnamed-chunk-101

This works beautifully:

ggplot(d1000,
       aes(x = carat, y = price)) +
   geom_point(aes(col = cut))

In fact, for the current plot, this is equivalent to specifying all aesthetics in ggplot(), or specifying those in geom_point().

K.10.1.3 Amended color plot

plot of chunk unnamed-chunk-102

Specifying x and y in ggplot() and fixed aesthetics in geom_point():

ggplot(d1000,
       aes(x = carat, y = price)) +
   geom_point(col = "limegreen",
              size = 3,
              alpha = 0.3)

Transparency helps to understand the dense region of small diamonds, as there is a lot of overplotting, it is hard to tell otherwise what is going on there.

K.10.2 Most important plot types

K.10.2.1 Orange tree barplot in different colors

plot of chunk sol-ggplot-types-orange-bar-colors

Using aes(..., fill=Tree) uses the values of the data variable Tree to determine the color of the bars.

We can just add the aesthetic fill=Tree to make the bar colors to be different for diffent trees:

ggplot(avg,
       aes(Tree, size, fill=Tree)) +
   geom_col()

Remember that it is fill aesthetic that controls the fill color, not the col aesthetic!

But here the colors do not contain any information that is not already embedded in the bars. While colors are usually a nice visual feature, it may be misleading some cases, making the viewer to believe that the colors have a distinct meaning, separate of the bars.

K.10.2.2 Histogram of Titanic data

plot of chunk unnamed-chunk-105

Here is age histogram:

ggplot(titanic,
       aes(age)) +
   geom_histogram(bins = 30,
                  fill = "mediumpurple4",
                  col = "gold1")

30 bins seems a good choice here.

plot of chunk unnamed-chunk-106

Here is age histogram:

ggplot(titanic,
       aes(fare)) +
   geom_histogram(bins = 70,
                  fill = "mediumpurple4",
                  col = "gold1")

A larger number of bins is better here, in order to make more bins available for cheaper tickets, less than 100£, where we have most data.

As you see, age is distributed broadly normally, but fare is more like log-normal with a long right tail of very expensive tickets. Why is it like that? It is broadly related to the fact that human age has pretty hard upper limit, but no such limit exists for wealth. There were very wealthy passengers, but no-one could have been 500 years old.

K.10.2.3 Diamond price in a narrow range

plot of chunk unnamed-chunk-107

Here is the price distribution for mass range \([0.45,0.5]\)ct.

diamonds %>%
   filter(between(carat, 0.45, 0.5)) %>%
   ggplot(aes(cut, price)) +
   geom_boxplot()
plot of chunk unnamed-chunk-108

And here for \([0.95,1]\)ct.

diamonds %>%
   filter(between(carat, 0.95, 1)) %>%
   ggplot(aes(cut, price)) +
   geom_boxplot()

Now it is fairly obvious that better cut is associated with higher price.

K.10.3 Inheritance

K.10.3.1 Ice extent in January

Everything in color:

ice <- read_delim("data/ice-extent.csv.bz2")
ice %>%
   filter(month == 2) %>%
   ggplot(aes(year, extent, col = region)) +
   geom_line() +
   geom_point()

plot of chunk unnamed-chunk-109

Gray lines:

ice %>%
   filter(month == 2) %>%
   ggplot(aes(year, extent, col = region)) +
   geom_line(aes(group = region),
             col = "gray80",
             linewidth = 2) +
   geom_point()

plot of chunk unnamed-chunk-110

3 Months in north:

ice %>%
   filter(month %in% c(2, 5, 9)) %>%
   filter(region == "N") %>%
   ggplot(aes(year, extent, col = factor(month))) +
   geom_line() +
   geom_point()

plot of chunk unnamed-chunk-111

3 Months in north, gray lines

ice %>%
   filter(month %in% c(2, 5, 9)) %>%
   filter(region == "N") %>%
   ggplot(aes(year, extent, col = factor(month))) +
   geom_line(aes(group = month),
             col = "gray30",
             linewidth = 2) +
   geom_point()

plot of chunk unnamed-chunk-112

K.10.4 Tuning your plots

K.10.4.1 Political parties with one color not specified

plot of chunk ggplot-tuning-loksabha-missing

Party which’ color is uncpecified is displayed as gray, more specifically as value of the argument na.value of the scale_fill_manual().

Let’s leave out INC and write

data.frame(party = c("BJP", "INC", "AITC"),
           seats = c(303, 52, 23)) %>%
ggplot(aes(party, seats, fill=party)) +
   geom_col() +
   scale_fill_manual(
      values = c(BJP="orange2",
                 AITC="springgreen3")
   )

As you see, it does not result in an error but a gray bar for INC. The gray value can be adjusted with na.value, e.g. as scale_fill_manual(na.value="red").

K.10.4.2 Manually specifying a continuous scale

I do not know how one might be able to manually specify colors for a continuous scale. The problem is that continuous variables can take an infinite number of values–and you cannot specify an infinite number of values manually.

The closest existing option to this is scale_color_gradientn(). This allows you to link a number of data values to specific colors, and tell ggplot to use gradient for whatever values there are in-between.

K.10.4.3 Using wrong scales

plot of chunk unnamed-chunk-114

Using wrong scale (col instead of fill) is silently ignored.

The wrong scale is silently ignored:

gdp <- data.frame(GDP=c(1000, 1050),
                  year=c(2023, 2024))
ggplot(gdp,
       aes(year, GDP, fill=factor(year))) +
   geom_col() +
   scale_color_manual(
      values = c("2023"="black",
                 "2024" = "white")
   )

K.10.4.4 March ice extent

plot of chunk ice-gradient2

Coloring bars according to the value

ice <- read_delim("data/ice-extent.csv.bz2")
## create a separate filtered df--
## we need it for both plotting
## and for computing the average
ice3 <- ice %>%
   filter(month == 3,
          region == "N")
avg <- ice3$extent %>%
   mean()
ggplot(ice3,
       aes(year, extent, fill = extent)) +
   geom_col() +
   scale_fill_gradient2(low = "red",
                        mid = "white",
                        high = "blue",
                        midpoint = avg)

Here one might want to make plot not of the extent, but of the difference between the extent and it’s average (baseline) value.

K.10.5 More geoms and plot types

K.10.5.1 Titanic fare by passenger class

plot of chunk fare-density-pclass

Here is the example:

titanic %>%
   ggplot(aes(fare,
              fill = factor(pclass))) +
   geom_density(alpha = 0.5) +
   coord_cartesian(xlim = c(0, 100),
                   ylim = c(0, 0.05))

We limit the plot region to \([0, 100] \times [0, 0.05]\) to zoom into the more interesting area. Alternatively, one may consider log-scale.

K.11 More about data manipulations

K.11.1 Merging data: joins

K.11.1.1 Merge artists, songs

left_join(artists, songs) should put the artists first and add a column song at the end of it. Something like

name   plays              song
John  guitar     Come Together
Paul    bass    Hello, Goodbye

But the problem is that John is playing in two songs, so a single song name may not be sufficient. One can come up with multiple solutions. For instance, you can list the first song where John plays. Or you can create two lines for John, one for each song. You may also create two columns for songs, one for each song.

left_join() picks the option of creating two lines, one for each song:

songs <- data.frame(song = c("Across the Universe", "Come Together",
                             "Hello, Goodbye", "Peggy Sue"),
                    name = c("John", "John", "Paul", "Buddy"))
artists <- data.frame(name = c("George", "John", "Paul", "Ringo"),
                      plays = c("sitar", "guitar", "bass", "drums"))
left_join(artists, songs)
##     name  plays                song
## 1 George  sitar                <NA>
## 2   John guitar Across the Universe
## 3   John guitar       Come Together
## 4   Paul   bass      Hello, Goodbye
## 5  Ringo  drums                <NA>

K.11.2 Reshaping

K.11.2.1 Drinking data with years in rows

This is fairly easy and fairly logical. Essentially, we need to to rotate the original wide form data by 90°:

drinking <- data.frame(
   state = c("Tennessee", "North Carolina", "Pennsylvania"),
   `2009` = c(48.3, 60.3, 36),
   `2010` = c(48.1, 59.7, 37.3),
   `2011` = c(39.6, 60.4, 40.6),
   `2012` = c(48.1, 59.2, 41.2),
   check.names = FALSE
)
drinking %>%
   pivot_longer(!state, names_to = "year", values_to = "pct") %>%
   pivot_wider(names_from = "state", values_from = "pct")
## # A tibble: 4 × 4
##   year  Tennessee `North Carolina` Pennsylvania
##   <chr>     <dbl>            <dbl>        <dbl>
## 1 2009       48.3             60.3         36  
## 2 2010       48.1             59.7         37.3
## 3 2011       39.6             60.4         40.6
## 4 2012       48.1             59.2         41.2

This table is easy to understand. Putting years in rows is also widely used in the literature.

K.11.2.2 Drinking data in pure wide form

If we do not have states in separate rows, then we need more columns. Currently we have 6 sex-year combinations for each state. We still need the six of those, but now they must be in the same row for all state. So we’ll have a peculiar data frame with a single row only! So the resulting dataset will contain a single row and a large number of columns, one for each state. But there will be no distinct “state” column. It might look like

2009_Tennessee 2009_North Carolina 2009_Pennsylvania 2010_Tennessee 2010_North Carolina 2010_Pennsylvania 2011_Tennessee 2011_North Carolina 2011_Pennsylvania 2012_Tennessee 2012_North Carolina 2012_Pennsylvania
48.3 60.3 36 48.1 59.7 37.3 39.6 60.4 40.6 48.1 59.2 41.2

Note that we now need to add state name to the column names to make clear which “2009” means Tennessee and which one North Carolina.

K.11.2.3 Reshape ice extent

As a refresher, the ice extent data looks like

ice %>%
   select(year, month, region, area) %>%
   head(3)
## # A tibble: 3 × 4
##    year month region  area
##   <dbl> <dbl> <chr>  <dbl>
## 1  1978    11 N       9.04
## 2  1978    11 S      11.7 
## 3  1978    12 N      10.9
  1. In terms of the region, the dataset is in long form. There is only a single column region that contains region type (“N” and “S”).
  2. In wide form, the column names might be Narea and Sarea, for instance.
  3. reshape to wide:
ice %>%
   select(year, month, region, area) %>%
   pivot_wider(names_from = "region", values_from = "area") %>%
   head(4)
## # A tibble: 4 × 4
##    year month     N     S
##   <dbl> <dbl> <dbl> <dbl>
## 1  1978    11  9.04 11.7 
## 2  1978    12 10.9   6.97
## 3  1979     1 12.4   3.47
## 4  1979     2 13.2   2.11

As you see, by default the variables names are “N” and “S”, the same values that were in the region column.

K.11.2.4 Reshape patients data

This data frame is in a wide form as there are two columns, male and female, that contain counts. The NA is somewhat misleading, it would be more appropriate to put “0” in that place instead.

Hence we can reshape it into a long form:

patients <- data.frame(pregnant = c("yes", "no"),
                       male = c(NA, 25),
                       female = c(11, 20))
patients %>%
   pivot_longer(!pregnant,
                names_to = "sex",
                values_to = "count")
## # A tibble: 4 × 3
##   pregnant sex    count
##   <chr>    <chr>  <dbl>
## 1 yes      male      NA
## 2 yes      female    11
## 3 no       male      25
## 4 no       female    20

The result has two columns-pregnance, sex and count. If needed, we can remove the NA-row.

K.12 Making maps

K.12.1 Shapefiles and GeoJSON

K.12.1.1 Difference between spatial data frame and manual map data frame

There are multiple differences:

  • Perhaps most importantly, the hand-made NZ map in Section 15.1.1 is stored as one vertex per row, while the spatial data frame is stored one polygon per row. This makes spatial data frames much smaller, for instance, you do not need to replicate the same color value for every single vertex–a single value for the polygon is enough.
  • Another important difference is the presence of coordinate reference system (CRS). This allows to easily transform one coordinate system to another, and in this way to use spatial data that is stored using different systems.

K.12.1.2 Why left_join()?

We use left join to merge map and population. Remember: this retains all the rows of map but drops the lines of population where there are no corresponding region on the map. Hence we retain all regions (rows of map), with potentially NA as the population value. This is a reasonable approach.

Alternatively:
  • inner join would remove regions where population from the map. That would leave holes in the map. It is probably better to keep those regions and use a dedicate NA-color, such as gray, instead.
  • Outer join will preserve all regions, but also population information for those regions that are not present on the map. This will probably not be a serious problem, it may just clutter your data frame with un-necessary rows.
  • Finally, right join will combine the worst of both worlds: leave holes in map for missing population data, while also cluttering the final dataset.

K.13 Command line