Chapter 15 Solutions

15.1 R Language

15.1.1 Operators

15.1.1.1 Create vectors

Using operators and sequences:

v <- 1:5
v^2

## [1]  1  4  9 16 25

or you can do it just as

(1:5)^2

## [1]  1  4  9 16 25

Using c:

c(1,4,9,16,25)

## [1]  1  4  9 16 25

15.1.1.2 Going out with friends

friends <- 7
budget <- 150
cat("I am going out with", friends, "friends\n")

## I am going out with 7 friends

mealprice <- 14
price <- mealprice*(friends + 1)
total <- price*1.15
cat("total price:", total, "\n")

## total price: 128.8

if(total <= budget) {
   cat("can afford\n")
} else {
   cat("cannot afford\n")
}

## can afford

15.1.2 Strings

15.1.2.1 Paste strings

f <- "5'"
i <- '3"'
result <- paste0("height is ", f, i)
result  # prints in a quoted form

## [1] "height is 5'3\""

cat(result, "\n")  # prints w/o quotes

## height is 5'3"

15.1.3 Collections

15.1.3.1 Manipulate elements from the vector

alphabet = c("α", "β", "γ", "δ", "ε", "ζ", "η")
## extract 1, 3, 5th:
alphabet[c(1,3,5)]

## [1] "α" "γ" "ε"

## add _θ_
alphabet[8] <- "θ"
## all except the first and the last
alphabet[c(-1,-8)]

## [1] "β" "γ" "δ" "ε" "ζ" "η"

15.1.3.2 Separate data into “good” and “bad” cases

results <- c(32, 33, 23, 14, 45, 33)
quality <- c("good", "good", "bad", "good", "bad", "good")
goodResults <- results[quality == "good"]
goodResults

## [1] 32 33 14 33

badResults <- results[quality == "bad"]
badResults

## [1] 23 45

15.2 Pipes and dplyr

15.2.1 Pipes

15.2.1.1 Print mean using pipes

1:4 %>%
   mean() %>%
   cat("approximately\n")

## 2.5 approximately

15.2.2 dplyr

15.2.2.1 Select Titanic variables

## one way
titanic %>%
   select(pclass, survived, sex, age) %>%
   head(2)

##   pclass survived    sex     age
## 1      1        1 female 29.0000
## 2      1        1   male  0.9167

## another way
titanic %>%
   select(pclass:age & !name) %>%
   head(2)

##   pclass survived    sex     age
## 1      1        1 female 29.0000
## 2      1        1   male  0.9167

Note combining two criteria with logical and & in the second example.

for the following answers we need the subset of babynames data:

babynames <- babynames::babynames %>%
   select(-prop) %>%
   filter(between(year, 1995, 2003))

15.2.2.2 Longest name

We can just compute the length of the names, arrange by the length in descending order, and pick the first names:

babynames %>%
   mutate(len = nchar(name)) %>%
   group_by(sex) %>%
   arrange(desc(nchar(name)), by_group=TRUE) %>%
   slice_head(n=3)

## # A tibble: 6 × 5
## # Groups:   sex [2]
##    year sex   name                n   len
##   <dbl> <chr> <chr>           <int> <int>
## 1  1997 F     Hannahelizabeth     5    15
## 2  1995 F     Mariaguadalupe     21    14
## 3  1995 F     Mariadelcarmen     11    14
## 4  1995 M     Johnchristopher    13    15
## 5  1995 M     Christopherjame     9    15
## 6  1995 M     Franciscojavier     8    15

This results in longest names printed for both boys and girls.

Alternatively, we can use min_rank to implicitly order the names, and pick the names with rank 1:

babynames %>%
   group_by(sex) %>%
   filter(min_rank(desc(nchar(name))) == 1)

## # A tibble: 48 × 4
## # Groups:   sex [2]
##     year sex   name                n
##    <dbl> <chr> <chr>           <int>
##  1  1995 M     Johnchristopher    13
##  2  1995 M     Christopherjame     9
##  3  1995 M     Franciscojavier     8
##  4  1995 M     Christiandaniel     7
##  5  1995 M     Christopherjohn     6
##  6  1995 M     Jonathanmichael     6
##  7  1996 M     Franciscojavier    13
##  8  1996 M     Jonathanmichael     9
##  9  1996 M     Christopherjohn     8
## 10  1996 M     Christophermich     8
## # ℹ 38 more rows

We can see that there are quite a few 15-character long names, apparently compound names. But we also see another problem–names are counted multiple times if they appear in multiple years. In order to solve that problem, we may group by name and sex, and sum over n, before we check the name length:

babynames %>%
   group_by(name, sex) %>%
   summarize(n = sum(n)) %>%
   group_by(sex) %>%
   filter(min_rank(desc(nchar(name))) == 1)

## # A tibble: 19 × 3
## # Groups:   sex [2]
##    name            sex       n
##    <chr>           <chr> <int>
##  1 Christiananthon M        12
##  2 Christiandaniel M         7
##  3 Christianjoseph M        10
##  4 Christianjoshua M         5
##  5 Christianmichae M         6
##  6 Christopheranth M         6
##  7 Christopherjame M        45
##  8 Christopherjohn M        39
##  9 Christopherjose M        12
## 10 Christophermich M        16
## 11 Christopherryan M         5
## 12 Davidchristophe M         5
## 13 Franciscojavier M        52
## 14 Hannahelizabeth F         5
## 15 Johnchristopher M        39
## 16 Jonathanmichael M        20
## 17 Jordanchristoph M         5
## 18 Matthewalexande M        11
## 19 Ryanchristopher M        22

This table lists 19 names, all of these 15 letters long. Interestingly, only one of these was given to girls. It appears that 15 letters is the limit in this dataset, e.g. Christianmichae is probably either Christianmichael or more likely Christian Michael.

15.3 Cleaning data

15.3.1 Missing values

15.3.1.1 Use `na_if` and filtering

We use na_if to convert empty strings to NA-s and thereafter remove all explicit missings:

titanic %>%
   mutate(boat = na_if(boat, "")) %>%
   filter(!is.na(boat)) %>%
   nrow()

## [1] 486

This is exactly the same number as in Section 6.1.2.

15.3.1.2 Replace missing value in ice extent data

First (re)create the data:

extent <- data.frame(month=9:12, extent=c(7.28, 9.05, 11.22, -9999))
extent

##   month   extent
## 1     9     7.28
## 2    10     9.05
## 3    11    11.22
## 4    12 -9999.00

Obviously, missing is coded as -9999. As extent is a measure of area, it cannot be negative, so even if documentation is not there, we can conclude that the value is invalid, and we should replace it.
Convert it to NA:

extent %>%
   mutate(extent = na_if(extent, -9999))

##   month extent
## 1     9   7.28
## 2    10   9.05
## 3    11  11.22
## 4    12     NA

use the previous value with tidyr::fill. Previous values will be used when `.direction = “down”:

extent %>%
   mutate(extent = na_if(extent, -9999)) %>%
   fill(extent, .direction="down")

##   month extent
## 1     9   7.28
## 2    10   9.05
## 3    11  11.22
## 4    12  11.22

15.4 ggplot

15.4.0.1 Orange trees as scatterplots

data(Orange)
Orange %>%
   ggplot(aes(age, circumference, col=Tree)) +
   geom_point()

plot of chunk unnamed-chunk-19

15.4.0.2 Orange trees as points and lines

data(Orange)
Orange %>%
   ggplot(aes(age, circumference, col=Tree)) +
   geom_point(col="black") +
   geom_line()  # note: later geom is on top of the previous

plot of chunk unnamed-chunk-20

15.4.0.3 Orange tree growth in log scale

data(Orange)
Orange %>%
   ggplot(aes(age, circumference, col=Tree)) +
   geom_line() +
   scale_y_log10()

plot of chunk unnamed-chunk-21

15.5 Regression models

15.5.1 Linear Regression

15.5.1.1 Manually find the best virginica petal parameters

Load data

data(iris)
virginica <- iris %>%
   filter(Species == "virginica")
x <- virginica$Petal.Length
y <- virginica$Petal.Width

Define the function:

sse <- function(b0, b1) {
   yHat <- b0 + b1*x
   sum((y - yHat)^2)
}

Now start from the best value got above, $(-0.1, 0.4)$ and manipulate $\beta_1$:

sse(-0.1, 0.4)

## [1] 4.6196

sse(-0.1, 0.3)

## [1] 14.2024

sse(-0.1, 0.5)

## [1] 26.16

Out of these values, $(-0.1, 0.4)$ is the best combination. Now manipulate $\beta_0$:

sse(-0.2, 0.4)

## [1] 4.1716

sse(-0.3, 0.4)

## [1] 4.7236

The best value is $(-0.2, 0.4)$. Next, manipulate $\beta_1$ again, just now in smaller steps:

sse(-0.2, 0.39)

## [1] 4.284536

sse(-0.2, 0.41)

## [1] 4.369896

We cannot beat $(-0.2, 0.40)$. Hence change $\beta_0$ again:

sse(-0.19, 0.40)

## [1] 4.1714

sse(-0.21, 0.40)

## [1] 4.1818

sse(-0.18, 0.40)

## [1] 4.1812

Out of these, $(-0.19, 0.40)$ is the best option.

We can keep going like that, each time making the step smaller. At the end, it will lead to the correct linear regression result.

15.5.1.2 Manually optimize virginica sepal regression

We essentially follow the same example as what we did where we optimized petals manually.

First, load data and define $x$ and $y$:

data(iris)
virginica <- iris %>%
   filter(Species == "virginica")
x <- virginica$Sepal.Length
y <- virginica$Sepal.Width

We use the same sse function as above:

sse <- function(b0, b1) {
   yHat <- b0 + b1*x
   sum((y - yHat)^2)
}

Start with certain $(\beta_0, \beta_1)$ values $(0, 0.5) with a similar reasoning $\beta_0 = 0$ means zero-length leaves are of zero width, and $\beta_1 = 0.5$ means that 1cm longer leaves are 0.5cm wider. Both are a reasonable starting point:

sse(0, 0.5)  # a

## [1] 10.575

Next, let’s try make $\beta_1$ smaller–if it helps:

sse(0, 0.4)  # b

## [1] 10.33

The new result, 10.33, is only a little bit better than the previous one, but it is still better.

As a third attempt, we set $\beta_0 = -0.1$ while keeping our previous best value $\beta_1 = 0.4$

sse(-0.1, 0.4)  # c

## [1] 14.218

This resulted in a worse outcome that $\beta_0 = 0$, hence lets try the other way around in increase $\beta_0$ to 0.1:

sse(0.1, 0.4)  # d

## [1] 7.442

Now we get a better result, so we can move even further into the positive direction:

sse(0.2, 0.4)  # e

## [1] 5.554

We see an additional improvement, although smaller than earlier. Let’s continue:

sse(0.3, 0.4)  # f

## [1] 4.666

Now we got $SSE$ below 5!

Finally, plot data and all the attempts:

## Create data frame of the three attempted parameters
lines <- data.frame(
   ## 'letters' are lowers case letters
   attempt = letters[1:6],
   intercept = c(0, 0, -0.1, 0.1, 0.2, 0.3),
   slope = c(0.5, 0.4, 0.4, 0.4, 0.4, 0.4))
## add predicted lines to the original scatterplot:
ggplot(virginica,
       aes(Sepal.Length, Sepal.Width)) +
   geom_jitter() +
   labs(x = "Petal length (cm)",
        y = "Petal width (cm)") +
   geom_abline(data=lines,
               aes(intercept=intercept,
                   slope=slope,
                   col=attempt))

We use pre-defined variable letters to label the attempts, it is simply a vector of 26 English lower-case letters.

15.5.1.3 Titanic pclass as number and categorical

First, use pclass as a number:

lm(survived ~ pclass + sex, data = titanic) %>%
   summary()

## 
## Call:
## lm(formula = survived ~ pclass + sex, data = titanic)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8952 -0.2451 -0.0998  0.2501  0.9002 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.04054    0.03369   30.89   <2e-16 ***
## pclass      -0.14531    0.01313  -11.07   <2e-16 ***
## sexmale     -0.50481    0.02297  -21.98   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3948 on 1306 degrees of freedom
## Multiple R-squared:  0.3413, Adjusted R-squared:  0.3403 
## F-statistic: 338.3 on 2 and 1306 DF,  p-value: < 2.2e-16

The intercept here means that 0-class women had survival probability 1.04. This is out of range of possible probabilities, and artifact of linear probability models, but we also did not have 0-class passengers.

pclass estimate -0.14 means that passengers in one unit larger class had 14 pct points lower probability to survive.

Next, convert pclass to a factor:

lm(survived ~ factor(pclass) + sex,
   data = titanic) %>%
   summary()

## 
## Call:
## lm(formula = survived ~ factor(pclass) + sex, data = titanic)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8990 -0.2364 -0.1015  0.2587  0.8985 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      0.89898    0.02540  35.398  < 2e-16 ***
## factor(pclass)2 -0.15771    0.03237  -4.872 1.24e-06 ***
## factor(pclass)3 -0.29264    0.02671 -10.957  < 2e-16 ***
## sexmale         -0.50487    0.02298 -21.974  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3949 on 1305 degrees of freedom
## Multiple R-squared:  0.3414, Adjusted R-squared:  0.3399 
## F-statistic: 225.5 on 3 and 1305 DF,  p-value: < 2.2e-16

Here the reference category is 1st class females, the intercept tells that their survival probability was 0.90. 2nd class passengers had 15 pct pt lower probability to survive, and 3rd class passengers had 29 pct pt lower probability to survive.

Treating pclass as a number and as a categorical gives fairly similar results here because, for some reason, the survival probability differs by about equal amount between 1st and 2nd, and between 2nd and 3rd class passengers. In general, it is not so though.

15.5.1.4 Log age instead of age

This is just about replacing age with log(age) in the formula:

treatment <- read_delim("../data/treatment.csv.bz2")
treatment %>%
   filter(re78 > 0) %>%
   lm(log(re78) ~ educ + log(age),
      data = .) %>%
   summary()

## 
## Call:
## lm(formula = log(re78) ~ educ + log(age), data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.5702 -0.2372  0.1301  0.4290  2.1656 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.097644   0.195998   31.11   <2e-16 ***
## educ        0.110124   0.005101   21.59   <2e-16 ***
## log(age)    0.688133   0.050924   13.51   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7399 on 2341 degrees of freedom
## Multiple R-squared:  0.1979, Adjusted R-squared:  0.1972 
## F-statistic: 288.8 on 2 and 2341 DF,  p-value: < 2.2e-16

Now the $R^2$ is 0.1979, originally it was 0.1908. Hence it improved the model a little bit.