Chapter 16 Solutions
16.1 R Language
16.1.1 Operators
16.1.1.1 Create vectors
Using operators and sequences:
## [1] 1 4 9 16 25
or you can do it just as
## [1] 1 4 9 16 25
Using c
:
## [1] 1 4 9 16 25
16.2 Pipes and dplyr
16.2.2 dplyr
16.2.2.1 Select Titanic variables
## pclass survived sex age
## 1 1 1 female 29.0000
## 2 1 1 male 0.9167
## pclass survived sex age
## 1 1 1 female 29.0000
## 2 1 1 male 0.9167
Note combining two criteria with logical and &
in the second example.
for the following answers we need the subset of babynames data:
16.2.2.2 Longest name
We can just compute the length of the names, arrange by the length in descending order, and pick the first names:
babynames %>%
mutate(len = nchar(name)) %>%
group_by(sex) %>%
arrange(desc(nchar(name)), by_group=TRUE) %>%
slice_head(n=3)
## # A tibble: 6 × 5
## # Groups: sex [2]
## year sex name n len
## <dbl> <chr> <chr> <int> <int>
## 1 1997 F Hannahelizabeth 5 15
## 2 1995 F Mariaguadalupe 21 14
## 3 1995 F Mariadelcarmen 11 14
## 4 1995 M Johnchristopher 13 15
## 5 1995 M Christopherjame 9 15
## 6 1995 M Franciscojavier 8 15
This results in longest names printed for both boys and girls.
Alternatively, we can use min_rank
to implicitly order the names,
and pick the names with rank 1:
## # A tibble: 48 × 4
## # Groups: sex [2]
## year sex name n
## <dbl> <chr> <chr> <int>
## 1 1995 M Johnchristopher 13
## 2 1995 M Christopherjame 9
## 3 1995 M Franciscojavier 8
## 4 1995 M Christiandaniel 7
## 5 1995 M Christopherjohn 6
## 6 1995 M Jonathanmichael 6
## 7 1996 M Franciscojavier 13
## 8 1996 M Jonathanmichael 9
## 9 1996 M Christopherjohn 8
## 10 1996 M Christophermich 8
## 11 1996 M Christopherjose 7
## 12 1996 M Christiananthon 6
## 13 1996 M Christianmichae 6
## 14 1996 M Christianjoseph 5
## 15 1996 M Christopherjame 5
## 16 1996 M Davidchristophe 5
## 17 1996 M Jordanchristoph 5
## 18 1996 M Ryanchristopher 5
## 19 1997 F Hannahelizabeth 5
## 20 1997 M Christophermich 8
## 21 1997 M Johnchristopher 8
## 22 1997 M Franciscojavier 7
## 23 1997 M Ryanchristopher 7
## 24 1997 M Christopherjame 6
## 25 1997 M Christopherjohn 6
## 26 1997 M Christopherryan 5
## 27 1998 M Christopherjame 12
## 28 1998 M Christopherjohn 9
## # ℹ 20 more rows
We can see that there are quite a few 15-character long names, apparently compound names. But we also see another problem–names are counted multiple times if they appear in multiple years. In order to solve that problem, we may group by name and sex, and sum over n, before we check the name length:
babynames %>%
group_by(name, sex) %>%
summarize(n = sum(n)) %>%
group_by(sex) %>%
filter(min_rank(desc(nchar(name))) == 1)
## # A tibble: 19 × 3
## # Groups: sex [2]
## name sex n
## <chr> <chr> <int>
## 1 Christiananthon M 12
## 2 Christiandaniel M 7
## 3 Christianjoseph M 10
## 4 Christianjoshua M 5
## 5 Christianmichae M 6
## 6 Christopheranth M 6
## 7 Christopherjame M 45
## 8 Christopherjohn M 39
## 9 Christopherjose M 12
## 10 Christophermich M 16
## 11 Christopherryan M 5
## 12 Davidchristophe M 5
## 13 Franciscojavier M 52
## 14 Hannahelizabeth F 5
## 15 Johnchristopher M 39
## 16 Jonathanmichael M 20
## 17 Jordanchristoph M 5
## 18 Matthewalexande M 11
## 19 Ryanchristopher M 22
This table lists 19 names, all of these 15 letters long. Interestingly, only one of these was given to girls. It appears that 15 letters is the limit in this dataset, e.g. Christianmichae is probably either Christianmichael or more likely Christian Michael.
16.3 Cleaning data
16.3.1 Missing values
16.3.1.1 Use na_if
and filtering
We use na_if
to convert empty strings to NA
-s and thereafter
remove all explicit missings:
## [1] 486
This is exactly the same number as in Section 6.1.2.
16.3.1.2 Replace missing value in ice extent data
First (re)create the data:
## month extent
## 1 9 7.28
## 2 10 9.05
## 3 11 11.22
## 4 12 -9999.00
- Obviously, missing is coded as
-9999
. As extent is a measure of area, it cannot be negative, so even if documentation is not there, we can conclude that the value is invalid, and we should replace it. - Convert it to
NA
:
## month extent
## 1 9 7.28
## 2 10 9.05
## 3 11 11.22
## 4 12 NA
- use the previous value with
tidyr::fill
. Previous values will be used when `.direction = “down”:
## month extent
## 1 9 7.28
## 2 10 9.05
## 3 11 11.22
## 4 12 11.22
16.4 ggplot
16.5 Regression models
16.5.1 Linear Regression
16.5.1.1 Manually find the best virginica petal parameters
Load data
data(iris)
virginica <- iris %>%
filter(Species == "virginica")
x <- virginica$Petal.Length
y <- virginica$Petal.Width
Define the function:
Now start from the best value got above, \((-0.1, 0.4)\) and manipulate \(\beta_1\):
## [1] 4.6196
## [1] 14.2024
## [1] 26.16
Out of these values, \((-0.1, 0.4)\) is the best combination. Now manipulate \(\beta_0\):
## [1] 4.1716
## [1] 4.7236
The best value is \((-0.2, 0.4)\). Next, manipulate \(\beta_1\) again, just now in smaller steps:
## [1] 4.284536
## [1] 4.369896
We cannot beat \((-0.2, 0.40)\). Hence change \(\beta_0\) again:
## [1] 4.1714
## [1] 4.1818
## [1] 4.1812
Out of these, \((-0.19, 0.40)\) is the best option.
We can keep going like that, each time making the step smaller. At the end, it will lead to the correct linear regression result.
16.5.1.2 Manually optimize virginica sepal regression
We essentially follow the same example as what we did where we optimized petals manually.
First, load data and define \(x\) and \(y\):
data(iris)
virginica <- iris %>%
filter(Species == "virginica")
x <- virginica$Sepal.Length
y <- virginica$Sepal.Width
We use the same sse
function as above:
Start with certain \((\beta_0, \beta_1)\) values $(0, 0.5) with a similar reasoning \(\beta_0 = 0\) means zero-length leaves are of zero width, and \(\beta_1 = 0.5\) means that 1cm longer leaves are 0.5cm wider. Both are a reasonable starting point:
## [1] 10.575
Next, let’s try make \(\beta_1\) smaller–if it helps:
## [1] 10.33
The new result, 10.33, is only a little bit better than the previous one, but it is still better.
As a third attempt, we set \(\beta_0 = -0.1\) while keeping our previous best value \(\beta_1 = 0.4\)
## [1] 14.218
This resulted in a worse outcome that \(\beta_0 = 0\), hence lets try the other way around in increase \(\beta_0\) to 0.1:
## [1] 7.442
Now we get a better result, so we can move even further into the positive direction:
## [1] 5.554
We see an additional improvement, although smaller than earlier. Let’s continue:
## [1] 4.666
Now we got \(SSE\) below 5!

Finally, plot data and all the attempts:
## Create data frame of the three attempted parameters
lines <- data.frame(
## 'letters' are lowers case letters
attempt = letters[1:6],
intercept = c(0, 0, -0.1, 0.1, 0.2, 0.3),
slope = c(0.5, 0.4, 0.4, 0.4, 0.4, 0.4))
## add predicted lines to the original scatterplot:
ggplot(virginica,
aes(Sepal.Length, Sepal.Width)) +
geom_jitter() +
labs(x = "Petal length (cm)",
y = "Petal width (cm)") +
geom_abline(data=lines,
aes(intercept=intercept,
slope=slope,
col=attempt))
We use pre-defined variable letters
to label the attempts, it is
simply a vector of 26 English lower-case letters.
16.5.1.3 Titanic pclass as number and categorical
First, use pclass as a number:
##
## Call:
## lm(formula = survived ~ pclass + sex, data = titanic)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8952 -0.2451 -0.0998 0.2501 0.9002
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.04054 0.03369 30.89 <2e-16 ***
## pclass -0.14531 0.01313 -11.07 <2e-16 ***
## sexmale -0.50481 0.02297 -21.98 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3948 on 1306 degrees of freedom
## Multiple R-squared: 0.3413, Adjusted R-squared: 0.3403
## F-statistic: 338.3 on 2 and 1306 DF, p-value: < 2.2e-16
The intercept here means that 0-class women had survival probability 1.04. This is out of range of possible probabilities, and artifact of linear probability models, but we also did not have 0-class passengers.
pclass estimate -0.14 means that passengers in one unit larger class had 14 pct points lower probability to survive.
Next, convert pclass to a factor:
##
## Call:
## lm(formula = survived ~ factor(pclass) + sex, data = titanic)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8990 -0.2364 -0.1015 0.2587 0.8985
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.89898 0.02540 35.398 < 2e-16 ***
## factor(pclass)2 -0.15771 0.03237 -4.872 1.24e-06 ***
## factor(pclass)3 -0.29264 0.02671 -10.957 < 2e-16 ***
## sexmale -0.50487 0.02298 -21.974 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3949 on 1305 degrees of freedom
## Multiple R-squared: 0.3414, Adjusted R-squared: 0.3399
## F-statistic: 225.5 on 3 and 1305 DF, p-value: < 2.2e-16
Here the reference category is 1st class females, the intercept tells that their survival probability was 0.90. 2nd class passengers had 15 pct pt lower probability to survive, and 3rd class passengers had 29 pct pt lower probability to survive.
Treating pclass as a number and as a categorical gives fairly similar results here because, for some reason, the survival probability differs by about equal amount between 1st and 2nd, and between 2nd and 3rd class passengers. In general, it is not so though.
16.5.1.4 Log age instead of age
This is just about replacing age
with log(age)
in the formula:
treatment <- read_delim("../data/treatment.csv.bz2")
treatment %>%
filter(re78 > 0) %>%
lm(log(re78) ~ educ + log(age),
data = .) %>%
summary()
##
## Call:
## lm(formula = log(re78) ~ educ + log(age), data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.5702 -0.2372 0.1301 0.4290 2.1656
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.097644 0.195998 31.11 <2e-16 ***
## educ 0.110124 0.005101 21.59 <2e-16 ***
## log(age) 0.688133 0.050924 13.51 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7399 on 2341 degrees of freedom
## Multiple R-squared: 0.1979, Adjusted R-squared: 0.1972
## F-statistic: 288.8 on 2 and 2341 DF, p-value: < 2.2e-16
Now the \(R^2\) is 0.1979, originally it was 0.1908. Hence it improved the model a little bit.
16.6 Assessing model goodness
16.6.1 Confusion matrix
16.6.1.1 Treatment and age with base-R tools
The initial model
treatment <- read_delim("../data/treatment.csv.bz2")
m <- glm(treat ~ age, family = binomial(),
data = treatment)
## Predict category
yhat <- predict(m, type = "response") > 0.5
## construct CM
table(Predicted = yhat, Actual = treatment$treat)
## Actual
## Predicted FALSE TRUE
## FALSE 2490 185
The cm only contains a single row (labeled FALSE). This is because the model predicts that no-one participates.
Accuracy is simple:
## [1] 0.9308411
Recall is obviously 0: we do not identify any cases of actual participation. It can still be computed as
## [1] 0
But precision is undefined: 0 cases out of 0 predictions are correct. This number cannot even be calculated.
Add more variables
## Warning: glm.fit: fitted probabilities numerically 0 or 1
## occurred
## Predict category
yhat <- predict(m, type = "response") > 0.5
## construct CM
table(Predicted = yhat, Actual = treatment$treat)
## Actual
## Predicted FALSE TRUE
## FALSE 2469 63
## TRUE 21 122
Now the table has all four entries–we predict some persons to participate and some not to participate.
Accuracy:
## [1] 0.9685981
Recall:
## [1] 0.6594595
Precision:
## [1] 0.8531469
Precision is larger than recall, this means the model does not capture that many participants, but those persons it predicts as participants are mostly correct.
16.6.1.2 Treatment and age with tidymodels
The initial model
library(tidymodels)
treatment <- treatment %>%
mutate(y = factor(treatment$treat))
# oucome needs to be a factor
m <- logistic_reg() %>%
fit(y ~ age, data = treatment)
yhat <- predict(m, new_data = treatment)
## Combine actual, predicted into a single df
## call them 'y' and 'yhat'
res <- yhat %>%
cbind(y = treatment$y) %>%
rename(yhat = .pred_class)
res %>%
conf_mat(y, yhat)
## Truth
## Prediction FALSE TRUE
## FALSE 2490 185
## TRUE 0 0
Unlike in case of the base-R table()
, this cm contains both rows,
for FALSE and TRUE, although the latter is filled with zeros as
the model predicts that no-one participates.
Accuracy can be computed as:
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.931
but it may be easier to use the previous approach. Recall:
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 recall binary 0
Precision is undefined, and will result in a warning and NA:
## Warning: While computing binary `precision()`, no predicted events
## were detected (i.e. `true_positive + false_positive = 0`).
## Precision is undefined in this case, and `NA` will be
## returned.
## Note that 185 true event(s) actually occurred for the
## problematic event level, TRUE
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 precision binary NA
Add more variables
## Warning: glm.fit: fitted probabilities numerically 0 or 1
## occurred
yhat <- predict(m, new_data = treatment)
res <- yhat %>%
cbind(y = treatment$y) %>%
rename(yhat = .pred_class)
res %>%
conf_mat(y, yhat)
## Truth
## Prediction FALSE TRUE
## FALSE 2469 63
## TRUE 21 122
Now all the four cells of the table contain positive values.
Accuracy:
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.969
Alternatively, can use the vector-oriented function, those return vectors, not data frames. Here example for recall:
## [1] 0.6594595
Now precision is defined:
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 precision binary 0.853
The computed values are the same, obviously. Tidymodels approach is more suitable if you want to change the models and manipulate the results as data frames. Otherwise, base-R may be simpler as it operates with vectors and returns just values, not data frames.