Chapter 15 Solutions
15.1 R Language
15.1.1 Operators
15.1.1.1 Create vectors
Using operators and sequences:
1:5
v <-^2 v
## [1] 1 4 9 16 25
or you can do it just as
1:5)^2 (
## [1] 1 4 9 16 25
Using c
:
c(1,4,9,16,25)
## [1] 1 4 9 16 25
15.1.1.2 Going out with friends
7
friends <- 150
budget <-cat("I am going out with", friends, "friends\n")
## I am going out with 7 friends
14
mealprice <- mealprice*(friends + 1)
price <- price*1.15
total <-cat("total price:", total, "\n")
## total price: 128.8
if(total <= budget) {
cat("can afford\n")
else {
} cat("cannot afford\n")
}
## can afford
15.2 Pipes and dplyr
15.2.2 dplyr
15.2.2.1 Select Titanic variables
## one way
%>%
titanic select(pclass, survived, sex, age) %>%
head(2)
## pclass survived sex age
## 1 1 1 female 29.0000
## 2 1 1 male 0.9167
## another way
%>%
titanic select(pclass:age & !name) %>%
head(2)
## pclass survived sex age
## 1 1 1 female 29.0000
## 2 1 1 male 0.9167
Note combining two criteria with logical and &
in the second example.
for the following answers we need the subset of babynames data:
babynames::babynames %>%
babynames <- select(-prop) %>%
filter(between(year, 1995, 2003))
15.2.2.2 Longest name
We can just compute the length of the names, arrange by the length in descending order, and pick the first names:
%>%
babynames mutate(len = nchar(name)) %>%
group_by(sex) %>%
arrange(desc(nchar(name)), by_group=TRUE) %>%
slice_head(n=3)
## # A tibble: 6 × 5
## # Groups: sex [2]
## year sex name n len
## <dbl> <chr> <chr> <int> <int>
## 1 1997 F Hannahelizabeth 5 15
## 2 1995 F Mariaguadalupe 21 14
## 3 1995 F Mariadelcarmen 11 14
## 4 1995 M Johnchristopher 13 15
## 5 1995 M Christopherjame 9 15
## 6 1995 M Franciscojavier 8 15
This results in longest names printed for both boys and girls.
Alternatively, we can use min_rank
to implicitly order the names,
and pick the names with rank 1:
%>%
babynames group_by(sex) %>%
filter(min_rank(desc(nchar(name))) == 1)
## # A tibble: 48 × 4
## # Groups: sex [2]
## year sex name n
## <dbl> <chr> <chr> <int>
## 1 1995 M Johnchristopher 13
## 2 1995 M Christopherjame 9
## 3 1995 M Franciscojavier 8
## 4 1995 M Christiandaniel 7
## 5 1995 M Christopherjohn 6
## 6 1995 M Jonathanmichael 6
## 7 1996 M Franciscojavier 13
## 8 1996 M Jonathanmichael 9
## 9 1996 M Christopherjohn 8
## 10 1996 M Christophermich 8
## # ℹ 38 more rows
We can see that there are quite a few 15-character long names, apparently compound names. But we also see another problem–names are counted multiple times if they appear in multiple years. In order to solve that problem, we may group by name and sex, and sum over n, before we check the name length:
%>%
babynames group_by(name, sex) %>%
summarize(n = sum(n)) %>%
group_by(sex) %>%
filter(min_rank(desc(nchar(name))) == 1)
## # A tibble: 19 × 3
## # Groups: sex [2]
## name sex n
## <chr> <chr> <int>
## 1 Christiananthon M 12
## 2 Christiandaniel M 7
## 3 Christianjoseph M 10
## 4 Christianjoshua M 5
## 5 Christianmichae M 6
## 6 Christopheranth M 6
## 7 Christopherjame M 45
## 8 Christopherjohn M 39
## 9 Christopherjose M 12
## 10 Christophermich M 16
## 11 Christopherryan M 5
## 12 Davidchristophe M 5
## 13 Franciscojavier M 52
## 14 Hannahelizabeth F 5
## 15 Johnchristopher M 39
## 16 Jonathanmichael M 20
## 17 Jordanchristoph M 5
## 18 Matthewalexande M 11
## 19 Ryanchristopher M 22
This table lists 19 names, all of these 15 letters long. Interestingly, only one of these was given to girls. It appears that 15 letters is the limit in this dataset, e.g. Christianmichae is probably either Christianmichael or more likely Christian Michael.
15.3 Cleaning data
15.3.1 Missing values
15.3.1.1 Use na_if
and filtering
We use na_if
to convert empty strings to NA
-s and thereafter
remove all explicit missings:
%>%
titanic mutate(boat = na_if(boat, "")) %>%
filter(!is.na(boat)) %>%
nrow()
## [1] 486
This is exactly the same number as in Section 6.1.2.
15.3.1.2 Replace missing value in ice extent data
First (re)create the data:
data.frame(month=9:12, extent=c(7.28, 9.05, 11.22, -9999))
extent <- extent
## month extent
## 1 9 7.28
## 2 10 9.05
## 3 11 11.22
## 4 12 -9999.00
- Obviously, missing is coded as
-9999
. As extent is a measure of area, it cannot be negative, so even if documentation is not there, we can conclude that the value is invalid, and we should replace it. - Convert it to
NA
:
%>%
extent mutate(extent = na_if(extent, -9999))
## month extent
## 1 9 7.28
## 2 10 9.05
## 3 11 11.22
## 4 12 NA
- use the previous value with
tidyr::fill
. Previous values will be used when `.direction = “down”:
%>%
extent mutate(extent = na_if(extent, -9999)) %>%
fill(extent, .direction="down")
## month extent
## 1 9 7.28
## 2 10 9.05
## 3 11 11.22
## 4 12 11.22
15.4 ggplot
15.4.0.1 Orange trees as scatterplots
data(Orange)
%>%
Orange ggplot(aes(age, circumference, col=Tree)) +
geom_point()
15.5 Regression models
15.5.1 Linear Regression
15.5.1.1 Manually find the best virginica petal parameters
Load data
data(iris)
iris %>%
virginica <- filter(Species == "virginica")
virginica$Petal.Length
x <- virginica$Petal.Width y <-
Define the function:
function(b0, b1) {
sse <- b0 + b1*x
yHat <-sum((y - yHat)^2)
}
Now start from the best value got above, \((-0.1, 0.4)\) and manipulate \(\beta_1\):
sse(-0.1, 0.4)
## [1] 4.6196
sse(-0.1, 0.3)
## [1] 14.2024
sse(-0.1, 0.5)
## [1] 26.16
Out of these values, \((-0.1, 0.4)\) is the best combination. Now manipulate \(\beta_0\):
sse(-0.2, 0.4)
## [1] 4.1716
sse(-0.3, 0.4)
## [1] 4.7236
The best value is \((-0.2, 0.4)\). Next, manipulate \(\beta_1\) again, just now in smaller steps:
sse(-0.2, 0.39)
## [1] 4.284536
sse(-0.2, 0.41)
## [1] 4.369896
We cannot beat \((-0.2, 0.40)\). Hence change \(\beta_0\) again:
sse(-0.19, 0.40)
## [1] 4.1714
sse(-0.21, 0.40)
## [1] 4.1818
sse(-0.18, 0.40)
## [1] 4.1812
Out of these, \((-0.19, 0.40)\) is the best option.
We can keep going like that, each time making the step smaller. At the end, it will lead to the correct linear regression result.
15.5.1.2 Manually optimize virginica sepal regression
We essentially follow the same example as what we did where we optimized petals manually.
First, load data and define \(x\) and \(y\):
data(iris)
iris %>%
virginica <- filter(Species == "virginica")
virginica$Sepal.Length
x <- virginica$Sepal.Width y <-
We use the same sse
function as above:
function(b0, b1) {
sse <- b0 + b1*x
yHat <-sum((y - yHat)^2)
}
Start with certain \((\beta_0, \beta_1)\) values $(0, 0.5) with a similar reasoning \(\beta_0 = 0\) means zero-length leaves are of zero width, and \(\beta_1 = 0.5\) means that 1cm longer leaves are 0.5cm wider. Both are a reasonable starting point:
sse(0, 0.5) # a
## [1] 10.575
Next, let’s try make \(\beta_1\) smaller–if it helps:
sse(0, 0.4) # b
## [1] 10.33
The new result, 10.33, is only a little bit better than the previous one, but it is still better.
As a third attempt, we set \(\beta_0 = -0.1\) while keeping our previous best value \(\beta_1 = 0.4\)
sse(-0.1, 0.4) # c
## [1] 14.218
This resulted in a worse outcome that \(\beta_0 = 0\), hence lets try the other way around in increase \(\beta_0\) to 0.1:
sse(0.1, 0.4) # d
## [1] 7.442
Now we get a better result, so we can move even further into the positive direction:
sse(0.2, 0.4) # e
## [1] 5.554
We see an additional improvement, although smaller than earlier. Let’s continue:
sse(0.3, 0.4) # f
## [1] 4.666
Now we got \(SSE\) below 5!
Finally, plot data and all the attempts:
## Create data frame of the three attempted parameters
data.frame(
lines <-## 'letters' are lowers case letters
attempt = letters[1:6],
intercept = c(0, 0, -0.1, 0.1, 0.2, 0.3),
slope = c(0.5, 0.4, 0.4, 0.4, 0.4, 0.4))
## add predicted lines to the original scatterplot:
ggplot(virginica,
aes(Sepal.Length, Sepal.Width)) +
geom_jitter() +
labs(x = "Petal length (cm)",
y = "Petal width (cm)") +
geom_abline(data=lines,
aes(intercept=intercept,
slope=slope,
col=attempt))
We use pre-defined variable letters
to label the attempts, it is
simply a vector of 26 English lower-case letters.
15.5.1.3 Titanic pclass as number and categorical
First, use pclass as a number:
lm(survived ~ pclass + sex, data = titanic) %>%
summary()
##
## Call:
## lm(formula = survived ~ pclass + sex, data = titanic)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8952 -0.2451 -0.0998 0.2501 0.9002
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.04054 0.03369 30.89 <2e-16 ***
## pclass -0.14531 0.01313 -11.07 <2e-16 ***
## sexmale -0.50481 0.02297 -21.98 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3948 on 1306 degrees of freedom
## Multiple R-squared: 0.3413, Adjusted R-squared: 0.3403
## F-statistic: 338.3 on 2 and 1306 DF, p-value: < 2.2e-16
The intercept here means that 0-class women had survival probability 1.04. This is out of range of possible probabilities, and artifact of linear probability models, but we also did not have 0-class passengers.
pclass estimate -0.14 means that passengers in one unit larger class had 14 pct points lower probability to survive.
Next, convert pclass to a factor:
lm(survived ~ factor(pclass) + sex,
data = titanic) %>%
summary()
##
## Call:
## lm(formula = survived ~ factor(pclass) + sex, data = titanic)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8990 -0.2364 -0.1015 0.2587 0.8985
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.89898 0.02540 35.398 < 2e-16 ***
## factor(pclass)2 -0.15771 0.03237 -4.872 1.24e-06 ***
## factor(pclass)3 -0.29264 0.02671 -10.957 < 2e-16 ***
## sexmale -0.50487 0.02298 -21.974 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3949 on 1305 degrees of freedom
## Multiple R-squared: 0.3414, Adjusted R-squared: 0.3399
## F-statistic: 225.5 on 3 and 1305 DF, p-value: < 2.2e-16
Here the reference category is 1st class females, the intercept tells that their survival probability was 0.90. 2nd class passengers had 15 pct pt lower probability to survive, and 3rd class passengers had 29 pct pt lower probability to survive.
Treating pclass as a number and as a categorical gives fairly similar results here because, for some reason, the survival probability differs by about equal amount between 1st and 2nd, and between 2nd and 3rd class passengers. In general, it is not so though.
15.5.1.4 Log age instead of age
This is just about replacing age
with log(age)
in the formula:
read_delim("../data/treatment.csv.bz2")
treatment <-%>%
treatment filter(re78 > 0) %>%
lm(log(re78) ~ educ + log(age),
data = .) %>%
summary()
##
## Call:
## lm(formula = log(re78) ~ educ + log(age), data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.5702 -0.2372 0.1301 0.4290 2.1656
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.097644 0.195998 31.11 <2e-16 ***
## educ 0.110124 0.005101 21.59 <2e-16 ***
## log(age) 0.688133 0.050924 13.51 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7399 on 2341 degrees of freedom
## Multiple R-squared: 0.1979, Adjusted R-squared: 0.1972
## F-statistic: 288.8 on 2 and 2341 DF, p-value: < 2.2e-16
Now the \(R^2\) is 0.1979, originally it was 0.1908. Hence it improved the model a little bit.