Chapter 13 Assessing model goodness

13.1 Regression model performance measures

In Section 12.2 we discussed how to predict values based on linear regression with base-R modeling tools. Remember, the task there was to predict the distance of galaxies based on their radial velocity using Hubble data. Here we continue the same topic, but now the task is to compute the model performance indicators, in particular RMSE

13.1.1 Base-R tools

First, load the Hubble data and fit the linear regression model:

hubble <- read_delim("../data/hubble.csv")
m <- lm(Rmodern ~ vModern, data = hubble)

Remember, RMSE is root mean squared error, where error stands for the prediction error, \(\hat y - y\). So we can calculate it as

y <- hubble$Rmodern  # just to make it clear
yhat <- predict(m)
e <- yhat - y  # error
RMSE <- sqrt(mean(e^2))  # root-mean-squared-e
RMSE
## [1] 2.61722

Exercise 13.1 Use the predicted values above to manually compute \(R^2\): \[\begin{equation*} R^{2} = 1 - \frac{ \frac{1}{N} \sum_{i} (y_{i} - \hat y_{i})^{2} }{ \frac{1}{N} \sum_{i} (y_{i} - \bar y_{i})^{2} } \end{equation*}\]

Compare your result with the regression output table–do you get the same value?

13.1.2 Tidymodels tools

library(tidymodels)
mf <- linear_reg() %>%
   fit(Rmodern ~ vModern, data = hubble)

We can combine the original and predicted values into a data frame for easy calculation of model goodness metrics:

results <- predict(mf, new_data = hubble) %>%
   cbind(hubble) %>%
   select(velocity = vModern, distance = Rmodern, .pred)
head(results, 3)
##   velocity distance    .pred
## 1    158.1  0.06170 3.865956
## 2    278.0  0.04997 5.385886
## 3    -57.0  0.50000 1.139209

So now we have a data frame with three columns: distance, velocity, the measured velocity; distance, the actual measured distance, and .pred, the predicted distance. One can easily compute RMSE and \(R^2\) using the dedicated functions:

results %>%
   rmse(distance, .pred)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        2.62
results %>%
   rsq(distance, .pred)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rsq     standard       0.819

rmse() and rsq() are model goodness measure functions that take three arguments: the dataset, and column names for the actual and predicted values. (There are many more model goodness measures.)

If interested, one can combine these two into a single multi-metric using metric_set():

goodness <- metric_set(rmse, rsq)
results %>%
   goodness(distance, .pred)
## # A tibble: 2 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard       2.62 
## 2 rsq     standard       0.819

This is handy if calculating the same metrics for multiple models over and over again.

13.2 Confusion matrix

Confusion matrix (CM) is one of the central tools to assess classification quality. Below, we’ll use Titanic data to demonstrate those tools.

titanic <- read_delim("../data/titanic.csv.bz2")
dim(titanic)
## [1] 1309   14
Standard form of the confusion matrix as used in this book

This section present the CM in this form. The positive cases are “relevant”, marked in bold.

This section presents CM in the form
  • Actual values in columns, predicted values in rows
  • Relevant (“positive”) values second, “negative” values first.

“Relevant” are those cases that we compute precision and recall for, marked as bold in the figure. Sometimes it is obvious, but this is not always the case.

13.2.1 Base-R tools to display confusion matrix

In order to display confusion matrix, you need to fit a model and predict the categories on the dataset (see Section 12.2.2):

y <- titanic$survived
m <- glm(survived ~ factor(pclass) + sex,
         family = binomial(),
         data = titanic)
phat <- predict(m, type = "response")
yhat <- phat > 0.5

(Here we did it on training data, but that does not have to be the case.)

Confusion matrix is just the basic cross-table that can be generated using the table() function:

table(yhat, y)
##        y
## yhat      0   1
##   FALSE 682 161
##   TRUE  127 339
                # table: first argument inrows
                # second argument in columns

Depending on your variable names it may be a bit unclear whether the actual values are in rows or columns, it is advisable to name the arguments:

table(Predicted = yhat, Actual = y)
##          Actual
## Predicted   0   1
##     FALSE 682 161
##     TRUE  127 339

Base-R does not have dedicated functions to compute the CM-based metrics (but you can use other packages). In fact, manually computing accuracy, the percentage of the correct predictions, is easier than loading the dedicated packages. Accuracy is just

mean(yhat == y)
## [1] 0.7799847

Computing precision is a bit more complex, here we just follow the definition:

Pr <- sum(yhat == 1 & y == 1)/sum(yhat == 1)
Pr
## [1] 0.7274678

The same is true for recall:

Re <- sum(yhat == 1 & y == 1)/sum(y == 1)
Re
## [1] 0.678

F-score is best to be computed from precision and recall:

2/(1/Pr + 1/Re)
## [1] 0.7018634

Exercise 13.2 Use treatment data and base-R tools to estimate a logistic regression model in the form \[\begin{equation*} \Pr(\mathit{treat}_{i} = \mathit{TRUE}) = \Lambda(\beta_{0} + \beta_{1} \cdot \mathit{age}_{i}). \end{equation*}\]

Construct the confusion matrix, and compute Accuracy, precision, and recall. Explain your results.

Add more variable to your model: ethn, u74, u75, and re75. Repeat the above.

The solution

However, even if you use base-R modeling tools, there is nothing wrong to use dedicated functions from other libraries for precision, recall, and other related measures. Here an example using tidymodels (see Section 13.2.2):

yardstick::f_meas_vec(factor(y), factor(as.numeric(yhat)),
                      event_level = "second")
## [1] 0.7018634

This example uses f_meas_vec() function from yardstick package, a part of tidymodels. It expects two vectors, truth and estimate, both of which must be factors with similar levels. That’s why we need to convert the predictions (logical) into a numeric. By default it considers the first level (here “0”, died) the relevant one, here we tell we want to compute F-score for the second level (“1”, survived). Anyway, as you see, the result is exactly the same.

13.2.2 tidymodels approach

Tidymodels (see Section 12.3) offers tools to build unified workflows for predictive modeling, and CM-related functionality is well integrated into that toolkit.

library(tidymodels)

As in the case with base-R tools, you need to fit a model and predict the outcome class in order to construct the CM:

y <- factor(titanic$survived)  # for easier notation
mf <- logistic_reg() %>%
   fit(y ~ factor(pclass) + sex,
       data = titanic)
yhat <- predict(mf, new_data = titanic)

Tidymodels are built in a way that it is advantageous to combine the actual and predicted results into a single data frame. You can add predicted results to your original data, but here we create a new one. Also, we rename the precise but somewhat awkward column name .pred_class to yhat:

results <- cbind(yhat, y) %>%
   rename(yhat = .pred_class)

Confusion matrix can be constructed with the function conf_mat(). It expects a data frame, and two column names: actual first and predicted second:

results %>%
   conf_mat(y, yhat) # conf_mat: actual first, predicted second
##           Truth
## Prediction   0   1
##          0 682 161
##          1 127 339

Note that it is the other way around as the base-R table behaves–there you need to give predicted first and actual values second. But CM is exactly the same as when using the base-R table(), just it automatically provides clear dimension names.

Accuracy, precision and other measures can be queried with the dedicated functions. The examples here all expect similar arguments: a data frame that contains both the actual and predicted values; column names for the actual (first) and predicted (second) values; and which factor level to be considered “relevant” (first, by default):

results %>%
   accuracy(y, yhat, event_level = "second")
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.780
results %>%
   precision(y, yhat, event_level = "second")
## # A tibble: 1 × 3
##   .metric   .estimator .estimate
##   <chr>     <chr>          <dbl>
## 1 precision binary         0.727
results %>%
   recall(y, yhat, event_level = "second")
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 recall  binary         0.678
results %>%
   f_meas(y, yhat, event_level = "second")
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 f_meas  binary         0.702

Note that all these functions return data frames, not single values!

There are also versions of the functions that expect two vectors (actual and predicted) instead of a data frame, for instance

recall_vec(y, yhat$.pred_class, event_level = "second")
## [1] 0.678

Unlike the data frame version, the vector-version of recall returns a single value.

Exercise 13.3 Repeat the exercise above with tidymodels tools: Use treatment data to estimate a logistic regression model in the form \[\begin{equation*} \Pr(\mathit{treat}_{i} = \mathit{TRUE}) = \Lambda(\beta_{0} + \beta_{1} \cdot \mathit{age}_{i}). \end{equation*}\]

Construct the confusion matrix, and compute Accuracy, precision, and recall. Explain your results.

Add more variable to your model: ethn, u74, u75, and re75. Repeat the above.

Comment the difference of output in terms of base-R and tidymodels.

The solution

If you are often computing a set of metrics, you may want to combine multiple functions into a single call using metric_set():

pr <- metric_set(precision, recall)
results %>%
   pr(truth = y, estimate = yhat, event_level = "second")
## # A tibble: 2 × 3
##   .metric   .estimator .estimate
##   <chr>     <chr>          <dbl>
## 1 precision binary         0.727
## 2 recall    binary         0.678

This will compute both precision and recall with a single call, and return a data frame with one row for each.

Finally, the summary() method for CM provides all these metrics, and much more:

results %>%
   conf_mat(y, yhat) %>%
   summary(event_level = "second")
## # A tibble: 13 × 3
##    .metric              .estimator .estimate
##    <chr>                <chr>          <dbl>
##  1 accuracy             binary         0.780
##  2 kap                  binary         0.528
##  3 sens                 binary         0.678
##  4 spec                 binary         0.843
##  5 ppv                  binary         0.727
##  6 npv                  binary         0.809
##  7 mcc                  binary         0.529
##  8 j_index              binary         0.521
##  9 bal_accuracy         binary         0.761
## 10 detection_prevalence binary         0.356
## 11 precision            binary         0.727
## 12 recall               binary         0.678
## 13 f_meas               binary         0.702

This may be good for a quick evaluation of your work, but if you are writing a report for the others to read, you should focus on a 1-2 relevant measures only.

thematic::thematic_on(font = thematic::font_spec(scale=1.8))
# theme_set(text = element_text(size = 24))