Chapter 13 Assessing model goodness
13.1 Regression model performance measures
In Section 12.2 we discussed how to predict values based on linear regression with base-R modeling tools. Remember, the task there was to predict the distance of galaxies based on their radial velocity using Hubble data. Here we continue the same topic, but now the task is to compute the model performance indicators, in particular RMSE
13.1.1 Base-R tools
First, load the Hubble data and fit the linear regression model:
Remember, RMSE is root mean squared error, where error stands for the prediction error, \(\hat y - y\). So we can calculate it as
y <- hubble$Rmodern # just to make it clear
yhat <- predict(m)
e <- yhat - y # error
RMSE <- sqrt(mean(e^2)) # root-mean-squared-e
RMSE
## [1] 2.61722
Exercise 13.1 Use the predicted values above to manually compute \(R^2\): \[\begin{equation*} R^{2} = 1 - \frac{ \frac{1}{N} \sum_{i} (y_{i} - \hat y_{i})^{2} }{ \frac{1}{N} \sum_{i} (y_{i} - \bar y_{i})^{2} } \end{equation*}\]
Compare your result with the regression output table–do you get the same value?
13.1.2 Tidymodels tools
We can combine the original and predicted values into a data frame for easy calculation of model goodness metrics:
results <- predict(mf, new_data = hubble) %>%
cbind(hubble) %>%
select(velocity = vModern, distance = Rmodern, .pred)
head(results, 3)
## velocity distance .pred
## 1 158.1 0.06170 3.865956
## 2 278.0 0.04997 5.385886
## 3 -57.0 0.50000 1.139209
So now we have a data frame with three columns: distance, velocity, the measured velocity; distance, the actual measured distance, and .pred, the predicted distance. One can easily compute RMSE and \(R^2\) using the dedicated functions:
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 2.62
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rsq standard 0.819
rmse()
and rsq()
are model goodness measure functions that take
three arguments: the dataset, and column names for the actual and
predicted values. (There are many more model goodness measures.)
If interested, one can combine these two into a single multi-metric
using metric_set()
:
## # A tibble: 2 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 2.62
## 2 rsq standard 0.819
This is handy if calculating the same metrics for multiple models over and over again.
13.2 Confusion matrix
Confusion matrix (CM) is one of the central tools to assess classification quality. Below, we’ll use Titanic data to demonstrate those tools.
## [1] 1309 14

This section present the CM in this form. The positive cases are “relevant”, marked in bold.
- Actual values in columns, predicted values in rows
- Relevant (“positive”) values second, “negative” values first.
“Relevant” are those cases that we compute precision and recall for, marked as bold in the figure. Sometimes it is obvious, but this is not always the case.
13.2.1 Base-R tools to display confusion matrix
In order to display confusion matrix, you need to fit a model and predict the categories on the dataset (see Section 12.2.2):
y <- titanic$survived
m <- glm(survived ~ factor(pclass) + sex,
family = binomial(),
data = titanic)
phat <- predict(m, type = "response")
yhat <- phat > 0.5
(Here we did it on training data, but that does not have to be the case.)
Confusion matrix is just the basic cross-table that can be generated
using the table()
function:
## y
## yhat 0 1
## FALSE 682 161
## TRUE 127 339
Depending on your variable names it may be a bit unclear whether the actual values are in rows or columns, it is advisable to name the arguments:
## Actual
## Predicted 0 1
## FALSE 682 161
## TRUE 127 339
Base-R does not have dedicated functions to compute the CM-based metrics (but you can use other packages). In fact, manually computing accuracy, the percentage of the correct predictions, is easier than loading the dedicated packages. Accuracy is just
## [1] 0.7799847
Computing precision is a bit more complex, here we just follow the definition:
## [1] 0.7274678
The same is true for recall:
## [1] 0.678
F-score is best to be computed from precision and recall:
## [1] 0.7018634
Exercise 13.2 Use treatment data and base-R tools to estimate a logistic regression model in the form \[\begin{equation*} \Pr(\mathit{treat}_{i} = \mathit{TRUE}) = \Lambda(\beta_{0} + \beta_{1} \cdot \mathit{age}_{i}). \end{equation*}\]
Construct the confusion matrix, and compute Accuracy, precision, and recall. Explain your results.
Add more variable to your model: ethn, u74, u75, and re75. Repeat the above.
However, even if you use base-R modeling tools, there is nothing wrong to use dedicated functions from other libraries for precision, recall, and other related measures. Here an example using tidymodels (see Section 13.2.2):
## [1] 0.7018634
This example uses f_meas_vec()
function from yardstick package, a
part of tidymodels. It expects two vectors, truth and estimate,
both of which must be factors with similar levels. That’s why we need
to convert the predictions (logical) into a numeric.
By default it
considers the first level (here “0”, died) the relevant one, here we
tell we want to compute F-score for the second level (“1”, survived).
Anyway, as you see, the result is exactly the same.
13.2.2 tidymodels approach
Tidymodels (see Section 12.3) offers tools to build unified workflows for predictive modeling, and CM-related functionality is well integrated into that toolkit.
As in the case with base-R tools, you need to fit a model and predict the outcome class in order to construct the CM:
y <- factor(titanic$survived) # for easier notation
mf <- logistic_reg() %>%
fit(y ~ factor(pclass) + sex,
data = titanic)
yhat <- predict(mf, new_data = titanic)
Tidymodels are built in a way that it is advantageous to combine the actual and predicted results into a single data frame. You can add predicted results to your original data, but here we create a new one. Also, we rename the precise but somewhat awkward column name .pred_class to yhat:
Confusion matrix can be constructed with the function conf_mat()
.
It expects a data frame, and two column names: actual first and
predicted second:
## Truth
## Prediction 0 1
## 0 682 161
## 1 127 339
Note that it is the other way around as the base-R table
behaves–there you need to give predicted first and actual values
second. But CM is exactly the same as when using the base-R
table()
, just it automatically provides clear dimension names.
Accuracy, precision and other measures can be queried with the dedicated functions. The examples here all expect similar arguments: a data frame that contains both the actual and predicted values; column names for the actual (first) and predicted (second) values; and which factor level to be considered “relevant” (first, by default):
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.780
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 precision binary 0.727
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 recall binary 0.678
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 f_meas binary 0.702
Note that all these functions return data frames, not single values!
There are also versions of the functions that expect two vectors (actual and predicted) instead of a data frame, for instance
## [1] 0.678
Unlike the data frame version, the vector-version of recall returns a single value.
Exercise 13.3 Repeat the exercise above with tidymodels tools: Use treatment data to estimate a logistic regression model in the form \[\begin{equation*} \Pr(\mathit{treat}_{i} = \mathit{TRUE}) = \Lambda(\beta_{0} + \beta_{1} \cdot \mathit{age}_{i}). \end{equation*}\]
Construct the confusion matrix, and compute Accuracy, precision, and recall. Explain your results.
Add more variable to your model: ethn, u74, u75, and re75. Repeat the above.
Comment the difference of output in terms of base-R and tidymodels.
If you are often computing a set of metrics, you may want to combine
multiple functions into a single call using metric_set()
:
pr <- metric_set(precision, recall)
results %>%
pr(truth = y, estimate = yhat, event_level = "second")
## # A tibble: 2 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 precision binary 0.727
## 2 recall binary 0.678
This will compute both precision and recall with a single call, and return a data frame with one row for each.
Finally, the summary()
method for CM provides all these metrics, and
much more:
## # A tibble: 13 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.780
## 2 kap binary 0.528
## 3 sens binary 0.678
## 4 spec binary 0.843
## 5 ppv binary 0.727
## 6 npv binary 0.809
## 7 mcc binary 0.529
## 8 j_index binary 0.521
## 9 bal_accuracy binary 0.761
## 10 detection_prevalence binary 0.356
## 11 precision binary 0.727
## 12 recall binary 0.678
## 13 f_meas binary 0.702
This may be good for a quick evaluation of your work, but if you are writing a report for the others to read, you should focus on a 1-2 relevant measures only.