Chapter 12 Prediction and Predictive Modeling
Up to now we were discussing the inferential modeling. The main purpose for that approach is to learn something about the data and the world behind it. Now our focus will shift to predictive modeling–models that predict the data as well as possible. If these models also tell us something about the deeper relationship is of secondary importance.
12.1 Manually predicting regression outcomes
Manual prediction is usually not needed. It is useful though for better understanding of the underlying models, and sometimes for debugging.
12.1.1 Linear regression
Linear regression predictions are fairly simple to be reproduced manually.
12.1.1.1 Hubble data example
Let’s use an example of Hubble data, but this time predict distance based on velocity, and use the moder estimates Rmodern and vModern.
The distance-velocity relationship is shown here.
Below, we are using different modeling tools to make predictions, based on this relationship, formally written as \[\begin{equation} R_i^{\mathit{Modern}} = \beta_0 + \beta_1 \cdot v_i^{\mathit{Modern}} + \epsilon_i. \end{equation}\]
Note that this is technically a very easy case as this does not include any categorical variables. If these are needed, then we need first to convert categoricals to dummies, and use the dummies instead.
12.1.1.2 Doing predictions
Before we can start with predictions, we need to find the intercept
\(\beta_0\) and slope \(\beta_1\). This is best to be done with the
lm()
function of base-R (see Section @(reg-ols-modeling-lm)):
The estimated parameter values are
## (Intercept) vModern
## 1.86177763 0.01267665
For simplicity, let’s extract the intercept and slope, and call those
b0
and b1
respectively:
Finally, we can just plug these values in to predict distance \(\hat R\): \[\begin{equation} \hat{R} = \hat{\beta}_0 + \hat{\beta_1} \cdot v \end{equation}\] where \(\hat{\beta}_0\) and \(\hat{\beta}_1\) denote the coefficient we calculated above. For instance, for the first 3 galaxies in the dataset, we can compute the predicted distance as
## [1] 3.865956 5.385886 1.139209
Alternatively, if we have velocity measurements 500, 1000, and 2000 km/2, we can compute the predicted distance as
## [1] 8.200102 14.538426 20.876751
12.1.2 Logistic regression
In case of logistic regression, we can predict both probability of outcome (called \(\hat P\) or phat below) and outcome category (called \(\hat y\) or yhat below). Logistic regression is somewhat more complex to work with manually.
12.1.2.1 Titanic data example
Below we do an example, using Titanic data. We model survival as a function of age only, as the other two main predictors (sex and pclass) are categorical and hence hard to handle for manual work. So we are fitting a logistic regression model \[\begin{equation} \Pr(S = 1) = \Lambda \big( {\beta}_0 + {\beta_1} \cdot \mathit{age} \big) = \Lambda(\eta) \end{equation}\] where \[\begin{equation} \eta = {\beta}_0 + {\beta_1} \cdot \mathit{age} \end{equation}\] is the link function, \[\begin{equation} \Lambda(\eta) = \frac{1}{1 + \exp(-\eta)} \end{equation}\] is the logistic function (aka sigmoid function) and \(s = 1\) denotes that the passenger survived.
First we fit the model:
titanic <- read_delim("../data/titanic.csv.bz2") %>%
filter(!is.na(age)) # many age's missing
m <- glm(survived ~ age, family = binomial, data = titanic)
This commands compute the best \(\beta_0\) and \(\beta_1\), here
## (Intercept) age
## -0.136531206 -0.007898598
12.1.2.2 Predicting probabilities
From above, we can estimate the survival probability of passengers as7
## [1] 0.4096069 0.4641188 0.4619914
All these numbers are less than 0.5, so all of these passengers were predicted more likely to die than to survive.
Here we should stop for a moment and ask how do we know that the
probabilities above are survival probabilities, not death
probabilities? This is related to how the outcome variable survived
is coded. Here it is coded as “0” for death and “1” for survival, and
glm will predict probability of “1”. If the response variable were
coded as text, the values are converted to categorical (using
factor()
function), this normally makes the alphabetically first
value to be considered as failure, and second one as “success”. For
instance, if instead of “0” and “1”, the codes were “no boat” and
“boat” (for drowning and surviving), the “boat” would be considered
failure and “no boat” success. See ?gml
and ?factor
.
If we need to predict probabilities for certain passengers who were, for instance, 10, 30, and 50 years old, we can achieve this as
## [1] 0.4463283 0.4076982 0.3701762
12.1.2.3 Predicting categories
Sometimes probabilities are all we want, but other times we want a “yes” or “no” answer–was someone predicted to survive or die?
The basics is quite easy: we normally predict success if the probability of “success” is larger than probability of “failure”. Here, in case of two categories only, this is equivalent to predict success if predicted \(\Pr(S = 1) > 0.5\):
## [1] FALSE FALSE FALSE
We predicted “FALSE” for all these example persons, hence all of them were predicted to die.
Again, you need to be aware what “success” and “failure” mean here. If you are interested to recover the original categories, I’d recommend to write it like this:
## [1] 0 0 0
12.2 Base R modeling tools
Base-R offers pretty good modeling framework. The central tool for predicting model outcomes is the predict method. It has two important arguments, the fitted model, and potentially also the new data. There may be more arguments for different types of models, e.g. in case of logistic regression it is possible to predict more outcomes than just the outcome probability.
12.2.1 Linear regression
Here we replicate the example from Section 12.1.1.2 using the base-R modeling tools. First fit the linear regression model:
The central tool for making predictions from a fitted model is the
function predict()
. Its first argument is the model. For instance,
we can get the first 3 predicted distances as
## 1 2 3
## 3.865956 5.385886 1.139209
You can check that the results are exactly the same as in Section 12.1.1.2.
Alternatively, if we want to predict the distance to galaxies that are not in the original dataset we used for fitting, we can create a new data frame with these measurements, and supply it as an additional argument newdata to the predict function. For instance,
## 1 2 3
## 8.200102 14.538426 20.876751
Again, these results are exactly the same as in Section 12.1.1.2.
Note that the new data frame must contain all variables that are used in the model (here just vModern). It may contain other columns too, those are just ignored.
12.2.2 Logistic regression
With base-R modeling tools, one can use predict()
in a fairly
similar fashion as in case of linear regression. However, now we have
to tell predict()
what exactly to predict. The default option,
predict the link \(\eta\), is of little use. So to predict the survival
probability for everyone on Titanic, you can use
m <- glm(survived ~ age, family = binomial, data = titanic)
phat <- predict(m, type = "response")
phat[1:3]
## 1 2 3
## 0.4096069 0.4641188 0.4619914
These numbers are exactly the same as above in Section
12.1.2.2. In case we want to predict
survival for a person who is not in the original dataset, we need to
supply a data frame with the newdata
argument, see above in Section
12.2.1.
Predict for glm()
does not do categories–if you need categories
instead of probabilities, you need to follow the steps in Section
12.1.2.3.
12.3 Using tidymodels
The modeling tools in base-R are mainly developed with inferential modeling in mind. But R also has several frameworks that are dedicated for predictive modeling. Such packages typically provide a more unified interface for models that are commonly used in prediction, the model assessment tools, such as confusion matrix and other model goodness measures, and tools to optimize parameters in pipelines. Unfortunately, as the tasks are different, so are different the tools, and hence bring a substantial learning curve.
This section discusses predictions using one of such frameworks, namely the tidymodels package. It is a package that works well together with tidyverse, but you need to install and load it separately:
12.3.1 Basic modeling
First, we discuss the basics of modeling with tidymodels. Its approach is heavily based on pipes, but onlike in case of dplyr, it is not data that flows through the pipe from function to function–it is the model that flows through fitting and prediction steps.
12.3.1.1 Linear regression
As the first step, let’s replicate the task in Section 12.1.1.1, predicting the distance of galaxies by their speed, using tidymodels.
First, we define the model:
This creates an empty linear regression model m. If has no defined parameters and it has no defined formulas. It is just an empty model that one can fit (or use in pipelines, see Section 14).
Next, we fit the model with a data, using a formula:
Note how the pipe forwards not data (which is given as a separate
argument) but the model into the fit()
function.
fit()
creates a new object, a fitted model, that we can use for
different tasks. For instance, here is a tidy-style version of the
regression table:
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.86 0.799 2.33 0.0294
## 2 vModern 0.0127 0.00127 9.96 0.00000000130
This is the exact same table as one can get using the base-R lm()
and summary()
methods, just printed in a slightly different way.
For instance, the \(t\)-values are called statistic instead.
Unfortunately, summary(mf)
does not produce anything meaningful with
fitted model objects.
Predicting with tidymodels works in a fairly similar way as with the base-R tools. Here is an example how we can use the model to predict the distance of two galaxies, with speeds 500 and 1500 km/2:
## # A tibble: 2 × 1
## .pred
## <dbl>
## 1 8.20
## 2 20.9
You can see two differences–the new data argument is called new_data, and more importantly, the result is a data frame where the column of predictions is called .pred (remember–it is a vector in base-R).
There is another important difference–the new_data argument is required (in base-R, if you leave it out, then the model predicts for the original data). It is not that common in predictive modeling to predict on the training data, so this difference is understandable. But here is how you can do it:
## # A tibble: 4 × 1
## .pred
## <dbl>
## 1 14.5
## 2 7.06
## 3 1.14
## 4 -1.95
12.3.1.2 Logistic regression
Let’s demonstrate tidymodels approach with a classification problem. The basic workflow is very similar.
First we define the model, this time it is logistic_reg()
:
And next, you need to fit it. logistic_reg()
requires the outcome
variable to be categorical, so we need to use factor(survived)
instead of just survived:
As in case of the linear regression, the coefficient table (not the
marginal effects!) can be extracted with tidy()
.
But predicting probabilities and outcomes is much more straightforward
with tidymodels. Both of these will be computed by predict(mf, new_data = ...)
. In order to get probabilities, we need to add type = "prob"
:
predict(mf, new_data = titanic,
# new data required
# compute for everyone
type = "prob") %>%
# predict probability
head(3)
## # A tibble: 3 × 2
## .pred_0 .pred_1
## <dbl> <dbl>
## 1 0.103 0.897
## 2 0.591 0.409
## 3 0.103 0.897
This returns a data frame with two columns: .pred_0
and .pred_1
.
These are predicted probabilities for drowning (survived = 0) and
surviving (survived = 1). For instance, the first passenger is
predicted to survive with probability 0.897 while the 2nd one has
survival chances of only 0.409.
We can also predict the class directly by specifying type = "class"
(or by just leaving it out):
predict(mf, type = "class",
# can leave type out as well
new_data = titanic) %>%
# new data required!
head(3)
## # A tibble: 3 × 1
## .pred_class
## <fct>
## 1 1
## 2 0
## 3 1
This, again, produces a data frame with a single column,
.pred_class
. And indeed, we can see that the first passenger is
predicted to survive while the second one to die.
Finally, a small example about how to predict the outcome for new cases, here for a 1st class man and a second class women. We have specified age as well, but the model is not using it.
newData <- data.frame(sex = c("male", "female"),
pclass = c(1, 2),
age = c(10, 20))
predict(mf, new_data = newData)
## # A tibble: 2 × 1
## .pred_class
## <fct>
## 1 0
## 2 1
Instead of writing
phat <- 1/(1 + exp(-eta))
below, one can use the logistic probability functionplogis()
asphat <- plogis(eta)
.↩︎