Chapter 12 Prediction and Predictive Modeling
Up to now we were discussing the inferential modeling. The main purpose for that approach is to learn something about the data and the world behind it. Now our focus will shift to predictive modeling–models that predict the data as well as possible. If these models also tell us something about the deeper relationship is of secondary importance.
12.1 Manually predicting regression outcomes
Manual prediction is usually not needed. It is useful though for better understanding of the underlying models, and sometimes for debugging.
12.1.1 Linear regression
Linear regression predictions are fairly simple to be reproduced manually.
12.1.1.1 Hubble data example
Let’s use an example of Hubble data, but this time predict distance based on velocity, and use the moder estimates Rmodern and vModern.The distance-velocity relationship is shown here.
read_delim("../data/hubble.csv")
hubble <-%>%
hubble ggplot(aes(vModern, Rmodern)) +
geom_point(size = 2) +
geom_smooth(method = "lm",
se = FALSE)
Below, we are using different modeling tools to make predictions, based on this relationship, formally written as \[\begin{equation} R_i^{\mathit{Modern}} = \beta_0 + \beta_1 \cdot v_i^{\mathit{Modern}} + \epsilon_i. \end{equation}\]
Note that this is technically a very easy case as this does not include any categorical variables. If these are needed, then we need first to convert categoricals to dummies, and use the dummies instead.
12.1.1.2 Doing predictions
Before we can start with predictions, we need to find the intercept
\(\beta_0\) and slope \(\beta_1\). This is best to be done with the
lm()
function of base-R (see Section @(reg-ols-modeling-lm)):
lm(Rmodern ~ vModern, data = hubble) m <-
The estimated parameter values are
coef(m)
## (Intercept) vModern
## 1.86177763 0.01267665
For simplicity, let’s extract the intercept and slope, and call those
b0
and b1
respectively:
coef(m)[1]
b0 <- coef(m)[2] b1 <-
Finally, we can just plug these values in to predict distance \(\hat R\): \[\begin{equation} \hat{R} = \hat{\beta}_0 + \hat{\beta_1} \cdot v \end{equation}\] where \(\hat{\beta}_0\) and \(\hat{\beta}_1\) denote the coefficient we calculated above. For instance, for the first 3 galaxies in the dataset, we can compute the predicted distance as
+ b1 * hubble$vModern[1:3] b0
## [1] 3.865956 5.385886 1.139209
Alternatively, if we have velocity measurements 500, 1000, and 2000 km/2, we can compute the predicted distance as
c(500, 1000, 1500)
v <-+ b1 * v b0
## [1] 8.200102 14.538426 20.876751
12.1.2 Logistic regression
In case of logistic regression, we can predict both probability of outcome (called \(\hat P\) or phat below) and outcome category (called \(\hat y\) or yhat below). Logistic regression is somewhat more complex to work with manually.
12.1.2.1 Titanic data example
Below we do an example, using Titanic data. We model survival as a function of age only, as the other two main predictors (sex and pclass) are categorical and hence hard to handle for manual work. So we are fitting a logistic regression model \[\begin{equation} \Pr(S = 1) = \Lambda \big( {\beta}_0 + {\beta_1} \cdot \mathit{age} \big) = \Lambda(\eta) \end{equation}\] where \[\begin{equation} \eta = {\beta}_0 + {\beta_1} \cdot \mathit{age} \end{equation}\] is the link function, \[\begin{equation} \Lambda(\eta) = \frac{1}{1 + \exp(-\eta)} \end{equation}\] is the logistic function (aka sigmoid function) and \(s = 1\) denotes that the passenger survived.
First we fit the model:
read_delim("../data/titanic.csv.bz2") %>%
titanic <- filter(!is.na(age)) # many age's missing
glm(survived ~ age, family = binomial, data = titanic) m <-
This commands compute the best \(\beta_0\) and \(\beta_1\), here
coef(m)
## (Intercept) age
## -0.136531206 -0.007898598
coef(m)[1]
b0 <- coef(m)[2] b1 <-
12.1.2.2 Predicting probabilities
From above, we can estimate the survival probability of passengers as7
b0 + b1*titanic$age
eta <- 1/(1 + exp(-eta))
phat <-## print for first 3 passengers:
1:3] phat[
## [1] 0.4096069 0.4641188 0.4619914
All these numbers are less than 0.5, so all of these passengers were predicted more likely to die than to survive.
Here we should stop for a moment and ask how do we know that the
probabilities above are survival probabilities, not death
probabilities? This is related to how the outcome variable survived
is coded. Here it is coded as “0” for death and “1” for survival, and
glm will predict probability of “1”. If the response variable were
coded as text, the values are converted to categorical (using
factor()
function), this normally makes the alphabetically first
value to be considered as failure, and second one as “success”. For
instance, if instead of “0” and “1”, the codes were “no boat” and
“boat” (for drowning and surviving), the “boat” would be considered
failure and “no boat” success. See ?gml
and ?factor
.
If we need to predict probabilities for certain passengers who were, for instance, 10, 30, and 50 years old, we can achieve this as
c(10, 30, 50)
age <- b0 + b1*age
eta <- 1/(1 + exp(-eta))
phat <- phat
## [1] 0.4463283 0.4076982 0.3701762
12.1.2.3 Predicting categories
Sometimes probabilities are all we want, but other times we want a “yes” or “no” answer–was someone predicted to survive or die?
The basics is quite easy: we normally predict success if the probability of “success” is larger than probability of “failure”. Here, in case of two categories only, this is equivalent to predict success if predicted \(\Pr(S = 1) > 0.5\):
phat > 0.5
yhat <- yhat
## [1] FALSE FALSE FALSE
We predicted “FALSE” for all these example persons, hence all of them were predicted to die.
Again, you need to be aware what “success” and “failure” mean here. If you are interested to recover the original categories, I’d recommend to write it like this:
ifelse(phat > 0.5, 1, 0) # all predicted to die
## [1] 0 0 0
12.2 Base R modeling tools
Base-R offers pretty good modeling framework. The central tool for predicting model outcomes is the predict method. It has two important arguments, the fitted model, and potentially also the new data. There may be more arguments for different types of models, e.g. in case of logistic regression it is possible to predict more outcomes than just the outcome probability.
12.2.1 Linear regression
Here we replicate the example from Section 12.1.1.2 using the base-R modeling tools. First fit the linear regression model:
lm(Rmodern ~ vModern, data = hubble) m <-
The central tool for making predictions from a fitted model is the
function predict()
. Its first argument is the model. For instance,
we can get the first 3 predicted distances as
predict(m)[1:3]
## 1 2 3
## 3.865956 5.385886 1.139209
You can check that the results are exactly the same as in Section 12.1.1.2.
Alternatively, if we want to predict the distance to galaxies that are not in the original dataset we used for fitting, we can create a new data frame with these measurements, and supply it as an additional argument newdata to the predict function. For instance,
data.frame(vModern = c(500, 1000, 1500))
newdata <-predict(m, newdata = newdata)
## 1 2 3
## 8.200102 14.538426 20.876751
Again, these results are exactly the same as in Section 12.1.1.2.
Note that the new data frame must contain all variables that are used in the model (here just vModern). It may contain other columns too, those are just ignored.
12.2.2 Logistic regression
With base-R modeling tools, one can use predict()
in a fairly
similar fashion as in case of linear regression. However, now we have
to tell predict()
what exactly to predict. The default option,
predict the link \(\eta\), is of little use. So to predict the survival
probability for everyone on Titanic, you can use
glm(survived ~ age, family = binomial, data = titanic)
m <- predict(m, type = "response")
phat <-1:3] phat[
## 1 2 3
## 0.4096069 0.4641188 0.4619914
These numbers are exactly the same as above in Section
12.1.2.2. In case we want to predict
survival for a person who is not in the original dataset, we need to
supply a data frame with the newdata
argument, see above in Section
12.2.1.
Predict for glm()
does not do categories–if you need categories
instead of probabilities, you need to follow the steps in Section
12.1.2.3.
12.3 Using caret libary
caret (Classification And REgression Training) is a package, dedicated to predictive modeling, and so it includes an number of convenience tools (and many more models).
It’s usage is fairly similar to the base-R functionality, but also different enough so that you need to adapt the code. You cannot write code for base-R and expect it to run with caret.
12.3.1 Linear regression
caret wraps the the model training into train()
function. For
instance, the Hubble regression model from Section
12.1.1.2 can be done with
library(caret)
train(Rmodern ~ vModern,
m <-method = "lm",
data = hubble)
coef(m)
## NULL
Note the differences and similarities with base-R:
- the first argument is formula, exactly as in case of
lm()
. - the model is trained with
train()
, not withlm()
. - the model type lm is supplied as the
method
argument. You can also supply other arguments to thelm()
function. data
is supplied in exactly the same way as in case oflm()
.
Instead of writing
phat <- 1/(1 + exp(-eta))
below, one can use the logistic probability functionplogis()
asphat <- plogis(eta)
.↩︎