Chapter 12 Prediction and Predictive Modeling

Up to now we were discussing the inferential modeling. The main purpose for that approach is to learn something about the data and the world behind it. Now our focus will shift to predictive modeling–models that predict the data as well as possible. If these models also tell us something about the deeper relationship is of secondary importance.

12.1 Manually predicting regression outcomes

Manual prediction is usually not needed. It is useful though for better understanding of the underlying models, and sometimes for debugging.

12.1.1 Linear regression

Linear regression predictions are fairly simple to be reproduced manually.

12.1.1.1 Hubble data example

Let’s use an example of Hubble data, but this time predict distance based on velocity, and use the moder estimates Rmodern and vModern.
plot of chunk hubble-distance-velocity

The distance-velocity relationship is shown here.

hubble <- read_delim("../data/hubble.csv")
hubble %>%
   ggplot(aes(vModern, Rmodern)) +
   geom_point(size = 2) +
   geom_smooth(method = "lm",
               se = FALSE)

Below, we are using different modeling tools to make predictions, based on this relationship, formally written as \[\begin{equation} R_i^{\mathit{Modern}} = \beta_0 + \beta_1 \cdot v_i^{\mathit{Modern}} + \epsilon_i. \end{equation}\]

Note that this is technically a very easy case as this does not include any categorical variables. If these are needed, then we need first to convert categoricals to dummies, and use the dummies instead.

12.1.1.2 Doing predictions

Before we can start with predictions, we need to find the intercept \(\beta_0\) and slope \(\beta_1\). This is best to be done with the lm() function of base-R (see Section @(reg-ols-modeling-lm)):

m <- lm(Rmodern ~ vModern, data = hubble)

The estimated parameter values are

coef(m)
## (Intercept)     vModern 
##  1.86177763  0.01267665

For simplicity, let’s extract the intercept and slope, and call those b0 and b1 respectively:

b0 <- coef(m)[1]
b1 <- coef(m)[2]

Finally, we can just plug these values in to predict distance \(\hat R\): \[\begin{equation} \hat{R} = \hat{\beta}_0 + \hat{\beta_1} \cdot v \end{equation}\] where \(\hat{\beta}_0\) and \(\hat{\beta}_1\) denote the coefficient we calculated above. For instance, for the first 3 galaxies in the dataset, we can compute the predicted distance as

b0 + b1 * hubble$vModern[1:3]
## [1] 3.865956 5.385886 1.139209

Alternatively, if we have velocity measurements 500, 1000, and 2000 km/2, we can compute the predicted distance as

v <- c(500, 1000, 1500)
b0 + b1 * v
## [1]  8.200102 14.538426 20.876751

12.1.2 Logistic regression

In case of logistic regression, we can predict both probability of outcome (called \(\hat P\) or phat below) and outcome category (called \(\hat y\) or yhat below). Logistic regression is somewhat more complex to work with manually.

12.1.2.1 Titanic data example

Below we do an example, using Titanic data. We model survival as a function of age only, as the other two main predictors (sex and pclass) are categorical and hence hard to handle for manual work. So we are fitting a logistic regression model \[\begin{equation} \Pr(S = 1) = \Lambda \big( {\beta}_0 + {\beta_1} \cdot \mathit{age} \big) = \Lambda(\eta) \end{equation}\] where \[\begin{equation} \eta = {\beta}_0 + {\beta_1} \cdot \mathit{age} \end{equation}\] is the link function, \[\begin{equation} \Lambda(\eta) = \frac{1}{1 + \exp(-\eta)} \end{equation}\] is the logistic function (aka sigmoid function) and \(s = 1\) denotes that the passenger survived.

First we fit the model:

titanic <- read_delim("../data/titanic.csv.bz2") %>%
   filter(!is.na(age))  # many age's missing
m <- glm(survived ~ age, family = binomial, data = titanic)

This commands compute the best \(\beta_0\) and \(\beta_1\), here

coef(m)
##  (Intercept)          age 
## -0.136531206 -0.007898598
b0 <- coef(m)[1]
b1 <- coef(m)[2]

12.1.2.2 Predicting probabilities

From above, we can estimate the survival probability of passengers as7

eta <- b0 + b1*titanic$age
phat <- 1/(1 + exp(-eta))
## print for first 3 passengers:
phat[1:3]
## [1] 0.4096069 0.4641188 0.4619914

All these numbers are less than 0.5, so all of these passengers were predicted more likely to die than to survive.

Here we should stop for a moment and ask how do we know that the probabilities above are survival probabilities, not death probabilities? This is related to how the outcome variable survived is coded. Here it is coded as “0” for death and “1” for survival, and glm will predict probability of “1”. If the response variable were coded as text, the values are converted to categorical (using factor() function), this normally makes the alphabetically first value to be considered as failure, and second one as “success”. For instance, if instead of “0” and “1”, the codes were “no boat” and “boat” (for drowning and surviving), the “boat” would be considered failure and “no boat” success. See ?gml and ?factor.

If we need to predict probabilities for certain passengers who were, for instance, 10, 30, and 50 years old, we can achieve this as

age <- c(10, 30, 50)
eta <- b0 + b1*age
phat <- 1/(1 + exp(-eta))
phat
## [1] 0.4463283 0.4076982 0.3701762

12.1.2.3 Predicting categories

Sometimes probabilities are all we want, but other times we want a “yes” or “no” answer–was someone predicted to survive or die?

The basics is quite easy: we normally predict success if the probability of “success” is larger than probability of “failure”. Here, in case of two categories only, this is equivalent to predict success if predicted \(\Pr(S = 1) > 0.5\):

yhat <- phat > 0.5
yhat
## [1] FALSE FALSE FALSE

We predicted “FALSE” for all these example persons, hence all of them were predicted to die.

Again, you need to be aware what “success” and “failure” mean here. If you are interested to recover the original categories, I’d recommend to write it like this:

ifelse(phat > 0.5, 1, 0)  # all predicted to die
## [1] 0 0 0

12.2 Base R modeling tools

Base-R offers pretty good modeling framework. The central tool for predicting model outcomes is the predict method. It has two important arguments, the fitted model, and potentially also the new data. There may be more arguments for different types of models, e.g. in case of logistic regression it is possible to predict more outcomes than just the outcome probability.

12.2.1 Linear regression

Here we replicate the example from Section 12.1.1.2 using the base-R modeling tools. First fit the linear regression model:

m <- lm(Rmodern ~ vModern, data = hubble)

The central tool for making predictions from a fitted model is the function predict(). Its first argument is the model. For instance, we can get the first 3 predicted distances as

predict(m)[1:3]
##        1        2        3 
## 3.865956 5.385886 1.139209

You can check that the results are exactly the same as in Section 12.1.1.2.

Alternatively, if we want to predict the distance to galaxies that are not in the original dataset we used for fitting, we can create a new data frame with these measurements, and supply it as an additional argument newdata to the predict function. For instance,

newdata <- data.frame(vModern = c(500, 1000, 1500))
predict(m, newdata = newdata)
##         1         2         3 
##  8.200102 14.538426 20.876751

Again, these results are exactly the same as in Section 12.1.1.2.

Note that the new data frame must contain all variables that are used in the model (here just vModern). It may contain other columns too, those are just ignored.

12.2.2 Logistic regression

With base-R modeling tools, one can use predict() in a fairly similar fashion as in case of linear regression. However, now we have to tell predict() what exactly to predict. The default option, predict the link \(\eta\), is of little use. So to predict the survival probability for everyone on Titanic, you can use

m <- glm(survived ~ age, family = binomial, data = titanic)
phat <- predict(m, type = "response")
phat[1:3]
##         1         2         3 
## 0.4096069 0.4641188 0.4619914

These numbers are exactly the same as above in Section 12.1.2.2. In case we want to predict survival for a person who is not in the original dataset, we need to supply a data frame with the newdata argument, see above in Section 12.2.1.

Predict for glm() does not do categories–if you need categories instead of probabilities, you need to follow the steps in Section 12.1.2.3.

12.3 Using caret libary

caret (Classification And REgression Training) is a package, dedicated to predictive modeling, and so it includes an number of convenience tools (and many more models).

It’s usage is fairly similar to the base-R functionality, but also different enough so that you need to adapt the code. You cannot write code for base-R and expect it to run with caret.

12.3.1 Linear regression

caret wraps the the model training into train() function. For instance, the Hubble regression model from Section 12.1.1.2 can be done with

library(caret)
m <- train(Rmodern ~ vModern,
           method = "lm",
           data = hubble)
coef(m)
## NULL
Note the differences and similarities with base-R:
  • the first argument is formula, exactly as in case of lm().
  • the model is trained with train(), not with lm().
  • the model type lm is supplied as the method argument. You can also supply other arguments to the lm() function.
  • data is supplied in exactly the same way as in case of lm().

12.3.2 Logistic regression


  1. Instead of writing phat <- 1/(1 + exp(-eta)) below, one can use the logistic probability function plogis() as phat <- plogis(eta).↩︎